Re: Apply Kmeans in partitions

2019-01-30 Thread Apostolos N. Papadopoulos

Hi Dimitri,

what is the error you are getting, please specify.

Apostolos


On 30/1/19 16:30, dimitris plakas wrote:

Hello everyone,

I have a dataframe which has 5040 rows where these rows are splitted 
in 5 groups. So i have a column called "Group_Id" which marks every 
row with values from 0-4 depending on in which group every rows 
belongs to. I am trying to split my dataframe to 5 partitions and 
apply Kmeans to every partition. I have tried


rdd=mydataframe.rdd.mapPartitions(function, True)
test = Kmeans.train(rdd, num_of_centers, "random")

but i get an error.

How can i apply Kmeans to every partition?

Thank you in advance,


--
Apostolos N. Papadopoulos, Associate Professor
Department of Informatics
Aristotle University of Thessaloniki
Thessaloniki, GREECE
tel: ++0030312310991918
email: papad...@csd.auth.gr
twitter: @papadopoulos_ap
web: http://datalab.csd.auth.gr/~apostol


-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Apply Kmeans in partitions

2019-01-30 Thread dimitris plakas
Hello everyone,

I have a dataframe which has 5040 rows where these rows are splitted in 5
groups. So i have a column called "Group_Id" which marks every row with
values from 0-4 depending on in which group every rows belongs to. I am
trying to split my dataframe to 5 partitions and apply Kmeans to every
partition. I have tried

rdd=mydataframe.rdd.mapPartitions(function, True)
test = Kmeans.train(rdd, num_of_centers, "random")

but i get an error.

How can i apply Kmeans to every partition?

Thank you in advance,


[Spark MLib]: RDD caching behavior of KMeans

2018-07-10 Thread mkhan37
Hi All,

I was varying the storage levels of RDD caching in the KMeans program
implemented using the MLib library and got some very confusing and
interesting results. The base code of the application is from a Benchmark
suite named  SparkBench <https://github.com/CODAIT/spark-bench>  . I changed
the storage levels of the data RDD passed to the Kmeans train function and
it seems like MEMORY_AND_DISK_SER is performing quite worse compared to
DISK_ONLY level. MEMORY_AND_DISK level performed the best as expected. But
as to why Memory serialized storage level is performing worse than Disk
serialized level is very confusing. I am using 1 node as master and 4 nodes
as slaves with each executor having a 48g JVM. The cached data should also
fit within the memory easily.

If anyone has any idea or suggestion on why this behavior is happening
please let me know.

Regards,
Muhib 



--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Understanding the results from Spark's KMeans clustering object

2018-05-18 Thread shubham
Hello Everyone,

I am performing clustering on a dataset using PySpark. To find the number of
clusters I performed clustering over a range of values (2,20) and found the
wsse (within-cluster sum of squares) values for each value of k. This where
I found something unusual. According to my understanding when you increase
the number of clusters, the wsse decreases monotonically. But results I got
say otherwise. I 'm displaying wsse for first few clusters only

Results from spark

For k = 002 WSSE is 255318.793358
For k = 003 WSSE is 209788.479560
For k = 004 WSSE is 208498.351074
For k = 005 WSSE is 142573.272672
For k = 006 WSSE is 154419.027612
For k = 007 WSSE is 115092.404604
For k = 008 WSSE is 104753.205635
For k = 009 WSSE is 98000.985547
For k = 010 WSSE is 95134.137071
If you look at the wsse value of for k=5 and k=6, you'll see the wsse has
increased. I turned to sklearn to see if I get similar results. The codes I
used for spark and sklearn are in the appendix section towards the end of
the post. I have tried to use same values for the parameters in spark and
sklearn KMeans model. The following are the results from sklearn and they
are as I expected them to be - monotonically decreasing.

Results from sklearn

For k = 002 WSSE is 245090.224247
For k = 003 WSSE is 201329.888159
For k = 004 WSSE is 166889.044195
For k = 005 WSSE is 142576.895154
For k = 006 WSSE is 123882.070776
For k = 007 WSSE is 112496.692455
For k = 008 WSSE is 102806.001664
For k = 009 WSSE is 95279.837212
For k = 010 WSSE is 89303.574467
I am not sure as to why I the wsse values increase in Spark. I tried using
different datasets and found similar behavior there as well. Is there
someplace I am going wrong? Any clues would be great.

APPENDIX
The dataset is located here.

Read the data and set declare variables

# get data
import pandas as pd
url =
"https://raw.githubusercontent.com/vectosaurus/bb_lite/master/3.0%20data/adult_comp_cont.csv";

df_pandas = pd.read_csv(url)
df_spark = sqlContext(df_pandas)
target_col = 'high_income'
numeric_cols = [i for i in df_pandas.columns if i !=target_col]

k_min = 2 # 2 in inclusive
k_max = 21 # 2i is exlusive. will fit till 20

max_iter = 1000
seed = 42
This is the code I am using for getting the sklearn results:

from sklearn.cluster import KMeans as KMeans_SKL
from sklearn.preprocessing import StandardScaler as StandardScaler_SKL

ss = StandardScaler_SKL(with_std=True, with_mean=True)
ss.fit(df_pandas.loc[:, numeric_cols])
df_pandas_scaled = pd.DataFrame(ss.transform(df_pandas.loc[:,
numeric_cols]))

wsse_collect = []

for i in range(k_min, k_max):
km = KMeans_SKL(random_state=seed, max_iter=max_iter, n_clusters=i)
_ = km.fit(df_pandas_scaled)
wsse = km.inertia_
print('For k = {i:03d} WSSE is {wsse:10f}'.format(i=i, wsse=wsse))
wsse_collect.append(wsse)
This is the code I am using for getting the spark results

from pyspark.ml.feature import StandardScaler, VectorAssembler
from pyspark.ml.clustering import KMeans

standard_scaler_inpt_features = 'ss_features'
kmeans_input_features = 'features'
kmeans_prediction_features = 'prediction'


assembler = VectorAssembler(inputCols=numeric_cols,
outputCol=standard_scaler_inpt_features)
assembled_df = assembler.transform(df_spark)

scaler = StandardScaler(inputCol=standard_scaler_inpt_features,
outputCol=kmeans_input_features, withStd=True, withMean=True)
scaler_model = scaler.fit(assembled_df)
scaled_data = scaler_model.transform(assembled_df)

wsse_collect_spark = []

for i in range(k_min, k_max):
km = KMeans(featuresCol=kmeans_input_features,
predictionCol=kmeans_prediction_col,
k=i, maxIter=max_iter, seed=seed)
km_fit = km.fit(scaled_data)
wsse_spark = km_fit.computeCost(scaled_data)
wsse_collect_spark .append(wsse_spark)
print('For k = {i:03d} WSSE is {wsse:10f}'.format(i=i, wsse=wsse_spark))




--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Bisecting Kmeans Linkage Matrix Output (Cluster Indices)

2018-03-14 Thread GabeChurch
I have been working on a project to return a Linkage Matrix output from the
Spark Bisecting Kmeans Algorithm output so that it is possible to plot the
selection steps in a dendogram. I am having trouble returning valid Indices
when I use more than 3-4 clusters in the algorithm and am hoping someone
else might have the time/interest enough to take a look. 

To achieve this I made some modifications to the Bisecting Kmeans algorithm
to produce a z-linkage matrix based on yu-iskw's work. I also made some
modifications to provide more information about the selection steps in the
Bisecting Kmeans Algorithm to the log at run-time.

Test outputs using the Iris Dataset with both k = 3 and k = 10 clusters can
be seen on  my stack overflow post
<https://stackoverflow.com/questions/49265521/bisecting-kmeans-cluster-indices-in-apache-spark>
  

The project so far (with a simple sbt build and the compiled jars) can also
be seen on  my github repo
<https://github.com/GabeChurch/IncubatingProjects>  and is also detailed in
the aforementioned stack overflow post.




--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Apache Spark documentation on mllib's Kmeans doesn't jibe.

2017-12-13 Thread Scott Reynolds
The train method is on the Companion Object
https://spark.apache.org/docs/2.1.0/api/scala/index.html#org.apache.spark.mllib.clustering.KMeans$

here is a decent resource on Companion Object usage:
https://docs.scala-lang.org/tour/singleton-objects.html

On Wed, Dec 13, 2017 at 9:16 AM Michael Segel 
wrote:

> Hi,
>
> Just came across this while looking at the docs on how to use Spark’s
> Kmeans clustering.
>
> Note: This appears to be true in both 2.1 and 2.2 documentation.
>
> The overview page:
> https://spark.apache.org/docs/2.1.0/mllib-clustering.html#k-means
> <https://urldefense.proofpoint.com/v2/url?u=https-3A__spark.apache.org_docs_2.1.0_mllib-2Dclustering.html-23k-2Dmeans&d=DwMGaQ&c=x_Y1Lz9GyeGp2OvBCa_eow&r=ChXZJWKniTslJvQGptpIW7qAh4kkrpgYSer_wfh4G5w&m=aqceDwZltCTqlsZ5_SVCDe_DGw08lU2Duf0yymdZZ7k&s=i-__RwjSLQ18f4-0jfvArBoWU8FzygMCKzJXp_FPv1U&e=>
>
> Here’ the example contains the following line:
>
> val clusters = KMeans.train(parsedData, numClusters, numIterations)
>
> I was trying to get more information on the train() method.
> So I checked out the KMeans Scala API:
>
>
> https://spark.apache.org/docs/2.1.0/api/scala/index.html#org.apache.spark.mllib.clustering.KMeans
> <https://urldefense.proofpoint.com/v2/url?u=https-3A__spark.apache.org_docs_2.1.0_api_scala_index.html-23org.apache.spark.mllib.clustering.KMeans&d=DwMGaQ&c=x_Y1Lz9GyeGp2OvBCa_eow&r=ChXZJWKniTslJvQGptpIW7qAh4kkrpgYSer_wfh4G5w&m=aqceDwZltCTqlsZ5_SVCDe_DGw08lU2Duf0yymdZZ7k&s=F8KhbHkJ4gQWQb4d1I-4a3gcn6uX4Z-lPmrQTmnaCp4&e=>
>
> The issue is that I couldn’t find the train method…
>
> So I thought I was slowly losing my mind.
>
> I checked out the entire API page… could not find any API docs which
> describe the method train().
>
> I ended up looking at the source code and found the method in the scala
> source code.
> (You can see the code here:
> https://github.com/apache/spark/blob/v2.1.0/mllib/src/main/scala/org/apache/spark/mllib/clustering/KMeans.scala
> <https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_apache_spark_blob_v2.1.0_mllib_src_main_scala_org_apache_spark_mllib_clustering_KMeans.scala&d=DwMGaQ&c=x_Y1Lz9GyeGp2OvBCa_eow&r=ChXZJWKniTslJvQGptpIW7qAh4kkrpgYSer_wfh4G5w&m=aqceDwZltCTqlsZ5_SVCDe_DGw08lU2Duf0yymdZZ7k&s=tYWGTjYLcXRMIuaE3IKN7ugoMSSXqfHknoWQewlqMPc&e=>
>  )
>
> So the method(s) exist, but not covered in the Scala API doc.
>
> How do you raise this as a ‘bug’ ?
>
> Thx
>
> -Mike
>
> --

Scott Reynolds
Principal Engineer
[image: twilio] <http://www.twilio.com/?utm_source=email_signature>


EMAIL sreyno...@twilio.com


Apache Spark documentation on mllib's Kmeans doesn't jibe.

2017-12-13 Thread Michael Segel
Hi,

Just came across this while looking at the docs on how to use Spark’s Kmeans 
clustering.

Note: This appears to be true in both 2.1 and 2.2 documentation.

The overview page:
https://spark.apache.org/docs/2.1.0/mllib-clustering.html#k-means

Here’ the example contains the following line:

val clusters = KMeans.train(parsedData, numClusters, numIterations)

I was trying to get more information on the train() method.
So I checked out the KMeans Scala API:
https://spark.apache.org/docs/2.1.0/api/scala/index.html#org.apache.spark.mllib.clustering.KMeans

The issue is that I couldn’t find the train method…

So I thought I was slowly losing my mind.

I checked out the entire API page… could not find any API docs which describe 
the method train().

I ended up looking at the source code and found the method in the scala source 
code.
(You can see the code here: 
https://github.com/apache/spark/blob/v2.1.0/mllib/src/main/scala/org/apache/spark/mllib/clustering/KMeans.scala
 )

So the method(s) exist, but not covered in the Scala API doc.

How do you raise this as a ‘bug’ ?

Thx

-Mike



Re: KMeans Clustering is not Reproducible

2017-05-24 Thread Christoph Brücke
Hi Ankur,

thank you for answering. But my problem is not, that I'm stuck in a local
extrema but rather the reproducibility of kmeans. Want I'm trying to
achieve is: when the input data and all the parameters stay the same,
especially the seed, I want to get the exact same results. Even though the
partitioning changes. As far as I'm concerned if I'm setting a seed in a ML
algorithm, I would expect that this algorithm is deterministic.

Unfortunately I couldn't find any information if this a goal of Spark's
mllib or not.

Maybe a little bit of background. I'm trying to benchmark some ML
algorithms while changing my cluster config. That is I want to find the
best cluster config to achieve the same results. But what I see is that
when I change the amount of executors, the results become incomparable,
since the results differ.

So in essence my question is, are the algorithms in the mllib partition
agnostic or not?

Thanks for your help,
Christoph

Am 24.05.2017 20:49 schrieb "Ankur Srivastava" :

Hi Christoph,

I am not an expert in ML and have not used Spark KMeans but your problem
seems to be an issue of local minimum vs global minimum. You should run
K-means multiple times with random starting point and also try with
multiple values of K (unless you are already sure).

Hope this helps.

Thanks
Ankur



On Wed, May 24, 2017 at 2:15 AM, Christoph Bruecke 
wrote:

> Hi Anastasios,
>
> thanks for the reply but caching doesn’t seem to change anything.
>
> After further investigation it really seems that the RDD#takeSample method
> is the cause of the non-reproducibility.
>
> Is this considered a bug and should I open an Issue for that?
>
> BTW: my example script contains a little type in line 3: it is `feature`
> not `features` (mind the `s`).
>
> Best,
> Christoph
>
> The script with caching
>
> ```
> import org.apache.spark.sql.DataFrame
> import org.apache.spark.ml.clustering.KMeans
> import org.apache.spark.ml.feature.VectorAssembler
> import org.apache.spark.storage.StorageLevel
>
> // generate random data for clustering
> val randomData = spark.range(1, 1000).withColumn("a",
> rand(123)).withColumn("b", rand(321))
>
> val vecAssembler = new VectorAssembler().setInputCols(Array("a",
> "b")).setOutputCol("features")
>
> val data = vecAssembler.transform(randomData)
>
> // instantiate KMeans with fixed seed
> val kmeans = new KMeans().setK(10).setSeed(9876L)
>
> // train the model with different partitioning
> val dataWith1Partition = data.repartition(1)
> // cache the data
> dataWith1Partition.persist(StorageLevel.MEMORY_AND_DISK)
> println("1 Partition: " + kmeans.fit(dataWith1Partition)
> .computeCost(dataWith1Partition))
>
> val dataWith4Partition = data.repartition(4)
> // cache the data
> dataWith4Partition.persist(StorageLevel.MEMORY_AND_DISK)
> println("4 Partition: " + kmeans.fit(dataWith4Partition)
> .computeCost(dataWith4Partition))
>
>
> ```
>
> Output:
>
> ```
> 1 Partition: 16.028212597888057
> 4 Partition: 16.14758460544976
> ```
>
> > On 22 May 2017, at 16:33, Anastasios Zouzias  wrote:
> >
> > Hi Christoph,
> >
> > Take a look at this, you might end up having a similar case:
> >
> > http://www.spark.tc/using-sparks-cache-for-correctness-not-
> just-performance/
> >
> > If this is not the case, then I agree with you the kmeans should be
> partitioning agnostic (although I haven't check the code yet).
> >
> > Best,
> > Anastasios
> >
> >
> > On Mon, May 22, 2017 at 3:42 PM, Christoph Bruecke 
> wrote:
> > Hi,
> >
> > I’m trying to figure out how to use KMeans in order to achieve
> reproducible results. I have found that running the same kmeans instance on
> the same data, with different partitioning will produce different
> clusterings.
> >
> > Given a simple KMeans run with fixed seed returns different results on
> the same
> > training data, if the training data is partitioned differently.
> >
> > Consider the following example. The same KMeans clustering set up is run
> on
> > identical data. The only difference is the partitioning of the training
> data
> > (one partition vs. four partitions).
> >
> > ```
> > import org.apache.spark.sql.DataFrame
> > import org.apache.spark.ml.clustering.KMeans
> > import org.apache.spark.ml.features.VectorAssembler
> >
> > // generate random data for clustering
> > val randomData = spark.range(1, 1000).withColumn("a",
> rand(123)).withColumn("b", rand(321))
> >
> > val vecAssembler = new VectorAssembler().setIn

Re: KMeans Clustering is not Reproducible

2017-05-24 Thread Yu Zhang
I agree with what Ankur said. The kmeans seeding program ('takeSample'
method) runs in parallel, so each partition has its sampling points based
on the local data which will cause the 'partition agnostic'. The seeding
method is based on Bahmani et al. kmeansII algorithm which gives
approximation guarantees on the kmeans cost.

You could set the initial seeding points which will avoid the 'agnostic'
issue.



Regards,
Yu Zhang

On Wed, May 24, 2017 at 1:49 PM, Ankur Srivastava <
ankur.srivast...@gmail.com> wrote:

> Hi Christoph,
>
> I am not an expert in ML and have not used Spark KMeans but your problem
> seems to be an issue of local minimum vs global minimum. You should run
> K-means multiple times with random starting point and also try with
> multiple values of K (unless you are already sure).
>
> Hope this helps.
>
> Thanks
> Ankur
>
>
>
> On Wed, May 24, 2017 at 2:15 AM, Christoph Bruecke 
> wrote:
>
>> Hi Anastasios,
>>
>> thanks for the reply but caching doesn’t seem to change anything.
>>
>> After further investigation it really seems that the RDD#takeSample
>> method is the cause of the non-reproducibility.
>>
>> Is this considered a bug and should I open an Issue for that?
>>
>> BTW: my example script contains a little type in line 3: it is `feature`
>> not `features` (mind the `s`).
>>
>> Best,
>> Christoph
>>
>> The script with caching
>>
>> ```
>> import org.apache.spark.sql.DataFrame
>> import org.apache.spark.ml.clustering.KMeans
>> import org.apache.spark.ml.feature.VectorAssembler
>> import org.apache.spark.storage.StorageLevel
>>
>> // generate random data for clustering
>> val randomData = spark.range(1, 1000).withColumn("a",
>> rand(123)).withColumn("b", rand(321))
>>
>> val vecAssembler = new VectorAssembler().setInputCols(Array("a",
>> "b")).setOutputCol("features")
>>
>> val data = vecAssembler.transform(randomData)
>>
>> // instantiate KMeans with fixed seed
>> val kmeans = new KMeans().setK(10).setSeed(9876L)
>>
>> // train the model with different partitioning
>> val dataWith1Partition = data.repartition(1)
>> // cache the data
>> dataWith1Partition.persist(StorageLevel.MEMORY_AND_DISK)
>> println("1 Partition: " + kmeans.fit(dataWith1Partition)
>> .computeCost(dataWith1Partition))
>>
>> val dataWith4Partition = data.repartition(4)
>> // cache the data
>> dataWith4Partition.persist(StorageLevel.MEMORY_AND_DISK)
>> println("4 Partition: " + kmeans.fit(dataWith4Partition)
>> .computeCost(dataWith4Partition))
>>
>>
>> ```
>>
>> Output:
>>
>> ```
>> 1 Partition: 16.028212597888057
>> 4 Partition: 16.14758460544976
>> ```
>>
>> > On 22 May 2017, at 16:33, Anastasios Zouzias  wrote:
>> >
>> > Hi Christoph,
>> >
>> > Take a look at this, you might end up having a similar case:
>> >
>> > http://www.spark.tc/using-sparks-cache-for-correctness-not-
>> just-performance/
>> >
>> > If this is not the case, then I agree with you the kmeans should be
>> partitioning agnostic (although I haven't check the code yet).
>> >
>> > Best,
>> > Anastasios
>> >
>> >
>> > On Mon, May 22, 2017 at 3:42 PM, Christoph Bruecke 
>> wrote:
>> > Hi,
>> >
>> > I’m trying to figure out how to use KMeans in order to achieve
>> reproducible results. I have found that running the same kmeans instance on
>> the same data, with different partitioning will produce different
>> clusterings.
>> >
>> > Given a simple KMeans run with fixed seed returns different results on
>> the same
>> > training data, if the training data is partitioned differently.
>> >
>> > Consider the following example. The same KMeans clustering set up is
>> run on
>> > identical data. The only difference is the partitioning of the training
>> data
>> > (one partition vs. four partitions).
>> >
>> > ```
>> > import org.apache.spark.sql.DataFrame
>> > import org.apache.spark.ml.clustering.KMeans
>> > import org.apache.spark.ml.features.VectorAssembler
>> >
>> > // generate random data for clustering
>> > val randomData = spark.range(1, 1000).withColumn("a",
>> rand(123)).withColumn("b", rand(321))
>> >
>> > val vecAssembler = new VectorAssembler().se

Re: KMeans Clustering is not Reproducible

2017-05-24 Thread Ankur Srivastava
Hi Christoph,

I am not an expert in ML and have not used Spark KMeans but your problem
seems to be an issue of local minimum vs global minimum. You should run
K-means multiple times with random starting point and also try with
multiple values of K (unless you are already sure).

Hope this helps.

Thanks
Ankur



On Wed, May 24, 2017 at 2:15 AM, Christoph Bruecke 
wrote:

> Hi Anastasios,
>
> thanks for the reply but caching doesn’t seem to change anything.
>
> After further investigation it really seems that the RDD#takeSample method
> is the cause of the non-reproducibility.
>
> Is this considered a bug and should I open an Issue for that?
>
> BTW: my example script contains a little type in line 3: it is `feature`
> not `features` (mind the `s`).
>
> Best,
> Christoph
>
> The script with caching
>
> ```
> import org.apache.spark.sql.DataFrame
> import org.apache.spark.ml.clustering.KMeans
> import org.apache.spark.ml.feature.VectorAssembler
> import org.apache.spark.storage.StorageLevel
>
> // generate random data for clustering
> val randomData = spark.range(1, 1000).withColumn("a",
> rand(123)).withColumn("b", rand(321))
>
> val vecAssembler = new VectorAssembler().setInputCols(Array("a",
> "b")).setOutputCol("features")
>
> val data = vecAssembler.transform(randomData)
>
> // instantiate KMeans with fixed seed
> val kmeans = new KMeans().setK(10).setSeed(9876L)
>
> // train the model with different partitioning
> val dataWith1Partition = data.repartition(1)
> // cache the data
> dataWith1Partition.persist(StorageLevel.MEMORY_AND_DISK)
> println("1 Partition: " + kmeans.fit(dataWith1Partition).computeCost(
> dataWith1Partition))
>
> val dataWith4Partition = data.repartition(4)
> // cache the data
> dataWith4Partition.persist(StorageLevel.MEMORY_AND_DISK)
> println("4 Partition: " + kmeans.fit(dataWith4Partition).computeCost(
> dataWith4Partition))
>
>
> ```
>
> Output:
>
> ```
> 1 Partition: 16.028212597888057
> 4 Partition: 16.14758460544976
> ```
>
> > On 22 May 2017, at 16:33, Anastasios Zouzias  wrote:
> >
> > Hi Christoph,
> >
> > Take a look at this, you might end up having a similar case:
> >
> > http://www.spark.tc/using-sparks-cache-for-correctness-
> not-just-performance/
> >
> > If this is not the case, then I agree with you the kmeans should be
> partitioning agnostic (although I haven't check the code yet).
> >
> > Best,
> > Anastasios
> >
> >
> > On Mon, May 22, 2017 at 3:42 PM, Christoph Bruecke 
> wrote:
> > Hi,
> >
> > I’m trying to figure out how to use KMeans in order to achieve
> reproducible results. I have found that running the same kmeans instance on
> the same data, with different partitioning will produce different
> clusterings.
> >
> > Given a simple KMeans run with fixed seed returns different results on
> the same
> > training data, if the training data is partitioned differently.
> >
> > Consider the following example. The same KMeans clustering set up is run
> on
> > identical data. The only difference is the partitioning of the training
> data
> > (one partition vs. four partitions).
> >
> > ```
> > import org.apache.spark.sql.DataFrame
> > import org.apache.spark.ml.clustering.KMeans
> > import org.apache.spark.ml.features.VectorAssembler
> >
> > // generate random data for clustering
> > val randomData = spark.range(1, 1000).withColumn("a",
> rand(123)).withColumn("b", rand(321))
> >
> > val vecAssembler = new VectorAssembler().setInputCols(Array("a",
> "b")).setOutputCol("features")
> >
> > val data = vecAssembler.transform(randomData)
> >
> > // instantiate KMeans with fixed seed
> > val kmeans = new KMeans().setK(10).setSeed(9876L)
> >
> > // train the model with different partitioning
> > val dataWith1Partition = data.repartition(1)
> > println("1 Partition: " + kmeans.fit(dataWith1Partition).computeCost(
> dataWith1Partition))
> >
> > val dataWith4Partition = data.repartition(4)
> > println("4 Partition: " + kmeans.fit(dataWith4Partition).computeCost(
> dataWith4Partition))
> > ```
> >
> > I get the following related cost
> >
> > ```
> > 1 Partition: 16.028212597888057
> > 4 Partition: 16.14758460544976
> > ```
> >
> > What I want to achieve is that repeated computations of the KMeans
> Clustering should yield identical result on identical training data,
> regardless of the partitioning.
> >
> > Looking through the Spark source code, I guess the cause is the
> initialization method of KMeans which in turn uses the `takeSample` method,
> which does not seem to be partition agnostic.
> >
> > Is this behaviour expected? Is there anything I could do to achieve
> reproducible results?
> >
> > Best,
> > Christoph
> > -
> > To unsubscribe e-mail: user-unsubscr...@spark.apache.org
> >
> >
> >
> >
> > --
> > -- Anastasios Zouzias
>
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>


Re: KMeans Clustering is not Reproducible

2017-05-24 Thread Christoph Bruecke
Hi Anastasios,

thanks for the reply but caching doesn’t seem to change anything.

After further investigation it really seems that the RDD#takeSample method is 
the cause of the non-reproducibility.

Is this considered a bug and should I open an Issue for that?

BTW: my example script contains a little type in line 3: it is `feature` not 
`features` (mind the `s`).

Best,
Christoph

The script with caching

```
import org.apache.spark.sql.DataFrame
import org.apache.spark.ml.clustering.KMeans
import org.apache.spark.ml.feature.VectorAssembler
import org.apache.spark.storage.StorageLevel

// generate random data for clustering
val randomData = spark.range(1, 1000).withColumn("a", 
rand(123)).withColumn("b", rand(321))

val vecAssembler = new VectorAssembler().setInputCols(Array("a", 
"b")).setOutputCol("features")

val data = vecAssembler.transform(randomData)

// instantiate KMeans with fixed seed
val kmeans = new KMeans().setK(10).setSeed(9876L)

// train the model with different partitioning
val dataWith1Partition = data.repartition(1)
// cache the data
dataWith1Partition.persist(StorageLevel.MEMORY_AND_DISK)
println("1 Partition: " + 
kmeans.fit(dataWith1Partition).computeCost(dataWith1Partition))

val dataWith4Partition = data.repartition(4)
// cache the data
dataWith4Partition.persist(StorageLevel.MEMORY_AND_DISK)
println("4 Partition: " + 
kmeans.fit(dataWith4Partition).computeCost(dataWith4Partition))


```

Output:

```
1 Partition: 16.028212597888057
4 Partition: 16.14758460544976
```
 
> On 22 May 2017, at 16:33, Anastasios Zouzias  wrote:
> 
> Hi Christoph,
> 
> Take a look at this, you might end up having a similar case:
> 
> http://www.spark.tc/using-sparks-cache-for-correctness-not-just-performance/
> 
> If this is not the case, then I agree with you the kmeans should be 
> partitioning agnostic (although I haven't check the code yet).
> 
> Best,
> Anastasios
> 
> 
> On Mon, May 22, 2017 at 3:42 PM, Christoph Bruecke  
> wrote:
> Hi,
> 
> I’m trying to figure out how to use KMeans in order to achieve reproducible 
> results. I have found that running the same kmeans instance on the same data, 
> with different partitioning will produce different clusterings.
> 
> Given a simple KMeans run with fixed seed returns different results on the 
> same
> training data, if the training data is partitioned differently.
> 
> Consider the following example. The same KMeans clustering set up is run on
> identical data. The only difference is the partitioning of the training data
> (one partition vs. four partitions).
> 
> ```
> import org.apache.spark.sql.DataFrame
> import org.apache.spark.ml.clustering.KMeans
> import org.apache.spark.ml.features.VectorAssembler
> 
> // generate random data for clustering
> val randomData = spark.range(1, 1000).withColumn("a", 
> rand(123)).withColumn("b", rand(321))
> 
> val vecAssembler = new VectorAssembler().setInputCols(Array("a", 
> "b")).setOutputCol("features")
> 
> val data = vecAssembler.transform(randomData)
> 
> // instantiate KMeans with fixed seed
> val kmeans = new KMeans().setK(10).setSeed(9876L)
> 
> // train the model with different partitioning
> val dataWith1Partition = data.repartition(1)
> println("1 Partition: " + 
> kmeans.fit(dataWith1Partition).computeCost(dataWith1Partition))
> 
> val dataWith4Partition = data.repartition(4)
> println("4 Partition: " + 
> kmeans.fit(dataWith4Partition).computeCost(dataWith4Partition))
> ```
> 
> I get the following related cost
> 
> ```
> 1 Partition: 16.028212597888057
> 4 Partition: 16.14758460544976
> ```
> 
> What I want to achieve is that repeated computations of the KMeans Clustering 
> should yield identical result on identical training data, regardless of the 
> partitioning.
> 
> Looking through the Spark source code, I guess the cause is the 
> initialization method of KMeans which in turn uses the `takeSample` method, 
> which does not seem to be partition agnostic.
> 
> Is this behaviour expected? Is there anything I could do to achieve 
> reproducible results?
> 
> Best,
> Christoph
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
> 
> 
> 
> 
> -- 
> -- Anastasios Zouzias


-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: KMeans Clustering is not Reproducible

2017-05-22 Thread Anastasios Zouzias
Hi Christoph,

Take a look at this, you might end up having a similar case:

http://www.spark.tc/using-sparks-cache-for-correctness-not-just-performance/

If this is not the case, then I agree with you the kmeans should be
partitioning agnostic (although I haven't check the code yet).

Best,
Anastasios


On Mon, May 22, 2017 at 3:42 PM, Christoph Bruecke 
wrote:

> Hi,
>
> I’m trying to figure out how to use KMeans in order to achieve
> reproducible results. I have found that running the same kmeans instance on
> the same data, with different partitioning will produce different
> clusterings.
>
> Given a simple KMeans run with fixed seed returns different results on the
> same
> training data, if the training data is partitioned differently.
>
> Consider the following example. The same KMeans clustering set up is run on
> identical data. The only difference is the partitioning of the training
> data
> (one partition vs. four partitions).
>
> ```
> import org.apache.spark.sql.DataFrame
> import org.apache.spark.ml.clustering.KMeans
> import org.apache.spark.ml.features.VectorAssembler
>
> // generate random data for clustering
> val randomData = spark.range(1, 1000).withColumn("a",
> rand(123)).withColumn("b", rand(321))
>
> val vecAssembler = new VectorAssembler().setInputCols(Array("a",
> "b")).setOutputCol("features")
>
> val data = vecAssembler.transform(randomData)
>
> // instantiate KMeans with fixed seed
> val kmeans = new KMeans().setK(10).setSeed(9876L)
>
> // train the model with different partitioning
> val dataWith1Partition = data.repartition(1)
> println("1 Partition: " + kmeans.fit(dataWith1Partition).computeCost(
> dataWith1Partition))
>
> val dataWith4Partition = data.repartition(4)
> println("4 Partition: " + kmeans.fit(dataWith4Partition).computeCost(
> dataWith4Partition))
> ```
>
> I get the following related cost
>
> ```
> 1 Partition: 16.028212597888057
> 4 Partition: 16.14758460544976
> ```
>
> What I want to achieve is that repeated computations of the KMeans
> Clustering should yield identical result on identical training data,
> regardless of the partitioning.
>
> Looking through the Spark source code, I guess the cause is the
> initialization method of KMeans which in turn uses the `takeSample` method,
> which does not seem to be partition agnostic.
>
> Is this behaviour expected? Is there anything I could do to achieve
> reproducible results?
>
> Best,
> Christoph
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>


-- 
-- Anastasios Zouzias



KMeans Clustering is not Reproducible

2017-05-22 Thread Christoph Bruecke
Hi,

I’m trying to figure out how to use KMeans in order to achieve reproducible 
results. I have found that running the same kmeans instance on the same data, 
with different partitioning will produce different clusterings.

Given a simple KMeans run with fixed seed returns different results on the same
training data, if the training data is partitioned differently.

Consider the following example. The same KMeans clustering set up is run on
identical data. The only difference is the partitioning of the training data
(one partition vs. four partitions).

```
import org.apache.spark.sql.DataFrame
import org.apache.spark.ml.clustering.KMeans
import org.apache.spark.ml.features.VectorAssembler

// generate random data for clustering
val randomData = spark.range(1, 1000).withColumn("a", 
rand(123)).withColumn("b", rand(321))

val vecAssembler = new VectorAssembler().setInputCols(Array("a", 
"b")).setOutputCol("features")

val data = vecAssembler.transform(randomData)

// instantiate KMeans with fixed seed
val kmeans = new KMeans().setK(10).setSeed(9876L)

// train the model with different partitioning
val dataWith1Partition = data.repartition(1)
println("1 Partition: " + 
kmeans.fit(dataWith1Partition).computeCost(dataWith1Partition))

val dataWith4Partition = data.repartition(4)
println("4 Partition: " + 
kmeans.fit(dataWith4Partition).computeCost(dataWith4Partition))
```

I get the following related cost

```
1 Partition: 16.028212597888057
4 Partition: 16.14758460544976
```

What I want to achieve is that repeated computations of the KMeans Clustering 
should yield identical result on identical training data, regardless of the 
partitioning.

Looking through the Spark source code, I guess the cause is the initialization 
method of KMeans which in turn uses the `takeSample` method, which does not 
seem to be partition agnostic.

Is this behaviour expected? Is there anything I could do to achieve 
reproducible results?

Best,
Christoph
-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: [MLlib] kmeans random initialization, same seed every time

2017-03-14 Thread Yuhao Yang
Hi Julian,

Thanks for reporting this. This is a valid issue and I created
https://issues.apache.org/jira/browse/SPARK-19957 to track it.

Right now the seed is set to this.getClass.getName.hashCode.toLong by
default, which indeed keeps the same among multiple fits. Feel free to
leave your comments or send a PR for the fix.

For your problem, you may add .setSeed(new Random().nextLong()) before
fit() as a workaround.

Thanks,
Yuhao

2017-03-14 5:46 GMT-07:00 Julian Keppel :

> I'm sorry, I missed some important informations. I use Spark version 2.0.2
> in Scala 2.11.8.
>
> 2017-03-14 13:44 GMT+01:00 Julian Keppel :
>
>> Hi everybody,
>>
>> I make some experiments with the Spark kmeans implementation of the new
>> DataFrame-API. I compare clustering results of different runs with
>> different parameters. I recognized that for random initialization mode, the
>> seed value is the same every time. How is it calculated? In my
>> understanding the seed should be random if it is not provided by the user.
>>
>> Thank you for you help.
>>
>> Julian
>>
>
>


Re: [MLlib] kmeans random initialization, same seed every time

2017-03-14 Thread Julian Keppel
I'm sorry, I missed some important informations. I use Spark version 2.0.2
in Scala 2.11.8.

2017-03-14 13:44 GMT+01:00 Julian Keppel :

> Hi everybody,
>
> I make some experiments with the Spark kmeans implementation of the new
> DataFrame-API. I compare clustering results of different runs with
> different parameters. I recognized that for random initialization mode, the
> seed value is the same every time. How is it calculated? In my
> understanding the seed should be random if it is not provided by the user.
>
> Thank you for you help.
>
> Julian
>


[MLlib] kmeans random initialization, same seed every time

2017-03-14 Thread Julian Keppel
Hi everybody,

I make some experiments with the Spark kmeans implementation of the new
DataFrame-API. I compare clustering results of different runs with
different parameters. I recognized that for random initialization mode, the
seed value is the same every time. How is it calculated? In my
understanding the seed should be random if it is not provided by the user.

Thank you for you help.

Julian


Re: ML version of Kmeans

2017-01-31 Thread Hollin Wilkins
Hey,

You could also take a look at MLeap, which provides a runtime for any Spark
transformer and does not have any dependencies on a SparkContext or Spark
libraries (excepting MLlib-local for linear algebra).

https://github.com/combust/mleap

On Tue, Jan 31, 2017 at 2:33 AM, Aseem Bansal  wrote:

> If you want to predict using dataset then transform is the way to go. If
> you want to predict on vectors then you will have to wait on this issue to
> be completed https://issues.apache.org/jira/browse/SPARK-10413
>
> On Tue, Jan 31, 2017 at 3:01 PM, Holden Karau 
> wrote:
>
>> You most likely want the transform function on KMeansModel (although that
>> works on a dataset input rather than a single element at a time).
>>
>> On Tue, Jan 31, 2017 at 1:24 AM, Madabhattula Rajesh Kumar <
>> mrajaf...@gmail.com> wrote:
>>
>>> Hi,
>>>
>>> I am not able to find predict method on "ML" version of Kmeans.
>>>
>>> Mllib version has a predict method.  KMeansModel.predict(point: Vector)
>>> .
>>> How to predict the cluster point for new vectors in ML version of kmeans
>>> ?
>>>
>>> Regards,
>>> Rajesh
>>>
>>
>>
>>
>> --
>> Cell : 425-233-8271 <(425)%20233-8271>
>> Twitter: https://twitter.com/holdenkarau
>>
>
>


Re: ML version of Kmeans

2017-01-31 Thread Aseem Bansal
If you want to predict using dataset then transform is the way to go. If
you want to predict on vectors then you will have to wait on this issue to
be completed https://issues.apache.org/jira/browse/SPARK-10413

On Tue, Jan 31, 2017 at 3:01 PM, Holden Karau  wrote:

> You most likely want the transform function on KMeansModel (although that
> works on a dataset input rather than a single element at a time).
>
> On Tue, Jan 31, 2017 at 1:24 AM, Madabhattula Rajesh Kumar <
> mrajaf...@gmail.com> wrote:
>
>> Hi,
>>
>> I am not able to find predict method on "ML" version of Kmeans.
>>
>> Mllib version has a predict method.  KMeansModel.predict(point: Vector)
>> .
>> How to predict the cluster point for new vectors in ML version of kmeans ?
>>
>> Regards,
>> Rajesh
>>
>
>
>
> --
> Cell : 425-233-8271
> Twitter: https://twitter.com/holdenkarau
>


Re: ML version of Kmeans

2017-01-31 Thread Holden Karau
You most likely want the transform function on KMeansModel (although that
works on a dataset input rather than a single element at a time).

On Tue, Jan 31, 2017 at 1:24 AM, Madabhattula Rajesh Kumar <
mrajaf...@gmail.com> wrote:

> Hi,
>
> I am not able to find predict method on "ML" version of Kmeans.
>
> Mllib version has a predict method.  KMeansModel.predict(point: Vector)
> .
> How to predict the cluster point for new vectors in ML version of kmeans ?
>
> Regards,
> Rajesh
>



-- 
Cell : 425-233-8271
Twitter: https://twitter.com/holdenkarau


ML version of Kmeans

2017-01-31 Thread Madabhattula Rajesh Kumar
Hi,

I am not able to find predict method on "ML" version of Kmeans.

Mllib version has a predict method.  KMeansModel.predict(point: Vector)
.
How to predict the cluster point for new vectors in ML version of kmeans ?

Regards,
Rajesh


PySpark 2: Kmeans The input data is not directly cached

2016-11-03 Thread Zakaria Hili
Hi,

I dont know why I receive the message

 WARN KMeans: The input data is not directly cached, which may hurt
performance if its parent RDDs are also uncached.

when I try to use Spark Kmeans

df_Part = assembler.transform(df_Part)
df_Part.cache()while (k<=max_cluster) and (wssse > seuilStop):
    kmeans = KMeans().setK(k)
model = kmeans.fit(df_Part)
wssse = model.computeCost(df_Part)
k=k+1

It says that my input (Dataframe) is not cached !!

I tried to print df_Part.is_cached and I recieved True which means that my
dataframe is cached, So why spark still warning me about this ???

thank you in advance


ᐧ


Re: Why training data in Kmeans Spark streaming clustering

2016-08-11 Thread Bryan Cutler
The algorithm update is just broken into 2 steps: trainOn - to learn/update
the cluster centers, and predictOn - predicts cluster assignment on data

The StreamingKMeansExample you reference breaks up data into training and
test because you might want to score the predictions.  If you don't care
about that, you could just use a single stream for both steps.

On Thu, Aug 11, 2016 at 9:14 AM, Ahmed Sadek  wrote:

> Dear All,
>
> I was wondering why there is training data and testing data in kmeans ?
> Shouldn't it be unsupervised learning with just access to stream data ?
>
> I found similar question but couldn't understand the answer.
> http://stackoverflow.com/questions/30972057/is-the-
> streaming-k-means-clustering-predefined-in-mllib-library-of-spark-supervi
>
> Thanks!
> Ahmed
>


Why training data in Kmeans Spark streaming clustering

2016-08-11 Thread Ahmed Sadek
Dear All,

I was wondering why there is training data and testing data in kmeans ?
Shouldn't it be unsupervised learning with just access to stream data ?

I found similar question but couldn't understand the answer.
http://stackoverflow.com/questions/30972057/is-the-streaming-k-means-clustering-predefined-in-mllib-library-of-spark-supervi

Thanks!
Ahmed


RE: bisecting kmeans model tree

2016-08-09 Thread Huang, Qian
There seems to be an existing JIRA for this.
https://issues.apache.org/jira/browse/SPARK-11664

From: Yanbo Liang [mailto:yblia...@gmail.com]
Sent: Saturday, July 16, 2016 6:18 PM
To: roni 
Cc: user@spark.apache.org
Subject: Re: bisecting kmeans model tree

Currently we do not expose the APIs to get the Bisecting KMeans tree structure, 
they are private in the ml.clustering package scope.
But I think we should make a plan to expose these APIs like what we did for 
Decision Tree.

Thanks
Yanbo

2016-07-12 11:45 GMT-07:00 roni 
mailto:roni.epi...@gmail.com>>:
Hi Spark,Mlib experts,
Anyone who can shine light on this?
Thanks
_R

On Thu, Apr 21, 2016 at 12:46 PM, roni 
mailto:roni.epi...@gmail.com>> wrote:
Hi ,
 I want to get the bisecting kmeans tree structure to show a dendogram  on the 
heatmap I am generating based on the hierarchical clustering of data.
 How do I get that using mlib .
Thanks
-Roni




Re: Kmeans dataset initialization

2016-08-06 Thread Tony Lane
Can anyone suggest how I can initialize kmeans structure directly from a
dataset of Row

On Sat, Aug 6, 2016 at 1:03 AM, Tony Lane  wrote:

> I have all the data required for KMeans in a dataset in memory
>
> Standard approach to load this data from a file is
> spark.read().format("libsvm").load(filename)
>
> where the file has data in the format
> 0 1:0.0 2:0.0 3:0.0
>
>
> How do i this from an in-memory dataset already present.
> Any suggestions ?
>
> -Tony
>
>


Kmeans dataset initialization

2016-08-05 Thread Tony Lane
I have all the data required for KMeans in a dataset in memory

Standard approach to load this data from a file is
spark.read().format("libsvm").load(filename)

where the file has data in the format
0 1:0.0 2:0.0 3:0.0


How do i this from an in-memory dataset already present.
Any suggestions ?

-Tony


Re: bisecting kmeans model tree

2016-07-16 Thread Yanbo Liang
Currently we do not expose the APIs to get the Bisecting KMeans tree
structure, they are private in the ml.clustering package scope.
But I think we should make a plan to expose these APIs like what we did for
Decision Tree.

Thanks
Yanbo

2016-07-12 11:45 GMT-07:00 roni :

> Hi Spark,Mlib experts,
> Anyone who can shine light on this?
> Thanks
> _R
>
> On Thu, Apr 21, 2016 at 12:46 PM, roni  wrote:
>
>> Hi ,
>>  I want to get the bisecting kmeans tree structure to show a dendogram
>>  on the heatmap I am generating based on the hierarchical clustering of
>> data.
>>  How do I get that using mlib .
>> Thanks
>> -Roni
>>
>
>


Re: bisecting kmeans model tree

2016-07-12 Thread roni
Hi Spark,Mlib experts,
Anyone who can shine light on this?
Thanks
_R

On Thu, Apr 21, 2016 at 12:46 PM, roni  wrote:

> Hi ,
>  I want to get the bisecting kmeans tree structure to show a dendogram  on
> the heatmap I am generating based on the hierarchical clustering of data.
>  How do I get that using mlib .
> Thanks
> -Roni
>


Working of Streaming Kmeans

2016-07-05 Thread Holden Karau
Hi Biplob,

The current Streaming KMeans code only updates data which comes in through
training (e.g. trainOn), predictOn does not update the model.

Cheers,

Holden :)

P.S.

Traffic on the list might be have been bit slower right now because of
Canada Day and 4th of July weekend respectively.

On Sunday, July 3, 2016, Biplob Biswas  wrote:

> Hi,
>
> Can anyone please explain this?
>
> Thanks & Regards
> Biplob Biswas
>
> On Sat, Jul 2, 2016 at 4:48 PM, Biplob Biswas 
> wrote:
>
>> Hi,
>>
>> I wanted to ask a very basic question about the working of Streaming
>> Kmeans.
>>
>> Does the model update only when training (i.e. training dataset is used)
>> or
>> does it update on the PredictOnValues function as well for the test
>> dataset?
>>
>> Thanks and Regards
>> Biplob
>>
>>
>>
>>
>> --
>> View this message in context:
>> http://apache-spark-user-list.1001560.n3.nabble.com/Working-of-Streaming-Kmeans-tp27268.html
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>
>> -
>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>>
>>
>


Re: Working of Streaming Kmeans

2016-07-03 Thread Biplob Biswas
Hi,

Can anyone please explain this?

Thanks & Regards
Biplob Biswas

On Sat, Jul 2, 2016 at 4:48 PM, Biplob Biswas 
wrote:

> Hi,
>
> I wanted to ask a very basic question about the working of Streaming
> Kmeans.
>
> Does the model update only when training (i.e. training dataset is used) or
> does it update on the PredictOnValues function as well for the test
> dataset?
>
> Thanks and Regards
> Biplob
>
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Working-of-Streaming-Kmeans-tp27268.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>


Working of Streaming Kmeans

2016-07-02 Thread Biplob Biswas
Hi,

I wanted to ask a very basic question about the working of Streaming Kmeans.

Does the model update only when training (i.e. training dataset is used) or
does it update on the PredictOnValues function as well for the test dataset? 

Thanks and Regards
Biplob




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Working-of-Streaming-Kmeans-tp27268.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Kmeans Streaming process flow

2016-06-10 Thread Chandra Mohan, Ananda Vel Murugan
Hi,

I am in the process of implementing a spark streaming application to do 
clustering of some events. I have a DStream of vectors that I have created from 
each event. Now I am trying to apply clustering. I referred following example 
in spark github.

There is a train method and predictOnValues method. I am confused how to map 
this example for my use case. In my case, I would be getting the stream of 
events 24 * 7. I am not sure how to split the "all day" data separately for 
train and predict methods. Or should this streaming application be run in train 
mode for few days and predict mode later? I am not able to find a suitable 
example on the web. Please advise. Thanks.

https://github.com/apache/spark/blob/branch-1.2/examples/src/main/scala/org/apache/spark/examples/mllib/StreamingKMeansExample.scala

object StreamingKMeansExample {

  def main(args: Array[String]) {
if (args.length != 5) {
  System.err.println(
"Usage: StreamingKMeansExample " +
  "
")
  System.exit(1)
}

val conf = new 
SparkConf().setMaster("local").setAppName("StreamingKMeansExample")
val ssc = new StreamingContext(conf, Seconds(args(2).toLong))

val trainingData = ssc.textFileStream(args(0)).map(Vectors.parse)
val testData = ssc.textFileStream(args(1)).map(LabeledPoint.parse)

val model = new StreamingKMeans()
  .setK(args(3).toInt)
  .setDecayFactor(1.0)
  .setRandomCenters(args(4).toInt, 0.0)

model.trainOn(trainingData)
model.predictOnValues(testData.map(lp => (lp.label, lp.features))).print()

ssc.start()
ssc.awaitTermination()
  }
}

Regards,
Anand.C


bisecting kmeans model tree

2016-04-21 Thread roni
Hi ,
 I want to get the bisecting kmeans tree structure to show a dendogram  on
the heatmap I am generating based on the hierarchical clustering of data.
 How do I get that using mlib .
Thanks
-Roni


bisecting kmeans tree

2016-04-20 Thread roni
Hi ,
 I want to get the bisecting kmeans tree structure to show on the heatmap I
am generating based on the hierarchical clustering of data.
 How do I get that using mlib .
Thanks
-R


Re: Why KMeans with mllib is so slow ?

2016-03-14 Thread Priya Ch
Hi Xi Shen,

  Changing the initialization step from "kmeans||" to "random" decreased
the execution time from 2 hrs to 6 min. However, by default the no.of runs
is 1. If I try to set the number of runs to 10, then again see increase in
job execution time.

How to proceed on this ?.

By the way how is this initialization mode "random" different from
"k-means||" ?


Regards,
Padma Ch



On Sun, Mar 13, 2016 at 12:37 PM, Xi Shen  wrote:

> Hi Chitturi,
>
> Please checkout
> https://spark.apache.org/docs/1.0.1/api/java/org/apache/spark/mllib/clustering/KMeans.html#setInitializationSteps(int
> ).
>
> I think it is caused by the initialization step. the "kmeans||" method
> does not initialize dataset in parallel. If your dataset is large, it takes
> a long time to initialize. Just changed to "random".
>
> Hope it helps.
>
>
> On Sun, Mar 13, 2016 at 2:58 PM Chitturi Padma <
> learnings.chitt...@gmail.com> wrote:
>
>> Hi All,
>>
>>   I  am facing the same issue. taking k values from 60 to 120
>> incrementing by 10 each time i.e k takes value 60,70,80,...120 the
>> algorithm takes around 2.5h on a 800 MB data set with 38 dimensions.
>> On Sun, Mar 29, 2015 at 9:34 AM, davidshen84 [via Apache Spark User List]
>> <[hidden email] <http:///user/SendEmail.jtp?type=node&node=26467&i=0>>
>> wrote:
>>
>>> Hi Jao,
>>>
>>> Sorry to pop up this old thread. I am have the same problem like you
>>> did. I want to know if you have figured out how to improve k-means on
>>> Spark.
>>>
>>> I am using Spark 1.2.0. My data set is about 270k vectors, each has
>>> about 350 dimensions. If I set k=500, the job takes about 3hrs on my
>>> cluster. The cluster has 7 executors, each has 8 cores...
>>>
>>> If I set k=5000 which is the required value for my task, the job goes on
>>> forever...
>>>
>>>
>>> Thanks,
>>> David
>>>
>>>
>>> --
>>> If you reply to this email, your message will be added to the discussion
>>> below:
>>>
>>> http://apache-spark-user-list.1001560.n3.nabble.com/Why-KMeans-with-mllib-is-so-slow-tp20480p22273.html
>>>
>> To start a new topic under Apache Spark User List, email [hidden email]
>>> <http:///user/SendEmail.jtp?type=node&node=26467&i=1>
>>> To unsubscribe from Apache Spark User List, click here.
>>> NAML
>>> <http://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html%21nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace&breadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml>
>>>
>>
>>
>> --
>> View this message in context: Re: Why KMeans with mllib is so slow ?
>> <http://apache-spark-user-list.1001560.n3.nabble.com/Why-KMeans-with-mllib-is-so-slow-tp20480p26467.html>
>> Sent from the Apache Spark User List mailing list archive
>> <http://apache-spark-user-list.1001560.n3.nabble.com/> at Nabble.com.
>>
> --
>
> Regards,
> David
>


Re: Why KMeans with mllib is so slow ?

2016-03-12 Thread Xi Shen
Hi Chitturi,

Please checkout
https://spark.apache.org/docs/1.0.1/api/java/org/apache/spark/mllib/clustering/KMeans.html#setInitializationSteps(int
).

I think it is caused by the initialization step. the "kmeans||" method does
not initialize dataset in parallel. If your dataset is large, it takes a
long time to initialize. Just changed to "random".

Hope it helps.


On Sun, Mar 13, 2016 at 2:58 PM Chitturi Padma 
wrote:

> Hi All,
>
>   I  am facing the same issue. taking k values from 60 to 120 incrementing
> by 10 each time i.e k takes value 60,70,80,...120 the algorithm takes
> around 2.5h on a 800 MB data set with 38 dimensions.
> On Sun, Mar 29, 2015 at 9:34 AM, davidshen84 [via Apache Spark User List]
> <[hidden email] <http:///user/SendEmail.jtp?type=node&node=26467&i=0>>
> wrote:
>
>> Hi Jao,
>>
>> Sorry to pop up this old thread. I am have the same problem like you did.
>> I want to know if you have figured out how to improve k-means on Spark.
>>
>> I am using Spark 1.2.0. My data set is about 270k vectors, each has about
>> 350 dimensions. If I set k=500, the job takes about 3hrs on my cluster. The
>> cluster has 7 executors, each has 8 cores...
>>
>> If I set k=5000 which is the required value for my task, the job goes on
>> forever...
>>
>>
>> Thanks,
>> David
>>
>>
>> --
>> If you reply to this email, your message will be added to the discussion
>> below:
>>
>> http://apache-spark-user-list.1001560.n3.nabble.com/Why-KMeans-with-mllib-is-so-slow-tp20480p22273.html
>>
> To start a new topic under Apache Spark User List, email [hidden email]
>> <http:///user/SendEmail.jtp?type=node&node=26467&i=1>
>> To unsubscribe from Apache Spark User List, click here.
>> NAML
>> <http://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html%21nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace&breadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml>
>>
>
>
> --
> View this message in context: Re: Why KMeans with mllib is so slow ?
> <http://apache-spark-user-list.1001560.n3.nabble.com/Why-KMeans-with-mllib-is-so-slow-tp20480p26467.html>
> Sent from the Apache Spark User List mailing list archive
> <http://apache-spark-user-list.1001560.n3.nabble.com/> at Nabble.com.
>
-- 

Regards,
David


Re: Why KMeans with mllib is so slow ?

2016-03-12 Thread Chitturi Padma
Hi All,

  I  am facing the same issue. taking k values from 60 to 120 incrementing
by 10 each time i.e k takes value 60,70,80,...120 the algorithm takes
around 2.5h on a 800 MB data set with 38 dimensions.

On Sun, Mar 29, 2015 at 9:34 AM, davidshen84 [via Apache Spark User List] <
ml-node+s1001560n2227...@n3.nabble.com> wrote:

> Hi Jao,
>
> Sorry to pop up this old thread. I am have the same problem like you did.
> I want to know if you have figured out how to improve k-means on Spark.
>
> I am using Spark 1.2.0. My data set is about 270k vectors, each has about
> 350 dimensions. If I set k=500, the job takes about 3hrs on my cluster. The
> cluster has 7 executors, each has 8 cores...
>
> If I set k=5000 which is the required value for my task, the job goes on
> forever...
>
>
> Thanks,
> David
>
>
> --
> If you reply to this email, your message will be added to the discussion
> below:
>
> http://apache-spark-user-list.1001560.n3.nabble.com/Why-KMeans-with-mllib-is-so-slow-tp20480p22273.html
> To start a new topic under Apache Spark User List, email
> ml-node+s1001560n1...@n3.nabble.com
> To unsubscribe from Apache Spark User List, click here
> <http://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_code&node=1&code=bGVhcm5pbmdzLmNoaXR0dXJpQGdtYWlsLmNvbXwxfC03NzExMjUwMg==>
> .
> NAML
> <http://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html%21nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace&breadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml>
>




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Why-KMeans-with-mllib-is-so-slow-tp20480p26467.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: Spark Mllib kmeans execution

2016-03-02 Thread Sonal Goyal
It will run distributed
On Mar 2, 2016 3:00 PM, "Priya Ch"  wrote:

> Hi All,
>
>   I am running k-means clustering algorithm. Now, when I am running the
> algorithm as -
>
> val conf = new SparkConf
> val sc = new SparkContext(conf)
> .
> .
> val kmeans = new KMeans()
> val model = kmeans.run(RDD[Vector])
> .
> .
> .
> The 'kmeans' object gets created on driver. Now does *kmeans.run() *get
> executed on each partition of the rdd in distributed fashion or else does
> the entire RDD is brought to driver and then gets executed at the driver on
> the entire RDD ??
>
> Thanks,
> Padma Ch
>
>
>


Spark Mllib kmeans execution

2016-03-02 Thread Priya Ch
Hi All,

  I am running k-means clustering algorithm. Now, when I am running the
algorithm as -

val conf = new SparkConf
val sc = new SparkContext(conf)
.
.
val kmeans = new KMeans()
val model = kmeans.run(RDD[Vector])
.
.
.
The 'kmeans' object gets created on driver. Now does *kmeans.run() *get
executed on each partition of the rdd in distributed fashion or else does
the entire RDD is brought to driver and then gets executed at the driver on
the entire RDD ??

Thanks,
Padma Ch


Re: Slowness in Kmeans calculating fastSquaredDistance

2016-02-09 Thread Li Ming Tsai
Hi,


It looks like Kmeans++ is slow 
(SPARK-3424<https://issues.apache.org/jira/browse/SPARK-3424>) in the 
initialisation phase and is local to driver using 1 core only.


If I use random, the job completed in 1.5mins compared to 1hr+.


Should I move this to the dev list?


Regards,

Liming



From: Li Ming Tsai 
Sent: Sunday, February 7, 2016 10:03 AM
To: user@spark.apache.org
Subject: Re: Slowness in Kmeans calculating fastSquaredDistance


Hi,


I did more investigation and found out that BLAS.scala is calling the native 
reference architecture (f2jblas) for level 1 routines.


I even patched it to use nativeBlas.ddot but it has no material impact.


https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/linalg/BLAS.scala#L126


private def dot(x: DenseVector, y: DenseVector): Double = {

val n = x.size

f2jBLAS.ddot(n, x.values, 1, y.values, 1)

  }


Maybe Xiangrui can comment on this?




From: Li Ming Tsai 
Sent: Friday, February 5, 2016 10:56 AM
To: user@spark.apache.org
Subject: Slowness in Kmeans calculating fastSquaredDistance


Hi,


I'm using INTEL MKL on Spark 1.6.0 which I built myself with the -Pnetlib-lgpl 
flag.


I am using spark local[4] mode and I run it like this:
# export LD_LIBRARY_PATH=/opt/intel/lib/intel64:/opt/intel/mkl/lib/intel64
# bin/spark-shell ...

I have also added the following to /opt/intel/mkl/lib/intel64:
lrwxrwxrwx 1 root root12 Feb  1 09:18 libblas.so -> libmkl_rt.so
lrwxrwxrwx 1 root root12 Feb  1 09:18 libblas.so.3 -> libmkl_rt.so
lrwxrwxrwx 1 root root12 Feb  1 09:18 liblapack.so -> libmkl_rt.so
lrwxrwxrwx 1 root root12 Feb  1 09:18 liblapack.so.3 -> libmkl_rt.so


I believe (???) that I'm using Intel MKL because the warnings went away:

16/02/01 07:49:38 WARN BLAS: Failed to load implementation from: 
com.github.fommil.netlib.NativeSystemBLAS

16/02/01 07:49:38 WARN BLAS: Failed to load implementation from: 
com.github.fommil.netlib.NativeRefBLAS

After collectAsMap, there is no progress but I can observe that only 1 CPU is 
being utilised with the following stack trace:

"ForkJoinPool-3-worker-7" #130 daemon prio=5 os_prio=0 tid=0x7fbf30ab6000 
nid=0xbdc runnable [0x7fbf12205000]

   java.lang.Thread.State: RUNNABLE

at com.github.fommil.netlib.F2jBLAS.ddot(F2jBLAS.java:71)

at org.apache.spark.mllib.linalg.BLAS$.dot(BLAS.scala:128)

at org.apache.spark.mllib.linalg.BLAS$.dot(BLAS.scala:111)

at 
org.apache.spark.mllib.util.MLUtils$.fastSquaredDistance(MLUtils.scala:349)

at 
org.apache.spark.mllib.clustering.KMeans$.fastSquaredDistance(KMeans.scala:587)

at 
org.apache.spark.mllib.clustering.KMeans$$anonfun$findClosest$1.apply(KMeans.scala:561)

at 
org.apache.spark.mllib.clustering.KMeans$$anonfun$findClosest$1.apply(KMeans.scala:555)


This last few steps takes more than half of the total time for a 1Mx100 dataset.


The code is just:

val clusters = KMeans.train(parsedData, 1000, 1)


Shouldn't it utilising all the cores for the dot product? Is this a 
misconfiguration?


Thanks!




Re: Slowness in Kmeans calculating fastSquaredDistance

2016-02-06 Thread Li Ming Tsai
Hi,


I did more investigation and found out that BLAS.scala is calling the native 
reference architecture (f2jblas) for level 1 routines.


I even patched it to use nativeBlas.ddot but it has no material impact.


https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/linalg/BLAS.scala#L126


private def dot(x: DenseVector, y: DenseVector): Double = {

val n = x.size

f2jBLAS.ddot(n, x.values, 1, y.values, 1)

  }


Maybe Xiangrui can comment on this?




From: Li Ming Tsai 
Sent: Friday, February 5, 2016 10:56 AM
To: user@spark.apache.org
Subject: Slowness in Kmeans calculating fastSquaredDistance


Hi,


I'm using INTEL MKL on Spark 1.6.0 which I built myself with the -Pnetlib-lgpl 
flag.


I am using spark local[4] mode and I run it like this:
# export LD_LIBRARY_PATH=/opt/intel/lib/intel64:/opt/intel/mkl/lib/intel64
# bin/spark-shell ...

I have also added the following to /opt/intel/mkl/lib/intel64:
lrwxrwxrwx 1 root root12 Feb  1 09:18 libblas.so -> libmkl_rt.so
lrwxrwxrwx 1 root root12 Feb  1 09:18 libblas.so.3 -> libmkl_rt.so
lrwxrwxrwx 1 root root12 Feb  1 09:18 liblapack.so -> libmkl_rt.so
lrwxrwxrwx 1 root root12 Feb  1 09:18 liblapack.so.3 -> libmkl_rt.so


I believe (???) that I'm using Intel MKL because the warnings went away:

16/02/01 07:49:38 WARN BLAS: Failed to load implementation from: 
com.github.fommil.netlib.NativeSystemBLAS

16/02/01 07:49:38 WARN BLAS: Failed to load implementation from: 
com.github.fommil.netlib.NativeRefBLAS

After collectAsMap, there is no progress but I can observe that only 1 CPU is 
being utilised with the following stack trace:

"ForkJoinPool-3-worker-7" #130 daemon prio=5 os_prio=0 tid=0x7fbf30ab6000 
nid=0xbdc runnable [0x7fbf12205000]

   java.lang.Thread.State: RUNNABLE

at com.github.fommil.netlib.F2jBLAS.ddot(F2jBLAS.java:71)

at org.apache.spark.mllib.linalg.BLAS$.dot(BLAS.scala:128)

at org.apache.spark.mllib.linalg.BLAS$.dot(BLAS.scala:111)

at 
org.apache.spark.mllib.util.MLUtils$.fastSquaredDistance(MLUtils.scala:349)

at 
org.apache.spark.mllib.clustering.KMeans$.fastSquaredDistance(KMeans.scala:587)

at 
org.apache.spark.mllib.clustering.KMeans$$anonfun$findClosest$1.apply(KMeans.scala:561)

at 
org.apache.spark.mllib.clustering.KMeans$$anonfun$findClosest$1.apply(KMeans.scala:555)


This last few steps takes more than half of the total time for a 1Mx100 dataset.


The code is just:

val clusters = KMeans.train(parsedData, 1000, 1)


Shouldn't it utilising all the cores for the dot product? Is this a 
misconfiguration?


Thanks!




Slowness in Kmeans calculating fastSquaredDistance

2016-02-04 Thread Li Ming Tsai
Hi,


I'm using INTEL MKL on Spark 1.6.0 which I built myself with the -Pnetlib-lgpl 
flag.


I am using spark local[4] mode and I run it like this:
# export LD_LIBRARY_PATH=/opt/intel/lib/intel64:/opt/intel/mkl/lib/intel64
# bin/spark-shell ...

I have also added the following to /opt/intel/mkl/lib/intel64:
lrwxrwxrwx 1 root root12 Feb  1 09:18 libblas.so -> libmkl_rt.so
lrwxrwxrwx 1 root root12 Feb  1 09:18 libblas.so.3 -> libmkl_rt.so
lrwxrwxrwx 1 root root12 Feb  1 09:18 liblapack.so -> libmkl_rt.so
lrwxrwxrwx 1 root root12 Feb  1 09:18 liblapack.so.3 -> libmkl_rt.so


I believe (???) that I'm using Intel MKL because the warnings went away:

16/02/01 07:49:38 WARN BLAS: Failed to load implementation from: 
com.github.fommil.netlib.NativeSystemBLAS

16/02/01 07:49:38 WARN BLAS: Failed to load implementation from: 
com.github.fommil.netlib.NativeRefBLAS

After collectAsMap, there is no progress but I can observe that only 1 CPU is 
being utilised with the following stack trace:

"ForkJoinPool-3-worker-7" #130 daemon prio=5 os_prio=0 tid=0x7fbf30ab6000 
nid=0xbdc runnable [0x7fbf12205000]

   java.lang.Thread.State: RUNNABLE

at com.github.fommil.netlib.F2jBLAS.ddot(F2jBLAS.java:71)

at org.apache.spark.mllib.linalg.BLAS$.dot(BLAS.scala:128)

at org.apache.spark.mllib.linalg.BLAS$.dot(BLAS.scala:111)

at 
org.apache.spark.mllib.util.MLUtils$.fastSquaredDistance(MLUtils.scala:349)

at 
org.apache.spark.mllib.clustering.KMeans$.fastSquaredDistance(KMeans.scala:587)

at 
org.apache.spark.mllib.clustering.KMeans$$anonfun$findClosest$1.apply(KMeans.scala:561)

at 
org.apache.spark.mllib.clustering.KMeans$$anonfun$findClosest$1.apply(KMeans.scala:555)


This last few steps takes more than half of the total time for a 1Mx100 dataset.


The code is just:

val clusters = KMeans.train(parsedData, 1000, 1)


Shouldn't it utilising all the cores for the dot product? Is this a 
misconfiguration?


Thanks!




Visualization of KMeans cluster in Spark

2016-01-28 Thread Yogesh Vyas
Hi,

Is there any way to visualizing the KMeans clusters in spark?
Can we connect Plotly with Apache Spark in Java?

Thanks,
Yogesh

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Spark MLLib KMeans Performance on Amazon EC2 M3.2xlarge

2016-01-01 Thread Yanbo Liang
Hi Jia,

I think the examples you provided is not very suitable to illustrate what
driver and executors do, because it's not show the internal implementation
of the KMeans algorithm.
You can refer the source code of MLlib Kmeans (
https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/clustering/KMeans.scala#L227
).
In short words, the driver need the memory of O(centers size) but each
executors need the memory of O(partition size). Usually we have large
dataset and distributed the whole dataset at many executors, but the
centers is not very big even compared with the dataset at one executor.

Cheers
Yanbo

2015-12-31 22:31 GMT+08:00 Jia Zou :

> Thanks, Yanbo.
> The results become much more reasonable, after I set driver memory to 5GB
> and increase worker memory to 25GB.
>
> So, my question is for following code snippet extracted from main method
> in JavaKMeans.java in examples, what will the driver do? and what will the
> worker do?
>
> I didn't understand this problem well by reading
> https://spark.apache.org/docs/1.1.0/cluster-overview.htmland
> http://stackoverflow.com/questions/27181737/how-to-deal-with-executor-memory-and-driver-memory-in-spark
>
> SparkConf sparkConf = new SparkConf().setAppName("JavaKMeans");
>
> JavaSparkContext sc = new JavaSparkContext(sparkConf);
>
> JavaRDD lines = sc.textFile(inputFile);
>
> JavaRDD points = lines.map(new ParsePoint());
>
>  points.persist(StorageLevel.MEMORY_AND_DISK());
>
> KMeansModel model = KMeans.train(points.rdd(), k, iterations, runs,
> KMeans.K_MEANS_PARALLEL());
>
>
> Thank you very much!
>
> Best Regards,
> Jia
>
> On Wed, Dec 30, 2015 at 9:00 PM, Yanbo Liang  wrote:
>
>> Hi Jia,
>>
>> You can try to use inputRDD.persist(MEMORY_AND_DISK) and verify whether
>> it can produce stable performance. The storage level of MEMORY_AND_DISK
>> will store the partitions that don't fit on disk and read them from there
>> when they are needed.
>> Actually, it's not necessary to set so large driver memory in your case,
>> because KMeans use low memory for driver if your k is not very large.
>>
>> Cheers
>> Yanbo
>>
>> 2015-12-30 22:20 GMT+08:00 Jia Zou :
>>
>>> I am running Spark MLLib KMeans in one EC2 M3.2xlarge instance with 8
>>> CPU cores and 30GB memory. Executor memory is set to 15GB, and driver
>>> memory is set to 15GB.
>>>
>>> The observation is that, when input data size is smaller than 15GB, the
>>> performance is quite stable. However, when input data becomes larger than
>>> that, the performance will be extremely unpredictable. For example, for
>>> 15GB input, with inputRDD.persist(MEMORY_ONLY) , I've got three
>>> dramatically different testing results: 27mins, 61mins and 114 mins. (All
>>> settings are the same for the 3 tests, and I will create input data
>>> immediately before running each of the tests to keep OS buffer cache hot.)
>>>
>>> Anyone can help to explain this? Thanks very much!
>>>
>>>
>>
>


Re: Spark MLLib KMeans Performance on Amazon EC2 M3.2xlarge

2015-12-31 Thread Jia Zou
Thanks, Yanbo.
The results become much more reasonable, after I set driver memory to 5GB
and increase worker memory to 25GB.

So, my question is for following code snippet extracted from main method in
JavaKMeans.java in examples, what will the driver do? and what will the
worker do?

I didn't understand this problem well by reading
https://spark.apache.org/docs/1.1.0/cluster-overview.htmland
http://stackoverflow.com/questions/27181737/how-to-deal-with-executor-memory-and-driver-memory-in-spark

SparkConf sparkConf = new SparkConf().setAppName("JavaKMeans");

JavaSparkContext sc = new JavaSparkContext(sparkConf);

JavaRDD lines = sc.textFile(inputFile);

JavaRDD points = lines.map(new ParsePoint());

 points.persist(StorageLevel.MEMORY_AND_DISK());

KMeansModel model = KMeans.train(points.rdd(), k, iterations, runs,
KMeans.K_MEANS_PARALLEL());


Thank you very much!

Best Regards,
Jia

On Wed, Dec 30, 2015 at 9:00 PM, Yanbo Liang  wrote:

> Hi Jia,
>
> You can try to use inputRDD.persist(MEMORY_AND_DISK) and verify whether it
> can produce stable performance. The storage level of MEMORY_AND_DISK will
> store the partitions that don't fit on disk and read them from there when
> they are needed.
> Actually, it's not necessary to set so large driver memory in your case,
> because KMeans use low memory for driver if your k is not very large.
>
> Cheers
> Yanbo
>
> 2015-12-30 22:20 GMT+08:00 Jia Zou :
>
>> I am running Spark MLLib KMeans in one EC2 M3.2xlarge instance with 8 CPU
>> cores and 30GB memory. Executor memory is set to 15GB, and driver memory is
>> set to 15GB.
>>
>> The observation is that, when input data size is smaller than 15GB, the
>> performance is quite stable. However, when input data becomes larger than
>> that, the performance will be extremely unpredictable. For example, for
>> 15GB input, with inputRDD.persist(MEMORY_ONLY) , I've got three
>> dramatically different testing results: 27mins, 61mins and 114 mins. (All
>> settings are the same for the 3 tests, and I will create input data
>> immediately before running each of the tests to keep OS buffer cache hot.)
>>
>> Anyone can help to explain this? Thanks very much!
>>
>>
>


Re: Spark MLLib KMeans Performance on Amazon EC2 M3.2xlarge

2015-12-30 Thread Yanbo Liang
Hi Jia,

You can try to use inputRDD.persist(MEMORY_AND_DISK) and verify whether it
can produce stable performance. The storage level of MEMORY_AND_DISK will
store the partitions that don't fit on disk and read them from there when
they are needed.
Actually, it's not necessary to set so large driver memory in your case,
because KMeans use low memory for driver if your k is not very large.

Cheers
Yanbo

2015-12-30 22:20 GMT+08:00 Jia Zou :

> I am running Spark MLLib KMeans in one EC2 M3.2xlarge instance with 8 CPU
> cores and 30GB memory. Executor memory is set to 15GB, and driver memory is
> set to 15GB.
>
> The observation is that, when input data size is smaller than 15GB, the
> performance is quite stable. However, when input data becomes larger than
> that, the performance will be extremely unpredictable. For example, for
> 15GB input, with inputRDD.persist(MEMORY_ONLY) , I've got three
> dramatically different testing results: 27mins, 61mins and 114 mins. (All
> settings are the same for the 3 tests, and I will create input data
> immediately before running each of the tests to keep OS buffer cache hot.)
>
> Anyone can help to explain this? Thanks very much!
>
>


Spark MLLib KMeans Performance on Amazon EC2 M3.2xlarge

2015-12-30 Thread Jia Zou
I am running Spark MLLib KMeans in one EC2 M3.2xlarge instance with 8 CPU
cores and 30GB memory. Executor memory is set to 15GB, and driver memory is
set to 15GB.

The observation is that, when input data size is smaller than 15GB, the
performance is quite stable. However, when input data becomes larger than
that, the performance will be extremely unpredictable. For example, for
15GB input, with inputRDD.persist(MEMORY_ONLY) , I've got three
dramatically different testing results: 27mins, 61mins and 114 mins. (All
settings are the same for the 3 tests, and I will create input data
immediately before running each of the tests to keep OS buffer cache hot.)

Anyone can help to explain this? Thanks very much!


Clustering KMeans error in 1.5.1

2015-10-16 Thread robin_up
We upgraded from 1.4.0 to 1.5.1 (skipped 1.5.0) and one of our clustering job
hit the below error. Does anyone know what this is about or if it is a bug?


stdout4260Traceback (most recent call last):
  File "user_clustering.py", line 137, in 
uig_model = KMeans.train(uigs,i,nIter, runs = nRuns)
  File
"/mnt/yarn/nm/usercache/ds/appcache/application_1444086959272_0238/container_1444086959272_0238_01_01/pyspark.zip/pyspark/mllib/clustering.py",
line 150, in train
  File
"/mnt/yarn/nm/usercache/ds/appcache/application_1444086959272_0238/container_1444086959272_0238_01_01/pyspark.zip/pyspark/mllib/common.py",
line 130, in callMLlibFunc
  File
"/mnt/yarn/nm/usercache/ds/appcache/application_1444086959272_0238/container_1444086959272_0238_01_01/pyspark.zip/pyspark/mllib/common.py",
line 123, in callJavaFunc
  File
"/mnt/yarn/nm/usercache/ds/appcache/application_1444086959272_0238/container_1444086959272_0238_01_01/py4j-0.8.2.1-src.zip/py4j/java_gateway.py",
line 538, in __call__
  File
"/mnt/yarn/nm/usercache/ds/appcache/application_1444086959272_0238/container_1444086959272_0238_01_01/pyspark.zip/pyspark/sql/utils.py",
line 36, in deco
  File
"/mnt/yarn/nm/usercache/ds/appcache/application_1444086959272_0238/container_1444086959272_0238_01_01/py4j-0.8.2.1-src.zip/py4j/protocol.py",
line 300, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling
o220.trainKMeansModel.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 73
in stage 14.0 failed 4 times, most recent failure: Lost task 73.3 in stage
14.0 (TID 1357, hadoop-sandbox-dn07): ExecutorLostFailure (executor 13 lost)
Driver stacktrace:
at
org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1283)
at
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1271)
at
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1270)
at
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
at
org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1270)
at
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:697)
at
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:697)
at scala.Option.foreach(Option.scala:236)
at
org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:697)
at
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1496)
at
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1458)
at
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1447)
at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
at 
org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:567)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1822)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1835)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1848)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1919)
at org.apache.spark.rdd.RDD.count(RDD.scala:1121)
at org.apache.spark.rdd.RDD.takeSample(RDD.scala:485)
at
org.apache.spark.mllib.clustering.KMeans.initKMeansParallel(KMeans.scala:376)
at 
org.apache.spark.mllib.clustering.KMeans.runAlgorithm(KMeans.scala:249)
at org.apache.spark.mllib.clustering.KMeans.run(KMeans.scala:213)
at
org.apache.spark.mllib.api.python.PythonMLLibAPI.trainKMeansModel(PythonMLLibAPI.scala:341)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:379)
at py4j.Gateway.invoke(Gateway.java:259)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:207)
at java.lang.Thread.run(Thread.java:745)



-
-- Robin Li
--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Clustering-KMeans-error-in-1-5-1-tp25101.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.


Re: Distance metrics in KMeans

2015-09-26 Thread Robineast
There is a Spark Package that gives some alternative distance metrics,
http://spark-packages.org/package/derrickburns/generalized-kmeans-clustering.
Not used it myself.



-
Robin East 
Spark GraphX in Action Michael Malak and Robin East 
Manning Publications Co. 
http://www.manning.com/books/spark-graphx-in-action

--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Distance-metrics-in-KMeans-tp24823p24829.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Distance metrics in KMeans

2015-09-25 Thread sethah
It looks like the distance metric is hard coded to the L2 norm (euclidean
distance) in MLlib. As you may expect, you are not the first person to
desire other metrics and there has been some prior effort. 

Please reference this PR: https://github.com/apache/spark/pull/2634

And corresponding JIRA: https://issues.apache.org/jira/browse/SPARK-3219

Seems as if the addition of arbitrary distance metrics is non-trivial given
current implementation in MLlib. Not sure of any current work towards this
issue.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Distance-metrics-in-KMeans-tp24823p24826.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Distance metrics in KMeans

2015-09-25 Thread bobtreacy
Is it possible to use other distance metrics than Euclidean (e.g. Tanimoto,
Manhattan) with MLlib KMeans?



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Distance-metrics-in-KMeans-tp24823.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



KMeans Model fails to run

2015-09-23 Thread Soong, Eddie
Hi,

Why am I getting this error which prevents my KMeans clustering algorithm to 
work inside of Spark? I'm trying to run a sample Scala model found in 
Databricks website on my Cloudera-Spark 1-node local VM. For completeness, the 
Scala program is as follows: Thx

import org.apache.spark.mllib.clustering.KMeans
import org.apache.spark.mllib.linalg.Vectors

// Load and parse the data
val data = sc.textFile("/path/to/file")
  .map(s => Vectors.dense(s.split(',').map(_.toDouble)))

// Cluster the data into three classes using KMeans
val numIterations = 20
val numClusters = 3
val kmeansModel = KMeans.train(data, numClusters, numIterations)


5/09/23 19:38:11 WARN clustering.KMeans: The input data is not directly cached, 
which may hurt performance if its parent RDDs are also uncached.
java.io.IOException: No FileSystem for scheme: c
   at 
org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2584)
   at 
org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2591)
   at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:91)
   at 
org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2630)
   at 
org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2612)
   at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:370)
   at org.apache.hadoop.fs.Path.getFileSystem(Path.java:296)
   at 
org.apache.hadoop.mapred.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:256)
   at 
org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:228)
   at 
org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:313)
   at 
org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:203)
   at 
org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:219)
   at 
org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:217)
   at scala.Option.getOrElse(Option.scala:120)
   at org.apache.spark.rdd.RDD.partitions(RDD.scala:217)
   at 
org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:32)
   at 
org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:219)
   at 
org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:217)
   at scala.Option.getOrElse(Option.scala:120)
   at org.apache.spark.rdd.RDD.partitions(RDD.scala:217)
   at 
org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:32)
   at 
org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:219)
   at 
org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:217)
   at scala.Option.getOrElse(Option.scala:120)
   at org.apache.spark.rdd.RDD.partitions(RDD.scala:217)
   at 
org.apache.spark.rdd.ZippedPartitionsBaseRDD.getPartitions(ZippedPartitionsRDD.scala:55)
   at 
org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:219)
   at 
org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:217)
   at scala.Option.getOrElse(Option.scala:120)
   at org.apache.spark.rdd.RDD.partitions(RDD.scala:217)
   at 
org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:32)
   at 
org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:219)
   at 
org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:217)
   at scala.Option.getOrElse(Option.scala:120)
   at org.apache.spark.rdd.RDD.partitions(RDD.scala:217)
   at org.apache.spark.SparkContext.runJob(SparkContext.scala:1517)
   at org.apache.spark.rdd.RDD.count(RDD.scala:1006)
   at org.apache.spark.rdd.RDD.takeSample(RDD.scala:428)
   at 
org.apache.spark.mllib.clustering.KMeans.initKMeansParallel(KMeans.scala:288)
   at 
org.apache.spark.mllib.clustering.KMeans.runAlgorithm(KMeans.scala:162)
   at org.apache.spark.mllib.clustering.KMeans.run(KMeans.scala:139)
   at 
org.apache.spark.mllib.clustering.KMeans$.train(KMeans.scala:420)
   at 
org.apache.spark.mllib.clustering.KMeans$.train(KMeans.scala:430)
   at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:29)
   at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:34)
   at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:36)
   at $iwC$$iwC$$iwC$$iwC$$iwC.(:38)
   at $iwC$$iwC$$iwC$$iwC.(:40)
   at $iwC$$iwC$$iwC.(:42)
   at $iwC$$iwC.(:44)
   at $iwC.(:46)
   at (:48)
   at .(:52)
   at .()
   at .(:7)
   at .()
   at $print()
   at sun.reflect.NativeMethodAccessorImpl.invoke0(Nati

Using ML KMeans without hardcoded feature vector creation

2015-09-15 Thread Tóth Zoltán
Hi,

I'm wondering if there is a concise way to run ML KMeans on a DataFrame if
I have the features in multiple numeric columns.

I.e. as in the Iris dataset:
(a1=5.1, a2=3.5, a3=1.4, a4=0.2, id=u'id_1', label=u'Iris-setosa',
binomial_label=1)

I'd like to use KMeans without recreating the DataSet with the feature
vector added manually as a new column and the original columns hardcoded
repeatedly in the code.

The solution I'd like to improve:

from pyspark.mllib.linalg import Vectors
from pyspark.sql.types import Row
from pyspark.ml.clustering import KMeans, KMeansModel

iris = sqlContext.read.parquet("/opt/data/iris.parquet")
iris.first()
# Row(a1=5.1, a2=3.5, a3=1.4, a4=0.2, id=u'id_1', label=u'Iris-setosa',
binomial_label=1)

df = iris.map(lambda r: Row(
id = r.id,
a1 = r.a1,
a2 = r.a2,
a3 = r.a3,
a4 = r.a4,
label = r.label,
binomial_label=r.binomial_label,
features = Vectors.dense(r.a1, r.a2, r.a3, r.a4))
).toDF()


kmeans_estimator = KMeans()\
.setFeaturesCol("features")\
.setPredictionCol("prediction")\
kmeans_transformer = kmeans_estimator.fit(df)

predicted_df = kmeans_transformer.transform(df).drop("features")
predicted_df.first()

# Row(a1=5.1, a2=3.5, a3=1.4, a4=0.2, binomial_label=1, id=u'id_1',
label=u'Iris-setosa', prediction=1)



I'm looking for a solution, which is something like:

feature_cols = ["a1", "a2", "a3", "a4"]

prediction_col_name = "prediction"






Thanks,

Zoltan


Kmeans issues and hierarchical clustering

2015-08-28 Thread Robust_spark
Dear All,

I am trying to cluster 350k english text phrases (each with 4-20 words) into
50k clusters with KMeans on a standalone system (8 cores, 16 GB). I am using
Kyro serializer with MEMORY_AND_DISK_SER set. Although I get clustering
results with lower number of features in HashingTF, the clustering quality
is poor. When I increase the number of features, I am hit with GC overhead
limit exceeded. How can I run the Kmeans clustering with the maximum number
of features without crashing the app? I don't mind if it takes hours to get
the results though. 

Also is there a agglomerative clustering algorithm (like hierarchical) in
Spark that can run on standalone systems?

Here is my code for reference - 

object phrase_app {
  def main(args: Array[String]) {
  
val conf = new SparkConf().setAppName("Simple Application")
conf.set("spark.serializer", 
"org.apache.spark.serializer.KryoSerializer")
val sc = new SparkContext(conf) 

// -- read phrases from text file ---
val phrases = sc.textFile("phrases.txt", 
10).persist(MEMORY_AND_DISK_SER)   

//  featurize phrases 
val no_features = 500
val tf = new HashingTF(no_features)
def featurize(s: String): Vector = {
tf.transform(s.sliding(1).toSeq)
  }
val featureVectors = phrases.map(featurize).persist(MEMORY_AND_DISK_SER)

// -- train Kmeans and get cluster phrases
//val model = KMeans.train(featureVectors, 5, 10, 1, "random")
val model = KMeans.train(featureVectors, 5, 10)
val clusters = model.predict(featureVectors).collect()

//  Print phrases and clusters to file 
import java.io._
val pw = new PrintWriter(new File("cluster_dump.txt" )) 
val phrases_array = phrases.collect()
for (i <- 0 until phrases_array.length){
pw.write( phrases_array(i) + ";" + clusters(i) + "\n")
}   
pw.close
  }
}


Thank you for your support. 



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Kmeans-issues-and-hierarchical-clustering-tp24494.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: mllib kmeans produce 1 large and many extremely small clusters

2015-08-10 Thread sooraj
Hi,

The issue is very likely to be in the data or the transformations you
apply, rather than anything to do with the Spark Kmeans API as such. I'd
start debugging by doing a bit of exploratory analysis of the TFIDF
vectors. That is, for instance, plot the distribution (histogram) of the
TFIDF values for each word in the vectors. It's quite possible that the
TFIDF values for most words for most documents are the same in your case,
causing all your 5000 points to crowd around the same region in the
n-dimensional space that they live in.



On 10 August 2015 at 10:28, farhan  wrote:

> I tried running mllib k-means with 20newsgroups data set from sklearn. On a
> 5000 document data set I get one cluster with most of the documents and
> other clusters just have handful of documents.
>
> #code
> newsgroups_train =
> fetch_20newsgroups(subset='train',random_state=1,remove=('headers',
> 'footers', 'quotes'))
> small_list = random.sample(newsgroups_train.data,5000)
>
> def get_word_vec(text,vocabulary):
> word_lst = tokenize_line(text)
> word_counter = Counter(word_lst)
> lst = []
> for v in vocabulary:
> if v in word_counter:
> lst.append(word_counter[v])
> else:
> lst.append(0)
> return lst
>
> docsrdd = sc.parallelize(small_list)
> tf = docsrdd.map(lambda x : get_word_vec(x,vocabulary))
> idf = IDF().fit(tf)
> tfidf = idf.transform(tf)
> clusters = KMeans.train(tfidf, 20)
>
> #documents in each cluster, using clusters.predict(x)
> Counter({0: 4978, 11: 3, 9: 2, 1: 1, 2: 1, 3: 1, 4: 1, 5: 1, 6: 1, 7: 1, 8:
> 1, 10: 1, 12: 1, 13: 1, 14: 1, 15: 1, 16: 1, 17: 1, 18: 1, 19: 1})
>
>
> Please Help !
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/mllib-kmeans-produce-1-large-and-many-extremely-small-clusters-tp24189.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>


mllib kmeans produce 1 large and many extremely small clusters

2015-08-09 Thread farhan
I tried running mllib k-means with 20newsgroups data set from sklearn. On a
5000 document data set I get one cluster with most of the documents and
other clusters just have handful of documents.

#code
newsgroups_train =
fetch_20newsgroups(subset='train',random_state=1,remove=('headers',
'footers', 'quotes'))
small_list = random.sample(newsgroups_train.data,5000)

def get_word_vec(text,vocabulary):
word_lst = tokenize_line(text)
word_counter = Counter(word_lst)
lst = []
for v in vocabulary:
if v in word_counter:
lst.append(word_counter[v])
else:
lst.append(0)  
return lst

docsrdd = sc.parallelize(small_list)
tf = docsrdd.map(lambda x : get_word_vec(x,vocabulary))
idf = IDF().fit(tf)
tfidf = idf.transform(tf) 
clusters = KMeans.train(tfidf, 20)

#documents in each cluster, using clusters.predict(x)
Counter({0: 4978, 11: 3, 9: 2, 1: 1, 2: 1, 3: 1, 4: 1, 5: 1, 6: 1, 7: 1, 8:
1, 10: 1, 12: 1, 13: 1, 14: 1, 15: 1, 16: 1, 17: 1, 18: 1, 19: 1})


Please Help !



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/mllib-kmeans-produce-1-large-and-many-extremely-small-clusters-tp24189.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



RE: Kmeans Labeled Point RDD

2015-07-20 Thread Mohammed Guller
I responded to your question on SO. Let me know if this what you wanted. 

http://stackoverflow.com/a/31528274/2336943


Mohammed

-Original Message-
From: plazaster [mailto:michaelplaz...@gmail.com] 
Sent: Sunday, July 19, 2015 11:38 PM
To: user@spark.apache.org
Subject: Re: Kmeans Labeled Point RDD

Has there been any progress on this, I am in the same boat.

I posted a similar question to Stack Exchange.

http://stackoverflow.com/questions/31447141/spark-mllib-kmeans-from-dataframe-and-back-again



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Kmeans-Labeled-Point-RDD-tp22989p23907.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional 
commands, e-mail: user-h...@spark.apache.org


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Kmeans Labeled Point RDD

2015-07-19 Thread plazaster
Has there been any progress on this, I am in the same boat.

I posted a similar question to Stack Exchange.

http://stackoverflow.com/questions/31447141/spark-mllib-kmeans-from-dataframe-and-back-again



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Kmeans-Labeled-Point-RDD-tp22989p23907.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: [MLLib][Kmeans] KMeansModel.computeCost takes lot of time

2015-07-13 Thread Nirmal Fernando
Can it be the limited memory causing this slowness?

On Tue, Jul 14, 2015 at 9:00 AM, Nirmal Fernando  wrote:

> Thanks Burak.
>
> Now it takes minutes to repartition;
>
> Active Stages (1) Stage IdDescriptionSubmittedDurationTasks:
> Succeeded/TotalInputOutputShuffle Read Shuffle Write  42 (kill)
> <http://localhost:4040/stages/stage/kill/?id=42&terminate=true> repartition
> at UnsupervisedSparkModelBuilder.java:120
> <http://localhost:4040/stages/stage?id=42&attempt=0> +details
>
> org.apache.spark.api.java.JavaRDD.repartition(JavaRDD.scala:100)
> org.wso2.carbon.ml.core.spark.algorithms.UnsupervisedSparkModelBuilder.buildKMeansModel(UnsupervisedSparkModelBuilder.java:120)
> org.wso2.carbon.ml.core.spark.algorithms.UnsupervisedSparkModelBuilder.build(UnsupervisedSparkModelBuilder.java:84)
> org.wso2.carbon.ml.core.impl.MLModelHandler$ModelBuilder.run(MLModelHandler.java:576)
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> java.lang.Thread.run(Thread.java:745)
>
>  2015/07/14 08:59:30 3.6 min
>  0/3
>  14.6 MB Pending Stages (1) Stage IdDescriptionSubmittedDurationTasks:
> Succeeded/TotalInputOutputShuffle Read Shuffle Write  43 sum at
> KMeansModel.scala:70 <http://localhost:4040/stages/stage?id=43&attempt=0> 
> +details
>
>
> org.apache.spark.rdd.DoubleRDDFunctions.sum(DoubleRDDFunctions.scala:33)
> org.apache.spark.mllib.clustering.KMeansModel.computeCost(KMeansModel.scala:70)
> org.wso2.carbon.ml.core.spark.algorithms.UnsupervisedSparkModelBuilder.buildKMeansModel(UnsupervisedSparkModelBuilder.java:121)
> org.wso2.carbon.ml.core.spark.algorithms.UnsupervisedSparkModelBuilder.build(UnsupervisedSparkModelBuilder.java:84)
> org.wso2.carbon.ml.core.impl.MLModelHandler$ModelBuilder.run(MLModelHandler.java:576)
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> java.lang.Thread.run(Thread.java:745)
>
>  Unknown Unknown
>  0/8
>
> On Mon, Jul 13, 2015 at 11:44 PM, Burak Yavuz  wrote:
>
>> Can you call repartition(8) or 16 on data.rdd(), before KMeans, and also,
>> .cache()?
>>
>> something like, (I'm assuming you are using Java):
>> ```
>> JavaRDD input = data.repartition(8).cache();
>> org.apache.spark.mllib.clustering.KMeans.train(input.rdd(), 3, 20);
>> ```
>>
>> On Mon, Jul 13, 2015 at 11:10 AM, Nirmal Fernando 
>> wrote:
>>
>>> I'm using;
>>>
>>> org.apache.spark.mllib.clustering.KMeans.train(data.rdd(), 3, 20);
>>>
>>> Cpu cores: 8 (using default Spark conf thought)
>>>
>>> On partitions, I'm not sure how to find that.
>>>
>>> On Mon, Jul 13, 2015 at 11:30 PM, Burak Yavuz  wrote:
>>>
>>>> What are the other parameters? Are you just setting k=3? What about #
>>>> of runs? How many partitions do you have? How many cores does your machine
>>>> have?
>>>>
>>>> Thanks,
>>>> Burak
>>>>
>>>> On Mon, Jul 13, 2015 at 10:57 AM, Nirmal Fernando 
>>>> wrote:
>>>>
>>>>> Hi Burak,
>>>>>
>>>>> k = 3
>>>>> dimension = 785 features
>>>>> Spark 1.4
>>>>>
>>>>> On Mon, Jul 13, 2015 at 10:28 PM, Burak Yavuz 
>>>>> wrote:
>>>>>
>>>>>> Hi,
>>>>>>
>>>>>> How are you running K-Means? What is your k? What is the dimension of
>>>>>> your dataset (columns)? Which Spark version are you using?
>>>>>>
>>>>>> Thanks,
>>>>>> Burak
>>>>>>
>>>>>> On Mon, Jul 13, 2015 at 2:53 AM, Nirmal Fernando 
>>>>>> wrote:
>>>>>>
>>>>>>> Hi,
>>>>>>>
>>>>>>> For a fairly large dataset, 30MB, KMeansModel.computeCost takes lot
>>>>>>> of time (16+ mints).
>>>>>>>
>>>>>>> It takes lot of time at this task;
>>>>>>>
>>>>>>> org.apache.spark.rdd.DoubleRDDFunctions.sum(DoubleRDDFunctions.scala:33)
>>>>>>> org.apache.spark.mllib.clustering.KMeansModel.computeCost(KMeansModel.scala:70)
>>>>>>>
>>>>>>> Can this be improved?
>>>>>>>
>>>>>>> --
>>>>>>>
>>>>>>> Thanks & regards,
>>>>>>> Nirmal
>>>>>>>
>>>>>>> Associate Technical Lead - Data Technologies Team, WSO2 Inc.
>>>>>>> Mobile: +94715779733
>>>>>>> Blog: http://nirmalfdo.blogspot.com/
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>> --
>>>>>
>>>>> Thanks & regards,
>>>>> Nirmal
>>>>>
>>>>> Associate Technical Lead - Data Technologies Team, WSO2 Inc.
>>>>> Mobile: +94715779733
>>>>> Blog: http://nirmalfdo.blogspot.com/
>>>>>
>>>>>
>>>>>
>>>>
>>>
>>>
>>> --
>>>
>>> Thanks & regards,
>>> Nirmal
>>>
>>> Associate Technical Lead - Data Technologies Team, WSO2 Inc.
>>> Mobile: +94715779733
>>> Blog: http://nirmalfdo.blogspot.com/
>>>
>>>
>>>
>>
>
>
> --
>
> Thanks & regards,
> Nirmal
>
> Associate Technical Lead - Data Technologies Team, WSO2 Inc.
> Mobile: +94715779733
> Blog: http://nirmalfdo.blogspot.com/
>
>
>


-- 

Thanks & regards,
Nirmal

Associate Technical Lead - Data Technologies Team, WSO2 Inc.
Mobile: +94715779733
Blog: http://nirmalfdo.blogspot.com/


Re: [MLLib][Kmeans] KMeansModel.computeCost takes lot of time

2015-07-13 Thread Nirmal Fernando
Thanks Burak.

Now it takes minutes to repartition;

Active Stages (1) Stage IdDescriptionSubmittedDurationTasks: Succeeded/Total
InputOutputShuffle Read Shuffle Write  42 (kill)
<http://localhost:4040/stages/stage/kill/?id=42&terminate=true> repartition
at UnsupervisedSparkModelBuilder.java:120
<http://localhost:4040/stages/stage?id=42&attempt=0> +details

org.apache.spark.api.java.JavaRDD.repartition(JavaRDD.scala:100)
org.wso2.carbon.ml.core.spark.algorithms.UnsupervisedSparkModelBuilder.buildKMeansModel(UnsupervisedSparkModelBuilder.java:120)
org.wso2.carbon.ml.core.spark.algorithms.UnsupervisedSparkModelBuilder.build(UnsupervisedSparkModelBuilder.java:84)
org.wso2.carbon.ml.core.impl.MLModelHandler$ModelBuilder.run(MLModelHandler.java:576)
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
java.lang.Thread.run(Thread.java:745)

 2015/07/14 08:59:30 3.6 min
 0/3
 14.6 MB Pending Stages (1) Stage IdDescriptionSubmittedDurationTasks:
Succeeded/TotalInputOutputShuffle Read Shuffle Write  43 sum at
KMeansModel.scala:70
<http://localhost:4040/stages/stage?id=43&attempt=0> +details


org.apache.spark.rdd.DoubleRDDFunctions.sum(DoubleRDDFunctions.scala:33)
org.apache.spark.mllib.clustering.KMeansModel.computeCost(KMeansModel.scala:70)
org.wso2.carbon.ml.core.spark.algorithms.UnsupervisedSparkModelBuilder.buildKMeansModel(UnsupervisedSparkModelBuilder.java:121)
org.wso2.carbon.ml.core.spark.algorithms.UnsupervisedSparkModelBuilder.build(UnsupervisedSparkModelBuilder.java:84)
org.wso2.carbon.ml.core.impl.MLModelHandler$ModelBuilder.run(MLModelHandler.java:576)
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
java.lang.Thread.run(Thread.java:745)

 Unknown Unknown
 0/8

On Mon, Jul 13, 2015 at 11:44 PM, Burak Yavuz  wrote:

> Can you call repartition(8) or 16 on data.rdd(), before KMeans, and also,
> .cache()?
>
> something like, (I'm assuming you are using Java):
> ```
> JavaRDD input = data.repartition(8).cache();
> org.apache.spark.mllib.clustering.KMeans.train(input.rdd(), 3, 20);
> ```
>
> On Mon, Jul 13, 2015 at 11:10 AM, Nirmal Fernando  wrote:
>
>> I'm using;
>>
>> org.apache.spark.mllib.clustering.KMeans.train(data.rdd(), 3, 20);
>>
>> Cpu cores: 8 (using default Spark conf thought)
>>
>> On partitions, I'm not sure how to find that.
>>
>> On Mon, Jul 13, 2015 at 11:30 PM, Burak Yavuz  wrote:
>>
>>> What are the other parameters? Are you just setting k=3? What about # of
>>> runs? How many partitions do you have? How many cores does your machine
>>> have?
>>>
>>> Thanks,
>>> Burak
>>>
>>> On Mon, Jul 13, 2015 at 10:57 AM, Nirmal Fernando 
>>> wrote:
>>>
>>>> Hi Burak,
>>>>
>>>> k = 3
>>>> dimension = 785 features
>>>> Spark 1.4
>>>>
>>>> On Mon, Jul 13, 2015 at 10:28 PM, Burak Yavuz  wrote:
>>>>
>>>>> Hi,
>>>>>
>>>>> How are you running K-Means? What is your k? What is the dimension of
>>>>> your dataset (columns)? Which Spark version are you using?
>>>>>
>>>>> Thanks,
>>>>> Burak
>>>>>
>>>>> On Mon, Jul 13, 2015 at 2:53 AM, Nirmal Fernando 
>>>>> wrote:
>>>>>
>>>>>> Hi,
>>>>>>
>>>>>> For a fairly large dataset, 30MB, KMeansModel.computeCost takes lot
>>>>>> of time (16+ mints).
>>>>>>
>>>>>> It takes lot of time at this task;
>>>>>>
>>>>>> org.apache.spark.rdd.DoubleRDDFunctions.sum(DoubleRDDFunctions.scala:33)
>>>>>> org.apache.spark.mllib.clustering.KMeansModel.computeCost(KMeansModel.scala:70)
>>>>>>
>>>>>> Can this be improved?
>>>>>>
>>>>>> --
>>>>>>
>>>>>> Thanks & regards,
>>>>>> Nirmal
>>>>>>
>>>>>> Associate Technical Lead - Data Technologies Team, WSO2 Inc.
>>>>>> Mobile: +94715779733
>>>>>> Blog: http://nirmalfdo.blogspot.com/
>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>
>>>>
>>>> --
>>>>
>>>> Thanks & regards,
>>>> Nirmal
>>>>
>>>> Associate Technical Lead - Data Technologies Team, WSO2 Inc.
>>>> Mobile: +94715779733
>>>> Blog: http://nirmalfdo.blogspot.com/
>>>>
>>>>
>>>>
>>>
>>
>>
>> --
>>
>> Thanks & regards,
>> Nirmal
>>
>> Associate Technical Lead - Data Technologies Team, WSO2 Inc.
>> Mobile: +94715779733
>> Blog: http://nirmalfdo.blogspot.com/
>>
>>
>>
>


-- 

Thanks & regards,
Nirmal

Associate Technical Lead - Data Technologies Team, WSO2 Inc.
Mobile: +94715779733
Blog: http://nirmalfdo.blogspot.com/


Re: [MLLib][Kmeans] KMeansModel.computeCost takes lot of time

2015-07-13 Thread Burak Yavuz
Can you call repartition(8) or 16 on data.rdd(), before KMeans, and also,
.cache()?

something like, (I'm assuming you are using Java):
```
JavaRDD input = data.repartition(8).cache();
org.apache.spark.mllib.clustering.KMeans.train(input.rdd(), 3, 20);
```

On Mon, Jul 13, 2015 at 11:10 AM, Nirmal Fernando  wrote:

> I'm using;
>
> org.apache.spark.mllib.clustering.KMeans.train(data.rdd(), 3, 20);
>
> Cpu cores: 8 (using default Spark conf thought)
>
> On partitions, I'm not sure how to find that.
>
> On Mon, Jul 13, 2015 at 11:30 PM, Burak Yavuz  wrote:
>
>> What are the other parameters? Are you just setting k=3? What about # of
>> runs? How many partitions do you have? How many cores does your machine
>> have?
>>
>> Thanks,
>> Burak
>>
>> On Mon, Jul 13, 2015 at 10:57 AM, Nirmal Fernando 
>> wrote:
>>
>>> Hi Burak,
>>>
>>> k = 3
>>> dimension = 785 features
>>> Spark 1.4
>>>
>>> On Mon, Jul 13, 2015 at 10:28 PM, Burak Yavuz  wrote:
>>>
>>>> Hi,
>>>>
>>>> How are you running K-Means? What is your k? What is the dimension of
>>>> your dataset (columns)? Which Spark version are you using?
>>>>
>>>> Thanks,
>>>> Burak
>>>>
>>>> On Mon, Jul 13, 2015 at 2:53 AM, Nirmal Fernando 
>>>> wrote:
>>>>
>>>>> Hi,
>>>>>
>>>>> For a fairly large dataset, 30MB, KMeansModel.computeCost takes lot of
>>>>> time (16+ mints).
>>>>>
>>>>> It takes lot of time at this task;
>>>>>
>>>>> org.apache.spark.rdd.DoubleRDDFunctions.sum(DoubleRDDFunctions.scala:33)
>>>>> org.apache.spark.mllib.clustering.KMeansModel.computeCost(KMeansModel.scala:70)
>>>>>
>>>>> Can this be improved?
>>>>>
>>>>> --
>>>>>
>>>>> Thanks & regards,
>>>>> Nirmal
>>>>>
>>>>> Associate Technical Lead - Data Technologies Team, WSO2 Inc.
>>>>> Mobile: +94715779733
>>>>> Blog: http://nirmalfdo.blogspot.com/
>>>>>
>>>>>
>>>>>
>>>>
>>>
>>>
>>> --
>>>
>>> Thanks & regards,
>>> Nirmal
>>>
>>> Associate Technical Lead - Data Technologies Team, WSO2 Inc.
>>> Mobile: +94715779733
>>> Blog: http://nirmalfdo.blogspot.com/
>>>
>>>
>>>
>>
>
>
> --
>
> Thanks & regards,
> Nirmal
>
> Associate Technical Lead - Data Technologies Team, WSO2 Inc.
> Mobile: +94715779733
> Blog: http://nirmalfdo.blogspot.com/
>
>
>


Re: [MLLib][Kmeans] KMeansModel.computeCost takes lot of time

2015-07-13 Thread Nirmal Fernando
I'm using;

org.apache.spark.mllib.clustering.KMeans.train(data.rdd(), 3, 20);

Cpu cores: 8 (using default Spark conf thought)

On partitions, I'm not sure how to find that.

On Mon, Jul 13, 2015 at 11:30 PM, Burak Yavuz  wrote:

> What are the other parameters? Are you just setting k=3? What about # of
> runs? How many partitions do you have? How many cores does your machine
> have?
>
> Thanks,
> Burak
>
> On Mon, Jul 13, 2015 at 10:57 AM, Nirmal Fernando  wrote:
>
>> Hi Burak,
>>
>> k = 3
>> dimension = 785 features
>> Spark 1.4
>>
>> On Mon, Jul 13, 2015 at 10:28 PM, Burak Yavuz  wrote:
>>
>>> Hi,
>>>
>>> How are you running K-Means? What is your k? What is the dimension of
>>> your dataset (columns)? Which Spark version are you using?
>>>
>>> Thanks,
>>> Burak
>>>
>>> On Mon, Jul 13, 2015 at 2:53 AM, Nirmal Fernando 
>>> wrote:
>>>
 Hi,

 For a fairly large dataset, 30MB, KMeansModel.computeCost takes lot of
 time (16+ mints).

 It takes lot of time at this task;

 org.apache.spark.rdd.DoubleRDDFunctions.sum(DoubleRDDFunctions.scala:33)
 org.apache.spark.mllib.clustering.KMeansModel.computeCost(KMeansModel.scala:70)

 Can this be improved?

 --

 Thanks & regards,
 Nirmal

 Associate Technical Lead - Data Technologies Team, WSO2 Inc.
 Mobile: +94715779733
 Blog: http://nirmalfdo.blogspot.com/



>>>
>>
>>
>> --
>>
>> Thanks & regards,
>> Nirmal
>>
>> Associate Technical Lead - Data Technologies Team, WSO2 Inc.
>> Mobile: +94715779733
>> Blog: http://nirmalfdo.blogspot.com/
>>
>>
>>
>


-- 

Thanks & regards,
Nirmal

Associate Technical Lead - Data Technologies Team, WSO2 Inc.
Mobile: +94715779733
Blog: http://nirmalfdo.blogspot.com/


Re: [MLLib][Kmeans] KMeansModel.computeCost takes lot of time

2015-07-13 Thread Burak Yavuz
What are the other parameters? Are you just setting k=3? What about # of
runs? How many partitions do you have? How many cores does your machine
have?

Thanks,
Burak

On Mon, Jul 13, 2015 at 10:57 AM, Nirmal Fernando  wrote:

> Hi Burak,
>
> k = 3
> dimension = 785 features
> Spark 1.4
>
> On Mon, Jul 13, 2015 at 10:28 PM, Burak Yavuz  wrote:
>
>> Hi,
>>
>> How are you running K-Means? What is your k? What is the dimension of
>> your dataset (columns)? Which Spark version are you using?
>>
>> Thanks,
>> Burak
>>
>> On Mon, Jul 13, 2015 at 2:53 AM, Nirmal Fernando  wrote:
>>
>>> Hi,
>>>
>>> For a fairly large dataset, 30MB, KMeansModel.computeCost takes lot of
>>> time (16+ mints).
>>>
>>> It takes lot of time at this task;
>>>
>>> org.apache.spark.rdd.DoubleRDDFunctions.sum(DoubleRDDFunctions.scala:33)
>>> org.apache.spark.mllib.clustering.KMeansModel.computeCost(KMeansModel.scala:70)
>>>
>>> Can this be improved?
>>>
>>> --
>>>
>>> Thanks & regards,
>>> Nirmal
>>>
>>> Associate Technical Lead - Data Technologies Team, WSO2 Inc.
>>> Mobile: +94715779733
>>> Blog: http://nirmalfdo.blogspot.com/
>>>
>>>
>>>
>>
>
>
> --
>
> Thanks & regards,
> Nirmal
>
> Associate Technical Lead - Data Technologies Team, WSO2 Inc.
> Mobile: +94715779733
> Blog: http://nirmalfdo.blogspot.com/
>
>
>


Re: [MLLib][Kmeans] KMeansModel.computeCost takes lot of time

2015-07-13 Thread Nirmal Fernando
Hi Burak,

k = 3
dimension = 785 features
Spark 1.4

On Mon, Jul 13, 2015 at 10:28 PM, Burak Yavuz  wrote:

> Hi,
>
> How are you running K-Means? What is your k? What is the dimension of your
> dataset (columns)? Which Spark version are you using?
>
> Thanks,
> Burak
>
> On Mon, Jul 13, 2015 at 2:53 AM, Nirmal Fernando  wrote:
>
>> Hi,
>>
>> For a fairly large dataset, 30MB, KMeansModel.computeCost takes lot of
>> time (16+ mints).
>>
>> It takes lot of time at this task;
>>
>> org.apache.spark.rdd.DoubleRDDFunctions.sum(DoubleRDDFunctions.scala:33)
>> org.apache.spark.mllib.clustering.KMeansModel.computeCost(KMeansModel.scala:70)
>>
>> Can this be improved?
>>
>> --
>>
>> Thanks & regards,
>> Nirmal
>>
>> Associate Technical Lead - Data Technologies Team, WSO2 Inc.
>> Mobile: +94715779733
>> Blog: http://nirmalfdo.blogspot.com/
>>
>>
>>
>


-- 

Thanks & regards,
Nirmal

Associate Technical Lead - Data Technologies Team, WSO2 Inc.
Mobile: +94715779733
Blog: http://nirmalfdo.blogspot.com/


Re: [MLLib][Kmeans] KMeansModel.computeCost takes lot of time

2015-07-13 Thread Burak Yavuz
Hi,

How are you running K-Means? What is your k? What is the dimension of your
dataset (columns)? Which Spark version are you using?

Thanks,
Burak

On Mon, Jul 13, 2015 at 2:53 AM, Nirmal Fernando  wrote:

> Hi,
>
> For a fairly large dataset, 30MB, KMeansModel.computeCost takes lot of
> time (16+ mints).
>
> It takes lot of time at this task;
>
> org.apache.spark.rdd.DoubleRDDFunctions.sum(DoubleRDDFunctions.scala:33)
> org.apache.spark.mllib.clustering.KMeansModel.computeCost(KMeansModel.scala:70)
>
> Can this be improved?
>
> --
>
> Thanks & regards,
> Nirmal
>
> Associate Technical Lead - Data Technologies Team, WSO2 Inc.
> Mobile: +94715779733
> Blog: http://nirmalfdo.blogspot.com/
>
>
>


[MLLib][Kmeans] KMeansModel.computeCost takes lot of time

2015-07-13 Thread Nirmal Fernando
Hi,

For a fairly large dataset, 30MB, KMeansModel.computeCost takes lot of time
(16+ mints).

It takes lot of time at this task;

org.apache.spark.rdd.DoubleRDDFunctions.sum(DoubleRDDFunctions.scala:33)
org.apache.spark.mllib.clustering.KMeansModel.computeCost(KMeansModel.scala:70)

Can this be improved?

-- 

Thanks & regards,
Nirmal

Associate Technical Lead - Data Technologies Team, WSO2 Inc.
Mobile: +94715779733
Blog: http://nirmalfdo.blogspot.com/


Re: KMeans questions

2015-07-02 Thread Feynman Liang
SPARK-7879 <https://issues.apache.org/jira/browse/SPARK-7879> seems to
address your use case (running KMeans on a dataframe and having the results
added as an additional column)

On Wed, Jul 1, 2015 at 5:53 PM, Eric Friedman 
wrote:

> In preparing a DataFrame (spark 1.4) to use with MLlib's kmeans.train
> method, is there a cleaner way to create the Vectors than this?
>
> data.map{r => Vectors.dense(r.getDouble(0), r.getDouble(3),
> r.getDouble(4), r.getDouble(5), r.getDouble(6))}
>
>
> Second, once I train the model and call predict on my vectorized dataset,
> what's the best way to relate the cluster assignments back to the original
> data frame?
>
>
> That is, I started with df1, which has a bunch of domain information in
> each row and also the doubles I use to cluster.  I vectorize the doubles
> and then train on them.  I use the resulting model to predict clusters for
> the vectors.  I'd like to look at the original domain information in light
> of the clusters to which they are now assigned.
>
>
>


KMeans questions

2015-07-01 Thread Eric Friedman
In preparing a DataFrame (spark 1.4) to use with MLlib's kmeans.train
method, is there a cleaner way to create the Vectors than this?

data.map{r => Vectors.dense(r.getDouble(0), r.getDouble(3), r.getDouble(4),
r.getDouble(5), r.getDouble(6))}


Second, once I train the model and call predict on my vectorized dataset,
what's the best way to relate the cluster assignments back to the original
data frame?


That is, I started with df1, which has a bunch of domain information in
each row and also the doubles I use to cluster.  I vectorize the doubles
and then train on them.  I use the resulting model to predict clusters for
the vectors.  I'd like to look at the original domain information in light
of the clusters to which they are now assigned.


Re: kmeans broadcast

2015-06-29 Thread Himanshu Mehra
Hi Haviv,

have you tried sc.broadcast(model), the broadcast method is a member of
sparkContext class.

Thanks
Himanshu



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/kmeans-broadcast-tp23511p23526.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



MLIB-KMEANS: Py4JNetworkError: An error occurred while trying to connect to the Java server , on a huge data set

2015-06-18 Thread rogersjeffreyl
Hi All,

I am trying to run KMeans clustering on a large data set with 12,000 points
and 80,000 dimensions.  I have a spark cluster in Ec2 stand alone mode  with
8  workers running on 2 slaves with 160 GB Ram and 40 VCPU. 

*My Code is as Follows:*

def convert_into_sparse_vector(A): 
non_nan_indices=np.nonzero(~np.isnan(A) )
non_nan_values=A[non_nan_indices]
dictionary=dict(zip(non_nan_indices[0],non_nan_values))
return Vectors.sparse (len(A),dictionary)

X=[convert_into_sparse_vector(A) for A in complete_dataframe.values ]
sc=SparkContext(appName="parallel_kmeans")
data=sc.parallelize(X,10)
model = KMeans.train(data, 1000, initializationMode="k-means||")

where complete_dataframe is a pandas data frame that has my data.

I get the error: Py4JNetworkError: An error occurred while trying to connect
to the Java server.
/
The error  trace is as follows:
>  Exception happened during
> processing of request from ('127.0.0.1', 41360) Traceback (most recent
> call last):   File "/usr/lib64/python2.6/SocketServer.py", line 283,
> in _handle_request_noblock
> self.process_request(request, client_address)   File
> "/usr/lib64/python2.6/SocketServer.py", line 309, in process_request
> self.finish_request(request, client_address)   File
> "/usr/lib64/python2.6/SocketServer.py", line 322, in finish_request
> self.RequestHandlerClass(request, client_address, self)   File
> "/usr/lib64/python2.6/SocketServer.py", line 617, in __init__
> self.handle()   File "/root/spark/python/pyspark/accumulators.py",
> line 235, in handle
> num_updates = read_int(self.rfile)   File
> "/root/spark/python/pyspark/serializers.py", line 544, in read_int
> raise EOFError EOFError
> 
> ---
> Py4JNetworkError  Traceback (most recent call
> last)  in ()
> > 1 model = KMeans.train(data, 1000, initializationMode="k-means||")
> 
> /root/spark/python/pyspark/mllib/clustering.pyc in train(cls, rdd, k,
> maxIterations, runs, initializationMode, seed, initializationSteps,
> epsilon)
> 134 """Train a k-means clustering model."""
> 135 model = callMLlibFunc("trainKMeansModel",
> rdd.map(_convert_to_vector), k, maxIterations,
> --> 136   runs, initializationMode, seed,
> initializationSteps, epsilon)
> 137 centers = callJavaFunc(rdd.context, model.clusterCenters)
> 138 return KMeansModel([c.toArray() for c in centers])
> 
> /root/spark/python/pyspark/mllib/common.pyc in callMLlibFunc(name,
> *args)
> 126 sc = SparkContext._active_spark_context
> 127 api = getattr(sc._jvm.PythonMLLibAPI(), name)
> --> 128 return callJavaFunc(sc, api, *args)
> 129 
> 130 
> 
> /root/spark/python/pyspark/mllib/common.pyc in callJavaFunc(sc, func,
> *args)
> 119 """ Call Java Function """
> 120 args = [_py2java(sc, a) for a in args]
> --> 121 return _java2py(sc, func(*args))
> 122 
> 123 
> 
> /root/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py in
> __call__(self, *args)
> 534 END_COMMAND_PART
> 535 
> --> 536 answer = self.gateway_client.send_command(command)
> 537 return_value = get_return_value(answer,
> self.gateway_client,
> 538 self.target_id, self.name)
> 
> /root/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py in
> send_command(self, command, retry)
> 367 if retry:
> 368 #print_exc()
> --> 369 response = self.send_command(command)
> 370 else:
> 371 response = ERROR
> 
> /root/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py in
> send_command(self, command, retry)
> 360  the Py4J protocol.
> 361 """
> --> 362 connection = self._get_connection()
> 363 try:
> 364 response = connection.send_command(command)
> 
> /root/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py in
> _get_connection(self)
> 316 connection = self.deque.pop()
> 317 except Exception:
> --> 318 connection = self._create_connection()
> 319 return connection
> 320 
> 
> /root/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py in
> _create_connection(self)
> 323 

Re: Restricting the number of iterations in Mllib Kmeans

2015-06-01 Thread Joseph Bradley
Hi Suman & Meethu,
Apologies---I was wrong about KMeans supporting an initial set of
centroids!  JIRA created: https://issues.apache.org/jira/browse/SPARK-8018
If you're interested in submitting a PR, please do!
Thanks,
Joseph

On Mon, Jun 1, 2015 at 2:25 AM, MEETHU MATHEW 
wrote:

> Hi Joseph,
> I was unable to find any function in Kmeans.scala where the initial
> centroids could be specified by the user. Kindly help.
>
> Thanks & Regards,
> Meethu M
>
>
>
>   On Tuesday, 19 May 2015 6:54 AM, Joseph Bradley 
> wrote:
>
>
> Hi Suman,
>
> For maxIterations, are you using the DenseKMeans.scala example code?  (I'm
> guessing yes since you mention the command line.)  If so, then you should
> be able to specify maxIterations via an extra parameter like
> "--numIterations 50" (note the example uses "numIterations" in the current
> master instead of "maxIterations," which is sort of a bug in the example).
> If that does not cap the max iterations, then please report it as a bug.
>
> To specify the initial centroids, you will need to modify the DenseKMeans
> example code.  Please see the KMeans API docs for details.
>
> Good luck,
> Joseph
>
> On Mon, May 18, 2015 at 3:22 AM, MEETHU MATHEW 
> wrote:
>
> Hi,
> I think you cant supply an initial set of centroids to kmeans
>
> Thanks & Regards,
> Meethu M
>
>
>
>   On Friday, 15 May 2015 12:37 AM, Suman Somasundar <
> suman.somasun...@oracle.com> wrote:
>
>
> Hi,,
>
> I want to run a definite number of iterations in Kmeans.  There is a
> command line argument to set maxIterations, but even if I set it to a
> number, Kmeans runs until the centroids converge.
> Is there a specific way to specify it in command line?
>
> Also, I wanted to know if we can supply the initial set of centroids to
> the program instead of it choosing the centroids in random?
>
> Thanks,
> Suman.
>
>
>
>
>
>


Re: Kmeans Labeled Point RDD

2015-05-21 Thread Krishna Sankar
You can predict and then zip it with the points RDD to get approx. same as
LP.
Cheers


On Thu, May 21, 2015 at 6:19 PM, anneywarlord 
wrote:

> Hello,
>
> New to Spark. I wanted to know if it is possible to use a Labeled Point RDD
> in org.apache.spark.mllib.clustering.KMeans. After I cluster my data I
> would
> like to be able to identify which observations were grouped with each
> centroid.
>
> Thanks
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Kmeans-Labeled-Point-RDD-tp22989.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>


Kmeans Labeled Point RDD

2015-05-21 Thread anneywarlord
Hello,

New to Spark. I wanted to know if it is possible to use a Labeled Point RDD
in org.apache.spark.mllib.clustering.KMeans. After I cluster my data I would
like to be able to identify which observations were grouped with each
centroid.

Thanks



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Kmeans-Labeled-Point-RDD-tp22989.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: spark mllib kmeans

2015-05-21 Thread Pa Rö
i want evaluate some different distance measure for time-space clustering.
so i need a api for implement my own function in java.

2015-05-19 22:08 GMT+02:00 Xiangrui Meng :

> Just curious, what distance measure do you need? -Xiangrui
>
> On Mon, May 11, 2015 at 8:28 AM, Jaonary Rabarisoa 
> wrote:
> > take a look at this
> > https://github.com/derrickburns/generalized-kmeans-clustering
> >
> > Best,
> >
> > Jao
> >
> > On Mon, May 11, 2015 at 3:55 PM, Driesprong, Fokko  >
> > wrote:
> >>
> >> Hi Paul,
> >>
> >> I would say that it should be possible, but you'll need a different
> >> distance measure which conforms to your coordinate system.
> >>
> >> 2015-05-11 14:59 GMT+02:00 Pa Rö :
> >>>
> >>> hi,
> >>>
> >>> it is possible to use a custom distance measure and a other data typ as
> >>> vector?
> >>> i want cluster temporal geo datas.
> >>>
> >>> best regards
> >>> paul
> >>
> >>
> >
>


Re: question about customize kmeans distance measure

2015-05-19 Thread Xiangrui Meng
MLlib only supports Euclidean distance for k-means. You can find
Bregman divergence support in Derrick's package:
http://spark-packages.org/package/derrickburns/generalized-kmeans-clustering.
Which distance measure do you want to use? -Xiangrui

On Tue, May 12, 2015 at 7:23 PM, June  wrote:
> Dear list,
>
>
>
> I am new to spark, and I want to use the kmeans algorithm in mllib package.
>
> I am wondering whether it is possible to customize the distance measure used
> by kmeans, and how?
>
>
>
> Many thanks!
>
>
>
> June

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: spark mllib kmeans

2015-05-19 Thread Xiangrui Meng
Just curious, what distance measure do you need? -Xiangrui

On Mon, May 11, 2015 at 8:28 AM, Jaonary Rabarisoa  wrote:
> take a look at this
> https://github.com/derrickburns/generalized-kmeans-clustering
>
> Best,
>
> Jao
>
> On Mon, May 11, 2015 at 3:55 PM, Driesprong, Fokko 
> wrote:
>>
>> Hi Paul,
>>
>> I would say that it should be possible, but you'll need a different
>> distance measure which conforms to your coordinate system.
>>
>> 2015-05-11 14:59 GMT+02:00 Pa Rö :
>>>
>>> hi,
>>>
>>> it is possible to use a custom distance measure and a other data typ as
>>> vector?
>>> i want cluster temporal geo datas.
>>>
>>> best regards
>>> paul
>>
>>
>

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Restricting the number of iterations in Mllib Kmeans

2015-05-18 Thread Joseph Bradley
Hi Suman,

For maxIterations, are you using the DenseKMeans.scala example code?  (I'm
guessing yes since you mention the command line.)  If so, then you should
be able to specify maxIterations via an extra parameter like
"--numIterations 50" (note the example uses "numIterations" in the current
master instead of "maxIterations," which is sort of a bug in the example).
If that does not cap the max iterations, then please report it as a bug.

To specify the initial centroids, you will need to modify the DenseKMeans
example code.  Please see the KMeans API docs for details.

Good luck,
Joseph

On Mon, May 18, 2015 at 3:22 AM, MEETHU MATHEW 
wrote:

> Hi,
> I think you cant supply an initial set of centroids to kmeans
>
> Thanks & Regards,
> Meethu M
>
>
>
>   On Friday, 15 May 2015 12:37 AM, Suman Somasundar <
> suman.somasun...@oracle.com> wrote:
>
>
> Hi,,
>
> I want to run a definite number of iterations in Kmeans.  There is a
> command line argument to set maxIterations, but even if I set it to a
> number, Kmeans runs until the centroids converge.
> Is there a specific way to specify it in command line?
>
> Also, I wanted to know if we can supply the initial set of centroids to
> the program instead of it choosing the centroids in random?
>
> Thanks,
> Suman.
>
>
>


Re: Restricting the number of iterations in Mllib Kmeans

2015-05-18 Thread MEETHU MATHEW
Hi,I think you cant supply an initial set of centroids to kmeans Thanks & 
Regards,
Meethu M 


 On Friday, 15 May 2015 12:37 AM, Suman Somasundar 
 wrote:
   

 Hi,,

I want to run a definite number of iterations in Kmeans.  There is a command 
line argument to set maxIterations, but even if I set it to a number, Kmeans 
runs until the centroids converge. Is there a specific way to specify it in 
command line?
Also, I wanted to know if we can supply the initial set of centroids to the 
program instead of it choosing the centroids in random?  Thanks,
Suman.

  

Restricting the number of iterations in Mllib Kmeans

2015-05-14 Thread Suman Somasundar
Hi,,

I want to run a definite number of iterations in Kmeans.  There is a command 
line argument to set maxIterations, but even if I set it to a number, Kmeans 
runs until the centroids converge. 

Is there a specific way to specify it in command line?


Also, I wanted to know if we can supply the initial set of centroids to the 
program instead of it choosing the centroids in random?

 

Thanks,
Suman.


question about customize kmeans distance measure

2015-05-12 Thread June
Dear list,

 

I am new to spark, and I want to use the kmeans algorithm in mllib package. 

I am wondering whether it is possible to customize the distance measure used
by kmeans, and how?

 

Many thanks!

 

June



Re: spark mllib kmeans

2015-05-11 Thread Jaonary Rabarisoa
take a look at this
https://github.com/derrickburns/generalized-kmeans-clustering

Best,

Jao

On Mon, May 11, 2015 at 3:55 PM, Driesprong, Fokko 
wrote:

> Hi Paul,
>
> I would say that it should be possible, but you'll need a different
> distance measure which conforms to your coordinate system.
>
> 2015-05-11 14:59 GMT+02:00 Pa Rö :
>
>> hi,
>>
>> it is possible to use a custom distance measure and a other data typ as
>> vector?
>> i want cluster temporal geo datas.
>>
>> best regards
>> paul
>>
>
>


Re: spark mllib kmeans

2015-05-11 Thread Driesprong, Fokko
Hi Paul,

I would say that it should be possible, but you'll need a different
distance measure which conforms to your coordinate system.

2015-05-11 14:59 GMT+02:00 Pa Rö :

> hi,
>
> it is possible to use a custom distance measure and a other data typ as
> vector?
> i want cluster temporal geo datas.
>
> best regards
> paul
>


spark mllib kmeans

2015-05-11 Thread Pa Rö
hi,

it is possible to use a custom distance measure and a other data typ as
vector?
i want cluster temporal geo datas.

best regards
paul


Re: MLib KMeans on large dataset issues

2015-04-29 Thread Sam Stoelinga
Guys, great feedback by pointing out my stupidity :D

Rows and columns got intermixed hence the weird results I was seeing.
Ignore my previous issues will reformat my data first.

On Wed, Apr 29, 2015 at 8:47 PM, Sam Stoelinga 
wrote:

> I'm mostly using example code, see here:
> http://paste.openstack.org/show/211966/
> The data has 799305 dimensions and is separated by space
>
> Please note the issues I'm seeing is because of the scala implementation
> imo as it happens also when using the Python wrappers.
>
>
>
> On Wed, Apr 29, 2015 at 8:00 PM, Jeetendra Gangele 
> wrote:
>
>> How you are passing feature vector to K means?
>> its in 2-D space of 1-D array?
>>
>> Did you try using Streaming Kmeans?
>>
>> will you be able to paste code here?
>>
>> On 29 April 2015 at 17:23, Sam Stoelinga  wrote:
>>
>>> Hi Sparkers,
>>>
>>> I am trying to run MLib kmeans on a large dataset(50+Gb of data) and a
>>> large K but I've encountered the following issues:
>>>
>>>
>>>- Spark driver gets out of memory and dies because collect gets
>>>called as part of KMeans, which loads all data back to the driver's 
>>> memory.
>>>- At the end there is a LocalKMeans class which runs KMeansPlusPlus
>>>on the Spark driver. Why isn't this distributed? It's spending a long 
>>> time
>>>on here and this has the same problem as point 1 requires loading the 
>>> data
>>>to the driver.
>>>Also when LocakKMeans is running on driver also seeing lots of :
>>>15/04/29 08:42:25 WARN clustering.LocalKMeans: kMeansPlusPlus
>>>initialization ran out of distinct points for centers. Using duplicate
>>>point for center k = 222
>>>- Has the above behaviour been like this in previous releases? I
>>>remember running KMeans before without too much problems.
>>>
>>> Looking forward to hear you point out my stupidity or provide
>>> work-arounds that could make Spark KMeans work well on large datasets.
>>>
>>> Regards,
>>> Sam Stoelinga
>>>
>>
>>
>>
>>
>


Re: MLib KMeans on large dataset issues

2015-04-29 Thread Sam Stoelinga
I'm mostly using example code, see here:
http://paste.openstack.org/show/211966/
The data has 799305 dimensions and is separated by space

Please note the issues I'm seeing is because of the scala implementation
imo as it happens also when using the Python wrappers.



On Wed, Apr 29, 2015 at 8:00 PM, Jeetendra Gangele 
wrote:

> How you are passing feature vector to K means?
> its in 2-D space of 1-D array?
>
> Did you try using Streaming Kmeans?
>
> will you be able to paste code here?
>
> On 29 April 2015 at 17:23, Sam Stoelinga  wrote:
>
>> Hi Sparkers,
>>
>> I am trying to run MLib kmeans on a large dataset(50+Gb of data) and a
>> large K but I've encountered the following issues:
>>
>>
>>- Spark driver gets out of memory and dies because collect gets
>>called as part of KMeans, which loads all data back to the driver's 
>> memory.
>>- At the end there is a LocalKMeans class which runs KMeansPlusPlus
>>on the Spark driver. Why isn't this distributed? It's spending a long time
>>on here and this has the same problem as point 1 requires loading the data
>>to the driver.
>>Also when LocakKMeans is running on driver also seeing lots of :
>>15/04/29 08:42:25 WARN clustering.LocalKMeans: kMeansPlusPlus
>>initialization ran out of distinct points for centers. Using duplicate
>>point for center k = 222
>>- Has the above behaviour been like this in previous releases? I
>>    remember running KMeans before without too much problems.
>>
>> Looking forward to hear you point out my stupidity or provide
>> work-arounds that could make Spark KMeans work well on large datasets.
>>
>> Regards,
>> Sam Stoelinga
>>
>
>
>
>


Re: MLib KMeans on large dataset issues

2015-04-29 Thread Jeetendra Gangele
How you are passing feature vector to K means?
its in 2-D space of 1-D array?

Did you try using Streaming Kmeans?

will you be able to paste code here?

On 29 April 2015 at 17:23, Sam Stoelinga  wrote:

> Hi Sparkers,
>
> I am trying to run MLib kmeans on a large dataset(50+Gb of data) and a
> large K but I've encountered the following issues:
>
>
>- Spark driver gets out of memory and dies because collect gets called
>as part of KMeans, which loads all data back to the driver's memory.
>- At the end there is a LocalKMeans class which runs KMeansPlusPlus on
>the Spark driver. Why isn't this distributed? It's spending a long time on
>here and this has the same problem as point 1 requires loading the data to
>the driver.
>Also when LocakKMeans is running on driver also seeing lots of :
>15/04/29 08:42:25 WARN clustering.LocalKMeans: kMeansPlusPlus
>initialization ran out of distinct points for centers. Using duplicate
>point for center k = 222
>- Has the above behaviour been like this in previous releases? I
>remember running KMeans before without too much problems.
>
> Looking forward to hear you point out my stupidity or provide work-arounds
> that could make Spark KMeans work well on large datasets.
>
> Regards,
> Sam Stoelinga
>


MLib KMeans on large dataset issues

2015-04-29 Thread Sam Stoelinga
Hi Sparkers,

I am trying to run MLib kmeans on a large dataset(50+Gb of data) and a
large K but I've encountered the following issues:


   - Spark driver gets out of memory and dies because collect gets called
   as part of KMeans, which loads all data back to the driver's memory.
   - At the end there is a LocalKMeans class which runs KMeansPlusPlus on
   the Spark driver. Why isn't this distributed? It's spending a long time on
   here and this has the same problem as point 1 requires loading the data to
   the driver.
   Also when LocakKMeans is running on driver also seeing lots of :
   15/04/29 08:42:25 WARN clustering.LocalKMeans: kMeansPlusPlus
   initialization ran out of distinct points for centers. Using duplicate
   point for center k = 222
   - Has the above behaviour been like this in previous releases? I
   remember running KMeans before without too much problems.

Looking forward to hear you point out my stupidity or provide work-arounds
that could make Spark KMeans work well on large datasets.

Regards,
Sam Stoelinga


Re: KMeans takeSample jobs and RDD cached

2015-04-25 Thread Joseph Bradley
Yes, the count() should be the first task, and the sampling + collecting
should be the second task.  The first one is probably slow because the RDD
being sampled is not yet cached/materialized.

K-Means creates some RDDs internally while learning, and since they aren't
needed after learning, they are unpersisted (uncached) at the end.

Joseph

On Sat, Apr 25, 2015 at 6:36 AM, podioss  wrote:

> Hi,
> i am running k-means algorithm with initialization mode set to random and
> various dataset sizes and values for clusters and i have a question
> regarding the takeSample job of the algorithm.
> More specific i notice that in every application there are  two sampling
> jobs. The first one is consuming the most time compared to all others while
> the second one is much quicker and that sparked my interest to investigate
> what is actually happening.
> In order to explain it, i  checked the source code of the takeSample
> operation and i saw that there is a count action involved and then the
> computation of a PartiotionwiseSampledRDD with a PoissonSampler.
> So my question is,if that count action corresponds to the first takeSample
> job and if the second takeSample job is the one doing the actual sampling.
>
> I also have a question for the RDDs that are created for the k-means. In
> the
> middle of the execution under the storage tab of the web ui i can see 3
> RDDs
> with their partitions cached in memory across all nodes which is very
> helpful for monitoring reasons. The problem is that after the completion i
> can only see one of them and the portion of the cache memory it used and i
> would like to ask why the web ui doesn't display all the RDDs involded in
> the computation.
>
> Thank you
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/KMeans-takeSample-jobs-and-RDD-cached-tp22656.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>


KMeans takeSample jobs and RDD cached

2015-04-25 Thread podioss
Hi,
i am running k-means algorithm with initialization mode set to random and
various dataset sizes and values for clusters and i have a question
regarding the takeSample job of the algorithm.
More specific i notice that in every application there are  two sampling
jobs. The first one is consuming the most time compared to all others while
the second one is much quicker and that sparked my interest to investigate
what is actually happening. 
In order to explain it, i  checked the source code of the takeSample
operation and i saw that there is a count action involved and then the
computation of a PartiotionwiseSampledRDD with a PoissonSampler.
So my question is,if that count action corresponds to the first takeSample
job and if the second takeSample job is the one doing the actual sampling.

I also have a question for the RDDs that are created for the k-means. In the
middle of the execution under the storage tab of the web ui i can see 3 RDDs
with their partitions cached in memory across all nodes which is very
helpful for monitoring reasons. The problem is that after the completion i
can only see one of them and the portion of the cache memory it used and i
would like to ask why the web ui doesn't display all the RDDs involded in
the computation.

Thank you



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/KMeans-takeSample-jobs-and-RDD-cached-tp22656.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Streaming Kmeans usage in java

2015-04-23 Thread Jeetendra Gangele
Do everyone do we have sample example how to use streaming k-means
clustering with java. I have seen some example usage in scala. can anybody
point me to the java example?

regards
jeetendra


Re: kmeans|| in Spark is not real paralleled?

2015-04-03 Thread Xi Shen
Hi Xingrui,

I have create JIRA https://issues.apache.org/jira/browse/SPARK-6706, and
attached the sample code. But I could not attache the test data. I will
update the bug once I found a place to host the test data.


Thanks,
David


On Tue, Mar 31, 2015 at 8:18 AM Xiangrui Meng  wrote:

> This PR updated the k-means|| initialization:
> https://github.com/apache/spark/commit/ca7910d6dd7693be2a675a0d6a6fcc9eb0aaeb5d,
> which was included in 1.3.0. It should fix kmean|| initialization with
> large k. Please create a JIRA for this issue and send me the code and the
> dataset to produce this problem. Thanks! -Xiangrui
>
> On Sun, Mar 29, 2015 at 1:20 AM, Xi Shen  wrote:
>
>> Hi,
>>
>> I have opened a couple of threads asking about k-means performance
>> problem in Spark. I think I made a little progress.
>>
>> Previous I use the simplest way of KMeans.train(rdd, k, maxIterations).
>> It uses the "kmeans||" initialization algorithm which supposedly to be a
>> faster version of kmeans++ and give better results in general.
>>
>> But I observed that if the k is very large, the initialization step takes
>> a long time. From the CPU utilization chart, it looks like only one thread
>> is working. Please see
>> https://stackoverflow.com/questions/29326433/cpu-gap-when-doing-k-means-with-spark
>> .
>>
>> I read the paper,
>> http://theory.stanford.edu/~sergei/papers/vldb12-kmpar.pdf, and it
>> points out kmeans++ initialization algorithm will suffer if k is large.
>> That's why the paper contributed the kmeans|| algorithm.
>>
>>
>> If I invoke KMeans.train by using the random initialization algorithm, I
>> do not observe this problem, even with very large k, like k=5000. This
>> makes me suspect that the kmeans|| in Spark is not properly implemented and
>> do not utilize parallel implementation.
>>
>>
>> I have also tested my code and data set with Spark 1.3.0, and I still
>> observe this problem. I quickly checked the PR regarding the KMeans
>> algorithm change from 1.2.0 to 1.3.0. It seems to be only code improvement
>> and polish, not changing/improving the algorithm.
>>
>>
>> I originally worked on Windows 64bit environment, and I also tested on
>> Linux 64bit environment. I could provide the code and data set if anyone
>> want to reproduce this problem.
>>
>>
>> I hope a Spark developer could comment on this problem and help
>> identifying if it is a bug.
>>
>>
>> Thanks,
>>
>> [image: --]
>> Xi Shen
>> [image: http://]about.me/davidshen
>> <http://about.me/davidshen?promo=email_sig>
>>   <http://about.me/davidshen>
>>
>
>


Re: Mllib kmeans #iteration

2015-04-02 Thread amoners
Have you refer to official document of kmeans on
https://spark.apache.org/docs/1.1.1/mllib-clustering.html ?




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Mllib-kmeans-iteration-tp22353p22365.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Mllib kmeans #iteration

2015-04-02 Thread Joseph Bradley
Check out the Spark docs for that parameter: *maxIterations*
http://spark.apache.org/docs/latest/mllib-clustering.html#k-means

On Thu, Apr 2, 2015 at 4:42 AM, podioss  wrote:

> Hello,
> i am running the Kmeans algorithm in cluster mode from Mllib and i was
> wondering if i could run the algorithm with fixed number of iterations in
> some way.
>
> Thanks
>
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Mllib-kmeans-iteration-tp22353.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>


Mllib kmeans #iteration

2015-04-02 Thread podioss
Hello,
i am running the Kmeans algorithm in cluster mode from Mllib and i was
wondering if i could run the algorithm with fixed number of iterations in
some way.

Thanks




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Mllib-kmeans-iteration-tp22353.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: kmeans|| in Spark is not real paralleled?

2015-03-30 Thread Xiangrui Meng
This PR updated the k-means|| initialization:
https://github.com/apache/spark/commit/ca7910d6dd7693be2a675a0d6a6fcc9eb0aaeb5d,
which was included in 1.3.0. It should fix kmean|| initialization with
large k. Please create a JIRA for this issue and send me the code and the
dataset to produce this problem. Thanks! -Xiangrui

On Sun, Mar 29, 2015 at 1:20 AM, Xi Shen  wrote:

> Hi,
>
> I have opened a couple of threads asking about k-means performance problem
> in Spark. I think I made a little progress.
>
> Previous I use the simplest way of KMeans.train(rdd, k, maxIterations). It
> uses the "kmeans||" initialization algorithm which supposedly to be a
> faster version of kmeans++ and give better results in general.
>
> But I observed that if the k is very large, the initialization step takes
> a long time. From the CPU utilization chart, it looks like only one thread
> is working. Please see
> https://stackoverflow.com/questions/29326433/cpu-gap-when-doing-k-means-with-spark
> .
>
> I read the paper,
> http://theory.stanford.edu/~sergei/papers/vldb12-kmpar.pdf, and it points
> out kmeans++ initialization algorithm will suffer if k is large. That's why
> the paper contributed the kmeans|| algorithm.
>
>
> If I invoke KMeans.train by using the random initialization algorithm, I
> do not observe this problem, even with very large k, like k=5000. This
> makes me suspect that the kmeans|| in Spark is not properly implemented and
> do not utilize parallel implementation.
>
>
> I have also tested my code and data set with Spark 1.3.0, and I still
> observe this problem. I quickly checked the PR regarding the KMeans
> algorithm change from 1.2.0 to 1.3.0. It seems to be only code improvement
> and polish, not changing/improving the algorithm.
>
>
> I originally worked on Windows 64bit environment, and I also tested on
> Linux 64bit environment. I could provide the code and data set if anyone
> want to reproduce this problem.
>
>
> I hope a Spark developer could comment on this problem and help
> identifying if it is a bug.
>
>
> Thanks,
>
> [image: --]
> Xi Shen
> [image: http://]about.me/davidshen
> <http://about.me/davidshen?promo=email_sig>
>   <http://about.me/davidshen>
>


kmeans|| in Spark is not real paralleled?

2015-03-29 Thread Xi Shen
Hi,

I have opened a couple of threads asking about k-means performance problem
in Spark. I think I made a little progress.

Previous I use the simplest way of KMeans.train(rdd, k, maxIterations). It
uses the "kmeans||" initialization algorithm which supposedly to be a
faster version of kmeans++ and give better results in general.

But I observed that if the k is very large, the initialization step takes a
long time. From the CPU utilization chart, it looks like only one thread is
working. Please see
https://stackoverflow.com/questions/29326433/cpu-gap-when-doing-k-means-with-spark
.

I read the paper, http://theory.stanford.edu/~sergei/papers/vldb12-kmpar.pdf,
and it points out kmeans++ initialization algorithm will suffer if k is
large. That's why the paper contributed the kmeans|| algorithm.


If I invoke KMeans.train by using the random initialization algorithm, I do
not observe this problem, even with very large k, like k=5000. This makes
me suspect that the kmeans|| in Spark is not properly implemented and do
not utilize parallel implementation.


I have also tested my code and data set with Spark 1.3.0, and I still
observe this problem. I quickly checked the PR regarding the KMeans
algorithm change from 1.2.0 to 1.3.0. It seems to be only code improvement
and polish, not changing/improving the algorithm.


I originally worked on Windows 64bit environment, and I also tested on
Linux 64bit environment. I could provide the code and data set if anyone
want to reproduce this problem.


I hope a Spark developer could comment on this problem and help identifying
if it is a bug.


Thanks,

[image: --]
Xi Shen
[image: http://]about.me/davidshen
<http://about.me/davidshen?promo=email_sig>
  <http://about.me/davidshen>


Re: Why KMeans with mllib is so slow ?

2015-03-29 Thread Xi Shen
Hi Burak,

Unfortunately, I am expected to do my work in HDInsight environment which
only supports Spark 1.2.0 with Microsoft's flavor. I cannot simple replace
it with Spark 1.3.

I think the problem I am observing is caused by kmeans|| initialization
step. I will open another thread to discuss it.


Thanks,
David





[image: --]
Xi Shen
[image: http://]about.me/davidshen
<http://about.me/davidshen?promo=email_sig>
  <http://about.me/davidshen>

On Sun, Mar 29, 2015 at 4:34 PM, Burak Yavuz  wrote:

> Hi David,
>
> Can you also try with Spark 1.3 if possible? I believe there was a 2x
> improvement on K-Means between 1.2 and 1.3.
>
> Thanks,
> Burak
>
>
>
> On Sat, Mar 28, 2015 at 9:04 PM, davidshen84 
> wrote:
>
>> Hi Jao,
>>
>> Sorry to pop up this old thread. I am have the same problem like you did.
>> I
>> want to know if you have figured out how to improve k-means on Spark.
>>
>> I am using Spark 1.2.0. My data set is about 270k vectors, each has about
>> 350 dimensions. If I set k=500, the job takes about 3hrs on my cluster.
>> The
>> cluster has 7 executors, each has 8 cores...
>>
>> If I set k=5000 which is the required value for my task, the job goes on
>> forever...
>>
>>
>> Thanks,
>> David
>>
>>
>>
>>
>> --
>> View this message in context:
>> http://apache-spark-user-list.1001560.n3.nabble.com/Why-KMeans-with-mllib-is-so-slow-tp20480p22273.html
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>
>> -
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> For additional commands, e-mail: user-h...@spark.apache.org
>>
>>
>


Re: Why KMeans with mllib is so slow ?

2015-03-28 Thread Burak Yavuz
Hi David,

Can you also try with Spark 1.3 if possible? I believe there was a 2x
improvement on K-Means between 1.2 and 1.3.

Thanks,
Burak



On Sat, Mar 28, 2015 at 9:04 PM, davidshen84  wrote:

> Hi Jao,
>
> Sorry to pop up this old thread. I am have the same problem like you did. I
> want to know if you have figured out how to improve k-means on Spark.
>
> I am using Spark 1.2.0. My data set is about 270k vectors, each has about
> 350 dimensions. If I set k=500, the job takes about 3hrs on my cluster. The
> cluster has 7 executors, each has 8 cores...
>
> If I set k=5000 which is the required value for my task, the job goes on
> forever...
>
>
> Thanks,
> David
>
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Why-KMeans-with-mllib-is-so-slow-tp20480p22273.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>


Re: Why KMeans with mllib is so slow ?

2015-03-28 Thread davidshen84
Hi Jao,

Sorry to pop up this old thread. I am have the same problem like you did. I
want to know if you have figured out how to improve k-means on Spark.

I am using Spark 1.2.0. My data set is about 270k vectors, each has about
350 dimensions. If I set k=500, the job takes about 3hrs on my cluster. The
cluster has 7 executors, each has 8 cores...

If I set k=5000 which is the required value for my task, the job goes on
forever...


Thanks,
David




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Why-KMeans-with-mllib-is-so-slow-tp20480p22273.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Spark MLLib KMeans Top Terms

2015-03-19 Thread mvsundaresan
I'm trying to cluster short text messages using KMeans, after trained the
kmeans I want to get the top terms (5 - 10). How do I get that using
clusterCenters?

full code is here
http://apache-spark-user-list.1001560.n3.nabble.com/KMeans-with-large-clusters-Java-Heap-Space-td21432.html
 



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-MLLib-KMeans-Top-Terms-tp22154.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



  1   2   3   >