Re: Apply Kmeans in partitions
Hi Dimitri, what is the error you are getting, please specify. Apostolos On 30/1/19 16:30, dimitris plakas wrote: Hello everyone, I have a dataframe which has 5040 rows where these rows are splitted in 5 groups. So i have a column called "Group_Id" which marks every row with values from 0-4 depending on in which group every rows belongs to. I am trying to split my dataframe to 5 partitions and apply Kmeans to every partition. I have tried rdd=mydataframe.rdd.mapPartitions(function, True) test = Kmeans.train(rdd, num_of_centers, "random") but i get an error. How can i apply Kmeans to every partition? Thank you in advance, -- Apostolos N. Papadopoulos, Associate Professor Department of Informatics Aristotle University of Thessaloniki Thessaloniki, GREECE tel: ++0030312310991918 email: papad...@csd.auth.gr twitter: @papadopoulos_ap web: http://datalab.csd.auth.gr/~apostol - To unsubscribe e-mail: user-unsubscr...@spark.apache.org
Apply Kmeans in partitions
Hello everyone, I have a dataframe which has 5040 rows where these rows are splitted in 5 groups. So i have a column called "Group_Id" which marks every row with values from 0-4 depending on in which group every rows belongs to. I am trying to split my dataframe to 5 partitions and apply Kmeans to every partition. I have tried rdd=mydataframe.rdd.mapPartitions(function, True) test = Kmeans.train(rdd, num_of_centers, "random") but i get an error. How can i apply Kmeans to every partition? Thank you in advance,
[Spark MLib]: RDD caching behavior of KMeans
Hi All, I was varying the storage levels of RDD caching in the KMeans program implemented using the MLib library and got some very confusing and interesting results. The base code of the application is from a Benchmark suite named SparkBench <https://github.com/CODAIT/spark-bench> . I changed the storage levels of the data RDD passed to the Kmeans train function and it seems like MEMORY_AND_DISK_SER is performing quite worse compared to DISK_ONLY level. MEMORY_AND_DISK level performed the best as expected. But as to why Memory serialized storage level is performing worse than Disk serialized level is very confusing. I am using 1 node as master and 4 nodes as slaves with each executor having a 48g JVM. The cached data should also fit within the memory easily. If anyone has any idea or suggestion on why this behavior is happening please let me know. Regards, Muhib -- Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/ - To unsubscribe e-mail: user-unsubscr...@spark.apache.org
Understanding the results from Spark's KMeans clustering object
Hello Everyone, I am performing clustering on a dataset using PySpark. To find the number of clusters I performed clustering over a range of values (2,20) and found the wsse (within-cluster sum of squares) values for each value of k. This where I found something unusual. According to my understanding when you increase the number of clusters, the wsse decreases monotonically. But results I got say otherwise. I 'm displaying wsse for first few clusters only Results from spark For k = 002 WSSE is 255318.793358 For k = 003 WSSE is 209788.479560 For k = 004 WSSE is 208498.351074 For k = 005 WSSE is 142573.272672 For k = 006 WSSE is 154419.027612 For k = 007 WSSE is 115092.404604 For k = 008 WSSE is 104753.205635 For k = 009 WSSE is 98000.985547 For k = 010 WSSE is 95134.137071 If you look at the wsse value of for k=5 and k=6, you'll see the wsse has increased. I turned to sklearn to see if I get similar results. The codes I used for spark and sklearn are in the appendix section towards the end of the post. I have tried to use same values for the parameters in spark and sklearn KMeans model. The following are the results from sklearn and they are as I expected them to be - monotonically decreasing. Results from sklearn For k = 002 WSSE is 245090.224247 For k = 003 WSSE is 201329.888159 For k = 004 WSSE is 166889.044195 For k = 005 WSSE is 142576.895154 For k = 006 WSSE is 123882.070776 For k = 007 WSSE is 112496.692455 For k = 008 WSSE is 102806.001664 For k = 009 WSSE is 95279.837212 For k = 010 WSSE is 89303.574467 I am not sure as to why I the wsse values increase in Spark. I tried using different datasets and found similar behavior there as well. Is there someplace I am going wrong? Any clues would be great. APPENDIX The dataset is located here. Read the data and set declare variables # get data import pandas as pd url = "https://raw.githubusercontent.com/vectosaurus/bb_lite/master/3.0%20data/adult_comp_cont.csv"; df_pandas = pd.read_csv(url) df_spark = sqlContext(df_pandas) target_col = 'high_income' numeric_cols = [i for i in df_pandas.columns if i !=target_col] k_min = 2 # 2 in inclusive k_max = 21 # 2i is exlusive. will fit till 20 max_iter = 1000 seed = 42 This is the code I am using for getting the sklearn results: from sklearn.cluster import KMeans as KMeans_SKL from sklearn.preprocessing import StandardScaler as StandardScaler_SKL ss = StandardScaler_SKL(with_std=True, with_mean=True) ss.fit(df_pandas.loc[:, numeric_cols]) df_pandas_scaled = pd.DataFrame(ss.transform(df_pandas.loc[:, numeric_cols])) wsse_collect = [] for i in range(k_min, k_max): km = KMeans_SKL(random_state=seed, max_iter=max_iter, n_clusters=i) _ = km.fit(df_pandas_scaled) wsse = km.inertia_ print('For k = {i:03d} WSSE is {wsse:10f}'.format(i=i, wsse=wsse)) wsse_collect.append(wsse) This is the code I am using for getting the spark results from pyspark.ml.feature import StandardScaler, VectorAssembler from pyspark.ml.clustering import KMeans standard_scaler_inpt_features = 'ss_features' kmeans_input_features = 'features' kmeans_prediction_features = 'prediction' assembler = VectorAssembler(inputCols=numeric_cols, outputCol=standard_scaler_inpt_features) assembled_df = assembler.transform(df_spark) scaler = StandardScaler(inputCol=standard_scaler_inpt_features, outputCol=kmeans_input_features, withStd=True, withMean=True) scaler_model = scaler.fit(assembled_df) scaled_data = scaler_model.transform(assembled_df) wsse_collect_spark = [] for i in range(k_min, k_max): km = KMeans(featuresCol=kmeans_input_features, predictionCol=kmeans_prediction_col, k=i, maxIter=max_iter, seed=seed) km_fit = km.fit(scaled_data) wsse_spark = km_fit.computeCost(scaled_data) wsse_collect_spark .append(wsse_spark) print('For k = {i:03d} WSSE is {wsse:10f}'.format(i=i, wsse=wsse_spark)) -- Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/ - To unsubscribe e-mail: user-unsubscr...@spark.apache.org
Bisecting Kmeans Linkage Matrix Output (Cluster Indices)
I have been working on a project to return a Linkage Matrix output from the Spark Bisecting Kmeans Algorithm output so that it is possible to plot the selection steps in a dendogram. I am having trouble returning valid Indices when I use more than 3-4 clusters in the algorithm and am hoping someone else might have the time/interest enough to take a look. To achieve this I made some modifications to the Bisecting Kmeans algorithm to produce a z-linkage matrix based on yu-iskw's work. I also made some modifications to provide more information about the selection steps in the Bisecting Kmeans Algorithm to the log at run-time. Test outputs using the Iris Dataset with both k = 3 and k = 10 clusters can be seen on my stack overflow post <https://stackoverflow.com/questions/49265521/bisecting-kmeans-cluster-indices-in-apache-spark> The project so far (with a simple sbt build and the compiled jars) can also be seen on my github repo <https://github.com/GabeChurch/IncubatingProjects> and is also detailed in the aforementioned stack overflow post. -- Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/ - To unsubscribe e-mail: user-unsubscr...@spark.apache.org
Re: Apache Spark documentation on mllib's Kmeans doesn't jibe.
The train method is on the Companion Object https://spark.apache.org/docs/2.1.0/api/scala/index.html#org.apache.spark.mllib.clustering.KMeans$ here is a decent resource on Companion Object usage: https://docs.scala-lang.org/tour/singleton-objects.html On Wed, Dec 13, 2017 at 9:16 AM Michael Segel wrote: > Hi, > > Just came across this while looking at the docs on how to use Spark’s > Kmeans clustering. > > Note: This appears to be true in both 2.1 and 2.2 documentation. > > The overview page: > https://spark.apache.org/docs/2.1.0/mllib-clustering.html#k-means > <https://urldefense.proofpoint.com/v2/url?u=https-3A__spark.apache.org_docs_2.1.0_mllib-2Dclustering.html-23k-2Dmeans&d=DwMGaQ&c=x_Y1Lz9GyeGp2OvBCa_eow&r=ChXZJWKniTslJvQGptpIW7qAh4kkrpgYSer_wfh4G5w&m=aqceDwZltCTqlsZ5_SVCDe_DGw08lU2Duf0yymdZZ7k&s=i-__RwjSLQ18f4-0jfvArBoWU8FzygMCKzJXp_FPv1U&e=> > > Here’ the example contains the following line: > > val clusters = KMeans.train(parsedData, numClusters, numIterations) > > I was trying to get more information on the train() method. > So I checked out the KMeans Scala API: > > > https://spark.apache.org/docs/2.1.0/api/scala/index.html#org.apache.spark.mllib.clustering.KMeans > <https://urldefense.proofpoint.com/v2/url?u=https-3A__spark.apache.org_docs_2.1.0_api_scala_index.html-23org.apache.spark.mllib.clustering.KMeans&d=DwMGaQ&c=x_Y1Lz9GyeGp2OvBCa_eow&r=ChXZJWKniTslJvQGptpIW7qAh4kkrpgYSer_wfh4G5w&m=aqceDwZltCTqlsZ5_SVCDe_DGw08lU2Duf0yymdZZ7k&s=F8KhbHkJ4gQWQb4d1I-4a3gcn6uX4Z-lPmrQTmnaCp4&e=> > > The issue is that I couldn’t find the train method… > > So I thought I was slowly losing my mind. > > I checked out the entire API page… could not find any API docs which > describe the method train(). > > I ended up looking at the source code and found the method in the scala > source code. > (You can see the code here: > https://github.com/apache/spark/blob/v2.1.0/mllib/src/main/scala/org/apache/spark/mllib/clustering/KMeans.scala > <https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_apache_spark_blob_v2.1.0_mllib_src_main_scala_org_apache_spark_mllib_clustering_KMeans.scala&d=DwMGaQ&c=x_Y1Lz9GyeGp2OvBCa_eow&r=ChXZJWKniTslJvQGptpIW7qAh4kkrpgYSer_wfh4G5w&m=aqceDwZltCTqlsZ5_SVCDe_DGw08lU2Duf0yymdZZ7k&s=tYWGTjYLcXRMIuaE3IKN7ugoMSSXqfHknoWQewlqMPc&e=> > ) > > So the method(s) exist, but not covered in the Scala API doc. > > How do you raise this as a ‘bug’ ? > > Thx > > -Mike > > -- Scott Reynolds Principal Engineer [image: twilio] <http://www.twilio.com/?utm_source=email_signature> EMAIL sreyno...@twilio.com
Apache Spark documentation on mllib's Kmeans doesn't jibe.
Hi, Just came across this while looking at the docs on how to use Spark’s Kmeans clustering. Note: This appears to be true in both 2.1 and 2.2 documentation. The overview page: https://spark.apache.org/docs/2.1.0/mllib-clustering.html#k-means Here’ the example contains the following line: val clusters = KMeans.train(parsedData, numClusters, numIterations) I was trying to get more information on the train() method. So I checked out the KMeans Scala API: https://spark.apache.org/docs/2.1.0/api/scala/index.html#org.apache.spark.mllib.clustering.KMeans The issue is that I couldn’t find the train method… So I thought I was slowly losing my mind. I checked out the entire API page… could not find any API docs which describe the method train(). I ended up looking at the source code and found the method in the scala source code. (You can see the code here: https://github.com/apache/spark/blob/v2.1.0/mllib/src/main/scala/org/apache/spark/mllib/clustering/KMeans.scala ) So the method(s) exist, but not covered in the Scala API doc. How do you raise this as a ‘bug’ ? Thx -Mike
Re: KMeans Clustering is not Reproducible
Hi Ankur, thank you for answering. But my problem is not, that I'm stuck in a local extrema but rather the reproducibility of kmeans. Want I'm trying to achieve is: when the input data and all the parameters stay the same, especially the seed, I want to get the exact same results. Even though the partitioning changes. As far as I'm concerned if I'm setting a seed in a ML algorithm, I would expect that this algorithm is deterministic. Unfortunately I couldn't find any information if this a goal of Spark's mllib or not. Maybe a little bit of background. I'm trying to benchmark some ML algorithms while changing my cluster config. That is I want to find the best cluster config to achieve the same results. But what I see is that when I change the amount of executors, the results become incomparable, since the results differ. So in essence my question is, are the algorithms in the mllib partition agnostic or not? Thanks for your help, Christoph Am 24.05.2017 20:49 schrieb "Ankur Srivastava" : Hi Christoph, I am not an expert in ML and have not used Spark KMeans but your problem seems to be an issue of local minimum vs global minimum. You should run K-means multiple times with random starting point and also try with multiple values of K (unless you are already sure). Hope this helps. Thanks Ankur On Wed, May 24, 2017 at 2:15 AM, Christoph Bruecke wrote: > Hi Anastasios, > > thanks for the reply but caching doesn’t seem to change anything. > > After further investigation it really seems that the RDD#takeSample method > is the cause of the non-reproducibility. > > Is this considered a bug and should I open an Issue for that? > > BTW: my example script contains a little type in line 3: it is `feature` > not `features` (mind the `s`). > > Best, > Christoph > > The script with caching > > ``` > import org.apache.spark.sql.DataFrame > import org.apache.spark.ml.clustering.KMeans > import org.apache.spark.ml.feature.VectorAssembler > import org.apache.spark.storage.StorageLevel > > // generate random data for clustering > val randomData = spark.range(1, 1000).withColumn("a", > rand(123)).withColumn("b", rand(321)) > > val vecAssembler = new VectorAssembler().setInputCols(Array("a", > "b")).setOutputCol("features") > > val data = vecAssembler.transform(randomData) > > // instantiate KMeans with fixed seed > val kmeans = new KMeans().setK(10).setSeed(9876L) > > // train the model with different partitioning > val dataWith1Partition = data.repartition(1) > // cache the data > dataWith1Partition.persist(StorageLevel.MEMORY_AND_DISK) > println("1 Partition: " + kmeans.fit(dataWith1Partition) > .computeCost(dataWith1Partition)) > > val dataWith4Partition = data.repartition(4) > // cache the data > dataWith4Partition.persist(StorageLevel.MEMORY_AND_DISK) > println("4 Partition: " + kmeans.fit(dataWith4Partition) > .computeCost(dataWith4Partition)) > > > ``` > > Output: > > ``` > 1 Partition: 16.028212597888057 > 4 Partition: 16.14758460544976 > ``` > > > On 22 May 2017, at 16:33, Anastasios Zouzias wrote: > > > > Hi Christoph, > > > > Take a look at this, you might end up having a similar case: > > > > http://www.spark.tc/using-sparks-cache-for-correctness-not- > just-performance/ > > > > If this is not the case, then I agree with you the kmeans should be > partitioning agnostic (although I haven't check the code yet). > > > > Best, > > Anastasios > > > > > > On Mon, May 22, 2017 at 3:42 PM, Christoph Bruecke > wrote: > > Hi, > > > > I’m trying to figure out how to use KMeans in order to achieve > reproducible results. I have found that running the same kmeans instance on > the same data, with different partitioning will produce different > clusterings. > > > > Given a simple KMeans run with fixed seed returns different results on > the same > > training data, if the training data is partitioned differently. > > > > Consider the following example. The same KMeans clustering set up is run > on > > identical data. The only difference is the partitioning of the training > data > > (one partition vs. four partitions). > > > > ``` > > import org.apache.spark.sql.DataFrame > > import org.apache.spark.ml.clustering.KMeans > > import org.apache.spark.ml.features.VectorAssembler > > > > // generate random data for clustering > > val randomData = spark.range(1, 1000).withColumn("a", > rand(123)).withColumn("b", rand(321)) > > > > val vecAssembler = new VectorAssembler().setIn
Re: KMeans Clustering is not Reproducible
I agree with what Ankur said. The kmeans seeding program ('takeSample' method) runs in parallel, so each partition has its sampling points based on the local data which will cause the 'partition agnostic'. The seeding method is based on Bahmani et al. kmeansII algorithm which gives approximation guarantees on the kmeans cost. You could set the initial seeding points which will avoid the 'agnostic' issue. Regards, Yu Zhang On Wed, May 24, 2017 at 1:49 PM, Ankur Srivastava < ankur.srivast...@gmail.com> wrote: > Hi Christoph, > > I am not an expert in ML and have not used Spark KMeans but your problem > seems to be an issue of local minimum vs global minimum. You should run > K-means multiple times with random starting point and also try with > multiple values of K (unless you are already sure). > > Hope this helps. > > Thanks > Ankur > > > > On Wed, May 24, 2017 at 2:15 AM, Christoph Bruecke > wrote: > >> Hi Anastasios, >> >> thanks for the reply but caching doesn’t seem to change anything. >> >> After further investigation it really seems that the RDD#takeSample >> method is the cause of the non-reproducibility. >> >> Is this considered a bug and should I open an Issue for that? >> >> BTW: my example script contains a little type in line 3: it is `feature` >> not `features` (mind the `s`). >> >> Best, >> Christoph >> >> The script with caching >> >> ``` >> import org.apache.spark.sql.DataFrame >> import org.apache.spark.ml.clustering.KMeans >> import org.apache.spark.ml.feature.VectorAssembler >> import org.apache.spark.storage.StorageLevel >> >> // generate random data for clustering >> val randomData = spark.range(1, 1000).withColumn("a", >> rand(123)).withColumn("b", rand(321)) >> >> val vecAssembler = new VectorAssembler().setInputCols(Array("a", >> "b")).setOutputCol("features") >> >> val data = vecAssembler.transform(randomData) >> >> // instantiate KMeans with fixed seed >> val kmeans = new KMeans().setK(10).setSeed(9876L) >> >> // train the model with different partitioning >> val dataWith1Partition = data.repartition(1) >> // cache the data >> dataWith1Partition.persist(StorageLevel.MEMORY_AND_DISK) >> println("1 Partition: " + kmeans.fit(dataWith1Partition) >> .computeCost(dataWith1Partition)) >> >> val dataWith4Partition = data.repartition(4) >> // cache the data >> dataWith4Partition.persist(StorageLevel.MEMORY_AND_DISK) >> println("4 Partition: " + kmeans.fit(dataWith4Partition) >> .computeCost(dataWith4Partition)) >> >> >> ``` >> >> Output: >> >> ``` >> 1 Partition: 16.028212597888057 >> 4 Partition: 16.14758460544976 >> ``` >> >> > On 22 May 2017, at 16:33, Anastasios Zouzias wrote: >> > >> > Hi Christoph, >> > >> > Take a look at this, you might end up having a similar case: >> > >> > http://www.spark.tc/using-sparks-cache-for-correctness-not- >> just-performance/ >> > >> > If this is not the case, then I agree with you the kmeans should be >> partitioning agnostic (although I haven't check the code yet). >> > >> > Best, >> > Anastasios >> > >> > >> > On Mon, May 22, 2017 at 3:42 PM, Christoph Bruecke >> wrote: >> > Hi, >> > >> > I’m trying to figure out how to use KMeans in order to achieve >> reproducible results. I have found that running the same kmeans instance on >> the same data, with different partitioning will produce different >> clusterings. >> > >> > Given a simple KMeans run with fixed seed returns different results on >> the same >> > training data, if the training data is partitioned differently. >> > >> > Consider the following example. The same KMeans clustering set up is >> run on >> > identical data. The only difference is the partitioning of the training >> data >> > (one partition vs. four partitions). >> > >> > ``` >> > import org.apache.spark.sql.DataFrame >> > import org.apache.spark.ml.clustering.KMeans >> > import org.apache.spark.ml.features.VectorAssembler >> > >> > // generate random data for clustering >> > val randomData = spark.range(1, 1000).withColumn("a", >> rand(123)).withColumn("b", rand(321)) >> > >> > val vecAssembler = new VectorAssembler().se
Re: KMeans Clustering is not Reproducible
Hi Christoph, I am not an expert in ML and have not used Spark KMeans but your problem seems to be an issue of local minimum vs global minimum. You should run K-means multiple times with random starting point and also try with multiple values of K (unless you are already sure). Hope this helps. Thanks Ankur On Wed, May 24, 2017 at 2:15 AM, Christoph Bruecke wrote: > Hi Anastasios, > > thanks for the reply but caching doesn’t seem to change anything. > > After further investigation it really seems that the RDD#takeSample method > is the cause of the non-reproducibility. > > Is this considered a bug and should I open an Issue for that? > > BTW: my example script contains a little type in line 3: it is `feature` > not `features` (mind the `s`). > > Best, > Christoph > > The script with caching > > ``` > import org.apache.spark.sql.DataFrame > import org.apache.spark.ml.clustering.KMeans > import org.apache.spark.ml.feature.VectorAssembler > import org.apache.spark.storage.StorageLevel > > // generate random data for clustering > val randomData = spark.range(1, 1000).withColumn("a", > rand(123)).withColumn("b", rand(321)) > > val vecAssembler = new VectorAssembler().setInputCols(Array("a", > "b")).setOutputCol("features") > > val data = vecAssembler.transform(randomData) > > // instantiate KMeans with fixed seed > val kmeans = new KMeans().setK(10).setSeed(9876L) > > // train the model with different partitioning > val dataWith1Partition = data.repartition(1) > // cache the data > dataWith1Partition.persist(StorageLevel.MEMORY_AND_DISK) > println("1 Partition: " + kmeans.fit(dataWith1Partition).computeCost( > dataWith1Partition)) > > val dataWith4Partition = data.repartition(4) > // cache the data > dataWith4Partition.persist(StorageLevel.MEMORY_AND_DISK) > println("4 Partition: " + kmeans.fit(dataWith4Partition).computeCost( > dataWith4Partition)) > > > ``` > > Output: > > ``` > 1 Partition: 16.028212597888057 > 4 Partition: 16.14758460544976 > ``` > > > On 22 May 2017, at 16:33, Anastasios Zouzias wrote: > > > > Hi Christoph, > > > > Take a look at this, you might end up having a similar case: > > > > http://www.spark.tc/using-sparks-cache-for-correctness- > not-just-performance/ > > > > If this is not the case, then I agree with you the kmeans should be > partitioning agnostic (although I haven't check the code yet). > > > > Best, > > Anastasios > > > > > > On Mon, May 22, 2017 at 3:42 PM, Christoph Bruecke > wrote: > > Hi, > > > > I’m trying to figure out how to use KMeans in order to achieve > reproducible results. I have found that running the same kmeans instance on > the same data, with different partitioning will produce different > clusterings. > > > > Given a simple KMeans run with fixed seed returns different results on > the same > > training data, if the training data is partitioned differently. > > > > Consider the following example. The same KMeans clustering set up is run > on > > identical data. The only difference is the partitioning of the training > data > > (one partition vs. four partitions). > > > > ``` > > import org.apache.spark.sql.DataFrame > > import org.apache.spark.ml.clustering.KMeans > > import org.apache.spark.ml.features.VectorAssembler > > > > // generate random data for clustering > > val randomData = spark.range(1, 1000).withColumn("a", > rand(123)).withColumn("b", rand(321)) > > > > val vecAssembler = new VectorAssembler().setInputCols(Array("a", > "b")).setOutputCol("features") > > > > val data = vecAssembler.transform(randomData) > > > > // instantiate KMeans with fixed seed > > val kmeans = new KMeans().setK(10).setSeed(9876L) > > > > // train the model with different partitioning > > val dataWith1Partition = data.repartition(1) > > println("1 Partition: " + kmeans.fit(dataWith1Partition).computeCost( > dataWith1Partition)) > > > > val dataWith4Partition = data.repartition(4) > > println("4 Partition: " + kmeans.fit(dataWith4Partition).computeCost( > dataWith4Partition)) > > ``` > > > > I get the following related cost > > > > ``` > > 1 Partition: 16.028212597888057 > > 4 Partition: 16.14758460544976 > > ``` > > > > What I want to achieve is that repeated computations of the KMeans > Clustering should yield identical result on identical training data, > regardless of the partitioning. > > > > Looking through the Spark source code, I guess the cause is the > initialization method of KMeans which in turn uses the `takeSample` method, > which does not seem to be partition agnostic. > > > > Is this behaviour expected? Is there anything I could do to achieve > reproducible results? > > > > Best, > > Christoph > > - > > To unsubscribe e-mail: user-unsubscr...@spark.apache.org > > > > > > > > > > -- > > -- Anastasios Zouzias > > > - > To unsubscribe e-mail: user-unsubscr...@spark.apache.org > >
Re: KMeans Clustering is not Reproducible
Hi Anastasios, thanks for the reply but caching doesn’t seem to change anything. After further investigation it really seems that the RDD#takeSample method is the cause of the non-reproducibility. Is this considered a bug and should I open an Issue for that? BTW: my example script contains a little type in line 3: it is `feature` not `features` (mind the `s`). Best, Christoph The script with caching ``` import org.apache.spark.sql.DataFrame import org.apache.spark.ml.clustering.KMeans import org.apache.spark.ml.feature.VectorAssembler import org.apache.spark.storage.StorageLevel // generate random data for clustering val randomData = spark.range(1, 1000).withColumn("a", rand(123)).withColumn("b", rand(321)) val vecAssembler = new VectorAssembler().setInputCols(Array("a", "b")).setOutputCol("features") val data = vecAssembler.transform(randomData) // instantiate KMeans with fixed seed val kmeans = new KMeans().setK(10).setSeed(9876L) // train the model with different partitioning val dataWith1Partition = data.repartition(1) // cache the data dataWith1Partition.persist(StorageLevel.MEMORY_AND_DISK) println("1 Partition: " + kmeans.fit(dataWith1Partition).computeCost(dataWith1Partition)) val dataWith4Partition = data.repartition(4) // cache the data dataWith4Partition.persist(StorageLevel.MEMORY_AND_DISK) println("4 Partition: " + kmeans.fit(dataWith4Partition).computeCost(dataWith4Partition)) ``` Output: ``` 1 Partition: 16.028212597888057 4 Partition: 16.14758460544976 ``` > On 22 May 2017, at 16:33, Anastasios Zouzias wrote: > > Hi Christoph, > > Take a look at this, you might end up having a similar case: > > http://www.spark.tc/using-sparks-cache-for-correctness-not-just-performance/ > > If this is not the case, then I agree with you the kmeans should be > partitioning agnostic (although I haven't check the code yet). > > Best, > Anastasios > > > On Mon, May 22, 2017 at 3:42 PM, Christoph Bruecke > wrote: > Hi, > > I’m trying to figure out how to use KMeans in order to achieve reproducible > results. I have found that running the same kmeans instance on the same data, > with different partitioning will produce different clusterings. > > Given a simple KMeans run with fixed seed returns different results on the > same > training data, if the training data is partitioned differently. > > Consider the following example. The same KMeans clustering set up is run on > identical data. The only difference is the partitioning of the training data > (one partition vs. four partitions). > > ``` > import org.apache.spark.sql.DataFrame > import org.apache.spark.ml.clustering.KMeans > import org.apache.spark.ml.features.VectorAssembler > > // generate random data for clustering > val randomData = spark.range(1, 1000).withColumn("a", > rand(123)).withColumn("b", rand(321)) > > val vecAssembler = new VectorAssembler().setInputCols(Array("a", > "b")).setOutputCol("features") > > val data = vecAssembler.transform(randomData) > > // instantiate KMeans with fixed seed > val kmeans = new KMeans().setK(10).setSeed(9876L) > > // train the model with different partitioning > val dataWith1Partition = data.repartition(1) > println("1 Partition: " + > kmeans.fit(dataWith1Partition).computeCost(dataWith1Partition)) > > val dataWith4Partition = data.repartition(4) > println("4 Partition: " + > kmeans.fit(dataWith4Partition).computeCost(dataWith4Partition)) > ``` > > I get the following related cost > > ``` > 1 Partition: 16.028212597888057 > 4 Partition: 16.14758460544976 > ``` > > What I want to achieve is that repeated computations of the KMeans Clustering > should yield identical result on identical training data, regardless of the > partitioning. > > Looking through the Spark source code, I guess the cause is the > initialization method of KMeans which in turn uses the `takeSample` method, > which does not seem to be partition agnostic. > > Is this behaviour expected? Is there anything I could do to achieve > reproducible results? > > Best, > Christoph > - > To unsubscribe e-mail: user-unsubscr...@spark.apache.org > > > > > -- > -- Anastasios Zouzias - To unsubscribe e-mail: user-unsubscr...@spark.apache.org
Re: KMeans Clustering is not Reproducible
Hi Christoph, Take a look at this, you might end up having a similar case: http://www.spark.tc/using-sparks-cache-for-correctness-not-just-performance/ If this is not the case, then I agree with you the kmeans should be partitioning agnostic (although I haven't check the code yet). Best, Anastasios On Mon, May 22, 2017 at 3:42 PM, Christoph Bruecke wrote: > Hi, > > I’m trying to figure out how to use KMeans in order to achieve > reproducible results. I have found that running the same kmeans instance on > the same data, with different partitioning will produce different > clusterings. > > Given a simple KMeans run with fixed seed returns different results on the > same > training data, if the training data is partitioned differently. > > Consider the following example. The same KMeans clustering set up is run on > identical data. The only difference is the partitioning of the training > data > (one partition vs. four partitions). > > ``` > import org.apache.spark.sql.DataFrame > import org.apache.spark.ml.clustering.KMeans > import org.apache.spark.ml.features.VectorAssembler > > // generate random data for clustering > val randomData = spark.range(1, 1000).withColumn("a", > rand(123)).withColumn("b", rand(321)) > > val vecAssembler = new VectorAssembler().setInputCols(Array("a", > "b")).setOutputCol("features") > > val data = vecAssembler.transform(randomData) > > // instantiate KMeans with fixed seed > val kmeans = new KMeans().setK(10).setSeed(9876L) > > // train the model with different partitioning > val dataWith1Partition = data.repartition(1) > println("1 Partition: " + kmeans.fit(dataWith1Partition).computeCost( > dataWith1Partition)) > > val dataWith4Partition = data.repartition(4) > println("4 Partition: " + kmeans.fit(dataWith4Partition).computeCost( > dataWith4Partition)) > ``` > > I get the following related cost > > ``` > 1 Partition: 16.028212597888057 > 4 Partition: 16.14758460544976 > ``` > > What I want to achieve is that repeated computations of the KMeans > Clustering should yield identical result on identical training data, > regardless of the partitioning. > > Looking through the Spark source code, I guess the cause is the > initialization method of KMeans which in turn uses the `takeSample` method, > which does not seem to be partition agnostic. > > Is this behaviour expected? Is there anything I could do to achieve > reproducible results? > > Best, > Christoph > - > To unsubscribe e-mail: user-unsubscr...@spark.apache.org > > -- -- Anastasios Zouzias
KMeans Clustering is not Reproducible
Hi, I’m trying to figure out how to use KMeans in order to achieve reproducible results. I have found that running the same kmeans instance on the same data, with different partitioning will produce different clusterings. Given a simple KMeans run with fixed seed returns different results on the same training data, if the training data is partitioned differently. Consider the following example. The same KMeans clustering set up is run on identical data. The only difference is the partitioning of the training data (one partition vs. four partitions). ``` import org.apache.spark.sql.DataFrame import org.apache.spark.ml.clustering.KMeans import org.apache.spark.ml.features.VectorAssembler // generate random data for clustering val randomData = spark.range(1, 1000).withColumn("a", rand(123)).withColumn("b", rand(321)) val vecAssembler = new VectorAssembler().setInputCols(Array("a", "b")).setOutputCol("features") val data = vecAssembler.transform(randomData) // instantiate KMeans with fixed seed val kmeans = new KMeans().setK(10).setSeed(9876L) // train the model with different partitioning val dataWith1Partition = data.repartition(1) println("1 Partition: " + kmeans.fit(dataWith1Partition).computeCost(dataWith1Partition)) val dataWith4Partition = data.repartition(4) println("4 Partition: " + kmeans.fit(dataWith4Partition).computeCost(dataWith4Partition)) ``` I get the following related cost ``` 1 Partition: 16.028212597888057 4 Partition: 16.14758460544976 ``` What I want to achieve is that repeated computations of the KMeans Clustering should yield identical result on identical training data, regardless of the partitioning. Looking through the Spark source code, I guess the cause is the initialization method of KMeans which in turn uses the `takeSample` method, which does not seem to be partition agnostic. Is this behaviour expected? Is there anything I could do to achieve reproducible results? Best, Christoph - To unsubscribe e-mail: user-unsubscr...@spark.apache.org
Re: [MLlib] kmeans random initialization, same seed every time
Hi Julian, Thanks for reporting this. This is a valid issue and I created https://issues.apache.org/jira/browse/SPARK-19957 to track it. Right now the seed is set to this.getClass.getName.hashCode.toLong by default, which indeed keeps the same among multiple fits. Feel free to leave your comments or send a PR for the fix. For your problem, you may add .setSeed(new Random().nextLong()) before fit() as a workaround. Thanks, Yuhao 2017-03-14 5:46 GMT-07:00 Julian Keppel : > I'm sorry, I missed some important informations. I use Spark version 2.0.2 > in Scala 2.11.8. > > 2017-03-14 13:44 GMT+01:00 Julian Keppel : > >> Hi everybody, >> >> I make some experiments with the Spark kmeans implementation of the new >> DataFrame-API. I compare clustering results of different runs with >> different parameters. I recognized that for random initialization mode, the >> seed value is the same every time. How is it calculated? In my >> understanding the seed should be random if it is not provided by the user. >> >> Thank you for you help. >> >> Julian >> > >
Re: [MLlib] kmeans random initialization, same seed every time
I'm sorry, I missed some important informations. I use Spark version 2.0.2 in Scala 2.11.8. 2017-03-14 13:44 GMT+01:00 Julian Keppel : > Hi everybody, > > I make some experiments with the Spark kmeans implementation of the new > DataFrame-API. I compare clustering results of different runs with > different parameters. I recognized that for random initialization mode, the > seed value is the same every time. How is it calculated? In my > understanding the seed should be random if it is not provided by the user. > > Thank you for you help. > > Julian >
[MLlib] kmeans random initialization, same seed every time
Hi everybody, I make some experiments with the Spark kmeans implementation of the new DataFrame-API. I compare clustering results of different runs with different parameters. I recognized that for random initialization mode, the seed value is the same every time. How is it calculated? In my understanding the seed should be random if it is not provided by the user. Thank you for you help. Julian
Re: ML version of Kmeans
Hey, You could also take a look at MLeap, which provides a runtime for any Spark transformer and does not have any dependencies on a SparkContext or Spark libraries (excepting MLlib-local for linear algebra). https://github.com/combust/mleap On Tue, Jan 31, 2017 at 2:33 AM, Aseem Bansal wrote: > If you want to predict using dataset then transform is the way to go. If > you want to predict on vectors then you will have to wait on this issue to > be completed https://issues.apache.org/jira/browse/SPARK-10413 > > On Tue, Jan 31, 2017 at 3:01 PM, Holden Karau > wrote: > >> You most likely want the transform function on KMeansModel (although that >> works on a dataset input rather than a single element at a time). >> >> On Tue, Jan 31, 2017 at 1:24 AM, Madabhattula Rajesh Kumar < >> mrajaf...@gmail.com> wrote: >> >>> Hi, >>> >>> I am not able to find predict method on "ML" version of Kmeans. >>> >>> Mllib version has a predict method. KMeansModel.predict(point: Vector) >>> . >>> How to predict the cluster point for new vectors in ML version of kmeans >>> ? >>> >>> Regards, >>> Rajesh >>> >> >> >> >> -- >> Cell : 425-233-8271 <(425)%20233-8271> >> Twitter: https://twitter.com/holdenkarau >> > >
Re: ML version of Kmeans
If you want to predict using dataset then transform is the way to go. If you want to predict on vectors then you will have to wait on this issue to be completed https://issues.apache.org/jira/browse/SPARK-10413 On Tue, Jan 31, 2017 at 3:01 PM, Holden Karau wrote: > You most likely want the transform function on KMeansModel (although that > works on a dataset input rather than a single element at a time). > > On Tue, Jan 31, 2017 at 1:24 AM, Madabhattula Rajesh Kumar < > mrajaf...@gmail.com> wrote: > >> Hi, >> >> I am not able to find predict method on "ML" version of Kmeans. >> >> Mllib version has a predict method. KMeansModel.predict(point: Vector) >> . >> How to predict the cluster point for new vectors in ML version of kmeans ? >> >> Regards, >> Rajesh >> > > > > -- > Cell : 425-233-8271 > Twitter: https://twitter.com/holdenkarau >
Re: ML version of Kmeans
You most likely want the transform function on KMeansModel (although that works on a dataset input rather than a single element at a time). On Tue, Jan 31, 2017 at 1:24 AM, Madabhattula Rajesh Kumar < mrajaf...@gmail.com> wrote: > Hi, > > I am not able to find predict method on "ML" version of Kmeans. > > Mllib version has a predict method. KMeansModel.predict(point: Vector) > . > How to predict the cluster point for new vectors in ML version of kmeans ? > > Regards, > Rajesh > -- Cell : 425-233-8271 Twitter: https://twitter.com/holdenkarau
ML version of Kmeans
Hi, I am not able to find predict method on "ML" version of Kmeans. Mllib version has a predict method. KMeansModel.predict(point: Vector) . How to predict the cluster point for new vectors in ML version of kmeans ? Regards, Rajesh
PySpark 2: Kmeans The input data is not directly cached
Hi, I dont know why I receive the message WARN KMeans: The input data is not directly cached, which may hurt performance if its parent RDDs are also uncached. when I try to use Spark Kmeans df_Part = assembler.transform(df_Part) df_Part.cache()while (k<=max_cluster) and (wssse > seuilStop): kmeans = KMeans().setK(k) model = kmeans.fit(df_Part) wssse = model.computeCost(df_Part) k=k+1 It says that my input (Dataframe) is not cached !! I tried to print df_Part.is_cached and I recieved True which means that my dataframe is cached, So why spark still warning me about this ??? thank you in advance ᐧ
Re: Why training data in Kmeans Spark streaming clustering
The algorithm update is just broken into 2 steps: trainOn - to learn/update the cluster centers, and predictOn - predicts cluster assignment on data The StreamingKMeansExample you reference breaks up data into training and test because you might want to score the predictions. If you don't care about that, you could just use a single stream for both steps. On Thu, Aug 11, 2016 at 9:14 AM, Ahmed Sadek wrote: > Dear All, > > I was wondering why there is training data and testing data in kmeans ? > Shouldn't it be unsupervised learning with just access to stream data ? > > I found similar question but couldn't understand the answer. > http://stackoverflow.com/questions/30972057/is-the- > streaming-k-means-clustering-predefined-in-mllib-library-of-spark-supervi > > Thanks! > Ahmed >
Why training data in Kmeans Spark streaming clustering
Dear All, I was wondering why there is training data and testing data in kmeans ? Shouldn't it be unsupervised learning with just access to stream data ? I found similar question but couldn't understand the answer. http://stackoverflow.com/questions/30972057/is-the-streaming-k-means-clustering-predefined-in-mllib-library-of-spark-supervi Thanks! Ahmed
RE: bisecting kmeans model tree
There seems to be an existing JIRA for this. https://issues.apache.org/jira/browse/SPARK-11664 From: Yanbo Liang [mailto:yblia...@gmail.com] Sent: Saturday, July 16, 2016 6:18 PM To: roni Cc: user@spark.apache.org Subject: Re: bisecting kmeans model tree Currently we do not expose the APIs to get the Bisecting KMeans tree structure, they are private in the ml.clustering package scope. But I think we should make a plan to expose these APIs like what we did for Decision Tree. Thanks Yanbo 2016-07-12 11:45 GMT-07:00 roni mailto:roni.epi...@gmail.com>>: Hi Spark,Mlib experts, Anyone who can shine light on this? Thanks _R On Thu, Apr 21, 2016 at 12:46 PM, roni mailto:roni.epi...@gmail.com>> wrote: Hi , I want to get the bisecting kmeans tree structure to show a dendogram on the heatmap I am generating based on the hierarchical clustering of data. How do I get that using mlib . Thanks -Roni
Re: Kmeans dataset initialization
Can anyone suggest how I can initialize kmeans structure directly from a dataset of Row On Sat, Aug 6, 2016 at 1:03 AM, Tony Lane wrote: > I have all the data required for KMeans in a dataset in memory > > Standard approach to load this data from a file is > spark.read().format("libsvm").load(filename) > > where the file has data in the format > 0 1:0.0 2:0.0 3:0.0 > > > How do i this from an in-memory dataset already present. > Any suggestions ? > > -Tony > >
Kmeans dataset initialization
I have all the data required for KMeans in a dataset in memory Standard approach to load this data from a file is spark.read().format("libsvm").load(filename) where the file has data in the format 0 1:0.0 2:0.0 3:0.0 How do i this from an in-memory dataset already present. Any suggestions ? -Tony
Re: bisecting kmeans model tree
Currently we do not expose the APIs to get the Bisecting KMeans tree structure, they are private in the ml.clustering package scope. But I think we should make a plan to expose these APIs like what we did for Decision Tree. Thanks Yanbo 2016-07-12 11:45 GMT-07:00 roni : > Hi Spark,Mlib experts, > Anyone who can shine light on this? > Thanks > _R > > On Thu, Apr 21, 2016 at 12:46 PM, roni wrote: > >> Hi , >> I want to get the bisecting kmeans tree structure to show a dendogram >> on the heatmap I am generating based on the hierarchical clustering of >> data. >> How do I get that using mlib . >> Thanks >> -Roni >> > >
Re: bisecting kmeans model tree
Hi Spark,Mlib experts, Anyone who can shine light on this? Thanks _R On Thu, Apr 21, 2016 at 12:46 PM, roni wrote: > Hi , > I want to get the bisecting kmeans tree structure to show a dendogram on > the heatmap I am generating based on the hierarchical clustering of data. > How do I get that using mlib . > Thanks > -Roni >
Working of Streaming Kmeans
Hi Biplob, The current Streaming KMeans code only updates data which comes in through training (e.g. trainOn), predictOn does not update the model. Cheers, Holden :) P.S. Traffic on the list might be have been bit slower right now because of Canada Day and 4th of July weekend respectively. On Sunday, July 3, 2016, Biplob Biswas wrote: > Hi, > > Can anyone please explain this? > > Thanks & Regards > Biplob Biswas > > On Sat, Jul 2, 2016 at 4:48 PM, Biplob Biswas > wrote: > >> Hi, >> >> I wanted to ask a very basic question about the working of Streaming >> Kmeans. >> >> Does the model update only when training (i.e. training dataset is used) >> or >> does it update on the PredictOnValues function as well for the test >> dataset? >> >> Thanks and Regards >> Biplob >> >> >> >> >> -- >> View this message in context: >> http://apache-spark-user-list.1001560.n3.nabble.com/Working-of-Streaming-Kmeans-tp27268.html >> Sent from the Apache Spark User List mailing list archive at Nabble.com. >> >> - >> To unsubscribe e-mail: user-unsubscr...@spark.apache.org >> >> >
Re: Working of Streaming Kmeans
Hi, Can anyone please explain this? Thanks & Regards Biplob Biswas On Sat, Jul 2, 2016 at 4:48 PM, Biplob Biswas wrote: > Hi, > > I wanted to ask a very basic question about the working of Streaming > Kmeans. > > Does the model update only when training (i.e. training dataset is used) or > does it update on the PredictOnValues function as well for the test > dataset? > > Thanks and Regards > Biplob > > > > > -- > View this message in context: > http://apache-spark-user-list.1001560.n3.nabble.com/Working-of-Streaming-Kmeans-tp27268.html > Sent from the Apache Spark User List mailing list archive at Nabble.com. > > - > To unsubscribe e-mail: user-unsubscr...@spark.apache.org > >
Working of Streaming Kmeans
Hi, I wanted to ask a very basic question about the working of Streaming Kmeans. Does the model update only when training (i.e. training dataset is used) or does it update on the PredictOnValues function as well for the test dataset? Thanks and Regards Biplob -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Working-of-Streaming-Kmeans-tp27268.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe e-mail: user-unsubscr...@spark.apache.org
Kmeans Streaming process flow
Hi, I am in the process of implementing a spark streaming application to do clustering of some events. I have a DStream of vectors that I have created from each event. Now I am trying to apply clustering. I referred following example in spark github. There is a train method and predictOnValues method. I am confused how to map this example for my use case. In my case, I would be getting the stream of events 24 * 7. I am not sure how to split the "all day" data separately for train and predict methods. Or should this streaming application be run in train mode for few days and predict mode later? I am not able to find a suitable example on the web. Please advise. Thanks. https://github.com/apache/spark/blob/branch-1.2/examples/src/main/scala/org/apache/spark/examples/mllib/StreamingKMeansExample.scala object StreamingKMeansExample { def main(args: Array[String]) { if (args.length != 5) { System.err.println( "Usage: StreamingKMeansExample " + " ") System.exit(1) } val conf = new SparkConf().setMaster("local").setAppName("StreamingKMeansExample") val ssc = new StreamingContext(conf, Seconds(args(2).toLong)) val trainingData = ssc.textFileStream(args(0)).map(Vectors.parse) val testData = ssc.textFileStream(args(1)).map(LabeledPoint.parse) val model = new StreamingKMeans() .setK(args(3).toInt) .setDecayFactor(1.0) .setRandomCenters(args(4).toInt, 0.0) model.trainOn(trainingData) model.predictOnValues(testData.map(lp => (lp.label, lp.features))).print() ssc.start() ssc.awaitTermination() } } Regards, Anand.C
bisecting kmeans model tree
Hi , I want to get the bisecting kmeans tree structure to show a dendogram on the heatmap I am generating based on the hierarchical clustering of data. How do I get that using mlib . Thanks -Roni
bisecting kmeans tree
Hi , I want to get the bisecting kmeans tree structure to show on the heatmap I am generating based on the hierarchical clustering of data. How do I get that using mlib . Thanks -R
Re: Why KMeans with mllib is so slow ?
Hi Xi Shen, Changing the initialization step from "kmeans||" to "random" decreased the execution time from 2 hrs to 6 min. However, by default the no.of runs is 1. If I try to set the number of runs to 10, then again see increase in job execution time. How to proceed on this ?. By the way how is this initialization mode "random" different from "k-means||" ? Regards, Padma Ch On Sun, Mar 13, 2016 at 12:37 PM, Xi Shen wrote: > Hi Chitturi, > > Please checkout > https://spark.apache.org/docs/1.0.1/api/java/org/apache/spark/mllib/clustering/KMeans.html#setInitializationSteps(int > ). > > I think it is caused by the initialization step. the "kmeans||" method > does not initialize dataset in parallel. If your dataset is large, it takes > a long time to initialize. Just changed to "random". > > Hope it helps. > > > On Sun, Mar 13, 2016 at 2:58 PM Chitturi Padma < > learnings.chitt...@gmail.com> wrote: > >> Hi All, >> >> I am facing the same issue. taking k values from 60 to 120 >> incrementing by 10 each time i.e k takes value 60,70,80,...120 the >> algorithm takes around 2.5h on a 800 MB data set with 38 dimensions. >> On Sun, Mar 29, 2015 at 9:34 AM, davidshen84 [via Apache Spark User List] >> <[hidden email] <http:///user/SendEmail.jtp?type=node&node=26467&i=0>> >> wrote: >> >>> Hi Jao, >>> >>> Sorry to pop up this old thread. I am have the same problem like you >>> did. I want to know if you have figured out how to improve k-means on >>> Spark. >>> >>> I am using Spark 1.2.0. My data set is about 270k vectors, each has >>> about 350 dimensions. If I set k=500, the job takes about 3hrs on my >>> cluster. The cluster has 7 executors, each has 8 cores... >>> >>> If I set k=5000 which is the required value for my task, the job goes on >>> forever... >>> >>> >>> Thanks, >>> David >>> >>> >>> -- >>> If you reply to this email, your message will be added to the discussion >>> below: >>> >>> http://apache-spark-user-list.1001560.n3.nabble.com/Why-KMeans-with-mllib-is-so-slow-tp20480p22273.html >>> >> To start a new topic under Apache Spark User List, email [hidden email] >>> <http:///user/SendEmail.jtp?type=node&node=26467&i=1> >>> To unsubscribe from Apache Spark User List, click here. >>> NAML >>> <http://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html%21nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace&breadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml> >>> >> >> >> -- >> View this message in context: Re: Why KMeans with mllib is so slow ? >> <http://apache-spark-user-list.1001560.n3.nabble.com/Why-KMeans-with-mllib-is-so-slow-tp20480p26467.html> >> Sent from the Apache Spark User List mailing list archive >> <http://apache-spark-user-list.1001560.n3.nabble.com/> at Nabble.com. >> > -- > > Regards, > David >
Re: Why KMeans with mllib is so slow ?
Hi Chitturi, Please checkout https://spark.apache.org/docs/1.0.1/api/java/org/apache/spark/mllib/clustering/KMeans.html#setInitializationSteps(int ). I think it is caused by the initialization step. the "kmeans||" method does not initialize dataset in parallel. If your dataset is large, it takes a long time to initialize. Just changed to "random". Hope it helps. On Sun, Mar 13, 2016 at 2:58 PM Chitturi Padma wrote: > Hi All, > > I am facing the same issue. taking k values from 60 to 120 incrementing > by 10 each time i.e k takes value 60,70,80,...120 the algorithm takes > around 2.5h on a 800 MB data set with 38 dimensions. > On Sun, Mar 29, 2015 at 9:34 AM, davidshen84 [via Apache Spark User List] > <[hidden email] <http:///user/SendEmail.jtp?type=node&node=26467&i=0>> > wrote: > >> Hi Jao, >> >> Sorry to pop up this old thread. I am have the same problem like you did. >> I want to know if you have figured out how to improve k-means on Spark. >> >> I am using Spark 1.2.0. My data set is about 270k vectors, each has about >> 350 dimensions. If I set k=500, the job takes about 3hrs on my cluster. The >> cluster has 7 executors, each has 8 cores... >> >> If I set k=5000 which is the required value for my task, the job goes on >> forever... >> >> >> Thanks, >> David >> >> >> -- >> If you reply to this email, your message will be added to the discussion >> below: >> >> http://apache-spark-user-list.1001560.n3.nabble.com/Why-KMeans-with-mllib-is-so-slow-tp20480p22273.html >> > To start a new topic under Apache Spark User List, email [hidden email] >> <http:///user/SendEmail.jtp?type=node&node=26467&i=1> >> To unsubscribe from Apache Spark User List, click here. >> NAML >> <http://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html%21nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace&breadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml> >> > > > -- > View this message in context: Re: Why KMeans with mllib is so slow ? > <http://apache-spark-user-list.1001560.n3.nabble.com/Why-KMeans-with-mllib-is-so-slow-tp20480p26467.html> > Sent from the Apache Spark User List mailing list archive > <http://apache-spark-user-list.1001560.n3.nabble.com/> at Nabble.com. > -- Regards, David
Re: Why KMeans with mllib is so slow ?
Hi All, I am facing the same issue. taking k values from 60 to 120 incrementing by 10 each time i.e k takes value 60,70,80,...120 the algorithm takes around 2.5h on a 800 MB data set with 38 dimensions. On Sun, Mar 29, 2015 at 9:34 AM, davidshen84 [via Apache Spark User List] < ml-node+s1001560n2227...@n3.nabble.com> wrote: > Hi Jao, > > Sorry to pop up this old thread. I am have the same problem like you did. > I want to know if you have figured out how to improve k-means on Spark. > > I am using Spark 1.2.0. My data set is about 270k vectors, each has about > 350 dimensions. If I set k=500, the job takes about 3hrs on my cluster. The > cluster has 7 executors, each has 8 cores... > > If I set k=5000 which is the required value for my task, the job goes on > forever... > > > Thanks, > David > > > -- > If you reply to this email, your message will be added to the discussion > below: > > http://apache-spark-user-list.1001560.n3.nabble.com/Why-KMeans-with-mllib-is-so-slow-tp20480p22273.html > To start a new topic under Apache Spark User List, email > ml-node+s1001560n1...@n3.nabble.com > To unsubscribe from Apache Spark User List, click here > <http://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_code&node=1&code=bGVhcm5pbmdzLmNoaXR0dXJpQGdtYWlsLmNvbXwxfC03NzExMjUwMg==> > . > NAML > <http://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html%21nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace&breadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml> > -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Why-KMeans-with-mllib-is-so-slow-tp20480p26467.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
Re: Spark Mllib kmeans execution
It will run distributed On Mar 2, 2016 3:00 PM, "Priya Ch" wrote: > Hi All, > > I am running k-means clustering algorithm. Now, when I am running the > algorithm as - > > val conf = new SparkConf > val sc = new SparkContext(conf) > . > . > val kmeans = new KMeans() > val model = kmeans.run(RDD[Vector]) > . > . > . > The 'kmeans' object gets created on driver. Now does *kmeans.run() *get > executed on each partition of the rdd in distributed fashion or else does > the entire RDD is brought to driver and then gets executed at the driver on > the entire RDD ?? > > Thanks, > Padma Ch > > >
Spark Mllib kmeans execution
Hi All, I am running k-means clustering algorithm. Now, when I am running the algorithm as - val conf = new SparkConf val sc = new SparkContext(conf) . . val kmeans = new KMeans() val model = kmeans.run(RDD[Vector]) . . . The 'kmeans' object gets created on driver. Now does *kmeans.run() *get executed on each partition of the rdd in distributed fashion or else does the entire RDD is brought to driver and then gets executed at the driver on the entire RDD ?? Thanks, Padma Ch
Re: Slowness in Kmeans calculating fastSquaredDistance
Hi, It looks like Kmeans++ is slow (SPARK-3424<https://issues.apache.org/jira/browse/SPARK-3424>) in the initialisation phase and is local to driver using 1 core only. If I use random, the job completed in 1.5mins compared to 1hr+. Should I move this to the dev list? Regards, Liming From: Li Ming Tsai Sent: Sunday, February 7, 2016 10:03 AM To: user@spark.apache.org Subject: Re: Slowness in Kmeans calculating fastSquaredDistance Hi, I did more investigation and found out that BLAS.scala is calling the native reference architecture (f2jblas) for level 1 routines. I even patched it to use nativeBlas.ddot but it has no material impact. https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/linalg/BLAS.scala#L126 private def dot(x: DenseVector, y: DenseVector): Double = { val n = x.size f2jBLAS.ddot(n, x.values, 1, y.values, 1) } Maybe Xiangrui can comment on this? From: Li Ming Tsai Sent: Friday, February 5, 2016 10:56 AM To: user@spark.apache.org Subject: Slowness in Kmeans calculating fastSquaredDistance Hi, I'm using INTEL MKL on Spark 1.6.0 which I built myself with the -Pnetlib-lgpl flag. I am using spark local[4] mode and I run it like this: # export LD_LIBRARY_PATH=/opt/intel/lib/intel64:/opt/intel/mkl/lib/intel64 # bin/spark-shell ... I have also added the following to /opt/intel/mkl/lib/intel64: lrwxrwxrwx 1 root root12 Feb 1 09:18 libblas.so -> libmkl_rt.so lrwxrwxrwx 1 root root12 Feb 1 09:18 libblas.so.3 -> libmkl_rt.so lrwxrwxrwx 1 root root12 Feb 1 09:18 liblapack.so -> libmkl_rt.so lrwxrwxrwx 1 root root12 Feb 1 09:18 liblapack.so.3 -> libmkl_rt.so I believe (???) that I'm using Intel MKL because the warnings went away: 16/02/01 07:49:38 WARN BLAS: Failed to load implementation from: com.github.fommil.netlib.NativeSystemBLAS 16/02/01 07:49:38 WARN BLAS: Failed to load implementation from: com.github.fommil.netlib.NativeRefBLAS After collectAsMap, there is no progress but I can observe that only 1 CPU is being utilised with the following stack trace: "ForkJoinPool-3-worker-7" #130 daemon prio=5 os_prio=0 tid=0x7fbf30ab6000 nid=0xbdc runnable [0x7fbf12205000] java.lang.Thread.State: RUNNABLE at com.github.fommil.netlib.F2jBLAS.ddot(F2jBLAS.java:71) at org.apache.spark.mllib.linalg.BLAS$.dot(BLAS.scala:128) at org.apache.spark.mllib.linalg.BLAS$.dot(BLAS.scala:111) at org.apache.spark.mllib.util.MLUtils$.fastSquaredDistance(MLUtils.scala:349) at org.apache.spark.mllib.clustering.KMeans$.fastSquaredDistance(KMeans.scala:587) at org.apache.spark.mllib.clustering.KMeans$$anonfun$findClosest$1.apply(KMeans.scala:561) at org.apache.spark.mllib.clustering.KMeans$$anonfun$findClosest$1.apply(KMeans.scala:555) This last few steps takes more than half of the total time for a 1Mx100 dataset. The code is just: val clusters = KMeans.train(parsedData, 1000, 1) Shouldn't it utilising all the cores for the dot product? Is this a misconfiguration? Thanks!
Re: Slowness in Kmeans calculating fastSquaredDistance
Hi, I did more investigation and found out that BLAS.scala is calling the native reference architecture (f2jblas) for level 1 routines. I even patched it to use nativeBlas.ddot but it has no material impact. https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/linalg/BLAS.scala#L126 private def dot(x: DenseVector, y: DenseVector): Double = { val n = x.size f2jBLAS.ddot(n, x.values, 1, y.values, 1) } Maybe Xiangrui can comment on this? From: Li Ming Tsai Sent: Friday, February 5, 2016 10:56 AM To: user@spark.apache.org Subject: Slowness in Kmeans calculating fastSquaredDistance Hi, I'm using INTEL MKL on Spark 1.6.0 which I built myself with the -Pnetlib-lgpl flag. I am using spark local[4] mode and I run it like this: # export LD_LIBRARY_PATH=/opt/intel/lib/intel64:/opt/intel/mkl/lib/intel64 # bin/spark-shell ... I have also added the following to /opt/intel/mkl/lib/intel64: lrwxrwxrwx 1 root root12 Feb 1 09:18 libblas.so -> libmkl_rt.so lrwxrwxrwx 1 root root12 Feb 1 09:18 libblas.so.3 -> libmkl_rt.so lrwxrwxrwx 1 root root12 Feb 1 09:18 liblapack.so -> libmkl_rt.so lrwxrwxrwx 1 root root12 Feb 1 09:18 liblapack.so.3 -> libmkl_rt.so I believe (???) that I'm using Intel MKL because the warnings went away: 16/02/01 07:49:38 WARN BLAS: Failed to load implementation from: com.github.fommil.netlib.NativeSystemBLAS 16/02/01 07:49:38 WARN BLAS: Failed to load implementation from: com.github.fommil.netlib.NativeRefBLAS After collectAsMap, there is no progress but I can observe that only 1 CPU is being utilised with the following stack trace: "ForkJoinPool-3-worker-7" #130 daemon prio=5 os_prio=0 tid=0x7fbf30ab6000 nid=0xbdc runnable [0x7fbf12205000] java.lang.Thread.State: RUNNABLE at com.github.fommil.netlib.F2jBLAS.ddot(F2jBLAS.java:71) at org.apache.spark.mllib.linalg.BLAS$.dot(BLAS.scala:128) at org.apache.spark.mllib.linalg.BLAS$.dot(BLAS.scala:111) at org.apache.spark.mllib.util.MLUtils$.fastSquaredDistance(MLUtils.scala:349) at org.apache.spark.mllib.clustering.KMeans$.fastSquaredDistance(KMeans.scala:587) at org.apache.spark.mllib.clustering.KMeans$$anonfun$findClosest$1.apply(KMeans.scala:561) at org.apache.spark.mllib.clustering.KMeans$$anonfun$findClosest$1.apply(KMeans.scala:555) This last few steps takes more than half of the total time for a 1Mx100 dataset. The code is just: val clusters = KMeans.train(parsedData, 1000, 1) Shouldn't it utilising all the cores for the dot product? Is this a misconfiguration? Thanks!
Slowness in Kmeans calculating fastSquaredDistance
Hi, I'm using INTEL MKL on Spark 1.6.0 which I built myself with the -Pnetlib-lgpl flag. I am using spark local[4] mode and I run it like this: # export LD_LIBRARY_PATH=/opt/intel/lib/intel64:/opt/intel/mkl/lib/intel64 # bin/spark-shell ... I have also added the following to /opt/intel/mkl/lib/intel64: lrwxrwxrwx 1 root root12 Feb 1 09:18 libblas.so -> libmkl_rt.so lrwxrwxrwx 1 root root12 Feb 1 09:18 libblas.so.3 -> libmkl_rt.so lrwxrwxrwx 1 root root12 Feb 1 09:18 liblapack.so -> libmkl_rt.so lrwxrwxrwx 1 root root12 Feb 1 09:18 liblapack.so.3 -> libmkl_rt.so I believe (???) that I'm using Intel MKL because the warnings went away: 16/02/01 07:49:38 WARN BLAS: Failed to load implementation from: com.github.fommil.netlib.NativeSystemBLAS 16/02/01 07:49:38 WARN BLAS: Failed to load implementation from: com.github.fommil.netlib.NativeRefBLAS After collectAsMap, there is no progress but I can observe that only 1 CPU is being utilised with the following stack trace: "ForkJoinPool-3-worker-7" #130 daemon prio=5 os_prio=0 tid=0x7fbf30ab6000 nid=0xbdc runnable [0x7fbf12205000] java.lang.Thread.State: RUNNABLE at com.github.fommil.netlib.F2jBLAS.ddot(F2jBLAS.java:71) at org.apache.spark.mllib.linalg.BLAS$.dot(BLAS.scala:128) at org.apache.spark.mllib.linalg.BLAS$.dot(BLAS.scala:111) at org.apache.spark.mllib.util.MLUtils$.fastSquaredDistance(MLUtils.scala:349) at org.apache.spark.mllib.clustering.KMeans$.fastSquaredDistance(KMeans.scala:587) at org.apache.spark.mllib.clustering.KMeans$$anonfun$findClosest$1.apply(KMeans.scala:561) at org.apache.spark.mllib.clustering.KMeans$$anonfun$findClosest$1.apply(KMeans.scala:555) This last few steps takes more than half of the total time for a 1Mx100 dataset. The code is just: val clusters = KMeans.train(parsedData, 1000, 1) Shouldn't it utilising all the cores for the dot product? Is this a misconfiguration? Thanks!
Visualization of KMeans cluster in Spark
Hi, Is there any way to visualizing the KMeans clusters in spark? Can we connect Plotly with Apache Spark in Java? Thanks, Yogesh - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Spark MLLib KMeans Performance on Amazon EC2 M3.2xlarge
Hi Jia, I think the examples you provided is not very suitable to illustrate what driver and executors do, because it's not show the internal implementation of the KMeans algorithm. You can refer the source code of MLlib Kmeans ( https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/clustering/KMeans.scala#L227 ). In short words, the driver need the memory of O(centers size) but each executors need the memory of O(partition size). Usually we have large dataset and distributed the whole dataset at many executors, but the centers is not very big even compared with the dataset at one executor. Cheers Yanbo 2015-12-31 22:31 GMT+08:00 Jia Zou : > Thanks, Yanbo. > The results become much more reasonable, after I set driver memory to 5GB > and increase worker memory to 25GB. > > So, my question is for following code snippet extracted from main method > in JavaKMeans.java in examples, what will the driver do? and what will the > worker do? > > I didn't understand this problem well by reading > https://spark.apache.org/docs/1.1.0/cluster-overview.htmland > http://stackoverflow.com/questions/27181737/how-to-deal-with-executor-memory-and-driver-memory-in-spark > > SparkConf sparkConf = new SparkConf().setAppName("JavaKMeans"); > > JavaSparkContext sc = new JavaSparkContext(sparkConf); > > JavaRDD lines = sc.textFile(inputFile); > > JavaRDD points = lines.map(new ParsePoint()); > > points.persist(StorageLevel.MEMORY_AND_DISK()); > > KMeansModel model = KMeans.train(points.rdd(), k, iterations, runs, > KMeans.K_MEANS_PARALLEL()); > > > Thank you very much! > > Best Regards, > Jia > > On Wed, Dec 30, 2015 at 9:00 PM, Yanbo Liang wrote: > >> Hi Jia, >> >> You can try to use inputRDD.persist(MEMORY_AND_DISK) and verify whether >> it can produce stable performance. The storage level of MEMORY_AND_DISK >> will store the partitions that don't fit on disk and read them from there >> when they are needed. >> Actually, it's not necessary to set so large driver memory in your case, >> because KMeans use low memory for driver if your k is not very large. >> >> Cheers >> Yanbo >> >> 2015-12-30 22:20 GMT+08:00 Jia Zou : >> >>> I am running Spark MLLib KMeans in one EC2 M3.2xlarge instance with 8 >>> CPU cores and 30GB memory. Executor memory is set to 15GB, and driver >>> memory is set to 15GB. >>> >>> The observation is that, when input data size is smaller than 15GB, the >>> performance is quite stable. However, when input data becomes larger than >>> that, the performance will be extremely unpredictable. For example, for >>> 15GB input, with inputRDD.persist(MEMORY_ONLY) , I've got three >>> dramatically different testing results: 27mins, 61mins and 114 mins. (All >>> settings are the same for the 3 tests, and I will create input data >>> immediately before running each of the tests to keep OS buffer cache hot.) >>> >>> Anyone can help to explain this? Thanks very much! >>> >>> >> >
Re: Spark MLLib KMeans Performance on Amazon EC2 M3.2xlarge
Thanks, Yanbo. The results become much more reasonable, after I set driver memory to 5GB and increase worker memory to 25GB. So, my question is for following code snippet extracted from main method in JavaKMeans.java in examples, what will the driver do? and what will the worker do? I didn't understand this problem well by reading https://spark.apache.org/docs/1.1.0/cluster-overview.htmland http://stackoverflow.com/questions/27181737/how-to-deal-with-executor-memory-and-driver-memory-in-spark SparkConf sparkConf = new SparkConf().setAppName("JavaKMeans"); JavaSparkContext sc = new JavaSparkContext(sparkConf); JavaRDD lines = sc.textFile(inputFile); JavaRDD points = lines.map(new ParsePoint()); points.persist(StorageLevel.MEMORY_AND_DISK()); KMeansModel model = KMeans.train(points.rdd(), k, iterations, runs, KMeans.K_MEANS_PARALLEL()); Thank you very much! Best Regards, Jia On Wed, Dec 30, 2015 at 9:00 PM, Yanbo Liang wrote: > Hi Jia, > > You can try to use inputRDD.persist(MEMORY_AND_DISK) and verify whether it > can produce stable performance. The storage level of MEMORY_AND_DISK will > store the partitions that don't fit on disk and read them from there when > they are needed. > Actually, it's not necessary to set so large driver memory in your case, > because KMeans use low memory for driver if your k is not very large. > > Cheers > Yanbo > > 2015-12-30 22:20 GMT+08:00 Jia Zou : > >> I am running Spark MLLib KMeans in one EC2 M3.2xlarge instance with 8 CPU >> cores and 30GB memory. Executor memory is set to 15GB, and driver memory is >> set to 15GB. >> >> The observation is that, when input data size is smaller than 15GB, the >> performance is quite stable. However, when input data becomes larger than >> that, the performance will be extremely unpredictable. For example, for >> 15GB input, with inputRDD.persist(MEMORY_ONLY) , I've got three >> dramatically different testing results: 27mins, 61mins and 114 mins. (All >> settings are the same for the 3 tests, and I will create input data >> immediately before running each of the tests to keep OS buffer cache hot.) >> >> Anyone can help to explain this? Thanks very much! >> >> >
Re: Spark MLLib KMeans Performance on Amazon EC2 M3.2xlarge
Hi Jia, You can try to use inputRDD.persist(MEMORY_AND_DISK) and verify whether it can produce stable performance. The storage level of MEMORY_AND_DISK will store the partitions that don't fit on disk and read them from there when they are needed. Actually, it's not necessary to set so large driver memory in your case, because KMeans use low memory for driver if your k is not very large. Cheers Yanbo 2015-12-30 22:20 GMT+08:00 Jia Zou : > I am running Spark MLLib KMeans in one EC2 M3.2xlarge instance with 8 CPU > cores and 30GB memory. Executor memory is set to 15GB, and driver memory is > set to 15GB. > > The observation is that, when input data size is smaller than 15GB, the > performance is quite stable. However, when input data becomes larger than > that, the performance will be extremely unpredictable. For example, for > 15GB input, with inputRDD.persist(MEMORY_ONLY) , I've got three > dramatically different testing results: 27mins, 61mins and 114 mins. (All > settings are the same for the 3 tests, and I will create input data > immediately before running each of the tests to keep OS buffer cache hot.) > > Anyone can help to explain this? Thanks very much! > >
Spark MLLib KMeans Performance on Amazon EC2 M3.2xlarge
I am running Spark MLLib KMeans in one EC2 M3.2xlarge instance with 8 CPU cores and 30GB memory. Executor memory is set to 15GB, and driver memory is set to 15GB. The observation is that, when input data size is smaller than 15GB, the performance is quite stable. However, when input data becomes larger than that, the performance will be extremely unpredictable. For example, for 15GB input, with inputRDD.persist(MEMORY_ONLY) , I've got three dramatically different testing results: 27mins, 61mins and 114 mins. (All settings are the same for the 3 tests, and I will create input data immediately before running each of the tests to keep OS buffer cache hot.) Anyone can help to explain this? Thanks very much!
Clustering KMeans error in 1.5.1
We upgraded from 1.4.0 to 1.5.1 (skipped 1.5.0) and one of our clustering job hit the below error. Does anyone know what this is about or if it is a bug? stdout4260Traceback (most recent call last): File "user_clustering.py", line 137, in uig_model = KMeans.train(uigs,i,nIter, runs = nRuns) File "/mnt/yarn/nm/usercache/ds/appcache/application_1444086959272_0238/container_1444086959272_0238_01_01/pyspark.zip/pyspark/mllib/clustering.py", line 150, in train File "/mnt/yarn/nm/usercache/ds/appcache/application_1444086959272_0238/container_1444086959272_0238_01_01/pyspark.zip/pyspark/mllib/common.py", line 130, in callMLlibFunc File "/mnt/yarn/nm/usercache/ds/appcache/application_1444086959272_0238/container_1444086959272_0238_01_01/pyspark.zip/pyspark/mllib/common.py", line 123, in callJavaFunc File "/mnt/yarn/nm/usercache/ds/appcache/application_1444086959272_0238/container_1444086959272_0238_01_01/py4j-0.8.2.1-src.zip/py4j/java_gateway.py", line 538, in __call__ File "/mnt/yarn/nm/usercache/ds/appcache/application_1444086959272_0238/container_1444086959272_0238_01_01/pyspark.zip/pyspark/sql/utils.py", line 36, in deco File "/mnt/yarn/nm/usercache/ds/appcache/application_1444086959272_0238/container_1444086959272_0238_01_01/py4j-0.8.2.1-src.zip/py4j/protocol.py", line 300, in get_return_value py4j.protocol.Py4JJavaError: An error occurred while calling o220.trainKMeansModel. : org.apache.spark.SparkException: Job aborted due to stage failure: Task 73 in stage 14.0 failed 4 times, most recent failure: Lost task 73.3 in stage 14.0 (TID 1357, hadoop-sandbox-dn07): ExecutorLostFailure (executor 13 lost) Driver stacktrace: at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1283) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1271) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1270) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1270) at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:697) at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:697) at scala.Option.foreach(Option.scala:236) at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:697) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1496) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1458) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1447) at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48) at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:567) at org.apache.spark.SparkContext.runJob(SparkContext.scala:1822) at org.apache.spark.SparkContext.runJob(SparkContext.scala:1835) at org.apache.spark.SparkContext.runJob(SparkContext.scala:1848) at org.apache.spark.SparkContext.runJob(SparkContext.scala:1919) at org.apache.spark.rdd.RDD.count(RDD.scala:1121) at org.apache.spark.rdd.RDD.takeSample(RDD.scala:485) at org.apache.spark.mllib.clustering.KMeans.initKMeansParallel(KMeans.scala:376) at org.apache.spark.mllib.clustering.KMeans.runAlgorithm(KMeans.scala:249) at org.apache.spark.mllib.clustering.KMeans.run(KMeans.scala:213) at org.apache.spark.mllib.api.python.PythonMLLibAPI.trainKMeansModel(PythonMLLibAPI.scala:341) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231) at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:379) at py4j.Gateway.invoke(Gateway.java:259) at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133) at py4j.commands.CallCommand.execute(CallCommand.java:79) at py4j.GatewayConnection.run(GatewayConnection.java:207) at java.lang.Thread.run(Thread.java:745) - -- Robin Li -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Clustering-KMeans-error-in-1-5-1-tp25101.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
Re: Distance metrics in KMeans
There is a Spark Package that gives some alternative distance metrics, http://spark-packages.org/package/derrickburns/generalized-kmeans-clustering. Not used it myself. - Robin East Spark GraphX in Action Michael Malak and Robin East Manning Publications Co. http://www.manning.com/books/spark-graphx-in-action -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Distance-metrics-in-KMeans-tp24823p24829.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Distance metrics in KMeans
It looks like the distance metric is hard coded to the L2 norm (euclidean distance) in MLlib. As you may expect, you are not the first person to desire other metrics and there has been some prior effort. Please reference this PR: https://github.com/apache/spark/pull/2634 And corresponding JIRA: https://issues.apache.org/jira/browse/SPARK-3219 Seems as if the addition of arbitrary distance metrics is non-trivial given current implementation in MLlib. Not sure of any current work towards this issue. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Distance-metrics-in-KMeans-tp24823p24826.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Distance metrics in KMeans
Is it possible to use other distance metrics than Euclidean (e.g. Tanimoto, Manhattan) with MLlib KMeans? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Distance-metrics-in-KMeans-tp24823.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
KMeans Model fails to run
Hi, Why am I getting this error which prevents my KMeans clustering algorithm to work inside of Spark? I'm trying to run a sample Scala model found in Databricks website on my Cloudera-Spark 1-node local VM. For completeness, the Scala program is as follows: Thx import org.apache.spark.mllib.clustering.KMeans import org.apache.spark.mllib.linalg.Vectors // Load and parse the data val data = sc.textFile("/path/to/file") .map(s => Vectors.dense(s.split(',').map(_.toDouble))) // Cluster the data into three classes using KMeans val numIterations = 20 val numClusters = 3 val kmeansModel = KMeans.train(data, numClusters, numIterations) 5/09/23 19:38:11 WARN clustering.KMeans: The input data is not directly cached, which may hurt performance if its parent RDDs are also uncached. java.io.IOException: No FileSystem for scheme: c at org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2584) at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2591) at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:91) at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2630) at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2612) at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:370) at org.apache.hadoop.fs.Path.getFileSystem(Path.java:296) at org.apache.hadoop.mapred.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:256) at org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:228) at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:313) at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:203) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:219) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:217) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.rdd.RDD.partitions(RDD.scala:217) at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:32) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:219) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:217) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.rdd.RDD.partitions(RDD.scala:217) at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:32) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:219) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:217) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.rdd.RDD.partitions(RDD.scala:217) at org.apache.spark.rdd.ZippedPartitionsBaseRDD.getPartitions(ZippedPartitionsRDD.scala:55) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:219) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:217) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.rdd.RDD.partitions(RDD.scala:217) at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:32) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:219) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:217) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.rdd.RDD.partitions(RDD.scala:217) at org.apache.spark.SparkContext.runJob(SparkContext.scala:1517) at org.apache.spark.rdd.RDD.count(RDD.scala:1006) at org.apache.spark.rdd.RDD.takeSample(RDD.scala:428) at org.apache.spark.mllib.clustering.KMeans.initKMeansParallel(KMeans.scala:288) at org.apache.spark.mllib.clustering.KMeans.runAlgorithm(KMeans.scala:162) at org.apache.spark.mllib.clustering.KMeans.run(KMeans.scala:139) at org.apache.spark.mllib.clustering.KMeans$.train(KMeans.scala:420) at org.apache.spark.mllib.clustering.KMeans$.train(KMeans.scala:430) at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:29) at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:34) at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:36) at $iwC$$iwC$$iwC$$iwC$$iwC.(:38) at $iwC$$iwC$$iwC$$iwC.(:40) at $iwC$$iwC$$iwC.(:42) at $iwC$$iwC.(:44) at $iwC.(:46) at (:48) at .(:52) at .() at .(:7) at .() at $print() at sun.reflect.NativeMethodAccessorImpl.invoke0(Nati
Using ML KMeans without hardcoded feature vector creation
Hi, I'm wondering if there is a concise way to run ML KMeans on a DataFrame if I have the features in multiple numeric columns. I.e. as in the Iris dataset: (a1=5.1, a2=3.5, a3=1.4, a4=0.2, id=u'id_1', label=u'Iris-setosa', binomial_label=1) I'd like to use KMeans without recreating the DataSet with the feature vector added manually as a new column and the original columns hardcoded repeatedly in the code. The solution I'd like to improve: from pyspark.mllib.linalg import Vectors from pyspark.sql.types import Row from pyspark.ml.clustering import KMeans, KMeansModel iris = sqlContext.read.parquet("/opt/data/iris.parquet") iris.first() # Row(a1=5.1, a2=3.5, a3=1.4, a4=0.2, id=u'id_1', label=u'Iris-setosa', binomial_label=1) df = iris.map(lambda r: Row( id = r.id, a1 = r.a1, a2 = r.a2, a3 = r.a3, a4 = r.a4, label = r.label, binomial_label=r.binomial_label, features = Vectors.dense(r.a1, r.a2, r.a3, r.a4)) ).toDF() kmeans_estimator = KMeans()\ .setFeaturesCol("features")\ .setPredictionCol("prediction")\ kmeans_transformer = kmeans_estimator.fit(df) predicted_df = kmeans_transformer.transform(df).drop("features") predicted_df.first() # Row(a1=5.1, a2=3.5, a3=1.4, a4=0.2, binomial_label=1, id=u'id_1', label=u'Iris-setosa', prediction=1) I'm looking for a solution, which is something like: feature_cols = ["a1", "a2", "a3", "a4"] prediction_col_name = "prediction" Thanks, Zoltan
Kmeans issues and hierarchical clustering
Dear All, I am trying to cluster 350k english text phrases (each with 4-20 words) into 50k clusters with KMeans on a standalone system (8 cores, 16 GB). I am using Kyro serializer with MEMORY_AND_DISK_SER set. Although I get clustering results with lower number of features in HashingTF, the clustering quality is poor. When I increase the number of features, I am hit with GC overhead limit exceeded. How can I run the Kmeans clustering with the maximum number of features without crashing the app? I don't mind if it takes hours to get the results though. Also is there a agglomerative clustering algorithm (like hierarchical) in Spark that can run on standalone systems? Here is my code for reference - object phrase_app { def main(args: Array[String]) { val conf = new SparkConf().setAppName("Simple Application") conf.set("spark.serializer", "org.apache.spark.serializer.KryoSerializer") val sc = new SparkContext(conf) // -- read phrases from text file --- val phrases = sc.textFile("phrases.txt", 10).persist(MEMORY_AND_DISK_SER) // featurize phrases val no_features = 500 val tf = new HashingTF(no_features) def featurize(s: String): Vector = { tf.transform(s.sliding(1).toSeq) } val featureVectors = phrases.map(featurize).persist(MEMORY_AND_DISK_SER) // -- train Kmeans and get cluster phrases //val model = KMeans.train(featureVectors, 5, 10, 1, "random") val model = KMeans.train(featureVectors, 5, 10) val clusters = model.predict(featureVectors).collect() // Print phrases and clusters to file import java.io._ val pw = new PrintWriter(new File("cluster_dump.txt" )) val phrases_array = phrases.collect() for (i <- 0 until phrases_array.length){ pw.write( phrases_array(i) + ";" + clusters(i) + "\n") } pw.close } } Thank you for your support. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Kmeans-issues-and-hierarchical-clustering-tp24494.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: mllib kmeans produce 1 large and many extremely small clusters
Hi, The issue is very likely to be in the data or the transformations you apply, rather than anything to do with the Spark Kmeans API as such. I'd start debugging by doing a bit of exploratory analysis of the TFIDF vectors. That is, for instance, plot the distribution (histogram) of the TFIDF values for each word in the vectors. It's quite possible that the TFIDF values for most words for most documents are the same in your case, causing all your 5000 points to crowd around the same region in the n-dimensional space that they live in. On 10 August 2015 at 10:28, farhan wrote: > I tried running mllib k-means with 20newsgroups data set from sklearn. On a > 5000 document data set I get one cluster with most of the documents and > other clusters just have handful of documents. > > #code > newsgroups_train = > fetch_20newsgroups(subset='train',random_state=1,remove=('headers', > 'footers', 'quotes')) > small_list = random.sample(newsgroups_train.data,5000) > > def get_word_vec(text,vocabulary): > word_lst = tokenize_line(text) > word_counter = Counter(word_lst) > lst = [] > for v in vocabulary: > if v in word_counter: > lst.append(word_counter[v]) > else: > lst.append(0) > return lst > > docsrdd = sc.parallelize(small_list) > tf = docsrdd.map(lambda x : get_word_vec(x,vocabulary)) > idf = IDF().fit(tf) > tfidf = idf.transform(tf) > clusters = KMeans.train(tfidf, 20) > > #documents in each cluster, using clusters.predict(x) > Counter({0: 4978, 11: 3, 9: 2, 1: 1, 2: 1, 3: 1, 4: 1, 5: 1, 6: 1, 7: 1, 8: > 1, 10: 1, 12: 1, 13: 1, 14: 1, 15: 1, 16: 1, 17: 1, 18: 1, 19: 1}) > > > Please Help ! > > > > -- > View this message in context: > http://apache-spark-user-list.1001560.n3.nabble.com/mllib-kmeans-produce-1-large-and-many-extremely-small-clusters-tp24189.html > Sent from the Apache Spark User List mailing list archive at Nabble.com. > > - > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > For additional commands, e-mail: user-h...@spark.apache.org > >
mllib kmeans produce 1 large and many extremely small clusters
I tried running mllib k-means with 20newsgroups data set from sklearn. On a 5000 document data set I get one cluster with most of the documents and other clusters just have handful of documents. #code newsgroups_train = fetch_20newsgroups(subset='train',random_state=1,remove=('headers', 'footers', 'quotes')) small_list = random.sample(newsgroups_train.data,5000) def get_word_vec(text,vocabulary): word_lst = tokenize_line(text) word_counter = Counter(word_lst) lst = [] for v in vocabulary: if v in word_counter: lst.append(word_counter[v]) else: lst.append(0) return lst docsrdd = sc.parallelize(small_list) tf = docsrdd.map(lambda x : get_word_vec(x,vocabulary)) idf = IDF().fit(tf) tfidf = idf.transform(tf) clusters = KMeans.train(tfidf, 20) #documents in each cluster, using clusters.predict(x) Counter({0: 4978, 11: 3, 9: 2, 1: 1, 2: 1, 3: 1, 4: 1, 5: 1, 6: 1, 7: 1, 8: 1, 10: 1, 12: 1, 13: 1, 14: 1, 15: 1, 16: 1, 17: 1, 18: 1, 19: 1}) Please Help ! -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/mllib-kmeans-produce-1-large-and-many-extremely-small-clusters-tp24189.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
RE: Kmeans Labeled Point RDD
I responded to your question on SO. Let me know if this what you wanted. http://stackoverflow.com/a/31528274/2336943 Mohammed -Original Message- From: plazaster [mailto:michaelplaz...@gmail.com] Sent: Sunday, July 19, 2015 11:38 PM To: user@spark.apache.org Subject: Re: Kmeans Labeled Point RDD Has there been any progress on this, I am in the same boat. I posted a similar question to Stack Exchange. http://stackoverflow.com/questions/31447141/spark-mllib-kmeans-from-dataframe-and-back-again -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Kmeans-Labeled-Point-RDD-tp22989p23907.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Kmeans Labeled Point RDD
Has there been any progress on this, I am in the same boat. I posted a similar question to Stack Exchange. http://stackoverflow.com/questions/31447141/spark-mllib-kmeans-from-dataframe-and-back-again -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Kmeans-Labeled-Point-RDD-tp22989p23907.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: [MLLib][Kmeans] KMeansModel.computeCost takes lot of time
Can it be the limited memory causing this slowness? On Tue, Jul 14, 2015 at 9:00 AM, Nirmal Fernando wrote: > Thanks Burak. > > Now it takes minutes to repartition; > > Active Stages (1) Stage IdDescriptionSubmittedDurationTasks: > Succeeded/TotalInputOutputShuffle Read Shuffle Write 42 (kill) > <http://localhost:4040/stages/stage/kill/?id=42&terminate=true> repartition > at UnsupervisedSparkModelBuilder.java:120 > <http://localhost:4040/stages/stage?id=42&attempt=0> +details > > org.apache.spark.api.java.JavaRDD.repartition(JavaRDD.scala:100) > org.wso2.carbon.ml.core.spark.algorithms.UnsupervisedSparkModelBuilder.buildKMeansModel(UnsupervisedSparkModelBuilder.java:120) > org.wso2.carbon.ml.core.spark.algorithms.UnsupervisedSparkModelBuilder.build(UnsupervisedSparkModelBuilder.java:84) > org.wso2.carbon.ml.core.impl.MLModelHandler$ModelBuilder.run(MLModelHandler.java:576) > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > java.lang.Thread.run(Thread.java:745) > > 2015/07/14 08:59:30 3.6 min > 0/3 > 14.6 MB Pending Stages (1) Stage IdDescriptionSubmittedDurationTasks: > Succeeded/TotalInputOutputShuffle Read Shuffle Write 43 sum at > KMeansModel.scala:70 <http://localhost:4040/stages/stage?id=43&attempt=0> > +details > > > org.apache.spark.rdd.DoubleRDDFunctions.sum(DoubleRDDFunctions.scala:33) > org.apache.spark.mllib.clustering.KMeansModel.computeCost(KMeansModel.scala:70) > org.wso2.carbon.ml.core.spark.algorithms.UnsupervisedSparkModelBuilder.buildKMeansModel(UnsupervisedSparkModelBuilder.java:121) > org.wso2.carbon.ml.core.spark.algorithms.UnsupervisedSparkModelBuilder.build(UnsupervisedSparkModelBuilder.java:84) > org.wso2.carbon.ml.core.impl.MLModelHandler$ModelBuilder.run(MLModelHandler.java:576) > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > java.lang.Thread.run(Thread.java:745) > > Unknown Unknown > 0/8 > > On Mon, Jul 13, 2015 at 11:44 PM, Burak Yavuz wrote: > >> Can you call repartition(8) or 16 on data.rdd(), before KMeans, and also, >> .cache()? >> >> something like, (I'm assuming you are using Java): >> ``` >> JavaRDD input = data.repartition(8).cache(); >> org.apache.spark.mllib.clustering.KMeans.train(input.rdd(), 3, 20); >> ``` >> >> On Mon, Jul 13, 2015 at 11:10 AM, Nirmal Fernando >> wrote: >> >>> I'm using; >>> >>> org.apache.spark.mllib.clustering.KMeans.train(data.rdd(), 3, 20); >>> >>> Cpu cores: 8 (using default Spark conf thought) >>> >>> On partitions, I'm not sure how to find that. >>> >>> On Mon, Jul 13, 2015 at 11:30 PM, Burak Yavuz wrote: >>> >>>> What are the other parameters? Are you just setting k=3? What about # >>>> of runs? How many partitions do you have? How many cores does your machine >>>> have? >>>> >>>> Thanks, >>>> Burak >>>> >>>> On Mon, Jul 13, 2015 at 10:57 AM, Nirmal Fernando >>>> wrote: >>>> >>>>> Hi Burak, >>>>> >>>>> k = 3 >>>>> dimension = 785 features >>>>> Spark 1.4 >>>>> >>>>> On Mon, Jul 13, 2015 at 10:28 PM, Burak Yavuz >>>>> wrote: >>>>> >>>>>> Hi, >>>>>> >>>>>> How are you running K-Means? What is your k? What is the dimension of >>>>>> your dataset (columns)? Which Spark version are you using? >>>>>> >>>>>> Thanks, >>>>>> Burak >>>>>> >>>>>> On Mon, Jul 13, 2015 at 2:53 AM, Nirmal Fernando >>>>>> wrote: >>>>>> >>>>>>> Hi, >>>>>>> >>>>>>> For a fairly large dataset, 30MB, KMeansModel.computeCost takes lot >>>>>>> of time (16+ mints). >>>>>>> >>>>>>> It takes lot of time at this task; >>>>>>> >>>>>>> org.apache.spark.rdd.DoubleRDDFunctions.sum(DoubleRDDFunctions.scala:33) >>>>>>> org.apache.spark.mllib.clustering.KMeansModel.computeCost(KMeansModel.scala:70) >>>>>>> >>>>>>> Can this be improved? >>>>>>> >>>>>>> -- >>>>>>> >>>>>>> Thanks & regards, >>>>>>> Nirmal >>>>>>> >>>>>>> Associate Technical Lead - Data Technologies Team, WSO2 Inc. >>>>>>> Mobile: +94715779733 >>>>>>> Blog: http://nirmalfdo.blogspot.com/ >>>>>>> >>>>>>> >>>>>>> >>>>>> >>>>> >>>>> >>>>> -- >>>>> >>>>> Thanks & regards, >>>>> Nirmal >>>>> >>>>> Associate Technical Lead - Data Technologies Team, WSO2 Inc. >>>>> Mobile: +94715779733 >>>>> Blog: http://nirmalfdo.blogspot.com/ >>>>> >>>>> >>>>> >>>> >>> >>> >>> -- >>> >>> Thanks & regards, >>> Nirmal >>> >>> Associate Technical Lead - Data Technologies Team, WSO2 Inc. >>> Mobile: +94715779733 >>> Blog: http://nirmalfdo.blogspot.com/ >>> >>> >>> >> > > > -- > > Thanks & regards, > Nirmal > > Associate Technical Lead - Data Technologies Team, WSO2 Inc. > Mobile: +94715779733 > Blog: http://nirmalfdo.blogspot.com/ > > > -- Thanks & regards, Nirmal Associate Technical Lead - Data Technologies Team, WSO2 Inc. Mobile: +94715779733 Blog: http://nirmalfdo.blogspot.com/
Re: [MLLib][Kmeans] KMeansModel.computeCost takes lot of time
Thanks Burak. Now it takes minutes to repartition; Active Stages (1) Stage IdDescriptionSubmittedDurationTasks: Succeeded/Total InputOutputShuffle Read Shuffle Write 42 (kill) <http://localhost:4040/stages/stage/kill/?id=42&terminate=true> repartition at UnsupervisedSparkModelBuilder.java:120 <http://localhost:4040/stages/stage?id=42&attempt=0> +details org.apache.spark.api.java.JavaRDD.repartition(JavaRDD.scala:100) org.wso2.carbon.ml.core.spark.algorithms.UnsupervisedSparkModelBuilder.buildKMeansModel(UnsupervisedSparkModelBuilder.java:120) org.wso2.carbon.ml.core.spark.algorithms.UnsupervisedSparkModelBuilder.build(UnsupervisedSparkModelBuilder.java:84) org.wso2.carbon.ml.core.impl.MLModelHandler$ModelBuilder.run(MLModelHandler.java:576) java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) java.lang.Thread.run(Thread.java:745) 2015/07/14 08:59:30 3.6 min 0/3 14.6 MB Pending Stages (1) Stage IdDescriptionSubmittedDurationTasks: Succeeded/TotalInputOutputShuffle Read Shuffle Write 43 sum at KMeansModel.scala:70 <http://localhost:4040/stages/stage?id=43&attempt=0> +details org.apache.spark.rdd.DoubleRDDFunctions.sum(DoubleRDDFunctions.scala:33) org.apache.spark.mllib.clustering.KMeansModel.computeCost(KMeansModel.scala:70) org.wso2.carbon.ml.core.spark.algorithms.UnsupervisedSparkModelBuilder.buildKMeansModel(UnsupervisedSparkModelBuilder.java:121) org.wso2.carbon.ml.core.spark.algorithms.UnsupervisedSparkModelBuilder.build(UnsupervisedSparkModelBuilder.java:84) org.wso2.carbon.ml.core.impl.MLModelHandler$ModelBuilder.run(MLModelHandler.java:576) java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) java.lang.Thread.run(Thread.java:745) Unknown Unknown 0/8 On Mon, Jul 13, 2015 at 11:44 PM, Burak Yavuz wrote: > Can you call repartition(8) or 16 on data.rdd(), before KMeans, and also, > .cache()? > > something like, (I'm assuming you are using Java): > ``` > JavaRDD input = data.repartition(8).cache(); > org.apache.spark.mllib.clustering.KMeans.train(input.rdd(), 3, 20); > ``` > > On Mon, Jul 13, 2015 at 11:10 AM, Nirmal Fernando wrote: > >> I'm using; >> >> org.apache.spark.mllib.clustering.KMeans.train(data.rdd(), 3, 20); >> >> Cpu cores: 8 (using default Spark conf thought) >> >> On partitions, I'm not sure how to find that. >> >> On Mon, Jul 13, 2015 at 11:30 PM, Burak Yavuz wrote: >> >>> What are the other parameters? Are you just setting k=3? What about # of >>> runs? How many partitions do you have? How many cores does your machine >>> have? >>> >>> Thanks, >>> Burak >>> >>> On Mon, Jul 13, 2015 at 10:57 AM, Nirmal Fernando >>> wrote: >>> >>>> Hi Burak, >>>> >>>> k = 3 >>>> dimension = 785 features >>>> Spark 1.4 >>>> >>>> On Mon, Jul 13, 2015 at 10:28 PM, Burak Yavuz wrote: >>>> >>>>> Hi, >>>>> >>>>> How are you running K-Means? What is your k? What is the dimension of >>>>> your dataset (columns)? Which Spark version are you using? >>>>> >>>>> Thanks, >>>>> Burak >>>>> >>>>> On Mon, Jul 13, 2015 at 2:53 AM, Nirmal Fernando >>>>> wrote: >>>>> >>>>>> Hi, >>>>>> >>>>>> For a fairly large dataset, 30MB, KMeansModel.computeCost takes lot >>>>>> of time (16+ mints). >>>>>> >>>>>> It takes lot of time at this task; >>>>>> >>>>>> org.apache.spark.rdd.DoubleRDDFunctions.sum(DoubleRDDFunctions.scala:33) >>>>>> org.apache.spark.mllib.clustering.KMeansModel.computeCost(KMeansModel.scala:70) >>>>>> >>>>>> Can this be improved? >>>>>> >>>>>> -- >>>>>> >>>>>> Thanks & regards, >>>>>> Nirmal >>>>>> >>>>>> Associate Technical Lead - Data Technologies Team, WSO2 Inc. >>>>>> Mobile: +94715779733 >>>>>> Blog: http://nirmalfdo.blogspot.com/ >>>>>> >>>>>> >>>>>> >>>>> >>>> >>>> >>>> -- >>>> >>>> Thanks & regards, >>>> Nirmal >>>> >>>> Associate Technical Lead - Data Technologies Team, WSO2 Inc. >>>> Mobile: +94715779733 >>>> Blog: http://nirmalfdo.blogspot.com/ >>>> >>>> >>>> >>> >> >> >> -- >> >> Thanks & regards, >> Nirmal >> >> Associate Technical Lead - Data Technologies Team, WSO2 Inc. >> Mobile: +94715779733 >> Blog: http://nirmalfdo.blogspot.com/ >> >> >> > -- Thanks & regards, Nirmal Associate Technical Lead - Data Technologies Team, WSO2 Inc. Mobile: +94715779733 Blog: http://nirmalfdo.blogspot.com/
Re: [MLLib][Kmeans] KMeansModel.computeCost takes lot of time
Can you call repartition(8) or 16 on data.rdd(), before KMeans, and also, .cache()? something like, (I'm assuming you are using Java): ``` JavaRDD input = data.repartition(8).cache(); org.apache.spark.mllib.clustering.KMeans.train(input.rdd(), 3, 20); ``` On Mon, Jul 13, 2015 at 11:10 AM, Nirmal Fernando wrote: > I'm using; > > org.apache.spark.mllib.clustering.KMeans.train(data.rdd(), 3, 20); > > Cpu cores: 8 (using default Spark conf thought) > > On partitions, I'm not sure how to find that. > > On Mon, Jul 13, 2015 at 11:30 PM, Burak Yavuz wrote: > >> What are the other parameters? Are you just setting k=3? What about # of >> runs? How many partitions do you have? How many cores does your machine >> have? >> >> Thanks, >> Burak >> >> On Mon, Jul 13, 2015 at 10:57 AM, Nirmal Fernando >> wrote: >> >>> Hi Burak, >>> >>> k = 3 >>> dimension = 785 features >>> Spark 1.4 >>> >>> On Mon, Jul 13, 2015 at 10:28 PM, Burak Yavuz wrote: >>> >>>> Hi, >>>> >>>> How are you running K-Means? What is your k? What is the dimension of >>>> your dataset (columns)? Which Spark version are you using? >>>> >>>> Thanks, >>>> Burak >>>> >>>> On Mon, Jul 13, 2015 at 2:53 AM, Nirmal Fernando >>>> wrote: >>>> >>>>> Hi, >>>>> >>>>> For a fairly large dataset, 30MB, KMeansModel.computeCost takes lot of >>>>> time (16+ mints). >>>>> >>>>> It takes lot of time at this task; >>>>> >>>>> org.apache.spark.rdd.DoubleRDDFunctions.sum(DoubleRDDFunctions.scala:33) >>>>> org.apache.spark.mllib.clustering.KMeansModel.computeCost(KMeansModel.scala:70) >>>>> >>>>> Can this be improved? >>>>> >>>>> -- >>>>> >>>>> Thanks & regards, >>>>> Nirmal >>>>> >>>>> Associate Technical Lead - Data Technologies Team, WSO2 Inc. >>>>> Mobile: +94715779733 >>>>> Blog: http://nirmalfdo.blogspot.com/ >>>>> >>>>> >>>>> >>>> >>> >>> >>> -- >>> >>> Thanks & regards, >>> Nirmal >>> >>> Associate Technical Lead - Data Technologies Team, WSO2 Inc. >>> Mobile: +94715779733 >>> Blog: http://nirmalfdo.blogspot.com/ >>> >>> >>> >> > > > -- > > Thanks & regards, > Nirmal > > Associate Technical Lead - Data Technologies Team, WSO2 Inc. > Mobile: +94715779733 > Blog: http://nirmalfdo.blogspot.com/ > > >
Re: [MLLib][Kmeans] KMeansModel.computeCost takes lot of time
I'm using; org.apache.spark.mllib.clustering.KMeans.train(data.rdd(), 3, 20); Cpu cores: 8 (using default Spark conf thought) On partitions, I'm not sure how to find that. On Mon, Jul 13, 2015 at 11:30 PM, Burak Yavuz wrote: > What are the other parameters? Are you just setting k=3? What about # of > runs? How many partitions do you have? How many cores does your machine > have? > > Thanks, > Burak > > On Mon, Jul 13, 2015 at 10:57 AM, Nirmal Fernando wrote: > >> Hi Burak, >> >> k = 3 >> dimension = 785 features >> Spark 1.4 >> >> On Mon, Jul 13, 2015 at 10:28 PM, Burak Yavuz wrote: >> >>> Hi, >>> >>> How are you running K-Means? What is your k? What is the dimension of >>> your dataset (columns)? Which Spark version are you using? >>> >>> Thanks, >>> Burak >>> >>> On Mon, Jul 13, 2015 at 2:53 AM, Nirmal Fernando >>> wrote: >>> Hi, For a fairly large dataset, 30MB, KMeansModel.computeCost takes lot of time (16+ mints). It takes lot of time at this task; org.apache.spark.rdd.DoubleRDDFunctions.sum(DoubleRDDFunctions.scala:33) org.apache.spark.mllib.clustering.KMeansModel.computeCost(KMeansModel.scala:70) Can this be improved? -- Thanks & regards, Nirmal Associate Technical Lead - Data Technologies Team, WSO2 Inc. Mobile: +94715779733 Blog: http://nirmalfdo.blogspot.com/ >>> >> >> >> -- >> >> Thanks & regards, >> Nirmal >> >> Associate Technical Lead - Data Technologies Team, WSO2 Inc. >> Mobile: +94715779733 >> Blog: http://nirmalfdo.blogspot.com/ >> >> >> > -- Thanks & regards, Nirmal Associate Technical Lead - Data Technologies Team, WSO2 Inc. Mobile: +94715779733 Blog: http://nirmalfdo.blogspot.com/
Re: [MLLib][Kmeans] KMeansModel.computeCost takes lot of time
What are the other parameters? Are you just setting k=3? What about # of runs? How many partitions do you have? How many cores does your machine have? Thanks, Burak On Mon, Jul 13, 2015 at 10:57 AM, Nirmal Fernando wrote: > Hi Burak, > > k = 3 > dimension = 785 features > Spark 1.4 > > On Mon, Jul 13, 2015 at 10:28 PM, Burak Yavuz wrote: > >> Hi, >> >> How are you running K-Means? What is your k? What is the dimension of >> your dataset (columns)? Which Spark version are you using? >> >> Thanks, >> Burak >> >> On Mon, Jul 13, 2015 at 2:53 AM, Nirmal Fernando wrote: >> >>> Hi, >>> >>> For a fairly large dataset, 30MB, KMeansModel.computeCost takes lot of >>> time (16+ mints). >>> >>> It takes lot of time at this task; >>> >>> org.apache.spark.rdd.DoubleRDDFunctions.sum(DoubleRDDFunctions.scala:33) >>> org.apache.spark.mllib.clustering.KMeansModel.computeCost(KMeansModel.scala:70) >>> >>> Can this be improved? >>> >>> -- >>> >>> Thanks & regards, >>> Nirmal >>> >>> Associate Technical Lead - Data Technologies Team, WSO2 Inc. >>> Mobile: +94715779733 >>> Blog: http://nirmalfdo.blogspot.com/ >>> >>> >>> >> > > > -- > > Thanks & regards, > Nirmal > > Associate Technical Lead - Data Technologies Team, WSO2 Inc. > Mobile: +94715779733 > Blog: http://nirmalfdo.blogspot.com/ > > >
Re: [MLLib][Kmeans] KMeansModel.computeCost takes lot of time
Hi Burak, k = 3 dimension = 785 features Spark 1.4 On Mon, Jul 13, 2015 at 10:28 PM, Burak Yavuz wrote: > Hi, > > How are you running K-Means? What is your k? What is the dimension of your > dataset (columns)? Which Spark version are you using? > > Thanks, > Burak > > On Mon, Jul 13, 2015 at 2:53 AM, Nirmal Fernando wrote: > >> Hi, >> >> For a fairly large dataset, 30MB, KMeansModel.computeCost takes lot of >> time (16+ mints). >> >> It takes lot of time at this task; >> >> org.apache.spark.rdd.DoubleRDDFunctions.sum(DoubleRDDFunctions.scala:33) >> org.apache.spark.mllib.clustering.KMeansModel.computeCost(KMeansModel.scala:70) >> >> Can this be improved? >> >> -- >> >> Thanks & regards, >> Nirmal >> >> Associate Technical Lead - Data Technologies Team, WSO2 Inc. >> Mobile: +94715779733 >> Blog: http://nirmalfdo.blogspot.com/ >> >> >> > -- Thanks & regards, Nirmal Associate Technical Lead - Data Technologies Team, WSO2 Inc. Mobile: +94715779733 Blog: http://nirmalfdo.blogspot.com/
Re: [MLLib][Kmeans] KMeansModel.computeCost takes lot of time
Hi, How are you running K-Means? What is your k? What is the dimension of your dataset (columns)? Which Spark version are you using? Thanks, Burak On Mon, Jul 13, 2015 at 2:53 AM, Nirmal Fernando wrote: > Hi, > > For a fairly large dataset, 30MB, KMeansModel.computeCost takes lot of > time (16+ mints). > > It takes lot of time at this task; > > org.apache.spark.rdd.DoubleRDDFunctions.sum(DoubleRDDFunctions.scala:33) > org.apache.spark.mllib.clustering.KMeansModel.computeCost(KMeansModel.scala:70) > > Can this be improved? > > -- > > Thanks & regards, > Nirmal > > Associate Technical Lead - Data Technologies Team, WSO2 Inc. > Mobile: +94715779733 > Blog: http://nirmalfdo.blogspot.com/ > > >
[MLLib][Kmeans] KMeansModel.computeCost takes lot of time
Hi, For a fairly large dataset, 30MB, KMeansModel.computeCost takes lot of time (16+ mints). It takes lot of time at this task; org.apache.spark.rdd.DoubleRDDFunctions.sum(DoubleRDDFunctions.scala:33) org.apache.spark.mllib.clustering.KMeansModel.computeCost(KMeansModel.scala:70) Can this be improved? -- Thanks & regards, Nirmal Associate Technical Lead - Data Technologies Team, WSO2 Inc. Mobile: +94715779733 Blog: http://nirmalfdo.blogspot.com/
Re: KMeans questions
SPARK-7879 <https://issues.apache.org/jira/browse/SPARK-7879> seems to address your use case (running KMeans on a dataframe and having the results added as an additional column) On Wed, Jul 1, 2015 at 5:53 PM, Eric Friedman wrote: > In preparing a DataFrame (spark 1.4) to use with MLlib's kmeans.train > method, is there a cleaner way to create the Vectors than this? > > data.map{r => Vectors.dense(r.getDouble(0), r.getDouble(3), > r.getDouble(4), r.getDouble(5), r.getDouble(6))} > > > Second, once I train the model and call predict on my vectorized dataset, > what's the best way to relate the cluster assignments back to the original > data frame? > > > That is, I started with df1, which has a bunch of domain information in > each row and also the doubles I use to cluster. I vectorize the doubles > and then train on them. I use the resulting model to predict clusters for > the vectors. I'd like to look at the original domain information in light > of the clusters to which they are now assigned. > > >
KMeans questions
In preparing a DataFrame (spark 1.4) to use with MLlib's kmeans.train method, is there a cleaner way to create the Vectors than this? data.map{r => Vectors.dense(r.getDouble(0), r.getDouble(3), r.getDouble(4), r.getDouble(5), r.getDouble(6))} Second, once I train the model and call predict on my vectorized dataset, what's the best way to relate the cluster assignments back to the original data frame? That is, I started with df1, which has a bunch of domain information in each row and also the doubles I use to cluster. I vectorize the doubles and then train on them. I use the resulting model to predict clusters for the vectors. I'd like to look at the original domain information in light of the clusters to which they are now assigned.
Re: kmeans broadcast
Hi Haviv, have you tried sc.broadcast(model), the broadcast method is a member of sparkContext class. Thanks Himanshu -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/kmeans-broadcast-tp23511p23526.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
MLIB-KMEANS: Py4JNetworkError: An error occurred while trying to connect to the Java server , on a huge data set
Hi All, I am trying to run KMeans clustering on a large data set with 12,000 points and 80,000 dimensions. I have a spark cluster in Ec2 stand alone mode with 8 workers running on 2 slaves with 160 GB Ram and 40 VCPU. *My Code is as Follows:* def convert_into_sparse_vector(A): non_nan_indices=np.nonzero(~np.isnan(A) ) non_nan_values=A[non_nan_indices] dictionary=dict(zip(non_nan_indices[0],non_nan_values)) return Vectors.sparse (len(A),dictionary) X=[convert_into_sparse_vector(A) for A in complete_dataframe.values ] sc=SparkContext(appName="parallel_kmeans") data=sc.parallelize(X,10) model = KMeans.train(data, 1000, initializationMode="k-means||") where complete_dataframe is a pandas data frame that has my data. I get the error: Py4JNetworkError: An error occurred while trying to connect to the Java server. / The error trace is as follows: > Exception happened during > processing of request from ('127.0.0.1', 41360) Traceback (most recent > call last): File "/usr/lib64/python2.6/SocketServer.py", line 283, > in _handle_request_noblock > self.process_request(request, client_address) File > "/usr/lib64/python2.6/SocketServer.py", line 309, in process_request > self.finish_request(request, client_address) File > "/usr/lib64/python2.6/SocketServer.py", line 322, in finish_request > self.RequestHandlerClass(request, client_address, self) File > "/usr/lib64/python2.6/SocketServer.py", line 617, in __init__ > self.handle() File "/root/spark/python/pyspark/accumulators.py", > line 235, in handle > num_updates = read_int(self.rfile) File > "/root/spark/python/pyspark/serializers.py", line 544, in read_int > raise EOFError EOFError > > --- > Py4JNetworkError Traceback (most recent call > last) in () > > 1 model = KMeans.train(data, 1000, initializationMode="k-means||") > > /root/spark/python/pyspark/mllib/clustering.pyc in train(cls, rdd, k, > maxIterations, runs, initializationMode, seed, initializationSteps, > epsilon) > 134 """Train a k-means clustering model.""" > 135 model = callMLlibFunc("trainKMeansModel", > rdd.map(_convert_to_vector), k, maxIterations, > --> 136 runs, initializationMode, seed, > initializationSteps, epsilon) > 137 centers = callJavaFunc(rdd.context, model.clusterCenters) > 138 return KMeansModel([c.toArray() for c in centers]) > > /root/spark/python/pyspark/mllib/common.pyc in callMLlibFunc(name, > *args) > 126 sc = SparkContext._active_spark_context > 127 api = getattr(sc._jvm.PythonMLLibAPI(), name) > --> 128 return callJavaFunc(sc, api, *args) > 129 > 130 > > /root/spark/python/pyspark/mllib/common.pyc in callJavaFunc(sc, func, > *args) > 119 """ Call Java Function """ > 120 args = [_py2java(sc, a) for a in args] > --> 121 return _java2py(sc, func(*args)) > 122 > 123 > > /root/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py in > __call__(self, *args) > 534 END_COMMAND_PART > 535 > --> 536 answer = self.gateway_client.send_command(command) > 537 return_value = get_return_value(answer, > self.gateway_client, > 538 self.target_id, self.name) > > /root/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py in > send_command(self, command, retry) > 367 if retry: > 368 #print_exc() > --> 369 response = self.send_command(command) > 370 else: > 371 response = ERROR > > /root/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py in > send_command(self, command, retry) > 360 the Py4J protocol. > 361 """ > --> 362 connection = self._get_connection() > 363 try: > 364 response = connection.send_command(command) > > /root/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py in > _get_connection(self) > 316 connection = self.deque.pop() > 317 except Exception: > --> 318 connection = self._create_connection() > 319 return connection > 320 > > /root/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py in > _create_connection(self) > 323
Re: Restricting the number of iterations in Mllib Kmeans
Hi Suman & Meethu, Apologies---I was wrong about KMeans supporting an initial set of centroids! JIRA created: https://issues.apache.org/jira/browse/SPARK-8018 If you're interested in submitting a PR, please do! Thanks, Joseph On Mon, Jun 1, 2015 at 2:25 AM, MEETHU MATHEW wrote: > Hi Joseph, > I was unable to find any function in Kmeans.scala where the initial > centroids could be specified by the user. Kindly help. > > Thanks & Regards, > Meethu M > > > > On Tuesday, 19 May 2015 6:54 AM, Joseph Bradley > wrote: > > > Hi Suman, > > For maxIterations, are you using the DenseKMeans.scala example code? (I'm > guessing yes since you mention the command line.) If so, then you should > be able to specify maxIterations via an extra parameter like > "--numIterations 50" (note the example uses "numIterations" in the current > master instead of "maxIterations," which is sort of a bug in the example). > If that does not cap the max iterations, then please report it as a bug. > > To specify the initial centroids, you will need to modify the DenseKMeans > example code. Please see the KMeans API docs for details. > > Good luck, > Joseph > > On Mon, May 18, 2015 at 3:22 AM, MEETHU MATHEW > wrote: > > Hi, > I think you cant supply an initial set of centroids to kmeans > > Thanks & Regards, > Meethu M > > > > On Friday, 15 May 2015 12:37 AM, Suman Somasundar < > suman.somasun...@oracle.com> wrote: > > > Hi,, > > I want to run a definite number of iterations in Kmeans. There is a > command line argument to set maxIterations, but even if I set it to a > number, Kmeans runs until the centroids converge. > Is there a specific way to specify it in command line? > > Also, I wanted to know if we can supply the initial set of centroids to > the program instead of it choosing the centroids in random? > > Thanks, > Suman. > > > > > >
Re: Kmeans Labeled Point RDD
You can predict and then zip it with the points RDD to get approx. same as LP. Cheers On Thu, May 21, 2015 at 6:19 PM, anneywarlord wrote: > Hello, > > New to Spark. I wanted to know if it is possible to use a Labeled Point RDD > in org.apache.spark.mllib.clustering.KMeans. After I cluster my data I > would > like to be able to identify which observations were grouped with each > centroid. > > Thanks > > > > -- > View this message in context: > http://apache-spark-user-list.1001560.n3.nabble.com/Kmeans-Labeled-Point-RDD-tp22989.html > Sent from the Apache Spark User List mailing list archive at Nabble.com. > > - > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > For additional commands, e-mail: user-h...@spark.apache.org > >
Kmeans Labeled Point RDD
Hello, New to Spark. I wanted to know if it is possible to use a Labeled Point RDD in org.apache.spark.mllib.clustering.KMeans. After I cluster my data I would like to be able to identify which observations were grouped with each centroid. Thanks -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Kmeans-Labeled-Point-RDD-tp22989.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: spark mllib kmeans
i want evaluate some different distance measure for time-space clustering. so i need a api for implement my own function in java. 2015-05-19 22:08 GMT+02:00 Xiangrui Meng : > Just curious, what distance measure do you need? -Xiangrui > > On Mon, May 11, 2015 at 8:28 AM, Jaonary Rabarisoa > wrote: > > take a look at this > > https://github.com/derrickburns/generalized-kmeans-clustering > > > > Best, > > > > Jao > > > > On Mon, May 11, 2015 at 3:55 PM, Driesprong, Fokko > > > wrote: > >> > >> Hi Paul, > >> > >> I would say that it should be possible, but you'll need a different > >> distance measure which conforms to your coordinate system. > >> > >> 2015-05-11 14:59 GMT+02:00 Pa Rö : > >>> > >>> hi, > >>> > >>> it is possible to use a custom distance measure and a other data typ as > >>> vector? > >>> i want cluster temporal geo datas. > >>> > >>> best regards > >>> paul > >> > >> > > >
Re: question about customize kmeans distance measure
MLlib only supports Euclidean distance for k-means. You can find Bregman divergence support in Derrick's package: http://spark-packages.org/package/derrickburns/generalized-kmeans-clustering. Which distance measure do you want to use? -Xiangrui On Tue, May 12, 2015 at 7:23 PM, June wrote: > Dear list, > > > > I am new to spark, and I want to use the kmeans algorithm in mllib package. > > I am wondering whether it is possible to customize the distance measure used > by kmeans, and how? > > > > Many thanks! > > > > June - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: spark mllib kmeans
Just curious, what distance measure do you need? -Xiangrui On Mon, May 11, 2015 at 8:28 AM, Jaonary Rabarisoa wrote: > take a look at this > https://github.com/derrickburns/generalized-kmeans-clustering > > Best, > > Jao > > On Mon, May 11, 2015 at 3:55 PM, Driesprong, Fokko > wrote: >> >> Hi Paul, >> >> I would say that it should be possible, but you'll need a different >> distance measure which conforms to your coordinate system. >> >> 2015-05-11 14:59 GMT+02:00 Pa Rö : >>> >>> hi, >>> >>> it is possible to use a custom distance measure and a other data typ as >>> vector? >>> i want cluster temporal geo datas. >>> >>> best regards >>> paul >> >> > - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Restricting the number of iterations in Mllib Kmeans
Hi Suman, For maxIterations, are you using the DenseKMeans.scala example code? (I'm guessing yes since you mention the command line.) If so, then you should be able to specify maxIterations via an extra parameter like "--numIterations 50" (note the example uses "numIterations" in the current master instead of "maxIterations," which is sort of a bug in the example). If that does not cap the max iterations, then please report it as a bug. To specify the initial centroids, you will need to modify the DenseKMeans example code. Please see the KMeans API docs for details. Good luck, Joseph On Mon, May 18, 2015 at 3:22 AM, MEETHU MATHEW wrote: > Hi, > I think you cant supply an initial set of centroids to kmeans > > Thanks & Regards, > Meethu M > > > > On Friday, 15 May 2015 12:37 AM, Suman Somasundar < > suman.somasun...@oracle.com> wrote: > > > Hi,, > > I want to run a definite number of iterations in Kmeans. There is a > command line argument to set maxIterations, but even if I set it to a > number, Kmeans runs until the centroids converge. > Is there a specific way to specify it in command line? > > Also, I wanted to know if we can supply the initial set of centroids to > the program instead of it choosing the centroids in random? > > Thanks, > Suman. > > >
Re: Restricting the number of iterations in Mllib Kmeans
Hi,I think you cant supply an initial set of centroids to kmeans Thanks & Regards, Meethu M On Friday, 15 May 2015 12:37 AM, Suman Somasundar wrote: Hi,, I want to run a definite number of iterations in Kmeans. There is a command line argument to set maxIterations, but even if I set it to a number, Kmeans runs until the centroids converge. Is there a specific way to specify it in command line? Also, I wanted to know if we can supply the initial set of centroids to the program instead of it choosing the centroids in random? Thanks, Suman.
Restricting the number of iterations in Mllib Kmeans
Hi,, I want to run a definite number of iterations in Kmeans. There is a command line argument to set maxIterations, but even if I set it to a number, Kmeans runs until the centroids converge. Is there a specific way to specify it in command line? Also, I wanted to know if we can supply the initial set of centroids to the program instead of it choosing the centroids in random? Thanks, Suman.
question about customize kmeans distance measure
Dear list, I am new to spark, and I want to use the kmeans algorithm in mllib package. I am wondering whether it is possible to customize the distance measure used by kmeans, and how? Many thanks! June
Re: spark mllib kmeans
take a look at this https://github.com/derrickburns/generalized-kmeans-clustering Best, Jao On Mon, May 11, 2015 at 3:55 PM, Driesprong, Fokko wrote: > Hi Paul, > > I would say that it should be possible, but you'll need a different > distance measure which conforms to your coordinate system. > > 2015-05-11 14:59 GMT+02:00 Pa Rö : > >> hi, >> >> it is possible to use a custom distance measure and a other data typ as >> vector? >> i want cluster temporal geo datas. >> >> best regards >> paul >> > >
Re: spark mllib kmeans
Hi Paul, I would say that it should be possible, but you'll need a different distance measure which conforms to your coordinate system. 2015-05-11 14:59 GMT+02:00 Pa Rö : > hi, > > it is possible to use a custom distance measure and a other data typ as > vector? > i want cluster temporal geo datas. > > best regards > paul >
spark mllib kmeans
hi, it is possible to use a custom distance measure and a other data typ as vector? i want cluster temporal geo datas. best regards paul
Re: MLib KMeans on large dataset issues
Guys, great feedback by pointing out my stupidity :D Rows and columns got intermixed hence the weird results I was seeing. Ignore my previous issues will reformat my data first. On Wed, Apr 29, 2015 at 8:47 PM, Sam Stoelinga wrote: > I'm mostly using example code, see here: > http://paste.openstack.org/show/211966/ > The data has 799305 dimensions and is separated by space > > Please note the issues I'm seeing is because of the scala implementation > imo as it happens also when using the Python wrappers. > > > > On Wed, Apr 29, 2015 at 8:00 PM, Jeetendra Gangele > wrote: > >> How you are passing feature vector to K means? >> its in 2-D space of 1-D array? >> >> Did you try using Streaming Kmeans? >> >> will you be able to paste code here? >> >> On 29 April 2015 at 17:23, Sam Stoelinga wrote: >> >>> Hi Sparkers, >>> >>> I am trying to run MLib kmeans on a large dataset(50+Gb of data) and a >>> large K but I've encountered the following issues: >>> >>> >>>- Spark driver gets out of memory and dies because collect gets >>>called as part of KMeans, which loads all data back to the driver's >>> memory. >>>- At the end there is a LocalKMeans class which runs KMeansPlusPlus >>>on the Spark driver. Why isn't this distributed? It's spending a long >>> time >>>on here and this has the same problem as point 1 requires loading the >>> data >>>to the driver. >>>Also when LocakKMeans is running on driver also seeing lots of : >>>15/04/29 08:42:25 WARN clustering.LocalKMeans: kMeansPlusPlus >>>initialization ran out of distinct points for centers. Using duplicate >>>point for center k = 222 >>>- Has the above behaviour been like this in previous releases? I >>>remember running KMeans before without too much problems. >>> >>> Looking forward to hear you point out my stupidity or provide >>> work-arounds that could make Spark KMeans work well on large datasets. >>> >>> Regards, >>> Sam Stoelinga >>> >> >> >> >> >
Re: MLib KMeans on large dataset issues
I'm mostly using example code, see here: http://paste.openstack.org/show/211966/ The data has 799305 dimensions and is separated by space Please note the issues I'm seeing is because of the scala implementation imo as it happens also when using the Python wrappers. On Wed, Apr 29, 2015 at 8:00 PM, Jeetendra Gangele wrote: > How you are passing feature vector to K means? > its in 2-D space of 1-D array? > > Did you try using Streaming Kmeans? > > will you be able to paste code here? > > On 29 April 2015 at 17:23, Sam Stoelinga wrote: > >> Hi Sparkers, >> >> I am trying to run MLib kmeans on a large dataset(50+Gb of data) and a >> large K but I've encountered the following issues: >> >> >>- Spark driver gets out of memory and dies because collect gets >>called as part of KMeans, which loads all data back to the driver's >> memory. >>- At the end there is a LocalKMeans class which runs KMeansPlusPlus >>on the Spark driver. Why isn't this distributed? It's spending a long time >>on here and this has the same problem as point 1 requires loading the data >>to the driver. >>Also when LocakKMeans is running on driver also seeing lots of : >>15/04/29 08:42:25 WARN clustering.LocalKMeans: kMeansPlusPlus >>initialization ran out of distinct points for centers. Using duplicate >>point for center k = 222 >>- Has the above behaviour been like this in previous releases? I >> remember running KMeans before without too much problems. >> >> Looking forward to hear you point out my stupidity or provide >> work-arounds that could make Spark KMeans work well on large datasets. >> >> Regards, >> Sam Stoelinga >> > > > >
Re: MLib KMeans on large dataset issues
How you are passing feature vector to K means? its in 2-D space of 1-D array? Did you try using Streaming Kmeans? will you be able to paste code here? On 29 April 2015 at 17:23, Sam Stoelinga wrote: > Hi Sparkers, > > I am trying to run MLib kmeans on a large dataset(50+Gb of data) and a > large K but I've encountered the following issues: > > >- Spark driver gets out of memory and dies because collect gets called >as part of KMeans, which loads all data back to the driver's memory. >- At the end there is a LocalKMeans class which runs KMeansPlusPlus on >the Spark driver. Why isn't this distributed? It's spending a long time on >here and this has the same problem as point 1 requires loading the data to >the driver. >Also when LocakKMeans is running on driver also seeing lots of : >15/04/29 08:42:25 WARN clustering.LocalKMeans: kMeansPlusPlus >initialization ran out of distinct points for centers. Using duplicate >point for center k = 222 >- Has the above behaviour been like this in previous releases? I >remember running KMeans before without too much problems. > > Looking forward to hear you point out my stupidity or provide work-arounds > that could make Spark KMeans work well on large datasets. > > Regards, > Sam Stoelinga >
MLib KMeans on large dataset issues
Hi Sparkers, I am trying to run MLib kmeans on a large dataset(50+Gb of data) and a large K but I've encountered the following issues: - Spark driver gets out of memory and dies because collect gets called as part of KMeans, which loads all data back to the driver's memory. - At the end there is a LocalKMeans class which runs KMeansPlusPlus on the Spark driver. Why isn't this distributed? It's spending a long time on here and this has the same problem as point 1 requires loading the data to the driver. Also when LocakKMeans is running on driver also seeing lots of : 15/04/29 08:42:25 WARN clustering.LocalKMeans: kMeansPlusPlus initialization ran out of distinct points for centers. Using duplicate point for center k = 222 - Has the above behaviour been like this in previous releases? I remember running KMeans before without too much problems. Looking forward to hear you point out my stupidity or provide work-arounds that could make Spark KMeans work well on large datasets. Regards, Sam Stoelinga
Re: KMeans takeSample jobs and RDD cached
Yes, the count() should be the first task, and the sampling + collecting should be the second task. The first one is probably slow because the RDD being sampled is not yet cached/materialized. K-Means creates some RDDs internally while learning, and since they aren't needed after learning, they are unpersisted (uncached) at the end. Joseph On Sat, Apr 25, 2015 at 6:36 AM, podioss wrote: > Hi, > i am running k-means algorithm with initialization mode set to random and > various dataset sizes and values for clusters and i have a question > regarding the takeSample job of the algorithm. > More specific i notice that in every application there are two sampling > jobs. The first one is consuming the most time compared to all others while > the second one is much quicker and that sparked my interest to investigate > what is actually happening. > In order to explain it, i checked the source code of the takeSample > operation and i saw that there is a count action involved and then the > computation of a PartiotionwiseSampledRDD with a PoissonSampler. > So my question is,if that count action corresponds to the first takeSample > job and if the second takeSample job is the one doing the actual sampling. > > I also have a question for the RDDs that are created for the k-means. In > the > middle of the execution under the storage tab of the web ui i can see 3 > RDDs > with their partitions cached in memory across all nodes which is very > helpful for monitoring reasons. The problem is that after the completion i > can only see one of them and the portion of the cache memory it used and i > would like to ask why the web ui doesn't display all the RDDs involded in > the computation. > > Thank you > > > > -- > View this message in context: > http://apache-spark-user-list.1001560.n3.nabble.com/KMeans-takeSample-jobs-and-RDD-cached-tp22656.html > Sent from the Apache Spark User List mailing list archive at Nabble.com. > > - > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > For additional commands, e-mail: user-h...@spark.apache.org > >
KMeans takeSample jobs and RDD cached
Hi, i am running k-means algorithm with initialization mode set to random and various dataset sizes and values for clusters and i have a question regarding the takeSample job of the algorithm. More specific i notice that in every application there are two sampling jobs. The first one is consuming the most time compared to all others while the second one is much quicker and that sparked my interest to investigate what is actually happening. In order to explain it, i checked the source code of the takeSample operation and i saw that there is a count action involved and then the computation of a PartiotionwiseSampledRDD with a PoissonSampler. So my question is,if that count action corresponds to the first takeSample job and if the second takeSample job is the one doing the actual sampling. I also have a question for the RDDs that are created for the k-means. In the middle of the execution under the storage tab of the web ui i can see 3 RDDs with their partitions cached in memory across all nodes which is very helpful for monitoring reasons. The problem is that after the completion i can only see one of them and the portion of the cache memory it used and i would like to ask why the web ui doesn't display all the RDDs involded in the computation. Thank you -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/KMeans-takeSample-jobs-and-RDD-cached-tp22656.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Streaming Kmeans usage in java
Do everyone do we have sample example how to use streaming k-means clustering with java. I have seen some example usage in scala. can anybody point me to the java example? regards jeetendra
Re: kmeans|| in Spark is not real paralleled?
Hi Xingrui, I have create JIRA https://issues.apache.org/jira/browse/SPARK-6706, and attached the sample code. But I could not attache the test data. I will update the bug once I found a place to host the test data. Thanks, David On Tue, Mar 31, 2015 at 8:18 AM Xiangrui Meng wrote: > This PR updated the k-means|| initialization: > https://github.com/apache/spark/commit/ca7910d6dd7693be2a675a0d6a6fcc9eb0aaeb5d, > which was included in 1.3.0. It should fix kmean|| initialization with > large k. Please create a JIRA for this issue and send me the code and the > dataset to produce this problem. Thanks! -Xiangrui > > On Sun, Mar 29, 2015 at 1:20 AM, Xi Shen wrote: > >> Hi, >> >> I have opened a couple of threads asking about k-means performance >> problem in Spark. I think I made a little progress. >> >> Previous I use the simplest way of KMeans.train(rdd, k, maxIterations). >> It uses the "kmeans||" initialization algorithm which supposedly to be a >> faster version of kmeans++ and give better results in general. >> >> But I observed that if the k is very large, the initialization step takes >> a long time. From the CPU utilization chart, it looks like only one thread >> is working. Please see >> https://stackoverflow.com/questions/29326433/cpu-gap-when-doing-k-means-with-spark >> . >> >> I read the paper, >> http://theory.stanford.edu/~sergei/papers/vldb12-kmpar.pdf, and it >> points out kmeans++ initialization algorithm will suffer if k is large. >> That's why the paper contributed the kmeans|| algorithm. >> >> >> If I invoke KMeans.train by using the random initialization algorithm, I >> do not observe this problem, even with very large k, like k=5000. This >> makes me suspect that the kmeans|| in Spark is not properly implemented and >> do not utilize parallel implementation. >> >> >> I have also tested my code and data set with Spark 1.3.0, and I still >> observe this problem. I quickly checked the PR regarding the KMeans >> algorithm change from 1.2.0 to 1.3.0. It seems to be only code improvement >> and polish, not changing/improving the algorithm. >> >> >> I originally worked on Windows 64bit environment, and I also tested on >> Linux 64bit environment. I could provide the code and data set if anyone >> want to reproduce this problem. >> >> >> I hope a Spark developer could comment on this problem and help >> identifying if it is a bug. >> >> >> Thanks, >> >> [image: --] >> Xi Shen >> [image: http://]about.me/davidshen >> <http://about.me/davidshen?promo=email_sig> >> <http://about.me/davidshen> >> > >
Re: Mllib kmeans #iteration
Have you refer to official document of kmeans on https://spark.apache.org/docs/1.1.1/mllib-clustering.html ? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Mllib-kmeans-iteration-tp22353p22365.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Mllib kmeans #iteration
Check out the Spark docs for that parameter: *maxIterations* http://spark.apache.org/docs/latest/mllib-clustering.html#k-means On Thu, Apr 2, 2015 at 4:42 AM, podioss wrote: > Hello, > i am running the Kmeans algorithm in cluster mode from Mllib and i was > wondering if i could run the algorithm with fixed number of iterations in > some way. > > Thanks > > > > > -- > View this message in context: > http://apache-spark-user-list.1001560.n3.nabble.com/Mllib-kmeans-iteration-tp22353.html > Sent from the Apache Spark User List mailing list archive at Nabble.com. > > - > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > For additional commands, e-mail: user-h...@spark.apache.org > >
Mllib kmeans #iteration
Hello, i am running the Kmeans algorithm in cluster mode from Mllib and i was wondering if i could run the algorithm with fixed number of iterations in some way. Thanks -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Mllib-kmeans-iteration-tp22353.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: kmeans|| in Spark is not real paralleled?
This PR updated the k-means|| initialization: https://github.com/apache/spark/commit/ca7910d6dd7693be2a675a0d6a6fcc9eb0aaeb5d, which was included in 1.3.0. It should fix kmean|| initialization with large k. Please create a JIRA for this issue and send me the code and the dataset to produce this problem. Thanks! -Xiangrui On Sun, Mar 29, 2015 at 1:20 AM, Xi Shen wrote: > Hi, > > I have opened a couple of threads asking about k-means performance problem > in Spark. I think I made a little progress. > > Previous I use the simplest way of KMeans.train(rdd, k, maxIterations). It > uses the "kmeans||" initialization algorithm which supposedly to be a > faster version of kmeans++ and give better results in general. > > But I observed that if the k is very large, the initialization step takes > a long time. From the CPU utilization chart, it looks like only one thread > is working. Please see > https://stackoverflow.com/questions/29326433/cpu-gap-when-doing-k-means-with-spark > . > > I read the paper, > http://theory.stanford.edu/~sergei/papers/vldb12-kmpar.pdf, and it points > out kmeans++ initialization algorithm will suffer if k is large. That's why > the paper contributed the kmeans|| algorithm. > > > If I invoke KMeans.train by using the random initialization algorithm, I > do not observe this problem, even with very large k, like k=5000. This > makes me suspect that the kmeans|| in Spark is not properly implemented and > do not utilize parallel implementation. > > > I have also tested my code and data set with Spark 1.3.0, and I still > observe this problem. I quickly checked the PR regarding the KMeans > algorithm change from 1.2.0 to 1.3.0. It seems to be only code improvement > and polish, not changing/improving the algorithm. > > > I originally worked on Windows 64bit environment, and I also tested on > Linux 64bit environment. I could provide the code and data set if anyone > want to reproduce this problem. > > > I hope a Spark developer could comment on this problem and help > identifying if it is a bug. > > > Thanks, > > [image: --] > Xi Shen > [image: http://]about.me/davidshen > <http://about.me/davidshen?promo=email_sig> > <http://about.me/davidshen> >
kmeans|| in Spark is not real paralleled?
Hi, I have opened a couple of threads asking about k-means performance problem in Spark. I think I made a little progress. Previous I use the simplest way of KMeans.train(rdd, k, maxIterations). It uses the "kmeans||" initialization algorithm which supposedly to be a faster version of kmeans++ and give better results in general. But I observed that if the k is very large, the initialization step takes a long time. From the CPU utilization chart, it looks like only one thread is working. Please see https://stackoverflow.com/questions/29326433/cpu-gap-when-doing-k-means-with-spark . I read the paper, http://theory.stanford.edu/~sergei/papers/vldb12-kmpar.pdf, and it points out kmeans++ initialization algorithm will suffer if k is large. That's why the paper contributed the kmeans|| algorithm. If I invoke KMeans.train by using the random initialization algorithm, I do not observe this problem, even with very large k, like k=5000. This makes me suspect that the kmeans|| in Spark is not properly implemented and do not utilize parallel implementation. I have also tested my code and data set with Spark 1.3.0, and I still observe this problem. I quickly checked the PR regarding the KMeans algorithm change from 1.2.0 to 1.3.0. It seems to be only code improvement and polish, not changing/improving the algorithm. I originally worked on Windows 64bit environment, and I also tested on Linux 64bit environment. I could provide the code and data set if anyone want to reproduce this problem. I hope a Spark developer could comment on this problem and help identifying if it is a bug. Thanks, [image: --] Xi Shen [image: http://]about.me/davidshen <http://about.me/davidshen?promo=email_sig> <http://about.me/davidshen>
Re: Why KMeans with mllib is so slow ?
Hi Burak, Unfortunately, I am expected to do my work in HDInsight environment which only supports Spark 1.2.0 with Microsoft's flavor. I cannot simple replace it with Spark 1.3. I think the problem I am observing is caused by kmeans|| initialization step. I will open another thread to discuss it. Thanks, David [image: --] Xi Shen [image: http://]about.me/davidshen <http://about.me/davidshen?promo=email_sig> <http://about.me/davidshen> On Sun, Mar 29, 2015 at 4:34 PM, Burak Yavuz wrote: > Hi David, > > Can you also try with Spark 1.3 if possible? I believe there was a 2x > improvement on K-Means between 1.2 and 1.3. > > Thanks, > Burak > > > > On Sat, Mar 28, 2015 at 9:04 PM, davidshen84 > wrote: > >> Hi Jao, >> >> Sorry to pop up this old thread. I am have the same problem like you did. >> I >> want to know if you have figured out how to improve k-means on Spark. >> >> I am using Spark 1.2.0. My data set is about 270k vectors, each has about >> 350 dimensions. If I set k=500, the job takes about 3hrs on my cluster. >> The >> cluster has 7 executors, each has 8 cores... >> >> If I set k=5000 which is the required value for my task, the job goes on >> forever... >> >> >> Thanks, >> David >> >> >> >> >> -- >> View this message in context: >> http://apache-spark-user-list.1001560.n3.nabble.com/Why-KMeans-with-mllib-is-so-slow-tp20480p22273.html >> Sent from the Apache Spark User List mailing list archive at Nabble.com. >> >> - >> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org >> For additional commands, e-mail: user-h...@spark.apache.org >> >> >
Re: Why KMeans with mllib is so slow ?
Hi David, Can you also try with Spark 1.3 if possible? I believe there was a 2x improvement on K-Means between 1.2 and 1.3. Thanks, Burak On Sat, Mar 28, 2015 at 9:04 PM, davidshen84 wrote: > Hi Jao, > > Sorry to pop up this old thread. I am have the same problem like you did. I > want to know if you have figured out how to improve k-means on Spark. > > I am using Spark 1.2.0. My data set is about 270k vectors, each has about > 350 dimensions. If I set k=500, the job takes about 3hrs on my cluster. The > cluster has 7 executors, each has 8 cores... > > If I set k=5000 which is the required value for my task, the job goes on > forever... > > > Thanks, > David > > > > > -- > View this message in context: > http://apache-spark-user-list.1001560.n3.nabble.com/Why-KMeans-with-mllib-is-so-slow-tp20480p22273.html > Sent from the Apache Spark User List mailing list archive at Nabble.com. > > - > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > For additional commands, e-mail: user-h...@spark.apache.org > >
Re: Why KMeans with mllib is so slow ?
Hi Jao, Sorry to pop up this old thread. I am have the same problem like you did. I want to know if you have figured out how to improve k-means on Spark. I am using Spark 1.2.0. My data set is about 270k vectors, each has about 350 dimensions. If I set k=500, the job takes about 3hrs on my cluster. The cluster has 7 executors, each has 8 cores... If I set k=5000 which is the required value for my task, the job goes on forever... Thanks, David -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Why-KMeans-with-mllib-is-so-slow-tp20480p22273.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Spark MLLib KMeans Top Terms
I'm trying to cluster short text messages using KMeans, after trained the kmeans I want to get the top terms (5 - 10). How do I get that using clusterCenters? full code is here http://apache-spark-user-list.1001560.n3.nabble.com/KMeans-with-large-clusters-Java-Heap-Space-td21432.html -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-MLLib-KMeans-Top-Terms-tp22154.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org