Re: getting the cluster elements from kmeans run
KMeansModel only returns the cluster centroids. To get the # of elements in each cluster, try calling kmeans.predict() on each of the points in the data used to build the model. See https://github.com/OryxProject/oryx/blob/master/oryx-app-mllib/src/main/java/com/cloudera/oryx/app/mllib/kmeans/KMeansUpdate.java Look at method fetchClusterCountsFromModel() From: Harini Srinivasan har...@us.ibm.com To: user@spark.apache.org Sent: Wednesday, February 11, 2015 12:36 PM Subject: getting the cluster elements from kmeans run Hi, Is there a way to get the elements ofeach cluster after running kmeans clustering? I am using the Java version. thanks
Re: K-Means final cluster centers
There's a kMeansModel.clusterCenters() available if u r looking to get the centers from KMeansModel. From: SK skrishna...@gmail.com To: user@spark.apache.org Sent: Thursday, February 5, 2015 5:35 PM Subject: K-Means final cluster centers Hi, I am trying to get the final cluster centers after running the KMeans algorithm in MLlib in order to characterize the clusters. But the KMeansModel does not have any public method to retrieve this info. There appears to be only a private method called clusterCentersWithNorm. I guess I could call predict() to get the final cluster assignment for the dataset and write my own code to compute the means based on this final assignment. But I would like to know if there is a way to get this info from MLLib API directly after running KMeans? thanks -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/K-Means-final-cluster-centers-tp21523.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Row similarities
Andrew, u would be better off using Mahout's RowSimilarityJob for what u r trying to accomplish. 1. It does give u pair-wise distances 2. U can specify the Distance measure u r looking to use 3. There's the old MapReduce impl and the Spark DSL impl per ur preference. From: Andrew Musselman andrew.mussel...@gmail.com To: Reza Zadeh r...@databricks.com Cc: user user@spark.apache.org Sent: Saturday, January 17, 2015 11:29 AM Subject: Re: Row similarities Thanks Reza, interesting approach. I think what I actually want is to calculate pair-wise distance, on second thought. Is there a pattern for that? On Jan 16, 2015, at 9:53 PM, Reza Zadeh r...@databricks.com wrote: You can use K-means with a suitably large k. Each cluster should correspond to rows that are similar to one another. On Fri, Jan 16, 2015 at 5:18 PM, Andrew Musselman andrew.mussel...@gmail.com wrote: What's a good way to calculate similarities between all vector-rows in a matrix or RDD[Vector]? I'm seeing RowMatrix has a columnSimilarities method but I'm not sure I'm going down a good path to transpose a matrix in order to run that.
Re: Clustering text data with MLlib
Here's the Streaming KMeans from Spark 1.2http://spark.apache.org/docs/latest/mllib-clustering.html#examples-1 Steaming KMeans still needs an initial 'k' to be specified, it then progresses to come up with an optimal 'k' IIRC. From: Sean Owen so...@cloudera.com To: jatinpreet jatinpr...@gmail.com Cc: user@spark.apache.org user@spark.apache.org Sent: Monday, December 29, 2014 6:25 AM Subject: Re: Clustering text data with MLlib You can try several values of k, apply some evaluation metric to the clustering, and then use that to decide what k is best, or at least pretty good. If it's a completely unsupervised problem, the metrics you can use tend to be some function of the inter-cluster and intra-cluster distances (good clustering means points are near to things in their own cluster and far from things in other clusters). If it's a supervised problem, you can bring things like purity or mutual information, but I don't think that's the case here. You would have to implement these metrics yourself. You can consider clustering algorithms that do not depend on k, like say DBSCAN. Although this has its own different hyperparameter to pick. Again you'd have to implement it yourself. What you describe sounds like topic modeling using LDA. This still requires you to pick a number of topics, but lets documents belong to several topics. Maybe that's more like what you want. This isn't in Spark per se but there is some work done on it (https://issues.apache.org/jira/browse/SPARK-1405) and Sandy has written up some text on doing this in Spark. Finally there is the Hierarchical Dirichlet process which does allow for the number of topics to be learned dynamically. This is relatively advanced. Finally finally, maybe someone can remind me of the streaming k-means variant that tries to pick k dynamically too. I am not finding what I'm thinking of but think this exists. On Mon, Dec 29, 2014 at 10:55 AM, jatinpreet jatinpr...@gmail.com wrote: Hi, I wish to cluster a set of textual documents into undefined number of classes. The clustering algorithm provided in MLlib i.e. K-means requires me to give a pre-defined number of classes. Is there any algorithm which is intelligent enough to identify how many classes should be made based on the input documents. I want to utilize the speed and agility of Spark in the process. Thanks, Jatin - Novice Big Data Programmer -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Clustering-text-data-with-MLlib-tp20883.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: K-means faster on Mahout then on Spark
Mahout does have a kmeans which can be executed in mapreduce and iterative modes. Sent from my iPhone On Mar 25, 2014, at 9:25 AM, Prashant Sharma scrapco...@gmail.com wrote: I think Mahout uses FuzzyKmeans, which is different algorithm and it is not iterative. Prashant Sharma On Tue, Mar 25, 2014 at 6:50 PM, Egor Pahomov pahomov.e...@gmail.com wrote: Hi, I'm running benchmark, which compares Mahout and SparkML. For now I have next results for k-means: Number of iterations= 10, number of elements = 1000, mahouttime= 602, spark time = 138 Number of iterations= 40, number of elements = 1000, mahouttime= 1917, spark time = 330 Number of iterations= 70, number of elements = 1000, mahouttime= 3203, spark time = 388 Number of iterations= 10, number of elements = 1, mahouttime= 1235, spark time = 2226 Number of iterations= 40, number of elements = 1, mahouttime= 2755, spark time = 6388 Number of iterations= 70, number of elements = 1, mahouttime= 4107, spark time = 10967 Number of iterations= 10, number of elements = 10, mahouttime= 7070, spark time = 25268 Time in seconds. It runs on Yarn cluster with about 40 machines. Elements for clusterization are randomly created. When I changed persistence level from Memory to Memory_and_disk, on big data spark started to work faster. What am I missing? See my benchmarking code in attachment. -- Sincerely yours Egor Pakhomov Scala Developer, Yandex