Here's the Streaming KMeans from Spark 1.2http://spark.apache.org/docs/latest/mllib-clustering.html#examples-1 Steaming KMeans still needs an initial 'k' to be specified, it then progresses to come up with an optimal 'k' IIRC.
From: Sean Owen <so...@cloudera.com> To: jatinpreet <jatinpr...@gmail.com> Cc: "user@spark.apache.org" <user@spark.apache.org> Sent: Monday, December 29, 2014 6:25 AM Subject: Re: Clustering text data with MLlib You can try several values of k, apply some evaluation metric to the clustering, and then use that to decide what k is best, or at least pretty good. If it's a completely unsupervised problem, the metrics you can use tend to be some function of the inter-cluster and intra-cluster distances (good clustering means points are near to things in their own cluster and far from things in other clusters). If it's a supervised problem, you can bring things like purity or mutual information, but I don't think that's the case here. You would have to implement these metrics yourself. You can consider clustering algorithms that do not depend on k, like say DBSCAN. Although this has its own different hyperparameter to pick. Again you'd have to implement it yourself. What you describe sounds like topic modeling using LDA. This still requires you to pick a number of topics, but lets documents belong to several topics. Maybe that's more like what you want. This isn't in Spark per se but there is some work done on it (https://issues.apache.org/jira/browse/SPARK-1405) and Sandy has written up some text on doing this in Spark. Finally there is the Hierarchical Dirichlet process which does allow for the number of topics to be learned dynamically. This is relatively advanced. Finally finally, maybe someone can remind me of the streaming k-means variant that tries to pick k dynamically too. I am not finding what I'm thinking of but think this exists. On Mon, Dec 29, 2014 at 10:55 AM, jatinpreet <jatinpr...@gmail.com> wrote: > Hi, > > I wish to cluster a set of textual documents into undefined number of > classes. The clustering algorithm provided in MLlib i.e. K-means requires me > to give a pre-defined number of classes. > > Is there any algorithm which is intelligent enough to identify how many > classes should be made based on the input documents. I want to utilize the > speed and agility of Spark in the process. > > Thanks, > Jatin > > > > ----- > Novice Big Data Programmer > -- > View this message in context: > http://apache-spark-user-list.1001560.n3.nabble.com/Clustering-text-data-with-MLlib-tp20883.html > Sent from the Apache Spark User List mailing list archive at Nabble.com. > > --------------------------------------------------------------------- > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > For additional commands, e-mail: user-h...@spark.apache.org > --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org