Re: Clustering text data with MLlib
Kmeans really needs to have identified number of clusters in advance. There are multiple algorithms (XMeans, ART,...) which do not need this information. Unfortunately, none of them is implemented in MLLib for the moment (you can give a hand and help community). Anyway, it seems to me you will not be satisfied with those algorithms(Xmeans, ART,...) either. I understood that what you want to achieve is precise number of clusters. Notice, whenever you change input parameters (random seed,...) number of clusters might be different. Clustering is great tool but it won't give you one true (one number). regards, Tomas -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Clustering-text-data-with-MLlib-tp20883p20899.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Clustering text data with MLlib
Jatin, One approach for determining K would be to sample the data set and run PCA. Then evaluate how many many of the resulting eigenvalue/eigenvector pairs to use before you reach diminishing returns on cumulative error. That number provides a reasonably good value for K to use in KMeans. With recent releases of Spark and MLlib, you don't have to sample, could run PCA at scale on the full data. but that may be overkill for what you need. As Sean mentioned there may be other algorithms that would be more effective for your use case. LDA is good for topic modeling. In practice its results can be noisy, unless the pipeline has some parsing/processing of the text ahead of training. Word2Vec can be an interesting alternative for topic modeling (also in Spark MLlib) and you may want to take a look at this tutorial/case study http://www.yseam.com/blog/WV.html On Mon, Dec 29, 2014 at 2:55 AM, jatinpreet wrote: > > Hi, > > I wish to cluster a set of textual documents into undefined number of > classes. The clustering algorithm provided in MLlib i.e. K-means requires > me > to give a pre-defined number of classes. > > Is there any algorithm which is intelligent enough to identify how many > classes should be made based on the input documents. I want to utilize the > speed and agility of Spark in the process. > > Thanks, > Jatin > > > > - > Novice Big Data Programmer > -- > View this message in context: > http://apache-spark-user-list.1001560.n3.nabble.com/Clustering-text-data-with-MLlib-tp20883.html > Sent from the Apache Spark User List mailing list archive at Nabble.com. > > - > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > For additional commands, e-mail: user-h...@spark.apache.org > >
Re: Clustering text data with MLlib
Here's the Streaming KMeans from Spark 1.2http://spark.apache.org/docs/latest/mllib-clustering.html#examples-1 Steaming KMeans still needs an initial 'k' to be specified, it then progresses to come up with an optimal 'k' IIRC. From: Sean Owen To: jatinpreet Cc: "user@spark.apache.org" Sent: Monday, December 29, 2014 6:25 AM Subject: Re: Clustering text data with MLlib You can try several values of k, apply some evaluation metric to the clustering, and then use that to decide what k is best, or at least pretty good. If it's a completely unsupervised problem, the metrics you can use tend to be some function of the inter-cluster and intra-cluster distances (good clustering means points are near to things in their own cluster and far from things in other clusters). If it's a supervised problem, you can bring things like purity or mutual information, but I don't think that's the case here. You would have to implement these metrics yourself. You can consider clustering algorithms that do not depend on k, like say DBSCAN. Although this has its own different hyperparameter to pick. Again you'd have to implement it yourself. What you describe sounds like topic modeling using LDA. This still requires you to pick a number of topics, but lets documents belong to several topics. Maybe that's more like what you want. This isn't in Spark per se but there is some work done on it (https://issues.apache.org/jira/browse/SPARK-1405) and Sandy has written up some text on doing this in Spark. Finally there is the Hierarchical Dirichlet process which does allow for the number of topics to be learned dynamically. This is relatively advanced. Finally finally, maybe someone can remind me of the streaming k-means variant that tries to pick k dynamically too. I am not finding what I'm thinking of but think this exists. On Mon, Dec 29, 2014 at 10:55 AM, jatinpreet wrote: > Hi, > > I wish to cluster a set of textual documents into undefined number of > classes. The clustering algorithm provided in MLlib i.e. K-means requires me > to give a pre-defined number of classes. > > Is there any algorithm which is intelligent enough to identify how many > classes should be made based on the input documents. I want to utilize the > speed and agility of Spark in the process. > > Thanks, > Jatin > > > > - > Novice Big Data Programmer > -- > View this message in context: > http://apache-spark-user-list.1001560.n3.nabble.com/Clustering-text-data-with-MLlib-tp20883.html > Sent from the Apache Spark User List mailing list archive at Nabble.com. > > - > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > For additional commands, e-mail: user-h...@spark.apache.org > - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Clustering text data with MLlib
You can try several values of k, apply some evaluation metric to the clustering, and then use that to decide what k is best, or at least pretty good. If it's a completely unsupervised problem, the metrics you can use tend to be some function of the inter-cluster and intra-cluster distances (good clustering means points are near to things in their own cluster and far from things in other clusters). If it's a supervised problem, you can bring things like purity or mutual information, but I don't think that's the case here. You would have to implement these metrics yourself. You can consider clustering algorithms that do not depend on k, like say DBSCAN. Although this has its own different hyperparameter to pick. Again you'd have to implement it yourself. What you describe sounds like topic modeling using LDA. This still requires you to pick a number of topics, but lets documents belong to several topics. Maybe that's more like what you want. This isn't in Spark per se but there is some work done on it (https://issues.apache.org/jira/browse/SPARK-1405) and Sandy has written up some text on doing this in Spark. Finally there is the Hierarchical Dirichlet process which does allow for the number of topics to be learned dynamically. This is relatively advanced. Finally finally, maybe someone can remind me of the streaming k-means variant that tries to pick k dynamically too. I am not finding what I'm thinking of but think this exists. On Mon, Dec 29, 2014 at 10:55 AM, jatinpreet wrote: > Hi, > > I wish to cluster a set of textual documents into undefined number of > classes. The clustering algorithm provided in MLlib i.e. K-means requires me > to give a pre-defined number of classes. > > Is there any algorithm which is intelligent enough to identify how many > classes should be made based on the input documents. I want to utilize the > speed and agility of Spark in the process. > > Thanks, > Jatin > > > > - > Novice Big Data Programmer > -- > View this message in context: > http://apache-spark-user-list.1001560.n3.nabble.com/Clustering-text-data-with-MLlib-tp20883.html > Sent from the Apache Spark User List mailing list archive at Nabble.com. > > - > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > For additional commands, e-mail: user-h...@spark.apache.org > - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org