Re: Clustering text data with MLlib

Suneel Marthi Mon, 29 Dec 2014 08:56:54 -0800

Here's the Streaming KMeans from Spark 
1.2http://spark.apache.org/docs/latest/mllib-clustering.html#examples-1
Steaming KMeans still needs an initial 'k' to be specified, it then progresses 
to come up with an optimal 'k' IIRC.

      From: Sean Owen <so...@cloudera.com>
 To: jatinpreet <jatinpr...@gmail.com> 
Cc: "user@spark.apache.org" <user@spark.apache.org> 
 Sent: Monday, December 29, 2014 6:25 AM
 Subject: Re: Clustering text data with MLlib

You can try several values of k, apply some evaluation metric to the
clustering, and then use that to decide what k is best, or at least
pretty good. If it's a completely unsupervised problem, the metrics
you can use tend to be some function of the inter-cluster and
intra-cluster distances (good clustering means points are near to
things in their own cluster and far from things in other clusters).

If it's a supervised problem, you can bring things like purity or
mutual information, but I don't think that's the case here. You would
have to implement these metrics yourself.

You can consider clustering algorithms that do not depend on k, like
say DBSCAN. Although this has its own different hyperparameter to
pick. Again you'd have to implement it yourself.

What you describe sounds like topic modeling using LDA. This still
requires you to pick a number of topics, but lets documents belong to
several topics. Maybe that's more like what you want. This isn't in
Spark per se but there is some work done on it
(https://issues.apache.org/jira/browse/SPARK-1405) and Sandy has
written up some text on doing this in Spark.

Finally there is the Hierarchical Dirichlet process which does allow
for the number of topics to be learned dynamically. This is relatively
advanced.

Finally finally, maybe someone can remind me of the streaming k-means
variant that tries to pick k dynamically too. I am not finding what
I'm thinking of but think this exists.

On Mon, Dec 29, 2014 at 10:55 AM, jatinpreet <jatinpr...@gmail.com> wrote:
> Hi,
>
> I wish to cluster a set of textual documents into undefined number of
> classes. The clustering algorithm provided in MLlib i.e. K-means requires me
> to give a pre-defined number of classes.
>
> Is there any algorithm which is intelligent enough to identify how many
> classes should be made based on the input documents. I want to utilize the
> speed and agility of Spark in the process.
>
> Thanks,
> Jatin
>
>
>
> -----
> Novice Big Data Programmer
> --
> View this message in context: 
> http://apache-spark-user-list.1001560.n3.nabble.com/Clustering-text-data-with-MLlib-tp20883.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: Clustering text data with MLlib

Reply via email to