On Jan 2, 2010, at 2:15 AM, Shashikant Kore wrote:

> On Thu, Dec 31, 2009 at 10:40 PM, Grant Ingersoll <[email protected]> wrote:
>> 
>> The other thing I'm interested in is people's real world feedback on using 
>> clustering to solve their text related problems.
>> For instance, what type of feature reduction did you do (stopword removal, 
>> stemming, etc.)?  What algorithms worked for you?
>> What didn't work?  Any and all insight is welcome and I don't particularly 
>> care if it is Mahout specific (for instance, part of
>> the chapter is about search result clustering using Carrot2 and so Mahout 
>> isn't applicable)
>> 
> 
> Let me start by saying Mahout works great for us. We can run k-means
> on 250k docs (10 iterations, 100 seeds) in less than 30 minutes on a
> single host.
> 
> Using vector normalization like L2 norm helped quite a bit. Thanks to
> Ted for this suggestion. In text clustering, you have lots of small
> documents. This results into very sparse vectors (total of 100K
> features with each vector having 200 features.) Using vanilla TFIDF
> weights doesn't work as nicely.
> 
> Even if we don't do explicit stop word removal, the threshold values
> for document count does that in a better fashion. If you exclude the
> features which are extremely common (say more than 40% documents) or
> extremely rare (say in less than 50 documents in a corpus of 100K
> docs), you have a meaningful set of features. The current K-Means
> already accepts these parameters.

You mean the Lucene Driver that creates the vectors, right?

> 
> Stemming can be used for feature reduction, but it has a minor issue.
> When you want to find out prominent features of the resulting cluster
> centroid, the feature may not be meaningful. For example,  if a
> prominent feature is "beautiful", when you get it back, you will get
> "beauti." Ouch.

Right, but this is easily handled  via something like Lucene's highlighter 
functionality.  I bet it could be made to work on Mahout's vectors (+ a 
dictionary) fairly easily.


> 
> I tried fuzzy K-Means for soft clustering, but I didn't get good
> results. May be the corpus had the issue.
> 
> One observation about the clustering process is that it is geared, by
> accident or by design, towards batch processing. There is no
> support for real-time clustering. There needs to be glue which ties
> all the components together to make the process seamless. I suppose,
> someone in need of this feature will contribute it to Mahout.

Right.  This should be pretty easy to remedy, though.  One could simply use the 
previous results as the --clusters option, right?

> 
> Grant,  If I recall more, I will mail it to you.

Great!  Thank you.

> 
> --shashi

--------------------------
Grant Ingersoll
http://www.lucidimagination.com/

Search the Lucene ecosystem using Solr/Lucene: 
http://www.lucidimagination.com/search

Reply via email to