On Thu, Dec 31, 2009 at 10:40 PM, Grant Ingersoll <[email protected]> wrote: > > The other thing I'm interested in is people's real world feedback on using > clustering to solve their text related problems. > For instance, what type of feature reduction did you do (stopword removal, > stemming, etc.)? What algorithms worked for you? > What didn't work? Any and all insight is welcome and I don't particularly > care if it is Mahout specific (for instance, part of > the chapter is about search result clustering using Carrot2 and so Mahout > isn't applicable) >
Let me start by saying Mahout works great for us. We can run k-means on 250k docs (10 iterations, 100 seeds) in less than 30 minutes on a single host. Using vector normalization like L2 norm helped quite a bit. Thanks to Ted for this suggestion. In text clustering, you have lots of small documents. This results into very sparse vectors (total of 100K features with each vector having 200 features.) Using vanilla TFIDF weights doesn't work as nicely. Even if we don't do explicit stop word removal, the threshold values for document count does that in a better fashion. If you exclude the features which are extremely common (say more than 40% documents) or extremely rare (say in less than 50 documents in a corpus of 100K docs), you have a meaningful set of features. The current K-Means already accepts these parameters. Stemming can be used for feature reduction, but it has a minor issue. When you want to find out prominent features of the resulting cluster centroid, the feature may not be meaningful. For example, if a prominent feature is "beautiful", when you get it back, you will get "beauti." Ouch. I tried fuzzy K-Means for soft clustering, but I didn't get good results. May be the corpus had the issue. One observation about the clustering process is that it is geared, by accident or by design, towards batch processing. There is no support for real-time clustering. There needs to be glue which ties all the components together to make the process seamless. I suppose, someone in need of this feature will contribute it to Mahout. Grant, If I recall more, I will mail it to you. --shashi
