Here are my two cents.

I experimented with both kmeans and fuzzy kmeans on a data similar to 
movielens. And, what I observed is that the initial cluster seeds are more 
important. Initially, I used canopy clustering seeds as initial seeds but the 
results weren't good and the number of clusters depends on the distance 
thresholds we give as input. Later, I have considered randomly selecting some 
points from the input dataset and consider them as initial seeds. Again, the 
results were not good. Now, I have chosen initial seeds from input set in such 
a way that the points are far from each other and I have observed better 
clustering using Fuzzy Kmeans. I have not implemented a map-reducable version 
for this seed selection. I will soon implement a map-reducable version and 
submit a patch.

Also, regarding canopy, I found that what actually canopy is defined in theory 
is different from what we have in mahout. However, some one can clarify on that.

Thanks
Pallavi
 
-----Original Message-----
From: Shashikant Kore [mailto:[email protected]] 
Sent: Saturday, January 02, 2010 12:45 PM
To: [email protected]
Subject: Re: Clustering techniques, tips and tricks

On Thu, Dec 31, 2009 at 10:40 PM, Grant Ingersoll <[email protected]> wrote:
>
> The other thing I'm interested in is people's real world feedback on using 
> clustering to solve their text related problems.
> For instance, what type of feature reduction did you do (stopword removal, 
> stemming, etc.)?  What algorithms worked for you?
> What didn't work?  Any and all insight is welcome and I don't 
> particularly care if it is Mahout specific (for instance, part of the 
> chapter is about search result clustering using Carrot2 and so Mahout 
> isn't applicable)
>

Let me start by saying Mahout works great for us. We can run k-means on 250k 
docs (10 iterations, 100 seeds) in less than 30 minutes on a single host.

Using vector normalization like L2 norm helped quite a bit. Thanks to Ted for 
this suggestion. In text clustering, you have lots of small documents. This 
results into very sparse vectors (total of 100K features with each vector 
having 200 features.) Using vanilla TFIDF weights doesn't work as nicely.

Even if we don't do explicit stop word removal, the threshold values for 
document count does that in a better fashion. If you exclude the features which 
are extremely common (say more than 40% documents) or extremely rare (say in 
less than 50 documents in a corpus of 100K docs), you have a meaningful set of 
features. The current K-Means already accepts these parameters.

Stemming can be used for feature reduction, but it has a minor issue.
When you want to find out prominent features of the resulting cluster centroid, 
the feature may not be meaningful. For example,  if a prominent feature is 
"beautiful", when you get it back, you will get "beauti." Ouch.

I tried fuzzy K-Means for soft clustering, but I didn't get good results. May 
be the corpus had the issue.

One observation about the clustering process is that it is geared, by accident 
or by design, towards batch processing. There is no support for real-time 
clustering. There needs to be glue which ties all the components together to 
make the process seamless. I suppose, someone in need of this feature will 
contribute it to Mahout.

Grant,  If I recall more, I will mail it to you.

--shashi

Reply via email to