Hi,

I have another suggestion that I am working at, it's not yet complete but I 
should be able to submit some patch in coming weeks.

The idea is to have two distances one each for
A) Across the cluster
B) Within the cluster 

These distances help in setting up the inclusion and exclusion of the input 
data point in the cluster.

In the iter-1 I allow inifite distance for both of these distance and run the 
regular logic of clustering. At the end of it calculate the min and max for 
both of these distances. 

For the remaining iterations do the following

A) find the Lowest distance cluster for the input point. If the distance of the 
input point increases the distance within the cluster then the previously 
assigned then ignore it and create a new cluster with this point. If it does 
not voilate the older distance keep it with following probability 
        1. 1 if the distance of Input point is less then min distance.
        2. in the ratio of (max-distance of the input from center)/max_distance

B) If there are two clusters which have cluster distance less then the minimum 
cluster distance then merge them and calculate the within cluster distance and 
across cluster distance.

I have observed that this has helped me in getting good clusters. Currently I 
have a single version code which has the above calculation and I am wokring on 
getting this into Map-red and should be able to submit a patch very soon.

Let me know If you think this will generally work.

--Thanks and Regards
Vaijanath 

-----Original Message-----
From: Palleti, Pallavi [mailto:[email protected]] 
Sent: Monday, January 04, 2010 5:34 PM
To: [email protected]
Subject: RE: Clustering techniques, tips and tricks

Here are my two cents.

I experimented with both kmeans and fuzzy kmeans on a data similar to 
movielens. And, what I observed is that the initial cluster seeds are more 
important. Initially, I used canopy clustering seeds as initial seeds but the 
results weren't good and the number of clusters depends on the distance 
thresholds we give as input. Later, I have considered randomly selecting some 
points from the input dataset and consider them as initial seeds. Again, the 
results were not good. Now, I have chosen initial seeds from input set in such 
a way that the points are far from each other and I have observed better 
clustering using Fuzzy Kmeans. I have not implemented a map-reducable version 
for this seed selection. I will soon implement a map-reducable version and 
submit a patch.

Also, regarding canopy, I found that what actually canopy is defined in theory 
is different from what we have in mahout. However, some one can clarify on that.

Thanks
Pallavi
 
-----Original Message-----
From: Shashikant Kore [mailto:[email protected]]
Sent: Saturday, January 02, 2010 12:45 PM
To: [email protected]
Subject: Re: Clustering techniques, tips and tricks

On Thu, Dec 31, 2009 at 10:40 PM, Grant Ingersoll <[email protected]> wrote:
>
> The other thing I'm interested in is people's real world feedback on using 
> clustering to solve their text related problems.
> For instance, what type of feature reduction did you do (stopword removal, 
> stemming, etc.)?  What algorithms worked for you?
> What didn't work?  Any and all insight is welcome and I don't 
> particularly care if it is Mahout specific (for instance, part of the 
> chapter is about search result clustering using Carrot2 and so Mahout 
> isn't applicable)
>

Let me start by saying Mahout works great for us. We can run k-means on 250k 
docs (10 iterations, 100 seeds) in less than 30 minutes on a single host.

Using vector normalization like L2 norm helped quite a bit. Thanks to Ted for 
this suggestion. In text clustering, you have lots of small documents. This 
results into very sparse vectors (total of 100K features with each vector 
having 200 features.) Using vanilla TFIDF weights doesn't work as nicely.

Even if we don't do explicit stop word removal, the threshold values for 
document count does that in a better fashion. If you exclude the features which 
are extremely common (say more than 40% documents) or extremely rare (say in 
less than 50 documents in a corpus of 100K docs), you have a meaningful set of 
features. The current K-Means already accepts these parameters.

Stemming can be used for feature reduction, but it has a minor issue.
When you want to find out prominent features of the resulting cluster centroid, 
the feature may not be meaningful. For example,  if a prominent feature is 
"beautiful", when you get it back, you will get "beauti." Ouch.

I tried fuzzy K-Means for soft clustering, but I didn't get good results. May 
be the corpus had the issue.

One observation about the clustering process is that it is geared, by accident 
or by design, towards batch processing. There is no support for real-time 
clustering. There needs to be glue which ties all the components together to 
make the process seamless. I suppose, someone in need of this feature will 
contribute it to Mahout.

Grant,  If I recall more, I will mail it to you.

--shashi

Reply via email to