Hi Welde, As a disclaimer, I only know enough to try to help you figure out your first problem. First of all, can you tell us about the dataset you are using? How many points are you clustering?
As a guess without knowing either of these things, part of the reason why your clusters look the same is that you're only clustering around 3 points. You're only running for 2 iterations, so it looks like its just not moving your cluster centers around at all. Can you try again with a larger k? This may let it run for more iterations so you should be able to see more changes in results. Good luck! -Kate On Sun, Oct 2, 2011 at 9:52 PM, Walter Chang <[email protected]> wrote: > Hi , > > i have used mahout to produce kmeans clustering for my tf-idf result. I use > the mahout command line to produce the clusters and it seems it successfully > completes. > > $MAHOUT_HOME/bin/mahout kmeans -i ./tfidf-vectors -c ./initialclusters -o > ./kmeans-clusters -cd 1.0 -k 3 -x 1000 > > It seems there are two clusters directory generated.(cluster-1 and > cluster-2) , when i use clusterdump on each of them, it seems to me that > the clustered top terms are the same. Any idea why ? > > Also, how can i see which documents have been assigned to each cluster. > Right now, i can see the number of documents assigned but not the complete > list. > > Most importantly, for production purposes, i assume it makes sense for > kmeans always runs on hadoop to generate the clustering file. But how do i > consume these during serving ? Ideally, serving should have the doc id or > query passed as a query, and the server should return the top document > ranked by the score within the same cluster back. How do I do it in code ? > Any good examples ? > > Thanks a lot, > > Weide >
