Hi Kate, I have 60 rows data that has text description. I just generated tf-idf using my analyzer. and tf-idf vector is passed into the clustering algorithms to do the clustering. I use k=3, it generates clusters-1, clusters-2 folder. What does each folder mean ? How does the clustering process generates those ?
Weide On Mon, Oct 3, 2011 at 8:04 AM, Kate Ericson <[email protected]>wrote: > Hi Welde, > > As a disclaimer, I only know enough to try to help you figure out your > first problem. > First of all, can you tell us about the dataset you are using? > How many points are you clustering? > > As a guess without knowing either of these things, part of the reason > why your clusters look the same is that you're only clustering around > 3 points. You're only running for 2 iterations, so it looks like its > just not moving your cluster centers around at all. Can you try again > with a larger k? > This may let it run for more iterations so you should be able to see > more changes in results. > > Good luck! > > -Kate > > On Sun, Oct 2, 2011 at 9:52 PM, Walter Chang <[email protected]> > wrote: > > Hi , > > > > i have used mahout to produce kmeans clustering for my tf-idf result. I > use > > the mahout command line to produce the clusters and it seems it > successfully > > completes. > > > > $MAHOUT_HOME/bin/mahout kmeans -i ./tfidf-vectors -c ./initialclusters > -o > > ./kmeans-clusters -cd 1.0 -k 3 -x 1000 > > > > It seems there are two clusters directory generated.(cluster-1 and > > cluster-2) , when i use clusterdump on each of them, it seems to me that > > the clustered top terms are the same. Any idea why ? > > > > Also, how can i see which documents have been assigned to each cluster. > > Right now, i can see the number of documents assigned but not the > complete > > list. > > > > Most importantly, for production purposes, i assume it makes sense for > > kmeans always runs on hadoop to generate the clustering file. But how do > i > > consume these during serving ? Ideally, serving should have the doc id or > > query passed as a query, and the server should return the top document > > ranked by the score within the same cluster back. How do I do it in code > ? > > Any good examples ? > > > > Thanks a lot, > > > > Weide > > >
