I tried to cluster 1000 emails of a person using Kmeans, but clusters are not forming okay. For example if Facebook sends notifications about James Doe and 5 other people, I get 5 clusters like:
:VL-858{n=7 Top Terms: doe => 10.066998481750488 james => 10.066998481750488 Why are notifications for all 5 people not getting clustered together? I used variants of the commands used in Mahout in Action, Sean Owen et al as follows: Vectorizing uses lowercasing, stop words and length filter: bin/hadoop jar /home/jesvin/dev/hadoop/mahout-distribution-0.7/examples/target/mahout-examples-0.7-job.jar org.apache.mahout.driver.MahoutDriver seq2sparse -i mymail-seqfiles -o mymail-vectors-bigram -ow -a mia.clustering.ch10.MyAnalyzer -chunk 200 -wt tfidf -s 5 -md 3 -x 90 -ng 2 -ml 50 -seq Its for 1000 emails, but I tried 100 clusters. If I tried 50, I still get similar results but half the number of emails "get into" any cluster. bin/hadoop jar /home/jesvin/dev/hadoop/mahout-distribution-0.7/examples/target/mahout-examples-0.7-job.jar org.apache.mahout.driver.MahoutDriver kmeans -i mymail-vectors-bigram/tfidf-vectors -c mymail-initial-clusters -o mymail-kmeans-clusters-from-bigrams -dm org.apache.mahout.common.distance.CosineDistanceMeasure -cd 0.1 -k 100 -x 20 -cl -- We dont beat the reaper by living longer. We beat the reaper by living well and living fully. The reaper will come for all of us. Question is, what do we do between the time we are born and the time he shows up? -Randy Pausch