Re: Why are clustering emails not clustering similar stuff?
How are you verifying your vectorization? What do you use for weighting of words? Have you tested the distance between the notifications and other documents? Are closely duplicate documents close to each other? Sent from my iPhone On Jun 6, 2013, at 7:47, Jesvin Jose frank.einst...@gmail.com wrote: I tried to cluster 1000 emails of a person using Kmeans, but clusters are not forming okay. For example if Facebook sends notifications about James Doe and 5 other people, I get 5 clusters like: :VL-858{n=7 Top Terms: doe = 10.066998481750488 james= 10.066998481750488 Why are notifications for all 5 people not getting clustered together? I used variants of the commands used in Mahout in Action, Sean Owen et al as follows: Vectorizing uses lowercasing, stop words and length filter: bin/hadoop jar /home/jesvin/dev/hadoop/mahout-distribution-0.7/examples/target/mahout-examples-0.7-job.jar org.apache.mahout.driver.MahoutDriver seq2sparse -i mymail-seqfiles -o mymail-vectors-bigram -ow -a mia.clustering.ch10.MyAnalyzer -chunk 200 -wt tfidf -s 5 -md 3 -x 90 -ng 2 -ml 50 -seq Its for 1000 emails, but I tried 100 clusters. If I tried 50, I still get similar results but half the number of emails get into any cluster. bin/hadoop jar /home/jesvin/dev/hadoop/mahout-distribution-0.7/examples/target/mahout-examples-0.7-job.jar org.apache.mahout.driver.MahoutDriver kmeans -i mymail-vectors-bigram/tfidf-vectors -c mymail-initial-clusters -o mymail-kmeans-clusters-from-bigrams -dm org.apache.mahout.common.distance.CosineDistanceMeasure -cd 0.1 -k 100 -x 20 -cl -- We dont beat the reaper by living longer. We beat the reaper by living well and living fully. The reaper will come for all of us. Question is, what do we do between the time we are born and the time he shows up? -Randy Pausch
Re: Why are clustering emails not clustering similar stuff?
My 2 cents. It is always tricky to get clustering right, especially kmeans. Especially when running clustering on a sparse dataset which these wordvectors tend to be (There can be only a subset of words present in a given document from whole corpus). If all you are looking for is grouping *document* together, then I think probably topic modeling might give you better results. On Wed, Jun 5, 2013 at 10:47 PM, Jesvin Jose frank.einst...@gmail.comwrote: I tried to cluster 1000 emails of a person using Kmeans, but clusters are not forming okay. For example if Facebook sends notifications about James Doe and 5 other people, I get 5 clusters like: :VL-858{n=7 Top Terms: doe = 10.066998481750488 james= 10.066998481750488 Why are notifications for all 5 people not getting clustered together? I used variants of the commands used in Mahout in Action, Sean Owen et al as follows: Vectorizing uses lowercasing, stop words and length filter: bin/hadoop jar /home/jesvin/dev/hadoop/mahout-distribution-0.7/examples/target/mahout-examples-0.7-job.jar org.apache.mahout.driver.MahoutDriver seq2sparse -i mymail-seqfiles -o mymail-vectors-bigram -ow -a mia.clustering.ch10.MyAnalyzer -chunk 200 -wt tfidf -s 5 -md 3 -x 90 -ng 2 -ml 50 -seq Its for 1000 emails, but I tried 100 clusters. If I tried 50, I still get similar results but half the number of emails get into any cluster. bin/hadoop jar /home/jesvin/dev/hadoop/mahout-distribution-0.7/examples/target/mahout-examples-0.7-job.jar org.apache.mahout.driver.MahoutDriver kmeans -i mymail-vectors-bigram/tfidf-vectors -c mymail-initial-clusters -o mymail-kmeans-clusters-from-bigrams -dm org.apache.mahout.common.distance.CosineDistanceMeasure -cd 0.1 -k 100 -x 20 -cl -- We dont beat the reaper by living longer. We beat the reaper by living well and living fully. The reaper will come for all of us. Question is, what do we do between the time we are born and the time he shows up? -Randy Pausch -- Mohit When you want success as badly as you want the air, then you will get it. There is no other secret of success. -Socrates
Why are clustering emails not clustering similar stuff?
I tried to cluster 1000 emails of a person using Kmeans, but clusters are not forming okay. For example if Facebook sends notifications about James Doe and 5 other people, I get 5 clusters like: :VL-858{n=7 Top Terms: doe = 10.066998481750488 james= 10.066998481750488 Why are notifications for all 5 people not getting clustered together? I used variants of the commands used in Mahout in Action, Sean Owen et al as follows: Vectorizing uses lowercasing, stop words and length filter: bin/hadoop jar /home/jesvin/dev/hadoop/mahout-distribution-0.7/examples/target/mahout-examples-0.7-job.jar org.apache.mahout.driver.MahoutDriver seq2sparse -i mymail-seqfiles -o mymail-vectors-bigram -ow -a mia.clustering.ch10.MyAnalyzer -chunk 200 -wt tfidf -s 5 -md 3 -x 90 -ng 2 -ml 50 -seq Its for 1000 emails, but I tried 100 clusters. If I tried 50, I still get similar results but half the number of emails get into any cluster. bin/hadoop jar /home/jesvin/dev/hadoop/mahout-distribution-0.7/examples/target/mahout-examples-0.7-job.jar org.apache.mahout.driver.MahoutDriver kmeans -i mymail-vectors-bigram/tfidf-vectors -c mymail-initial-clusters -o mymail-kmeans-clusters-from-bigrams -dm org.apache.mahout.common.distance.CosineDistanceMeasure -cd 0.1 -k 100 -x 20 -cl -- We dont beat the reaper by living longer. We beat the reaper by living well and living fully. The reaper will come for all of us. Question is, what do we do between the time we are born and the time he shows up? -Randy Pausch