Re: Why are clustering emails not clustering similar stuff?

2013-06-08 Thread Ted Dunning
How are you verifying your vectorization?

What do you use for weighting of words?

Have you tested the distance between the notifications and other documents?  
Are closely duplicate documents close to each other? 

Sent from my iPhone

On Jun 6, 2013, at 7:47, Jesvin Jose frank.einst...@gmail.com wrote:

 I tried to cluster 1000 emails of a person using Kmeans, but clusters are
 not forming okay. For example if Facebook sends notifications about James
 Doe and 5 other people, I get 5 clusters like:
 
 :VL-858{n=7
Top Terms:
doe   =  10.066998481750488
james=  10.066998481750488
 
 Why are notifications for all 5 people not getting clustered together? I
 used variants of the commands used in Mahout in Action, Sean Owen et al as
 follows:
 
 Vectorizing uses lowercasing, stop words and length filter:
 
 bin/hadoop jar
 /home/jesvin/dev/hadoop/mahout-distribution-0.7/examples/target/mahout-examples-0.7-job.jar
 org.apache.mahout.driver.MahoutDriver seq2sparse -i mymail-seqfiles -o
 mymail-vectors-bigram -ow  -a mia.clustering.ch10.MyAnalyzer -chunk 200 -wt
 tfidf -s 5 -md 3 -x 90 -ng 2 -ml 50 -seq
 
 Its for 1000 emails, but I tried 100 clusters. If I tried 50, I still get
 similar results but half the number of emails get into any cluster.
 
 bin/hadoop jar
 /home/jesvin/dev/hadoop/mahout-distribution-0.7/examples/target/mahout-examples-0.7-job.jar
 org.apache.mahout.driver.MahoutDriver kmeans -i
 mymail-vectors-bigram/tfidf-vectors -c mymail-initial-clusters -o
 mymail-kmeans-clusters-from-bigrams -dm
 org.apache.mahout.common.distance.CosineDistanceMeasure -cd 0.1 -k 100 -x
 20 -cl
 
 -- 
 We dont beat the reaper by living longer. We beat the reaper by living well
 and living fully. The reaper will come for all of us. Question is, what do
 we do between the time we are born and the time he shows up? -Randy Pausch


Re: Why are clustering emails not clustering similar stuff?

2013-06-06 Thread Mohit Singh
My 2 cents.
   It is always tricky to get clustering right, especially kmeans.
Especially when running clustering on a sparse dataset which these
wordvectors tend to be (There can be only a subset of words present in a
given document from whole corpus).

If all you are looking for is grouping *document* together, then I think
probably topic modeling might give you better results.



On Wed, Jun 5, 2013 at 10:47 PM, Jesvin Jose frank.einst...@gmail.comwrote:

 I tried to cluster 1000 emails of a person using Kmeans, but clusters are
 not forming okay. For example if Facebook sends notifications about James
 Doe and 5 other people, I get 5 clusters like:

 :VL-858{n=7
 Top Terms:
 doe   =  10.066998481750488
 james=  10.066998481750488

 Why are notifications for all 5 people not getting clustered together? I
 used variants of the commands used in Mahout in Action, Sean Owen et al as
 follows:

 Vectorizing uses lowercasing, stop words and length filter:

 bin/hadoop jar

 /home/jesvin/dev/hadoop/mahout-distribution-0.7/examples/target/mahout-examples-0.7-job.jar
 org.apache.mahout.driver.MahoutDriver seq2sparse -i mymail-seqfiles -o
 mymail-vectors-bigram -ow  -a mia.clustering.ch10.MyAnalyzer -chunk 200 -wt
 tfidf -s 5 -md 3 -x 90 -ng 2 -ml 50 -seq

 Its for 1000 emails, but I tried 100 clusters. If I tried 50, I still get
 similar results but half the number of emails get into any cluster.

 bin/hadoop jar

 /home/jesvin/dev/hadoop/mahout-distribution-0.7/examples/target/mahout-examples-0.7-job.jar
 org.apache.mahout.driver.MahoutDriver kmeans -i
 mymail-vectors-bigram/tfidf-vectors -c mymail-initial-clusters -o
 mymail-kmeans-clusters-from-bigrams -dm
 org.apache.mahout.common.distance.CosineDistanceMeasure -cd 0.1 -k 100 -x
 20 -cl

 --
 We dont beat the reaper by living longer. We beat the reaper by living well
 and living fully. The reaper will come for all of us. Question is, what do
 we do between the time we are born and the time he shows up? -Randy Pausch




-- 
Mohit

When you want success as badly as you want the air, then you will get it.
There is no other secret of success.
-Socrates


Why are clustering emails not clustering similar stuff?

2013-06-05 Thread Jesvin Jose
I tried to cluster 1000 emails of a person using Kmeans, but clusters are
not forming okay. For example if Facebook sends notifications about James
Doe and 5 other people, I get 5 clusters like:

:VL-858{n=7
Top Terms:
doe   =  10.066998481750488
james=  10.066998481750488

Why are notifications for all 5 people not getting clustered together? I
used variants of the commands used in Mahout in Action, Sean Owen et al as
follows:

Vectorizing uses lowercasing, stop words and length filter:

bin/hadoop jar
/home/jesvin/dev/hadoop/mahout-distribution-0.7/examples/target/mahout-examples-0.7-job.jar
org.apache.mahout.driver.MahoutDriver seq2sparse -i mymail-seqfiles -o
mymail-vectors-bigram -ow  -a mia.clustering.ch10.MyAnalyzer -chunk 200 -wt
tfidf -s 5 -md 3 -x 90 -ng 2 -ml 50 -seq

Its for 1000 emails, but I tried 100 clusters. If I tried 50, I still get
similar results but half the number of emails get into any cluster.

bin/hadoop jar
/home/jesvin/dev/hadoop/mahout-distribution-0.7/examples/target/mahout-examples-0.7-job.jar
org.apache.mahout.driver.MahoutDriver kmeans -i
mymail-vectors-bigram/tfidf-vectors -c mymail-initial-clusters -o
mymail-kmeans-clusters-from-bigrams -dm
org.apache.mahout.common.distance.CosineDistanceMeasure -cd 0.1 -k 100 -x
20 -cl

-- 
We dont beat the reaper by living longer. We beat the reaper by living well
and living fully. The reaper will come for all of us. Question is, what do
we do between the time we are born and the time he shows up? -Randy Pausch