I have some questions on benchmarking that I wanted to get others opinions on.  

This week I have been trying out EMR and KMeans with the goal of doing some 
benchmarking both for the community and for Taming Text.  For starters, I put 
up a file of ~45 MB containing roughly 110K sparse vectors.  I know, pretty 
small, but it is a start.  I tried this out on 2, 4 and 8 instances.  The time 
to complete the clustering for all variations in preliminary runs (I haven't 
done repeats yet to get an average) was about the same.  I'm guessing, this is 
due to either the overhead of Hadoop or possibly the fact that the file is so 
small that it isn't split, but, since I'm a newbie to EMR, I thought I would 
ask what others opinions are.  I have done no Hadoop tuning at this point.  
What do people think?  Should I be seeing more speedup at this point?

FWIW, I am in the process right now of copying over all ASF mail archives to S3 
(~80-100GB uncompressed, 8.5 GB compressed --thankfully, Amazon has free 
inbound now) and plan on testing on a larger set once I can get them into 
Mahout format.  If anyone has anything bigger and can share it, let me know.

FTR, I ran: elastic-mapreduce -j j-3QNGDH7H7EXG8  --jar 
s3n://news-vecs/mahout-core-0.4-SNAPSHOT.job  --main-class 
org.apache.mahout.clustering.kmeans.KMeansDriver --arg --input --arg 
s3n://PATH/part-out.vec --arg --clusters --arg 
s3n://news-vecs/kmeans/clusters-9-11/ --arg -k --arg 10 --arg --output --arg 
s3n://PATH/out-9-11/ --arg --distanceMeasure --arg  
org.apache.mahout.common.distance.CosineDistanceMeasure --arg 
--convergenceDelta --arg 0.001 --arg --overwrite --arg --maxIter --arg 50 --arg 
--clustering

-Grant

Reply via email to