I have some questions on benchmarking that I wanted to get others opinions on.
This week I have been trying out EMR and KMeans with the goal of doing some benchmarking both for the community and for Taming Text. For starters, I put up a file of ~45 MB containing roughly 110K sparse vectors. I know, pretty small, but it is a start. I tried this out on 2, 4 and 8 instances. The time to complete the clustering for all variations in preliminary runs (I haven't done repeats yet to get an average) was about the same. I'm guessing, this is due to either the overhead of Hadoop or possibly the fact that the file is so small that it isn't split, but, since I'm a newbie to EMR, I thought I would ask what others opinions are. I have done no Hadoop tuning at this point. What do people think? Should I be seeing more speedup at this point? FWIW, I am in the process right now of copying over all ASF mail archives to S3 (~80-100GB uncompressed, 8.5 GB compressed --thankfully, Amazon has free inbound now) and plan on testing on a larger set once I can get them into Mahout format. If anyone has anything bigger and can share it, let me know. FTR, I ran: elastic-mapreduce -j j-3QNGDH7H7EXG8 --jar s3n://news-vecs/mahout-core-0.4-SNAPSHOT.job --main-class org.apache.mahout.clustering.kmeans.KMeansDriver --arg --input --arg s3n://PATH/part-out.vec --arg --clusters --arg s3n://news-vecs/kmeans/clusters-9-11/ --arg -k --arg 10 --arg --output --arg s3n://PATH/out-9-11/ --arg --distanceMeasure --arg org.apache.mahout.common.distance.CosineDistanceMeasure --arg --convergenceDelta --arg 0.001 --arg --overwrite --arg --maxIter --arg 50 --arg --clustering -Grant
