I am getting more and more ideas as I try to write about scaling Mahout clustering. I added serialize and de serialize benchmark for Vectors and checked the speed of our vectors.
Here is the output with Cardinality=1000 Sparsity=1000(dense) numVectors=100 loop=100 (hence writing 10K(int-doubles) to and reading back from disk) Note: that these are not disk MB/s but the size of vectors/per sec deserialized and the filesystem is a Ramdisk. robinanil$ ls -lh /tmp/*vector -rwxrwxrwx 1 robinanil 77M May 2 21:25 /tmp/ram/dense-vector -rwxrwxrwx 1 robinanil 115M May 2 21:25 /tmp/ram/randsparse-vector -rwxrwxrwx 1 robinanil 115M May 2 21:25 /tmp/ram/seqsparse-vector BenchMarks DenseVector RandSparseVector SeqSparseVector Deserialize nCalls = 10000; nCalls = 10000; nCalls = 10000; sum = 1.30432s; sum = 2.207437s; sum = 1.681144s; min = 0.045ms; min = 0.152ms; min = 0.114ms; max = 74.549ms; max = 8.446ms; max = 3.748ms; mean = 0.130432ms; mean = 0.220743ms; mean = 0.168114ms; stdDev = 0.904858ms; stdDev = 0.206271ms; stdDev = 0.087123ms; Speed = 7666.83 /sec Speed = 4530.1406 /sec Speed = 5948.33 /sec Rate = 92.00197 MB/s Rate = 54.361687 MB/s Rate = 71.37997 MB/s Serialize nCalls = 10000; nCalls = 10000; nCalls = 10000; sum = 3.391168s; sum = 6.300965s; sum = 5.304873s; min = 0.068ms; min = 0.135ms; min = 0.12ms; max = 254.635ms; max = 1183.891ms; max = 639.583ms; mean = 0.339116ms; mean = 0.630096ms; mean = 0.530487ms; stdDev = 5.558922ms; stdDev = 13.460321ms; stdDev = 8.618806ms; Speed = 2948.8364 /sec Speed = 1587.0585 /sec Speed = 1885.0592 /sec Rate = 35.38604 MB/s Rate = 19.044703 MB/s Rate = 22.620712 MB/s