I am getting more and more ideas as I try to write about scaling Mahout
clustering. I added serialize and de serialize benchmark for Vectors and
checked the speed of our vectors.
Here is the output with Cardinality=1000 Sparsity=1000(dense) numVectors=100
loop=100 (hence writing 10K(int-doubles) to and reading back from disk)
Note: that these are not disk MB/s but the size of vectors/per sec
deserialized and the filesystem is a Ramdisk.
robinanil$ ls -lh /tmp/*vector
-rwxrwxrwx 1 robinanil 77M May 2 21:25 /tmp/ram/dense-vector
-rwxrwxrwx 1 robinanil 115M May 2 21:25 /tmp/ram/randsparse-vector
-rwxrwxrwx 1 robinanil 115M May 2 21:25 /tmp/ram/seqsparse-vector
BenchMarks DenseVector RandSparseVector
SeqSparseVector
Deserialize
nCalls = 10000; nCalls = 10000;
nCalls = 10000;
sum = 1.30432s; sum = 2.207437s; sum
= 1.681144s;
min = 0.045ms; min = 0.152ms; min
= 0.114ms;
max = 74.549ms; max = 8.446ms; max
= 3.748ms;
mean = 0.130432ms; mean = 0.220743ms; mean
= 0.168114ms;
stdDev = 0.904858ms; stdDev = 0.206271ms;
stdDev = 0.087123ms;
Speed = 7666.83 /sec Speed = 4530.1406 /sec
Speed = 5948.33 /sec
Rate = 92.00197 MB/s Rate = 54.361687 MB/s Rate
= 71.37997 MB/s
Serialize
nCalls = 10000; nCalls = 10000;
nCalls = 10000;
sum = 3.391168s; sum = 6.300965s; sum
= 5.304873s;
min = 0.068ms; min = 0.135ms; min
= 0.12ms;
max = 254.635ms; max = 1183.891ms; max
= 639.583ms;
mean = 0.339116ms; mean = 0.630096ms; mean
= 0.530487ms;
stdDev = 5.558922ms; stdDev = 13.460321ms;
stdDev = 8.618806ms;
Speed = 2948.8364 /sec Speed = 1587.0585 /sec
Speed = 1885.0592 /sec
Rate = 35.38604 MB/s Rate = 19.044703 MB/s Rate
= 22.620712 MB/s