Streaming KMeans runs with a single reducer that runs Ball KMeans and hence the 
slow performance that you have been experiencing. 

How did u come up with -km 63000?

Given that u would like 10000 clusters (= k) and have 2,000,000 datapoints (= 
n) so k * ln(n) = 10000 * ln(2 * 10^6)  = 145087 (rounded to nearest integer) 
and that should be the value of -km in ur case. (km = k * log (n) )

Not sure if that's gonna fix ur reduce being stuck at 76% forever but its 
definitely worth a try.

If you would like go to with -rskm option, please upgrade to Mahout 0.9.  I 
still think there's an issue with -rskm option with Mahout 0.9 and trunk today 
while executing in MR mode, but it definitely works in the nonMR (-xm 
sequential) mode in 0.9.











On Monday, February 17, 2014 9:05 PM, Sylvia Ma <xiaojun...@fujixerox.co.jp> 
wrote:
 
I am using mahout 0.8 embedded in chd5.0.0 provided by cloudera and found 
that reduce of mahout streamingkmeans is extremely slow.

For example:
With a dataset of 2000000 objects, 128 variables, I would like to get 10000 
clusters.

The command executed is as the following.
mahout streamingkmeans -i input -o output -ow -k 10000 -km 63000

I have 15 maps which were all completed in 4 hours.
However, reduce took over 100 hours and it was still stuck at 76%.

I have tuned performance of hadoop as the following. 
map task jvm = 3g 
reduce task jvm = 10g 
io.sort.mb = 512 
io.sort.factor = 50 
mapred.reduce.parallel.copies = 10 
mapred.inmem.merge.threshold = 0 

I tried to assign enough memory but the reduce is still very very very slow.


Why does it take so much time in reduce?
And What can I do to speed up the job?

I wonder if it will be helpful to set -rskm to be true.
-rskm option has bug in Mahout 0.8, so I cannot get a try... 




Yours Sincerely,
Sylvia Ma

Reply via email to