Re: reduce is too slow in StreamingKmeans

Suneel Marthi Mon, 17 Mar 2014 19:27:05 -0700

-rskm option works only in sequential mode and fails in MR. That's still an 
issue in present trunk that needs to be fixed.
That should explain why Streaming KMeans with -rskm works only in sequential 
mode for you.

Mahout 0.9 has been built with Hadoop 1.2.1 profile, not sure if that's gonna 
work with 0.20.

On Monday, March 17, 2014 9:50 PM, fx MA XIAOJUN <xiaojun...@fujixerox.co.jp> 
wrote:

Thank you for your extremely quick reply.

>> What do u mean by this? kmeans hasn't changed between 0.8 and 0.9. Did u 
>> mean Streaming KMeans here?
I want to try using -rskm in streaming kmeans. 
But in mahout 0.8, if setting -rskm as true, errors occur.
I heard that the bug has been fixed in 0.9. So I upgraded 0.8->0.9

The hadoop I installed is cdh5-MRv1, corresponding to hadoop 0.20, not hadoop 
2.x(YARN)
cdh5-MRv1 has compatible version of mahout(mahout-0.8+cdh5.0.0b2+28) which is 
compiled by cloudera.
So I uninstalled mahout-0.8+cdh5.0.0b2+28, and installed apache mahout 0.9 
distribution. 
It turned out that "Mahout kmeans" runs very well on mapreduce.
However, "Mahout streamingkmeans" runs properly in sequential mode, but fails 
in mapreduce mode.

If it is the problem of incompatibility between hadoop and mahout, I don’t 
think "mahout kmeans" can run properly.

Is mahout 0.9 compatible with Hadoop 0.20?

-----Original Message-----
From: Suneel Marthi [mailto:suneel_mar...@yahoo.com] 
Sent: Monday, March 17, 2014 6:21 PM
To: fx MA XIAOJUN; user@mahout.apache.org
Subject: Re: reduce is too slow in StreamingKmeans

On Monday, March 17, 2014 3:43 AM, fx MA XIAOJUN <xiaojun...@fujixerox.co.jp> 
wrote:

Thank you for your quick reply.

As to -km, I thought it was log10, instead of ln. I was wrong...
This time I set -km 140000 and run mahout streamingkmeans again.(CDH 5.0 Mrv1, 
Mahout 0.8) The maps run faster than before, but the reduce was still stuck at 
76% for ever.

>> This has been my experience too both with 0.8 and 0.9. 

So, I uninstalled mahout 0.8, and installed mahout 0.9 in order to use -rskm 
option.

Mahout kmeans can be executed properly, so I think the installation of mahout 
0.9 is successful.

>> What do u mean by this? kmeans hasn't changed between 0.8 and 0.9. Did u 
>> mean Streaming KMeans here?

However, when executing mahout streamingkmeans, I got errors as following.
Hadoop I installed is cdh5-beta1-mapreduce version 1.
----------------------------------------------------------------------------------------
Exception in thread "main" java.lang.IncompatibleClassChangeError: Found 
interface org.apache.hadoop.mapreduce.JobContext, but class was expected
    at org.apache.mahout.common.HadoopUtil.getCustomJobName(HadoopUtil.java:174)
    at 
org.apache.mahout.clustering.streaming.mapreduce.StreamingKMeansDriver.runMapReduce(StreamingKMeansDriver.java:464)
    at 
org.apache.mahout.clustering.streaming.mapreduce.StreamingKMeansDriver.run(StreamingKMeansDriver.java:419)
    at 
org.apache.mahout.clustering.streaming.mapreduce.StreamingKMeansDriver.run(StreamingKMeansDriver.java:240)
    at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
    at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:84)
    at 
org.apache.mahout.clustering.streaming.mapreduce.StreamingKMeansDriver.main(StreamingKMeansDriver.java:491)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
    at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:606)
    at 
org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:72)
    at org.apache.hadoop.util.ProgramDriver.run(ProgramDriver.java:144)
    at org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:152)
    at org.apache.mahout.driver.MahoutDriver.main(MahoutDriver.java:195)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
    at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:606)
    at org.apache.hadoop.util.RunJar.main(RunJar.java:212)
--------------------------------------------------------------------------------------------

Seems like u r trying to execute on Hadoop 2 while Mahout 0.9 has been built 
with Hadoop 1.x profile, hence the error u r seeing.
If u would like to test on Hadoop 2, work off of present trunk and build the 
code with Hadoop 2 profile like below:

mvn clean install -Dhadoop2.profile=<hadoop 2.x version>

Please give that a try.

-----Original Message-----
From: Suneel Marthi [mailto:suneel_mar...@yahoo.com]
Sent: Wednesday, February 19, 2014 1:08 AM
To: user@mahout.apache.org
Subject: Re: reduce is too slow in StreamingKmeans

Streaming KMeans runs with a single reducer that runs Ball KMeans and hence the 
slow performance that you have been experiencing. 

How did u come up with -km 63000?

Given that u would like 10000 clusters (= k) and have 2,000,000 datapoints (= 
n) so k * ln(n) = 10000 * ln(2 * 10^6)  = 145087 (rounded to nearest integer) 
and that should be the value of -km in ur case. (km = k * log (n) )

Not sure if that's gonna fix ur reduce being stuck at 76% forever but its 
definitely worth a try.

If you would like go to with -rskm option, please upgrade to Mahout 0.9.  I 
still think there's an issue with -rskm option with Mahout 0.9 and trunk today 
while executing in MR mode, but it definitely works in the nonMR (-xm 
sequential) mode in 0.9.

On Monday, February 17, 2014 9:05 PM, Sylvia Ma <xiaojun...@fujixerox.co.jp> 
wrote:

I am using mahout 0.8 embedded in chd5.0.0 provided by cloudera and found that 
reduce of mahout streamingkmeans is extremely slow.

For example:
With a dataset of 2000000 objects, 128 variables, I would like to get 10000 
clusters.

The command executed is as the following.
mahout streamingkmeans -i input -o output -ow -k 10000 -km 63000

I have 15 maps which were all completed in 4 hours.
However, reduce took over 100 hours and it was still stuck at 76%.

I have tuned performance of hadoop as the following. 
map task jvm = 3g
reduce task jvm = 10g
io.sort.mb = 512
io.sort.factor = 50
mapred.reduce.parallel.copies = 10
mapred.inmem.merge.threshold = 0 

I tried to assign enough memory but the reduce is still very very very slow.

Why does it take so much time in reduce?
And What can I do to speed up the job?

I wonder if it will be helpful to set -rskm to be true.
-rskm option has bug in Mahout 0.8, so I cannot get a try... 

Yours Sincerely,
Sylvia Ma

Re: reduce is too slow in StreamingKmeans

Reply via email to