On Monday, March 17, 2014 3:43 AM, fx MA XIAOJUN <xiaojun...@fujixerox.co.jp>
wrote:
Thank you for your quick reply.
As to -km, I thought it was log10, instead of ln. I was wrong...
This time I set -km 140000 and run mahout streamingkmeans again.(CDH 5.0 Mrv1,
Mahout 0.8)
The maps run faster than before, but the reduce was still stuck at 76% for ever.
>> This has been my experience too both with 0.8 and 0.9.
So, I uninstalled mahout 0.8, and installed mahout 0.9 in order to use -rskm
option.
Mahout kmeans can be executed properly, so I think the installation of mahout
0.9 is successful.
>> What do u mean by this? kmeans hasn't changed between 0.8 and 0.9. Did u
>> mean Streaming KMeans here?
However, when executing mahout streamingkmeans, I got errors as following.
Hadoop I installed is cdh5-beta1-mapreduce version 1.
----------------------------------------------------------------------------------------
Exception in thread "main" java.lang.IncompatibleClassChangeError: Found
interface org.apache.hadoop.mapreduce.JobContext, but class was expected
at org.apache.mahout.common.HadoopUtil.getCustomJobName(HadoopUtil.java:174)
at
org.apache.mahout.clustering.streaming.mapreduce.StreamingKMeansDriver.runMapReduce(StreamingKMeansDriver.java:464)
at
org.apache.mahout.clustering.streaming.mapreduce.StreamingKMeansDriver.run(StreamingKMeansDriver.java:419)
at
org.apache.mahout.clustering.streaming.mapreduce.StreamingKMeansDriver.run(StreamingKMeansDriver.java:240)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:84)
at
org.apache.mahout.clustering.streaming.mapreduce.StreamingKMeansDriver.main(StreamingKMeansDriver.java:491)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at
org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:72)
at org.apache.hadoop.util.ProgramDriver.run(ProgramDriver.java:144)
at org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:152)
at org.apache.mahout.driver.MahoutDriver.main(MahoutDriver.java:195)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.hadoop.util.RunJar.main(RunJar.java:212)
--------------------------------------------------------------------------------------------
Seems like u r trying to execute on Hadoop 2 while Mahout 0.9 has been built
with Hadoop 1.x profile, hence the error u r seeing.
If u would like to test on Hadoop 2, work off of present trunk and build the
code with Hadoop 2 profile like below:
mvn clean install -Dhadoop2.profile=<hadoop 2.x version>
Please give that a try.
-----Original Message-----
From: Suneel Marthi [mailto:suneel_mar...@yahoo.com]
Sent: Wednesday, February 19, 2014 1:08 AM
To: user@mahout.apache.org
Subject: Re: reduce is too slow in StreamingKmeans
Streaming KMeans runs with a single reducer that runs Ball KMeans and hence the
slow performance that you have been experiencing.
How did u come up with -km 63000?
Given that u would like 10000 clusters (= k) and have 2,000,000 datapoints (=
n) so k * ln(n) = 10000 * ln(2 * 10^6) = 145087 (rounded to nearest integer)
and that should be the value of -km in ur case. (km = k * log (n) )
Not sure if that's gonna fix ur reduce being stuck at 76% forever but its
definitely worth a try.
If you would like go to with -rskm option, please upgrade to Mahout 0.9. I
still think there's an issue with -rskm option with Mahout 0.9 and trunk today
while executing in MR mode, but it definitely works in the nonMR (-xm
sequential) mode in 0.9.
On Monday, February 17, 2014 9:05 PM, Sylvia Ma <xiaojun...@fujixerox.co.jp>
wrote:
I am using mahout 0.8 embedded in chd5.0.0 provided by cloudera and found that
reduce of mahout streamingkmeans is extremely slow.
For example:
With a dataset of 2000000 objects, 128 variables, I would like to get 10000
clusters.
The command executed is as the following.
mahout streamingkmeans -i input -o output -ow -k 10000 -km 63000
I have 15 maps which were all completed in 4 hours.
However, reduce took over 100 hours and it was still stuck at 76%.
I have tuned performance of hadoop as the following.
map task jvm = 3g
reduce task jvm = 10g
io.sort.mb = 512
io.sort.factor = 50
mapred.reduce.parallel.copies = 10
mapred.inmem.merge.threshold = 0
I tried to assign enough memory but the reduce is still very very very slow.
Why does it take so much time in reduce?
And What can I do to speed up the job?
I wonder if it will be helpful to set -rskm to be true.
-rskm option has bug in Mahout 0.8, so I cannot get a try...
Yours Sincerely,
Sylvia Ma