RE: reduce is too slow in StreamingKmeans
Thank you for your quick reply. As to -km, I thought it was log10, instead of ln. I was wrong... This time I set -km 14 and run mahout streamingkmeans again.(CDH 5.0 Mrv1, Mahout 0.8) The maps run faster than before, but the reduce was still stuck at 76% for ever. So, I uninstalled mahout 0.8, and installed mahout 0.9 in order to use -rskm option. Mahout kmeans can be executed properly, so I think the installation of mahout 0.9 is successful. However, when executing mahout streamingkmeans, I got errors as following. Hadoop I installed is cdh5-beta1-mapreduce version 1. Exception in thread main java.lang.IncompatibleClassChangeError: Found interface org.apache.hadoop.mapreduce.JobContext, but class was expected at org.apache.mahout.common.HadoopUtil.getCustomJobName(HadoopUtil.java:174) at org.apache.mahout.clustering.streaming.mapreduce.StreamingKMeansDriver.runMapReduce(StreamingKMeansDriver.java:464) at org.apache.mahout.clustering.streaming.mapreduce.StreamingKMeansDriver.run(StreamingKMeansDriver.java:419) at org.apache.mahout.clustering.streaming.mapreduce.StreamingKMeansDriver.run(StreamingKMeansDriver.java:240) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:84) at org.apache.mahout.clustering.streaming.mapreduce.StreamingKMeansDriver.main(StreamingKMeansDriver.java:491) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:72) at org.apache.hadoop.util.ProgramDriver.run(ProgramDriver.java:144) at org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:152) at org.apache.mahout.driver.MahoutDriver.main(MahoutDriver.java:195) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.hadoop.util.RunJar.main(RunJar.java:212) -Original Message- From: Suneel Marthi [mailto:suneel_mar...@yahoo.com] Sent: Wednesday, February 19, 2014 1:08 AM To: user@mahout.apache.org Subject: Re: reduce is too slow in StreamingKmeans Streaming KMeans runs with a single reducer that runs Ball KMeans and hence the slow performance that you have been experiencing. How did u come up with -km 63000? Given that u would like 1 clusters (= k) and have 2,000,000 datapoints (= n) so k * ln(n) = 1 * ln(2 * 10^6) = 145087 (rounded to nearest integer) and that should be the value of -km in ur case. (km = k * log (n) ) Not sure if that's gonna fix ur reduce being stuck at 76% forever but its definitely worth a try. If you would like go to with -rskm option, please upgrade to Mahout 0.9. I still think there's an issue with -rskm option with Mahout 0.9 and trunk today while executing in MR mode, but it definitely works in the nonMR (-xm sequential) mode in 0.9. On Monday, February 17, 2014 9:05 PM, Sylvia Ma xiaojun...@fujixerox.co.jp wrote: I am using mahout 0.8 embedded in chd5.0.0 provided by cloudera and found that reduce of mahout streamingkmeans is extremely slow. For example: With a dataset of 200 objects, 128 variables, I would like to get 1 clusters. The command executed is as the following. mahout streamingkmeans -i input -o output -ow -k 1 -km 63000 I have 15 maps which were all completed in 4 hours. However, reduce took over 100 hours and it was still stuck at 76%. I have tuned performance of hadoop as the following. map task jvm = 3g reduce task jvm = 10g io.sort.mb = 512 io.sort.factor = 50 mapred.reduce.parallel.copies = 10 mapred.inmem.merge.threshold = 0 I tried to assign enough memory but the reduce is still very very very slow. Why does it take so much time in reduce? And What can I do to speed up the job? I wonder if it will be helpful to set -rskm to be true. -rskm option has bug in Mahout 0.8, so I cannot get a try... Yours Sincerely, Sylvia Ma
Re: reduce is too slow in StreamingKmeans
On Monday, March 17, 2014 3:43 AM, fx MA XIAOJUN xiaojun...@fujixerox.co.jp wrote: Thank you for your quick reply. As to -km, I thought it was log10, instead of ln. I was wrong... This time I set -km 14 and run mahout streamingkmeans again.(CDH 5.0 Mrv1, Mahout 0.8) The maps run faster than before, but the reduce was still stuck at 76% for ever. This has been my experience too both with 0.8 and 0.9. So, I uninstalled mahout 0.8, and installed mahout 0.9 in order to use -rskm option. Mahout kmeans can be executed properly, so I think the installation of mahout 0.9 is successful. What do u mean by this? kmeans hasn't changed between 0.8 and 0.9. Did u mean Streaming KMeans here? However, when executing mahout streamingkmeans, I got errors as following. Hadoop I installed is cdh5-beta1-mapreduce version 1. Exception in thread main java.lang.IncompatibleClassChangeError: Found interface org.apache.hadoop.mapreduce.JobContext, but class was expected at org.apache.mahout.common.HadoopUtil.getCustomJobName(HadoopUtil.java:174) at org.apache.mahout.clustering.streaming.mapreduce.StreamingKMeansDriver.runMapReduce(StreamingKMeansDriver.java:464) at org.apache.mahout.clustering.streaming.mapreduce.StreamingKMeansDriver.run(StreamingKMeansDriver.java:419) at org.apache.mahout.clustering.streaming.mapreduce.StreamingKMeansDriver.run(StreamingKMeansDriver.java:240) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:84) at org.apache.mahout.clustering.streaming.mapreduce.StreamingKMeansDriver.main(StreamingKMeansDriver.java:491) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:72) at org.apache.hadoop.util.ProgramDriver.run(ProgramDriver.java:144) at org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:152) at org.apache.mahout.driver.MahoutDriver.main(MahoutDriver.java:195) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.hadoop.util.RunJar.main(RunJar.java:212) Seems like u r trying to execute on Hadoop 2 while Mahout 0.9 has been built with Hadoop 1.x profile, hence the error u r seeing. If u would like to test on Hadoop 2, work off of present trunk and build the code with Hadoop 2 profile like below: mvn clean install -Dhadoop2.profile=hadoop 2.x version Please give that a try. -Original Message- From: Suneel Marthi [mailto:suneel_mar...@yahoo.com] Sent: Wednesday, February 19, 2014 1:08 AM To: user@mahout.apache.org Subject: Re: reduce is too slow in StreamingKmeans Streaming KMeans runs with a single reducer that runs Ball KMeans and hence the slow performance that you have been experiencing. How did u come up with -km 63000? Given that u would like 1 clusters (= k) and have 2,000,000 datapoints (= n) so k * ln(n) = 1 * ln(2 * 10^6) = 145087 (rounded to nearest integer) and that should be the value of -km in ur case. (km = k * log (n) ) Not sure if that's gonna fix ur reduce being stuck at 76% forever but its definitely worth a try. If you would like go to with -rskm option, please upgrade to Mahout 0.9. I still think there's an issue with -rskm option with Mahout 0.9 and trunk today while executing in MR mode, but it definitely works in the nonMR (-xm sequential) mode in 0.9. On Monday, February 17, 2014 9:05 PM, Sylvia Ma xiaojun...@fujixerox.co.jp wrote: I am using mahout 0.8 embedded in chd5.0.0 provided by cloudera and found that reduce of mahout streamingkmeans is extremely slow. For example: With a dataset of 200 objects, 128 variables, I would like to get 1 clusters. The command executed is as the following. mahout streamingkmeans -i input -o output -ow -k 1 -km 63000 I have 15 maps which were all completed in 4 hours. However, reduce took over 100 hours and it was still stuck at 76%. I have tuned performance of hadoop as the following. map task jvm = 3g reduce task jvm = 10g io.sort.mb = 512 io.sort.factor = 50 mapred.reduce.parallel.copies = 10 mapred.inmem.merge.threshold = 0 I tried to assign enough memory but the reduce is still very very very slow. Why does it take so much time in reduce? And
Re: Problem with FileSystem in Kmeans
This problem's specifically to do with Canopy clustering and is not an issue with KMeans. I had seen this behavior with Canopy and looking at the code its indeed an issue wherein cluster-0 is created on the local file system and the remaining clusters land on HDFS. Please file a JIRA for this if not already done so. On Wednesday, March 12, 2014 3:02 AM, Bikash Gupta bikash.gupt...@gmail.com wrote: Hi, Problem is not with input path, its the way Kmeans is getting executed. Let me explain. I have created CSV-Sequence using map-reduce hence my data is in HDFS After this I have run Canopy MR hence data is also in HDFS Now these two things are getting pushed in Kmeans MR. If you check KmeansDriver class, at first it tries to create cluster-0 folder with data, here if you dont specify the scheme then it will write in local file system. After that MR job is getting started which is expecting cluster-0 in HDFS. Path priorClustersPath = new Path(output, Cluster.INITIAL_CLUSTERS_DIR); ClusteringPolicy policy = new KMeansClusteringPolicy(convergenceDelta); ClusterClassifier prior = new ClusterClassifier(clusters, policy); prior.writeToSeqFiles(priorClustersPath); if (runSequential) { ClusterIterator.iterateSeq(conf, input, priorClustersPath, output, maxIterations); } else { ClusterIterator.iterateMR(conf, input, priorClustersPath, output, maxIterations); } Let me know if I am not able to explain clearly. On Wed, Mar 12, 2014 at 11:53 AM, Sebastian Schelter s...@apache.org wrote: Hi Bikash, Have you tried adding hdfs:// to your input path? Maybe that helps. --sebastian On 03/11/2014 11:22 AM, Bikash Gupta wrote: Hi, I am running Kmeans in cluster where I am setting the configuration of fs.hdfs.impl and fs.file.impl before hand as mentioned below conf.set(fs.hdfs.impl,org.apache.hadoop.hdfs. DistributedFileSystem.class.getName()); conf.set(fs.file.impl,org.apache.hadoop.fs. LocalFileSystem.class.getName()); Problem is that cluster-0 directory is getting created in local file system and cluster-1 is getting created in HDFS, and Kmeans map reduce job is unable to find cluster-0 . Please see below the stacktrace 2014-03-11 14:52:15 o.a.m.c.AbstractJob [INFO] Command line arguments: {--clustering=null, --clusters=[/3/clusters-0-final], --convergenceDelta=[0.1], --distanceMeasure=[org.apache.mahout.common.distance. EuclideanDistanceMeasure], --endPhase=[2147483647], --input=[/2/sequence], --maxIter=[100], --method=[mapreduce], --output=[/5], --overwrite=null, --startPhase=[0], --tempDir=[temp]} 2014-03-11 14:52:15 o.a.h.u.NativeCodeLoader [WARN] Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 2014-03-11 14:52:15 o.a.m.c.k.KMeansDriver [INFO] Input: /2/sequence Clusters In: /3/clusters-0-final Out: /5 2014-03-11 14:52:15 o.a.m.c.k.KMeansDriver [INFO] convergence: 0.1 max Iterations: 100 2014-03-11 14:52:16 o.a.h.m.JobClient [WARN] Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same. 2014-03-11 14:52:17 o.a.h.m.l.i.FileInputFormat [INFO] Total input paths to process : 3 2014-03-11 14:52:19 o.a.h.m.JobClient [INFO] Running job: job_201403111332_0011 2014-03-11 14:52:20 o.a.h.m.JobClient [INFO] map 0% reduce 0% 2014-03-11 14:52:28 o.a.h.m.JobClient [INFO] Task Id : attempt_201403111332_0011_m_00_0, Status : FAILED 2014-03-11 14:52:28 STDIO [ERROR] java.lang.IllegalStateException: /5/clusters-0 at org.apache.mahout.common.iterator.sequencefile. SequenceFileDirValueIterable.iterator(SequenceFileDirValueIterable. java:78) at org.apache.mahout.clustering.classify.ClusterClassifier.readFromSeqFiles( ClusterClassifier.java:208) at org.apache.mahout.clustering.iterator.CIMapper.setup(CIMapper.java:44) at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:138) at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask. java:672) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:330) at org.apache.hadoop.mapred.Child$4.run(Child.java:268) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs( UserGroupInformation.java:1438) at org.apache.hadoop.mapred.Child.main(Child.java:262) Caused by: java.io.FileNotFoundException: File /5/clusters-0 Please suggest!!! -- Thanks Regards Bikash Kumar Gupta
java.lang.IncompatibleClassChangeError: Found interface org.apache.hadoop.mapreduce.TaskAttemptContext, but class was expected
Hi Here is my output: [speech@h14 ~]$ mahout/bin/mahout seqdirectory -c UTF-8 -i /user/speech/demo -o demo-seqfiles MAHOUT_LOCAL is not set; adding HADOOP_CONF_DIR to classpath. Running on hadoop, using /usr/bin/hadoop and HADOOP_CONF_DIR=/etc/hadoop/conf MAHOUT-JOB: /home/speech/mahout/examples/target/mahout-examples-1.0-SNAPSHOT-job.jar 14/03/17 11:47:30 INFO common.AbstractJob: Command line arguments: {--charset=[UTF-8], --chunkSize=[64], --endPhase=[2147483647], --fileFilterClass=[org.apache.mahout.text.PrefixAdditionFilter], --input=[/user/speech/demo], --keyPrefix=[], --method=[mapreduce], --output=[demo-seqfiles], --startPhase=[0], --tempDir=[temp]} 14/03/17 11:47:31 INFO Configuration.deprecation: mapred.input.dir is deprecated. Instead, use mapreduce.input.fileinputformat.inputdir 14/03/17 11:47:31 INFO Configuration.deprecation: mapred.compress.map.output is deprecated. Instead, use mapreduce.map.output.compress 14/03/17 11:47:31 INFO Configuration.deprecation: mapred.output.dir is deprecated. Instead, use mapreduce.output.fileoutputformat.outputdir 14/03/17 11:47:31 INFO Configuration.deprecation: session.id is deprecated. Instead, use dfs.metrics.session-id 14/03/17 11:47:31 INFO jvm.JvmMetrics: Initializing JVM Metrics with processName=JobTracker, sessionId= 14/03/17 11:47:32 INFO input.FileInputFormat: Total input paths to process : 10 14/03/17 11:47:32 INFO input.CombineFileInputFormat: DEBUG: Terminated node allocation with : CompletedNodes: 4, size left: 29775 14/03/17 11:47:32 INFO mapreduce.JobSubmitter: number of splits:1 14/03/17 11:47:32 INFO Configuration.deprecation: user.name is deprecated. Instead, use mapreduce.job.user.name 14/03/17 11:47:32 INFO Configuration.deprecation: mapred.output.compress is deprecated. Instead, use mapreduce.output.fileoutputformat.compress 14/03/17 11:47:32 INFO Configuration.deprecation: mapred.jar is deprecated. Instead, use mapreduce.job.jar 14/03/17 11:47:32 INFO Configuration.deprecation: mapred.reduce.tasks is deprecated. Instead, use mapreduce.job.reduces 14/03/17 11:47:32 INFO Configuration.deprecation: mapred.output.value.class is deprecated. Instead, use mapreduce.job.output.value.class 14/03/17 11:47:32 INFO Configuration.deprecation: mapred.mapoutput.value.class is deprecated. Instead, use mapreduce.map.output.value.class 14/03/17 11:47:32 INFO Configuration.deprecation: mapreduce.map.class is deprecated. Instead, use mapreduce.job.map.class 14/03/17 11:47:32 INFO Configuration.deprecation: mapred.job.name is deprecated. Instead, use mapreduce.job.name 14/03/17 11:47:32 INFO Configuration.deprecation: mapreduce.inputformat.class is deprecated. Instead, use mapreduce.job.inputformat.class 14/03/17 11:47:32 INFO Configuration.deprecation: mapred.max.split.size is deprecated. Instead, use mapreduce.input.fileinputformat.split.maxsize 14/03/17 11:47:32 INFO Configuration.deprecation: mapreduce.outputformat.class is deprecated. Instead, use mapreduce.job.outputformat.class 14/03/17 11:47:32 INFO Configuration.deprecation: mapred.map.tasks is deprecated. Instead, use mapreduce.job.maps 14/03/17 11:47:32 INFO Configuration.deprecation: mapred.output.key.class is deprecated. Instead, use mapreduce.job.output.key.class 14/03/17 11:47:32 INFO Configuration.deprecation: mapred.mapoutput.key.class is deprecated. Instead, use mapreduce.map.output.key.class 14/03/17 11:47:32 INFO Configuration.deprecation: mapred.working.dir is deprecated. Instead, use mapreduce.job.working.dir 14/03/17 11:47:32 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_local42076163_0001 14/03/17 11:47:32 WARN conf.Configuration: file:/tmp/hadoop-speech/mapred/staging/speech42076163/.staging/job_local42076163_0001/job.xml:an attempt to override final parameter: mapreduce.job.end-notification.max.retry.interval; Ignoring. 14/03/17 11:47:32 WARN conf.Configuration: file:/tmp/hadoop-speech/mapred/staging/speech42076163/.staging/job_local42076163_0001/job.xml:an attempt to override final parameter: mapreduce.job.end-notification.max.attempts; Ignoring. 14/03/17 11:47:32 WARN conf.Configuration: file:/tmp/hadoop-speech/mapred/local/localRunner/speech/job_local42076163_0001/job_local42076163_0001.xml:an attempt to override final parameter: mapreduce.job.end-notification.max.retry.interval; Ignoring. 14/03/17 11:47:32 WARN conf.Configuration: file:/tmp/hadoop-speech/mapred/local/localRunner/speech/job_local42076163_0001/job_local42076163_0001.xml:an attempt to override final parameter: mapreduce.job.end-notification.max.attempts; Ignoring. 14/03/17 11:47:32 INFO mapreduce.Job: The url to track the job: http://localhost:8080/ 14/03/17 11:47:32 INFO mapreduce.Job: Running job: job_local42076163_0001 14/03/17 11:47:32 INFO mapred.LocalJobRunner: OutputCommitter set in config null 14/03/17 11:47:33 INFO mapred.LocalJobRunner: OutputCommitter is org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter 14/03/17
Re: java.lang.IncompatibleClassChangeError: Found interface org.apache.hadoop.mapreduce.TaskAttemptContext, but class was expected
R u running on Hadoop 2.x which seems to be the case here. Compile with hadoop 2 profile: mvn -DskipTests clean install -Dhadoop2.profile=ur hadoop version On Monday, March 17, 2014 5:57 AM, Margusja mar...@roo.ee wrote: Hi Here is my output: [speech@h14 ~]$ mahout/bin/mahout seqdirectory -c UTF-8 -i /user/speech/demo -o demo-seqfiles MAHOUT_LOCAL is not set; adding HADOOP_CONF_DIR to classpath. Running on hadoop, using /usr/bin/hadoop and HADOOP_CONF_DIR=/etc/hadoop/conf MAHOUT-JOB: /home/speech/mahout/examples/target/mahout-examples-1.0-SNAPSHOT-job.jar 14/03/17 11:47:30 INFO common.AbstractJob: Command line arguments: {--charset=[UTF-8], --chunkSize=[64], --endPhase=[2147483647], --fileFilterClass=[org.apache.mahout.text.PrefixAdditionFilter], --input=[/user/speech/demo], --keyPrefix=[], --method=[mapreduce], --output=[demo-seqfiles], --startPhase=[0], --tempDir=[temp]} 14/03/17 11:47:31 INFO Configuration.deprecation: mapred.input.dir is deprecated. Instead, use mapreduce.input.fileinputformat.inputdir 14/03/17 11:47:31 INFO Configuration.deprecation: mapred.compress.map.output is deprecated. Instead, use mapreduce.map.output.compress 14/03/17 11:47:31 INFO Configuration.deprecation: mapred.output.dir is deprecated. Instead, use mapreduce.output.fileoutputformat.outputdir 14/03/17 11:47:31 INFO Configuration.deprecation: session.id is deprecated. Instead, use dfs.metrics.session-id 14/03/17 11:47:31 INFO jvm.JvmMetrics: Initializing JVM Metrics with processName=JobTracker, sessionId= 14/03/17 11:47:32 INFO input.FileInputFormat: Total input paths to process : 10 14/03/17 11:47:32 INFO input.CombineFileInputFormat: DEBUG: Terminated node allocation with : CompletedNodes: 4, size left: 29775 14/03/17 11:47:32 INFO mapreduce.JobSubmitter: number of splits:1 14/03/17 11:47:32 INFO Configuration.deprecation: user.name is deprecated. Instead, use mapreduce.job.user.name 14/03/17 11:47:32 INFO Configuration.deprecation: mapred.output.compress is deprecated. Instead, use mapreduce.output.fileoutputformat.compress 14/03/17 11:47:32 INFO Configuration.deprecation: mapred.jar is deprecated. Instead, use mapreduce.job.jar 14/03/17 11:47:32 INFO Configuration.deprecation: mapred.reduce.tasks is deprecated. Instead, use mapreduce.job.reduces 14/03/17 11:47:32 INFO Configuration.deprecation: mapred.output.value.class is deprecated. Instead, use mapreduce.job.output.value.class 14/03/17 11:47:32 INFO Configuration.deprecation: mapred.mapoutput.value.class is deprecated. Instead, use mapreduce.map.output.value.class 14/03/17 11:47:32 INFO Configuration.deprecation: mapreduce.map.class is deprecated. Instead, use mapreduce.job.map.class 14/03/17 11:47:32 INFO Configuration.deprecation: mapred.job.name is deprecated. Instead, use mapreduce.job.name 14/03/17 11:47:32 INFO Configuration.deprecation: mapreduce.inputformat.class is deprecated. Instead, use mapreduce.job.inputformat.class 14/03/17 11:47:32 INFO Configuration.deprecation: mapred.max.split.size is deprecated. Instead, use mapreduce.input.fileinputformat.split.maxsize 14/03/17 11:47:32 INFO Configuration.deprecation: mapreduce.outputformat.class is deprecated. Instead, use mapreduce.job.outputformat.class 14/03/17 11:47:32 INFO Configuration.deprecation: mapred.map.tasks is deprecated. Instead, use mapreduce.job.maps 14/03/17 11:47:32 INFO Configuration.deprecation: mapred.output.key.class is deprecated. Instead, use mapreduce.job.output.key.class 14/03/17 11:47:32 INFO Configuration.deprecation: mapred.mapoutput.key.class is deprecated. Instead, use mapreduce.map.output.key.class 14/03/17 11:47:32 INFO Configuration.deprecation: mapred.working.dir is deprecated. Instead, use mapreduce.job.working.dir 14/03/17 11:47:32 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_local42076163_0001 14/03/17 11:47:32 WARN conf.Configuration: file:/tmp/hadoop-speech/mapred/staging/speech42076163/.staging/job_local42076163_0001/job.xml:an attempt to override final parameter: mapreduce.job.end-notification.max.retry.interval; Ignoring. 14/03/17 11:47:32 WARN conf.Configuration: file:/tmp/hadoop-speech/mapred/staging/speech42076163/.staging/job_local42076163_0001/job.xml:an attempt to override final parameter: mapreduce.job.end-notification.max.attempts; Ignoring. 14/03/17 11:47:32 WARN conf.Configuration: file:/tmp/hadoop-speech/mapred/local/localRunner/speech/job_local42076163_0001/job_local42076163_0001.xml:an attempt to override final parameter: mapreduce.job.end-notification.max.retry.interval; Ignoring. 14/03/17 11:47:32 WARN conf.Configuration: file:/tmp/hadoop-speech/mapred/local/localRunner/speech/job_local42076163_0001/job_local42076163_0001.xml:an attempt to override final parameter: mapreduce.job.end-notification.max.attempts; Ignoring. 14/03/17 11:47:32 INFO mapreduce.Job: The url to track the job: http://localhost:8080/ 14/03/17 11:47:32 INFO mapreduce.Job: Running job:
Re: Problem with FileSystem in Kmeans
Suneel, Just for information, I havent found this issue in Canopy. Canopy cluster-0 was created in HDFS only. However Kmeans cluster-0 was created in local file system and cluster-1 in HDFS and after that it spit an error as it was unable to locate cluster-0 On Mon, Mar 17, 2014 at 3:10 PM, Suneel Marthi suneel_mar...@yahoo.comwrote: This problem's specifically to do with Canopy clustering and is not an issue with KMeans. I had seen this behavior with Canopy and looking at the code its indeed an issue wherein cluster-0 is created on the local file system and the remaining clusters land on HDFS. Please file a JIRA for this if not already done so. On Wednesday, March 12, 2014 3:02 AM, Bikash Gupta bikash.gupt...@gmail.com wrote: Hi, Problem is not with input path, its the way Kmeans is getting executed. Let me explain. I have created CSV-Sequence using map-reduce hence my data is in HDFS After this I have run Canopy MR hence data is also in HDFS Now these two things are getting pushed in Kmeans MR. If you check KmeansDriver class, at first it tries to create cluster-0 folder with data, here if you dont specify the scheme then it will write in local file system. After that MR job is getting started which is expecting cluster-0 in HDFS. Path priorClustersPath = new Path(output, Cluster.INITIAL_CLUSTERS_DIR); ClusteringPolicy policy = new KMeansClusteringPolicy(convergenceDelta); ClusterClassifier prior = new ClusterClassifier(clusters, policy); prior.writeToSeqFiles(priorClustersPath); if (runSequential) { ClusterIterator.iterateSeq(conf, input, priorClustersPath, output, maxIterations); } else { ClusterIterator.iterateMR(conf, input, priorClustersPath, output, maxIterations); } Let me know if I am not able to explain clearly. On Wed, Mar 12, 2014 at 11:53 AM, Sebastian Schelter s...@apache.org wrote: Hi Bikash, Have you tried adding hdfs:// to your input path? Maybe that helps. --sebastian On 03/11/2014 11:22 AM, Bikash Gupta wrote: Hi, I am running Kmeans in cluster where I am setting the configuration of fs.hdfs.impl and fs.file.impl before hand as mentioned below conf.set(fs.hdfs.impl,org.apache.hadoop.hdfs. DistributedFileSystem.class.getName()); conf.set(fs.file.impl,org.apache.hadoop.fs. LocalFileSystem.class.getName()); Problem is that cluster-0 directory is getting created in local file system and cluster-1 is getting created in HDFS, and Kmeans map reduce job is unable to find cluster-0 . Please see below the stacktrace 2014-03-11 14:52:15 o.a.m.c.AbstractJob [INFO] Command line arguments: {--clustering=null, --clusters=[/3/clusters-0-final], --convergenceDelta=[0.1], --distanceMeasure=[org.apache.mahout.common.distance. EuclideanDistanceMeasure], --endPhase=[2147483647], --input=[/2/sequence], --maxIter=[100], --method=[mapreduce], --output=[/5], --overwrite=null, --startPhase=[0], --tempDir=[temp]} 2014-03-11 14:52:15 o.a.h.u.NativeCodeLoader [WARN] Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 2014-03-11 14:52:15 o.a.m.c.k.KMeansDriver [INFO] Input: /2/sequence Clusters In: /3/clusters-0-final Out: /5 2014-03-11 14:52:15 o.a.m.c.k.KMeansDriver [INFO] convergence: 0.1 max Iterations: 100 2014-03-11 14:52:16 o.a.h.m.JobClient [WARN] Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same. 2014-03-11 14:52:17 o.a.h.m.l.i.FileInputFormat [INFO] Total input paths to process : 3 2014-03-11 14:52:19 o.a.h.m.JobClient [INFO] Running job: job_201403111332_0011 2014-03-11 14:52:20 o.a.h.m.JobClient [INFO] map 0% reduce 0% 2014-03-11 14:52:28 o.a.h.m.JobClient [INFO] Task Id : attempt_201403111332_0011_m_00_0, Status : FAILED 2014-03-11 14:52:28 STDIO [ERROR] java.lang.IllegalStateException: /5/clusters-0 at org.apache.mahout.common.iterator.sequencefile. SequenceFileDirValueIterable.iterator(SequenceFileDirValueIterable. java:78) at org.apache.mahout.clustering.classify.ClusterClassifier.readFromSeqFiles( ClusterClassifier.java:208) at org.apache.mahout.clustering.iterator.CIMapper.setup(CIMapper.java:44) at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:138) at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask. java:672) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:330) at org.apache.hadoop.mapred.Child$4.run(Child.java:268) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs( UserGroupInformation.java:1438) at org.apache.hadoop.mapred.Child.main(Child.java:262) Caused by: java.io.FileNotFoundException: File /5/clusters-0 Please
Re: Problem with FileSystem in Kmeans
I have 3 node cluster of CDH4.6, however I have build Mahout 0.9 with Hadoop 2.x profile. I have also created a mount point for these node and the path uri is same as HDFS. I have manually configured filesystem parameter conf.set(fs.hdfs.impl,org. apache.hadoop.hdfs.DistributedFileSystem.class.getName()); conf.set(fs.file.impl,org.apache.hadoop.fs.LocalFileSystem.class.getName()); Input data(sequence file) and Cluster center(output of Canopy) are present in HDFS. After this I am executing KmeansDriver using ToolRunner but got the error as shown above. After debugging I have found that cluster-0 is getting created in Mount Point and cluster-1 in HDFS if I dont provide file system scheme. Once i provide the file system scheme i.e. hdfs:///, everything works like charm. On Mon, Mar 17, 2014 at 4:24 PM, Suneel Marthi suneel_mar...@yahoo.comwrote: Have not seen that behavior with KMeans, what were ur settings again? Sorry joining late onto this thread, hence have not looked at the entire history. On Monday, March 17, 2014 6:52 AM, Bikash Gupta bikash.gupt...@gmail.com wrote: Suneel, Just for information, I havent found this issue in Canopy. Canopy cluster-0 was created in HDFS only. However Kmeans cluster-0 was created in local file system and cluster-1 in HDFS and after that it spit an error as it was unable to locate cluster-0 On Mon, Mar 17, 2014 at 3:10 PM, Suneel Marthi suneel_mar...@yahoo.comwrote: This problem's specifically to do with Canopy clustering and is not an issue with KMeans. I had seen this behavior with Canopy and looking at the code its indeed an issue wherein cluster-0 is created on the local file system and the remaining clusters land on HDFS. Please file a JIRA for this if not already done so. On Wednesday, March 12, 2014 3:02 AM, Bikash Gupta bikash.gupt...@gmail.com wrote: Hi, Problem is not with input path, its the way Kmeans is getting executed. Let me explain. I have created CSV-Sequence using map-reduce hence my data is in HDFS After this I have run Canopy MR hence data is also in HDFS Now these two things are getting pushed in Kmeans MR. If you check KmeansDriver class, at first it tries to create cluster-0 folder with data, here if you dont specify the scheme then it will write in local file system. After that MR job is getting started which is expecting cluster-0 in HDFS. Path priorClustersPath = new Path(output, Cluster.INITIAL_CLUSTERS_DIR); ClusteringPolicy policy = new KMeansClusteringPolicy(convergenceDelta); ClusterClassifier prior = new ClusterClassifier(clusters, policy); prior.writeToSeqFiles(priorClustersPath); if (runSequential) { ClusterIterator.iterateSeq(conf, input, priorClustersPath, output, maxIterations); } else { ClusterIterator.iterateMR(conf, input, priorClustersPath, output, maxIterations); } Let me know if I am not able to explain clearly. On Wed, Mar 12, 2014 at 11:53 AM, Sebastian Schelter s...@apache.org wrote: Hi Bikash, Have you tried adding hdfs:// to your input path? Maybe that helps. --sebastian On 03/11/2014 11:22 AM, Bikash Gupta wrote: Hi, I am running Kmeans in cluster where I am setting the configuration of fs.hdfs.impl and fs.file.impl before hand as mentioned below conf.set(fs.hdfs.impl,org.apache.hadoop.hdfs. DistributedFileSystem.class.getName()); conf.set(fs.file.impl,org.apache.hadoop.fs. LocalFileSystem.class.getName()); Problem is that cluster-0 directory is getting created in local file system and cluster-1 is getting created in HDFS, and Kmeans map reduce job is unable to find cluster-0 . Please see below the stacktrace 2014-03-11 14:52:15 o.a.m.c.AbstractJob [INFO] Command line arguments: {--clustering=null, --clusters=[/3/clusters-0-final], --convergenceDelta=[0.1], --distanceMeasure=[org.apache.mahout.common.distance. EuclideanDistanceMeasure], --endPhase=[2147483647], --input=[/2/sequence], --maxIter=[100], --method=[mapreduce], --output=[/5], --overwrite=null, --startPhase=[0], --tempDir=[temp]} 2014-03-11 14:52:15 o.a.h.u.NativeCodeLoader [WARN] Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 2014-03-11 14:52:15 o.a.m.c.k.KMeansDriver [INFO] Input: /2/sequence Clusters In: /3/clusters-0-final Out: /5 2014-03-11 14:52:15 o.a.m.c.k.KMeansDriver [INFO] convergence: 0.1 max Iterations: 100 2014-03-11 14:52:16 o.a.h.m.JobClient [WARN] Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same. 2014-03-11 14:52:17 o.a.h.m.l.i.FileInputFormat [INFO] Total input paths to process : 3 2014-03-11 14:52:19 o.a.h.m.JobClient [INFO] Running job: job_201403111332_0011 2014-03-11 14:52:20 o.a.h.m.JobClient [INFO] map 0% reduce 0% 2014-03-11 14:52:28 o.a.h.m.JobClient [INFO] Task Id : attempt_201403111332_0011_m_00_0, Status :
Normalization in Mahout
Hi, Do we have any utility for Column and Row normalization in Mahout? -- Thanks Regards Bikash Gupta
Re: Normalization in Mahout
What r u trying to do? On Monday, March 17, 2014 7:45 AM, Bikash Gupta bikash.gupt...@gmail.com wrote: Hi, Do we have any utility for Column and Row normalization in Mahout? -- Thanks Regards Bikash Gupta
Re: Normalization in Mahout
Want to achieve few things 1. Normalize input data of clustering and classification algorithm 2. Normalize output data to plot in graph On Mon, Mar 17, 2014 at 5:32 PM, Suneel Marthi suneel_mar...@yahoo.comwrote: What r u trying to do? On Monday, March 17, 2014 7:45 AM, Bikash Gupta bikash.gupt...@gmail.com wrote: Hi, Do we have any utility for Column and Row normalization in Mahout? -- Thanks Regards Bikash Gupta -- Thanks Regards Bikash Kumar Gupta
Re: Mahout parallel K-Means - algorithms analysis
You could take a look at org.apache.mahout.clustering.classify/ClusterClassificationMapper Enjoy, Wei Shung On Sat, Mar 15, 2014 at 2:51 PM, Suneel Marthi suneel_mar...@yahoo.comwrote: The clustering code is cimapper and cireducer. Following the clustering, there is cluster classification which is mapper only. Not sure about the reference paper, this stuffs been around for long but the documentation for kmeans on mahout.apache.org should explain the approach. Sent from my iPhone On Mar 15, 2014, at 5:36 PM, hiroshi leon hiroshi_8...@hotmail.com wrote: Hello Ted, Thank you so much for your reply, the program that I was checking is the KMeansDriver class with the run function, the buildCluster function in the same class and following the ClusterIterator class with the iterateMR function. I would like to know how where can I check the code that is implemented for the mapper and the reducer? is it in the CIMappper.class and CIReducer.class? Is there a research paper or pseudo-code in which Mahout parallel K-means was based on? Thank you so much and have a nice day. Best regards From: ted.dunn...@gmail.com Date: Sat, 15 Mar 2014 13:56:56 -0700 Subject: Re: Mahout parallel K-Means - algorithms analysis To: user@mahout.apache.org We would love to help. Can you say which program and which classes you are looking at? On Sat, Mar 15, 2014 at 12:58 PM, hiroshi leon hiroshi_8...@hotmail.comwrote: To whom it may correspond, Hello, I have been checking the algorithm of Mahout 0.9 version k-means using MapReduce and I would like to know where can I check the code of what is happening inside the map function and in the reducer? I was debugging using NetBeans and I was not able to find what is exactly implemented in the Map and Reduce functions... The reason what I am doing this is because I would like to know what is exactly implemented in the version of Mahout 0.9 in order to see which parts where optimized on the K-Means mapReduce algorithm. Do you know which research paper the Mahout K-means was based on or where can I read the pseudo code? Thank you so much! Best regards! Hiroshi
RE: reduce is too slow in StreamingKmeans
Thank you for your extremely quick reply. What do u mean by this? kmeans hasn't changed between 0.8 and 0.9. Did u mean Streaming KMeans here? I want to try using -rskm in streaming kmeans. But in mahout 0.8, if setting -rskm as true, errors occur. I heard that the bug has been fixed in 0.9. So I upgraded 0.8-0.9 The hadoop I installed is cdh5-MRv1, corresponding to hadoop 0.20, not hadoop 2.x(YARN) cdh5-MRv1 has compatible version of mahout(mahout-0.8+cdh5.0.0b2+28) which is compiled by cloudera. So I uninstalled mahout-0.8+cdh5.0.0b2+28, and installed apache mahout 0.9 distribution. It turned out that Mahout kmeans runs very well on mapreduce. However, Mahout streamingkmeans runs properly in sequential mode, but fails in mapreduce mode. If it is the problem of incompatibility between hadoop and mahout, I don’t think mahout kmeans can run properly. Is mahout 0.9 compatible with Hadoop 0.20? -Original Message- From: Suneel Marthi [mailto:suneel_mar...@yahoo.com] Sent: Monday, March 17, 2014 6:21 PM To: fx MA XIAOJUN; user@mahout.apache.org Subject: Re: reduce is too slow in StreamingKmeans On Monday, March 17, 2014 3:43 AM, fx MA XIAOJUN xiaojun...@fujixerox.co.jp wrote: Thank you for your quick reply. As to -km, I thought it was log10, instead of ln. I was wrong... This time I set -km 14 and run mahout streamingkmeans again.(CDH 5.0 Mrv1, Mahout 0.8) The maps run faster than before, but the reduce was still stuck at 76% for ever. This has been my experience too both with 0.8 and 0.9. So, I uninstalled mahout 0.8, and installed mahout 0.9 in order to use -rskm option. Mahout kmeans can be executed properly, so I think the installation of mahout 0.9 is successful. What do u mean by this? kmeans hasn't changed between 0.8 and 0.9. Did u mean Streaming KMeans here? However, when executing mahout streamingkmeans, I got errors as following. Hadoop I installed is cdh5-beta1-mapreduce version 1. Exception in thread main java.lang.IncompatibleClassChangeError: Found interface org.apache.hadoop.mapreduce.JobContext, but class was expected at org.apache.mahout.common.HadoopUtil.getCustomJobName(HadoopUtil.java:174) at org.apache.mahout.clustering.streaming.mapreduce.StreamingKMeansDriver.runMapReduce(StreamingKMeansDriver.java:464) at org.apache.mahout.clustering.streaming.mapreduce.StreamingKMeansDriver.run(StreamingKMeansDriver.java:419) at org.apache.mahout.clustering.streaming.mapreduce.StreamingKMeansDriver.run(StreamingKMeansDriver.java:240) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:84) at org.apache.mahout.clustering.streaming.mapreduce.StreamingKMeansDriver.main(StreamingKMeansDriver.java:491) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:72) at org.apache.hadoop.util.ProgramDriver.run(ProgramDriver.java:144) at org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:152) at org.apache.mahout.driver.MahoutDriver.main(MahoutDriver.java:195) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.hadoop.util.RunJar.main(RunJar.java:212) Seems like u r trying to execute on Hadoop 2 while Mahout 0.9 has been built with Hadoop 1.x profile, hence the error u r seeing. If u would like to test on Hadoop 2, work off of present trunk and build the code with Hadoop 2 profile like below: mvn clean install -Dhadoop2.profile=hadoop 2.x version Please give that a try. -Original Message- From: Suneel Marthi [mailto:suneel_mar...@yahoo.com] Sent: Wednesday, February 19, 2014 1:08 AM To: user@mahout.apache.org Subject: Re: reduce is too slow in StreamingKmeans Streaming KMeans runs with a single reducer that runs Ball KMeans and hence the slow performance that you have been experiencing. How did u come up with -km 63000? Given that u would like 1 clusters (= k) and have 2,000,000 datapoints (= n) so k * ln(n) = 1 * ln(2 * 10^6) = 145087 (rounded to nearest integer) and that should be the value of -km in ur case. (km = k * log (n) ) Not sure if that's gonna fix ur reduce being stuck at 76% forever but its definitely worth a try.
Re: reduce is too slow in StreamingKmeans
-rskm option works only in sequential mode and fails in MR. That's still an issue in present trunk that needs to be fixed. That should explain why Streaming KMeans with -rskm works only in sequential mode for you. Mahout 0.9 has been built with Hadoop 1.2.1 profile, not sure if that's gonna work with 0.20. On Monday, March 17, 2014 9:50 PM, fx MA XIAOJUN xiaojun...@fujixerox.co.jp wrote: Thank you for your extremely quick reply. What do u mean by this? kmeans hasn't changed between 0.8 and 0.9. Did u mean Streaming KMeans here? I want to try using -rskm in streaming kmeans. But in mahout 0.8, if setting -rskm as true, errors occur. I heard that the bug has been fixed in 0.9. So I upgraded 0.8-0.9 The hadoop I installed is cdh5-MRv1, corresponding to hadoop 0.20, not hadoop 2.x(YARN) cdh5-MRv1 has compatible version of mahout(mahout-0.8+cdh5.0.0b2+28) which is compiled by cloudera. So I uninstalled mahout-0.8+cdh5.0.0b2+28, and installed apache mahout 0.9 distribution. It turned out that Mahout kmeans runs very well on mapreduce. However, Mahout streamingkmeans runs properly in sequential mode, but fails in mapreduce mode. If it is the problem of incompatibility between hadoop and mahout, I don’t think mahout kmeans can run properly. Is mahout 0.9 compatible with Hadoop 0.20? -Original Message- From: Suneel Marthi [mailto:suneel_mar...@yahoo.com] Sent: Monday, March 17, 2014 6:21 PM To: fx MA XIAOJUN; user@mahout.apache.org Subject: Re: reduce is too slow in StreamingKmeans On Monday, March 17, 2014 3:43 AM, fx MA XIAOJUN xiaojun...@fujixerox.co.jp wrote: Thank you for your quick reply. As to -km, I thought it was log10, instead of ln. I was wrong... This time I set -km 14 and run mahout streamingkmeans again.(CDH 5.0 Mrv1, Mahout 0.8) The maps run faster than before, but the reduce was still stuck at 76% for ever. This has been my experience too both with 0.8 and 0.9. So, I uninstalled mahout 0.8, and installed mahout 0.9 in order to use -rskm option. Mahout kmeans can be executed properly, so I think the installation of mahout 0.9 is successful. What do u mean by this? kmeans hasn't changed between 0.8 and 0.9. Did u mean Streaming KMeans here? However, when executing mahout streamingkmeans, I got errors as following. Hadoop I installed is cdh5-beta1-mapreduce version 1. Exception in thread main java.lang.IncompatibleClassChangeError: Found interface org.apache.hadoop.mapreduce.JobContext, but class was expected at org.apache.mahout.common.HadoopUtil.getCustomJobName(HadoopUtil.java:174) at org.apache.mahout.clustering.streaming.mapreduce.StreamingKMeansDriver.runMapReduce(StreamingKMeansDriver.java:464) at org.apache.mahout.clustering.streaming.mapreduce.StreamingKMeansDriver.run(StreamingKMeansDriver.java:419) at org.apache.mahout.clustering.streaming.mapreduce.StreamingKMeansDriver.run(StreamingKMeansDriver.java:240) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:84) at org.apache.mahout.clustering.streaming.mapreduce.StreamingKMeansDriver.main(StreamingKMeansDriver.java:491) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:72) at org.apache.hadoop.util.ProgramDriver.run(ProgramDriver.java:144) at org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:152) at org.apache.mahout.driver.MahoutDriver.main(MahoutDriver.java:195) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.hadoop.util.RunJar.main(RunJar.java:212) Seems like u r trying to execute on Hadoop 2 while Mahout 0.9 has been built with Hadoop 1.x profile, hence the error u r seeing. If u would like to test on Hadoop 2, work off of present trunk and build the code with Hadoop 2 profile like below: mvn clean install -Dhadoop2.profile=hadoop 2.x version Please give that a try. -Original Message- From: Suneel Marthi [mailto:suneel_mar...@yahoo.com] Sent: Wednesday, February 19, 2014 1:08 AM To: user@mahout.apache.org Subject: Re: reduce is too slow in StreamingKmeans Streaming KMeans runs with a single reducer that runs Ball KMeans and hence the slow
RE: reduce is too slow in StreamingKmeans
As mahout streamingkmeans has no problems in sequential mode, I would like to try sequential mode. However, java.lang.OutofMemoryError occurs. I wonder where to set JVM heap size for sequential mode? Is it the same with mapreduce mode? -Original Message- From: fx MA XIAOJUN [mailto:xiaojun...@fujixerox.co.jp] Sent: Tuesday, March 18, 2014 10:50 AM To: Suneel Marthi; user@mahout.apache.org Subject: RE: reduce is too slow in StreamingKmeans Thank you for your extremely quick reply. What do u mean by this? kmeans hasn't changed between 0.8 and 0.9. Did u mean Streaming KMeans here? I want to try using -rskm in streaming kmeans. But in mahout 0.8, if setting -rskm as true, errors occur. I heard that the bug has been fixed in 0.9. So I upgraded 0.8-0.9 The hadoop I installed is cdh5-MRv1, corresponding to hadoop 0.20, not hadoop 2.x(YARN) cdh5-MRv1 has compatible version of mahout(mahout-0.8+cdh5.0.0b2+28) which is compiled by cloudera. So I uninstalled mahout-0.8+cdh5.0.0b2+28, and installed apache mahout 0.9 distribution. It turned out that Mahout kmeans runs very well on mapreduce. However, Mahout streamingkmeans runs properly in sequential mode, but fails in mapreduce mode. If it is the problem of incompatibility between hadoop and mahout, I don’t think mahout kmeans can run properly. Is mahout 0.9 compatible with Hadoop 0.20? -Original Message- From: Suneel Marthi [mailto:suneel_mar...@yahoo.com] Sent: Monday, March 17, 2014 6:21 PM To: fx MA XIAOJUN; user@mahout.apache.org Subject: Re: reduce is too slow in StreamingKmeans On Monday, March 17, 2014 3:43 AM, fx MA XIAOJUN xiaojun...@fujixerox.co.jp wrote: Thank you for your quick reply. As to -km, I thought it was log10, instead of ln. I was wrong... This time I set -km 14 and run mahout streamingkmeans again.(CDH 5.0 Mrv1, Mahout 0.8) The maps run faster than before, but the reduce was still stuck at 76% for ever. This has been my experience too both with 0.8 and 0.9. So, I uninstalled mahout 0.8, and installed mahout 0.9 in order to use -rskm option. Mahout kmeans can be executed properly, so I think the installation of mahout 0.9 is successful. What do u mean by this? kmeans hasn't changed between 0.8 and 0.9. Did u mean Streaming KMeans here? However, when executing mahout streamingkmeans, I got errors as following. Hadoop I installed is cdh5-beta1-mapreduce version 1. Exception in thread main java.lang.IncompatibleClassChangeError: Found interface org.apache.hadoop.mapreduce.JobContext, but class was expected at org.apache.mahout.common.HadoopUtil.getCustomJobName(HadoopUtil.java:174) at org.apache.mahout.clustering.streaming.mapreduce.StreamingKMeansDriver.runMapReduce(StreamingKMeansDriver.java:464) at org.apache.mahout.clustering.streaming.mapreduce.StreamingKMeansDriver.run(StreamingKMeansDriver.java:419) at org.apache.mahout.clustering.streaming.mapreduce.StreamingKMeansDriver.run(StreamingKMeansDriver.java:240) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:84) at org.apache.mahout.clustering.streaming.mapreduce.StreamingKMeansDriver.main(StreamingKMeansDriver.java:491) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:72) at org.apache.hadoop.util.ProgramDriver.run(ProgramDriver.java:144) at org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:152) at org.apache.mahout.driver.MahoutDriver.main(MahoutDriver.java:195) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.hadoop.util.RunJar.main(RunJar.java:212) Seems like u r trying to execute on Hadoop 2 while Mahout 0.9 has been built with Hadoop 1.x profile, hence the error u r seeing. If u would like to test on Hadoop 2, work off of present trunk and build the code with Hadoop 2 profile like below: mvn clean install -Dhadoop2.profile=hadoop 2.x version Please give that a try. -Original Message- From: Suneel Marthi [mailto:suneel_mar...@yahoo.com] Sent: Wednesday, February 19, 2014 1:08 AM To: user@mahout.apache.org Subject: Re: reduce is too slow in StreamingKmeans Streaming KMeans runs with a single
Fwd: Need help in executing SSVD for dimensionality reduction on Mahout
Hi, I am trying to use SSVD for dimensionality reduction on Mahout, the input is a sample data in CSV format. Below is a snippet of the input 22,2,44,36,5,9,2824,2,4,733,285,169 25,1,150,175,3,9,4037,2,18,1822,254,171 I have executed the below steps. 1. Loaded the csv file and Vectorized the data by following the steps mentioned at https://github.com/tdunning/pig-vector with key as TextConverter and value as VectorWritable. Listed below is the output of this step. I believe the values 420468, 279945 are indices, please correct me if I am wrong. Key: 1: Value: {420468:733.0,279945:2.0,607618:285.0,107323:4.0,88330:2.0,263605:9.0,975378:169.0,796003:2824.0,899937:44.0,422862:5.0,723271:22.0,508675:36.0} Key: 1: Value: {420468:1822.0,279945:2.0,607618:254.0,107323:18.0,88330:1.0,263605:9.0,975378:171.0,796003:4037.0,899937:150.0,422862:3.0,723271:25.0,508675:175.0} 2. Passed the output of the above command to SSVD as follows bin/mahout ssvd -i /user/cloudera/vectorized_data/ -o /user/cloudera/reduced_dimensions --rank 7 -us true -V false -U false -pca true -ow -t 1 Below is a snippet of the output in USigma folder Key: 1: Value: {0:190.78376981262613,1:350.30406212052424,2:78.24932121461198,3:98.67283686605012,4:-122.95056058078157,5:-4.201436498582381,6:-1.4370820809434337} Key: 1: Value: {0:1295.933786837574,1:-698.5629072274602,2:-24.15996813349674,3:60.936737740013946,4:11.859426028893711,5:-6.379057682687426,6:0.9356299409590896} Please let me know if my approach is correct and help me in interpreting the output in USigma folder Thanks in advance Pratap
Re: Need help in executing SSVD for dimensionality reduction on Mahout
If the rows in the input for SSVD are data points you are trying to create reduced space for, then rows of USigma represent the same points in the PCA (reduced) space. The mapping between the input rows and output rows is by same keys in the sequence files. However, it doesn't look like your input is using distinct such values (1), this is not recommended. SSVD will also propagate names if NamedVector is used for rows of the input. That's possibly another way to map input rows to PCA space rows in USigma. However, it doesn't look like the input is using Named vectors in this case. On Mon, Mar 17, 2014 at 10:22 PM, Vijaya Pratap bvprat...@gmail.com wrote: Hi, I am trying to use SSVD for dimensionality reduction on Mahout, the input is a sample data in CSV format. Below is a snippet of the input 22,2,44,36,5,9,2824,2,4,733,285,169 25,1,150,175,3,9,4037,2,18,1822,254,171 I have executed the below steps. 1. Loaded the csv file and Vectorized the data by following the steps mentioned at https://github.com/tdunning/pig-vector with key as TextConverter and value as VectorWritable. Listed below is the output of this step. I believe the values 420468, 279945 are indices, please correct me if I am wrong. Key: 1: Value: {420468:733.0,279945:2.0,607618:285.0,107323:4.0,88330:2.0,263605:9.0,975378:169.0,796003:2824.0,899937:44.0,422862:5.0,723271:22.0,508675:36.0} Key: 1: Value: {420468:1822.0,279945:2.0,607618:254.0,107323:18.0,88330:1.0,263605:9.0,975378:171.0,796003:4037.0,899937:150.0,422862:3.0,723271:25.0,508675:175.0} 2. Passed the output of the above command to SSVD as follows bin/mahout ssvd -i /user/cloudera/vectorized_data/ -o /user/cloudera/reduced_dimensions --rank 7 -us true -V false -U false -pca true -ow -t 1 Below is a snippet of the output in USigma folder Key: 1: Value: {0:190.78376981262613,1:350.30406212052424,2:78.24932121461198,3:98.67283686605012,4:-122.95056058078157,5:-4.201436498582381,6:-1.4370820809434337} Key: 1: Value: {0:1295.933786837574,1:-698.5629072274602,2:-24.15996813349674,3:60.936737740013946,4:11.859426028893711,5:-6.379057682687426,6:0.9356299409590896} Please let me know if my approach is correct and help me in interpreting the output in USigma folder Thanks in advance Pratap