RE: reduce is too slow in StreamingKmeans

2014-03-17 Thread fx MA XIAOJUN
Thank you for your quick reply.

As to -km, I thought it was log10, instead of ln. I was wrong...
This time I set -km 14 and run mahout streamingkmeans again.(CDH 5.0 Mrv1, 
Mahout 0.8)
The maps run faster than before, but the reduce was still stuck at 76% for ever.

So, I uninstalled mahout 0.8, and installed mahout 0.9 in order to use -rskm 
option.

Mahout kmeans can be executed properly, so I think the installation of mahout 
0.9 is successful.

However, when executing mahout streamingkmeans, I got errors as following.
Hadoop I installed is cdh5-beta1-mapreduce version 1.

Exception in thread main java.lang.IncompatibleClassChangeError: Found 
interface org.apache.hadoop.mapreduce.JobContext, but class was expected
at 
org.apache.mahout.common.HadoopUtil.getCustomJobName(HadoopUtil.java:174)
at 
org.apache.mahout.clustering.streaming.mapreduce.StreamingKMeansDriver.runMapReduce(StreamingKMeansDriver.java:464)
at 
org.apache.mahout.clustering.streaming.mapreduce.StreamingKMeansDriver.run(StreamingKMeansDriver.java:419)
at 
org.apache.mahout.clustering.streaming.mapreduce.StreamingKMeansDriver.run(StreamingKMeansDriver.java:240)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:84)
at 
org.apache.mahout.clustering.streaming.mapreduce.StreamingKMeansDriver.main(StreamingKMeansDriver.java:491)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at 
org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:72)
at org.apache.hadoop.util.ProgramDriver.run(ProgramDriver.java:144)
at org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:152)
at org.apache.mahout.driver.MahoutDriver.main(MahoutDriver.java:195)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.hadoop.util.RunJar.main(RunJar.java:212)








-Original Message-
From: Suneel Marthi [mailto:suneel_mar...@yahoo.com] 
Sent: Wednesday, February 19, 2014 1:08 AM
To: user@mahout.apache.org
Subject: Re: reduce is too slow in StreamingKmeans

Streaming KMeans runs with a single reducer that runs Ball KMeans and hence the 
slow performance that you have been experiencing. 

How did u come up with -km 63000?

Given that u would like 1 clusters (= k) and have 2,000,000 datapoints (= 
n) so k * ln(n) = 1 * ln(2 * 10^6)  = 145087 (rounded to nearest integer) 
and that should be the value of -km in ur case. (km = k * log (n) )

Not sure if that's gonna fix ur reduce being stuck at 76% forever but its 
definitely worth a try.

If you would like go to with -rskm option, please upgrade to Mahout 0.9.  I 
still think there's an issue with -rskm option with Mahout 0.9 and trunk today 
while executing in MR mode, but it definitely works in the nonMR (-xm 
sequential) mode in 0.9.











On Monday, February 17, 2014 9:05 PM, Sylvia Ma xiaojun...@fujixerox.co.jp 
wrote:
 
I am using mahout 0.8 embedded in chd5.0.0 provided by cloudera and found that 
reduce of mahout streamingkmeans is extremely slow.

For example:
With a dataset of 200 objects, 128 variables, I would like to get 1 
clusters.

The command executed is as the following.
mahout streamingkmeans -i input -o output -ow -k 1 -km 63000

I have 15 maps which were all completed in 4 hours.
However, reduce took over 100 hours and it was still stuck at 76%.

I have tuned performance of hadoop as the following. 
map task jvm = 3g
reduce task jvm = 10g
io.sort.mb = 512
io.sort.factor = 50
mapred.reduce.parallel.copies = 10
mapred.inmem.merge.threshold = 0 

I tried to assign enough memory but the reduce is still very very very slow.


Why does it take so much time in reduce?
And What can I do to speed up the job?

I wonder if it will be helpful to set -rskm to be true.
-rskm option has bug in Mahout 0.8, so I cannot get a try... 




Yours Sincerely,
Sylvia Ma


Re: reduce is too slow in StreamingKmeans

2014-03-17 Thread Suneel Marthi





On Monday, March 17, 2014 3:43 AM, fx MA XIAOJUN xiaojun...@fujixerox.co.jp 
wrote:
 
Thank you for your quick reply.

As to -km, I thought it was log10, instead of ln. I was wrong...
This time I set -km 14 and run mahout streamingkmeans again.(CDH 5.0 Mrv1, 
Mahout 0.8)
The maps run faster than before, but the reduce was still stuck at 76% for ever.

 This has been my experience too both with 0.8 and 0.9. 

So, I uninstalled mahout 0.8, and installed mahout 0.9 in order to use -rskm 
option.

Mahout kmeans can be executed properly, so I think the installation of mahout 
0.9 is successful.

 What do u mean by this? kmeans hasn't changed between 0.8 and 0.9. Did u 
 mean Streaming KMeans here?

However, when executing mahout streamingkmeans, I got errors as following.
Hadoop I installed is cdh5-beta1-mapreduce version 1.

Exception in thread main java.lang.IncompatibleClassChangeError: Found 
interface org.apache.hadoop.mapreduce.JobContext, but class was expected
    at org.apache.mahout.common.HadoopUtil.getCustomJobName(HadoopUtil.java:174)
    at 
org.apache.mahout.clustering.streaming.mapreduce.StreamingKMeansDriver.runMapReduce(StreamingKMeansDriver.java:464)
    at 
org.apache.mahout.clustering.streaming.mapreduce.StreamingKMeansDriver.run(StreamingKMeansDriver.java:419)
    at 
org.apache.mahout.clustering.streaming.mapreduce.StreamingKMeansDriver.run(StreamingKMeansDriver.java:240)
    at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
    at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:84)
    at 
org.apache.mahout.clustering.streaming.mapreduce.StreamingKMeansDriver.main(StreamingKMeansDriver.java:491)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
    at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:606)
    at 
org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:72)
    at org.apache.hadoop.util.ProgramDriver.run(ProgramDriver.java:144)
    at org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:152)
    at org.apache.mahout.driver.MahoutDriver.main(MahoutDriver.java:195)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
    at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:606)
    at org.apache.hadoop.util.RunJar.main(RunJar.java:212)



Seems like u r trying to execute on Hadoop 2 while Mahout 0.9 has been built 
with Hadoop 1.x profile, hence the error u r seeing.
If u would like to test on Hadoop 2, work off of present trunk and build the 
code with Hadoop 2 profile like below:

mvn clean install -Dhadoop2.profile=hadoop 2.x version

Please give that a try.





-Original Message-
From: Suneel Marthi [mailto:suneel_mar...@yahoo.com] 
Sent: Wednesday, February 19, 2014 1:08 AM
To: user@mahout.apache.org
Subject: Re: reduce is too slow in StreamingKmeans

Streaming KMeans runs with a single reducer that runs Ball KMeans and hence the 
slow performance that you have been experiencing. 

How did u come up with -km 63000?

Given that u would like 1 clusters (= k) and have 2,000,000 datapoints (= 
n) so k * ln(n) = 1 * ln(2 * 10^6)  = 145087 (rounded to nearest integer) 
and that should be the value of -km in ur case. (km = k * log (n) )

Not sure if that's gonna fix ur reduce being stuck at 76% forever but its 
definitely worth a try.

If you would like go to with -rskm option, please upgrade to Mahout 0.9.  I 
still think there's an issue with -rskm option with Mahout 0.9 and trunk today 
while executing in MR mode, but it definitely works in the nonMR (-xm 
sequential) mode in 0.9.











On Monday, February 17, 2014 9:05 PM, Sylvia Ma xiaojun...@fujixerox.co.jp 
wrote:

I am using mahout 0.8 embedded in chd5.0.0 provided by cloudera and found that 
reduce of mahout streamingkmeans is extremely slow.

For example:
With a dataset of 200 objects, 128 variables, I would like to get 1 
clusters.

The command executed is as the following.
mahout streamingkmeans -i input -o output -ow -k 1 -km 63000

I have 15 maps which were all completed in 4 hours.
However, reduce took over 100 hours and it was still stuck at 76%.

I have tuned performance of hadoop as the following. 
map task jvm = 3g
reduce task jvm = 10g
io.sort.mb = 512
io.sort.factor = 50
mapred.reduce.parallel.copies = 10
mapred.inmem.merge.threshold = 0 

I tried to assign enough memory but the reduce is still very very very slow.


Why does it take so much time in reduce?
And 

Re: Problem with FileSystem in Kmeans

2014-03-17 Thread Suneel Marthi
This problem's specifically to do with Canopy clustering and is not an issue 
with KMeans. I had seen this behavior with Canopy and looking at the code its 
indeed an issue wherein cluster-0 is created on the local file system and the 
remaining clusters land on HDFS. 

Please file a JIRA for this if not already done so. 





On Wednesday, March 12, 2014 3:02 AM, Bikash Gupta bikash.gupt...@gmail.com 
wrote:
 
Hi,

Problem is not with input path, its the way Kmeans is getting executed. Let
me explain.

I have created CSV-Sequence using map-reduce hence my data is in HDFS
After this I have run Canopy MR hence data is also in HDFS

Now these two things are getting pushed in Kmeans MR.

If you check KmeansDriver class, at first it tries to create cluster-0
folder with data, here if you dont specify the scheme then it will write in
local file system. After that MR job is getting started which is expecting
cluster-0 in HDFS.

Path priorClustersPath = new Path(output, Cluster.INITIAL_CLUSTERS_DIR);
    ClusteringPolicy policy = new KMeansClusteringPolicy(convergenceDelta);
    ClusterClassifier prior = new ClusterClassifier(clusters, policy);
    prior.writeToSeqFiles(priorClustersPath);

    if (runSequential) {
      ClusterIterator.iterateSeq(conf, input, priorClustersPath, output,
maxIterations);
    } else {
      ClusterIterator.iterateMR(conf, input, priorClustersPath, output,
maxIterations);
    }

Let me know if I am not able to explain clearly.



On Wed, Mar 12, 2014 at 11:53 AM, Sebastian Schelter s...@apache.org wrote:

 Hi Bikash,

 Have you tried adding hdfs:// to your input path? Maybe that helps.

 --sebastian


 On 03/11/2014 11:22 AM, Bikash Gupta wrote:

 Hi,

 I am running Kmeans in cluster where I am setting the configuration of
 fs.hdfs.impl and fs.file.impl before hand as mentioned below

 conf.set(fs.hdfs.impl,org.apache.hadoop.hdfs.
 DistributedFileSystem.class.getName());
 conf.set(fs.file.impl,org.apache.hadoop.fs.
 LocalFileSystem.class.getName());

 Problem is that cluster-0 directory is getting created in local file
 system
 and cluster-1 is getting created in HDFS, and Kmeans map reduce job is
 unable to find cluster-0 . Please see below the stacktrace

 2014-03-11 14:52:15 o.a.m.c.AbstractJob [INFO] Command line arguments:
 {--clustering=null, --clusters=[/3/clusters-0-final],
 --convergenceDelta=[0.1],
 --distanceMeasure=[org.apache.mahout.common.distance.
 EuclideanDistanceMeasure],
 --endPhase=[2147483647], --input=[/2/sequence], --maxIter=[100],
 --method=[mapreduce], --output=[/5], --overwrite=null, --startPhase=[0],
 --tempDir=[temp]}
 2014-03-11 14:52:15 o.a.h.u.NativeCodeLoader [WARN] Unable to load
 native-hadoop library for your platform... using builtin-java classes
 where
 applicable
 2014-03-11 14:52:15 o.a.m.c.k.KMeansDriver [INFO] Input: /2/sequence
 Clusters In: /3/clusters-0-final Out: /5
 2014-03-11 14:52:15 o.a.m.c.k.KMeansDriver [INFO] convergence: 0.1 max
 Iterations: 100
 2014-03-11 14:52:16 o.a.h.m.JobClient [WARN] Use GenericOptionsParser for
 parsing the arguments. Applications should implement Tool for the same.
 2014-03-11 14:52:17 o.a.h.m.l.i.FileInputFormat [INFO] Total input paths
 to
 process : 3
 2014-03-11 14:52:19 o.a.h.m.JobClient [INFO] Running job:
 job_201403111332_0011
 2014-03-11 14:52:20 o.a.h.m.JobClient [INFO]  map 0% reduce 0%
 2014-03-11 14:52:28 o.a.h.m.JobClient [INFO] Task Id :
 attempt_201403111332_0011_m_00_0, Status : FAILED
 2014-03-11 14:52:28 STDIO [ERROR] java.lang.IllegalStateException:
 /5/clusters-0
          at
 org.apache.mahout.common.iterator.sequencefile.
 SequenceFileDirValueIterable.iterator(SequenceFileDirValueIterable.
 java:78)
          at
 org.apache.mahout.clustering.classify.ClusterClassifier.readFromSeqFiles(
 ClusterClassifier.java:208)
          at
 org.apache.mahout.clustering.iterator.CIMapper.setup(CIMapper.java:44)
          at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:138)
          at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.
 java:672)
          at org.apache.hadoop.mapred.MapTask.run(MapTask.java:330)
          at org.apache.hadoop.mapred.Child$4.run(Child.java:268)
          at java.security.AccessController.doPrivileged(Native Method)
          at javax.security.auth.Subject.doAs(Subject.java:415)
          at
 org.apache.hadoop.security.UserGroupInformation.doAs(
 UserGroupInformation.java:1438)
          at org.apache.hadoop.mapred.Child.main(Child.java:262)
 Caused by: java.io.FileNotFoundException: File /5/clusters-0

 Please suggest!!!






-- 
Thanks  Regards
Bikash Kumar Gupta

java.lang.IncompatibleClassChangeError: Found interface org.apache.hadoop.mapreduce.TaskAttemptContext, but class was expected

2014-03-17 Thread Margusja

Hi

Here is my output:
[speech@h14 ~]$ mahout/bin/mahout seqdirectory -c UTF-8 -i 
/user/speech/demo -o demo-seqfiles

MAHOUT_LOCAL is not set; adding HADOOP_CONF_DIR to classpath.
Running on hadoop, using /usr/bin/hadoop and 
HADOOP_CONF_DIR=/etc/hadoop/conf
MAHOUT-JOB: 
/home/speech/mahout/examples/target/mahout-examples-1.0-SNAPSHOT-job.jar
14/03/17 11:47:30 INFO common.AbstractJob: Command line arguments: 
{--charset=[UTF-8], --chunkSize=[64], --endPhase=[2147483647], 
--fileFilterClass=[org.apache.mahout.text.PrefixAdditionFilter], 
--input=[/user/speech/demo], --keyPrefix=[], --method=[mapreduce], 
--output=[demo-seqfiles], --startPhase=[0], --tempDir=[temp]}
14/03/17 11:47:31 INFO Configuration.deprecation: mapred.input.dir is 
deprecated. Instead, use mapreduce.input.fileinputformat.inputdir
14/03/17 11:47:31 INFO Configuration.deprecation: 
mapred.compress.map.output is deprecated. Instead, use 
mapreduce.map.output.compress
14/03/17 11:47:31 INFO Configuration.deprecation: mapred.output.dir is 
deprecated. Instead, use mapreduce.output.fileoutputformat.outputdir
14/03/17 11:47:31 INFO Configuration.deprecation: session.id is 
deprecated. Instead, use dfs.metrics.session-id
14/03/17 11:47:31 INFO jvm.JvmMetrics: Initializing JVM Metrics with 
processName=JobTracker, sessionId=
14/03/17 11:47:32 INFO input.FileInputFormat: Total input paths to 
process : 10
14/03/17 11:47:32 INFO input.CombineFileInputFormat: DEBUG: Terminated 
node allocation with : CompletedNodes: 4, size left: 29775

14/03/17 11:47:32 INFO mapreduce.JobSubmitter: number of splits:1
14/03/17 11:47:32 INFO Configuration.deprecation: user.name is 
deprecated. Instead, use mapreduce.job.user.name
14/03/17 11:47:32 INFO Configuration.deprecation: mapred.output.compress 
is deprecated. Instead, use mapreduce.output.fileoutputformat.compress
14/03/17 11:47:32 INFO Configuration.deprecation: mapred.jar is 
deprecated. Instead, use mapreduce.job.jar
14/03/17 11:47:32 INFO Configuration.deprecation: mapred.reduce.tasks is 
deprecated. Instead, use mapreduce.job.reduces
14/03/17 11:47:32 INFO Configuration.deprecation: 
mapred.output.value.class is deprecated. Instead, use 
mapreduce.job.output.value.class
14/03/17 11:47:32 INFO Configuration.deprecation: 
mapred.mapoutput.value.class is deprecated. Instead, use 
mapreduce.map.output.value.class
14/03/17 11:47:32 INFO Configuration.deprecation: mapreduce.map.class is 
deprecated. Instead, use mapreduce.job.map.class
14/03/17 11:47:32 INFO Configuration.deprecation: mapred.job.name is 
deprecated. Instead, use mapreduce.job.name
14/03/17 11:47:32 INFO Configuration.deprecation: 
mapreduce.inputformat.class is deprecated. Instead, use 
mapreduce.job.inputformat.class
14/03/17 11:47:32 INFO Configuration.deprecation: mapred.max.split.size 
is deprecated. Instead, use mapreduce.input.fileinputformat.split.maxsize
14/03/17 11:47:32 INFO Configuration.deprecation: 
mapreduce.outputformat.class is deprecated. Instead, use 
mapreduce.job.outputformat.class
14/03/17 11:47:32 INFO Configuration.deprecation: mapred.map.tasks is 
deprecated. Instead, use mapreduce.job.maps
14/03/17 11:47:32 INFO Configuration.deprecation: 
mapred.output.key.class is deprecated. Instead, use 
mapreduce.job.output.key.class
14/03/17 11:47:32 INFO Configuration.deprecation: 
mapred.mapoutput.key.class is deprecated. Instead, use 
mapreduce.map.output.key.class
14/03/17 11:47:32 INFO Configuration.deprecation: mapred.working.dir is 
deprecated. Instead, use mapreduce.job.working.dir
14/03/17 11:47:32 INFO mapreduce.JobSubmitter: Submitting tokens for 
job: job_local42076163_0001
14/03/17 11:47:32 WARN conf.Configuration: 
file:/tmp/hadoop-speech/mapred/staging/speech42076163/.staging/job_local42076163_0001/job.xml:an 
attempt to override final parameter: 
mapreduce.job.end-notification.max.retry.interval;  Ignoring.
14/03/17 11:47:32 WARN conf.Configuration: 
file:/tmp/hadoop-speech/mapred/staging/speech42076163/.staging/job_local42076163_0001/job.xml:an 
attempt to override final parameter: 
mapreduce.job.end-notification.max.attempts;  Ignoring.
14/03/17 11:47:32 WARN conf.Configuration: 
file:/tmp/hadoop-speech/mapred/local/localRunner/speech/job_local42076163_0001/job_local42076163_0001.xml:an 
attempt to override final parameter: 
mapreduce.job.end-notification.max.retry.interval;  Ignoring.
14/03/17 11:47:32 WARN conf.Configuration: 
file:/tmp/hadoop-speech/mapred/local/localRunner/speech/job_local42076163_0001/job_local42076163_0001.xml:an 
attempt to override final parameter: 
mapreduce.job.end-notification.max.attempts;  Ignoring.
14/03/17 11:47:32 INFO mapreduce.Job: The url to track the job: 
http://localhost:8080/

14/03/17 11:47:32 INFO mapreduce.Job: Running job: job_local42076163_0001
14/03/17 11:47:32 INFO mapred.LocalJobRunner: OutputCommitter set in 
config null
14/03/17 11:47:33 INFO mapred.LocalJobRunner: OutputCommitter is 
org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter

14/03/17 

Re: java.lang.IncompatibleClassChangeError: Found interface org.apache.hadoop.mapreduce.TaskAttemptContext, but class was expected

2014-03-17 Thread Suneel Marthi
R u running on Hadoop 2.x which seems to be the case here.

Compile with hadoop 2 profile:

mvn -DskipTests clean install -Dhadoop2.profile=ur hadoop version





On Monday, March 17, 2014 5:57 AM, Margusja mar...@roo.ee wrote:
 
Hi

Here is my output:
[speech@h14 ~]$ mahout/bin/mahout seqdirectory -c UTF-8 -i 
/user/speech/demo -o demo-seqfiles
MAHOUT_LOCAL is not set; adding HADOOP_CONF_DIR to classpath.
Running on hadoop, using /usr/bin/hadoop and 
HADOOP_CONF_DIR=/etc/hadoop/conf
MAHOUT-JOB: 
/home/speech/mahout/examples/target/mahout-examples-1.0-SNAPSHOT-job.jar
14/03/17 11:47:30 INFO common.AbstractJob: Command line arguments: 
{--charset=[UTF-8], --chunkSize=[64], --endPhase=[2147483647], 
--fileFilterClass=[org.apache.mahout.text.PrefixAdditionFilter], 
--input=[/user/speech/demo], --keyPrefix=[], --method=[mapreduce], 
--output=[demo-seqfiles], --startPhase=[0], --tempDir=[temp]}
14/03/17 11:47:31 INFO Configuration.deprecation: mapred.input.dir is 
deprecated. Instead, use mapreduce.input.fileinputformat.inputdir
14/03/17 11:47:31 INFO Configuration.deprecation: 
mapred.compress.map.output is deprecated. Instead, use 
mapreduce.map.output.compress
14/03/17 11:47:31 INFO Configuration.deprecation: mapred.output.dir is 
deprecated. Instead, use mapreduce.output.fileoutputformat.outputdir
14/03/17 11:47:31 INFO Configuration.deprecation: session.id is 
deprecated. Instead, use dfs.metrics.session-id
14/03/17 11:47:31 INFO jvm.JvmMetrics: Initializing JVM Metrics with 
processName=JobTracker, sessionId=
14/03/17 11:47:32 INFO input.FileInputFormat: Total input paths to 
process : 10
14/03/17 11:47:32 INFO input.CombineFileInputFormat: DEBUG: Terminated 
node allocation with : CompletedNodes: 4, size left: 29775
14/03/17 11:47:32 INFO mapreduce.JobSubmitter: number of splits:1
14/03/17 11:47:32 INFO Configuration.deprecation: user.name is 
deprecated. Instead, use mapreduce.job.user.name
14/03/17 11:47:32 INFO Configuration.deprecation: mapred.output.compress 
is deprecated. Instead, use mapreduce.output.fileoutputformat.compress
14/03/17 11:47:32 INFO Configuration.deprecation: mapred.jar is 
deprecated. Instead, use mapreduce.job.jar
14/03/17 11:47:32 INFO Configuration.deprecation: mapred.reduce.tasks is 
deprecated. Instead, use mapreduce.job.reduces
14/03/17 11:47:32 INFO Configuration.deprecation: 
mapred.output.value.class is deprecated. Instead, use 
mapreduce.job.output.value.class
14/03/17 11:47:32 INFO Configuration.deprecation: 
mapred.mapoutput.value.class is deprecated. Instead, use 
mapreduce.map.output.value.class
14/03/17 11:47:32 INFO Configuration.deprecation: mapreduce.map.class is 
deprecated. Instead, use mapreduce.job.map.class
14/03/17 11:47:32 INFO Configuration.deprecation: mapred.job.name is 
deprecated. Instead, use mapreduce.job.name
14/03/17 11:47:32 INFO Configuration.deprecation: 
mapreduce.inputformat.class is deprecated. Instead, use 
mapreduce.job.inputformat.class
14/03/17 11:47:32 INFO Configuration.deprecation: mapred.max.split.size 
is deprecated. Instead, use mapreduce.input.fileinputformat.split.maxsize
14/03/17 11:47:32 INFO Configuration.deprecation: 
mapreduce.outputformat.class is deprecated. Instead, use 
mapreduce.job.outputformat.class
14/03/17 11:47:32 INFO Configuration.deprecation: mapred.map.tasks is 
deprecated. Instead, use mapreduce.job.maps
14/03/17 11:47:32 INFO Configuration.deprecation: 
mapred.output.key.class is deprecated. Instead, use 
mapreduce.job.output.key.class
14/03/17 11:47:32 INFO Configuration.deprecation: 
mapred.mapoutput.key.class is deprecated. Instead, use 
mapreduce.map.output.key.class
14/03/17 11:47:32 INFO Configuration.deprecation: mapred.working.dir is 
deprecated. Instead, use mapreduce.job.working.dir
14/03/17 11:47:32 INFO mapreduce.JobSubmitter: Submitting tokens for 
job: job_local42076163_0001
14/03/17 11:47:32 WARN conf.Configuration: 
file:/tmp/hadoop-speech/mapred/staging/speech42076163/.staging/job_local42076163_0001/job.xml:an
 
attempt to override final parameter: 
mapreduce.job.end-notification.max.retry.interval;  Ignoring.
14/03/17 11:47:32 WARN conf.Configuration: 
file:/tmp/hadoop-speech/mapred/staging/speech42076163/.staging/job_local42076163_0001/job.xml:an
 
attempt to override final parameter: 
mapreduce.job.end-notification.max.attempts;  Ignoring.
14/03/17 11:47:32 WARN conf.Configuration: 
file:/tmp/hadoop-speech/mapred/local/localRunner/speech/job_local42076163_0001/job_local42076163_0001.xml:an
 
attempt to override final parameter: 
mapreduce.job.end-notification.max.retry.interval;  Ignoring.
14/03/17 11:47:32 WARN conf.Configuration: 
file:/tmp/hadoop-speech/mapred/local/localRunner/speech/job_local42076163_0001/job_local42076163_0001.xml:an
 
attempt to override final parameter: 
mapreduce.job.end-notification.max.attempts;  Ignoring.
14/03/17 11:47:32 INFO mapreduce.Job: The url to track the job: 
http://localhost:8080/
14/03/17 11:47:32 INFO mapreduce.Job: Running job: 

Re: Problem with FileSystem in Kmeans

2014-03-17 Thread Bikash Gupta
Suneel,

Just for information, I havent found this issue in Canopy. Canopy cluster-0
was created in HDFS only.

However Kmeans cluster-0 was created in local file system and cluster-1 in
HDFS and after that it spit an error as it was unable to locate cluster-0


On Mon, Mar 17, 2014 at 3:10 PM, Suneel Marthi suneel_mar...@yahoo.comwrote:

 This problem's specifically to do with Canopy clustering and is not an
 issue with KMeans. I had seen this behavior with Canopy and looking at the
 code its indeed an issue wherein cluster-0 is created on the local file
 system and the remaining clusters land on HDFS.

 Please file a JIRA for this if not already done so.





 On Wednesday, March 12, 2014 3:02 AM, Bikash Gupta 
 bikash.gupt...@gmail.com wrote:

 Hi,

 Problem is not with input path, its the way Kmeans is getting executed. Let
 me explain.

 I have created CSV-Sequence using map-reduce hence my data is in HDFS
 After this I have run Canopy MR hence data is also in HDFS

 Now these two things are getting pushed in Kmeans MR.

 If you check KmeansDriver class, at first it tries to create cluster-0
 folder with data, here if you dont specify the scheme then it will write in
 local file system. After that MR job is getting started which is expecting
 cluster-0 in HDFS.

 Path priorClustersPath = new Path(output, Cluster.INITIAL_CLUSTERS_DIR);
 ClusteringPolicy policy = new KMeansClusteringPolicy(convergenceDelta);
 ClusterClassifier prior = new ClusterClassifier(clusters, policy);
 prior.writeToSeqFiles(priorClustersPath);

 if (runSequential) {
   ClusterIterator.iterateSeq(conf, input, priorClustersPath, output,
 maxIterations);
 } else {
   ClusterIterator.iterateMR(conf, input, priorClustersPath, output,
 maxIterations);
 }

 Let me know if I am not able to explain clearly.



 On Wed, Mar 12, 2014 at 11:53 AM, Sebastian Schelter s...@apache.org
 wrote:

  Hi Bikash,
 
  Have you tried adding hdfs:// to your input path? Maybe that helps.
 
  --sebastian
 
 
  On 03/11/2014 11:22 AM, Bikash Gupta wrote:
 
  Hi,
 
  I am running Kmeans in cluster where I am setting the configuration of
  fs.hdfs.impl and fs.file.impl before hand as mentioned below
 
  conf.set(fs.hdfs.impl,org.apache.hadoop.hdfs.
  DistributedFileSystem.class.getName());
  conf.set(fs.file.impl,org.apache.hadoop.fs.
  LocalFileSystem.class.getName());
 
  Problem is that cluster-0 directory is getting created in local file
  system
  and cluster-1 is getting created in HDFS, and Kmeans map reduce job is
  unable to find cluster-0 . Please see below the stacktrace
 
  2014-03-11 14:52:15 o.a.m.c.AbstractJob [INFO] Command line arguments:
  {--clustering=null, --clusters=[/3/clusters-0-final],
  --convergenceDelta=[0.1],
  --distanceMeasure=[org.apache.mahout.common.distance.
  EuclideanDistanceMeasure],
  --endPhase=[2147483647], --input=[/2/sequence], --maxIter=[100],
  --method=[mapreduce], --output=[/5], --overwrite=null, --startPhase=[0],
  --tempDir=[temp]}
  2014-03-11 14:52:15 o.a.h.u.NativeCodeLoader [WARN] Unable to load
  native-hadoop library for your platform... using builtin-java classes
  where
  applicable
  2014-03-11 14:52:15 o.a.m.c.k.KMeansDriver [INFO] Input: /2/sequence
  Clusters In: /3/clusters-0-final Out: /5
  2014-03-11 14:52:15 o.a.m.c.k.KMeansDriver [INFO] convergence: 0.1 max
  Iterations: 100
  2014-03-11 14:52:16 o.a.h.m.JobClient [WARN] Use GenericOptionsParser
 for
  parsing the arguments. Applications should implement Tool for the same.
  2014-03-11 14:52:17 o.a.h.m.l.i.FileInputFormat [INFO] Total input paths
  to
  process : 3
  2014-03-11 14:52:19 o.a.h.m.JobClient [INFO] Running job:
  job_201403111332_0011
  2014-03-11 14:52:20 o.a.h.m.JobClient [INFO]  map 0% reduce 0%
  2014-03-11 14:52:28 o.a.h.m.JobClient [INFO] Task Id :
  attempt_201403111332_0011_m_00_0, Status : FAILED
  2014-03-11 14:52:28 STDIO [ERROR] java.lang.IllegalStateException:
  /5/clusters-0
   at
  org.apache.mahout.common.iterator.sequencefile.
  SequenceFileDirValueIterable.iterator(SequenceFileDirValueIterable.
  java:78)
   at
 
 org.apache.mahout.clustering.classify.ClusterClassifier.readFromSeqFiles(
  ClusterClassifier.java:208)
   at
  org.apache.mahout.clustering.iterator.CIMapper.setup(CIMapper.java:44)
   at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:138)
   at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.
  java:672)
   at org.apache.hadoop.mapred.MapTask.run(MapTask.java:330)
   at org.apache.hadoop.mapred.Child$4.run(Child.java:268)
   at java.security.AccessController.doPrivileged(Native Method)
   at javax.security.auth.Subject.doAs(Subject.java:415)
   at
  org.apache.hadoop.security.UserGroupInformation.doAs(
  UserGroupInformation.java:1438)
   at org.apache.hadoop.mapred.Child.main(Child.java:262)
  Caused by: java.io.FileNotFoundException: File /5/clusters-0
 
  Please 

Re: Problem with FileSystem in Kmeans

2014-03-17 Thread Bikash Gupta
I have 3 node cluster of CDH4.6, however I have build Mahout 0.9 with
Hadoop 2.x profile.

I have also created a mount point for these node and the path uri is same
as HDFS.

I have manually configured filesystem parameter

conf.set(fs.hdfs.impl,org.
apache.hadoop.hdfs.DistributedFileSystem.class.getName());
conf.set(fs.file.impl,org.apache.hadoop.fs.LocalFileSystem.class.getName());

Input data(sequence file) and Cluster center(output of Canopy) are present
in HDFS. After this I am executing KmeansDriver using ToolRunner but got
the error as shown above.

After debugging I have found that cluster-0 is getting created in Mount
Point and cluster-1 in HDFS if I dont provide file system scheme. Once i
provide the file system scheme i.e. hdfs:///, everything works like
charm.



On Mon, Mar 17, 2014 at 4:24 PM, Suneel Marthi suneel_mar...@yahoo.comwrote:

 Have not seen that behavior with KMeans, what were ur settings again?
 Sorry joining late onto this thread, hence have not looked at the entire
 history.




   On Monday, March 17, 2014 6:52 AM, Bikash Gupta 
 bikash.gupt...@gmail.com wrote:
  Suneel,

 Just for information, I havent found this issue in Canopy. Canopy
 cluster-0 was created in HDFS only.

 However Kmeans cluster-0 was created in local file system and cluster-1 in
 HDFS and after that it spit an error as it was unable to locate cluster-0


 On Mon, Mar 17, 2014 at 3:10 PM, Suneel Marthi suneel_mar...@yahoo.comwrote:

 This problem's specifically to do with Canopy clustering and is not an
 issue with KMeans. I had seen this behavior with Canopy and looking at the
 code its indeed an issue wherein cluster-0 is created on the local file
 system and the remaining clusters land on HDFS.

 Please file a JIRA for this if not already done so.





 On Wednesday, March 12, 2014 3:02 AM, Bikash Gupta 
 bikash.gupt...@gmail.com wrote:

 Hi,

 Problem is not with input path, its the way Kmeans is getting executed. Let
 me explain.

 I have created CSV-Sequence using map-reduce hence my data is in HDFS
 After this I have run Canopy MR hence data is also in HDFS

 Now these two things are getting pushed in Kmeans MR.

 If you check KmeansDriver class, at first it tries to create cluster-0
 folder with data, here if you dont specify the scheme then it will write in
 local file system. After that MR job is getting started which is expecting
 cluster-0 in HDFS.

 Path priorClustersPath = new Path(output, Cluster.INITIAL_CLUSTERS_DIR);
 ClusteringPolicy policy = new KMeansClusteringPolicy(convergenceDelta);
 ClusterClassifier prior = new ClusterClassifier(clusters, policy);
 prior.writeToSeqFiles(priorClustersPath);

 if (runSequential) {
   ClusterIterator.iterateSeq(conf, input, priorClustersPath, output,
 maxIterations);
 } else {
   ClusterIterator.iterateMR(conf, input, priorClustersPath, output,
 maxIterations);
 }

 Let me know if I am not able to explain clearly.



 On Wed, Mar 12, 2014 at 11:53 AM, Sebastian Schelter s...@apache.org
 wrote:

  Hi Bikash,
 
  Have you tried adding hdfs:// to your input path? Maybe that helps.
 
  --sebastian
 
 
  On 03/11/2014 11:22 AM, Bikash Gupta wrote:
 
  Hi,
 
  I am running Kmeans in cluster where I am setting the configuration of
  fs.hdfs.impl and fs.file.impl before hand as mentioned below
 
  conf.set(fs.hdfs.impl,org.apache.hadoop.hdfs.
  DistributedFileSystem.class.getName());
  conf.set(fs.file.impl,org.apache.hadoop.fs.
  LocalFileSystem.class.getName());
 
  Problem is that cluster-0 directory is getting created in local file
  system
  and cluster-1 is getting created in HDFS, and Kmeans map reduce job is
  unable to find cluster-0 . Please see below the stacktrace
 
  2014-03-11 14:52:15 o.a.m.c.AbstractJob [INFO] Command line arguments:
  {--clustering=null, --clusters=[/3/clusters-0-final],
  --convergenceDelta=[0.1],
  --distanceMeasure=[org.apache.mahout.common.distance.
  EuclideanDistanceMeasure],
  --endPhase=[2147483647], --input=[/2/sequence], --maxIter=[100],
  --method=[mapreduce], --output=[/5], --overwrite=null, --startPhase=[0],
  --tempDir=[temp]}
  2014-03-11 14:52:15 o.a.h.u.NativeCodeLoader [WARN] Unable to load
  native-hadoop library for your platform... using builtin-java classes
  where
  applicable
  2014-03-11 14:52:15 o.a.m.c.k.KMeansDriver [INFO] Input: /2/sequence
  Clusters In: /3/clusters-0-final Out: /5
  2014-03-11 14:52:15 o.a.m.c.k.KMeansDriver [INFO] convergence: 0.1 max
  Iterations: 100
  2014-03-11 14:52:16 o.a.h.m.JobClient [WARN] Use GenericOptionsParser
 for
  parsing the arguments. Applications should implement Tool for the same.
  2014-03-11 14:52:17 o.a.h.m.l.i.FileInputFormat [INFO] Total input paths
  to
  process : 3
  2014-03-11 14:52:19 o.a.h.m.JobClient [INFO] Running job:
  job_201403111332_0011
  2014-03-11 14:52:20 o.a.h.m.JobClient [INFO]  map 0% reduce 0%
  2014-03-11 14:52:28 o.a.h.m.JobClient [INFO] Task Id :
  attempt_201403111332_0011_m_00_0, Status : 

Normalization in Mahout

2014-03-17 Thread Bikash Gupta
Hi,

Do we have any utility for Column and Row normalization in Mahout?

-- 
Thanks  Regards
Bikash Gupta


Re: Normalization in Mahout

2014-03-17 Thread Suneel Marthi
What r u trying to do? 





On Monday, March 17, 2014 7:45 AM, Bikash Gupta bikash.gupt...@gmail.com 
wrote:
 
Hi,

Do we have any utility for Column and Row normalization in Mahout?

-- 
Thanks  Regards
Bikash Gupta

Re: Normalization in Mahout

2014-03-17 Thread Bikash Gupta
Want to achieve few things

1. Normalize input data of clustering and classification algorithm
2. Normalize output data to plot in graph


On Mon, Mar 17, 2014 at 5:32 PM, Suneel Marthi suneel_mar...@yahoo.comwrote:

 What r u trying to do?





 On Monday, March 17, 2014 7:45 AM, Bikash Gupta bikash.gupt...@gmail.com
 wrote:

 Hi,

 Do we have any utility for Column and Row normalization in Mahout?

 --
 Thanks  Regards
 Bikash Gupta




-- 
Thanks  Regards
Bikash Kumar Gupta


Re: Mahout parallel K-Means - algorithms analysis

2014-03-17 Thread Weishung Chung
You could take a look
at org.apache.mahout.clustering.classify/ClusterClassificationMapper

Enjoy,
Wei Shung


On Sat, Mar 15, 2014 at 2:51 PM, Suneel Marthi suneel_mar...@yahoo.comwrote:

 The clustering code is cimapper and cireducer.  Following the clustering,
 there is cluster classification which is mapper only.

 Not sure about the reference paper, this stuffs been around for long but
 the documentation for kmeans on mahout.apache.org should explain the
 approach.

 Sent from my iPhone

  On Mar 15, 2014, at 5:36 PM, hiroshi leon hiroshi_8...@hotmail.com
 wrote:
 
  Hello Ted,
 
  Thank you so much for your reply, the program that I was checking is the
 KMeansDriver class with the run function,
  the buildCluster function in the same class and following the
 ClusterIterator class with
  the iterateMR function.
 
  I would like to know how where can I check the code that is implemented
 for the mapper and the
  reducer? is it in the CIMappper.class and CIReducer.class?
 
  Is there a research paper or pseudo-code in which Mahout parallel
 K-means was based on?
 
  Thank you so much and have a nice day.
 
  Best regards
 
 
  From: ted.dunn...@gmail.com
  Date: Sat, 15 Mar 2014 13:56:56 -0700
  Subject: Re: Mahout parallel K-Means - algorithms analysis
  To: user@mahout.apache.org
 
  We would love to help.
 
  Can you say which program and which classes you are looking at?
 
 
  On Sat, Mar 15, 2014 at 12:58 PM, hiroshi leon 
 hiroshi_8...@hotmail.comwrote:
 
  To whom it may correspond,
 
  Hello, I have been checking the algorithm of Mahout 0.9 version k-means
  using MapReduce and I would like to know where can I check the code of
  what is happening inside the map function and in the reducer?
 
 
  I was debugging using NetBeans and I was not able to find what is
 exactly
  implemented in the Map and Reduce functions...
 
 
 
  The reason what I am doing this is because I would like to know what
  is exactly implemented in the version of Mahout 0.9 in order to see
  which parts where optimized on the K-Means mapReduce algorithm.
 
 
 
  Do you know  which research paper the Mahout K-means was based on or
 where
  can I read the pseudo code?
 
 
 
  Thank you so much!
 
 
 
  Best regards!
 
  Hiroshi
 



RE: reduce is too slow in StreamingKmeans

2014-03-17 Thread fx MA XIAOJUN
Thank you for your extremely quick reply.

 What do u mean by this? kmeans hasn't changed between 0.8 and 0.9. Did u 
 mean Streaming KMeans here?
I want to try using -rskm in streaming kmeans. 
But in mahout 0.8, if setting -rskm as true, errors occur.
I heard that the bug has been fixed in 0.9. So I upgraded 0.8-0.9


The hadoop I installed is cdh5-MRv1, corresponding to hadoop 0.20, not hadoop 
2.x(YARN)
cdh5-MRv1 has compatible version of mahout(mahout-0.8+cdh5.0.0b2+28) which is 
compiled by cloudera.
So I uninstalled mahout-0.8+cdh5.0.0b2+28, and installed apache mahout 0.9 
distribution. 
It turned out that Mahout kmeans runs very well on mapreduce.
However, Mahout streamingkmeans runs properly in sequential mode, but fails 
in mapreduce mode.

If it is the problem of incompatibility between hadoop and mahout, I don’t 
think mahout kmeans can run properly.

Is mahout 0.9 compatible with Hadoop 0.20?





-Original Message-
From: Suneel Marthi [mailto:suneel_mar...@yahoo.com] 
Sent: Monday, March 17, 2014 6:21 PM
To: fx MA XIAOJUN; user@mahout.apache.org
Subject: Re: reduce is too slow in StreamingKmeans





On Monday, March 17, 2014 3:43 AM, fx MA XIAOJUN xiaojun...@fujixerox.co.jp 
wrote:

Thank you for your quick reply.

As to -km, I thought it was log10, instead of ln. I was wrong...
This time I set -km 14 and run mahout streamingkmeans again.(CDH 5.0 Mrv1, 
Mahout 0.8) The maps run faster than before, but the reduce was still stuck at 
76% for ever.

 This has been my experience too both with 0.8 and 0.9. 

So, I uninstalled mahout 0.8, and installed mahout 0.9 in order to use -rskm 
option.

Mahout kmeans can be executed properly, so I think the installation of mahout 
0.9 is successful.

 What do u mean by this? kmeans hasn't changed between 0.8 and 0.9. Did u 
 mean Streaming KMeans here?

However, when executing mahout streamingkmeans, I got errors as following.
Hadoop I installed is cdh5-beta1-mapreduce version 1.

Exception in thread main java.lang.IncompatibleClassChangeError: Found 
interface org.apache.hadoop.mapreduce.JobContext, but class was expected
at org.apache.mahout.common.HadoopUtil.getCustomJobName(HadoopUtil.java:174)
at 
org.apache.mahout.clustering.streaming.mapreduce.StreamingKMeansDriver.runMapReduce(StreamingKMeansDriver.java:464)
at 
org.apache.mahout.clustering.streaming.mapreduce.StreamingKMeansDriver.run(StreamingKMeansDriver.java:419)
at 
org.apache.mahout.clustering.streaming.mapreduce.StreamingKMeansDriver.run(StreamingKMeansDriver.java:240)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:84)
at 
org.apache.mahout.clustering.streaming.mapreduce.StreamingKMeansDriver.main(StreamingKMeansDriver.java:491)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at 
org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:72)
at org.apache.hadoop.util.ProgramDriver.run(ProgramDriver.java:144)
at org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:152)
at org.apache.mahout.driver.MahoutDriver.main(MahoutDriver.java:195)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.hadoop.util.RunJar.main(RunJar.java:212)



Seems like u r trying to execute on Hadoop 2 while Mahout 0.9 has been built 
with Hadoop 1.x profile, hence the error u r seeing.
If u would like to test on Hadoop 2, work off of present trunk and build the 
code with Hadoop 2 profile like below:

mvn clean install -Dhadoop2.profile=hadoop 2.x version

Please give that a try.





-Original Message-
From: Suneel Marthi [mailto:suneel_mar...@yahoo.com]
Sent: Wednesday, February 19, 2014 1:08 AM
To: user@mahout.apache.org
Subject: Re: reduce is too slow in StreamingKmeans

Streaming KMeans runs with a single reducer that runs Ball KMeans and hence the 
slow performance that you have been experiencing. 

How did u come up with -km 63000?

Given that u would like 1 clusters (= k) and have 2,000,000 datapoints (= 
n) so k * ln(n) = 1 * ln(2 * 10^6)  = 145087 (rounded to nearest integer) 
and that should be the value of -km in ur case. (km = k * log (n) )

Not sure if that's gonna fix ur reduce being stuck at 76% forever but its 
definitely worth a try.


Re: reduce is too slow in StreamingKmeans

2014-03-17 Thread Suneel Marthi
-rskm option works only in sequential mode and fails in MR. That's still an 
issue in present trunk that needs to be fixed.
That should explain why Streaming KMeans with -rskm works only in sequential 
mode for you.

Mahout 0.9 has been built with Hadoop 1.2.1 profile, not sure if that's gonna 
work with 0.20.






On Monday, March 17, 2014 9:50 PM, fx MA XIAOJUN xiaojun...@fujixerox.co.jp 
wrote:
 
Thank you for your extremely quick reply.

 What do u mean by this? kmeans hasn't changed between 0.8 and 0.9. Did u 
 mean Streaming KMeans here?
I want to try using -rskm in streaming kmeans. 
But in mahout 0.8, if setting -rskm as true, errors occur.
I heard that the bug has been fixed in 0.9. So I upgraded 0.8-0.9


The hadoop I installed is cdh5-MRv1, corresponding to hadoop 0.20, not hadoop 
2.x(YARN)
cdh5-MRv1 has compatible version of mahout(mahout-0.8+cdh5.0.0b2+28) which is 
compiled by cloudera.
So I uninstalled mahout-0.8+cdh5.0.0b2+28, and installed apache mahout 0.9 
distribution. 
It turned out that Mahout kmeans runs very well on mapreduce.
However, Mahout streamingkmeans runs properly in sequential mode, but fails 
in mapreduce mode.

If it is the problem of incompatibility between hadoop and mahout, I don’t 
think mahout kmeans can run properly.

Is mahout 0.9 compatible with Hadoop 0.20?





-Original Message-
From: Suneel Marthi [mailto:suneel_mar...@yahoo.com] 
Sent: Monday, March 17, 2014 6:21 PM
To: fx MA XIAOJUN; user@mahout.apache.org
Subject: Re: reduce is too slow in StreamingKmeans





On Monday, March 17, 2014 3:43 AM, fx MA XIAOJUN xiaojun...@fujixerox.co.jp 
wrote:

Thank you for your quick reply.

As to -km, I thought it was log10, instead of ln. I was wrong...
This time I set -km 14 and run mahout streamingkmeans again.(CDH 5.0 Mrv1, 
Mahout 0.8) The maps run faster than before, but the reduce was still stuck at 
76% for ever.

 This has been my experience too both with 0.8 and 0.9. 

So, I uninstalled mahout 0.8, and installed mahout 0.9 in order to use -rskm 
option.

Mahout kmeans can be executed properly, so I think the installation of mahout 
0.9 is successful.

 What do u mean by this? kmeans hasn't changed between 0.8 and 0.9. Did u 
 mean Streaming KMeans here?

However, when executing mahout streamingkmeans, I got errors as following.
Hadoop I installed is cdh5-beta1-mapreduce version 1.

Exception in thread main java.lang.IncompatibleClassChangeError: Found 
interface org.apache.hadoop.mapreduce.JobContext, but class was expected
    at org.apache.mahout.common.HadoopUtil.getCustomJobName(HadoopUtil.java:174)
    at 
org.apache.mahout.clustering.streaming.mapreduce.StreamingKMeansDriver.runMapReduce(StreamingKMeansDriver.java:464)
    at 
org.apache.mahout.clustering.streaming.mapreduce.StreamingKMeansDriver.run(StreamingKMeansDriver.java:419)
    at 
org.apache.mahout.clustering.streaming.mapreduce.StreamingKMeansDriver.run(StreamingKMeansDriver.java:240)
    at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
    at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:84)
    at 
org.apache.mahout.clustering.streaming.mapreduce.StreamingKMeansDriver.main(StreamingKMeansDriver.java:491)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
    at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:606)
    at 
org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:72)
    at org.apache.hadoop.util.ProgramDriver.run(ProgramDriver.java:144)
    at org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:152)
    at org.apache.mahout.driver.MahoutDriver.main(MahoutDriver.java:195)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
    at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:606)
    at org.apache.hadoop.util.RunJar.main(RunJar.java:212)



Seems like u r trying to execute on Hadoop 2 while Mahout 0.9 has been built 
with Hadoop 1.x profile, hence the error u r seeing.
If u would like to test on Hadoop 2, work off of present trunk and build the 
code with Hadoop 2 profile like below:

mvn clean install -Dhadoop2.profile=hadoop 2.x version

Please give that a try.





-Original Message-
From: Suneel Marthi [mailto:suneel_mar...@yahoo.com]
Sent: Wednesday, February 19, 2014 1:08 AM
To: user@mahout.apache.org
Subject: Re: reduce is too slow in StreamingKmeans

Streaming KMeans runs with a single reducer that runs Ball KMeans and hence the 
slow 

RE: reduce is too slow in StreamingKmeans

2014-03-17 Thread fx MA XIAOJUN
As mahout streamingkmeans has no problems in sequential mode, 
I would like to try sequential mode.
However, java.lang.OutofMemoryError occurs.

I wonder where to set JVM heap size for sequential mode?
Is it the same with mapreduce mode?



-Original Message-
From: fx MA XIAOJUN [mailto:xiaojun...@fujixerox.co.jp] 
Sent: Tuesday, March 18, 2014 10:50 AM
To: Suneel Marthi; user@mahout.apache.org
Subject: RE: reduce is too slow in StreamingKmeans

Thank you for your extremely quick reply.

 What do u mean by this? kmeans hasn't changed between 0.8 and 0.9. Did u 
 mean Streaming KMeans here?
I want to try using -rskm in streaming kmeans. 
But in mahout 0.8, if setting -rskm as true, errors occur.
I heard that the bug has been fixed in 0.9. So I upgraded 0.8-0.9


The hadoop I installed is cdh5-MRv1, corresponding to hadoop 0.20, not hadoop 
2.x(YARN)
cdh5-MRv1 has compatible version of mahout(mahout-0.8+cdh5.0.0b2+28) which is 
compiled by cloudera.
So I uninstalled mahout-0.8+cdh5.0.0b2+28, and installed apache mahout 0.9 
distribution. 
It turned out that Mahout kmeans runs very well on mapreduce.
However, Mahout streamingkmeans runs properly in sequential mode, but fails 
in mapreduce mode.

If it is the problem of incompatibility between hadoop and mahout, I don’t 
think mahout kmeans can run properly.

Is mahout 0.9 compatible with Hadoop 0.20?





-Original Message-
From: Suneel Marthi [mailto:suneel_mar...@yahoo.com] 
Sent: Monday, March 17, 2014 6:21 PM
To: fx MA XIAOJUN; user@mahout.apache.org
Subject: Re: reduce is too slow in StreamingKmeans





On Monday, March 17, 2014 3:43 AM, fx MA XIAOJUN xiaojun...@fujixerox.co.jp 
wrote:

Thank you for your quick reply.

As to -km, I thought it was log10, instead of ln. I was wrong...
This time I set -km 14 and run mahout streamingkmeans again.(CDH 5.0 Mrv1, 
Mahout 0.8) The maps run faster than before, but the reduce was still stuck at 
76% for ever.

 This has been my experience too both with 0.8 and 0.9. 

So, I uninstalled mahout 0.8, and installed mahout 0.9 in order to use -rskm 
option.

Mahout kmeans can be executed properly, so I think the installation of mahout 
0.9 is successful.

 What do u mean by this? kmeans hasn't changed between 0.8 and 0.9. Did u 
 mean Streaming KMeans here?

However, when executing mahout streamingkmeans, I got errors as following.
Hadoop I installed is cdh5-beta1-mapreduce version 1.

Exception in thread main java.lang.IncompatibleClassChangeError: Found 
interface org.apache.hadoop.mapreduce.JobContext, but class was expected
at org.apache.mahout.common.HadoopUtil.getCustomJobName(HadoopUtil.java:174)
at 
org.apache.mahout.clustering.streaming.mapreduce.StreamingKMeansDriver.runMapReduce(StreamingKMeansDriver.java:464)
at 
org.apache.mahout.clustering.streaming.mapreduce.StreamingKMeansDriver.run(StreamingKMeansDriver.java:419)
at 
org.apache.mahout.clustering.streaming.mapreduce.StreamingKMeansDriver.run(StreamingKMeansDriver.java:240)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:84)
at 
org.apache.mahout.clustering.streaming.mapreduce.StreamingKMeansDriver.main(StreamingKMeansDriver.java:491)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at 
org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:72)
at org.apache.hadoop.util.ProgramDriver.run(ProgramDriver.java:144)
at org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:152)
at org.apache.mahout.driver.MahoutDriver.main(MahoutDriver.java:195)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.hadoop.util.RunJar.main(RunJar.java:212)



Seems like u r trying to execute on Hadoop 2 while Mahout 0.9 has been built 
with Hadoop 1.x profile, hence the error u r seeing.
If u would like to test on Hadoop 2, work off of present trunk and build the 
code with Hadoop 2 profile like below:

mvn clean install -Dhadoop2.profile=hadoop 2.x version

Please give that a try.





-Original Message-
From: Suneel Marthi [mailto:suneel_mar...@yahoo.com]
Sent: Wednesday, February 19, 2014 1:08 AM
To: user@mahout.apache.org
Subject: Re: reduce is too slow in StreamingKmeans

Streaming KMeans runs with a single 

Fwd: Need help in executing SSVD for dimensionality reduction on Mahout

2014-03-17 Thread Vijaya Pratap
Hi,

I am trying to use SSVD for dimensionality reduction on Mahout, the input
is a sample data in CSV format. Below is a snippet of the input

22,2,44,36,5,9,2824,2,4,733,285,169
25,1,150,175,3,9,4037,2,18,1822,254,171

I have executed the below steps.

1. Loaded the csv file and Vectorized the data by following the steps
mentioned at https://github.com/tdunning/pig-vector with key as
TextConverter and value as VectorWritable. Listed below is the output of
this step. I believe the values 420468, 279945 are indices, please correct
me if I am wrong.
Key: 1: Value:
{420468:733.0,279945:2.0,607618:285.0,107323:4.0,88330:2.0,263605:9.0,975378:169.0,796003:2824.0,899937:44.0,422862:5.0,723271:22.0,508675:36.0}
Key: 1: Value:
{420468:1822.0,279945:2.0,607618:254.0,107323:18.0,88330:1.0,263605:9.0,975378:171.0,796003:4037.0,899937:150.0,422862:3.0,723271:25.0,508675:175.0}

2. Passed the output of the above command to SSVD as follows
bin/mahout ssvd -i /user/cloudera/vectorized_data/ -o
/user/cloudera/reduced_dimensions --rank 7 -us true -V false -U false -pca
true -ow -t 1

Below is a snippet of the output in USigma folder
Key: 1: Value:
{0:190.78376981262613,1:350.30406212052424,2:78.24932121461198,3:98.67283686605012,4:-122.95056058078157,5:-4.201436498582381,6:-1.4370820809434337}
Key: 1: Value:
{0:1295.933786837574,1:-698.5629072274602,2:-24.15996813349674,3:60.936737740013946,4:11.859426028893711,5:-6.379057682687426,6:0.9356299409590896}

Please let me know if my approach is correct and help me in interpreting
the output in USigma folder


Thanks in advance
Pratap


Re: Need help in executing SSVD for dimensionality reduction on Mahout

2014-03-17 Thread Dmitriy Lyubimov
If the rows in the input for SSVD are data points you are trying to create
reduced space for, then rows of USigma represent the same points in the PCA
(reduced) space. The mapping between the input rows and output rows is by
same keys in the sequence files. However, it doesn't look like your input
is using distinct such values (1), this is not recommended.

SSVD will also propagate names if NamedVector is used for rows of the
input. That's possibly another way to map input rows to PCA space rows in
USigma. However, it doesn't look like the input is using Named vectors in
this case.


On Mon, Mar 17, 2014 at 10:22 PM, Vijaya Pratap bvprat...@gmail.com wrote:

 Hi,

 I am trying to use SSVD for dimensionality reduction on Mahout, the input
 is a sample data in CSV format. Below is a snippet of the input

 22,2,44,36,5,9,2824,2,4,733,285,169
 25,1,150,175,3,9,4037,2,18,1822,254,171

 I have executed the below steps.

 1. Loaded the csv file and Vectorized the data by following the steps
 mentioned at https://github.com/tdunning/pig-vector with key as
 TextConverter and value as VectorWritable. Listed below is the output of
 this step. I believe the values 420468, 279945 are indices, please correct
 me if I am wrong.
 Key: 1: Value:

 {420468:733.0,279945:2.0,607618:285.0,107323:4.0,88330:2.0,263605:9.0,975378:169.0,796003:2824.0,899937:44.0,422862:5.0,723271:22.0,508675:36.0}
 Key: 1: Value:

 {420468:1822.0,279945:2.0,607618:254.0,107323:18.0,88330:1.0,263605:9.0,975378:171.0,796003:4037.0,899937:150.0,422862:3.0,723271:25.0,508675:175.0}

 2. Passed the output of the above command to SSVD as follows
 bin/mahout ssvd -i /user/cloudera/vectorized_data/ -o
 /user/cloudera/reduced_dimensions --rank 7 -us true -V false -U false -pca
 true -ow -t 1

 Below is a snippet of the output in USigma folder
 Key: 1: Value:

 {0:190.78376981262613,1:350.30406212052424,2:78.24932121461198,3:98.67283686605012,4:-122.95056058078157,5:-4.201436498582381,6:-1.4370820809434337}
 Key: 1: Value:

 {0:1295.933786837574,1:-698.5629072274602,2:-24.15996813349674,3:60.936737740013946,4:11.859426028893711,5:-6.379057682687426,6:0.9356299409590896}

 Please let me know if my approach is correct and help me in interpreting
 the output in USigma folder


 Thanks in advance
 Pratap