I think that would be good. I'm going to be working on MAHOUT today and tomorrow, hopefully. Finally have some free time at ApacheCon...

On Nov 5, 2008, at 11:08 PM, Palleti, Pallavi wrote:

Hi all,

The same is discussed here:
https://issues.apache.org/jira/browse/MAHOUT-79

I have patch for fixing this issue ready. If no one is working on it, I
can open an issue in jira and commit the same.

Thanks
Pallavi

-----Original Message-----
From: Jeff Eastman [mailto:[EMAIL PROTECTED]
Sent: Thursday, November 06, 2008 5:50 AM
To: [email protected]
Subject: Re: Problems with KMeans clustering

Thanks Steve,

That was a subtle change that was evidently made after Kmeans was
implemented and did not show up until later when people such as Philippe

and yourself ran it with real problems on real clusters. While the type signatures of the reducer and combiner are in fact the same, the values
provided by the mapper and combiner are different and could indeed
create the odd behavior that was reported.

The algorithm's dependence upon run-once behavior is pretty fundamental,

since summing of cluster centroids is done in the combiner and the
reducer does a merge of those clusters. I'd be interested in exactly how

you resolved this.

It likely applies to some of the other clustering implementations too.

Finally, can you explain why this problem no longer seems to occur with
Hadoop trunk?

Jeff


Steve Schlosser wrote:
Hi folks

A while back we upgraded our Hadoop cluster from 0.15 to 0.18.0, and I
found that Mahout Kmeans quit working.  I finally tracked it down to
the fact that the semantics of the combiner changed between 0.16,
0.17, and 0.18 from run exactly once to run zero or more times (which
is in line with how Map/Reduce was originally specified).  See:
https://issues.apache.org/jira/browse/HADOOP-3586.

The Kmeans combiner depended on running exactly once, but on our new
cluster it was running multiple times, causing hard-to-discern errors.
Basically, the second time through the Combiner, it would throw an
exception that the formatting of the vector (serialized into a Text)
was failing. In the end, I had to make some formatting changes to the
data output by the Mapper and the Combiner to match what the Reducer
expects, as well as changes to the Combiner input to .  I ended up
having to hack the Mapper to output vectors that either the Combiner
or Reducer could take as input, and make the Combiner take in the same
input that it outputs and to calculate convergence at each step.

My apologies if this has already been covered and put to rest - I just
happened upon this thread this afternoon.

-steve

On Sun, Nov 2, 2008 at 10:29 AM, Philippe Lamarche
<[EMAIL PROTECTED]> wrote:

Hi there,
It also works on 0.19.0-dev, that is on hadoop/branches/branch-0.19.

I intend in the next few day to try to find out what exactly is the
problem
to make sure that it won't come back in a few revisions.

Thanks!

On Thu, Oct 30, 2008 at 9:20 AM, Grant Ingersoll
<[EMAIL PROTECTED]>wrote:


Hmm, I believe that patch has been applied in 18.2 (whatever that
is) but
it also looks like it has been applied to 0.17.3 branch as well.
So, it
might be something else that "fixed" it.

At any rate, glad to hear it works on trunk.


On Oct 29, 2008, at 6:38 PM, Philippe Lamarche wrote:

I am not sure I understand the hadoop svn structure, however I was
able to

make it work with hadoop trunk, or 0.20.0-dev.
It didn't work with hadoop/branch-0.18, with or without patch 4277.


Here is a copy-paste of the steps, once Hadoop is built and
installed.  I
am
using the same exact "apache-mahout-examples-0.1-dev.job", not
rebuilt
with
the 0.20.0-dev jars.

It works!

That would mean that the bug/feature is not related to
HADOOP-4277<http://issues.apache.org/jira/browse/HADOOP-4277>,

and was reintroduced (or never took away) in hadoop/trunk.


[EMAIL PROTECTED]:/usr/local/hadoop$ bin/hadoop namenode -format
08/10/29 18:27:59 INFO namenode.NameNode: STARTUP_MSG:
/************************************************************
STARTUP_MSG: Starting NameNode
STARTUP_MSG:   host = phil/127.0.1.1
STARTUP_MSG:   args = [-format]
STARTUP_MSG:   version = 0.20.0-dev
STARTUP_MSG:   build =  -r ; compiled by 'philippe' on Wed Oct 29
18:25:08
EDT 2008
************************************************************/
08/10/29 18:28:00 INFO namenode.FSNamesystem: fsOwner=hadoop,hadoop 08/10/29 18:28:00 INFO namenode.FSNamesystem: supergroup=supergroup
08/10/29 18:28:00 INFO namenode.FSNamesystem:
isPermissionEnabled=true
08/10/29 18:28:00 INFO common.Storage: Image file of size 96 saved
in 0
seconds.
08/10/29 18:28:00 INFO common.Storage: Storage directory
/usr/local/hadoop-datastore/hadoop-hadoop/dfs/name has been
successfully
formatted.
08/10/29 18:28:00 INFO namenode.NameNode: SHUTDOWN_MSG:
/************************************************************
SHUTDOWN_MSG: Shutting down NameNode at phil/127.0.1.1
************************************************************/

[EMAIL PROTECTED]:/usr/local/hadoop$ bin/hadoop dfs -put
/home/philippe/synthetic_control.data testdata

[EMAIL PROTECTED]:/usr/local/hadoop$ bin/hadoop jar


/home/philippe/workspace/MahoutJava/examples/build/apache-mahout- example
s-0.1-dev.job
org.apache.mahout.clustering.syntheticcontrol.kmeans.Job
08/10/29 18:28:45 WARN mapred.JobClient: Use GenericOptionsParser
for
parsing the arguments. Applications should implement Tool for the
same.
08/10/29 18:28:46 INFO mapred.FileInputFormat: Total input paths to
process
: 1
08/10/29 18:28:47 INFO mapred.JobClient: Running job:
job_200810291828_0002
08/10/29 18:28:48 INFO mapred.JobClient:  map 0% reduce 0%
08/10/29 18:28:54 INFO mapred.JobClient:  map 50% reduce 0%
08/10/29 18:28:55 INFO mapred.JobClient:  map 100% reduce 0%
08/10/29 18:28:56 INFO mapred.JobClient: Job complete:
job_200810291828_0002
08/10/29 18:28:56 INFO mapred.JobClient: Counters: 7
08/10/29 18:28:56 INFO mapred.JobClient:   File Systems
08/10/29 18:28:56 INFO mapred.JobClient: HDFS bytes read=291644
08/10/29 18:28:56 INFO mapred.JobClient:     HDFS bytes
written=323660
08/10/29 18:28:56 INFO mapred.JobClient:   Job Counters
08/10/29 18:28:56 INFO mapred.JobClient:     Launched map tasks=2
08/10/29 18:28:56 INFO mapred.JobClient: Data-local map tasks=2
08/10/29 18:28:56 INFO mapred.JobClient:   Map-Reduce Framework
08/10/29 18:28:56 INFO mapred.JobClient:     Map input records=600
08/10/29 18:28:56 INFO mapred.JobClient: Map input bytes=288374 08/10/29 18:28:56 INFO mapred.JobClient: Map output records=600
08/10/29 18:28:56 WARN mapred.JobClient: Use GenericOptionsParser
for
parsing the arguments. Applications should implement Tool for the
same.
08/10/29 18:28:56 INFO mapred.FileInputFormat: Total input paths to
process
: 2
08/10/29 18:28:56 INFO mapred.JobClient: Running job:
job_200810291828_0003
08/10/29 18:28:57 INFO mapred.JobClient:  map 0% reduce 0%
08/10/29 18:29:03 INFO mapred.JobClient:  map 50% reduce 0%
08/10/29 18:29:05 INFO mapred.JobClient:  map 100% reduce 0%
08/10/29 18:29:10 INFO mapred.JobClient:  map 100% reduce 100%
08/10/29 18:29:11 INFO mapred.JobClient: Job complete:
job_200810291828_0003
08/10/29 18:29:11 INFO mapred.JobClient: Counters: 16
08/10/29 18:29:11 INFO mapred.JobClient:   File Systems
08/10/29 18:29:11 INFO mapred.JobClient: HDFS bytes read=323660
08/10/29 18:29:11 INFO mapred.JobClient:     HDFS bytes
written=9657
08/10/29 18:29:11 INFO mapred.JobClient: Local bytes read=36119
08/10/29 18:29:11 INFO mapred.JobClient:     Local bytes
written=72300
08/10/29 18:29:11 INFO mapred.JobClient:   Job Counters
08/10/29 18:29:11 INFO mapred.JobClient:     Launched reduce
tasks=1
08/10/29 18:29:11 INFO mapred.JobClient:     Launched map tasks=2
08/10/29 18:29:11 INFO mapred.JobClient: Data-local map tasks=2
08/10/29 18:29:11 INFO mapred.JobClient:   Map-Reduce Framework
08/10/29 18:29:11 INFO mapred.JobClient:     Reduce input groups=1
08/10/29 18:29:11 INFO mapred.JobClient:     Combine output
records=28
08/10/29 18:29:11 INFO mapred.JobClient:     Map input records=600
08/10/29 18:29:11 INFO mapred.JobClient:     Reduce output
records=7
08/10/29 18:29:11 INFO mapred.JobClient:     Map output
bytes=943020
08/10/29 18:29:11 INFO mapred.JobClient: Map input bytes=323660
08/10/29 18:29:11 INFO mapred.JobClient:     Combine input
records=1732
08/10/29 18:29:11 INFO mapred.JobClient:     Map output
records=1732
08/10/29 18:29:11 INFO mapred.JobClient:     Reduce input
records=28
08/10/29 18:29:11 WARN mapred.JobClient: Use GenericOptionsParser
for
parsing the arguments. Applications should implement Tool for the
same.
08/10/29 18:29:11 INFO mapred.FileInputFormat: Total input paths to
process
: 2
08/10/29 18:29:12 INFO mapred.JobClient: Running job:
job_200810291828_0004
08/10/29 18:29:13 INFO mapred.JobClient:  map 0% reduce 0%
08/10/29 18:29:20 INFO mapred.JobClient:  map 50% reduce 0%
08/10/29 18:29:22 INFO mapred.JobClient:  map 100% reduce 0%
08/10/29 18:29:27 INFO mapred.JobClient:  map 100% reduce 100%
08/10/29 18:29:28 INFO mapred.JobClient: Job complete:
job_200810291828_0004
08/10/29 18:29:28 INFO mapred.JobClient: Counters: 16
08/10/29 18:29:28 INFO mapred.JobClient:   File Systems
08/10/29 18:29:28 INFO mapred.JobClient: HDFS bytes read=342974
08/10/29 18:29:28 INFO mapred.JobClient:     HDFS bytes
written=3002539
08/10/29 18:29:28 INFO mapred.JobClient:     Local bytes
read=3018455
08/10/29 18:29:28 INFO mapred.JobClient:     Local bytes
written=6036972
08/10/29 18:29:28 INFO mapred.JobClient:   Job Counters
08/10/29 18:29:28 INFO mapred.JobClient:     Launched reduce
tasks=1
08/10/29 18:29:28 INFO mapred.JobClient:     Launched map tasks=2
08/10/29 18:29:28 INFO mapred.JobClient: Data-local map tasks=2
08/10/29 18:29:28 INFO mapred.JobClient:   Map-Reduce Framework
08/10/29 18:29:28 INFO mapred.JobClient:     Reduce input groups=7
08/10/29 18:29:28 INFO mapred.JobClient:     Combine output
records=0
08/10/29 18:29:28 INFO mapred.JobClient:     Map input records=600
08/10/29 18:29:28 INFO mapred.JobClient:     Reduce output
records=1591
08/10/29 18:29:28 INFO mapred.JobClient:     Map output
bytes=3008903
08/10/29 18:29:28 INFO mapred.JobClient: Map input bytes=323660
08/10/29 18:29:28 INFO mapred.JobClient:     Combine input
records=0
08/10/29 18:29:28 INFO mapred.JobClient:     Map output
records=1591
08/10/29 18:29:28 INFO mapred.JobClient:     Reduce input
records=1591
08/10/29 18:29:28 INFO kmeans.KMeansDriver: Iteration 0
08/10/29 18:29:28 WARN mapred.JobClient: Use GenericOptionsParser
for
parsing the arguments. Applications should implement Tool for the
same.
08/10/29 18:29:28 INFO mapred.FileInputFormat: Total input paths to
process
: 2
08/10/29 18:29:28 INFO mapred.JobClient: Running job:
job_200810291828_0005
08/10/29 18:29:29 INFO mapred.JobClient:  map 0% reduce 0%
08/10/29 18:29:35 INFO mapred.JobClient:  map 50% reduce 0%
08/10/29 18:29:37 INFO mapred.JobClient:  map 100% reduce 0%
08/10/29 18:29:41 INFO mapred.JobClient: Job complete:
job_200810291828_0005
08/10/29 18:29:41 INFO mapred.JobClient: Counters: 16
08/10/29 18:29:41 INFO mapred.JobClient:   File Systems
08/10/29 18:29:41 INFO mapred.JobClient: HDFS bytes read=342974
08/10/29 18:29:41 INFO mapred.JobClient:     HDFS bytes
written=8205
08/10/29 18:29:41 INFO mapred.JobClient: Local bytes read=23227
08/10/29 18:29:41 INFO mapred.JobClient:     Local bytes
written=46516
08/10/29 18:29:41 INFO mapred.JobClient:   Job Counters
08/10/29 18:29:41 INFO mapred.JobClient:     Launched reduce
tasks=1
08/10/29 18:29:41 INFO mapred.JobClient:     Launched map tasks=2
08/10/29 18:29:41 INFO mapred.JobClient: Data-local map tasks=2
08/10/29 18:29:41 INFO mapred.JobClient:   Map-Reduce Framework
08/10/29 18:29:41 INFO mapred.JobClient:     Reduce input groups=7
08/10/29 18:29:41 INFO mapred.JobClient:     Combine output
records=10
08/10/29 18:29:41 INFO mapred.JobClient:     Map input records=600
08/10/29 18:29:41 INFO mapred.JobClient:     Reduce output
records=7
08/10/29 18:29:41 INFO mapred.JobClient:     Map output
bytes=1136504
08/10/29 18:29:41 INFO mapred.JobClient: Map input bytes=323660
08/10/29 18:29:41 INFO mapred.JobClient:     Combine input
records=600
08/10/29 18:29:41 INFO mapred.JobClient: Map output records=600
08/10/29 18:29:41 INFO mapred.JobClient:     Reduce input
records=10
08/10/29 18:29:41 INFO kmeans.KMeansDriver: Iteration 1
08/10/29 18:29:41 WARN mapred.JobClient: Use GenericOptionsParser
for
parsing the arguments. Applications should implement Tool for the
same.
08/10/29 18:29:41 INFO mapred.FileInputFormat: Total input paths to
process
: 2
08/10/29 18:29:42 INFO mapred.JobClient: Running job:
job_200810291828_0006
08/10/29 18:29:43 INFO mapred.JobClient:  map 0% reduce 0%
08/10/29 18:29:50 INFO mapred.JobClient:  map 50% reduce 0%
08/10/29 18:29:51 INFO mapred.JobClient:  map 100% reduce 0%
08/10/29 18:29:55 INFO mapred.JobClient:  map 100% reduce 100%
08/10/29 18:29:56 INFO mapred.JobClient: Job complete:
job_200810291828_0006
08/10/29 18:29:56 INFO mapred.JobClient: Counters: 16
08/10/29 18:29:56 INFO mapred.JobClient:   File Systems
08/10/29 18:29:56 INFO mapred.JobClient: HDFS bytes read=340070
08/10/29 18:29:56 INFO mapred.JobClient:     HDFS bytes
written=8242
08/10/29 18:29:56 INFO mapred.JobClient: Local bytes read=21265
08/10/29 18:29:56 INFO mapred.JobClient:     Local bytes
written=42592
08/10/29 18:29:56 INFO mapred.JobClient:   Job Counters
08/10/29 18:29:56 INFO mapred.JobClient:     Launched reduce
tasks=1
08/10/29 18:29:56 INFO mapred.JobClient:     Launched map tasks=2
08/10/29 18:29:56 INFO mapred.JobClient: Data-local map tasks=2
08/10/29 18:29:56 INFO mapred.JobClient:   Map-Reduce Framework
08/10/29 18:29:56 INFO mapred.JobClient:     Reduce input groups=7
08/10/29 18:29:56 INFO mapred.JobClient:     Combine output
records=10
08/10/29 18:29:56 INFO mapred.JobClient:     Map input records=600
08/10/29 18:29:56 INFO mapred.JobClient:     Reduce output
records=7
08/10/29 18:29:56 INFO mapred.JobClient:     Map output
bytes=1023966
08/10/29 18:29:56 INFO mapred.JobClient: Map input bytes=323660
08/10/29 18:29:56 INFO mapred.JobClient:     Combine input
records=600
08/10/29 18:29:56 INFO mapred.JobClient: Map output records=600
08/10/29 18:29:56 INFO mapred.JobClient:     Reduce input
records=10
08/10/29 18:29:56 INFO kmeans.KMeansDriver: Iteration 2
08/10/29 18:29:56 WARN mapred.JobClient: Use GenericOptionsParser
for
parsing the arguments. Applications should implement Tool for the
same.
08/10/29 18:29:56 INFO mapred.FileInputFormat: Total input paths to
process
: 2
08/10/29 18:29:56 INFO mapred.JobClient: Running job:
job_200810291828_0007
08/10/29 18:29:57 INFO mapred.JobClient:  map 0% reduce 0%
08/10/29 18:30:03 INFO mapred.JobClient:  map 50% reduce 0%
08/10/29 18:30:05 INFO mapred.JobClient:  map 100% reduce 0%
08/10/29 18:30:09 INFO mapred.JobClient: Job complete:
job_200810291828_0007
08/10/29 18:30:09 INFO mapred.JobClient: Counters: 16
08/10/29 18:30:09 INFO mapred.JobClient:   File Systems
08/10/29 18:30:09 INFO mapred.JobClient: HDFS bytes read=340144
08/10/29 18:30:09 INFO mapred.JobClient:     HDFS bytes
written=8280
08/10/29 18:30:09 INFO mapred.JobClient: Local bytes read=21085
08/10/29 18:30:09 INFO mapred.JobClient:     Local bytes
written=42232
08/10/29 18:30:09 INFO mapred.JobClient:   Job Counters
08/10/29 18:30:09 INFO mapred.JobClient:     Launched reduce
tasks=1
08/10/29 18:30:09 INFO mapred.JobClient:     Launched map tasks=2
08/10/29 18:30:09 INFO mapred.JobClient: Data-local map tasks=2
08/10/29 18:30:09 INFO mapred.JobClient:   Map-Reduce Framework
08/10/29 18:30:09 INFO mapred.JobClient:     Reduce input groups=7
08/10/29 18:30:09 INFO mapred.JobClient:     Combine output
records=10
08/10/29 18:30:09 INFO mapred.JobClient:     Map input records=600
08/10/29 18:30:09 INFO mapred.JobClient:     Reduce output
records=7
08/10/29 18:30:09 INFO mapred.JobClient:     Map output
bytes=1023681
08/10/29 18:30:09 INFO mapred.JobClient: Map input bytes=323660
08/10/29 18:30:09 INFO mapred.JobClient:     Combine input
records=600
08/10/29 18:30:09 INFO mapred.JobClient: Map output records=600
08/10/29 18:30:09 INFO mapred.JobClient:     Reduce input
records=10
08/10/29 18:30:09 INFO kmeans.KMeansDriver: Iteration 3
08/10/29 18:30:09 WARN mapred.JobClient: Use GenericOptionsParser
for
parsing the arguments. Applications should implement Tool for the
same.
08/10/29 18:30:09 INFO mapred.FileInputFormat: Total input paths to
process
: 2
08/10/29 18:30:09 INFO mapred.JobClient: Running job:
job_200810291828_0008
08/10/29 18:30:10 INFO mapred.JobClient:  map 0% reduce 0%
08/10/29 18:30:17 INFO mapred.JobClient:  map 50% reduce 0%
08/10/29 18:30:18 INFO mapred.JobClient:  map 100% reduce 0%
08/10/29 18:30:22 INFO mapred.JobClient:  map 100% reduce 100%
08/10/29 18:30:23 INFO mapred.JobClient: Job complete:
job_200810291828_0008
08/10/29 18:30:23 INFO mapred.JobClient: Counters: 16
08/10/29 18:30:23 INFO mapred.JobClient:   File Systems
08/10/29 18:30:23 INFO mapred.JobClient: HDFS bytes read=340220
08/10/29 18:30:23 INFO mapred.JobClient:     HDFS bytes
written=8250
08/10/29 18:30:23 INFO mapred.JobClient: Local bytes read=21339
08/10/29 18:30:23 INFO mapred.JobClient:     Local bytes
written=42740
08/10/29 18:30:23 INFO mapred.JobClient:   Job Counters
08/10/29 18:30:23 INFO mapred.JobClient:     Launched reduce
tasks=1
08/10/29 18:30:23 INFO mapred.JobClient:     Launched map tasks=2
08/10/29 18:30:23 INFO mapred.JobClient: Data-local map tasks=2
08/10/29 18:30:23 INFO mapred.JobClient:   Map-Reduce Framework
08/10/29 18:30:23 INFO mapred.JobClient:     Reduce input groups=7
08/10/29 18:30:23 INFO mapred.JobClient:     Combine output
records=10
08/10/29 18:30:23 INFO mapred.JobClient:     Map input records=600
08/10/29 18:30:23 INFO mapred.JobClient:     Reduce output
records=7
08/10/29 18:30:23 INFO mapred.JobClient:     Map output
bytes=1028419
08/10/29 18:30:23 INFO mapred.JobClient: Map input bytes=323660
08/10/29 18:30:23 INFO mapred.JobClient:     Combine input
records=600
08/10/29 18:30:23 INFO mapred.JobClient: Map output records=600
08/10/29 18:30:23 INFO mapred.JobClient:     Reduce input
records=10
08/10/29 18:30:23 INFO kmeans.KMeansDriver: Iteration 4
08/10/29 18:30:23 WARN mapred.JobClient: Use GenericOptionsParser
for
parsing the arguments. Applications should implement Tool for the
same.
08/10/29 18:30:23 INFO mapred.FileInputFormat: Total input paths to
process
: 2
08/10/29 18:30:24 INFO mapred.JobClient: Running job:
job_200810291828_0009
08/10/29 18:30:25 INFO mapred.JobClient:  map 0% reduce 0%
08/10/29 18:30:31 INFO mapred.JobClient:  map 50% reduce 0%
08/10/29 18:30:33 INFO mapred.JobClient:  map 100% reduce 0%
08/10/29 18:30:37 INFO mapred.JobClient:  map 100% reduce 100%
08/10/29 18:30:38 INFO mapred.JobClient: Job complete:
job_200810291828_0009
08/10/29 18:30:38 INFO mapred.JobClient: Counters: 16
08/10/29 18:30:38 INFO mapred.JobClient:   File Systems
08/10/29 18:30:38 INFO mapred.JobClient: HDFS bytes read=340160
08/10/29 18:30:38 INFO mapred.JobClient:     HDFS bytes
written=8200
08/10/29 18:30:38 INFO mapred.JobClient: Local bytes read=21219
08/10/29 18:30:38 INFO mapred.JobClient:     Local bytes
written=42500
08/10/29 18:30:38 INFO mapred.JobClient:   Job Counters
08/10/29 18:30:38 INFO mapred.JobClient:     Launched reduce
tasks=1
08/10/29 18:30:38 INFO mapred.JobClient:     Launched map tasks=2
08/10/29 18:30:38 INFO mapred.JobClient: Data-local map tasks=2
08/10/29 18:30:38 INFO mapred.JobClient:   Map-Reduce Framework
08/10/29 18:30:38 INFO mapred.JobClient:     Reduce input groups=7
08/10/29 18:30:38 INFO mapred.JobClient:     Combine output
records=10
08/10/29 18:30:38 INFO mapred.JobClient:     Map input records=600
08/10/29 18:30:38 INFO mapred.JobClient:     Reduce output
records=7
08/10/29 18:30:38 INFO mapred.JobClient:     Map output
bytes=1024899
08/10/29 18:30:38 INFO mapred.JobClient: Map input bytes=323660
08/10/29 18:30:38 INFO mapred.JobClient:     Combine input
records=600
08/10/29 18:30:38 INFO mapred.JobClient: Map output records=600
08/10/29 18:30:38 INFO mapred.JobClient:     Reduce input
records=10
08/10/29 18:30:38 INFO kmeans.KMeansDriver: Clustering
08/10/29 18:30:38 WARN mapred.JobClient: Use GenericOptionsParser
for
parsing the arguments. Applications should implement Tool for the
same.
08/10/29 18:30:38 INFO mapred.FileInputFormat: Total input paths to
process
: 2
08/10/29 18:30:38 INFO mapred.JobClient: Running job:
job_200810291828_0010
08/10/29 18:30:39 INFO mapred.JobClient:  map 0% reduce 0%
08/10/29 18:30:45 INFO mapred.JobClient:  map 50% reduce 0%
08/10/29 18:30:47 INFO mapred.JobClient: Job complete:
job_200810291828_0010
08/10/29 18:30:47 INFO mapred.JobClient: Counters: 7
08/10/29 18:30:47 INFO mapred.JobClient:   File Systems
08/10/29 18:30:47 INFO mapred.JobClient: HDFS bytes read=340060
08/10/29 18:30:47 INFO mapred.JobClient:     HDFS bytes
written=1020535
08/10/29 18:30:47 INFO mapred.JobClient:   Job Counters
08/10/29 18:30:47 INFO mapred.JobClient:     Launched map tasks=2
08/10/29 18:30:47 INFO mapred.JobClient: Data-local map tasks=2
08/10/29 18:30:47 INFO mapred.JobClient:   Map-Reduce Framework
08/10/29 18:30:47 INFO mapred.JobClient:     Map input records=600
08/10/29 18:30:47 INFO mapred.JobClient: Map input bytes=323660 08/10/29 18:30:47 INFO mapred.JobClient: Map output records=600
08/10/29 18:30:47 WARN mapred.JobClient: Use GenericOptionsParser
for
parsing the arguments. Applications should implement Tool for the
same.
08/10/29 18:30:47 INFO mapred.FileInputFormat: Total input paths to
process
: 2
08/10/29 18:30:48 INFO mapred.JobClient: Running job:
job_200810291828_0011
08/10/29 18:30:49 INFO mapred.JobClient:  map 0% reduce 0%
08/10/29 18:30:56 INFO mapred.JobClient:  map 50% reduce 0%
08/10/29 18:30:57 INFO mapred.JobClient: Job complete:
job_200810291828_0011
08/10/29 18:30:57 INFO mapred.JobClient: Counters: 7
08/10/29 18:30:57 INFO mapred.JobClient:   File Systems
08/10/29 18:30:57 INFO mapred.JobClient:     HDFS bytes
read=1020535
08/10/29 18:30:57 INFO mapred.JobClient:     HDFS bytes
written=325460
08/10/29 18:30:57 INFO mapred.JobClient:   Job Counters
08/10/29 18:30:57 INFO mapred.JobClient:     Launched map tasks=2
08/10/29 18:30:57 INFO mapred.JobClient: Data-local map tasks=2
08/10/29 18:30:57 INFO mapred.JobClient:   Map-Reduce Framework
08/10/29 18:30:57 INFO mapred.JobClient:     Map input records=600
08/10/29 18:30:57 INFO mapred.JobClient:     Map input
bytes=1020535
08/10/29 18:30:57 INFO mapred.JobClient: Map output records=600





On Wed, Oct 29, 2008 at 11:10 AM, Philippe Lamarche <
[EMAIL PROTECTED]> wrote:

I will!

On 10/29/08, Grant Ingersoll <[EMAIL PROTECTED]> wrote:


Philippe, can you try the patch suggested by Arun Murthy on
[EMAIL PROTECTED]  See
http://issues.apache.org/jira/browse/HADOOP-4277

I'm pretty swamped at the moment w/ ApacheCon coming up next
week, but
if
it does fix the issue, then maybe we should move forward to the
18.2
candidate (I don't think it has been released yet, those guys
have a
pretty
sophisticated build process going)

-Grant

On Oct 28, 2008, at 7:19 AM, Philippe Lamarche wrote:

Ubuntu linux 2.6.24 <http://2.6.24.21>, with java-6- sun-1.6.0.07.


On Tue, Oct 28, 2008 at 7:03 AM, Grant Ingersoll
<[EMAIL PROTECTED]


wrote:


Just a single machine.  I didn't think we were using features
either.


Are
you saying you can run the example using 0.18.1?

BTW, Philippe, what JVM, O/S, etc. are you using?

-Grant


On Oct 27, 2008, at 11:55 PM, Jeff Eastman wrote:

Hi,



Are you guys running on real Hadoop arrays? I can run the
synthetic
control example just fine on a single machine. That code is
just
trying
to
read a vector from a string. I'd be surprised if we were using
any
"features" but will watch the threads.

Jeff



Grant Ingersoll wrote:

I started a thread on [EMAIL PROTECTED]:


http://hadoop.markmail.org/message/cczunzfhpcqz6pis


On Oct 27, 2008, at 9:49 PM, Grant Ingersoll wrote:

OK, I can confirm that the exact same code works with 0.17.2
and not
w/

0.18.1.  So, it sounds like a bug in Hadoop, or we are
relying on

incorrect behavior in Hadoop.


On Oct 27, 2008, at 9:33 PM, Grant Ingersoll wrote:


On Oct 26, 2008, at 10:46 AM, Philippe Lamarche wrote:


Unfortunately, I went straight from 0.17.2 to 0.18.1.  It
was
working

on

0.17.2.


BTW, are you saying the same exact code was working on
0.17.2 or


are
you referring to some older Mahout code that worked on
17.2?




On Sun, Oct 26, 2008 at 9:48 AM, Grant Ingersoll <

[EMAIL PROTECTED]

wrote:


Did this work with 0.18.0 or other prior versions for you?



On Oct 25, 2008, at 7:23 PM, Philippe Lamarche wrote:

Hi,


I just updated to hadoop 0.18.1 and got a clean version
of

mahout
from
svn.
However, I am having problems with KMeans, that can be
traced
down
to :

2008-10-25 19:10:16,987 INFO
org.apache.hadoop.mapred.Merger:
Merging
2 sorted segments
2008-10-25 19:10:16,987 INFO
org.apache.hadoop.mapred.Merger:
Down
to
the last merge-pass, with 2 segments left of total size:
5011
bytes
2008-10-25 19:10:16,999 WARN
org.apache.hadoop.mapred.ReduceTask:
attempt_200810251826_0013_r_000000_0 Merge of the
inmemory
files
threw
an exception: java.io.IOException: Intermedate merge
failed
at




org.apache.hadoop.mapred.ReduceTask$ReduceCopier $InMemFSMergeThread.doIn
MemMerge(ReduceTask.java:2147)
at




org.apache.hadoop.mapred.ReduceTask$ReduceCopier $InMemFSMergeThread.run(
ReduceTask.java:2078)
Caused by: java.lang.NumberFormatException: For input
string:
"["
at




sun.misc.FloatingDecimal.readJavaFormatString(FloatingDecimal.java: 1224)
at java.lang.Double.parseDouble(Double.java:510)
at




org.apache.mahout.matrix.DenseVector.decodeFormat(DenseVector.java:60)
at




org .apache.mahout.matrix.AbstractVector.decodeVector(AbstractVector.java
:256)
at




org .apache.mahout.clustering.kmeans.KMeansCombiner.reduce(KMeansCombiner
.java:38)
at




org .apache.mahout.clustering.kmeans.KMeansCombiner.reduce(KMeansCombiner
.java:31)
at




org.apache.hadoop.mapred.ReduceTask $ReduceCopier.combineAndSpill(ReduceT
ask.java:2174)
at




org.apache.hadoop.mapred.ReduceTask$ReduceCopier.access $3100(ReduceTask.
java:341)
at




org.apache.hadoop.mapred.ReduceTask$ReduceCopier $InMemFSMergeThread.doIn
MemMerge(ReduceTask.java:2134)
... 1 more

2008-10-25 19:10:16,999 INFO
org.apache.hadoop.mapred.ReduceTask:
In-memory merge complete: 0 files left.
2008-10-25 19:10:17,000 WARN
org.apache.hadoop.mapred.TaskTracker:
Error running child
java.io.IOException:
attempt_200810251826_0013_r_000000_0The
reduce
copier failed
at
org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:255)
at




org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:2207)


This is while running the synthetic_control.data
example, but I
have
the
same problems with any other input data.

I am able to do other map-reduce job without problems.

Here is the output of the jar task:

[EMAIL PROTECTED]:/usr/local/hadoop$ bin/hadoop jar





/home/philippe/workspace/MahoutJava/examples/dist/apache-mahout- examples
-0.1-dev.jar
org .apache.mahout.clustering.syntheticcontrol.kmeans.Job
08/10/25 19:09:27 WARN mapred.JobClient: Use
GenericOptionsParser
for
parsing the arguments. Applications should implement
Tool for
the
same.
08/10/25 19:09:28 INFO mapred.FileInputFormat: Total
input
paths
to
process
: 1
08/10/25 19:09:28 INFO mapred.FileInputFormat: Total
input
paths
to
process
: 1
08/10/25 19:09:28 INFO mapred.JobClient: Running job:
job_200810251826_0010
08/10/25 19:09:29 INFO mapred.JobClient:  map 0% reduce
0%
08/10/25 19:09:31 INFO mapred.JobClient: map 50% reduce
0%
08/10/25 19:09:32 INFO mapred.JobClient: Job complete:
job_200810251826_0010
08/10/25 19:09:32 INFO mapred.JobClient: Counters: 7
08/10/25 19:09:32 INFO mapred.JobClient:   File Systems
08/10/25 19:09:32 INFO mapred.JobClient:     HDFS bytes
read=291644
08/10/25 19:09:32 INFO mapred.JobClient:     HDFS bytes
written=323660
08/10/25 19:09:32 INFO mapred.JobClient:   Job Counters
08/10/25 19:09:32 INFO mapred.JobClient:     Launched
map
tasks=2
08/10/25 19:09:32 INFO mapred.JobClient:     Data-local
map
tasks=2
08/10/25 19:09:32 INFO mapred.JobClient:   Map-Reduce
Framework
08/10/25 19:09:32 INFO mapred.JobClient:     Map input
records=600
08/10/25 19:09:32 INFO mapred.JobClient:     Map input
bytes=288374
08/10/25 19:09:32 INFO mapred.JobClient:     Map output
records=600
08/10/25 19:09:32 WARN mapred.JobClient: Use
GenericOptionsParser
for
parsing the arguments. Applications should implement
Tool for
the
same.
08/10/25 19:09:32 INFO mapred.FileInputFormat: Total
input
paths
to
process
: 2
08/10/25 19:09:32 INFO mapred.FileInputFormat: Total
input
paths
to
process
: 2
08/10/25 19:09:32 INFO mapred.JobClient: Running job:
job_200810251826_0011
08/10/25 19:09:33 INFO mapred.JobClient:  map 0% reduce
0%
08/10/25 19:09:37 INFO mapred.JobClient: map 50% reduce
0%
08/10/25 19:09:39 INFO mapred.JobClient:  map 100%
reduce 0%
08/10/25 19:09:44 INFO mapred.JobClient:  map 100%
reduce 16%
08/10/25 19:09:52 INFO mapred.JobClient: Job complete:
job_200810251826_0011
08/10/25 19:09:52 INFO mapred.JobClient: Counters: 16
08/10/25 19:09:52 INFO mapred.JobClient:   File Systems
08/10/25 19:09:52 INFO mapred.JobClient:     HDFS bytes
read=323660
08/10/25 19:09:52 INFO mapred.JobClient:     HDFS bytes
written=1447
08/10/25 19:09:52 INFO mapred.JobClient: Local bytes
read=1389
08/10/25 19:09:52 INFO mapred.JobClient: Local bytes
written=37878
08/10/25 19:09:52 INFO mapred.JobClient:   Job Counters
08/10/25 19:09:52 INFO mapred.JobClient:     Launched
reduce
tasks=1
08/10/25 19:09:52 INFO mapred.JobClient:     Launched
map
tasks=2
08/10/25 19:09:52 INFO mapred.JobClient:     Data-local
map
tasks=2
08/10/25 19:09:52 INFO mapred.JobClient:   Map-Reduce
Framework
08/10/25 19:09:52 INFO mapred.JobClient:     Reduce
input
groups=1
08/10/25 19:09:52 INFO mapred.JobClient:     Combine
output
records=29
08/10/25 19:09:52 INFO mapred.JobClient:     Map input
records=600
08/10/25 19:09:52 INFO mapred.JobClient:     Reduce
output
records=1
08/10/25 19:09:52 INFO mapred.JobClient:     Map output
bytes=943020
08/10/25 19:09:52 INFO mapred.JobClient:     Map input
bytes=323660
08/10/25 19:09:52 INFO mapred.JobClient:     Combine
input
records=1760
08/10/25 19:09:52 INFO mapred.JobClient:     Map output
records=1732
08/10/25 19:09:52 INFO mapred.JobClient:     Reduce
input
records=1
08/10/25 19:09:53 WARN mapred.JobClient: Use
GenericOptionsParser
for
parsing the arguments. Applications should implement
Tool for
the
same.
08/10/25 19:09:53 INFO mapred.FileInputFormat: Total
input
paths
to
process
: 2
08/10/25 19:09:53 INFO mapred.FileInputFormat: Total
input
paths
to
process
: 2
08/10/25 19:09:53 INFO mapred.JobClient: Running job:
job_200810251826_0012
08/10/25 19:09:54 INFO mapred.JobClient:  map 0% reduce
0%
08/10/25 19:09:56 INFO mapred.JobClient: map 50% reduce
0%
08/10/25 19:09:58 INFO mapred.JobClient:  map 100%
reduce 0%
08/10/25 19:10:02 INFO mapred.JobClient: Job complete:
job_200810251826_0012
08/10/25 19:10:02 INFO mapred.JobClient: Counters: 16
08/10/25 19:10:02 INFO mapred.JobClient:   File Systems
08/10/25 19:10:02 INFO mapred.JobClient:     HDFS bytes
read=326554
08/10/25 19:10:02 INFO mapred.JobClient:     HDFS bytes
written=1137260
08/10/25 19:10:02 INFO mapred.JobClient: Local bytes
read=1147358
08/10/25 19:10:02 INFO mapred.JobClient: Local bytes
written=2304490
08/10/25 19:10:02 INFO mapred.JobClient:   Job Counters
08/10/25 19:10:02 INFO mapred.JobClient:     Launched
reduce
tasks=1
08/10/25 19:10:02 INFO mapred.JobClient:     Launched
map
tasks=2
08/10/25 19:10:02 INFO mapred.JobClient:     Data-local
map
tasks=2
08/10/25 19:10:02 INFO mapred.JobClient:   Map-Reduce
Framework
08/10/25 19:10:02 INFO mapred.JobClient:     Reduce
input
groups=1
08/10/25 19:10:02 INFO mapred.JobClient:     Combine
output
records=0
08/10/25 19:10:02 INFO mapred.JobClient:     Map input
records=600
08/10/25 19:10:02 INFO mapred.JobClient:     Reduce
output
records=600
08/10/25 19:10:02 INFO mapred.JobClient:     Map output
bytes=1139660
08/10/25 19:10:02 INFO mapred.JobClient:     Map input
bytes=323660
08/10/25 19:10:02 INFO mapred.JobClient:     Combine
input
records=0
08/10/25 19:10:02 INFO mapred.JobClient:     Map output
records=600
08/10/25 19:10:02 INFO mapred.JobClient:     Reduce
input
records=600
08/10/25 19:10:02 INFO kmeans.KMeansDriver: Iteration 0
08/10/25 19:10:02 WARN mapred.JobClient: Use
GenericOptionsParser
for
parsing the arguments. Applications should implement
Tool for
the
same.
08/10/25 19:10:02 INFO mapred.FileInputFormat: Total
input
paths
to
process
: 2
08/10/25 19:10:02 INFO mapred.FileInputFormat: Total
input
paths
to
process
: 2
08/10/25 19:10:03 INFO mapred.JobClient: Running job:
job_200810251826_0013
08/10/25 19:10:04 INFO mapred.JobClient:  map 0% reduce
0%
08/10/25 19:10:08 INFO mapred.JobClient: map 50% reduce
0%
08/10/25 19:10:09 INFO mapred.JobClient:  map 100%
reduce 0%
08/10/25 19:10:21 INFO mapred.JobClient: Task Id :
attempt_200810251826_0013_r_000000_0, Status : FAILED
java.io.IOException:
attempt_200810251826_0013_r_000000_0The
reduce
copier
failed
at
org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:255)
at




org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:2207)


I am not sure if I am doing something wrong here.

Thanks for the help,

Philippe.


--------------------------


Grant Ingersoll
Lucene Boot Camp Training Nov. 3-4, 2008, ApacheCon US
New
Orleans.
http://www.lucenebootcamp.com


Lucene Helpful Hints:
http://wiki.apache.org/lucene-java/BasicsOfPerformance
http://wiki.apache.org/lucene-java/LuceneFAQ











--------------------------


Grant Ingersoll

Lucene Boot Camp Training Nov. 3-4, 2008, ApacheCon US New
Orleans.
http://www.lucenebootcamp.com


Lucene Helpful Hints:
http://wiki.apache.org/lucene-java/BasicsOfPerformance
http://wiki.apache.org/lucene-java/LuceneFAQ










--------------------------


Grant Ingersoll
Lucene Boot Camp Training Nov. 3-4, 2008, ApacheCon US New
Orleans.
http://www.lucenebootcamp.com


Lucene Helpful Hints:
http://wiki.apache.org/lucene-java/BasicsOfPerformance
http://wiki.apache.org/lucene-java/LuceneFAQ










--------------------------


Grant Ingersoll
Lucene Boot Camp Training Nov. 3-4, 2008, ApacheCon US New
Orleans.
http://www.lucenebootcamp.com


Lucene Helpful Hints:
http://wiki.apache.org/lucene-java/BasicsOfPerformance
http://wiki.apache.org/lucene-java/LuceneFAQ













--------------------------

Grant Ingersoll
Lucene Boot Camp Training Nov. 3-4, 2008, ApacheCon US New
Orleans.
http://www.lucenebootcamp.com


Lucene Helpful Hints:
http://wiki.apache.org/lucene-java/BasicsOfPerformance
http://wiki.apache.org/lucene-java/LuceneFAQ











--------------------------

Grant Ingersoll
Lucene Boot Camp Training Nov. 3-4, 2008, ApacheCon US New
Orleans.
http://www.lucenebootcamp.com


Lucene Helpful Hints:
http://wiki.apache.org/lucene-java/BasicsOfPerformance
http://wiki.apache.org/lucene-java/LuceneFAQ




Reply via email to