Re: Problems with KMeans clustering

Grant Ingersoll Tue, 28 Oct 2008 12:44:15 -0700

We're doing pseudo-distributed: http://hadoop.apache.org/core/docs/r0.17.2/quickstart.html(further down)


Whereas you're doing, I think, Standalone.


On Oct 28, 2008, at 3:21 PM, Jeff Eastman wrote:

I right-click on the syntheticcontrol.kmeans.Job.java file in theEclipse Package Explorer and select Run As... Java Application. MyMahout project does not depend upon any other projects and uses thehadoop-0.18.1 jar from our lib. It has always worked just fine. Whatare you/he doing differently?
Jeff


Grant Ingersoll wrote:
How, exactly are you doing it?  That might give some clues.

On Oct 28, 2008, at 12:06 PM, Jeff Eastman wrote:
I run the job classes from Eclipse, as a java application.
Jeff


Grant Ingersoll wrote:
Can you publish the steps you did to run?
I pretty much followed the Hadoop quick start, and justsubstituted in the Mahout Job for the grep job.
-Grant

On Oct 27, 2008, at 11:55 PM, Jeff Eastman wrote:
Hi,
Are you guys running on real Hadoop arrays? I can run thesynthetic control example just fine on a single machine. Thatcode is just trying to read a vector from a string. I'd besurprised if we were using any "features" but will watch thethreads.
Jeff



Grant Ingersoll wrote:
I started a thread on [EMAIL PROTECTED]:  
http://hadoop.markmail.org/message/cczunzfhpcqz6pis


On Oct 27, 2008, at 9:49 PM, Grant Ingersoll wrote:
OK, I can confirm that the exact same code works with 0.17.2and not w/ 0.18.1. So, it sounds like a bug in Hadoop, or weare relying on incorrect behavior in Hadoop.
On Oct 27, 2008, at 9:33 PM, Grant Ingersoll wrote:
On Oct 26, 2008, at 10:46 AM, Philippe Lamarche wrote:
Unfortunately, I went straight from 0.17.2 to 0.18.1. Itwas working on
0.17.2.
BTW, are you saying the same exact code was working on 0.17.2or are you referring to some older Mahout code that worked on17.2?
On Sun, Oct 26, 2008 at 9:48 AM, Grant Ingersoll <[EMAIL PROTECTED]>wrote:
Did this work with 0.18.0 or other prior versions for you?



On Oct 25, 2008, at 7:23 PM, Philippe Lamarche wrote:

Hi,
I just updated to hadoop 0.18.1 and got a clean version ofmahout from
svn.
However, I am having problems with KMeans, that can betraced down to :
2008-10-25 19:10:16,987 INFOorg.apache.hadoop.mapred.Merger: Merging
2 sorted segments
2008-10-25 19:10:16,987 INFOorg.apache.hadoop.mapred.Merger: Down tothe last merge-pass, with 2 segments left of total size:5011 bytes2008-10-25 19:10:16,999 WARNorg.apache.hadoop.mapred.ReduceTask:attempt_200810251826_0013_r_000000_0 Merge of the inmemoryfiles threw
an exception: java.io.IOException: Intermedate merge failed
 at
org.apache.hadoop.mapred.ReduceTask$ReduceCopier$InMemFSMergeThread.doInMemMerge(ReduceTask.java:2147)
 at
org.apache.hadoop.mapred.ReduceTask$ReduceCopier$InMemFSMergeThread.run(ReduceTask.java:2078)Caused by: java.lang.NumberFormatException: For inputstring: "["
 at
sun.misc.FloatingDecimal.readJavaFormatString(FloatingDecimal.java:1224)
 at java.lang.Double.parseDouble(Double.java:510)
 at
org.apache.mahout.matrix.DenseVector.decodeFormat(DenseVector.java:60)
 at
org.apache.mahout.matrix.AbstractVector.decodeVector(AbstractVector.java:256)
 at
org.apache.mahout.clustering.kmeans.KMeansCombiner.reduce(KMeansCombiner.java:38)
 at
org.apache.mahout.clustering.kmeans.KMeansCombiner.reduce(KMeansCombiner.java:31)
 at
org.apache.hadoop.mapred.ReduceTask$ReduceCopier.combineAndSpill(ReduceTask.java:2174)
 at
org.apache.hadoop.mapred.ReduceTask$ReduceCopier.access$3100(ReduceTask.java:341)
 at
org.apache.hadoop.mapred.ReduceTask$ReduceCopier$InMemFSMergeThread.doInMemMerge(ReduceTask.java:2134)
 ... 1 more
2008-10-25 19:10:16,999 INFOorg.apache.hadoop.mapred.ReduceTask:
In-memory merge complete: 0 files left.
2008-10-25 19:10:17,000 WARNorg.apache.hadoop.mapred.TaskTracker:
Error running child
java.io.IOException:attempt_200810251826_0013_r_000000_0The reduce
copier failed
atorg.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:255)
 at
org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:2207)
This is while running the synthetic_control.data example,but I have the
same problems with any other input data.

I am able to do other map-reduce job without problems.

Here is the output of the jar task:

[EMAIL PROTECTED]:/usr/local/hadoop$ bin/hadoop jar
/home/philippe/workspace/MahoutJava/examples/dist/apache-mahout-examples-0.1-dev.jar
org.apache.mahout.clustering.syntheticcontrol.kmeans.Job
08/10/25 19:09:27 WARN mapred.JobClient: UseGenericOptionsParser forparsing the arguments. Applications should implement Toolfor the same.08/10/25 19:09:28 INFO mapred.FileInputFormat: Total inputpaths to
process
: 1
08/10/25 19:09:28 INFO mapred.FileInputFormat: Total inputpaths to
process
: 1
08/10/25 19:09:28 INFO mapred.JobClient: Running job:
job_200810251826_0010
08/10/25 19:09:29 INFO mapred.JobClient:  map 0% reduce 0%
08/10/25 19:09:31 INFO mapred.JobClient:  map 50% reduce 0%
08/10/25 19:09:32 INFO mapred.JobClient: Job complete:
job_200810251826_0010
08/10/25 19:09:32 INFO mapred.JobClient: Counters: 7
08/10/25 19:09:32 INFO mapred.JobClient:   File Systems
08/10/25 19:09:32 INFO mapred.JobClient: HDFS bytesread=29164408/10/25 19:09:32 INFO mapred.JobClient: HDFS byteswritten=323660
08/10/25 19:09:32 INFO mapred.JobClient:   Job Counters
08/10/25 19:09:32 INFO mapred.JobClient: Launched maptasks=208/10/25 19:09:32 INFO mapred.JobClient: Data-localmap tasks=208/10/25 19:09:32 INFO mapred.JobClient: Map-ReduceFramework08/10/25 19:09:32 INFO mapred.JobClient: Map inputrecords=60008/10/25 19:09:32 INFO mapred.JobClient: Map inputbytes=28837408/10/25 19:09:32 INFO mapred.JobClient: Map outputrecords=60008/10/25 19:09:32 WARN mapred.JobClient: UseGenericOptionsParser forparsing the arguments. Applications should implement Toolfor the same.08/10/25 19:09:32 INFO mapred.FileInputFormat: Total inputpaths to
process
: 2
08/10/25 19:09:32 INFO mapred.FileInputFormat: Total inputpaths to
process
: 2
08/10/25 19:09:32 INFO mapred.JobClient: Running job:
job_200810251826_0011
08/10/25 19:09:33 INFO mapred.JobClient:  map 0% reduce 0%
08/10/25 19:09:37 INFO mapred.JobClient:  map 50% reduce 0%
08/10/25 19:09:39 INFO mapred.JobClient:  map 100% reduce 0%
08/10/25 19:09:44 INFO mapred.JobClient: map 100% reduce16%
08/10/25 19:09:52 INFO mapred.JobClient: Job complete:
job_200810251826_0011
08/10/25 19:09:52 INFO mapred.JobClient: Counters: 16
08/10/25 19:09:52 INFO mapred.JobClient:   File Systems
08/10/25 19:09:52 INFO mapred.JobClient: HDFS bytesread=32366008/10/25 19:09:52 INFO mapred.JobClient: HDFS byteswritten=144708/10/25 19:09:52 INFO mapred.JobClient: Local bytesread=138908/10/25 19:09:52 INFO mapred.JobClient: Local byteswritten=37878
08/10/25 19:09:52 INFO mapred.JobClient:   Job Counters
08/10/25 19:09:52 INFO mapred.JobClient: Launchedreduce tasks=108/10/25 19:09:52 INFO mapred.JobClient: Launched maptasks=208/10/25 19:09:52 INFO mapred.JobClient: Data-localmap tasks=208/10/25 19:09:52 INFO mapred.JobClient: Map-ReduceFramework08/10/25 19:09:52 INFO mapred.JobClient: Reduce inputgroups=108/10/25 19:09:52 INFO mapred.JobClient: Combineoutput records=2908/10/25 19:09:52 INFO mapred.JobClient: Map inputrecords=60008/10/25 19:09:52 INFO mapred.JobClient: Reduce outputrecords=108/10/25 19:09:52 INFO mapred.JobClient: Map outputbytes=94302008/10/25 19:09:52 INFO mapred.JobClient: Map inputbytes=32366008/10/25 19:09:52 INFO mapred.JobClient: Combine inputrecords=176008/10/25 19:09:52 INFO mapred.JobClient: Map outputrecords=173208/10/25 19:09:52 INFO mapred.JobClient: Reduce inputrecords=108/10/25 19:09:53 WARN mapred.JobClient: UseGenericOptionsParser forparsing the arguments. Applications should implement Toolfor the same.08/10/25 19:09:53 INFO mapred.FileInputFormat: Total inputpaths to
process
: 2
08/10/25 19:09:53 INFO mapred.FileInputFormat: Total inputpaths to
process
: 2
08/10/25 19:09:53 INFO mapred.JobClient: Running job:
job_200810251826_0012
08/10/25 19:09:54 INFO mapred.JobClient:  map 0% reduce 0%
08/10/25 19:09:56 INFO mapred.JobClient:  map 50% reduce 0%
08/10/25 19:09:58 INFO mapred.JobClient:  map 100% reduce 0%
08/10/25 19:10:02 INFO mapred.JobClient: Job complete:
job_200810251826_0012
08/10/25 19:10:02 INFO mapred.JobClient: Counters: 16
08/10/25 19:10:02 INFO mapred.JobClient:   File Systems
08/10/25 19:10:02 INFO mapred.JobClient: HDFS bytesread=32655408/10/25 19:10:02 INFO mapred.JobClient: HDFS byteswritten=113726008/10/25 19:10:02 INFO mapred.JobClient: Local bytesread=114735808/10/25 19:10:02 INFO mapred.JobClient: Local byteswritten=2304490
08/10/25 19:10:02 INFO mapred.JobClient:   Job Counters
08/10/25 19:10:02 INFO mapred.JobClient: Launchedreduce tasks=108/10/25 19:10:02 INFO mapred.JobClient: Launched maptasks=208/10/25 19:10:02 INFO mapred.JobClient: Data-localmap tasks=208/10/25 19:10:02 INFO mapred.JobClient: Map-ReduceFramework08/10/25 19:10:02 INFO mapred.JobClient: Reduce inputgroups=108/10/25 19:10:02 INFO mapred.JobClient: Combineoutput records=008/10/25 19:10:02 INFO mapred.JobClient: Map inputrecords=60008/10/25 19:10:02 INFO mapred.JobClient: Reduce outputrecords=60008/10/25 19:10:02 INFO mapred.JobClient: Map outputbytes=113966008/10/25 19:10:02 INFO mapred.JobClient: Map inputbytes=32366008/10/25 19:10:02 INFO mapred.JobClient: Combine inputrecords=008/10/25 19:10:02 INFO mapred.JobClient: Map outputrecords=60008/10/25 19:10:02 INFO mapred.JobClient: Reduce inputrecords=600
08/10/25 19:10:02 INFO kmeans.KMeansDriver: Iteration 0
08/10/25 19:10:02 WARN mapred.JobClient: UseGenericOptionsParser forparsing the arguments. Applications should implement Toolfor the same.08/10/25 19:10:02 INFO mapred.FileInputFormat: Total inputpaths to
process
: 2
08/10/25 19:10:02 INFO mapred.FileInputFormat: Total inputpaths to
process
: 2
08/10/25 19:10:03 INFO mapred.JobClient: Running job:
job_200810251826_0013
08/10/25 19:10:04 INFO mapred.JobClient:  map 0% reduce 0%
08/10/25 19:10:08 INFO mapred.JobClient:  map 50% reduce 0%
08/10/25 19:10:09 INFO mapred.JobClient:  map 100% reduce 0%
08/10/25 19:10:21 INFO mapred.JobClient: Task Id :
attempt_200810251826_0013_r_000000_0, Status : FAILED
java.io.IOException:attempt_200810251826_0013_r_000000_0The reduce copier
failed
at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:255)
at
org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:2207)
I am not sure if I am doing something wrong here.

Thanks for the help,

Philippe.
--------------------------
Grant Ingersoll
Lucene Boot Camp Training Nov. 3-4, 2008, ApacheCon US NewOrleans.
http://www.lucenebootcamp.com


Lucene Helpful Hints:
http://wiki.apache.org/lucene-java/BasicsOfPerformance
http://wiki.apache.org/lucene-java/LuceneFAQ
--------------------------
Grant Ingersoll
Lucene Boot Camp Training Nov. 3-4, 2008, ApacheCon US NewOrleans.
http://www.lucenebootcamp.com


Lucene Helpful Hints:
http://wiki.apache.org/lucene-java/BasicsOfPerformance
http://wiki.apache.org/lucene-java/LuceneFAQ
--------------------------
Grant Ingersoll
Lucene Boot Camp Training Nov. 3-4, 2008, ApacheCon US NewOrleans.
http://www.lucenebootcamp.com


Lucene Helpful Hints:
http://wiki.apache.org/lucene-java/BasicsOfPerformance
http://wiki.apache.org/lucene-java/LuceneFAQ
--------------------------
Grant Ingersoll
Lucene Boot Camp Training Nov. 3-4, 2008, ApacheCon US NewOrleans.
http://www.lucenebootcamp.com


Lucene Helpful Hints:
http://wiki.apache.org/lucene-java/BasicsOfPerformance
http://wiki.apache.org/lucene-java/LuceneFAQ
--------------------------
Grant Ingersoll
Lucene Boot Camp Training Nov. 3-4, 2008, ApacheCon US New Orleans.
http://www.lucenebootcamp.com


Lucene Helpful Hints:
http://wiki.apache.org/lucene-java/BasicsOfPerformance
http://wiki.apache.org/lucene-java/LuceneFAQ
--------------------------
Grant Ingersoll
Lucene Boot Camp Training Nov. 3-4, 2008, ApacheCon US New Orleans.
http://www.lucenebootcamp.com


Lucene Helpful Hints:
http://wiki.apache.org/lucene-java/BasicsOfPerformance
http://wiki.apache.org/lucene-java/LuceneFAQ


--------------------------
Grant Ingersoll
Lucene Boot Camp Training Nov. 3-4, 2008, ApacheCon US New Orleans.
http://www.lucenebootcamp.com


Lucene Helpful Hints:
http://wiki.apache.org/lucene-java/BasicsOfPerformance
http://wiki.apache.org/lucene-java/LuceneFAQ

Re: Problems with KMeans clustering

Reply via email to