Re: mahout & hadoop compatibility

Grant Ingersoll Thu, 04 Dec 2008 18:26:13 -0800

Yeah, it should work with 0.18, with a few patches to fix the Combinerissue, if you are using the k-Means clustering stuff. I committed oneof them, but forget the Issue numbers (Pallavi?) Have a look in JIRA.


On Dec 4, 2008, at 8:11 PM, Pradhuman Jhala wrote:

Just wondering if Mahout is compatible with hadoop-0.18 (and later)versions. As in hadoop version 0.18 onwards, the combiner executionpolicy has changed and now it gets executed twice - first fromMapper side (on the output of Mapper) and then again on the Reducerside (on the output of first Combiner).
For more details: http://issues.apache.org/jira/browse/HADOOP-3226 <http://issues.apache.org/jira/browse/HADOOP-3226>
It seems me that the kmean and canopy clustering in Mahout assumesthat the combiner gets executed on Mapper side only and it's a majorsource of error, as when the Combiner gets executed on the Reducerside, it can not parse the output of first Combiner correctly.
To fix, only for hadoop-0.18.*, if you want to use combiner only onthe output of mapper (like earlier hadoop versions), add thefollowing to your job config:
job.setCombineOnlyOnce(true);
This method (setCombineOnlyOnce) is not available in hadoop-0.19release, so I think Mahout code needs to be changed to take care ofthis issue.
Pradhuman


--------------------------
Grant Ingersoll

Lucene Helpful Hints:
http://wiki.apache.org/lucene-java/BasicsOfPerformance
http://wiki.apache.org/lucene-java/LuceneFAQ

Re: mahout & hadoop compatibility

Reply via email to