[ https://issues.apache.org/jira/browse/MAHOUT-140?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12730128#action_12730128 ]
Ted Dunning commented on MAHOUT-140: ------------------------------------ These results look *really* promising. But I am curious about how to interpret these numbers. It appears that you get decent speed-up with a larger cluster (5x speedup with 10x nodes). But these numbers don't seem to show speedup over the results that you gave in MAHOUT-122 where a single node seemed to be able to build 50 trees on 5% of the data in <10m. My guess is that the comparison I am making is invalid. Can you clarify how things look so far? This seems like it ought to be much more promising than what I am saying. I don't understand how your small cluster here could be slower than the reference implementation. A second question is why you don't see perfect speedup with an increasing cluster. Do you have any insight into how the time breaks down between hadoop MR startup, data cache loading, tree building, oob error estimation and storing output? > In-memory mapreduce Random Forests > ---------------------------------- > > Key: MAHOUT-140 > URL: https://issues.apache.org/jira/browse/MAHOUT-140 > Project: Mahout > Issue Type: New Feature > Components: Classification > Affects Versions: 0.2 > Reporter: Deneche A. Hakim > Priority: Minor > Attachments: mapred_jul12.diff, mapred_patch.diff > > > Each mapper is responsible for growing a number of trees with a whole copy of > the dataset loaded in memory, it uses the reference implementation's code to > build each tree and estimate the oob error. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.