[ 
https://issues.apache.org/jira/browse/MAHOUT-140?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12730128#action_12730128
 ] 

Ted Dunning commented on MAHOUT-140:
------------------------------------

These results look *really* promising.

But I am  curious about how to interpret these numbers.  It appears that you 
get decent speed-up with a larger cluster (5x speedup with 10x nodes).

But these numbers don't seem to show speedup over the results that you gave in 
MAHOUT-122 where a single node seemed to be able to build 50 trees on 5% of the 
data in <10m.

My guess is that the comparison I am making is invalid.

Can you clarify how things look so far?  This seems like it ought to be much 
more promising than what I am saying.  I don't understand how your small 
cluster here could be slower than the reference implementation.

A second question is why you don't see perfect speedup with an increasing 
cluster.  Do you have any insight into how the time breaks down between hadoop 
MR startup, data cache loading, tree building, oob error estimation and storing 
output?

> In-memory mapreduce Random Forests
> ----------------------------------
>
>                 Key: MAHOUT-140
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-140
>             Project: Mahout
>          Issue Type: New Feature
>          Components: Classification
>    Affects Versions: 0.2
>            Reporter: Deneche A. Hakim
>            Priority: Minor
>         Attachments: mapred_jul12.diff, mapred_patch.diff
>
>
> Each mapper is responsible for growing a number of trees with a whole copy of 
> the dataset loaded in memory, it uses the reference implementation's code to 
> build each tree and estimate the oob error.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to