[ 
https://issues.apache.org/jira/browse/MAHOUT-140?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12730292#action_12730292
 ] 

Deneche A. Hakim commented on MAHOUT-140:
-----------------------------------------

bq. But these numbers don't seem to show speedup over the results that you gave 
in MAHOUT-122 where a single node seemed to be able to build 50 trees on 5% of 
the data in <10m.

bq. My guess is that the comparison I am making is invalid.

bq. Can you clarify how things look so far? This seems like it ought to be much 
more promising than what I am saying. I don't understand how your small cluster 
here could be slower than the reference implementation.

I noticed too that running BuildForest on my laptop is faster than EC2, I 
suspect that my laptop's CPU is faster than the EC2 instance that I used 
(m1.small). To be sure I run the sequential version, which allow the use of a 
specific seed thus being repeatable, and got the following results:

the program uses the reference implementation to build 50 trees with Kdd10%, 
selecting 1 random variable at each tree-node, starting with seed=1, estimating 
the o-o-b error and using the optimized IG code

|| Instance || build time ||
| my laptop | 9m 14s 978 |
| 1 m1.small | 28m 59s 510 |
| 1 c1.medium | 11m 35s 286 |

The m1.small is, indeed, slower than my laptop but still the reference 
implementation running on this instance takes only 29m compared to 45m when 
using the mapred implementation. But because the mapred implementation does not 
accept seed values, for now, the comparison between the sequential and 
sequential implementations will be difficult.

I'm thinking of a way to make the mapred implementation use specific seeds: the 
main program passes a specific seed value (user parameter) to InMemInputFormat, 
this seed is used to instantiate a Random object used to generate a different 
seed for each InputSplit (mapper). This way I can make the reference 
implementation use the same scheme, given desired the number of mappers, and 
thus be able to compare between the two implementations. What do you think of 
this scheme ?

bq. A second question is why you don't see perfect speedup with an increasing 
cluster. Do you have any insight into how the time breaks down between hadoop 
MR startup, data cache loading, tree building, oob error estimation and storing 
output?

I noticed that loading the data can take some time and because all the mappers 
do the loading, the loading time is always the same wherever you use a small or 
a large cluster. I also noticed that the compression is activated when using 
hadoop on EC2 and it also takes some time to initialize after the mappers 
finish their work. But I need to run more tests and collect more info to be 
able to answer your question.

> In-memory mapreduce Random Forests
> ----------------------------------
>
>                 Key: MAHOUT-140
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-140
>             Project: Mahout
>          Issue Type: New Feature
>          Components: Classification
>    Affects Versions: 0.2
>            Reporter: Deneche A. Hakim
>            Priority: Minor
>         Attachments: mapred_jul12.diff, mapred_patch.diff
>
>
> Each mapper is responsible for growing a number of trees with a whole copy of 
> the dataset loaded in memory, it uses the reference implementation's code to 
> build each tree and estimate the oob error.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to