[ https://issues.apache.org/jira/browse/MAHOUT-140?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12730292#action_12730292 ]
Deneche A. Hakim commented on MAHOUT-140: ----------------------------------------- bq. But these numbers don't seem to show speedup over the results that you gave in MAHOUT-122 where a single node seemed to be able to build 50 trees on 5% of the data in <10m. bq. My guess is that the comparison I am making is invalid. bq. Can you clarify how things look so far? This seems like it ought to be much more promising than what I am saying. I don't understand how your small cluster here could be slower than the reference implementation. I noticed too that running BuildForest on my laptop is faster than EC2, I suspect that my laptop's CPU is faster than the EC2 instance that I used (m1.small). To be sure I run the sequential version, which allow the use of a specific seed thus being repeatable, and got the following results: the program uses the reference implementation to build 50 trees with Kdd10%, selecting 1 random variable at each tree-node, starting with seed=1, estimating the o-o-b error and using the optimized IG code || Instance || build time || | my laptop | 9m 14s 978 | | 1 m1.small | 28m 59s 510 | | 1 c1.medium | 11m 35s 286 | The m1.small is, indeed, slower than my laptop but still the reference implementation running on this instance takes only 29m compared to 45m when using the mapred implementation. But because the mapred implementation does not accept seed values, for now, the comparison between the sequential and sequential implementations will be difficult. I'm thinking of a way to make the mapred implementation use specific seeds: the main program passes a specific seed value (user parameter) to InMemInputFormat, this seed is used to instantiate a Random object used to generate a different seed for each InputSplit (mapper). This way I can make the reference implementation use the same scheme, given desired the number of mappers, and thus be able to compare between the two implementations. What do you think of this scheme ? bq. A second question is why you don't see perfect speedup with an increasing cluster. Do you have any insight into how the time breaks down between hadoop MR startup, data cache loading, tree building, oob error estimation and storing output? I noticed that loading the data can take some time and because all the mappers do the loading, the loading time is always the same wherever you use a small or a large cluster. I also noticed that the compression is activated when using hadoop on EC2 and it also takes some time to initialize after the mappers finish their work. But I need to run more tests and collect more info to be able to answer your question. > In-memory mapreduce Random Forests > ---------------------------------- > > Key: MAHOUT-140 > URL: https://issues.apache.org/jira/browse/MAHOUT-140 > Project: Mahout > Issue Type: New Feature > Components: Classification > Affects Versions: 0.2 > Reporter: Deneche A. Hakim > Priority: Minor > Attachments: mapred_jul12.diff, mapred_patch.diff > > > Each mapper is responsible for growing a number of trees with a whole copy of > the dataset loaded in memory, it uses the reference implementation's code to > build each tree and estimate the oob error. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.