[ https://issues.apache.org/jira/browse/MAHOUT-140?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12732922#action_12732922 ]
Deneche A. Hakim edited comment on MAHOUT-140 at 7/18/09 11:20 AM: ------------------------------------------------------------------- * First of all I implemented the *in-mem-sequential* builder which simulates the execution of many mappers in a sequential manner. I also implemented the seed generation scheme for the *in-mem-mapred* implementation, passing the same seed to the *in-mem-mapred* and *in-mem-sequential* implementations generates the same trees with the same output, this should make the comparison easier. * on a 10 instances cluster (ec2 c1.medium) building 200 trees with KDD10% with a seed=1 gives : || in-mem-sequential || in-mem-mapred || | 0h 52m 38s 665 | 0h 13m 3s 691 | its a 4x speedup, I don't know if I should expect a higher speedup, so I run some more tests to try and find what takes most of the time. * I noticed that speculative execution was turned on, passing (-Dmapred.map.tasks.speculative.execution=false) gives: || in-mem-mapred|| | 0h 12m 46s 150 | it doesn't seem to be the cause of the slowdown. * How much time does the output takes, this includes computing the oob estimate and outputing the trees and the oob predictions ? I added a special job parameter (debug.mahout.rf.output) when false the mappers don't compute the oob estimates and don't output anything, they just prepare the bags and build the trees. The result is: || in-mem-mapred|| | 0h 12m 35s 557 | actually the output doesn't seem to make much time * How much time does launching and configuring the MR take, this includes loading the data in all the nodes ? running the *in-mem-mapred* with just 10 trees, thus 1 tree per map, gives: || in-mem-mapred|| | 0h 1m 36s 335 | Starting up the MR doesn't seem to take a lot of time, actually it seems that building the trees *is* what takes most of the time * Because I'm running a number of maps equal to the number of cluster-nodes, if one maps take 100 minutes and all other maps take only 1 minute, the job still takes 100 minutes to finish. I added a special job parameter (debug.mahout.rf.single.seed), when true all mappers use the same seed thus they all behave similarly. The results are: || in-mem-sequential || in-mem-mapred || | 0h 40m 39s 829 | 0h 9m 30s 577 | In the *in-mem-sequential* implementation, each 20 trees take about 4 minutes to be built, but in the *in-mem-mapred* implementation, each map takes 9 minutes to build 20 trees. It looks like building a single tree in a sequential manner is *2x faster* than building the same tree with the cluster !!! I don't have a lot of experience with clusters, is it normal ??? may be 10 instances is just too small to get a good speedup, or may be there is a bug hiding somewhere (I can hear it walking in the code when the moon...) was (Author: adeneche): * First of all I implemented the *in-mem-sequential* builder which simulates the execution of many mappers in a sequential manner. Passing the same seed to the *in-mem-mapred* and *in-mem-sequential* implementations generates the same trees with the same output, this should make the comparison easier. * on a 10 instances cluster (ec2 c1.medium) building 200 trees with KDD10% with a seed=1 gives : || in-mem-sequential || in-mem-mapred || | 0h 52m 38s 665 | 0h 13m 3s 691 | its a 4x speedup, I don't know if I should expect a higher speedup, so I run some more tests to try and find what takes most of the time. * I noticed that speculative execution was turned on, passing (-Dmapred.map.tasks.speculative.execution=false) gives: || in-mem-mapred|| | 0h 12m 46s 150 | it doesn't seem to be the cause of the slowdown. * How much time does the output takes, this includes computing the oob estimate and outputing the trees and the oob predictions ? I added a special job parameter (debug.mahout.rf.output) when false the mappers don't compute the oob estimates and don't output anything, they just prepare the bags and build the trees. The result is: || in-mem-mapred|| | 0h 12m 35s 557 | actually the output doesn't seem to make much time * How much time does launching and configuring the MR take, this includes loading the data in all the nodes ? running the *in-mem-mapred* with just 10 trees, thus 1 tree per map, gives: || in-mem-mapred|| | 0h 1m 36s 335 | It seem that building the trees *is* what's taking most of the time * Because I'm running a number of maps equal to the number of cluster-nodes, if one maps takes 100 minutes and all other maps take only 1 minute, the job still takes 100 minutes to finish. I added a special job parameter (debug.mahout.rf.single.seed), when true all mappers use the same seed thus they all behave similarly. The results are: || in-mem-sequential || in-mem-mapred || | 0h 40m 39s 829 | 0h 9m 30s 577 | It looks like building a single tree in a sequential manner is 2x faster than building the same tree with the cluster !!! I don't have a lot of experience with clusters, is it normal ??? may be 10 instances is just too small to get a good speedup, or may be there is a bug hiding somewhere (I can hear it walking in the code when the moon...) > In-memory mapreduce Random Forests > ---------------------------------- > > Key: MAHOUT-140 > URL: https://issues.apache.org/jira/browse/MAHOUT-140 > Project: Mahout > Issue Type: New Feature > Components: Classification > Affects Versions: 0.2 > Reporter: Deneche A. Hakim > Priority: Minor > Attachments: mapred_jul12.diff, mapred_patch.diff > > > Each mapper is responsible for growing a number of trees with a whole copy of > the dataset loaded in memory, it uses the reference implementation's code to > build each tree and estimate the oob error. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.