[jira] Issue Comment Edited: (MAHOUT-140) In-memory mapreduce Random Forests

Deneche A. Hakim (JIRA) Sat, 18 Jul 2009 11:22:40 -0700

    [ 
https://issues.apache.org/jira/browse/MAHOUT-140?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12732922#action_12732922
 ]


Deneche A. Hakim edited comment on MAHOUT-140 at 7/18/09 11:20 AM:
-------------------------------------------------------------------

* First of all I implemented the *in-mem-sequential* builder which simulates 
the execution of many mappers in a sequential manner. I also implemented the 
seed generation scheme for the *in-mem-mapred* implementation, passing the same 
seed to the *in-mem-mapred* and *in-mem-sequential* implementations generates 
the same trees with the same output, this should make the comparison easier.

* on a 10 instances cluster (ec2 c1.medium) building 200 trees with KDD10% with 
a seed=1 gives :
|| in-mem-sequential || in-mem-mapred ||
| 0h 52m 38s 665 | 0h 13m 3s 691 |

its a 4x speedup, I don't know if I should expect a higher speedup, so I run 
some more tests to try and find what takes most of the time.

* I noticed that speculative execution was turned on, passing 
(-Dmapred.map.tasks.speculative.execution=false) gives:
|| in-mem-mapred||
| 0h 12m 46s 150 |
it doesn't seem to be the cause of the slowdown.

* How much time does the output takes, this includes computing the oob estimate 
and outputing the trees and the oob predictions ? I added a special job 
parameter (debug.mahout.rf.output) when false the mappers don't compute the oob 
estimates and don't output anything, they just prepare the bags and build the 
trees. The result is:
|| in-mem-mapred||
| 0h 12m 35s 557 |
actually the output doesn't seem to make much time

* How much time does launching and configuring the MR take, this includes 
loading the data in all the nodes ? running the *in-mem-mapred* with just 10 
trees, thus 1 tree per map, gives:
|| in-mem-mapred||
| 0h 1m 36s 335 |

Starting up the MR doesn't seem to take a lot of time, actually it seems that 
building the trees *is* what takes most of the time

* Because I'm running a number of maps equal to the number of cluster-nodes, if 
one maps take 100 minutes and all other maps take only 1 minute, the job still 
takes 100 minutes to finish. I added a special job parameter 
(debug.mahout.rf.single.seed), when true all mappers use the same seed thus 
they all behave similarly. The results are:
|| in-mem-sequential || in-mem-mapred ||
| 0h 40m 39s 829 | 0h 9m 30s 577 |

In the *in-mem-sequential* implementation, each 20 trees take about 4 minutes 
to be built, but in the *in-mem-mapred* implementation, each map takes 9 
minutes to build 20 trees. It looks like building a single tree in a sequential 
manner is *2x faster* than building the same tree with the cluster !!! I don't 
have a lot of experience with clusters, is it normal ??? may be 10 instances is 
just too small to get a good speedup, or may be there is a bug hiding somewhere 
(I can hear it walking in the code when the moon...)

      was (Author: adeneche):
    * First of all I implemented the *in-mem-sequential* builder which 
simulates the execution of many mappers in a sequential manner. Passing the 
same seed to the *in-mem-mapred* and *in-mem-sequential* implementations 
generates the same trees with the same output, this should make the comparison 
easier.

* on a 10 instances cluster (ec2 c1.medium) building 200 trees with KDD10% with 
a seed=1 gives :
|| in-mem-sequential || in-mem-mapred ||
| 0h 52m 38s 665 | 0h 13m 3s 691 |

its a 4x speedup, I don't know if I should expect a higher speedup, so I run 
some more tests to try and find what takes most of the time.

* I noticed that speculative execution was turned on, passing 
(-Dmapred.map.tasks.speculative.execution=false) gives:
|| in-mem-mapred||
| 0h 12m 46s 150 |
it doesn't seem to be the cause of the slowdown.

* How much time does the output takes, this includes computing the oob estimate 
and outputing the trees and the oob predictions ? I added a special job 
parameter (debug.mahout.rf.output) when false the mappers don't compute the oob 
estimates and don't output anything, they just prepare the bags and build the 
trees. The result is:
|| in-mem-mapred||
| 0h 12m 35s 557 |
actually the output doesn't seem to make much time

* How much time does launching and configuring the MR take, this includes 
loading the data in all the nodes ? running the *in-mem-mapred* with just 10 
trees, thus 1 tree per map, gives:
|| in-mem-mapred||
| 0h 1m 36s 335 |

It seem that building the trees *is* what's taking most of the time

* Because I'm running a number of maps equal to the number of cluster-nodes, if 
one maps takes 100 minutes and all other maps take only 1 minute, the job still 
takes 100 minutes to finish. I added a special job parameter 
(debug.mahout.rf.single.seed), when true all mappers use the same seed thus 
they all behave similarly. The results are:
|| in-mem-sequential || in-mem-mapred ||
| 0h 40m 39s 829 | 0h 9m 30s 577 |

It looks like building a single tree in a sequential manner is 2x faster than 
building the same tree with the cluster !!! I don't have a lot of experience 
with clusters, is it normal ??? may be 10 instances is just too small to get a 
good speedup, or may be there is a bug hiding somewhere (I can hear it walking 
in the code when the moon...)
  
> In-memory mapreduce Random Forests
> ----------------------------------
>
>                 Key: MAHOUT-140
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-140
>             Project: Mahout
>          Issue Type: New Feature
>          Components: Classification
>    Affects Versions: 0.2
>            Reporter: Deneche A. Hakim
>            Priority: Minor
>         Attachments: mapred_jul12.diff, mapred_patch.diff
>
>
> Each mapper is responsible for growing a number of trees with a whole copy of 
> the dataset loaded in memory, it uses the reference implementation's code to 
> build each tree and estimate the oob error.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Issue Comment Edited: (MAHOUT-140) In-memory mapreduce Random Forests

Reply via email to