Re: [jira] Commented: (MAHOUT-140) In-memory mapreduce Random Forests
Actually, I'm not used any reducer at all, the output of the mappers is collected and handled by the main program after the end of the job. Running the job with 10 map tasks in a 10 instances (c1.medium) cluster takes 0h 11m 39s 209, speculative execution is on so 12 map tasks have been launched. running the same job with 5x10 map tasks takes 0h 11m 54s 962, 59 map tasks have been launched. And running the same job again with 5x10 map tasks with job parameter mapred.job.reuse.jvm.num.tasks=-1 (no limit how many tasks to run per jvm) takes 0h 11m 57s 115 --- En date de : Sam 18.7.09, Ted Dunning a écrit : > De: Ted Dunning > Objet: Re: [jira] Commented: (MAHOUT-140) In-memory mapreduce Random Forests > À: mahout-dev@lucene.apache.org > Date: Samedi 18 Juillet 2009, 20h36 > This is interesting. > > Is the reduce trivial here? (if so, then and shuffling > isn't the problem and > you may have demonstrated this with your no output > version) > > WHat happens if you increase the number of maps to 5x the > number of nodes? > > > > On Sat, Jul 18, 2009 at 11:11 AM, Deneche A. Hakim (JIRA) > wrote: > > > It looks like building a single tree in a sequential > manner is 2x faster > > than building the same tree with the cluster !!! I > don't have a lot of > > experience with clusters, is it normal ??? may be 10 > instances is just too > > small to get a good speedup, or may be there is a bug > hiding somewhere (I > > can hear it walking in the code when the moon...) > > > > > > -- > Ted Dunning, CTO > DeepDyve >
Re: [jira] Commented: (MAHOUT-140) In-memory mapreduce Random Forests
This is interesting. Is the reduce trivial here? (if so, then and shuffling isn't the problem and you may have demonstrated this with your no output version) WHat happens if you increase the number of maps to 5x the number of nodes? On Sat, Jul 18, 2009 at 11:11 AM, Deneche A. Hakim (JIRA) wrote: > It looks like building a single tree in a sequential manner is 2x faster > than building the same tree with the cluster !!! I don't have a lot of > experience with clusters, is it normal ??? may be 10 instances is just too > small to get a good speedup, or may be there is a bug hiding somewhere (I > can hear it walking in the code when the moon...) > -- Ted Dunning, CTO DeepDyve
[jira] Commented: (MAHOUT-140) In-memory mapreduce Random Forests
[ https://issues.apache.org/jira/browse/MAHOUT-140?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12732922#action_12732922 ] Deneche A. Hakim commented on MAHOUT-140: - * First of all I implemented the *in-mem-sequential* builder which simulates the execution of many mappers in a sequential manner. Passing the same seed to the *in-mem-mapred* and *in-mem-sequential* implementations generates the same trees with the same output, this should make the comparison easier. * on a 10 instances cluster (ec2 c1.medium) building 200 trees with KDD10% with a seed=1 gives : || in-mem-sequential || in-mem-mapred || | 0h 52m 38s 665 | 0h 13m 3s 691 | its a 4x speedup, I don't know if I should expect a higher speedup, so I run some more tests to try and find what takes most of the time. * I noticed that speculative execution was turned on, passing (-Dmapred.map.tasks.speculative.execution=false) gives: || in-mem-mapred|| | 0h 12m 46s 150 | it doesn't seem to be the cause of the slowdown. * How much time does the output takes, this includes computing the oob estimate and outputing the trees and the oob predictions ? I added a special job parameter (debug.mahout.rf.output) when false the mappers don't compute the oob estimates and don't output anything, they just prepare the bags and build the trees. The result is: || in-mem-mapred|| | 0h 12m 35s 557 | actually the output doesn't seem to make much time * How much time does launching and configuring the MR take, this includes loading the data in all the nodes ? running the *in-mem-mapred* with just 10 trees, thus 1 tree per map, gives: || in-mem-mapred|| | 0h 1m 36s 335 | It seem that building the trees *is* what's taking most of the time * Because I'm running a number of maps equal to the number of cluster-nodes, if one maps takes 100 minutes and all other maps take only 1 minute, the job still takes 100 minutes to finish. I added a special job parameter (debug.mahout.rf.single.seed), when true all mappers use the same seed thus they all behave similarly. The results are: || in-mem-sequential || in-mem-mapred || | 0h 40m 39s 829 | 0h 9m 30s 577 | It looks like building a single tree in a sequential manner is 2x faster than building the same tree with the cluster !!! I don't have a lot of experience with clusters, is it normal ??? may be 10 instances is just too small to get a good speedup, or may be there is a bug hiding somewhere (I can hear it walking in the code when the moon...) > In-memory mapreduce Random Forests > -- > > Key: MAHOUT-140 > URL: https://issues.apache.org/jira/browse/MAHOUT-140 > Project: Mahout > Issue Type: New Feature > Components: Classification >Affects Versions: 0.2 >Reporter: Deneche A. Hakim >Priority: Minor > Attachments: mapred_jul12.diff, mapred_patch.diff > > > Each mapper is responsible for growing a number of trees with a whole copy of > the dataset loaded in memory, it uses the reference implementation's code to > build each tree and estimate the oob error. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (MAHOUT-140) In-memory mapreduce Random Forests
[ https://issues.apache.org/jira/browse/MAHOUT-140?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12730292#action_12730292 ] Deneche A. Hakim commented on MAHOUT-140: - bq. But these numbers don't seem to show speedup over the results that you gave in MAHOUT-122 where a single node seemed to be able to build 50 trees on 5% of the data in <10m. bq. My guess is that the comparison I am making is invalid. bq. Can you clarify how things look so far? This seems like it ought to be much more promising than what I am saying. I don't understand how your small cluster here could be slower than the reference implementation. I noticed too that running BuildForest on my laptop is faster than EC2, I suspect that my laptop's CPU is faster than the EC2 instance that I used (m1.small). To be sure I run the sequential version, which allow the use of a specific seed thus being repeatable, and got the following results: the program uses the reference implementation to build 50 trees with Kdd10%, selecting 1 random variable at each tree-node, starting with seed=1, estimating the o-o-b error and using the optimized IG code || Instance || build time || | my laptop | 9m 14s 978 | | 1 m1.small | 28m 59s 510 | | 1 c1.medium | 11m 35s 286 | The m1.small is, indeed, slower than my laptop but still the reference implementation running on this instance takes only 29m compared to 45m when using the mapred implementation. But because the mapred implementation does not accept seed values, for now, the comparison between the sequential and sequential implementations will be difficult. I'm thinking of a way to make the mapred implementation use specific seeds: the main program passes a specific seed value (user parameter) to InMemInputFormat, this seed is used to instantiate a Random object used to generate a different seed for each InputSplit (mapper). This way I can make the reference implementation use the same scheme, given desired the number of mappers, and thus be able to compare between the two implementations. What do you think of this scheme ? bq. A second question is why you don't see perfect speedup with an increasing cluster. Do you have any insight into how the time breaks down between hadoop MR startup, data cache loading, tree building, oob error estimation and storing output? I noticed that loading the data can take some time and because all the mappers do the loading, the loading time is always the same wherever you use a small or a large cluster. I also noticed that the compression is activated when using hadoop on EC2 and it also takes some time to initialize after the mappers finish their work. But I need to run more tests and collect more info to be able to answer your question. > In-memory mapreduce Random Forests > -- > > Key: MAHOUT-140 > URL: https://issues.apache.org/jira/browse/MAHOUT-140 > Project: Mahout > Issue Type: New Feature > Components: Classification >Affects Versions: 0.2 >Reporter: Deneche A. Hakim >Priority: Minor > Attachments: mapred_jul12.diff, mapred_patch.diff > > > Each mapper is responsible for growing a number of trees with a whole copy of > the dataset loaded in memory, it uses the reference implementation's code to > build each tree and estimate the oob error. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (MAHOUT-140) In-memory mapreduce Random Forests
[ https://issues.apache.org/jira/browse/MAHOUT-140?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12730128#action_12730128 ] Ted Dunning commented on MAHOUT-140: These results look *really* promising. But I am curious about how to interpret these numbers. It appears that you get decent speed-up with a larger cluster (5x speedup with 10x nodes). But these numbers don't seem to show speedup over the results that you gave in MAHOUT-122 where a single node seemed to be able to build 50 trees on 5% of the data in <10m. My guess is that the comparison I am making is invalid. Can you clarify how things look so far? This seems like it ought to be much more promising than what I am saying. I don't understand how your small cluster here could be slower than the reference implementation. A second question is why you don't see perfect speedup with an increasing cluster. Do you have any insight into how the time breaks down between hadoop MR startup, data cache loading, tree building, oob error estimation and storing output? > In-memory mapreduce Random Forests > -- > > Key: MAHOUT-140 > URL: https://issues.apache.org/jira/browse/MAHOUT-140 > Project: Mahout > Issue Type: New Feature > Components: Classification >Affects Versions: 0.2 >Reporter: Deneche A. Hakim >Priority: Minor > Attachments: mapred_jul12.diff, mapred_patch.diff > > > Each mapper is responsible for growing a number of trees with a whole copy of > the dataset loaded in memory, it uses the reference implementation's code to > build each tree and estimate the oob error. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.