Re: [jira] Commented: (MAHOUT-140) In-memory mapreduce Random Forests

2009-07-18 Thread deneche abdelhakim

Actually, I'm not used any reducer at all, the output of the mappers is 
collected and handled by the main program after the end of the job.

Running the job with 10 map tasks in a 10 instances (c1.medium) cluster takes 
0h 11m 39s 209, speculative execution is on so 12 map tasks have been launched.

running the same job with 5x10 map tasks takes 0h 11m 54s 962, 59 map tasks 
have been launched.

And running the same job again with 5x10 map tasks with job parameter 
mapred.job.reuse.jvm.num.tasks=-1 (no limit how many tasks to run per jvm) 
takes 0h 11m 57s 115 

--- En date de : Sam 18.7.09, Ted Dunning  a écrit :

> De: Ted Dunning 
> Objet: Re: [jira] Commented: (MAHOUT-140) In-memory mapreduce Random Forests
> À: mahout-dev@lucene.apache.org
> Date: Samedi 18 Juillet 2009, 20h36
> This is interesting.
> 
> Is the reduce trivial here? (if so, then and shuffling
> isn't the problem and
> you may have demonstrated this with your no output
> version)
> 
> WHat happens if you increase the number of maps to 5x the
> number of nodes?
> 
> 
> 
> On Sat, Jul 18, 2009 at 11:11 AM, Deneche A. Hakim (JIRA)
> wrote:
> 
> > It looks like building a single tree in a sequential
> manner is 2x faster
> > than building the same tree with the cluster !!! I
> don't have a lot of
> > experience with clusters, is it normal ??? may be 10
> instances is just too
> > small to get a good speedup, or may be there is a bug
> hiding somewhere (I
> > can hear it walking in the code when the moon...)
> >
> 
> 
> 
> -- 
> Ted Dunning, CTO
> DeepDyve
> 





Re: [jira] Commented: (MAHOUT-140) In-memory mapreduce Random Forests

2009-07-18 Thread Ted Dunning
This is interesting.

Is the reduce trivial here? (if so, then and shuffling isn't the problem and
you may have demonstrated this with your no output version)

WHat happens if you increase the number of maps to 5x the number of nodes?



On Sat, Jul 18, 2009 at 11:11 AM, Deneche A. Hakim (JIRA)
wrote:

> It looks like building a single tree in a sequential manner is 2x faster
> than building the same tree with the cluster !!! I don't have a lot of
> experience with clusters, is it normal ??? may be 10 instances is just too
> small to get a good speedup, or may be there is a bug hiding somewhere (I
> can hear it walking in the code when the moon...)
>



-- 
Ted Dunning, CTO
DeepDyve


[jira] Commented: (MAHOUT-140) In-memory mapreduce Random Forests

2009-07-18 Thread Deneche A. Hakim (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-140?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12732922#action_12732922
 ] 

Deneche A. Hakim commented on MAHOUT-140:
-

* First of all I implemented the *in-mem-sequential* builder which simulates 
the execution of many mappers in a sequential manner. Passing the same seed to 
the *in-mem-mapred* and *in-mem-sequential* implementations generates the same 
trees with the same output, this should make the comparison easier.

* on a 10 instances cluster (ec2 c1.medium) building 200 trees with KDD10% with 
a seed=1 gives :
|| in-mem-sequential || in-mem-mapred ||
| 0h 52m 38s 665 | 0h 13m 3s 691 |

its a 4x speedup, I don't know if I should expect a higher speedup, so I run 
some more tests to try and find what takes most of the time.

* I noticed that speculative execution was turned on, passing 
(-Dmapred.map.tasks.speculative.execution=false) gives:
|| in-mem-mapred||
| 0h 12m 46s 150 |
it doesn't seem to be the cause of the slowdown.

* How much time does the output takes, this includes computing the oob estimate 
and outputing the trees and the oob predictions ? I added a special job 
parameter (debug.mahout.rf.output) when false the mappers don't compute the oob 
estimates and don't output anything, they just prepare the bags and build the 
trees. The result is:
|| in-mem-mapred||
| 0h 12m 35s 557 |
actually the output doesn't seem to make much time

* How much time does launching and configuring the MR take, this includes 
loading the data in all the nodes ? running the *in-mem-mapred* with just 10 
trees, thus 1 tree per map, gives:
|| in-mem-mapred||
| 0h 1m 36s 335 |

It seem that building the trees *is* what's taking most of the time

* Because I'm running a number of maps equal to the number of cluster-nodes, if 
one maps takes 100 minutes and all other maps take only 1 minute, the job still 
takes 100 minutes to finish. I added a special job parameter 
(debug.mahout.rf.single.seed), when true all mappers use the same seed thus 
they all behave similarly. The results are:
|| in-mem-sequential || in-mem-mapred ||
| 0h 40m 39s 829 | 0h 9m 30s 577 |

It looks like building a single tree in a sequential manner is 2x faster than 
building the same tree with the cluster !!! I don't have a lot of experience 
with clusters, is it normal ??? may be 10 instances is just too small to get a 
good speedup, or may be there is a bug hiding somewhere (I can hear it walking 
in the code when the moon...)

> In-memory mapreduce Random Forests
> --
>
> Key: MAHOUT-140
> URL: https://issues.apache.org/jira/browse/MAHOUT-140
> Project: Mahout
>  Issue Type: New Feature
>  Components: Classification
>Affects Versions: 0.2
>Reporter: Deneche A. Hakim
>Priority: Minor
> Attachments: mapred_jul12.diff, mapred_patch.diff
>
>
> Each mapper is responsible for growing a number of trees with a whole copy of 
> the dataset loaded in memory, it uses the reference implementation's code to 
> build each tree and estimate the oob error.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (MAHOUT-140) In-memory mapreduce Random Forests

2009-07-13 Thread Deneche A. Hakim (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-140?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12730292#action_12730292
 ] 

Deneche A. Hakim commented on MAHOUT-140:
-

bq. But these numbers don't seem to show speedup over the results that you gave 
in MAHOUT-122 where a single node seemed to be able to build 50 trees on 5% of 
the data in <10m.

bq. My guess is that the comparison I am making is invalid.

bq. Can you clarify how things look so far? This seems like it ought to be much 
more promising than what I am saying. I don't understand how your small cluster 
here could be slower than the reference implementation.

I noticed too that running BuildForest on my laptop is faster than EC2, I 
suspect that my laptop's CPU is faster than the EC2 instance that I used 
(m1.small). To be sure I run the sequential version, which allow the use of a 
specific seed thus being repeatable, and got the following results:

the program uses the reference implementation to build 50 trees with Kdd10%, 
selecting 1 random variable at each tree-node, starting with seed=1, estimating 
the o-o-b error and using the optimized IG code

|| Instance || build time ||
| my laptop | 9m 14s 978 |
| 1 m1.small | 28m 59s 510 |
| 1 c1.medium | 11m 35s 286 |

The m1.small is, indeed, slower than my laptop but still the reference 
implementation running on this instance takes only 29m compared to 45m when 
using the mapred implementation. But because the mapred implementation does not 
accept seed values, for now, the comparison between the sequential and 
sequential implementations will be difficult.

I'm thinking of a way to make the mapred implementation use specific seeds: the 
main program passes a specific seed value (user parameter) to InMemInputFormat, 
this seed is used to instantiate a Random object used to generate a different 
seed for each InputSplit (mapper). This way I can make the reference 
implementation use the same scheme, given desired the number of mappers, and 
thus be able to compare between the two implementations. What do you think of 
this scheme ?

bq. A second question is why you don't see perfect speedup with an increasing 
cluster. Do you have any insight into how the time breaks down between hadoop 
MR startup, data cache loading, tree building, oob error estimation and storing 
output?

I noticed that loading the data can take some time and because all the mappers 
do the loading, the loading time is always the same wherever you use a small or 
a large cluster. I also noticed that the compression is activated when using 
hadoop on EC2 and it also takes some time to initialize after the mappers 
finish their work. But I need to run more tests and collect more info to be 
able to answer your question.

> In-memory mapreduce Random Forests
> --
>
> Key: MAHOUT-140
> URL: https://issues.apache.org/jira/browse/MAHOUT-140
> Project: Mahout
>  Issue Type: New Feature
>  Components: Classification
>Affects Versions: 0.2
>Reporter: Deneche A. Hakim
>Priority: Minor
> Attachments: mapred_jul12.diff, mapred_patch.diff
>
>
> Each mapper is responsible for growing a number of trees with a whole copy of 
> the dataset loaded in memory, it uses the reference implementation's code to 
> build each tree and estimate the oob error.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (MAHOUT-140) In-memory mapreduce Random Forests

2009-07-12 Thread Ted Dunning (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-140?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12730128#action_12730128
 ] 

Ted Dunning commented on MAHOUT-140:


These results look *really* promising.

But I am  curious about how to interpret these numbers.  It appears that you 
get decent speed-up with a larger cluster (5x speedup with 10x nodes).

But these numbers don't seem to show speedup over the results that you gave in 
MAHOUT-122 where a single node seemed to be able to build 50 trees on 5% of the 
data in <10m.

My guess is that the comparison I am making is invalid.

Can you clarify how things look so far?  This seems like it ought to be much 
more promising than what I am saying.  I don't understand how your small 
cluster here could be slower than the reference implementation.

A second question is why you don't see perfect speedup with an increasing 
cluster.  Do you have any insight into how the time breaks down between hadoop 
MR startup, data cache loading, tree building, oob error estimation and storing 
output?

> In-memory mapreduce Random Forests
> --
>
> Key: MAHOUT-140
> URL: https://issues.apache.org/jira/browse/MAHOUT-140
> Project: Mahout
>  Issue Type: New Feature
>  Components: Classification
>Affects Versions: 0.2
>Reporter: Deneche A. Hakim
>Priority: Minor
> Attachments: mapred_jul12.diff, mapred_patch.diff
>
>
> Each mapper is responsible for growing a number of trees with a whole copy of 
> the dataset loaded in memory, it uses the reference implementation's code to 
> build each tree and estimate the oob error.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.