[ 
https://issues.apache.org/jira/browse/MAHOUT-140?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Deneche A. Hakim updated MAHOUT-140:
------------------------------------

    Attachment: mapred_patch.diff

org.apache.mahout.rf.mapred

To make it simple, MAHOUT-122 (ref. implementation) is also included in this 
patch.

In-memory mapreduce implementation RandomForests, each mapper is responsible 
for growing a number of trees with a whole copy of the dataset loaded in 
memory, it uses the reference implementation's code to build each tree and 
estimate the oob error.

There is no need for input data, the dataset is distributed to the slave nodes 
using the DistributedCache. A custom InputFormat (InMemInputFormat) is 
configured with desired number of trees and generates a number of InputSplits 
(InMemInputSplit) equal to the configured number of maps (mapred.map.tasks).

There is no need for a reducer, each map outputs (InMemOutput) the trees it 
built and, for each tree, the labels the tree predicted for each oob instance. 
This step has to be done in the mapper because only there we know which 
instances are oob.

The main program (InMemBuilder) is responsible for configuring and launching 
the job. At the end of the job it parses the output files and builds the 
corresponding RandomForest, and for each tree predictions it calls (if 
available) a PredictionCallback that allows the caller to compute any error 
needed.

To test this implementation I added BuildForest that takes simple parameters 
and can build a forest, with the Kdd dataset, using either the sequential 
(reference) or mapreduce implementation. The basic usage is as follows: 

hadoop jar mahout-core-....job org.apache.mahout.rf.mapred.examples.BuildForest 
[MR] path m nbtrees
 * MR      : use the mapreduce implementation
 * path    : path to the Kdd dataset
 * m       : number of variables to select at each tree-node
 * nbtrees : size of the forest

BuildForest implements the Tool interface, so you'll be able to pass Hadoop 
parameters.

I did a small experiment on my two-nodes (ubuntu) cluster and got a x2 speedup, 
but there is a lot of random going around, and my cluster is weird (1 node is 
2x faster than the other), I shall do more tests this coming week.

PS: I also added a package.htm in org.apache.mahout.rf.mapred that contains 
this description

> In-memory mapreduce Random Forests
> ----------------------------------
>
>                 Key: MAHOUT-140
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-140
>             Project: Mahout
>          Issue Type: New Feature
>          Components: Classification
>    Affects Versions: 0.2
>            Reporter: Deneche A. Hakim
>            Priority: Minor
>         Attachments: mapred_patch.diff
>
>
> Each mapper is responsible for growing a number of trees with a whole copy of 
> the dataset loaded in memory, it uses the reference implementation's code to 
> build each tree and estimate the oob error.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to