[ https://issues.apache.org/jira/browse/MAHOUT-140?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Deneche A. Hakim updated MAHOUT-140: ------------------------------------ Attachment: mapred_patch.diff org.apache.mahout.rf.mapred To make it simple, MAHOUT-122 (ref. implementation) is also included in this patch. In-memory mapreduce implementation RandomForests, each mapper is responsible for growing a number of trees with a whole copy of the dataset loaded in memory, it uses the reference implementation's code to build each tree and estimate the oob error. There is no need for input data, the dataset is distributed to the slave nodes using the DistributedCache. A custom InputFormat (InMemInputFormat) is configured with desired number of trees and generates a number of InputSplits (InMemInputSplit) equal to the configured number of maps (mapred.map.tasks). There is no need for a reducer, each map outputs (InMemOutput) the trees it built and, for each tree, the labels the tree predicted for each oob instance. This step has to be done in the mapper because only there we know which instances are oob. The main program (InMemBuilder) is responsible for configuring and launching the job. At the end of the job it parses the output files and builds the corresponding RandomForest, and for each tree predictions it calls (if available) a PredictionCallback that allows the caller to compute any error needed. To test this implementation I added BuildForest that takes simple parameters and can build a forest, with the Kdd dataset, using either the sequential (reference) or mapreduce implementation. The basic usage is as follows: hadoop jar mahout-core-....job org.apache.mahout.rf.mapred.examples.BuildForest [MR] path m nbtrees * MR : use the mapreduce implementation * path : path to the Kdd dataset * m : number of variables to select at each tree-node * nbtrees : size of the forest BuildForest implements the Tool interface, so you'll be able to pass Hadoop parameters. I did a small experiment on my two-nodes (ubuntu) cluster and got a x2 speedup, but there is a lot of random going around, and my cluster is weird (1 node is 2x faster than the other), I shall do more tests this coming week. PS: I also added a package.htm in org.apache.mahout.rf.mapred that contains this description > In-memory mapreduce Random Forests > ---------------------------------- > > Key: MAHOUT-140 > URL: https://issues.apache.org/jira/browse/MAHOUT-140 > Project: Mahout > Issue Type: New Feature > Components: Classification > Affects Versions: 0.2 > Reporter: Deneche A. Hakim > Priority: Minor > Attachments: mapred_patch.diff > > > Each mapper is responsible for growing a number of trees with a whole copy of > the dataset loaded in memory, it uses the reference implementation's code to > build each tree and estimate the oob error. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.