[ 
https://issues.apache.org/jira/browse/MAHOUT-122?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Deneche A. Hakim updated MAHOUT-122:
------------------------------------

    Attachment: refimp_Jul6.diff

*Optimization patch*
Just the reference implementation, does not contain the mapred implementation

*changes*
* the class InformationGain responsible for finding the best split using 
Information Gain has been renamed IgSplit and is now abstract, the class 
DefaultIgSplit contains the current implementation, and I added an optimized 
version OptIgSplit
* examples.CpuTest is a small program (it uses CLI by the way) to test the 
optimizations. This program works as follows:
 ** the program loads kdd50%, just to reduce the loading time;
 ** using the passed seed, the program creates a random number generator;
 ** the program instantiates a TreeBuilder and passes it a DefaultIgSplit or a 
OptIgSplit depending on the passed parameters
 ** the program repeats N times (N is a CommandLine parameter)
  *** using the passed ratio (a number in the range (0,1]), the program creates 
a random subset from the data;
  *** the program builds a single tree, selecting one random variable at each 
tree-node

Here are some results on the KDD dataset, for each ratio we launch the program 
two times, one time with the DefaultIgSplit and another time with OptIgSplit. 
Each time we build 50 trees, and the initial seed is always 1L. I mesured the 
sum of build time for all the trees, it does not include the data loading nor 
the subset generation

|| Ratio || Default || Optimized ||
| 0.1% | 1m 1s 336 | 2s 580 |
| 0.5% | 18m 34s 523 | 1m 21s 285 |
| 1% | 36m 37s 953 | 1m 40s 699 |
| 5% | 1h 43m 4s 899 | 2m 19s 745 |
| 10% | 6h 8m 46s 947 | 5m 8s 359 |

Because the KDD dataset does not contain categorical attribute, this results 
are only for numerical attributes, I shall run another bunch of tests using 
another dataset. There is still room for improvement, but it requires sorting 
the whole dataset for each of its numerical attributes. This could be possible 
by processing the dataset once and storing an optmization-friendly version for 
latter uses.

Now that the building algorithm is fast enough, I tried it again with the 
Breiman's example and discovered (horrified) that the out-of-bag classification 
takes a lot of time: using the KDD10%, one tree is build in 5s with the 
optimized code, and the oob classification takes more than *19 minutes* !!! I 
shall (must) investigate this issue as soon as possible.

> Random Forests Reference Implementation
> ---------------------------------------
>
>                 Key: MAHOUT-122
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-122
>             Project: Mahout
>          Issue Type: Task
>          Components: Classification
>    Affects Versions: 0.2
>            Reporter: Deneche A. Hakim
>         Attachments: 2w_patch.diff, 3w_patch.diff, refimp_Jul6.diff, RF 
> reference.patch
>
>
> This is the first step of my GSOC project. Implement a simple, easy to 
> understand, reference implementation of Random Forests (Building and 
> Classification). The only requirement here is that "it works"

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to