[ https://issues.apache.org/jira/browse/MAHOUT-122?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Deneche A. Hakim updated MAHOUT-122: ------------------------------------ Attachment: refimp_Jul6.diff *Optimization patch* Just the reference implementation, does not contain the mapred implementation *changes* * the class InformationGain responsible for finding the best split using Information Gain has been renamed IgSplit and is now abstract, the class DefaultIgSplit contains the current implementation, and I added an optimized version OptIgSplit * examples.CpuTest is a small program (it uses CLI by the way) to test the optimizations. This program works as follows: ** the program loads kdd50%, just to reduce the loading time; ** using the passed seed, the program creates a random number generator; ** the program instantiates a TreeBuilder and passes it a DefaultIgSplit or a OptIgSplit depending on the passed parameters ** the program repeats N times (N is a CommandLine parameter) *** using the passed ratio (a number in the range (0,1]), the program creates a random subset from the data; *** the program builds a single tree, selecting one random variable at each tree-node Here are some results on the KDD dataset, for each ratio we launch the program two times, one time with the DefaultIgSplit and another time with OptIgSplit. Each time we build 50 trees, and the initial seed is always 1L. I mesured the sum of build time for all the trees, it does not include the data loading nor the subset generation || Ratio || Default || Optimized || | 0.1% | 1m 1s 336 | 2s 580 | | 0.5% | 18m 34s 523 | 1m 21s 285 | | 1% | 36m 37s 953 | 1m 40s 699 | | 5% | 1h 43m 4s 899 | 2m 19s 745 | | 10% | 6h 8m 46s 947 | 5m 8s 359 | Because the KDD dataset does not contain categorical attribute, this results are only for numerical attributes, I shall run another bunch of tests using another dataset. There is still room for improvement, but it requires sorting the whole dataset for each of its numerical attributes. This could be possible by processing the dataset once and storing an optmization-friendly version for latter uses. Now that the building algorithm is fast enough, I tried it again with the Breiman's example and discovered (horrified) that the out-of-bag classification takes a lot of time: using the KDD10%, one tree is build in 5s with the optimized code, and the oob classification takes more than *19 minutes* !!! I shall (must) investigate this issue as soon as possible. > Random Forests Reference Implementation > --------------------------------------- > > Key: MAHOUT-122 > URL: https://issues.apache.org/jira/browse/MAHOUT-122 > Project: Mahout > Issue Type: Task > Components: Classification > Affects Versions: 0.2 > Reporter: Deneche A. Hakim > Attachments: 2w_patch.diff, 3w_patch.diff, refimp_Jul6.diff, RF > reference.patch > > > This is the first step of my GSOC project. Implement a simple, easy to > understand, reference implementation of Random Forests (Building and > Classification). The only requirement here is that "it works" -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.