[ 
https://issues.apache.org/jira/browse/MAHOUT-122?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Deneche A. Hakim updated MAHOUT-122:
------------------------------------

    Attachment: refimp_Jul7.diff

I did some tests on the "poker hand" dataset from UCI, it contains 8 
categorical attributes and 1.000.000 instances. I got the following results (50 
trees) :

|| Ratio || Default || Optimized ||
| 100% | 11m 31s 253 | 8m 32s 446 |

It seems that the default implementation is fast enough for categorical 
attributes, and the optimized version is faster.

I also found the issue with the oob error estimation. The old code was:
{code}
Data bag = data.bagging(rng);

Node tree = treeBuilder.build(bag);

// predict the label for the out-of-bag elements
for (int index = 0; index < data.size(); index++) {
  Instance v = data.get(index);

  if (!bag.contains(v)) {
    int prediction = tree.classify(v);
    callback.prediction(treeId, v, prediction);
  }
}
{code}

The problem was with bag.contains(), commenting this test drop the build time 
from *21m 8s 473* to *5s 913*. I modified Data.bag() to fill a given boolean 
array with which instances are sampled in the bag, and used it as follows:

{code}
Arrays.fill(sampled, false);
Data bag = data.bagging(rng, sampled);

Node tree = treeBuilder.build(bag);

// predict the label for the out-of-bag elements
for (int index = 0; index < data.size(); index++) {
  Instance v = data.get(index);

  if (sampled[index] == false) {
    int prediction = tree.classify(v);
    callback.prediction(treeId, v, prediction);
  }
}
{code}

The new build time is *6s 777*. I think this issue is solved (for now...)

> Random Forests Reference Implementation
> ---------------------------------------
>
>                 Key: MAHOUT-122
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-122
>             Project: Mahout
>          Issue Type: Task
>          Components: Classification
>    Affects Versions: 0.2
>            Reporter: Deneche A. Hakim
>         Attachments: 2w_patch.diff, 3w_patch.diff, refimp_Jul6.diff, 
> refimp_Jul7.diff, RF reference.patch
>
>
> This is the first step of my GSOC project. Implement a simple, easy to 
> understand, reference implementation of Random Forests (Building and 
> Classification). The only requirement here is that "it works"

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to