[ 
https://issues.apache.org/jira/browse/MAHOUT-122?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12718771#action_12718771
 ] 

Deneche A. Hakim commented on MAHOUT-122:
-----------------------------------------

I was wrong about the memory usage of the current implementation, even that 
each node has its own Data object, all the Data object still share the same 
Instance objects which all the actual data.

I did some profiling and I found that "InformationGain.computeSplit()" method 
takes nearly 98.5% of total time, this is responsible for computing the 
Information Gain for the current split. So if we want later to optimize this 
implementation we'll have to use a better algorithm to compute the Information 
Gain, the one that I'm aware of and which is available in the Weka source code, 
computes
 the sorting indices for the data with each attribute.

I also did some memory usage profiling using a Runnable that samples every 50ms 
a rough estimation of memory usage using (Runtime.getTotalMemory() - 
Runtime.getFreeMemory()). I used the KDD dataset (> 700 Mb of data), I then 
created different datasets using subsets of different size (1%, 10%, 25%, 50%). 
Here are the results :

KDD has 41 attributes (stored as "double")
KDD  1% has      49402 instances
KDD 10% has   494021 instances
KDD 25% has 1224607 instances

KDD 1% contains 
Dataset       Nb Trees    MUALD(*)             Max Used Memory    Nb Nodes      
Max Tree Depth
KDD  1%      1               35.414.504 b       38.069.640 b         120        
       10
KDD  1%     10              35.144.096 b       45.669.904 b         126 (mean)  
 11 (mean) 


KDD 10%     1             201.697.512 b     226.653.392 b         712           
    22

KDD 25%     1             521.515.136 b     569.795.152 b         930           
    26

(*) Memory used right after loading the Data

I should run more tests using KDD 50% and KDD 100%, and also building more 
trees to see how the memory usage behaves. But because the current 
implementation is very slow, it may take some time

> Random Forests Reference Implementation
> ---------------------------------------
>
>                 Key: MAHOUT-122
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-122
>             Project: Mahout
>          Issue Type: Task
>          Components: Classification
>    Affects Versions: 0.2
>            Reporter: Deneche A. Hakim
>         Attachments: 2w_patch.diff, RF reference.patch
>
>   Original Estimate: 25h
>  Remaining Estimate: 25h
>
> This is the first step of my GSOC project. Implement a simple, easy to 
> understand, reference implementation of Random Forests (Building and 
> Classification). The only requirement here is that "it works"

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to