[jira] Issue Comment Edited: (MAHOUT-122) Random Forests Reference Implementation

Deneche A. Hakim (JIRA) Wed, 17 Jun 2009 02:58:33 -0700

    [ 
https://issues.apache.org/jira/browse/MAHOUT-122?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12718771#action_12718771
 ]


Deneche A. Hakim edited comment on MAHOUT-122 at 6/17/09 2:57 AM:
------------------------------------------------------------------

whats new:
I added the results for KDD 10% with 10 trees. I tried also to build one single 
tree with KDD 50% and after more than 12 hours !!! of computing I gave up

I was wrong about the memory usage of the current implementation, even that 
each node has its own Data object, all the Data object still share the same 
Instance objects which all the actual data.

I did some profiling and I found that "InformationGain.computeSplit()" method 
takes nearly 98.5% of total time, this is responsible for computing the 
Information Gain for the current split. So if we want later to optimize this 
implementation we'll have to use a better algorithm to compute the Information 
Gain, the one that I'm aware of and which is available in the Weka source code, 
computes
 the sorting indices for the data with each attribute.

I also did some memory usage profiling using a Runnable that samples every 50ms 
a rough estimation of memory usage using (Runtime.getTotalMemory() - 
Runtime.getFreeMemory()). I used the KDD dataset (> 700 Mb of data), I then 
created different datasets using subsets of different size (1%, 10%, 25%, 50%). 
Here are the results :

KDD has 41 attributes (stored as "double")
KDD  1% has      49402 instances
KDD 10% has   494021 instances
KDD 25% has 1224607 instances

|| Dataset || Nb Trees || MUALD(*) || Max Used Memory || Nb Nodes || Max Tree 
Depth ||
| KDD  1% | 1 | 35.414.504 b | 38.069.640 b | 120 | 10 |
| KDD  1% | 10 | 35.144.096 b | 45.669.904 b | 126 (mean) | 11 (mean) |
| KDD 10% | 1 | 201.697.512 b | 226.653.392 b | 712 | 22 |
| KDD 10% | 10 | 201.697.512 b | 276.780.280 b | 870 (mean) | 29 (mean) |
| KDD 25% | 1 | 521.515.136 b | 569.795.152 b | 930 | 26 |

(*) Memory used right after loading the Data

I should run more tests using KDD 50% and KDD 100%, and also building more 
trees to see how the memory usage behaves. But because the current 
implementation is very slow, it may take some time

      was (Author: adeneche):
    I was wrong about the memory usage of the current implementation, even that 
each node has its own Data object, all the Data object still share the same 
Instance objects which all the actual data.

I did some profiling and I found that "InformationGain.computeSplit()" method 
takes nearly 98.5% of total time, this is responsible for computing the 
Information Gain for the current split. So if we want later to optimize this 
implementation we'll have to use a better algorithm to compute the Information 
Gain, the one that I'm aware of and which is available in the Weka source code, 
computes
 the sorting indices for the data with each attribute.

I also did some memory usage profiling using a Runnable that samples every 50ms 
a rough estimation of memory usage using (Runtime.getTotalMemory() - 
Runtime.getFreeMemory()). I used the KDD dataset (> 700 Mb of data), I then 
created different datasets using subsets of different size (1%, 10%, 25%, 50%). 
Here are the results :

KDD has 41 attributes (stored as "double")
KDD  1% has      49402 instances
KDD 10% has   494021 instances
KDD 25% has 1224607 instances

KDD 1% contains 
Dataset       |Nb Trees  |  MUALD(*)         |    Max Used Memory  |  Nb Nodes  
  |  Max Tree Depth
KDD  1%     | 1            |   35.414.504 b  |     38.069.640 b       |  120    
         |  10
KDD  1%     |10           |   35.144.096 b  |     45.669.904 b       |  126 
(mean) |  11 (mean) 

KDD 10%    | 1            | 201.697.512 b  |   226.653.392 b       |  712       
      |  22

KDD 25%    | 1            | 521.515.136 b  |   569.795.152 b       |  930       
      |  26

(*) Memory used right after loading the Data

I should run more tests using KDD 50% and KDD 100%, and also building more 
trees to see how the memory usage behaves. But because the current 
implementation is very slow, it may take some time

ps: edited the results table to make it somehow more readable
  
> Random Forests Reference Implementation
> ---------------------------------------
>
>                 Key: MAHOUT-122
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-122
>             Project: Mahout
>          Issue Type: Task
>          Components: Classification
>    Affects Versions: 0.2
>            Reporter: Deneche A. Hakim
>         Attachments: 2w_patch.diff, 3w_patch.diff, RF reference.patch
>
>   Original Estimate: 25h
>  Remaining Estimate: 25h
>
> This is the first step of my GSOC project. Implement a simple, easy to 
> understand, reference implementation of Random Forests (Building and 
> Classification). The only requirement here is that "it works"

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Issue Comment Edited: (MAHOUT-122) Random Forests Reference Implementation

Reply via email to