[ https://issues.apache.org/jira/browse/MAHOUT-122?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12718771#action_12718771 ]
Deneche A. Hakim edited comment on MAHOUT-122 at 6/17/09 2:57 AM: ------------------------------------------------------------------ whats new: I added the results for KDD 10% with 10 trees. I tried also to build one single tree with KDD 50% and after more than 12 hours !!! of computing I gave up I was wrong about the memory usage of the current implementation, even that each node has its own Data object, all the Data object still share the same Instance objects which all the actual data. I did some profiling and I found that "InformationGain.computeSplit()" method takes nearly 98.5% of total time, this is responsible for computing the Information Gain for the current split. So if we want later to optimize this implementation we'll have to use a better algorithm to compute the Information Gain, the one that I'm aware of and which is available in the Weka source code, computes the sorting indices for the data with each attribute. I also did some memory usage profiling using a Runnable that samples every 50ms a rough estimation of memory usage using (Runtime.getTotalMemory() - Runtime.getFreeMemory()). I used the KDD dataset (> 700 Mb of data), I then created different datasets using subsets of different size (1%, 10%, 25%, 50%). Here are the results : KDD has 41 attributes (stored as "double") KDD 1% has 49402 instances KDD 10% has 494021 instances KDD 25% has 1224607 instances || Dataset || Nb Trees || MUALD(*) || Max Used Memory || Nb Nodes || Max Tree Depth || | KDD 1% | 1 | 35.414.504 b | 38.069.640 b | 120 | 10 | | KDD 1% | 10 | 35.144.096 b | 45.669.904 b | 126 (mean) | 11 (mean) | | KDD 10% | 1 | 201.697.512 b | 226.653.392 b | 712 | 22 | | KDD 10% | 10 | 201.697.512 b | 276.780.280 b | 870 (mean) | 29 (mean) | | KDD 25% | 1 | 521.515.136 b | 569.795.152 b | 930 | 26 | (*) Memory used right after loading the Data I should run more tests using KDD 50% and KDD 100%, and also building more trees to see how the memory usage behaves. But because the current implementation is very slow, it may take some time was (Author: adeneche): I was wrong about the memory usage of the current implementation, even that each node has its own Data object, all the Data object still share the same Instance objects which all the actual data. I did some profiling and I found that "InformationGain.computeSplit()" method takes nearly 98.5% of total time, this is responsible for computing the Information Gain for the current split. So if we want later to optimize this implementation we'll have to use a better algorithm to compute the Information Gain, the one that I'm aware of and which is available in the Weka source code, computes the sorting indices for the data with each attribute. I also did some memory usage profiling using a Runnable that samples every 50ms a rough estimation of memory usage using (Runtime.getTotalMemory() - Runtime.getFreeMemory()). I used the KDD dataset (> 700 Mb of data), I then created different datasets using subsets of different size (1%, 10%, 25%, 50%). Here are the results : KDD has 41 attributes (stored as "double") KDD 1% has 49402 instances KDD 10% has 494021 instances KDD 25% has 1224607 instances KDD 1% contains Dataset |Nb Trees | MUALD(*) | Max Used Memory | Nb Nodes | Max Tree Depth KDD 1% | 1 | 35.414.504 b | 38.069.640 b | 120 | 10 KDD 1% |10 | 35.144.096 b | 45.669.904 b | 126 (mean) | 11 (mean) KDD 10% | 1 | 201.697.512 b | 226.653.392 b | 712 | 22 KDD 25% | 1 | 521.515.136 b | 569.795.152 b | 930 | 26 (*) Memory used right after loading the Data I should run more tests using KDD 50% and KDD 100%, and also building more trees to see how the memory usage behaves. But because the current implementation is very slow, it may take some time ps: edited the results table to make it somehow more readable > Random Forests Reference Implementation > --------------------------------------- > > Key: MAHOUT-122 > URL: https://issues.apache.org/jira/browse/MAHOUT-122 > Project: Mahout > Issue Type: Task > Components: Classification > Affects Versions: 0.2 > Reporter: Deneche A. Hakim > Attachments: 2w_patch.diff, 3w_patch.diff, RF reference.patch > > Original Estimate: 25h > Remaining Estimate: 25h > > This is the first step of my GSOC project. Implement a simple, easy to > understand, reference implementation of Random Forests (Building and > Classification). The only requirement here is that "it works" -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.