[ 
https://issues.apache.org/jira/browse/MAHOUT-122?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12725125#action_12725125
 ] 

Deneche A. Hakim commented on MAHOUT-122:
-----------------------------------------

bq. The 450 byte overhead per training instance seems a little bit high, but I 
don't know the data well so it might be pretty reasonable. The original data 
size was about 100 bytes.

I may be able to explain this overhead:

* First of all, the memory estimations that I've done didn't account for the 
memory not yet garbage collected, so I've run the tests again and this time a 
launched the Garbage Collector just after loading the data;
* In a separate run, I allocated a double[nb instances][nb attributes] and 
noted how much memory is used 

|| Dataset || Data size (nb instances x nb attributes) || Mem. used by 
double[nb instances][nb attributes] || MUALD ||
| KDD   1% |    49.402 x 42 |  19.050.312 B |  22.331.360 B |
| KDD  10% |   494.021 x 42 | 178.094.200 B | 204.659.576 B |
| KDD  25% | 1.224.607 x 42 | 438.395.224 B | 500.341.256 B |
| KDD  50% | 2.449.215 x 42 | 873.266.456 B | 998.331.560 B |

Most of the overhead is caused by how the instances are represented in memory, 
I'm using a DenseVector so all the attributes are stored in a double[], this 
means that each attribute uses 8 B of memory. By examining the original data, 
we can see that most of the attributes contain at most 3 digits and because 
that are stored as text they take at most 4 B if we count the separator.

I suppose that the difference between MUALD and the memory used by double[][] 
is caused by the way the jvm stores the references to the instances' objects.

> Random Forests Reference Implementation
> ---------------------------------------
>
>                 Key: MAHOUT-122
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-122
>             Project: Mahout
>          Issue Type: Task
>          Components: Classification
>    Affects Versions: 0.2
>            Reporter: Deneche A. Hakim
>         Attachments: 2w_patch.diff, 3w_patch.diff, RF reference.patch
>
>   Original Estimate: 25h
>  Remaining Estimate: 25h
>
> This is the first step of my GSOC project. Implement a simple, easy to 
> understand, reference implementation of Random Forests (Building and 
> Classification). The only requirement here is that "it works"

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to