[ https://issues.apache.org/jira/browse/MAHOUT-122?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12725125#action_12725125 ]
Deneche A. Hakim commented on MAHOUT-122: ----------------------------------------- bq. The 450 byte overhead per training instance seems a little bit high, but I don't know the data well so it might be pretty reasonable. The original data size was about 100 bytes. I may be able to explain this overhead: * First of all, the memory estimations that I've done didn't account for the memory not yet garbage collected, so I've run the tests again and this time a launched the Garbage Collector just after loading the data; * In a separate run, I allocated a double[nb instances][nb attributes] and noted how much memory is used || Dataset || Data size (nb instances x nb attributes) || Mem. used by double[nb instances][nb attributes] || MUALD || | KDD 1% | 49.402 x 42 | 19.050.312 B | 22.331.360 B | | KDD 10% | 494.021 x 42 | 178.094.200 B | 204.659.576 B | | KDD 25% | 1.224.607 x 42 | 438.395.224 B | 500.341.256 B | | KDD 50% | 2.449.215 x 42 | 873.266.456 B | 998.331.560 B | Most of the overhead is caused by how the instances are represented in memory, I'm using a DenseVector so all the attributes are stored in a double[], this means that each attribute uses 8 B of memory. By examining the original data, we can see that most of the attributes contain at most 3 digits and because that are stored as text they take at most 4 B if we count the separator. I suppose that the difference between MUALD and the memory used by double[][] is caused by the way the jvm stores the references to the instances' objects. > Random Forests Reference Implementation > --------------------------------------- > > Key: MAHOUT-122 > URL: https://issues.apache.org/jira/browse/MAHOUT-122 > Project: Mahout > Issue Type: Task > Components: Classification > Affects Versions: 0.2 > Reporter: Deneche A. Hakim > Attachments: 2w_patch.diff, 3w_patch.diff, RF reference.patch > > Original Estimate: 25h > Remaining Estimate: 25h > > This is the first step of my GSOC project. Implement a simple, easy to > understand, reference implementation of Random Forests (Building and > Classification). The only requirement here is that "it works" -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.