[ https://issues.apache.org/jira/browse/MADLIB-1057?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15983698#comment-15983698 ]
ASF GitHub Bot commented on MADLIB-1057: ---------------------------------------- Github user asfgit closed the pull request at: https://github.com/apache/incubator-madlib/pull/120 > Reduce memory footprint for DT > ------------------------------ > > Key: MADLIB-1057 > URL: https://issues.apache.org/jira/browse/MADLIB-1057 > Project: Apache MADlib > Issue Type: Improvement > Components: Module: Decision Tree > Reporter: Frank McQuillan > Assignee: Rahul Iyer > Fix For: v1.11 > > > Follow on from spike > https://issues.apache.org/jira/browse/MADLIB-1035 > Step 1 > As a madlib developer I want to recreate the RF memory issue (reported in > https://issues.apache.org/jira/browse/MADLIB-1035). > The current datasets we have are > dt_adult : 32K rows 14 columns > ecommerce : 1M rows 4 columns (ecommerce isn’t actually suitable for DT/RF) > We need a table with ~2.2M rows and ~130 features (the actual target table > has ~1300 features). Randomly filling them might help diagnosing the issue > but ideally we would want a somewhat sensible dataset. The problem seems to > involve relatively short trees (depth 5) which means a random dataset will > probably fill the whole tree which might not be true for a structured dataset. > Step 2 > Refactoring DT for for smaller memory footprint. > Tree Accumulator has 2 matrices for continuous and categorical variables. > The whole structure is recreated at every level. > Every matrix has 2^i rows (i is the level) > The categorical matrix size depends on the total number of categories > (weather : {sunny, cloudy, rainy}, isWeekend : {true, false} means this total > is 3+2=5) > The continuous matrix size depends on the number of cont. features * the > number of bins. > Tree accumulator works like an array not a linked list. Even if the output is > not a complete tree, the tree accumulator creates rows for nonexistent > branches in proper order and fills them with 0 values. > The refactored version would create a small index table that has the same > number of rows as the old tree accumulator (a complete tree) but only a > single index column that points to the new tree accumulator row. > This will allow us to keep most of the internal function interfaces same but > the code to access (read/write) the tree accumulator will have to change. -- This message was sent by Atlassian JIRA (v6.3.15#6346)