[ 
https://issues.apache.org/jira/browse/MADLIB-1057?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15983698#comment-15983698
 ] 

ASF GitHub Bot commented on MADLIB-1057:
----------------------------------------

Github user asfgit closed the pull request at:

    https://github.com/apache/incubator-madlib/pull/120


> Reduce memory footprint for DT
> ------------------------------
>
>                 Key: MADLIB-1057
>                 URL: https://issues.apache.org/jira/browse/MADLIB-1057
>             Project: Apache MADlib
>          Issue Type: Improvement
>          Components: Module: Decision Tree
>            Reporter: Frank McQuillan
>            Assignee: Rahul Iyer
>             Fix For: v1.11
>
>
> Follow on from spike 
> https://issues.apache.org/jira/browse/MADLIB-1035
> Step 1
> As a madlib developer I want to recreate the RF memory issue (reported in 
> https://issues.apache.org/jira/browse/MADLIB-1035). 
> The current datasets we have are 
> dt_adult : 32K rows 14 columns
> ecommerce : 1M rows 4 columns (ecommerce isn’t actually suitable for DT/RF)
> We need a table with ~2.2M rows and ~130 features (the actual target table 
> has ~1300 features). Randomly filling them might help diagnosing the issue 
> but ideally we would want a somewhat sensible dataset. The problem seems to 
> involve relatively short trees (depth 5) which means a random dataset will 
> probably fill the whole tree which might not be true for a structured dataset.
> Step 2
> Refactoring DT for for smaller memory footprint.
> Tree Accumulator has 2 matrices for continuous and categorical variables. 
> The whole structure is recreated at every level. 
> Every matrix has 2^i rows (i is the level)
> The categorical matrix size depends on the total number of categories 
> (weather : {sunny, cloudy, rainy}, isWeekend : {true, false} means this total 
> is 3+2=5) 
> The continuous matrix size depends on the number of cont. features * the 
> number of bins.
> Tree accumulator works like an array not a linked list. Even if the output is 
> not a complete tree, the tree accumulator creates rows for nonexistent 
> branches in proper order and fills them with 0 values. 
> The refactored version would create a small index table that has the same 
> number of rows as the old tree accumulator (a complete tree) but only a 
> single index column that points to the new tree accumulator row. 
> This will allow us to keep most of the internal function interfaces same but 
> the code to access (read/write) the tree accumulator will have to change.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

Reply via email to