[ https://issues.apache.org/jira/browse/SPARK-3383?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16240284#comment-16240284 ]
Yan Facai (颜发才) commented on SPARK-3383: ---------------------------------------- [~WeichenXu123] Good work! I'd like to take a look if time allows. Anyway, I believe that unordered features can benefit a lot from the PR. > DecisionTree aggregate size could be smaller > -------------------------------------------- > > Key: SPARK-3383 > URL: https://issues.apache.org/jira/browse/SPARK-3383 > Project: Spark > Issue Type: Improvement > Components: MLlib > Affects Versions: 1.1.0 > Reporter: Joseph K. Bradley > Priority: Minor > > Storage and communication optimization: > DecisionTree aggregate statistics could store less data (described below). > The savings would be significant for datasets with many low-arity categorical > features (binary features, or unordered categorical features). Savings would > be negligible for continuous features. > DecisionTree stores a vector sufficient statistics for each (node, feature, > bin). We could store 1 fewer bin per (node, feature): For a given (node, > feature), if we store these vectors for all but the last bin, and also store > the total statistics for each node, then we could compute the statistics for > the last bin. For binary and unordered categorical features, this would cut > in half the number of bins to store and communicate. -- This message was sent by Atlassian JIRA (v6.4.14#64029) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org