[ https://issues.apache.org/jira/browse/SPARK-22451?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16240967#comment-16240967 ]
Joseph K. Bradley commented on SPARK-22451: ------------------------------------------- How does this work? For unordered features, we need to keep track of stats for all possible 2^numCategories splits. That information cannot be reconstructed from a summary with only O(numCategories) values. > Reduce decision tree aggregate size for unordered features from > O(2^numCategories) to O(numCategories) > ------------------------------------------------------------------------------------------------------ > > Key: SPARK-22451 > URL: https://issues.apache.org/jira/browse/SPARK-22451 > Project: Spark > Issue Type: Improvement > Components: ML > Affects Versions: 2.2.0 > Reporter: Weichen Xu > Original Estimate: 24h > Remaining Estimate: 24h > > Do not need generate all possible splits for unordered features before > aggregate, > in aggregete (executor side): > 1. Change `mixedBinSeqOp`, for each unordered feature, we do the same stat > with ordered features. so for unordered features, we only need > O(numCategories) space for this feature stat. > 2. After driver side get the aggregate result, generate all possible split > combinations, and compute the best split. > This will reduce decision tree aggregate size for each unordered feature from > O(2^numCategories) to O(numCategories), `numCategories` is the arity of this > unordered feature. > This also reduce the cpu cost in executor side. Reduce time complexity for > this unordered feature from O(numPoints * 2^numCategories) to O(numPoints). > This won't increase time complexity for unordered features best split > computing in driver side. -- This message was sent by Atlassian JIRA (v6.4.14#64029) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org