[ https://issues.apache.org/jira/browse/SPARK-22451?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16241634#comment-16241634 ]
Weichen Xu commented on SPARK-22451: ------------------------------------ [~josephkb] I think it can be reconstructed from a summary with only O(numCategories) values, let me give an simple example to explain how it works (but if I am wrong correct me): Suppose an unordered features containing values [a, b, c, d, e, f], we first get separate statistics for stat[a], stat[b], stat[c], stat[d], stat[e], stat[f] then, if we want to get stats for split [a, c, d], we can compute it by stat[a] + stat[c] + stat[d]. [~Siddharth Murching] can also help to confirm the correctness of this. > Reduce decision tree aggregate size for unordered features from > O(2^numCategories) to O(numCategories) > ------------------------------------------------------------------------------------------------------ > > Key: SPARK-22451 > URL: https://issues.apache.org/jira/browse/SPARK-22451 > Project: Spark > Issue Type: Improvement > Components: ML > Affects Versions: 2.2.0 > Reporter: Weichen Xu > Original Estimate: 24h > Remaining Estimate: 24h > > Do not need generate all possible splits for unordered features before > aggregate, > in aggregete (executor side): > 1. Change `mixedBinSeqOp`, for each unordered feature, we do the same stat > with ordered features. so for unordered features, we only need > O(numCategories) space for this feature stat. > 2. After driver side get the aggregate result, generate all possible split > combinations, and compute the best split. > This will reduce decision tree aggregate size for each unordered feature from > O(2^numCategories) to O(numCategories), `numCategories` is the arity of this > unordered feature. > This also reduce the cpu cost in executor side. Reduce time complexity for > this unordered feature from O(numPoints * 2^numCategories) to O(numPoints). > This won't increase time complexity for unordered features best split > computing in driver side. -- This message was sent by Atlassian JIRA (v6.4.14#64029) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org