[ https://issues.apache.org/jira/browse/SPARK-3161?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Xiangrui Meng updated SPARK-3161: --------------------------------- Priority: Major (was: Minor) Target Version/s: 1.2.0 > Cache example-node map for DecisionTree training > ------------------------------------------------ > > Key: SPARK-3161 > URL: https://issues.apache.org/jira/browse/SPARK-3161 > Project: Spark > Issue Type: Improvement > Components: MLlib > Reporter: Joseph K. Bradley > > Improvement: worker computation > When training each level of a DecisionTree, each example needs to be mapped > to a node in the current level (or to none if it does not reach that level). > This is currently done via the function predictNodeIndex(), which traces from > the current tree’s root node to the given level. > Proposal: Cache this mapping. > * Pro: O(1) lookup instead of O(level). > * Con: Extra RDD which must share the same partitioning as the training data. > Design: > * (option 1) This could be done as in [Sequoia Forests | > https://github.com/AlpineNow/SparkML2] where each instance is stored with an > array of node indices (1 node per tree). > * (option 2) This could also be done by storing an RDD\[Array\[Map\[Int, > Array\[TreePoint\]\]\]\], where each partition stores an array of maps from > node indices to an array of instances. This has more overhead in data > structures but could be more efficient: not all nodes are split on each > iteration, and this would allow each executor to ignore instances which are > not used for the current node set. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org