[ 
https://issues.apache.org/jira/browse/SPARK-3161?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-3161:
---------------------------------
            Priority: Major  (was: Minor)
    Target Version/s: 1.2.0

> Cache example-node map for DecisionTree training
> ------------------------------------------------
>
>                 Key: SPARK-3161
>                 URL: https://issues.apache.org/jira/browse/SPARK-3161
>             Project: Spark
>          Issue Type: Improvement
>          Components: MLlib
>            Reporter: Joseph K. Bradley
>
> Improvement: worker computation
> When training each level of a DecisionTree, each example needs to be mapped 
> to a node in the current level (or to none if it does not reach that level).  
> This is currently done via the function predictNodeIndex(), which traces from 
> the current tree’s root node to the given level.
> Proposal: Cache this mapping.
> * Pro: O(1) lookup instead of O(level).
> * Con: Extra RDD which must share the same partitioning as the training data.
> Design:
> * (option 1) This could be done as in [Sequoia Forests | 
> https://github.com/AlpineNow/SparkML2] where each instance is stored with an 
> array of node indices (1 node per tree).
> * (option 2) This could also be done by storing an RDD\[Array\[Map\[Int, 
> Array\[TreePoint\]\]\]\], where each partition stores an array of maps from 
> node indices to an array of instances.  This has more overhead in data 
> structures but could be more efficient: not all nodes are split on each 
> iteration, and this would allow each executor to ignore instances which are 
> not used for the current node set.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to