[jira] [Created] (SPARK-23704) PySpark access of individual trees in random forest is slow

Julian King (JIRA) Thu, 15 Mar 2018 17:58:11 -0700

Julian King created SPARK-23704:
-----------------------------------

             Summary: PySpark access of individual trees in random forest is 
slow
                 Key: SPARK-23704
                 URL: https://issues.apache.org/jira/browse/SPARK-23704
             Project: Spark
          Issue Type: Improvement
          Components: ML
    Affects Versions: 2.2.1
         Environment: PySpark 2.2.1 / Windows 10
            Reporter: Julian King



Making predictions from a randomForestClassifier PySpark is much faster than 
making predictions from an individual tree contained within the .trees 
attribute. 

In fact, the model.transform call without an action is more than 10x slower for 
an individual tree vs the model.transform call for the random forest model.

See 
[https://stackoverflow.com/questions/49297470/slow-individual-tree-access-for-random-forest-in-pyspark]
 for example with timing.

Ideally:
 * Getting a prediction from a single tree should be comparable to or faster 
than getting predictions from the whole tree
 * Getting all the predictions from all the individual trees should be 
comparable in speed to getting the predictions from the random forest

 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-23704) PySpark access of individual trees in random forest is slow

Reply via email to