Julian King created SPARK-23704: ----------------------------------- Summary: PySpark access of individual trees in random forest is slow Key: SPARK-23704 URL: https://issues.apache.org/jira/browse/SPARK-23704 Project: Spark Issue Type: Improvement Components: ML Affects Versions: 2.2.1 Environment: PySpark 2.2.1 / Windows 10 Reporter: Julian King
Making predictions from a randomForestClassifier PySpark is much faster than making predictions from an individual tree contained within the .trees attribute. In fact, the model.transform call without an action is more than 10x slower for an individual tree vs the model.transform call for the random forest model. See [https://stackoverflow.com/questions/49297470/slow-individual-tree-access-for-random-forest-in-pyspark] for example with timing. Ideally: * Getting a prediction from a single tree should be comparable to or faster than getting predictions from the whole tree * Getting all the predictions from all the individual trees should be comparable in speed to getting the predictions from the random forest -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org