Aaron Staple created SPARK-3550: ----------------------------------- Summary: Disable automatic rdd caching in python api for relevant learners Key: SPARK-3550 URL: https://issues.apache.org/jira/browse/SPARK-3550 Project: Spark Issue Type: Improvement Components: MLlib, PySpark Reporter: Aaron Staple
The python mllib api automatically caches training rdds. However, the NaiveBayes, ALS, and DecisionTree learners do not require external caching to prevent repeated RDD re-evaluation during learning. NaiveBayes only evaluates its input RDD once, while ALS and DecisionTree internally persist transformations of their input RDDs. For these learners, we should disable the automatic caching in the python mllib api. See discussion here: https://github.com/apache/spark/pull/2362#issuecomment-55637953 -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org