Aaron Staple created SPARK-3550:
-----------------------------------

             Summary: Disable automatic rdd caching in python api for relevant 
learners
                 Key: SPARK-3550
                 URL: https://issues.apache.org/jira/browse/SPARK-3550
             Project: Spark
          Issue Type: Improvement
          Components: MLlib, PySpark
            Reporter: Aaron Staple


The python mllib api automatically caches training rdds. However, the 
NaiveBayes, ALS, and DecisionTree learners do not require external caching to 
prevent repeated RDD re-evaluation during learning. NaiveBayes only evaluates 
its input RDD once, while ALS and DecisionTree internally persist 
transformations of their input RDDs. For these learners, we should disable the 
automatic caching in the python mllib api.

See discussion here:
https://github.com/apache/spark/pull/2362#issuecomment-55637953



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to