[ https://issues.apache.org/jira/browse/SPARK-3550?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14148347#comment-14148347 ]
Aaron Staple commented on SPARK-3550: ------------------------------------- This has been addressed in another commit: https://github.com/apache/spark/commit/fce5e251d636c788cda91345867e0294280c074d See comment here: https://github.com/apache/spark/pull/2412#issuecomment-56865408 > Disable automatic rdd caching in python api for relevant learners > ----------------------------------------------------------------- > > Key: SPARK-3550 > URL: https://issues.apache.org/jira/browse/SPARK-3550 > Project: Spark > Issue Type: Improvement > Components: MLlib, PySpark > Reporter: Aaron Staple > > The python mllib api automatically caches training rdds. However, the > NaiveBayes, ALS, and DecisionTree learners do not require external caching to > prevent repeated RDD re-evaluation during learning. NaiveBayes only evaluates > its input RDD once, while ALS and DecisionTree internally persist > transformations of their input RDDs. For these learners, we should disable > the automatic caching in the python mllib api. > See discussion here: > https://github.com/apache/spark/pull/2362#issuecomment-55637953 -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org