GitHub user staple opened a pull request: https://github.com/apache/spark/pull/2362
[SPARK-3488][MLLIB] Cache python RDDs after deserialization for relevant iterative learners. When running an iterative learning algorithm, it makes sense that the input RDD be cached for improved performance. When learning is applied to a python RDD, previously the python RDD was always cached, then in scala that cached RDD was mapped to an uncached deserialized RDD, and the uncached RDD was passed to the learning algorithm. Since the RDD with deserialized data was uncached, learning algorithms would implicitly deserialize the same data repeatedly, on every iteration. This patch moves RDD caching after deserialization for learning algorithms that should be called with a cached RDD. For algorithms that implement their own caching internally, the input RDD is no longer cached. Below Iâve listed the different learning routines accessible from python, the location where caching was previously enabled, and the location (if any) where caching is now enabled by this patch. LogisticRegressionWithSGD: was: python (in _regression_train_wrapper/_get_unmangled_labeled_point_rdd) now: jvm (trainRegressionModel) SVMWithSGD: was: python (in _regression_train_wrapper/_get_unmangled_labeled_point_rdd) now: jvm (trainRegressionModel) NaiveBayes: was: python (in _get_unmangled_labeled_point_rdd) now: none KMeans: was: python (in _get_unmangled_double_vector_rdd) now: jvm (trainKMeansModel) ALS: was: python (in _get_unmangled_rdd) now: none LinearRegressionWithSGD: was: python (in _regression_train_wrapper/_get_unmangled_labeled_point_rdd) now: jvm (trainRegressionModel) LassoWithSGD: was: python (in _regression_train_wrapper/_get_unmangled_labeled_point_rdd) now: jvm (trainRegressionModel) RidgeRegressionWithSGD: was: python (in _regression_train_wrapper/_get_unmangled_labeled_point_rdd) now: jvm (trainRegressionModel) DecisionTree: was: python (in _get_unmangled_labeled_point_rdd) now: none You can merge this pull request into a Git repository by running: $ git pull https://github.com/staple/spark SPARK-3488 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/2362.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #2362 ---- commit 7042ebc2214f13c7d5d5acd28fcfa0478c1ddf2c Author: Aaron Staple <aaron.sta...@gmail.com> Date: 2014-09-11T05:11:11Z [SPARK-3488][MLLIB] Cache python RDDs after deserialization for relevant iterative learners. ---- --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org