GitHub user staple opened a pull request:

    https://github.com/apache/spark/pull/2362

    [SPARK-3488][MLLIB] Cache python RDDs after deserialization for relevant 
iterative learners.

    When running an iterative learning algorithm, it makes sense that the input 
RDD be cached for improved performance. When learning is applied to a python 
RDD, previously the python RDD was always cached, then in scala that cached RDD 
was mapped to an uncached deserialized RDD, and the uncached RDD was passed to 
the learning algorithm. Since the RDD with deserialized data was uncached, 
learning algorithms would implicitly deserialize the same data repeatedly, on 
every iteration.
    
    This patch moves RDD caching after deserialization for learning algorithms 
that should be called with a cached RDD. For algorithms that implement their 
own caching internally, the input RDD is no longer cached. Below I’ve listed 
the different learning routines accessible from python, the location where 
caching was previously enabled, and the location (if any) where caching is now 
enabled by this patch.
    
    LogisticRegressionWithSGD:
    was: python (in _regression_train_wrapper/_get_unmangled_labeled_point_rdd)
    now: jvm (trainRegressionModel)
    
    SVMWithSGD:
    was: python (in _regression_train_wrapper/_get_unmangled_labeled_point_rdd)
    now: jvm (trainRegressionModel)
    
    NaiveBayes:
    was: python (in _get_unmangled_labeled_point_rdd)
    now: none
    
    KMeans:
    was: python (in _get_unmangled_double_vector_rdd)
    now: jvm (trainKMeansModel)
    
    ALS:
    was: python (in _get_unmangled_rdd)
    now: none
    
    LinearRegressionWithSGD:
    was: python (in _regression_train_wrapper/_get_unmangled_labeled_point_rdd)
    now: jvm (trainRegressionModel)
    
    LassoWithSGD:
    was: python (in _regression_train_wrapper/_get_unmangled_labeled_point_rdd)
    now: jvm (trainRegressionModel)
    
    RidgeRegressionWithSGD:
    was: python (in _regression_train_wrapper/_get_unmangled_labeled_point_rdd)
    now: jvm (trainRegressionModel)
    
    DecisionTree:
    was: python (in _get_unmangled_labeled_point_rdd)
    now: none

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/staple/spark SPARK-3488

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/2362.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #2362
    
----
commit 7042ebc2214f13c7d5d5acd28fcfa0478c1ddf2c
Author: Aaron Staple <aaron.sta...@gmail.com>
Date:   2014-09-11T05:11:11Z

    [SPARK-3488][MLLIB] Cache python RDDs after deserialization for relevant 
iterative learners.

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Reply via email to