[ 
https://issues.apache.org/jira/browse/SPARK-3488?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14135613#comment-14135613
 ] 

Aaron Staple commented on SPARK-3488:
-------------------------------------

After further discussion it's been decided that, for now, the present 
implementation’s reduced memory footprint for cached rdds is worth the cpu cost 
of repeated deserialization during learning.

See discussion https://github.com/apache/spark/pull/2362#issuecomment-55552191

> cache deserialized python RDDs before iterative learning
> --------------------------------------------------------
>
>                 Key: SPARK-3488
>                 URL: https://issues.apache.org/jira/browse/SPARK-3488
>             Project: Spark
>          Issue Type: Improvement
>          Components: MLlib, PySpark
>            Reporter: Aaron Staple
>
> When running an iterative learning algorithm, it makes sense that the input 
> RDD be cached for improved performance. When learning is applied to a python 
> RDD, currently the python RDD is always cached, then in scala that cached RDD 
> is mapped to an uncached deserialized RDD, and the uncached RDD is passed to 
> the learning algorithm. Instead the deserialized RDD should be cached.
> This was originally discussed here:
> https://github.com/apache/spark/pull/2347#issuecomment-55181535



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to