[jira] [Commented] (SPARK-20580) Allow RDD cache with unserializable objects

Fernando Pereira (JIRA) Wed, 03 May 2017 07:27:23 -0700

    [ 
https://issues.apache.org/jira/browse/SPARK-20580?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15994968#comment-15994968
 ]


Fernando Pereira commented on SPARK-20580:
------------------------------------------

I try to avoid any operation involving serialization. Im using pyspark and the 
default RDD.cache()

So eventually there is a bug somewhere.
My case:
In [11]: fzer.fdata.morphologyRDD.map(lambda a: 
MorphoStats.has_duplicated_points(a[1])).count()
Out[11]: 22431                                                                  
In [12]: rdd2 = fzer.fdata.morphologyRDD.cache()
In [13]: rdd2.map(lambda a: MorphoStats.has_duplicated_points(a[1])).count()
[Stage 7:>                                                        (0 + 8) / 
128]17/05/03 16:22:52 
ERROR Executor: Exception in task 1.0 in stage 7.0 (TID 652)
org.apache.spark.SparkException: Python worker exited unexpectedly (crashed)
(...)
Caused by: java.io.EOFException
        at java.io.DataInputStream.readInt(DataInputStream.java:392)


> Allow RDD cache with unserializable objects
> -------------------------------------------
>
>                 Key: SPARK-20580
>                 URL: https://issues.apache.org/jira/browse/SPARK-20580
>             Project: Spark
>          Issue Type: Improvement
>          Components: Spark Core
>    Affects Versions: 1.3.0
>            Reporter: Fernando Pereira
>            Priority: Minor
>
> In my current scenario we load complex Python objects in the worker nodes 
> that are not completely serializable. We then apply map certain operations to 
> the RDD which at some point we collect. In this basic usage all works well.
> However, if we cache() the RDD (which defaults to memory) suddenly it fails 
> to execute the transformations after the caching step. Apparently caching 
> serializes the RDD data and deserializes it whenever more transformations are 
> required.
> It would be nice to avoid serialization of the objects if they are to be 
> cached to memory, and keep the original object



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-20580) Allow RDD cache with unserializable objects

Reply via email to