[ https://issues.apache.org/jira/browse/SPARK-20580?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15994968#comment-15994968 ]
Fernando Pereira commented on SPARK-20580: ------------------------------------------ I try to avoid any operation involving serialization. Im using pyspark and the default RDD.cache() So eventually there is a bug somewhere. My case: In [11]: fzer.fdata.morphologyRDD.map(lambda a: MorphoStats.has_duplicated_points(a[1])).count() Out[11]: 22431 In [12]: rdd2 = fzer.fdata.morphologyRDD.cache() In [13]: rdd2.map(lambda a: MorphoStats.has_duplicated_points(a[1])).count() [Stage 7:> (0 + 8) / 128]17/05/03 16:22:52 ERROR Executor: Exception in task 1.0 in stage 7.0 (TID 652) org.apache.spark.SparkException: Python worker exited unexpectedly (crashed) (...) Caused by: java.io.EOFException at java.io.DataInputStream.readInt(DataInputStream.java:392) > Allow RDD cache with unserializable objects > ------------------------------------------- > > Key: SPARK-20580 > URL: https://issues.apache.org/jira/browse/SPARK-20580 > Project: Spark > Issue Type: Improvement > Components: Spark Core > Affects Versions: 1.3.0 > Reporter: Fernando Pereira > Priority: Minor > > In my current scenario we load complex Python objects in the worker nodes > that are not completely serializable. We then apply map certain operations to > the RDD which at some point we collect. In this basic usage all works well. > However, if we cache() the RDD (which defaults to memory) suddenly it fails > to execute the transformations after the caching step. Apparently caching > serializes the RDD data and deserializes it whenever more transformations are > required. > It would be nice to avoid serialization of the objects if they are to be > cached to memory, and keep the original object -- This message was sent by Atlassian JIRA (v6.3.15#6346) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org