Hari Shreedharan created SPARK-2442:
---------------------------------------

             Summary: Add a Hadoop Writable serializer
                 Key: SPARK-2442
                 URL: https://issues.apache.org/jira/browse/SPARK-2442
             Project: Spark
          Issue Type: Bug
            Reporter: Hari Shreedharan


Using data read from hadoop files in shuffles can cause exceptions with the 
following stacktrace:
{code}
java.io.NotSerializableException: org.apache.hadoop.io.BytesWritable
        at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1181)
        at 
java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1541)
        at 
java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1506)
        at 
java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1429)
        at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1175)
        at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:347)
        at 
org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:42)
        at 
org.apache.spark.storage.DiskBlockObjectWriter.write(BlockObjectWriter.scala:179)
        at 
org.apache.spark.scheduler.ShuffleMapTask$$anonfun$runTask$1.apply(ShuffleMapTask.scala:161)
        at 
org.apache.spark.scheduler.ShuffleMapTask$$anonfun$runTask$1.apply(ShuffleMapTask.scala:158)
        at scala.collection.Iterator$class.foreach(Iterator.scala:727)
        at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
        at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:158)
        at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
        at org.apache.spark.scheduler.Task.run(Task.scala:51)
        at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:183)
        at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1146)
        at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
        at java.lang.Thread.run(Thread.java:679)
{code}

This though seems to go away if Kyro serializer is used. I am wondering if 
adding a Hadoop-writables friendly serializer makes sense as it is likely to 
perform better than Kyro without registration, since Writables don't implement 
Serializable - so the serialization might not be the most efficient.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Reply via email to