Hi, I am trying to upgrade from spark v0.91 to v1.0.0 and getting into some wierd behavior. When, in pyspark, I invoke sc.textFile("hdfs://hadoop-ha01:yyyy/user/xxxxx/events_2.1").take(10000) the call crashes with the below stack trace. The file resides in hadoop 2.2, it is a large event data, contain Unicode strings (unfortunately I cannot share the data due to privacy constraints). Same code works just fine with the Scala version of 1.0.0 The annoying thing is that the same code on both Scala and python 0.9.1 work without any problem.
To me the problem seems to be as related to serialization support in python but I am just guessing. Any help is appreciated, Sagi 14/06/19 08:35:52 INFO deprecation: mapred.tip.id is deprecated. Instead, use ma preduce.task.id 14/06/19 08:35:52 INFO deprecation: mapred.task.id is deprecated. Instead, use m apreduce.task.attempt.id 14/06/19 08:35:52 INFO deprecation: mapred.task.is.map is deprecated. Instead, u se mapreduce.task.ismap 14/06/19 08:35:52 INFO deprecation: mapred.task.partition is deprecated. Instead , use mapreduce.task.partition 14/06/19 08:35:52 INFO deprecation: mapred.job.id is deprecated. Instead, use ma preduce.job.id 14/06/19 08:35:53 INFO PythonRDD: Times: total = 542, boot = 219, init = 322, fi nish = 1 14/06/19 08:35:53 ERROR PythonRDD: Python worker exited unexpectedly (crashed) java.net.SocketException: Connection reset by peer: socket write error at java.net.SocketOutputStream.socketWrite0(Native Method) at java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:113) at java.net.SocketOutputStream.write(SocketOutputStream.java:159) at java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:82 ) at java.io.BufferedOutputStream.write(BufferedOutputStream.java:126) at java.io.DataOutputStream.write(DataOutputStream.java:107) at java.io.FilterOutputStream.write(FilterOutputStream.java:97) at org.apache.spark.api.python.PythonRDD$.writeUTF(PythonRDD.scala:332) at org.apache.spark.api.python.PythonRDD$$anonfun$writeIteratorToStream$ 2.apply(PythonRDD.scala:304) at org.apache.spark.api.python.PythonRDD$$anonfun$writeIteratorToStream$ 2.apply(PythonRDD.scala:303) at scala.collection.Iterator$class.foreach(Iterator.scala:727) at scala.collection.AbstractIterator.foreach(Iterator.scala:1157) at org.apache.spark.api.python.PythonRDD$.writeIteratorToStream(PythonRD D.scala:303) at org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1.app ly$mcV$sp(PythonRDD.scala:200) at org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1.app ly(PythonRDD.scala:175) at org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1.app ly(PythonRDD.scala:175) at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1160) at org.apache.spark.api.python.PythonRDD$WriterThread.run(PythonRDD.scal a:174) 14/06/19 08:35:53 ERROR PythonRDD: This may have been caused by a prior exceptio n: java.net.SocketException: Connection reset by peer: socket write error at java.net.SocketOutputStream.socketWrite0(Native Method) at java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:113) at java.net.SocketOutputStream.write(SocketOutputStream.java:159) at java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:82 ) at java.io.BufferedOutputStream.write(BufferedOutputStream.java:126) at java.io.DataOutputStream.write(DataOutputStream.java:107) at java.io.FilterOutputStream.write(FilterOutputStream.java:97) at org.apache.spark.api.python.PythonRDD$.writeUTF(PythonRDD.scala:332) at org.apache.spark.api.python.PythonRDD$$anonfun$writeIteratorToStream$ 2.apply(PythonRDD.scala:304) at org.apache.spark.api.python.PythonRDD$$anonfun$writeIteratorToStream$ 2.apply(PythonRDD.scala:303) at scala.collection.Iterator$class.foreach(Iterator.scala:727) at scala.collection.AbstractIterator.foreach(Iterator.scala:1157) at org.apache.spark.api.python.PythonRDD$.writeIteratorToStream(PythonRD D.scala:303) at org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1.app ly$mcV$sp(PythonRDD.scala:200) at org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1.app ly(PythonRDD.scala:175) at org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1.app ly(PythonRDD.scala:175) at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1160) at org.apache.spark.api.python.PythonRDD$WriterThread.run(PythonRDD.scal a:174) 14/06/19 08:35:53 INFO DAGScheduler: Failed to run take at <stdin>:1 Traceback (most recent call last): File "<stdin>", line 1, in <module> File "D:\src\spark-1.0.0-bin-hadoop2\python\pyspark\rdd.py", line 868, in take iterator = mapped._jrdd.collectPartitions(partitionsToTake)[0].iterator() File "D:\src\spark-1.0.0-bin-hadoop2\python\lib\py4j-0.8.1-src.zip\py4j\java_g ateway.py", line 537, in __call__ File "D:\src\spark-1.0.0-bin-hadoop2\python\lib\py4j-0.8.1-src.zip\py4j\protoc ol.py", line 300, in get_return_value py4j.protocol.Py4JJavaError: An error occurred while calling o21.collectPartitio ns. : java.net.SocketException: Connection reset by peer: socket write error at java.net.SocketOutputStream.socketWrite0(Native Method) at java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:113) at java.net.SocketOutputStream.write(SocketOutputStream.java:159) at java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:82 ) at java.io.BufferedOutputStream.write(BufferedOutputStream.java:126) at java.io.DataOutputStream.write(DataOutputStream.java:107) at java.io.FilterOutputStream.write(FilterOutputStream.java:97) at org.apache.spark.api.python.PythonRDD$.writeUTF(PythonRDD.scala:332) at org.apache.spark.api.python.PythonRDD$$anonfun$writeIteratorToStream$ 2.apply(PythonRDD.scala:304) at org.apache.spark.api.python.PythonRDD$$anonfun$writeIteratorToStream$ 2.apply(PythonRDD.scala:303) at scala.collection.Iterator$class.foreach(Iterator.scala:727) at scala.collection.AbstractIterator.foreach(Iterator.scala:1157) at org.apache.spark.api.python.PythonRDD$.writeIteratorToStream(PythonRD D.scala:303) at org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1.app ly$mcV$sp(PythonRDD.scala:200) at org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1.app ly(PythonRDD.scala:175) at org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1.app ly(PythonRDD.scala:175) at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1160) at org.apache.spark.api.python.PythonRDD$WriterThread.run(PythonRDD.scal a:174)