Hi,

I am trying to upgrade from spark v0.91 to v1.0.0 and getting into some wierd 
behavior.
When, in pyspark, I invoke 
sc.textFile("hdfs://hadoop-ha01:yyyy/user/xxxxx/events_2.1").take(10000) the 
call crashes with the below stack trace.
The file resides in hadoop 2.2, it is a large event data, contain Unicode 
strings (unfortunately I cannot share the data due to privacy constraints).
Same code works just fine with the Scala version of 1.0.0
The annoying thing is that the same code on both Scala and python 0.9.1 work 
without any problem.

To me the problem seems to be as related to serialization support in python but 
I am just guessing.

Any help is appreciated,
Sagi


14/06/19 08:35:52 INFO deprecation: mapred.tip.id is deprecated. Instead, use ma
preduce.task.id
14/06/19 08:35:52 INFO deprecation: mapred.task.id is deprecated. Instead, use m
apreduce.task.attempt.id
14/06/19 08:35:52 INFO deprecation: mapred.task.is.map is deprecated. Instead, u
se mapreduce.task.ismap
14/06/19 08:35:52 INFO deprecation: mapred.task.partition is deprecated. Instead
, use mapreduce.task.partition
14/06/19 08:35:52 INFO deprecation: mapred.job.id is deprecated. Instead, use ma
preduce.job.id
14/06/19 08:35:53 INFO PythonRDD: Times: total = 542, boot = 219, init = 322, fi
nish = 1
14/06/19 08:35:53 ERROR PythonRDD: Python worker exited unexpectedly (crashed)
java.net.SocketException: Connection reset by peer: socket write error
       at java.net.SocketOutputStream.socketWrite0(Native Method)
        at java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:113)
        at java.net.SocketOutputStream.write(SocketOutputStream.java:159)
        at java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:82
)
        at java.io.BufferedOutputStream.write(BufferedOutputStream.java:126)
        at java.io.DataOutputStream.write(DataOutputStream.java:107)
        at java.io.FilterOutputStream.write(FilterOutputStream.java:97)
        at org.apache.spark.api.python.PythonRDD$.writeUTF(PythonRDD.scala:332)
        at org.apache.spark.api.python.PythonRDD$$anonfun$writeIteratorToStream$
2.apply(PythonRDD.scala:304)
        at org.apache.spark.api.python.PythonRDD$$anonfun$writeIteratorToStream$
2.apply(PythonRDD.scala:303)
        at scala.collection.Iterator$class.foreach(Iterator.scala:727)
        at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
        at org.apache.spark.api.python.PythonRDD$.writeIteratorToStream(PythonRD
D.scala:303)
        at org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1.app
ly$mcV$sp(PythonRDD.scala:200)
        at org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1.app
ly(PythonRDD.scala:175)
        at org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1.app
ly(PythonRDD.scala:175)
        at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1160)
        at org.apache.spark.api.python.PythonRDD$WriterThread.run(PythonRDD.scal
a:174)
14/06/19 08:35:53 ERROR PythonRDD: This may have been caused by a prior exceptio
n:
java.net.SocketException: Connection reset by peer: socket write error
        at java.net.SocketOutputStream.socketWrite0(Native Method)
        at java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:113)
        at java.net.SocketOutputStream.write(SocketOutputStream.java:159)
        at java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:82
)
        at java.io.BufferedOutputStream.write(BufferedOutputStream.java:126)
        at java.io.DataOutputStream.write(DataOutputStream.java:107)
        at java.io.FilterOutputStream.write(FilterOutputStream.java:97)
        at org.apache.spark.api.python.PythonRDD$.writeUTF(PythonRDD.scala:332)
        at org.apache.spark.api.python.PythonRDD$$anonfun$writeIteratorToStream$
2.apply(PythonRDD.scala:304)
        at org.apache.spark.api.python.PythonRDD$$anonfun$writeIteratorToStream$
2.apply(PythonRDD.scala:303)
        at scala.collection.Iterator$class.foreach(Iterator.scala:727)
        at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
        at org.apache.spark.api.python.PythonRDD$.writeIteratorToStream(PythonRD
D.scala:303)
        at org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1.app
ly$mcV$sp(PythonRDD.scala:200)
        at org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1.app
ly(PythonRDD.scala:175)
        at org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1.app
ly(PythonRDD.scala:175)
        at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1160)
        at org.apache.spark.api.python.PythonRDD$WriterThread.run(PythonRDD.scal
a:174)
14/06/19 08:35:53 INFO DAGScheduler: Failed to run take at <stdin>:1
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "D:\src\spark-1.0.0-bin-hadoop2\python\pyspark\rdd.py", line 868, in take

    iterator = mapped._jrdd.collectPartitions(partitionsToTake)[0].iterator()
  File "D:\src\spark-1.0.0-bin-hadoop2\python\lib\py4j-0.8.1-src.zip\py4j\java_g
ateway.py", line 537, in __call__
  File "D:\src\spark-1.0.0-bin-hadoop2\python\lib\py4j-0.8.1-src.zip\py4j\protoc
ol.py", line 300, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o21.collectPartitio
ns.
: java.net.SocketException: Connection reset by peer: socket write error
        at java.net.SocketOutputStream.socketWrite0(Native Method)
        at java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:113)
        at java.net.SocketOutputStream.write(SocketOutputStream.java:159)
        at java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:82
)
        at java.io.BufferedOutputStream.write(BufferedOutputStream.java:126)
        at java.io.DataOutputStream.write(DataOutputStream.java:107)
        at java.io.FilterOutputStream.write(FilterOutputStream.java:97)
        at org.apache.spark.api.python.PythonRDD$.writeUTF(PythonRDD.scala:332)
        at org.apache.spark.api.python.PythonRDD$$anonfun$writeIteratorToStream$
2.apply(PythonRDD.scala:304)
        at org.apache.spark.api.python.PythonRDD$$anonfun$writeIteratorToStream$
2.apply(PythonRDD.scala:303)
        at scala.collection.Iterator$class.foreach(Iterator.scala:727)
        at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
        at org.apache.spark.api.python.PythonRDD$.writeIteratorToStream(PythonRD
D.scala:303)
        at org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1.app
ly$mcV$sp(PythonRDD.scala:200)
        at org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1.app
ly(PythonRDD.scala:175)
        at org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1.app
ly(PythonRDD.scala:175)
        at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1160)
        at org.apache.spark.api.python.PythonRDD$WriterThread.run(PythonRDD.scal
a:174)




Reply via email to