PySpark RDD method errors

Chad Timmins Thu, 09 Jul 2015 15:35:52 -0700

Hi,

When I run the filter() method on an RDD object and then try to print its 
results using collect(), I get a Py4JJavaError.  It is not only filter but 
other methods that cause similar errors and I cannot figure out what is causing 
this.  PySpark from the command line works fine, but it does not work in the 
Zeppelin Notebook.  My setup is on an AWS EMR instance running spark 1.3.1 on 
Amazon's Hadoop 2.4.0.  I have included a snippet of code (in blue) and the 
error (in red).  Thank you and please let me know if you need any more 
additional information.



%pyspark

nums = [1,2,3,4,5,6]

rdd_nums = sc.parallelize(nums)
rdd_sq   = rdd_nums.map(lambda x: pow(x,2))
rdd_cube = rdd_nums.map(lambda x: pow(x,3))
rdd_odd  = rdd_nums.filter(lambda x: x%2 == 1)

print "nums: %s" % rdd_nums.collect()
print "squares: %s" % rdd_sq.collect()
print "cubes: %s" % rdd_cube.collect()
print "odds: %s" % rdd_odd.collect()


Py4JJavaError: An error occurred while calling 
z:org.apache.spark.api.python.PythonRDD.collectAndServe. : 
org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in 
stage 107.0 failed 4 times, most recent failure: Lost task 0.3 in stage 107.0 
(TID 263, ip-10-204-134-29.us-west-2.compute.internal): 
org.apache.spark.SparkException: Python worker exited unexpectedly (crashed)
...
...
Caused by: java.io.EOFException at 
java.io.DataInputStream.readInt(DataInputStream.java:392) at 
org.apache.spark.api.python.PythonRDD$$anon$1.read(PythonRDD.scala:108) ... 10 
more

If I instead set rdd_odd  = rdd_nums.filter(lambda x: x%2)    I don't get an 
error


Thanks,

Chad Timmins

Software Engineer Intern at Trulia
B.S. Electrical Engineering, UC Davis 2015

PySpark RDD method errors

Reply via email to