PySpark Unknown Opcode Error

2015-05-26 Thread Nikhil Muralidhar
Hello,
  I am trying to run a spark job (which runs fine on the master node of the
cluster), on a HDFS hadoop cluster using YARN. When I run the job which has
a rdd.saveAsTextFile() line in it, I get the following error:

*SystemError: unknown opcode*

The entire stacktrace has been appended to this message.

 All the nodes on the cluster have Python 2.7.9 running on them including
the master and all of them have the variable SPARK_PYTHON set to the
anaconda python path. When I try pyspark-shell on these instances they use
anaconda python to open up the spark shell.

I installed anaconda on all slaves after looking at the python version
incompatibility issues mentioned in the following post:


http://glennklockwood.blogspot.com/2014/06/spark-on-supercomputers-few-notes.html

Please let me know what the issue might be.

The spark version we are using is Spark 1.3
15/05/26 18:03:55 INFO storage.BlockManagerInfo: Added broadcast_1_piece0 in 
memory on ip-10-64-10-221.ec2.internal:36266 (size: 5.1 KB, free: 445.4 MB)
15/05/26 18:03:55 INFO storage.BlockManagerInfo: Added broadcast_1_piece0 in 
memory on ip-10-64-10-222.ec2.internal:33470 (size: 5.1 KB, free: 445.4 MB)
15/05/26 18:03:55 INFO storage.BlockManagerInfo: Added broadcast_0_piece0 in 
memory on ip-10-64-10-221.ec2.internal:36266 (size: 18.8 KB, free: 445.4 MB)
15/05/26 18:03:55 INFO storage.BlockManagerInfo: Added broadcast_0_piece0 in 
memory on ip-10-64-10-222.ec2.internal:33470 (size: 18.8 KB, free: 445.4 MB)
15/05/26 18:03:56 WARN scheduler.TaskSetManager: Lost task 20.0 in stage 0.0 
(TID 7, ip-10-64-10-221.ec2.internal): 
org.apache.spark.api.python.PythonException: Traceback (most recent call last):
  File /home/hadoop/spark/python/pyspark/worker.py, line 101, in main
process()
  File /home/hadoop/spark/python/pyspark/worker.py, line 96, in process
serializer.dump_stream(func(split_index, iterator), outfile)
  File /home/hadoop/spark/python/pyspark/rdd.py, line 2252, in pipeline_func
return func(split, prev_func(split, iterator))
  File /home/hadoop/spark/python/pyspark/rdd.py, line 2252, in pipeline_func
return func(split, prev_func(split, iterator))
  File /home/hadoop/spark/python/pyspark/rdd.py, line 282, in func
return f(iterator)
  File /home/hadoop/spark/python/pyspark/rdd.py, line 1704, in combineLocally
if spill else InMemoryMerger(agg)
SystemError: unknown opcode

at 
org.apache.spark.api.python.PythonRDD$$anon$1.read(PythonRDD.scala:135)
at 
org.apache.spark.api.python.PythonRDD$$anon$1.init(PythonRDD.scala:176)
at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:94)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
at org.apache.spark.api.python.PairwiseRDD.compute(PythonRDD.scala:311)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
at org.apache.spark.scheduler.Task.run(Task.scala:64)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:203)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)

15/05/26 18:03:56 INFO scheduler.TaskSetManager: Lost task 2.0 in stage 0.0 
(TID 0) on executor ip-10-64-10-221.ec2.internal: 
org.apache.spark.api.python.PythonException (Traceback (most recent call last):
  File /home/hadoop/spark/python/pyspark/worker.py, line 101, in main
process()
  File /home/hadoop/spark/python/pyspark/worker.py, line 96, in process
serializer.dump_stream(func(split_index, iterator), outfile)
  File /home/hadoop/spark/python/pyspark/rdd.py, line 2252, in pipeline_func
return func(split, prev_func(split, iterator))
  File /home/hadoop/spark/python/pyspark/rdd.py, line 2252, in pipeline_func
return func(split, prev_func(split, iterator))
  File /home/hadoop/spark/python/pyspark/rdd.py, line 282, in func
return f(iterator)
  File /home/hadoop/spark/python/pyspark/rdd.py, line 1704, in combineLocally
if spill else InMemoryMerger(agg)
SystemError: unknown opcode
) [duplicate 1]
15/05/26 18:03:56 INFO scheduler.TaskSetManager: Lost task 21.0 in stage 0.0 
(TID 8) on executor ip-10-64-10-221.ec2.internal: 
org.apache.spark.api.python.PythonException (Traceback (most recent call last):
  File /home/hadoop/spark/python/pyspark/worker.py, line 101, in main
process()
  File /home/hadoop/spark/python/pyspark/worker.py, line 96, in process
serializer.dump_stream(func(split_index, iterator), outfile)
  File /home/hadoop/spark/python/pyspark/rdd.py, line 2252, in pipeline_func

Re: PySpark Unknown Opcode Error

2015-05-26 Thread Davies Liu
This should be the case that you run different versions for Python in
driver and slaves, Spark 1.4 will double check that  will release
soon).

SPARK_PYTHON should be PYSPARK_PYTHON

On Tue, May 26, 2015 at 11:21 AM, Nikhil Muralidhar nmural...@gmail.com wrote:
 Hello,
   I am trying to run a spark job (which runs fine on the master node of the
 cluster), on a HDFS hadoop cluster using YARN. When I run the job which has
 a rdd.saveAsTextFile() line in it, I get the following error:

 SystemError: unknown opcode

 The entire stacktrace has been appended to this message.

  All the nodes on the cluster have Python 2.7.9 running on them including
 the master and all of them have the variable SPARK_PYTHON set to the
 anaconda python path. When I try pyspark-shell on these instances they use
 anaconda python to open up the spark shell.

 I installed anaconda on all slaves after looking at the python version
 incompatibility issues mentioned in the following post:


 http://glennklockwood.blogspot.com/2014/06/spark-on-supercomputers-few-notes.html

 Please let me know what the issue might be.

 The spark version we are using is Spark 1.3



 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org