Hello,
  I am trying to run a spark job (which runs fine on the master node of the
cluster), on a HDFS hadoop cluster using YARN. When I run the job which has
a rdd.saveAsTextFile() line in it, I get the following error:

*SystemError: unknown opcode*

The entire stacktrace has been appended to this message.

 All the nodes on the cluster have Python 2.7.9 running on them including
the master and all of them have the variable SPARK_PYTHON set to the
anaconda python path. When I try pyspark-shell on these instances they use
anaconda python to open up the spark shell.

I installed anaconda on all slaves after looking at the python version
incompatibility issues mentioned in the following post:


http://glennklockwood.blogspot.com/2014/06/spark-on-supercomputers-few-notes.html

Please let me know what the issue might be.

The spark version we are using is Spark 1.3
15/05/26 18:03:55 INFO storage.BlockManagerInfo: Added broadcast_1_piece0 in 
memory on ip-10-64-10-221.ec2.internal:36266 (size: 5.1 KB, free: 445.4 MB)
15/05/26 18:03:55 INFO storage.BlockManagerInfo: Added broadcast_1_piece0 in 
memory on ip-10-64-10-222.ec2.internal:33470 (size: 5.1 KB, free: 445.4 MB)
15/05/26 18:03:55 INFO storage.BlockManagerInfo: Added broadcast_0_piece0 in 
memory on ip-10-64-10-221.ec2.internal:36266 (size: 18.8 KB, free: 445.4 MB)
15/05/26 18:03:55 INFO storage.BlockManagerInfo: Added broadcast_0_piece0 in 
memory on ip-10-64-10-222.ec2.internal:33470 (size: 18.8 KB, free: 445.4 MB)
15/05/26 18:03:56 WARN scheduler.TaskSetManager: Lost task 20.0 in stage 0.0 
(TID 7, ip-10-64-10-221.ec2.internal): 
org.apache.spark.api.python.PythonException: Traceback (most recent call last):
  File "/home/hadoop/spark/python/pyspark/worker.py", line 101, in main
    process()
  File "/home/hadoop/spark/python/pyspark/worker.py", line 96, in process
    serializer.dump_stream(func(split_index, iterator), outfile)
  File "/home/hadoop/spark/python/pyspark/rdd.py", line 2252, in pipeline_func
    return func(split, prev_func(split, iterator))
  File "/home/hadoop/spark/python/pyspark/rdd.py", line 2252, in pipeline_func
    return func(split, prev_func(split, iterator))
  File "/home/hadoop/spark/python/pyspark/rdd.py", line 282, in func
    return f(iterator)
  File "/home/hadoop/spark/python/pyspark/rdd.py", line 1704, in combineLocally
    if spill else InMemoryMerger(agg)
SystemError: unknown opcode

        at 
org.apache.spark.api.python.PythonRDD$$anon$1.read(PythonRDD.scala:135)
        at 
org.apache.spark.api.python.PythonRDD$$anon$1.<init>(PythonRDD.scala:176)
        at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:94)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
        at org.apache.spark.api.python.PairwiseRDD.compute(PythonRDD.scala:311)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
        at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68)
        at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
        at org.apache.spark.scheduler.Task.run(Task.scala:64)
        at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:203)
        at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
        at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
        at java.lang.Thread.run(Thread.java:745)

15/05/26 18:03:56 INFO scheduler.TaskSetManager: Lost task 2.0 in stage 0.0 
(TID 0) on executor ip-10-64-10-221.ec2.internal: 
org.apache.spark.api.python.PythonException (Traceback (most recent call last):
  File "/home/hadoop/spark/python/pyspark/worker.py", line 101, in main
    process()
  File "/home/hadoop/spark/python/pyspark/worker.py", line 96, in process
    serializer.dump_stream(func(split_index, iterator), outfile)
  File "/home/hadoop/spark/python/pyspark/rdd.py", line 2252, in pipeline_func
    return func(split, prev_func(split, iterator))
  File "/home/hadoop/spark/python/pyspark/rdd.py", line 2252, in pipeline_func
    return func(split, prev_func(split, iterator))
  File "/home/hadoop/spark/python/pyspark/rdd.py", line 282, in func
    return f(iterator)
  File "/home/hadoop/spark/python/pyspark/rdd.py", line 1704, in combineLocally
    if spill else InMemoryMerger(agg)
SystemError: unknown opcode
) [duplicate 1]
15/05/26 18:03:56 INFO scheduler.TaskSetManager: Lost task 21.0 in stage 0.0 
(TID 8) on executor ip-10-64-10-221.ec2.internal: 
org.apache.spark.api.python.PythonException (Traceback (most recent call last):
  File "/home/hadoop/spark/python/pyspark/worker.py", line 101, in main
    process()
  File "/home/hadoop/spark/python/pyspark/worker.py", line 96, in process
    serializer.dump_stream(func(split_index, iterator), outfile)
  File "/home/hadoop/spark/python/pyspark/rdd.py", line 2252, in pipeline_func
    return func(split, prev_func(split, iterator))
  File "/home/hadoop/spark/python/pyspark/rdd.py", line 2252, in pipeline_func
    return func(split, prev_func(split, iterator))
  File "/home/hadoop/spark/python/pyspark/rdd.py", line 282, in func
    return f(iterator)
  File "/home/hadoop/spark/python/pyspark/rdd.py", line 1704, in combineLocally
    if spill else InMemoryMerger(agg)
SystemError: unknown opcode
) [duplicate 2]
---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Reply via email to