Hello, I am trying to run a spark job (which runs fine on the master node of the cluster), on a HDFS hadoop cluster using YARN. When I run the job which has a rdd.saveAsTextFile() line in it, I get the following error:
*SystemError: unknown opcode* The entire stacktrace has been appended to this message. All the nodes on the cluster have Python 2.7.9 running on them including the master and all of them have the variable SPARK_PYTHON set to the anaconda python path. When I try pyspark-shell on these instances they use anaconda python to open up the spark shell. I installed anaconda on all slaves after looking at the python version incompatibility issues mentioned in the following post: http://glennklockwood.blogspot.com/2014/06/spark-on-supercomputers-few-notes.html Please let me know what the issue might be. The spark version we are using is Spark 1.3
15/05/26 18:03:55 INFO storage.BlockManagerInfo: Added broadcast_1_piece0 in memory on ip-10-64-10-221.ec2.internal:36266 (size: 5.1 KB, free: 445.4 MB) 15/05/26 18:03:55 INFO storage.BlockManagerInfo: Added broadcast_1_piece0 in memory on ip-10-64-10-222.ec2.internal:33470 (size: 5.1 KB, free: 445.4 MB) 15/05/26 18:03:55 INFO storage.BlockManagerInfo: Added broadcast_0_piece0 in memory on ip-10-64-10-221.ec2.internal:36266 (size: 18.8 KB, free: 445.4 MB) 15/05/26 18:03:55 INFO storage.BlockManagerInfo: Added broadcast_0_piece0 in memory on ip-10-64-10-222.ec2.internal:33470 (size: 18.8 KB, free: 445.4 MB) 15/05/26 18:03:56 WARN scheduler.TaskSetManager: Lost task 20.0 in stage 0.0 (TID 7, ip-10-64-10-221.ec2.internal): org.apache.spark.api.python.PythonException: Traceback (most recent call last): File "/home/hadoop/spark/python/pyspark/worker.py", line 101, in main process() File "/home/hadoop/spark/python/pyspark/worker.py", line 96, in process serializer.dump_stream(func(split_index, iterator), outfile) File "/home/hadoop/spark/python/pyspark/rdd.py", line 2252, in pipeline_func return func(split, prev_func(split, iterator)) File "/home/hadoop/spark/python/pyspark/rdd.py", line 2252, in pipeline_func return func(split, prev_func(split, iterator)) File "/home/hadoop/spark/python/pyspark/rdd.py", line 282, in func return f(iterator) File "/home/hadoop/spark/python/pyspark/rdd.py", line 1704, in combineLocally if spill else InMemoryMerger(agg) SystemError: unknown opcode at org.apache.spark.api.python.PythonRDD$$anon$1.read(PythonRDD.scala:135) at org.apache.spark.api.python.PythonRDD$$anon$1.<init>(PythonRDD.scala:176) at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:94) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277) at org.apache.spark.rdd.RDD.iterator(RDD.scala:244) at org.apache.spark.api.python.PairwiseRDD.compute(PythonRDD.scala:311) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277) at org.apache.spark.rdd.RDD.iterator(RDD.scala:244) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41) at org.apache.spark.scheduler.Task.run(Task.scala:64) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:203) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) 15/05/26 18:03:56 INFO scheduler.TaskSetManager: Lost task 2.0 in stage 0.0 (TID 0) on executor ip-10-64-10-221.ec2.internal: org.apache.spark.api.python.PythonException (Traceback (most recent call last): File "/home/hadoop/spark/python/pyspark/worker.py", line 101, in main process() File "/home/hadoop/spark/python/pyspark/worker.py", line 96, in process serializer.dump_stream(func(split_index, iterator), outfile) File "/home/hadoop/spark/python/pyspark/rdd.py", line 2252, in pipeline_func return func(split, prev_func(split, iterator)) File "/home/hadoop/spark/python/pyspark/rdd.py", line 2252, in pipeline_func return func(split, prev_func(split, iterator)) File "/home/hadoop/spark/python/pyspark/rdd.py", line 282, in func return f(iterator) File "/home/hadoop/spark/python/pyspark/rdd.py", line 1704, in combineLocally if spill else InMemoryMerger(agg) SystemError: unknown opcode ) [duplicate 1] 15/05/26 18:03:56 INFO scheduler.TaskSetManager: Lost task 21.0 in stage 0.0 (TID 8) on executor ip-10-64-10-221.ec2.internal: org.apache.spark.api.python.PythonException (Traceback (most recent call last): File "/home/hadoop/spark/python/pyspark/worker.py", line 101, in main process() File "/home/hadoop/spark/python/pyspark/worker.py", line 96, in process serializer.dump_stream(func(split_index, iterator), outfile) File "/home/hadoop/spark/python/pyspark/rdd.py", line 2252, in pipeline_func return func(split, prev_func(split, iterator)) File "/home/hadoop/spark/python/pyspark/rdd.py", line 2252, in pipeline_func return func(split, prev_func(split, iterator)) File "/home/hadoop/spark/python/pyspark/rdd.py", line 282, in func return f(iterator) File "/home/hadoop/spark/python/pyspark/rdd.py", line 1704, in combineLocally if spill else InMemoryMerger(agg) SystemError: unknown opcode ) [duplicate 2]
--------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org