Hi, *Problem*: Spark job fails, but RM page says the job succeeded, also
appHandle = sparkLauncher.startApplication() ... appHandle.getState() returns Finished state - which indicates The application finished with a successful status, whereas the Spark job actually failed. *Environment*: Macintosh (El Capitan), Hadoop 2.7.2, Spark 2.0, SparkLauncher 2.0.1 I have Spark job (pagerank.py) running in yarn-client mode. *Reason of job failure*: The job fails because dependency package pagerank.zip is missing. *Related Jira (which indicate that bug is fixed)*: https://issues.apache.org/jira/browse/SPARK-7736 - this was in Yarn-cluster mode, now i face this issue in yarn-client mode. https://issues.apache.org/jira/browse/SPARK-9416 (duplicate) I faced same issue last year with SparkLauncher (spark-launcher_2.11) 1.4.0 version, then Marcelo had pull request which fixed the issue, and it was working at that time (after Marcelo's fix) for yarn-cluster mode. *Description*: I'm launching Spark job via SparkLauncher#startApplication(), 1) in the RM page, it says the job succeeded, even though the Spark job has failed. 2) in the container logs, i see that appHandle.getState() returned Finished state - which also means The application finished with a successful status. But in the same map container log lines I see that *the job is actually failed (*I launched Spark job from the map task*)*: 493 INFO: ImportError: ('No module named pagerank', <function subimport at 0x10703f500>, ('pagerank',)) 557 INFO: ImportError: ('No module named pagerank', <function subimport at 0x10703f500>, ('pagerank',)) 591 INFO: ImportError: ('No module named pagerank', <function subimport at 0x10c8a9500>, ('pagerank',)) 655 INFO: ImportError: ('No module named pagerank', <function subimport at 0x10c8a9500>, ('pagerank',)) 659 INFO: 16/11/11 18:25:37 ERROR TaskSetManager: Task 0 in stage 0.0 failed 4 times; aborting job 665 INFO: 16/11/11 18:25:37 INFO DAGScheduler: ShuffleMapStage 0 (distinct at /private/var/folders/9x/4m9lx2wj4qd8vwwq8n_qb2vx7mkj6g/T/hadoop/hadoop/nm-local-dir/usercache/<my-username>/appcache/application_1478901028064_0016/container_1478901028064_0016_01_000002/pag erank.py:52) failed in 3.221 s 667 INFO: 16/11/11 18:25:37 INFO DAGScheduler: *Job 0 failed*: collect at /private/var/folders/9x/4m9lx2wj4qd8vwwq8n_qb2vx7mkj6g/T/hadoop/hadoop/nm-local-dir/usercache/<my-username>/appcache/application_1478901028064_0016/container_1478901028064_0016_01_000002/pagerank. py:68, took 3.303328 s 681 INFO: py4j.protocol.Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.collectAndServe. 683 INFO: : org.apache.spark.SparkException: *Job aborted due to stage failure*: Task 0 in stage 0.0 failed 4 times, most recent failure: Lost task 0.3 in stage 0.0 (TID 3, <my-ip>): org.apache.spark.api.python.PythonException: Traceback (most recent call last): 705 INFO: ImportError: ('No module named pagerank', <function subimport at 0x10c8a9500>, ('pagerank',)) 745 INFO: at org.apache.spark.scheduler.DAGScheduler.org $apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1450) 757 INFO: at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:811) 759 INFO: at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:811) 763 INFO: at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:811) 841 INFO: ImportError: ('No module named pagerank', <function subimport at 0x10c8a9500>, ('pagerank',)) 887 INFO: Spark job with app id: application_1478901028064_0017, *State changed to: FINISHED* - The application finished with a successful status. And here are the log lines from the Spark job container: 16/11/11 18:25:37 ERROR Executor: Exception in task 0.2 in stage 0.0 (TID 2) org.apache.spark.api.python.PythonException: Traceback (most recent call last): File "/var/folders/9x/4m9lx2wj4qd8vwwq8n_qb2vx7mkj6g/T/hadoop/hadoop/nm-local-dir/usercache/<my-username>/appcache/application_1478901028064_0017/container_1478901028064_0017_01_000002/pyspark.zip/pyspark/worker.py", line 161, in main func, profiler, deserializer, serializer = read_command(pickleSer, infile) File "/var/folders/9x/4m9lx2wj4qd8vwwq8n_qb2vx7mkj6g/T/hadoop/hadoop/nm-local-dir/usercache/<my-username>/appcache/application_1478901028064_0017/container_1478901028064_0017_01_000002/pyspark.zip/pyspark/worker.py", line 54, in read_command command = serializer._read_with_length(file) File "/var/folders/9x/4m9lx2wj4qd8vwwq8n_qb2vx7mkj6g/T/hadoop/hadoop/nm-local-dir/usercache/<my-username>/appcache/application_1478901028064_0017/container_1478901028064_0017_01_000002/pyspark.zip/pyspark/serializers.py", line 164, in _read_with_length return self.loads(obj) File "/var/folders/9x/4m9lx2wj4qd8vwwq8n_qb2vx7mkj6g/T/hadoop/hadoop/nm-local-dir/usercache/<my-username>/appcache/application_1478901028064_0017/container_1478901028064_0017_01_000002/pyspark.zip/pyspark/serializers.py", line 422, in loads return pickle.loads(obj) File "/var/folders/9x/4m9lx2wj4qd8vwwq8n_qb2vx7mkj6g/T/hadoop/hadoop/nm-local-dir/usercache/<my-username>/appcache/application_1478901028064_0017/container_1478901028064_0017_01_000002/pyspark.zip/pyspark/cloudpickle.py", line 664, in subimport __import__(name) ImportError: ('No module named pagerank', <function subimport at 0x10c8a9500>, ('pagerank',)) at org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRDD.scala:193) at org.apache.spark.api.python.PythonRunner$$anon$1.<init>(PythonRDD.scala:234) at org.apache.spark.api.python.PythonRunner.compute(PythonRDD.scala:152) at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:63) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319) at org.apache.spark.rdd.RDD.iterator(RDD.scala:283) at org.apache.spark.api.python.PairwiseRDD.compute(PythonRDD.scala:390) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319) at org.apache.spark.rdd.RDD.iterator(RDD.scala:283) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:79) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:47) at org.apache.spark.scheduler.Task.run(Task.scala:86) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) 16/11/11 18:25:37 ERROR Executor: Exception in task 0.3 in stage 0.0 (TID 3) org.apache.spark.api.python.PythonException: Traceback (most recent call last): File "/var/folders/9x/4m9lx2wj4qd8vwwq8n_qb2vx7mkj6g/T/hadoop/hadoop/nm-local-dir/usercache/<my-username>/appcache/application_1478901028064_0017/container_1478901028064_0017_01_000002/pyspark.zip/pyspark/worker.py", line 161, in main func, profiler, deserializer, serializer = read_command(pickleSer, infile) File "/var/folders/9x/4m9lx2wj4qd8vwwq8n_qb2vx7mkj6g/T/hadoop/hadoop/nm-local-dir/usercache/<my-username>/appcache/application_1478901028064_0017/container_1478901028064_0017_01_000002/pyspark.zip/pyspark/worker.py", line 54, in read_command command = serializer._read_with_length(file) File "/var/folders/9x/4m9lx2wj4qd8vwwq8n_qb2vx7mkj6g/T/hadoop/hadoop/nm-local-dir/usercache/<my-username>/appcache/application_1478901028064_0017/container_1478901028064_0017_01_000002/pyspark.zip/pyspark/serializers.py", line 164, in _read_with_length return self.loads(obj) File "/var/folders/9x/4m9lx2wj4qd8vwwq8n_qb2vx7mkj6g/T/hadoop/hadoop/nm-local-dir/usercache/<my-username>/appcache/application_1478901028064_0017/container_1478901028064_0017_01_000002/pyspark.zip/pyspark/serializers.py", line 422, in loads return pickle.loads(obj) File "/var/folders/9x/4m9lx2wj4qd8vwwq8n_qb2vx7mkj6g/T/hadoop/hadoop/nm-local-dir/usercache/<my-username>/appcache/application_1478901028064_0017/container_1478901028064_0017_01_000002/pyspark.zip/pyspark/cloudpickle.py", line 664, in subimport __import__(name) ImportError: ('No module named pagerank', <function subimport at 0x10c8a9500>, ('pagerank',)) at org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRDD.scala:193) at org.apache.spark.api.python.PythonRunner$$anon$1.<init>(PythonRDD.scala:234) at org.apache.spark.api.python.PythonRunner.compute(PythonRDD.scala:152) at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:63) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319) at org.apache.spark.rdd.RDD.iterator(RDD.scala:283) at org.apache.spark.api.python.PairwiseRDD.compute(PythonRDD.scala:390) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319) at org.apache.spark.rdd.RDD.iterator(RDD.scala:283) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:79) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:47) at org.apache.spark.scheduler.Task.run(Task.scala:86) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745)