[ Potential bug ] Spark terminal logs say that job has succeeded even though job has failed in Yarn cluster mode

Elkhan Dadashov Thu, 23 Jul 2015 16:45:07 -0700

Hi all,

While running Spark Word count python example with intentional mistake in *Yarn
cluster mode*, Spark terminal states final status as SUCCEEDED, but log
files state correct results indicating that the job failed.


Why terminal log output & application log output contradict each other ?

If i run same job on *local mode* then terminal logs and application logs
match, where both state that job has failed to expected error in python
script.

More details: Scenario

While running Spark Word count python example on *Yarn cluster mode*, if I
make intentional error in wordcount.py by changing this line (I'm using
Spark 1.4.1, but this problem exists in Spark 1.4.0 and in 1.3.0 versions -
which i tested):

lines = sc.textFile(sys.argv[1], 1)

into this line:

lines = sc.textFile(*nonExistentVariable*,1)

where nonExistentVariable variable was never created and initialized.

then i run that example with this command (I put README.md into HDFS before
running this command):

*./bin/spark-submit --master yarn-cluster wordcount.py /README.md*

The job runs and finishes successfully according the log printed in the
terminal :
*Terminal logs*:
...
15/07/23 16:19:17 INFO yarn.Client: Application report for
application_1437612288327_0013 (state: RUNNING)
15/07/23 16:19:18 INFO yarn.Client: Application report for
application_1437612288327_0013 (state: RUNNING)
15/07/23 16:19:19 INFO yarn.Client: Application report for
application_1437612288327_0013 (state: RUNNING)
15/07/23 16:19:20 INFO yarn.Client: Application report for
application_1437612288327_0013 (state: RUNNING)
15/07/23 16:19:21 INFO yarn.Client: Application report for
application_1437612288327_0013 (state: FINISHED)
15/07/23 16:19:21 INFO yarn.Client:
 client token: N/A
 diagnostics: Shutdown hook called before final status was reported.
 ApplicationMaster host: 10.0.53.59
 ApplicationMaster RPC port: 0
 queue: default
 start time: 1437693551439
 final status: *SUCCEEDED*
 tracking URL:
http://localhost:8088/proxy/application_1437612288327_0013/history/application_1437612288327_0013/1
 user: edadashov
15/07/23 16:19:21 INFO util.Utils: Shutdown hook called
15/07/23 16:19:21 INFO util.Utils: Deleting directory
/tmp/spark-eba0a1b5-a216-4afa-9c54-a3cb67b16444

But if look at log files generated for this application in HDFS - it
indicates failure of the job with correct reason:
*Application log files*:
...
\00 stdout\00 179Traceback (most recent call last):
  File "wordcount.py", line 32, in <module>
    lines = sc.textFile(nonExistentVariable,1)
*NameError: name 'nonExistentVariable' is not defined*


Why terminal output - final status: *SUCCEEDED , *is not matching
application log results - failure of the job (NameError: name
'nonExistentVariable' is not defined) ?

Is this bug ? Is there Jira ticket related to this issue ? (Is someone
assigned to this issue ?)

If i run this wordcount .py example (with mistake line) in local mode, then
terminal log states that the job has failed in terminal logs too.

*./bin/spark-submit wordcount.py /README.md*

*Terminal logs*:

...
15/07/23 16:31:55 INFO scheduler.EventLoggingListener: Logging events to
hdfs:///app-logs/local-1437694314943
Traceback (most recent call last):
  File "/home/edadashov/tools/myspark/spark/wordcount.py", line 32, in
<module>
    lines = sc.textFile(nonExistentVariable,1)
NameError: name 'nonExistentVariable' is not defined
15/07/23 16:31:55 INFO spark.SparkContext: Invoking stop() from shutdown
hook


Thanks.

[ Potential bug ] Spark terminal logs say that job has succeeded even though job has failed in Yarn cluster mode

Reply via email to