Re: [ Potential bug ] Spark terminal logs say that job has succeeded even though job has failed in Yarn cluster mode

Corey Nolet Mon, 27 Jul 2015 19:46:05 -0700

Elkhan,

What does the ResourceManager say about the final status of the job?  Spark
jobs that run as Yarn applications can fail but still successfully clean up
their resources and give them back to the Yarn cluster. Because of this,
there's a difference between your code throwing an exception in an
executor/driver and the Yarn application failing. Generally you'll see a
yarn application fail when there's a memory problem (too much memory being
allocated or not enough causing executors to fail multiple times not
allowing your job to finish).


What I'm seeing from your post is that you had an exception in your
application which was caught by the Spark framework which then proceeded to
clean up the job and shut itself down- which it did successfully. When you
aren't running in the Yarn modes, you aren't seeing any Yarn status that's
telling you the Yarn application was successfully shut down, you are just
seeing the failure(s) from your drivers/executors.



On Mon, Jul 27, 2015 at 2:11 PM, Elkhan Dadashov <elkhan8...@gmail.com>
wrote:

> Any updates on this bug ?
>
> Why Spark log results & Job final status does not match ? (one saying that
> job has failed, another stating that job has succeeded)
>
> Thanks.
>
>
> On Thu, Jul 23, 2015 at 4:43 PM, Elkhan Dadashov <elkhan8...@gmail.com>
> wrote:
>
>> Hi all,
>>
>> While running Spark Word count python example with intentional mistake in 
>> *Yarn
>> cluster mode*, Spark terminal states final status as SUCCEEDED, but log
>> files state correct results indicating that the job failed.
>>
>> Why terminal log output & application log output contradict each other ?
>>
>> If i run same job on *local mode* then terminal logs and application
>> logs match, where both state that job has failed to expected error in
>> python script.
>>
>> More details: Scenario
>>
>> While running Spark Word count python example on *Yarn cluster mode*, if
>> I make intentional error in wordcount.py by changing this line (I'm using
>> Spark 1.4.1, but this problem exists in Spark 1.4.0 and in 1.3.0 versions -
>> which i tested):
>>
>> lines = sc.textFile(sys.argv[1], 1)
>>
>> into this line:
>>
>> lines = sc.textFile(*nonExistentVariable*,1)
>>
>> where nonExistentVariable variable was never created and initialized.
>>
>> then i run that example with this command (I put README.md into HDFS
>> before running this command):
>>
>> *./bin/spark-submit --master yarn-cluster wordcount.py /README.md*
>>
>> The job runs and finishes successfully according the log printed in the
>> terminal :
>> *Terminal logs*:
>> ...
>> 15/07/23 16:19:17 INFO yarn.Client: Application report for
>> application_1437612288327_0013 (state: RUNNING)
>> 15/07/23 16:19:18 INFO yarn.Client: Application report for
>> application_1437612288327_0013 (state: RUNNING)
>> 15/07/23 16:19:19 INFO yarn.Client: Application report for
>> application_1437612288327_0013 (state: RUNNING)
>> 15/07/23 16:19:20 INFO yarn.Client: Application report for
>> application_1437612288327_0013 (state: RUNNING)
>> 15/07/23 16:19:21 INFO yarn.Client: Application report for
>> application_1437612288327_0013 (state: FINISHED)
>> 15/07/23 16:19:21 INFO yarn.Client:
>>  client token: N/A
>>  diagnostics: Shutdown hook called before final status was reported.
>>  ApplicationMaster host: 10.0.53.59
>>  ApplicationMaster RPC port: 0
>>  queue: default
>>  start time: 1437693551439
>>  final status: *SUCCEEDED*
>>  tracking URL:
>> http://localhost:8088/proxy/application_1437612288327_0013/history/application_1437612288327_0013/1
>>  user: edadashov
>> 15/07/23 16:19:21 INFO util.Utils: Shutdown hook called
>> 15/07/23 16:19:21 INFO util.Utils: Deleting directory
>> /tmp/spark-eba0a1b5-a216-4afa-9c54-a3cb67b16444
>>
>> But if look at log files generated for this application in HDFS - it
>> indicates failure of the job with correct reason:
>> *Application log files*:
>> ...
>> \00 stdout\00 179Traceback (most recent call last):
>>   File "wordcount.py", line 32, in <module>
>>     lines = sc.textFile(nonExistentVariable,1)
>> *NameError: name 'nonExistentVariable' is not defined*
>>
>>
>> Why terminal output - final status: *SUCCEEDED , *is not matching
>> application log results - failure of the job (NameError: name
>> 'nonExistentVariable' is not defined) ?
>>
>> Is this bug ? Is there Jira ticket related to this issue ? (Is someone
>> assigned to this issue ?)
>>
>> If i run this wordcount .py example (with mistake line) in local mode,
>> then terminal log states that the job has failed in terminal logs too.
>>
>> *./bin/spark-submit wordcount.py /README.md*
>>
>> *Terminal logs*:
>>
>> ...
>> 15/07/23 16:31:55 INFO scheduler.EventLoggingListener: Logging events to
>> hdfs:///app-logs/local-1437694314943
>> Traceback (most recent call last):
>>   File "/home/edadashov/tools/myspark/spark/wordcount.py", line 32, in
>> <module>
>>     lines = sc.textFile(nonExistentVariable,1)
>> NameError: name 'nonExistentVariable' is not defined
>> 15/07/23 16:31:55 INFO spark.SparkContext: Invoking stop() from shutdown
>> hook
>>
>>
>> Thanks.
>>
>
>
>
> --
>
> Best regards,
> Elkhan Dadashov
>

Re: [ Potential bug ] Spark terminal logs say that job has succeeded even though job has failed in Yarn cluster mode

Reply via email to