Re: [ Potential bug ] Spark terminal logs say that job has succeeded even though job has failed in Yarn cluster mode

Marcelo Vanzin Tue, 28 Jul 2015 12:01:04 -0700

This might be an issue with how pyspark propagates the error back to the
AM. I'm pretty sure this does not happen for Scala / Java apps.


Have you filed a bug?

On Tue, Jul 28, 2015 at 11:17 AM, Elkhan Dadashov <elkhan8...@gmail.com>
wrote:

> Thanks Corey for your answer,
>
> Do you mean that "final status : SUCCEEDED" in terminal logs means that
> YARN RM could clean the resources after the application has finished
> (application finishing does not necessarily mean succeeded or failed) ?
>
> With that logic it totally makes sense.
>
> Basically the YARN logs does not say anything about the Spark job itself.
> It just says that Spark job resources have been cleaned up after the job
> completed and returned back to Yarn.
>
> It would be great if Yarn logs could also say about the consequence of the
> job, because the user is interested in more about the job final status.
>
> Yarn related logs can be found in RM ,NM, DN, NN log files in detail.
>
> Thanks again.
>
> On Mon, Jul 27, 2015 at 7:45 PM, Corey Nolet <cjno...@gmail.com> wrote:
>
>> Elkhan,
>>
>> What does the ResourceManager say about the final status of the job?
>> Spark jobs that run as Yarn applications can fail but still successfully
>> clean up their resources and give them back to the Yarn cluster. Because of
>> this, there's a difference between your code throwing an exception in an
>> executor/driver and the Yarn application failing. Generally you'll see a
>> yarn application fail when there's a memory problem (too much memory being
>> allocated or not enough causing executors to fail multiple times not
>> allowing your job to finish).
>>
>> What I'm seeing from your post is that you had an exception in your
>> application which was caught by the Spark framework which then proceeded to
>> clean up the job and shut itself down- which it did successfully. When you
>> aren't running in the Yarn modes, you aren't seeing any Yarn status that's
>> telling you the Yarn application was successfully shut down, you are just
>> seeing the failure(s) from your drivers/executors.
>>
>>
>>
>> On Mon, Jul 27, 2015 at 2:11 PM, Elkhan Dadashov <elkhan8...@gmail.com>
>> wrote:
>>
>>> Any updates on this bug ?
>>>
>>> Why Spark log results & Job final status does not match ? (one saying
>>> that job has failed, another stating that job has succeeded)
>>>
>>> Thanks.
>>>
>>>
>>> On Thu, Jul 23, 2015 at 4:43 PM, Elkhan Dadashov <elkhan8...@gmail.com>
>>> wrote:
>>>
>>>> Hi all,
>>>>
>>>> While running Spark Word count python example with intentional mistake
>>>> in *Yarn cluster mode*, Spark terminal states final status as
>>>> SUCCEEDED, but log files state correct results indicating that the job
>>>> failed.
>>>>
>>>> Why terminal log output & application log output contradict each other ?
>>>>
>>>> If i run same job on *local mode* then terminal logs and application
>>>> logs match, where both state that job has failed to expected error in
>>>> python script.
>>>>
>>>> More details: Scenario
>>>>
>>>> While running Spark Word count python example on *Yarn cluster mode*,
>>>> if I make intentional error in wordcount.py by changing this line (I'm
>>>> using Spark 1.4.1, but this problem exists in Spark 1.4.0 and in 1.3.0
>>>> versions - which i tested):
>>>>
>>>> lines = sc.textFile(sys.argv[1], 1)
>>>>
>>>> into this line:
>>>>
>>>> lines = sc.textFile(*nonExistentVariable*,1)
>>>>
>>>> where nonExistentVariable variable was never created and initialized.
>>>>
>>>> then i run that example with this command (I put README.md into HDFS
>>>> before running this command):
>>>>
>>>> *./bin/spark-submit --master yarn-cluster wordcount.py /README.md*
>>>>
>>>> The job runs and finishes successfully according the log printed in the
>>>> terminal :
>>>> *Terminal logs*:
>>>> ...
>>>> 15/07/23 16:19:17 INFO yarn.Client: Application report for
>>>> application_1437612288327_0013 (state: RUNNING)
>>>> 15/07/23 16:19:18 INFO yarn.Client: Application report for
>>>> application_1437612288327_0013 (state: RUNNING)
>>>> 15/07/23 16:19:19 INFO yarn.Client: Application report for
>>>> application_1437612288327_0013 (state: RUNNING)
>>>> 15/07/23 16:19:20 INFO yarn.Client: Application report for
>>>> application_1437612288327_0013 (state: RUNNING)
>>>> 15/07/23 16:19:21 INFO yarn.Client: Application report for
>>>> application_1437612288327_0013 (state: FINISHED)
>>>> 15/07/23 16:19:21 INFO yarn.Client:
>>>>  client token: N/A
>>>>  diagnostics: Shutdown hook called before final status was reported.
>>>>  ApplicationMaster host: 10.0.53.59
>>>>  ApplicationMaster RPC port: 0
>>>>  queue: default
>>>>  start time: 1437693551439
>>>>  final status: *SUCCEEDED*
>>>>  tracking URL:
>>>> http://localhost:8088/proxy/application_1437612288327_0013/history/application_1437612288327_0013/1
>>>>  user: edadashov
>>>> 15/07/23 16:19:21 INFO util.Utils: Shutdown hook called
>>>> 15/07/23 16:19:21 INFO util.Utils: Deleting directory
>>>> /tmp/spark-eba0a1b5-a216-4afa-9c54-a3cb67b16444
>>>>
>>>> But if look at log files generated for this application in HDFS - it
>>>> indicates failure of the job with correct reason:
>>>> *Application log files*:
>>>> ...
>>>> \00 stdout\00 179Traceback (most recent call last):
>>>>   File "wordcount.py", line 32, in <module>
>>>>     lines = sc.textFile(nonExistentVariable,1)
>>>> *NameError: name 'nonExistentVariable' is not defined*
>>>>
>>>>
>>>> Why terminal output - final status: *SUCCEEDED , *is not matching
>>>> application log results - failure of the job (NameError: name
>>>> 'nonExistentVariable' is not defined) ?
>>>>
>>>> Is this bug ? Is there Jira ticket related to this issue ? (Is someone
>>>> assigned to this issue ?)
>>>>
>>>> If i run this wordcount .py example (with mistake line) in local mode,
>>>> then terminal log states that the job has failed in terminal logs too.
>>>>
>>>> *./bin/spark-submit wordcount.py /README.md*
>>>>
>>>> *Terminal logs*:
>>>>
>>>> ...
>>>> 15/07/23 16:31:55 INFO scheduler.EventLoggingListener: Logging events
>>>> to hdfs:///app-logs/local-1437694314943
>>>> Traceback (most recent call last):
>>>>   File "/home/edadashov/tools/myspark/spark/wordcount.py", line 32, in
>>>> <module>
>>>>     lines = sc.textFile(nonExistentVariable,1)
>>>> NameError: name 'nonExistentVariable' is not defined
>>>> 15/07/23 16:31:55 INFO spark.SparkContext: Invoking stop() from
>>>> shutdown hook
>>>>
>>>>
>>>> Thanks.
>>>>
>>>
>>>
>>>
>>> --
>>>
>>> Best regards,
>>> Elkhan Dadashov
>>>
>>
>>
>
>
> --
>
> Best regards,
> Elkhan Dadashov
>



-- 
Marcelo

Re: [ Potential bug ] Spark terminal logs say that job has succeeded even though job has failed in Yarn cluster mode

Reply via email to