Re: [ Potential bug ] Spark terminal logs say that job has succeeded even though job has failed in Yarn cluster mode

Elkhan Dadashov Tue, 28 Jul 2015 12:25:54 -0700

Thanks a lot for feedback, Marcelo.

I've filed a bug just now - SPARK-9416
<https://issues.apache.org/jira/browse/SPARK-9416>




On Tue, Jul 28, 2015 at 12:14 PM, Marcelo Vanzin <[email protected]>
wrote:

> BTW this is most probably caused by this line in PythonRunner.scala:
>
>     System.exit(process.waitFor())
>
> The YARN backend doesn't like applications calling System.exit().
>
>
> On Tue, Jul 28, 2015 at 12:00 PM, Marcelo Vanzin <[email protected]>
> wrote:
>
>> This might be an issue with how pyspark propagates the error back to the
>> AM. I'm pretty sure this does not happen for Scala / Java apps.
>>
>> Have you filed a bug?
>>
>> On Tue, Jul 28, 2015 at 11:17 AM, Elkhan Dadashov <[email protected]>
>> wrote:
>>
>>> Thanks Corey for your answer,
>>>
>>> Do you mean that "final status : SUCCEEDED" in terminal logs means that
>>> YARN RM could clean the resources after the application has finished
>>> (application finishing does not necessarily mean succeeded or failed) ?
>>>
>>> With that logic it totally makes sense.
>>>
>>> Basically the YARN logs does not say anything about the Spark job
>>> itself. It just says that Spark job resources have been cleaned up after
>>> the job completed and returned back to Yarn.
>>>
>>> It would be great if Yarn logs could also say about the consequence of
>>> the job, because the user is interested in more about the job final status.
>>>
>>> Yarn related logs can be found in RM ,NM, DN, NN log files in detail.
>>>
>>> Thanks again.
>>>
>>> On Mon, Jul 27, 2015 at 7:45 PM, Corey Nolet <[email protected]> wrote:
>>>
>>>> Elkhan,
>>>>
>>>> What does the ResourceManager say about the final status of the job?
>>>> Spark jobs that run as Yarn applications can fail but still successfully
>>>> clean up their resources and give them back to the Yarn cluster. Because of
>>>> this, there's a difference between your code throwing an exception in an
>>>> executor/driver and the Yarn application failing. Generally you'll see a
>>>> yarn application fail when there's a memory problem (too much memory being
>>>> allocated or not enough causing executors to fail multiple times not
>>>> allowing your job to finish).
>>>>
>>>> What I'm seeing from your post is that you had an exception in your
>>>> application which was caught by the Spark framework which then proceeded to
>>>> clean up the job and shut itself down- which it did successfully. When you
>>>> aren't running in the Yarn modes, you aren't seeing any Yarn status that's
>>>> telling you the Yarn application was successfully shut down, you are just
>>>> seeing the failure(s) from your drivers/executors.
>>>>
>>>>
>>>>
>>>> On Mon, Jul 27, 2015 at 2:11 PM, Elkhan Dadashov <[email protected]>
>>>> wrote:
>>>>
>>>>> Any updates on this bug ?
>>>>>
>>>>> Why Spark log results & Job final status does not match ? (one saying
>>>>> that job has failed, another stating that job has succeeded)
>>>>>
>>>>> Thanks.
>>>>>
>>>>>
>>>>> On Thu, Jul 23, 2015 at 4:43 PM, Elkhan Dadashov <[email protected]
>>>>> > wrote:
>>>>>
>>>>>> Hi all,
>>>>>>
>>>>>> While running Spark Word count python example with intentional
>>>>>> mistake in *Yarn cluster mode*, Spark terminal states final status
>>>>>> as SUCCEEDED, but log files state correct results indicating that the job
>>>>>> failed.
>>>>>>
>>>>>> Why terminal log output & application log output contradict each
>>>>>> other ?
>>>>>>
>>>>>> If i run same job on *local mode* then terminal logs and application
>>>>>> logs match, where both state that job has failed to expected error in
>>>>>> python script.
>>>>>>
>>>>>> More details: Scenario
>>>>>>
>>>>>> While running Spark Word count python example on *Yarn cluster mode*,
>>>>>> if I make intentional error in wordcount.py by changing this line (I'm
>>>>>> using Spark 1.4.1, but this problem exists in Spark 1.4.0 and in 1.3.0
>>>>>> versions - which i tested):
>>>>>>
>>>>>> lines = sc.textFile(sys.argv[1], 1)
>>>>>>
>>>>>> into this line:
>>>>>>
>>>>>> lines = sc.textFile(*nonExistentVariable*,1)
>>>>>>
>>>>>> where nonExistentVariable variable was never created and initialized.
>>>>>>
>>>>>> then i run that example with this command (I put README.md into HDFS
>>>>>> before running this command):
>>>>>>
>>>>>> *./bin/spark-submit --master yarn-cluster wordcount.py /README.md*
>>>>>>
>>>>>> The job runs and finishes successfully according the log printed in
>>>>>> the terminal :
>>>>>> *Terminal logs*:
>>>>>> ...
>>>>>> 15/07/23 16:19:17 INFO yarn.Client: Application report for
>>>>>> application_1437612288327_0013 (state: RUNNING)
>>>>>> 15/07/23 16:19:18 INFO yarn.Client: Application report for
>>>>>> application_1437612288327_0013 (state: RUNNING)
>>>>>> 15/07/23 16:19:19 INFO yarn.Client: Application report for
>>>>>> application_1437612288327_0013 (state: RUNNING)
>>>>>> 15/07/23 16:19:20 INFO yarn.Client: Application report for
>>>>>> application_1437612288327_0013 (state: RUNNING)
>>>>>> 15/07/23 16:19:21 INFO yarn.Client: Application report for
>>>>>> application_1437612288327_0013 (state: FINISHED)
>>>>>> 15/07/23 16:19:21 INFO yarn.Client:
>>>>>>  client token: N/A
>>>>>>  diagnostics: Shutdown hook called before final status was reported.
>>>>>>  ApplicationMaster host: 10.0.53.59
>>>>>>  ApplicationMaster RPC port: 0
>>>>>>  queue: default
>>>>>>  start time: 1437693551439
>>>>>>  final status: *SUCCEEDED*
>>>>>>  tracking URL:
>>>>>> http://localhost:8088/proxy/application_1437612288327_0013/history/application_1437612288327_0013/1
>>>>>>  user: edadashov
>>>>>> 15/07/23 16:19:21 INFO util.Utils: Shutdown hook called
>>>>>> 15/07/23 16:19:21 INFO util.Utils: Deleting directory
>>>>>> /tmp/spark-eba0a1b5-a216-4afa-9c54-a3cb67b16444
>>>>>>
>>>>>> But if look at log files generated for this application in HDFS - it
>>>>>> indicates failure of the job with correct reason:
>>>>>> *Application log files*:
>>>>>> ...
>>>>>> \00 stdout\00 179Traceback (most recent call last):
>>>>>>   File "wordcount.py", line 32, in <module>
>>>>>>     lines = sc.textFile(nonExistentVariable,1)
>>>>>> *NameError: name 'nonExistentVariable' is not defined*
>>>>>>
>>>>>>
>>>>>> Why terminal output - final status: *SUCCEEDED , *is not matching
>>>>>> application log results - failure of the job (NameError: name
>>>>>> 'nonExistentVariable' is not defined) ?
>>>>>>
>>>>>> Is this bug ? Is there Jira ticket related to this issue ? (Is
>>>>>> someone assigned to this issue ?)
>>>>>>
>>>>>> If i run this wordcount .py example (with mistake line) in local
>>>>>> mode, then terminal log states that the job has failed in terminal logs 
>>>>>> too.
>>>>>>
>>>>>> *./bin/spark-submit wordcount.py /README.md*
>>>>>>
>>>>>> *Terminal logs*:
>>>>>>
>>>>>> ...
>>>>>> 15/07/23 16:31:55 INFO scheduler.EventLoggingListener: Logging events
>>>>>> to hdfs:///app-logs/local-1437694314943
>>>>>> Traceback (most recent call last):
>>>>>>   File "/home/edadashov/tools/myspark/spark/wordcount.py", line 32,
>>>>>> in <module>
>>>>>>     lines = sc.textFile(nonExistentVariable,1)
>>>>>> NameError: name 'nonExistentVariable' is not defined
>>>>>> 15/07/23 16:31:55 INFO spark.SparkContext: Invoking stop() from
>>>>>> shutdown hook
>>>>>>
>>>>>>
>>>>>> Thanks.
>>>>>>
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>>
>>>>> Best regards,
>>>>> Elkhan Dadashov
>>>>>
>>>>
>>>>
>>>
>>>
>>> --
>>>
>>> Best regards,
>>> Elkhan Dadashov
>>>
>>
>>
>>
>> --
>> Marcelo
>>
>
>
>
> --
> Marcelo
>



-- 

Best regards,
Elkhan Dadashov

Re: [ Potential bug ] Spark terminal logs say that job has succeeded even though job has failed in Yarn cluster mode

Reply via email to