BTW this is most probably caused by this line in PythonRunner.scala: System.exit(process.waitFor())
The YARN backend doesn't like applications calling System.exit(). On Tue, Jul 28, 2015 at 12:00 PM, Marcelo Vanzin <van...@cloudera.com> wrote: > This might be an issue with how pyspark propagates the error back to the > AM. I'm pretty sure this does not happen for Scala / Java apps. > > Have you filed a bug? > > On Tue, Jul 28, 2015 at 11:17 AM, Elkhan Dadashov <elkhan8...@gmail.com> > wrote: > >> Thanks Corey for your answer, >> >> Do you mean that "final status : SUCCEEDED" in terminal logs means that >> YARN RM could clean the resources after the application has finished >> (application finishing does not necessarily mean succeeded or failed) ? >> >> With that logic it totally makes sense. >> >> Basically the YARN logs does not say anything about the Spark job itself. >> It just says that Spark job resources have been cleaned up after the job >> completed and returned back to Yarn. >> >> It would be great if Yarn logs could also say about the consequence of >> the job, because the user is interested in more about the job final status. >> >> Yarn related logs can be found in RM ,NM, DN, NN log files in detail. >> >> Thanks again. >> >> On Mon, Jul 27, 2015 at 7:45 PM, Corey Nolet <cjno...@gmail.com> wrote: >> >>> Elkhan, >>> >>> What does the ResourceManager say about the final status of the job? >>> Spark jobs that run as Yarn applications can fail but still successfully >>> clean up their resources and give them back to the Yarn cluster. Because of >>> this, there's a difference between your code throwing an exception in an >>> executor/driver and the Yarn application failing. Generally you'll see a >>> yarn application fail when there's a memory problem (too much memory being >>> allocated or not enough causing executors to fail multiple times not >>> allowing your job to finish). >>> >>> What I'm seeing from your post is that you had an exception in your >>> application which was caught by the Spark framework which then proceeded to >>> clean up the job and shut itself down- which it did successfully. When you >>> aren't running in the Yarn modes, you aren't seeing any Yarn status that's >>> telling you the Yarn application was successfully shut down, you are just >>> seeing the failure(s) from your drivers/executors. >>> >>> >>> >>> On Mon, Jul 27, 2015 at 2:11 PM, Elkhan Dadashov <elkhan8...@gmail.com> >>> wrote: >>> >>>> Any updates on this bug ? >>>> >>>> Why Spark log results & Job final status does not match ? (one saying >>>> that job has failed, another stating that job has succeeded) >>>> >>>> Thanks. >>>> >>>> >>>> On Thu, Jul 23, 2015 at 4:43 PM, Elkhan Dadashov <elkhan8...@gmail.com> >>>> wrote: >>>> >>>>> Hi all, >>>>> >>>>> While running Spark Word count python example with intentional mistake >>>>> in *Yarn cluster mode*, Spark terminal states final status as >>>>> SUCCEEDED, but log files state correct results indicating that the job >>>>> failed. >>>>> >>>>> Why terminal log output & application log output contradict each other >>>>> ? >>>>> >>>>> If i run same job on *local mode* then terminal logs and application >>>>> logs match, where both state that job has failed to expected error in >>>>> python script. >>>>> >>>>> More details: Scenario >>>>> >>>>> While running Spark Word count python example on *Yarn cluster mode*, >>>>> if I make intentional error in wordcount.py by changing this line (I'm >>>>> using Spark 1.4.1, but this problem exists in Spark 1.4.0 and in 1.3.0 >>>>> versions - which i tested): >>>>> >>>>> lines = sc.textFile(sys.argv[1], 1) >>>>> >>>>> into this line: >>>>> >>>>> lines = sc.textFile(*nonExistentVariable*,1) >>>>> >>>>> where nonExistentVariable variable was never created and initialized. >>>>> >>>>> then i run that example with this command (I put README.md into HDFS >>>>> before running this command): >>>>> >>>>> *./bin/spark-submit --master yarn-cluster wordcount.py /README.md* >>>>> >>>>> The job runs and finishes successfully according the log printed in >>>>> the terminal : >>>>> *Terminal logs*: >>>>> ... >>>>> 15/07/23 16:19:17 INFO yarn.Client: Application report for >>>>> application_1437612288327_0013 (state: RUNNING) >>>>> 15/07/23 16:19:18 INFO yarn.Client: Application report for >>>>> application_1437612288327_0013 (state: RUNNING) >>>>> 15/07/23 16:19:19 INFO yarn.Client: Application report for >>>>> application_1437612288327_0013 (state: RUNNING) >>>>> 15/07/23 16:19:20 INFO yarn.Client: Application report for >>>>> application_1437612288327_0013 (state: RUNNING) >>>>> 15/07/23 16:19:21 INFO yarn.Client: Application report for >>>>> application_1437612288327_0013 (state: FINISHED) >>>>> 15/07/23 16:19:21 INFO yarn.Client: >>>>> client token: N/A >>>>> diagnostics: Shutdown hook called before final status was reported. >>>>> ApplicationMaster host: 10.0.53.59 >>>>> ApplicationMaster RPC port: 0 >>>>> queue: default >>>>> start time: 1437693551439 >>>>> final status: *SUCCEEDED* >>>>> tracking URL: >>>>> http://localhost:8088/proxy/application_1437612288327_0013/history/application_1437612288327_0013/1 >>>>> user: edadashov >>>>> 15/07/23 16:19:21 INFO util.Utils: Shutdown hook called >>>>> 15/07/23 16:19:21 INFO util.Utils: Deleting directory >>>>> /tmp/spark-eba0a1b5-a216-4afa-9c54-a3cb67b16444 >>>>> >>>>> But if look at log files generated for this application in HDFS - it >>>>> indicates failure of the job with correct reason: >>>>> *Application log files*: >>>>> ... >>>>> \00 stdout\00 179Traceback (most recent call last): >>>>> File "wordcount.py", line 32, in <module> >>>>> lines = sc.textFile(nonExistentVariable,1) >>>>> *NameError: name 'nonExistentVariable' is not defined* >>>>> >>>>> >>>>> Why terminal output - final status: *SUCCEEDED , *is not matching >>>>> application log results - failure of the job (NameError: name >>>>> 'nonExistentVariable' is not defined) ? >>>>> >>>>> Is this bug ? Is there Jira ticket related to this issue ? (Is someone >>>>> assigned to this issue ?) >>>>> >>>>> If i run this wordcount .py example (with mistake line) in local mode, >>>>> then terminal log states that the job has failed in terminal logs too. >>>>> >>>>> *./bin/spark-submit wordcount.py /README.md* >>>>> >>>>> *Terminal logs*: >>>>> >>>>> ... >>>>> 15/07/23 16:31:55 INFO scheduler.EventLoggingListener: Logging events >>>>> to hdfs:///app-logs/local-1437694314943 >>>>> Traceback (most recent call last): >>>>> File "/home/edadashov/tools/myspark/spark/wordcount.py", line 32, in >>>>> <module> >>>>> lines = sc.textFile(nonExistentVariable,1) >>>>> NameError: name 'nonExistentVariable' is not defined >>>>> 15/07/23 16:31:55 INFO spark.SparkContext: Invoking stop() from >>>>> shutdown hook >>>>> >>>>> >>>>> Thanks. >>>>> >>>> >>>> >>>> >>>> -- >>>> >>>> Best regards, >>>> Elkhan Dadashov >>>> >>> >>> >> >> >> -- >> >> Best regards, >> Elkhan Dadashov >> > > > > -- > Marcelo > -- Marcelo