Thanks a lot for feedback, Marcelo. I've filed a bug just now - SPARK-9416 <https://issues.apache.org/jira/browse/SPARK-9416>
On Tue, Jul 28, 2015 at 12:14 PM, Marcelo Vanzin <van...@cloudera.com> wrote: > BTW this is most probably caused by this line in PythonRunner.scala: > > System.exit(process.waitFor()) > > The YARN backend doesn't like applications calling System.exit(). > > > On Tue, Jul 28, 2015 at 12:00 PM, Marcelo Vanzin <van...@cloudera.com> > wrote: > >> This might be an issue with how pyspark propagates the error back to the >> AM. I'm pretty sure this does not happen for Scala / Java apps. >> >> Have you filed a bug? >> >> On Tue, Jul 28, 2015 at 11:17 AM, Elkhan Dadashov <elkhan8...@gmail.com> >> wrote: >> >>> Thanks Corey for your answer, >>> >>> Do you mean that "final status : SUCCEEDED" in terminal logs means that >>> YARN RM could clean the resources after the application has finished >>> (application finishing does not necessarily mean succeeded or failed) ? >>> >>> With that logic it totally makes sense. >>> >>> Basically the YARN logs does not say anything about the Spark job >>> itself. It just says that Spark job resources have been cleaned up after >>> the job completed and returned back to Yarn. >>> >>> It would be great if Yarn logs could also say about the consequence of >>> the job, because the user is interested in more about the job final status. >>> >>> Yarn related logs can be found in RM ,NM, DN, NN log files in detail. >>> >>> Thanks again. >>> >>> On Mon, Jul 27, 2015 at 7:45 PM, Corey Nolet <cjno...@gmail.com> wrote: >>> >>>> Elkhan, >>>> >>>> What does the ResourceManager say about the final status of the job? >>>> Spark jobs that run as Yarn applications can fail but still successfully >>>> clean up their resources and give them back to the Yarn cluster. Because of >>>> this, there's a difference between your code throwing an exception in an >>>> executor/driver and the Yarn application failing. Generally you'll see a >>>> yarn application fail when there's a memory problem (too much memory being >>>> allocated or not enough causing executors to fail multiple times not >>>> allowing your job to finish). >>>> >>>> What I'm seeing from your post is that you had an exception in your >>>> application which was caught by the Spark framework which then proceeded to >>>> clean up the job and shut itself down- which it did successfully. When you >>>> aren't running in the Yarn modes, you aren't seeing any Yarn status that's >>>> telling you the Yarn application was successfully shut down, you are just >>>> seeing the failure(s) from your drivers/executors. >>>> >>>> >>>> >>>> On Mon, Jul 27, 2015 at 2:11 PM, Elkhan Dadashov <elkhan8...@gmail.com> >>>> wrote: >>>> >>>>> Any updates on this bug ? >>>>> >>>>> Why Spark log results & Job final status does not match ? (one saying >>>>> that job has failed, another stating that job has succeeded) >>>>> >>>>> Thanks. >>>>> >>>>> >>>>> On Thu, Jul 23, 2015 at 4:43 PM, Elkhan Dadashov <elkhan8...@gmail.com >>>>> > wrote: >>>>> >>>>>> Hi all, >>>>>> >>>>>> While running Spark Word count python example with intentional >>>>>> mistake in *Yarn cluster mode*, Spark terminal states final status >>>>>> as SUCCEEDED, but log files state correct results indicating that the job >>>>>> failed. >>>>>> >>>>>> Why terminal log output & application log output contradict each >>>>>> other ? >>>>>> >>>>>> If i run same job on *local mode* then terminal logs and application >>>>>> logs match, where both state that job has failed to expected error in >>>>>> python script. >>>>>> >>>>>> More details: Scenario >>>>>> >>>>>> While running Spark Word count python example on *Yarn cluster mode*, >>>>>> if I make intentional error in wordcount.py by changing this line (I'm >>>>>> using Spark 1.4.1, but this problem exists in Spark 1.4.0 and in 1.3.0 >>>>>> versions - which i tested): >>>>>> >>>>>> lines = sc.textFile(sys.argv[1], 1) >>>>>> >>>>>> into this line: >>>>>> >>>>>> lines = sc.textFile(*nonExistentVariable*,1) >>>>>> >>>>>> where nonExistentVariable variable was never created and initialized. >>>>>> >>>>>> then i run that example with this command (I put README.md into HDFS >>>>>> before running this command): >>>>>> >>>>>> *./bin/spark-submit --master yarn-cluster wordcount.py /README.md* >>>>>> >>>>>> The job runs and finishes successfully according the log printed in >>>>>> the terminal : >>>>>> *Terminal logs*: >>>>>> ... >>>>>> 15/07/23 16:19:17 INFO yarn.Client: Application report for >>>>>> application_1437612288327_0013 (state: RUNNING) >>>>>> 15/07/23 16:19:18 INFO yarn.Client: Application report for >>>>>> application_1437612288327_0013 (state: RUNNING) >>>>>> 15/07/23 16:19:19 INFO yarn.Client: Application report for >>>>>> application_1437612288327_0013 (state: RUNNING) >>>>>> 15/07/23 16:19:20 INFO yarn.Client: Application report for >>>>>> application_1437612288327_0013 (state: RUNNING) >>>>>> 15/07/23 16:19:21 INFO yarn.Client: Application report for >>>>>> application_1437612288327_0013 (state: FINISHED) >>>>>> 15/07/23 16:19:21 INFO yarn.Client: >>>>>> client token: N/A >>>>>> diagnostics: Shutdown hook called before final status was reported. >>>>>> ApplicationMaster host: 10.0.53.59 >>>>>> ApplicationMaster RPC port: 0 >>>>>> queue: default >>>>>> start time: 1437693551439 >>>>>> final status: *SUCCEEDED* >>>>>> tracking URL: >>>>>> http://localhost:8088/proxy/application_1437612288327_0013/history/application_1437612288327_0013/1 >>>>>> user: edadashov >>>>>> 15/07/23 16:19:21 INFO util.Utils: Shutdown hook called >>>>>> 15/07/23 16:19:21 INFO util.Utils: Deleting directory >>>>>> /tmp/spark-eba0a1b5-a216-4afa-9c54-a3cb67b16444 >>>>>> >>>>>> But if look at log files generated for this application in HDFS - it >>>>>> indicates failure of the job with correct reason: >>>>>> *Application log files*: >>>>>> ... >>>>>> \00 stdout\00 179Traceback (most recent call last): >>>>>> File "wordcount.py", line 32, in <module> >>>>>> lines = sc.textFile(nonExistentVariable,1) >>>>>> *NameError: name 'nonExistentVariable' is not defined* >>>>>> >>>>>> >>>>>> Why terminal output - final status: *SUCCEEDED , *is not matching >>>>>> application log results - failure of the job (NameError: name >>>>>> 'nonExistentVariable' is not defined) ? >>>>>> >>>>>> Is this bug ? Is there Jira ticket related to this issue ? (Is >>>>>> someone assigned to this issue ?) >>>>>> >>>>>> If i run this wordcount .py example (with mistake line) in local >>>>>> mode, then terminal log states that the job has failed in terminal logs >>>>>> too. >>>>>> >>>>>> *./bin/spark-submit wordcount.py /README.md* >>>>>> >>>>>> *Terminal logs*: >>>>>> >>>>>> ... >>>>>> 15/07/23 16:31:55 INFO scheduler.EventLoggingListener: Logging events >>>>>> to hdfs:///app-logs/local-1437694314943 >>>>>> Traceback (most recent call last): >>>>>> File "/home/edadashov/tools/myspark/spark/wordcount.py", line 32, >>>>>> in <module> >>>>>> lines = sc.textFile(nonExistentVariable,1) >>>>>> NameError: name 'nonExistentVariable' is not defined >>>>>> 15/07/23 16:31:55 INFO spark.SparkContext: Invoking stop() from >>>>>> shutdown hook >>>>>> >>>>>> >>>>>> Thanks. >>>>>> >>>>> >>>>> >>>>> >>>>> -- >>>>> >>>>> Best regards, >>>>> Elkhan Dadashov >>>>> >>>> >>>> >>> >>> >>> -- >>> >>> Best regards, >>> Elkhan Dadashov >>> >> >> >> >> -- >> Marcelo >> > > > > -- > Marcelo > -- Best regards, Elkhan Dadashov