[ 
https://issues.apache.org/jira/browse/SPARK-25869?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16715709#comment-16715709
 ] 

ASF GitHub Bot commented on SPARK-25869:
----------------------------------------

vanzin commented on a change in pull request #22876: [SPARK-25869] [YARN] the 
original diagnostics is missing when job failed ma…
URL: https://github.com/apache/spark/pull/22876#discussion_r240412098
 
 

 ##########
 File path: 
resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/ApplicationMaster.scala
 ##########
 @@ -293,6 +293,9 @@ private[spark] class ApplicationMaster(args: 
ApplicationMasterArguments) extends
         }
 
         if (!unregistered) {
+          logInfo("Waiting for " + sparkConf.get("spark.yarn.report.interval", 
"1000").toInt +"ms to unregister am," +
 
 Review comment:
   This should also be a config constant. Instead of sleeping might be better 
to join `userClassThread` or `reporterThread` since they may exit more quickly 
than the configured wait.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Spark on YARN: the original diagnostics is missing when job failed 
> maxAppAttempts times
> ---------------------------------------------------------------------------------------
>
>                 Key: SPARK-25869
>                 URL: https://issues.apache.org/jira/browse/SPARK-25869
>             Project: Spark
>          Issue Type: Bug
>          Components: YARN
>    Affects Versions: 2.1.1
>            Reporter: Yeliang Cang
>            Priority: Major
>
> When configure spark on yarn, I submit job using below command:
> {code}
>  spark-submit  --class org.apache.spark.examples.SparkPi     --master yarn    
>  --deploy-mode cluster     --driver-memory 127m  --driver-cores 1   
> --executor-memory 2048m     --executor-cores 1    --num-executors 10  --queue 
> root.mr --conf spark.testing.reservedMemory=1048576 --conf 
> spark.yarn.executor.memoryOverhead=50 --conf 
> spark.yarn.driver.memoryOverhead=50 
> /opt/ZDH/parcels/lib/spark/examples/jars/spark-examples* 10000
> {code}
> Apparently, the driver memory is not enough, but this can not be seen in 
> spark client log:
> {code}
> 2018-10-29 19:28:34,658 INFO org.apache.spark.deploy.yarn.Client: Application 
> report for application_1540536615315_0013 (state: ACCEPTED)
> 2018-10-29 19:28:35,660 INFO org.apache.spark.deploy.yarn.Client: Application 
> report for application_1540536615315_0013 (state: RUNNING)
> 2018-10-29 19:28:35,660 INFO org.apache.spark.deploy.yarn.Client:
>  client token: N/A
>  diagnostics: N/A
>  ApplicationMaster host: 10.43.183.143
>  ApplicationMaster RPC port: 0
>  queue: root.mr
>  start time: 1540812501560
>  final status: UNDEFINED
>  tracking URL: http://zdh141:8088/proxy/application_1540536615315_0013/
>  user: mr
> 2018-10-29 19:28:36,663 INFO org.apache.spark.deploy.yarn.Client: Application 
> report for application_1540536615315_0013 (state: FINISHED)
> 2018-10-29 19:28:36,663 INFO org.apache.spark.deploy.yarn.Client:
>  client token: N/A
>  diagnostics: Shutdown hook called before final status was reported.
>  ApplicationMaster host: 10.43.183.143
>  ApplicationMaster RPC port: 0
>  queue: root.mr
>  start time: 1540812501560
>  final status: FAILED
>  tracking URL: http://zdh141:8088/proxy/application_1540536615315_0013/
>  user: mr
> Exception in thread "main" org.apache.spark.SparkException: Application 
> application_1540536615315_0013 finished with failed status
>  at org.apache.spark.deploy.yarn.Client.run(Client.scala:1137)
>  at org.apache.spark.deploy.yarn.Client$.main(Client.scala:1183)
>  at org.apache.spark.deploy.yarn.Client.main(Client.scala)
>  at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>  at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>  at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>  at java.lang.reflect.Method.invoke(Method.java:498)
>  at 
> org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:775)
>  at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:180)
>  at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:205)
>  at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:119)
>  at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
> 2018-10-29 19:28:36,694 INFO org.apache.spark.util.ShutdownHookManager: 
> Shutdown hook called
> 2018-10-29 19:28:36,695 INFO org.apache.spark.util.ShutdownHookManager: 
> Deleting directory /tmp/spark-96077be5-0dfa-496d-a6a0-96e83393a8d9
> {code}
>  
>  
> Solution: after apply the patch, spark client log can be shown as:
> {code}
> 2018-10-29 19:27:32,962 INFO org.apache.spark.deploy.yarn.Client: Application 
> report for application_1540536615315_0012 (state: RUNNING)
> 2018-10-29 19:27:32,962 INFO org.apache.spark.deploy.yarn.Client:
>  client token: N/A
>  diagnostics: N/A
>  ApplicationMaster host: 10.43.183.143
>  ApplicationMaster RPC port: 0
>  queue: root.mr
>  start time: 1540812436656
>  final status: UNDEFINED
>  tracking URL: http://zdh141:8088/proxy/application_1540536615315_0012/
>  user: mr
> 2018-10-29 19:27:33,964 INFO org.apache.spark.deploy.yarn.Client: Application 
> report for application_1540536615315_0012 (state: FAILED)
> 2018-10-29 19:27:33,964 INFO org.apache.spark.deploy.yarn.Client:
>  client token: N/A
>  diagnostics: Application application_1540536615315_0012 failed 2 times due 
> to AM Container for appattempt_1540536615315_0012_000002 exited with 
> exitCode: -104
> For more detailed output, check application tracking 
> page:http://zdh141:8088/cluster/app/application_1540536615315_0012Then, click 
> on links to logs of each attempt.
> Diagnostics: virtual memory used. Killing container.
> Dump of the process-tree for container_e53_1540536615315_0012_02_000001 :
>  |- PID PPID PGRPID SESSID CMD_NAME USER_MODE_TIME(MILLIS) 
> SYSTEM_TIME(MILLIS) VMEM_USAGE(BYTES) RSSMEM_USAGE(PAGES) FULL_CMD_LINE
>  |- 1532 1528 1528 1528 (java) 1209 174 3472551936 65185 
> /usr/java/jdk/bin/java -server -Xmx127m 
> -Djava.io.tmpdir=/data3/zdh/yarn/local/usercache/mr/appcache/application_1540536615315_0012/container_e53_1540536615315_0012_02_000001/tmp
>  -Xss32M -XX:MetaspaceSize=128M -XX:MaxMetaspaceSize=512M 
> -Dspark.yarn.app.container.log.dir=/data1/zdh/yarn/logs/userlogs/application_1540536615315_0012/container_e53_1540536615315_0012_02_000001
>  org.apache.spark.deploy.yarn.ApplicationMaster --class 
> org.apache.spark.examples.SparkPi --jar 
> file:/opt/ZDH/parcels/lib/spark/examples/jars/spark-examples_2.11-2.2.1-zdh8.5.1.jar
>  --arg 10000 --properties-file 
> /data3/zdh/yarn/local/usercache/mr/appcache/application_1540536615315_0012/container_e53_1540536615315_0012_02_000001/__spark_conf__/__spark_conf__.properties
>  |- 1528 1526 1528 1528 (bash) 0 0 108642304 309 /bin/bash -c 
> LD_LIBRARY_PATH=/opt/ZDH/parcels/lib/hadoop/lib/native: 
> /usr/java/jdk/bin/java -server -Xmx127m 
> -Djava.io.tmpdir=/data3/zdh/yarn/local/usercache/mr/appcache/application_1540536615315_0012/container_e53_1540536615315_0012_02_000001/tmp
>  '-Xss32M' '-XX:MetaspaceSize=128M' '-XX:MaxMetaspaceSize=512M' 
> -Dspark.yarn.app.container.log.dir=/data1/zdh/yarn/logs/userlogs/application_1540536615315_0012/container_e53_1540536615315_0012_02_000001
>  org.apache.spark.deploy.yarn.ApplicationMaster --class 
> 'org.apache.spark.examples.SparkPi' --jar 
> file:/opt/ZDH/parcels/lib/spark/examples/jars/spark-examples_2.11-2.2.1-zdh8.5.1.jar
>  --arg '10000' --properties-file 
> /data3/zdh/yarn/local/usercache/mr/appcache/application_1540536615315_0012/container_e53_1540536615315_0012_02_000001/__spark_conf__/__spark_conf__.properties
>  1> 
> /data1/zdh/yarn/logs/userlogs/application_1540536615315_0012/container_e53_1540536615315_0012_02_000001/stdout
>  2> 
> /data1/zdh/yarn/logs/userlogs/application_1540536615315_0012/container_e53_1540536615315_0012_02_000001/stderr
> Container killed on request. Exit code is 143
> Container exited with a non-zero exit code 143
> PmemUsageMBsMaxMBs is: 255.0 MBFailing this attempt. Failing the application.
>  ApplicationMaster host: N/A
>  ApplicationMaster RPC port: -1
>  queue: root.mr
>  start time: 1540812436656
>  final status: FAILED
>  tracking URL: http://zdh141:8088/cluster/app/application_1540536615315_0012
>  user: mr
> 2018-10-29 19:27:34,542 INFO org.apache.spark.deploy.yarn.Client: Deleted 
> staging directory 
> hdfs://nameservice/user/mr/.sparkStaging/application_1540536615315_0012
> Exception in thread "main" org.apache.spark.SparkException: Application 
> application_1540536615315_0012 finished with failed status
>  at org.apache.spark.deploy.yarn.Client.run(Client.scala:1137)
>  at org.apache.spark.deploy.yarn.Client$.main(Client.scala:1183)
>  at org.apache.spark.deploy.yarn.Client.main(Client.scala)
>  at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>  at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>  at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>  at java.lang.reflect.Method.invoke(Method.java:498)
>  at 
> org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:775)
>  at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:180)
>  at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:205)
>  at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:119)
>  at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
> 2018-10-29 19:27:34,548 INFO org.apache.spark.util.ShutdownHookManager: 
> Shutdown hook called
> 2018-10-29 19:27:34,549 INFO org.apache.spark.util.ShutdownHookManager: 
> Deleting directory /tmp/spark-ce35f2ad-ec1f-4173-9441-163e2482ed61
> {code}
> Now we can see the true reason for job failure from client!



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to