[jira] [Commented] (LIVY-586) When a batch fails on startup, Livy continues to report the batch as "starting", even though it has failed

jiewang (Jira) Sun, 25 Aug 2019 20:38:05 -0700


    [ 
https://issues.apache.org/jira/browse/LIVY-586?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16915448#comment-16915448
 ]


jiewang commented on LIVY-586:
------------------------------

I'm working on it.

> When a batch fails on startup, Livy continues to report the batch as 
> "starting", even though it has failed
> ----------------------------------------------------------------------------------------------------------
>
>                 Key: LIVY-586
>                 URL: https://issues.apache.org/jira/browse/LIVY-586
>             Project: Livy
>          Issue Type: Bug
>          Components: Batch
>    Affects Versions: 0.5.0
>         Environment: AWS EMR, Livy submits batches to YARN in cluster mode
>            Reporter: Sam Brougher
>            Priority: Major
>
> When starting a Livy batch, I accidentally provided it a jar location in S3 
> that did not exist. Livy then continued to report that the job was 
> "starting", even though it had clearly failed.
> stdout:
> {code:java}
> 2019-04-05 11:24:18,149 [main] WARN org.apache.hadoop.util.NativeCodeLoader 
> [appName=] [jobId=] [clusterId=] - Unable to load native-hadoop library for 
> your platform... using builtin-java classes where applicable
> Warning: Skip remote jar 
> s3://dev-dp-local/jars/develop-fix/ap5-app-transform-0.2-thread-pool-SNAPSHOT.jar.
> 2019-04-05 11:24:19,152 [main] INFO org.apache.hadoop.yarn.client.RMProxy 
> [appName=] [jobId=] [clusterId=] - Connecting to ResourceManager at 
> ip-10-25-30-127.dev.cainc.internal/10.25.30.127:8032
> 2019-04-05 11:24:19,453 [main] INFO org.apache.spark.deploy.yarn.Client 
> [appName=] [jobId=] [clusterId=] - Requesting a new application from cluster 
> with 6 NodeManagers
> 2019-04-05 11:24:19,532 [main] INFO org.apache.spark.deploy.yarn.Client 
> [appName=] [jobId=] [clusterId=] - Verifying our application has not 
> requested more than the maximum memory capability of the cluster (54272 MB 
> per container)
> 2019-04-05 11:24:19,533 [main] INFO org.apache.spark.deploy.yarn.Client 
> [appName=] [jobId=] [clusterId=] - Will allocate AM container, with 9011 MB 
> memory including 819 MB overhead
> 2019-04-05 11:24:19,534 [main] INFO org.apache.spark.deploy.yarn.Client 
> [appName=] [jobId=] [clusterId=] - Setting up container launch context for 
> our AM
> 2019-04-05 11:24:19,537 [main] INFO org.apache.spark.deploy.yarn.Client 
> [appName=] [jobId=] [clusterId=] - Setting up the launch environment for our 
> AM container
> 2019-04-05 11:24:19,549 [main] INFO org.apache.spark.deploy.yarn.Client 
> [appName=] [jobId=] [clusterId=] - Preparing resources for our AM container
> 2019-04-05 11:24:21,059 [main] WARN org.apache.spark.deploy.yarn.Client 
> [appName=] [jobId=] [clusterId=] - Neither spark.yarn.jars nor 
> spark.yarn.archive is set, falling back to uploading libraries under 
> SPARK_HOME.
> 2019-04-05 11:24:23,790 [main] INFO org.apache.spark.deploy.yarn.Client 
> [appName=] [jobId=] [clusterId=] - Uploading resource 
> file:/mnt/tmp/spark-b4e4a760-77a3-4554-a3f3-c3f82675d865/__spark_libs__3639879082942366045.zip
>  -> 
> hdfs://ip-10-25-30-127.dev.cainc.internal:8020/user/livy/.sparkStaging/application_1554234858331_0222/__spark_libs__3639879082942366045.zip
> 2019-04-05 11:24:26,817 [main] INFO org.apache.spark.deploy.yarn.Client 
> [appName=] [jobId=] [clusterId=] - Uploading resource 
> s3://dev-dp-local/jars/develop-fix/ap5-app-transform-0.2-thread-pool-SNAPSHOT.jar
>  -> 
> hdfs://ip-10-25-30-127.dev.cainc.internal:8020/user/livy/.sparkStaging/application_1554234858331_0222/ap5-app-transform-0.2-thread-pool-SNAPSHOT.jar
> 2019-04-05 11:24:26,940 [main] INFO org.apache.spark.deploy.yarn.Client 
> [appName=] [jobId=] [clusterId=] - Deleted staging directory 
> hdfs://ip-10-25-30-127.dev.cainc.internal:8020/user/livy/.sparkStaging/application_1554234858331_0222
> Exception in thread "main" java.io.FileNotFoundException: No such file or 
> directory 
> 's3://dev-dp-local/jars/develop-fix/ap5-app-transform-0.2-thread-pool-SNAPSHOT.jar'
>       at 
> com.amazon.ws.emr.hadoop.fs.s3n.S3NativeFileSystem.getFileStatus(S3NativeFileSystem.java:805)
>       at 
> com.amazon.ws.emr.hadoop.fs.EmrFileSystem.getFileStatus(EmrFileSystem.java:536)
>       at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:340)
>       at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:292)
>       at 
> org.apache.spark.deploy.yarn.Client.copyFileToRemote(Client.scala:356)
>       at 
> org.apache.spark.deploy.yarn.Client.org$apache$spark$deploy$yarn$Client$$distribute$1(Client.scala:478)
>       at 
> org.apache.spark.deploy.yarn.Client$$anonfun$prepareLocalResources$10.apply(Client.scala:577)
>       at 
> org.apache.spark.deploy.yarn.Client$$anonfun$prepareLocalResources$10.apply(Client.scala:576)
>       at scala.Option.foreach(Option.scala:257)
>       at 
> org.apache.spark.deploy.yarn.Client.prepareLocalResources(Client.scala:576)
>       at 
> org.apache.spark.deploy.yarn.Client.createContainerLaunchContext(Client.scala:869)
>       at 
> org.apache.spark.deploy.yarn.Client.submitApplication(Client.scala:169)
>       at org.apache.spark.deploy.yarn.Client.run(Client.scala:1152)
>       at 
> org.apache.spark.deploy.yarn.YarnClusterApplication.start(Client.scala:1520)
>       at 
> org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:894)
>       at 
> org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:198)
>       at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:228)
>       at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:137)
>       at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
> 2019-04-05 11:24:26,964 [pool-1-thread-1] INFO 
> org.apache.spark.util.ShutdownHookManager [appName=] [jobId=] [clusterId=] - 
> Shutdown hook called
> 2019-04-05 11:24:26,965 [pool-1-thread-1] INFO 
> org.apache.spark.util.ShutdownHookManager [appName=] [jobId=] [clusterId=] - 
> Deleting directory /mnt/tmp/spark-aa8e8eff-ca2c-4358-a24f-19eb3863ef8f
> 2019-04-05 11:24:26,966 [pool-1-thread-1] INFO 
> org.apache.spark.util.ShutdownHookManager [appName=] [jobId=] [clusterId=] - 
> Deleting directory /mnt/tmp/spark-b4e4a760-77a3-4554-a3f3-c3f82675d865
> {code}
> stderr is empty
> YARN Diagnostics eventually warns that the tag for the batch can't be found 
> after 900 seconds.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

[jira] [Commented] (LIVY-586) When a batch fails on startup, Livy continues to report the batch as "starting", even though it has failed

Reply via email to