Sahil Takiar created HIVE-20134: ----------------------------------- Summary: Improve logging when HoS Driver is killed due to exceeding memory limits Key: HIVE-20134 URL: https://issues.apache.org/jira/browse/HIVE-20134 Project: Hive Issue Type: Sub-task Components: Spark Reporter: Sahil Takiar
This was improved in HIVE-18093, but more can be done. If a HoS Driver gets killed because it exceeds its memory limits, YARN will issue a SIGTERM on the process. The SIGTERM will cause the shutdown hook in the HoS Driver to be triggered. This causes the Driver to kill all running jobs, even if they are running. The user ends up seeing an error like the one below. Which isn't very informative. We should propagate the error from the Driver shutdown hook to the user. {code:java} INFO : 2018-07-09 17:48:42,580 Stage-64_0: 526/526 Finished Stage-65_0: 1405/1405 Finished Stage-66_0: 0(+759)/1102 Stage-67_0: 0/1099 Stage-68_0: 0/1099 Stage-69_0: 0/1 INFO : 2018-07-09 17:48:44,589 Stage-64_0: 526/526 Finished Stage-65_0: 1405/1405 Finished Stage-66_0: 1(+759)/1102 Stage-67_0: 0/1099 Stage-68_0: 0/1099 Stage-69_0: 0/1 INFO : 2018-07-09 17:48:45,591 Stage-64_0: 526/526 Finished Stage-65_0: 1405/1405 Finished Stage-66_0: 2(+759)/1102 Stage-67_0: 0/1099 Stage-68_0: 0/1099 Stage-69_0: 0/1 INFO : 2018-07-09 17:48:48,596 Stage-64_0: 526/526 Finished Stage-65_0: 1405/1405 Finished Stage-66_0: 2(+759)/1102 Stage-67_0: 0/1099 Stage-68_0: 0/1099 Stage-69_0: 0/1 ERROR : Spark job[23] failed java.lang.InterruptedException: null at java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:998) ~[?:1.8.0_141] at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1304) ~[?:1.8.0_141] at scala.concurrent.impl.Promise$DefaultPromise.tryAwait(Promise.scala:202) ~[scala-library-2.11.8.jar:?] at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:218) ~[scala-library-2.11.8.jar:?] at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:153) ~[scala-library-2.11.8.jar:?] at org.apache.spark.SimpleFutureAction.ready(FutureAction.scala:125) ~[spark-core_2.11-2.2.0-SNAPSHOT.jar:2.2.0-SNAPSHOT] at org.apache.spark.SimpleFutureAction.ready(FutureAction.scala:114) ~[spark-core_2.11-2.2.0-SNAPSHOT.jar:2.2.0-SNAPSHOT] at org.apache.spark.util.ThreadUtils$.awaitReady(ThreadUtils.scala:222) ~[spark-core_2.11-2.2.0-SNAPSHOT.jar:2.2.0-SNAPSHOT] at org.apache.spark.JavaFutureActionWrapper.getImpl(FutureAction.scala:264) ~[spark-core_2.11-2.2.0-SNAPSHOT.jar:2.2.0-SNAPSHOT] at org.apache.spark.JavaFutureActionWrapper.get(FutureAction.scala:277) ~[spark-core_2.11-2.2.0-SNAPSHOT.jar:2.2.0-SNAPSHOT] at org.apache.hive.spark.client.RemoteDriver$JobWrapper.call(RemoteDriver.java:391) ~[hive-exec-2.1.1-SNAPSHOT.jar:2.1.1-SNAPSHOT] at org.apache.hive.spark.client.RemoteDriver$JobWrapper.call(RemoteDriver.java:352) ~[hive-exec-2.1.1-SNAPSHOT.jar:2.1.1-SNAPSHOT] at java.util.concurrent.FutureTask.run(FutureTask.java:266) [?:1.8.0_141] at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_141] at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_141] at java.lang.Thread.run(Thread.java:748) [?:1.8.0_141] ERROR : FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.spark.SparkTask. null INFO : Completed executing command(queryId=hive_20180709174140_0f64ee17-f793-441a-9a77-3ee0cd0a9c32); Time taken: 249.727 seconds Error: Error while processing statement: FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.spark.SparkTask. null (state=08S01,code=1){code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)