Chao Sun created HIVE-16984:
-------------------------------
Summary: HoS: avoid waiting for RemoteSparkJobStatus::getAppID()
when remote driver died
Key: HIVE-16984
URL: https://issues.apache.org/jira/browse/HIVE-16984
Project: Hive
Issue Type: Bug
Components: Spark
Reporter: Chao Sun
Assignee: Chao Sun
In HoS, after a RemoteDriver is launched, it may fail to initialize a Spark
context and thus the ApplicationMaster will die eventually. In this case, there
are two issues related to RemoteSparkJobStatus::getAppID():
1. Currently we call {{getAppID()}} before starting the monitoring job. For the
first, it will wait for {{hive.spark.client.future.timeout}}, and for the
latter, it will wait for {{hive.spark.job.monitor.timeout}}. The error message
for the latter treats the {{hive.spark.job.monitor.timeout}} as the time
waiting for the job submission. However, this is inaccurate as it doesn't
include {{hive.spark.client.future.timeout}}.
2. In case the RemoteDriver suddenly died, currently we still may wait
hopelessly for the timeouts. This should potentially be avoided if we know that
the channel has closed between the client and remote driver.
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)