[ https://issues.apache.org/jira/browse/HIVE-15671?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15832582#comment-15832582 ]
Xuefu Zhang edited comment on HIVE-15671 at 1/20/17 11:12 PM: -------------------------------------------------------------- Patch #1 followed what [~vanzin] suggested. With it, I observed the following behavior: 1. Increasing *server.connect.timeout* will make hive wait longer for the driver to connect back, which solves the busy cluster problem. 2. Killing driver while the job is running immediately fails the query on Hive side with the following error: {code} 2017-01-20 22:01:08,235 Stage-2_0: 7(+3)/685 Stage-3_0: 0/1 2017-01-20 22:01:09,237 Stage-2_0: 16(+6)/685 Stage-3_0: 0/1 Failed to monitor Job[ 1] with exception 'java.lang.IllegalStateException(RPC channel is closed.)' FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.spark.SparkTask {code} This meets my expectation. However, I didn't test the case of driver death before connecting back to Hive. (It's also hard to construct such a test case.) In that case, I assume that Hive will wait for *server.connect.timeout* before declaring a failure. I guess there isn't much we can do for this case. I don't think the change here has any implication on this. was (Author: xuefuz): Patch #1 followed what [~vanzin] suggested. With it, I observed the following behavior: 1. Increasing *server.connect.timeout* will make hive wait longer for the driver to connect back, which solves the busy cluster problem. 2. Killing driver while the job is running immediately fails the query on Hive side with the following error: {code} 2017-01-20 22:01:08,235 Stage-2_0: 7(+3)/685 Stage-3_0: 0/1 2017-01-20 22:01:09,237 Stage-2_0: 16(+6)/685 Stage-3_0: 0/1 Failed to monitor Job[ 1] with exception 'java.lang.IllegalStateException(RPC channel is closed.)' FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.spark.SparkTask {code} This meets my expectation. However, I didn't test the case of driver death before connecting back to Hive. (It's also hard to construct such a test case.) In that case, I assume that Hive will wait for *server.connect.timeout* before declares a failure. I guess there isn't much we can do for this case. I don't think the change here has any implication on this. > RPCServer.registerClient() erroneously uses server/client handshake timeout > for connection timeout > -------------------------------------------------------------------------------------------------- > > Key: HIVE-15671 > URL: https://issues.apache.org/jira/browse/HIVE-15671 > Project: Hive > Issue Type: Bug > Components: Spark > Affects Versions: 1.1.0 > Reporter: Xuefu Zhang > Assignee: Xuefu Zhang > Attachments: HIVE-15671.1.patch, HIVE-15671.patch > > > {code} > /** > * Tells the RPC server to expect a connection from a new client. > * ... > */ > public Future<Rpc> registerClient(final String clientId, String secret, > RpcDispatcher serverDispatcher) { > return registerClient(clientId, secret, serverDispatcher, > config.getServerConnectTimeoutMs()); > } > {code} > {{config.getServerConnectTimeoutMs()}} returns value for > *hive.spark.client.server.connect.timeout*, which is meant for timeout for > handshake between Hive client and remote Spark driver. Instead, the timeout > should be *hive.spark.client.connect.timeout*, which is for timeout for > remote Spark driver in connecting back to Hive client. -- This message was sent by Atlassian JIRA (v6.3.4#6332)