[ 
https://issues.apache.org/jira/browse/HIVE-15671?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15832582#comment-15832582
 ] 

Xuefu Zhang edited comment on HIVE-15671 at 1/20/17 11:12 PM:
--------------------------------------------------------------

Patch #1 followed what [~vanzin] suggested. With it, I observed the following 
behavior:

1. Increasing *server.connect.timeout* will make hive wait longer for the 
driver to connect back, which solves the busy cluster problem.
2. Killing driver while the job is running immediately fails the query on Hive 
side with the following error:
{code}
2017-01-20 22:01:08,235 Stage-2_0: 7(+3)/685    Stage-3_0: 0/1  
2017-01-20 22:01:09,237 Stage-2_0: 16(+6)/685   Stage-3_0: 0/1  
Failed to monitor Job[ 1] with exception 'java.lang.IllegalStateException(RPC 
channel is closed.)'
FAILED: Execution Error, return code 1 from 
org.apache.hadoop.hive.ql.exec.spark.SparkTask
{code}

This meets my expectation.

However, I didn't test the case of driver death before connecting back to Hive. 
(It's also hard to construct such a test case.) In that case, I assume that 
Hive will wait for *server.connect.timeout* before declaring a failure. I guess 
there isn't much we can do for this case. I don't think the change here has any 
implication on this.


was (Author: xuefuz):
Patch #1 followed what [~vanzin] suggested. With it, I observed the following 
behavior:

1. Increasing *server.connect.timeout* will make hive wait longer for the 
driver to connect back, which solves the busy cluster problem.
2. Killing driver while the job is running immediately fails the query on Hive 
side with the following error:
{code}
2017-01-20 22:01:08,235 Stage-2_0: 7(+3)/685    Stage-3_0: 0/1  
2017-01-20 22:01:09,237 Stage-2_0: 16(+6)/685   Stage-3_0: 0/1  
Failed to monitor Job[ 1] with exception 'java.lang.IllegalStateException(RPC 
channel is closed.)'
FAILED: Execution Error, return code 1 from 
org.apache.hadoop.hive.ql.exec.spark.SparkTask
{code}

This meets my expectation.

However, I didn't test the case of driver death before connecting back to Hive. 
(It's also hard to construct such a test case.) In that case, I assume that 
Hive will wait for *server.connect.timeout* before declares a failure. I guess 
there isn't much we can do for this case. I don't think the change here has any 
implication on this.

> RPCServer.registerClient() erroneously uses server/client handshake timeout 
> for connection timeout
> --------------------------------------------------------------------------------------------------
>
>                 Key: HIVE-15671
>                 URL: https://issues.apache.org/jira/browse/HIVE-15671
>             Project: Hive
>          Issue Type: Bug
>          Components: Spark
>    Affects Versions: 1.1.0
>            Reporter: Xuefu Zhang
>            Assignee: Xuefu Zhang
>         Attachments: HIVE-15671.1.patch, HIVE-15671.patch
>
>
> {code}
>   /**
>    * Tells the RPC server to expect a connection from a new client.
>    * ...
>    */
>   public Future<Rpc> registerClient(final String clientId, String secret,
>       RpcDispatcher serverDispatcher) {
>     return registerClient(clientId, secret, serverDispatcher, 
> config.getServerConnectTimeoutMs());
>   }
> {code}
> {{config.getServerConnectTimeoutMs()}} returns value for 
> *hive.spark.client.server.connect.timeout*, which is meant for timeout for 
> handshake between Hive client and remote Spark driver. Instead, the timeout 
> should be *hive.spark.client.connect.timeout*, which is for timeout for 
> remote Spark driver in connecting back to Hive client.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to