[ 
https://issues.apache.org/jira/browse/HIVE-15671?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15832245#comment-15832245
 ] 

Xuefu Zhang commented on HIVE-15671:
------------------------------------

[~vanzin], thanks for your insight. I think we are approaching to something. 
I'm going to change #2 to use {{getConnectTimeoutMs()}} and try it out. Naming 
is one thing, but yes, the server-side timeout should be bigger. When I tested 
with my patch, I actually made *client.connect.timeout* much bigger than 
*server.connect.timeout* and that's why I didn't have the problem that [~lirui] 
got. 

{quote}That's kinda hard to solve, because the server doesn't know which client 
connected until...{quote}
My original problem (with no patch so ever) was about a busy cluster where it 
took longer time (up to 10m) to get a container to run the driver. To overcome 
that, I increased *server.connect.timeout* to 10m which worked. With that, 
however, I got a different problem when the driver suddenly dies (due to OOM, 
for instance), at which point the driver had already connected back to Hive and 
the job was running. In such a case, Hive wouldn't detect the driver was gone 
until 10m later. My patch here was to solve this problem.

With the new understanding, I'd like to make sure that both the problems are 
solved: 1. user should be able to increase *server.connect.timeout* to handler 
longer startup of the driver. 2. Hive should be able to immediately detect the 
death of the driver (after connection has been made).

Any additional thoughts?

> RPCServer.registerClient() erroneously uses server/client handshake timeout 
> for connection timeout
> --------------------------------------------------------------------------------------------------
>
>                 Key: HIVE-15671
>                 URL: https://issues.apache.org/jira/browse/HIVE-15671
>             Project: Hive
>          Issue Type: Bug
>          Components: Spark
>    Affects Versions: 1.1.0
>            Reporter: Xuefu Zhang
>            Assignee: Xuefu Zhang
>         Attachments: HIVE-15671.patch
>
>
> {code}
>   /**
>    * Tells the RPC server to expect a connection from a new client.
>    * ...
>    */
>   public Future<Rpc> registerClient(final String clientId, String secret,
>       RpcDispatcher serverDispatcher) {
>     return registerClient(clientId, secret, serverDispatcher, 
> config.getServerConnectTimeoutMs());
>   }
> {code}
> {{config.getServerConnectTimeoutMs()}} returns value for 
> *hive.spark.client.server.connect.timeout*, which is meant for timeout for 
> handshake between Hive client and remote Spark driver. Instead, the timeout 
> should be *hive.spark.client.connect.timeout*, which is for timeout for 
> remote Spark driver in connecting back to Hive client.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to