[ https://issues.apache.org/jira/browse/HIVE-15671?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15861114#comment-15861114 ]
Xuefu Zhang edited comment on HIVE-15671 at 2/10/17 11:10 AM: -------------------------------------------------------------- Hi [~vanzin], to backtrack a little bit, I have a followup question about your comment. {quote} That's kinda hard to solve, because the server doesn't know which client connected until two things happen: first the driver has started, second the driver completed the SASL handshake to identify itself. A lot of things can go wrong in that time. There's already some code, IIRC, that fails the session if the spark-submit job dies with an error, but aside from that, it's kinda hard to do more. {quote} I was talking about server detecting a driver problem after it has connected back to the server. I'm wondering which timeout applies in case of a problem on the driver side, such as long GC, stall connection between the server and the driver, etc. It's kind of long if this timeout is also server.connect.timeout, which is increased to 10m in our case to accommodate for the busy cluster. To me it doesn't seem that such a timeout exist, in absence of a heartbeat mechanism. was (Author: xuefuz): Hi [~vanzin], to backtrack a little bit, I have a followup question about your comment. {quote} That's kinda hard to solve, because the server doesn't know which client connected until two things happen: first the driver has started, second the driver completed the SASL handshake to identify itself. A lot of things can go wrong in that time. There's already some code, IIRC, that fails the session if the spark-submit job dies with an error, but aside from that, it's kinda hard to do more. {code} I was talking about server detecting a driver problem after it has connected back to the server. I'm wondering which timeout applies in case of a problem on the driver side, such as long GC, stall connection between the server and the driver, etc. It's kind of long if this timeout is also server.connect.timeout, which is increased to 10m in our case to accommodate for the busy cluster. To me it doesn't seem that such a timeout exist, in absence of a heartbeat mechanism. > RPCServer.registerClient() erroneously uses server/client handshake timeout > for connection timeout > -------------------------------------------------------------------------------------------------- > > Key: HIVE-15671 > URL: https://issues.apache.org/jira/browse/HIVE-15671 > Project: Hive > Issue Type: Bug > Components: Spark > Affects Versions: 1.1.0 > Reporter: Xuefu Zhang > Assignee: Xuefu Zhang > Attachments: HIVE-15671.1.patch, HIVE-15671.patch > > > {code} > /** > * Tells the RPC server to expect a connection from a new client. > * ... > */ > public Future<Rpc> registerClient(final String clientId, String secret, > RpcDispatcher serverDispatcher) { > return registerClient(clientId, secret, serverDispatcher, > config.getServerConnectTimeoutMs()); > } > {code} > {{config.getServerConnectTimeoutMs()}} returns value for > *hive.spark.client.server.connect.timeout*, which is meant for timeout for > handshake between Hive client and remote Spark driver. Instead, the timeout > should be *hive.spark.client.connect.timeout*, which is for timeout for > remote Spark driver in connecting back to Hive client. -- This message was sent by Atlassian JIRA (v6.3.15#6346)