rymarm commented on pull request #2333: URL: https://github.com/apache/drill/pull/2333#issuecomment-949656576
Hi @paul-rogers, thank you for your review! By answering the main questions: > ...So, the question is, does the isValid() method need to know the state at this exact instant? Or, is it good enough to know that the connection was OK when we last checked 30 seconds (or whatever) ago? I think the answer is pretty easy and very predictable — yes, the method needs to know the state at this exact instant, as it is required by JDBC API and is described in [java documentation](https://docs.oracle.com/javase/8/docs/api/java/sql/Connection.html#isValid-int-). As we implement some interface we should comply with it, so as users may simply use Drill JDBC driver without deep knowledge of its implementation and the method result will be predictable. ------------------------ And let me comment on other parts of your message. > ...So, the client always knows that the connection is still valid (or was at the time of the last heartbeat. We can check how frequently they occur; I think it is on the order of 30 seconds or so, but I may be wrong. The current implementation does not correspond to what you said. For a now, only the server always knows that the connection is still valid, because it receives a heartbeat from the client, while the client doesn't check, whether the server (drillbit) has answered on the PING with a PONG. Heartbeat mechanism was implemented in [this commit](https://github.com/apache/drill/commit/960f876a1945eee4eeb0d6a98c5c58dfe2eea1a9), and there were missed functions that on the client side checks whether the server answers for requests. There are [`IdleStateHandler`](https://github.com/apache/drill/blob/960f876a1945eee4eeb0d6a98c5c58dfe2eea1a9/exec/java-exec/src/main/java/org/apache/drill/exec/rpc/BasicClient.java#L119) that sends PING (heartbeat) request if the client doesn't send any request for a while. But this handler, [doesn't check, whether the client received a PONG answer](https://github.com/apache/drill/blob/960f876a1945eee4eeb0d6a98c5c58dfe2eea1a9/exec/java-exec/src/main/java/org/apache/drill/exec/rpc/RpcBus.java#L266) from the server. I may suppose, that author of this logic (heartbeat) believed, that netty's [`Future#isSuccess`](https://netty.io/4.1/api/io/netty/util/concurrent/Future.html#isSuccess--) will return `false` if the server [didn't receive PING message](https://github.com/apache/drill/blob/960f876a1945eee4eeb0d6a98c5c58dfe2eea1a9/exec/java-exec/src/main/java/org/apache/drill/exec/rpc/BasicClient.java#L123), but it doesn't: `Future#isSuccess` mean only successful flush to a socket and it d oesn't mean that any data was delivered ([source](https://github.com/netty/netty/issues/10343#issuecomment-641365702)). Therefore, the client doesn't have any mechanism, that checks whether the server (drillbit) still answers. And in cases when drillbit is not accessible now due to network issues, drillbit failing, etc — the client (or HikariCP for example) will think, that connection is still valid. > ...I think it is on the order of 30 seconds or so, but I may be wrong. It is 15 seconds by default and calculates as `drill.exec.rpc.user.timeout` * 0.5 ([souserce](https://github.com/apache/drill/blob/4aefcef2b665c5737471664912a26ef6ed9a6cfc/exec/rpc/src/main/java/org/apache/drill/exec/rpc/BasicClient.java#L87)). > ...a heartbeat message was backed up behind data batches and the connection ended up failing as a result. I even heard about cases, when zookeeper was losing drillbit from a cluster, because it didn't send heartbeat in time due to garbage collector freeze. And therefore in infrequent cases, queries may fail, because a foreman decides, that one of the drillbits is lost. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
