[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-4817?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

mutu updated ZOOKEEPER-4817:
----------------------------
    Description: 
Recently, we encounter an confused issue. The client disconnection warning 
disappears in system log. However, sometimes, this message appears in system 
log. There is a cluster consisting of three node. A client sends many creation 
requests and then read the node created by the first request. The client read 
operation failed due to missing node. We watch the system log. Sometimes, there 
is a client disconnection warning. Sometimes, there is not. This incomplete 
system log mislead client judgement on the problem.

After investigating, when NIOServerCnxn.doIO is stuck in any IO point in this 
function and the stuck time exceeds 20s, the client disconnection warning will 
disappear. If the stuck time is less than 20s, the client disconnection warning 
will appear in system log. 

We find that the root cause is that selectorThread is set as the daemon thread. 
When one node encounter the fail-slow nic, the client disconnects with the 
node. If the NIOServerCnxn.doIO is stuck and the stuck time exceeds 20s, the 
corresponding selectorThread will be kill by JVM. Hence, the client 
disconnection warning is missed.

Are there any comments to figure out this issues and improve the 
diagnosiability of ZooKeeper? I will very appreciate them.

  was:
Recently, we encounter an confused issue. The client disconnection warning 
disappears in system log. However, sometimes, this message appears in system 
log. There is a cluster consisting of three node. A client sends many creation 
requests and then read the node created by the first request. The client read 
operation failed due to missing node. We watch the system log. Sometimes, there 
is a client disconnection warning. Sometimes, there is not. After 
investigating, when NIOServerCnxn.doIO is stuck in any IO point in this 
function and the stuck time exceeds 20s, the client disconnection warning will 
disappear. If the stuck time is less than 20s, the client disconnection warning 
will appear in system log. 

We find that the root cause is 

 

When the doIO encounters the slowdown caused by teh fail-slow nic, the context 
is same.

Are there any comments to figure out this issues? I will very appreciate them.


> CancelledKeyException does not work in some cases.
> --------------------------------------------------
>
>                 Key: ZOOKEEPER-4817
>                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-4817
>             Project: ZooKeeper
>          Issue Type: Bug
>    Affects Versions: 3.10.0
>            Reporter: mutu
>            Priority: Major
>         Attachments: node1-25.log, node1-60.log, node2-25.log, node2-60.log, 
> node3-25.log, node3-60.log
>
>
> Recently, we encounter an confused issue. The client disconnection warning 
> disappears in system log. However, sometimes, this message appears in system 
> log. There is a cluster consisting of three node. A client sends many 
> creation requests and then read the node created by the first request. The 
> client read operation failed due to missing node. We watch the system log. 
> Sometimes, there is a client disconnection warning. Sometimes, there is not. 
> This incomplete system log mislead client judgement on the problem.
> After investigating, when NIOServerCnxn.doIO is stuck in any IO point in this 
> function and the stuck time exceeds 20s, the client disconnection warning 
> will disappear. If the stuck time is less than 20s, the client disconnection 
> warning will appear in system log. 
> We find that the root cause is that selectorThread is set as the daemon 
> thread. When one node encounter the fail-slow nic, the client disconnects 
> with the node. If the NIOServerCnxn.doIO is stuck and the stuck time exceeds 
> 20s, the corresponding selectorThread will be kill by JVM. Hence, the client 
> disconnection warning is missed.
> Are there any comments to figure out this issues and improve the 
> diagnosiability of ZooKeeper? I will very appreciate them.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to