[ https://issues.apache.org/jira/browse/ZOOKEEPER-4817?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
mutu updated ZOOKEEPER-4817: ---------------------------- Attachment: (was: system-20s.log) > Client disconnection warning is missed in system log sometimes. > --------------------------------------------------------------- > > Key: ZOOKEEPER-4817 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-4817 > Project: ZooKeeper > Issue Type: Bug > Affects Versions: 3.10.0 > Reporter: mutu > Priority: Major > Attachments: system1_20s.log, system1_25s.log, system2_20s.log, > system2_25s.log, system3_20s.log, system3_25s.log > > > Recently, we encounter an confused issue. The client disconnection warning > disappears in system log. However, sometimes, this message appears in system > log. There is a cluster consisting of three node. A client sends many > creation requests and then read the node created by the first request. The > client read operation failed due to missing node. We watch the system log. > Sometimes, there is a client disconnection warning. Sometimes, there is not. > This incomplete system log mislead client judgement on the problem. > After investigating, when NIOServerCnxn.doIO is stuck in any IO point in this > function and the stuck time exceeds 20s, the client disconnection warning > will disappear. If the stuck time is less than 20s, the client disconnection > warning will appear in system log. > We find that the root cause is that selectorThread is set as the daemon > thread. When one node encounter the fail-slow nic, the client disconnects > with the node. If the NIOServerCnxn.doIO is stuck and the stuck time exceeds > 20s, the corresponding selectorThread will be killed by JVM. Hence, the > client disconnection warning is missed. > Are there any comments to figure out this issues and improve the > diagnosability of ZooKeeper? I will very appreciate them. -- This message was sent by Atlassian Jira (v8.20.10#820010)