[ https://issues.apache.org/jira/browse/ZOOKEEPER-3182?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16666745#comment-16666745 ]
Fangmin Lv commented on ZOOKEEPER-3182: --------------------------------------- [~andorm] the change in the updated PR looks good to me, I've approved it. > Race condition when follower syncing with leader and starting to serve > requests > ------------------------------------------------------------------------------- > > Key: ZOOKEEPER-3182 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-3182 > Project: ZooKeeper > Issue Type: Bug > Components: server > Affects Versions: 3.6.0 > Reporter: Andor Molnar > Priority: Critical > > This issue is probably introduced by ZOOKEEPER-2024 where 2 seperate queues > have been implemented in CommitProcessor to improve performance. > [~abrahamfine] 's analysis is accurate on GitHub: > https://github.com/apache/zookeeper/pull/300 > He was trying to introduce synchronization between Learner.syncWithLeader() > and CommitProcessor to wait for in-flight requests to be committed before > accepting client requests. > In the affected unit test ({{testNodeDataChanged}}) there's a race between > reconnecting client's setWatches request and updates coming from the leader > according to the following logs: > {noformat} > 2018-10-25 13:59:58,556 [myid:] - DEBUG > [FollowerRequestProcessor:1:CommitProcessor@424] - Processing request:: > sessionid:0x10005d8fc4d0000 type:setWatches cxid:0x3 zxid:0xfffffffffffffffe > txntype:unknown reqpath:n/a > 2018-10-25 13:59:58,556 [myid:] - DEBUG > [CommitProcWorkThread-1:FinalRequestProcessor@91] - Processing request:: > sessionid:0x10005d8fc4d0000 type:setWatches cxid:0x3 zxid:0xfffffffffffffffe > txntype:unknown reqpath:n/a > ... > 2018-10-25 13:59:58,557 [myid:] - DEBUG > [CommitProcWorkThread-1:FinalRequestProcessor@91] - Processing request:: > sessionid:0x20005d8f8a40000 type:delete cxid:0x1 zxid:0x100000004 txntype:2 > reqpath:n/a > ... > 2018-10-25 13:59:58,561 [myid:] - DEBUG > [CommitProcWorkThread-1:FinalRequestProcessor@91] - Processing request:: > sessionid:0x20005d8f8a40000 type:create cxid:0x2 zxid:0x100000005 txntype:1 > reqpath:n/a > 2018-10-25 13:59:58,561 [myid:127.0.0.1:11231] - DEBUG > [main-SendThread(127.0.0.1:11231):ClientCnxn$SendThread@864] - Got > WatchedEvent state:SyncConnected type:NodeDeleted path:/test-changed for > sessionid 0x10005d8fc4d0000 > {noformat} > {{setWatches}} request is processed before {{delete}} and {{create}}, hence > the client receives NodeDeleted event. > In the working scenario it looks like: > {noformat} > 2018-10-25 14:04:55,247 [myid:] - DEBUG > [CommitProcWorkThread-1:FinalRequestProcessor@91] - Processing request:: > sessionid:0x20005dd88110000 type:delete cxid: > 0x1 zxid:0x100000004 txntype:2 reqpath:n/a > 2018-10-25 14:04:55,249 [myid:] - DEBUG > [CommitProcWorkThread-1:FinalRequestProcessor@91] - Processing request:: > sessionid:0x20005dd88110000 type:create cxid: > 0x2 zxid:0x100000005 txntype:1 reqpath:n/a > ... > 2018-10-25 14:04:56,314 [myid:] - DEBUG > [FollowerRequestProcessor:1:CommitProcessor@424] - Processing request:: > sessionid:0x10005dd88110000 type:setWatches cxid:0x3 zxid:0xfffffffffffffffe > txntype:unknown reqpath:n/a > 2018-10-25 14:04:56,315 [myid:] - DEBUG > [CommitProcWorkThread-1:FinalRequestProcessor@91] - Processing request:: > sessionid:0x10005dd88110000 type:setWatches cxid:0x3 zxid:0xfffffffffffffffe > txntype:unknown reqpath:n/a > ... > 2018-10-25 14:04:56,316 [myid:127.0.0.1:11231] - DEBUG > [main-SendThread(127.0.0.1:11231):ClientCnxn$SendThread@842] - Got > notification sessionid:0x10005dd88110000 > 2018-10-25 14:04:56,316 [myid:127.0.0.1:11231] - DEBUG > [main-SendThread(127.0.0.1:11231):ClientCnxn$SendThread@864] - Got > WatchedEvent state:SyncConnected type:NodeDataChanged path:/test-changed for > sessionid 0x10005dd88110000 > {noformat} > {{delete}} and {{create}} requests happen way before {{setWatches}} comes in > (even before the client connection is established) and client receives > NodeDataChanged event only. > Abe's approach unfortunately raises the following concerns: > - modifies CommitProcessor's code which might affect performance and > correctness ([~shralex] raised on ZOOKEEPER-2807), > - we experienced deadlocks while testing the patch: > https://github.com/apache/zookeeper/pull/300 > As a consequence I raised this Jira to capture the experiences and to put the > unit test on Ignore list, because currently I'm not sure about whether this > is a real issue or a non-backward compatible change in 3.6 with the gain of a > huge performance improvement. > Either way I don't want this flaky test to influence contributions, so I'll > mark as Ignored on trunk until the issue is resolved. -- This message was sent by Atlassian JIRA (v7.6.3#76005)