[jira] [Commented] (ZOOKEEPER-4444) Follower doesn't get synchronized after process restart

LINFUSHOU (Jira) Fri, 22 Mar 2024 00:34:04 -0700


    [ 
https://issues.apache.org/jira/browse/ZOOKEEPER-4444?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17829769#comment-17829769
 ]


LINFUSHOU commented on ZOOKEEPER-4444:
--------------------------------------

This issue happened in my environment, then zookeeper follower data was 
inconsistent with leader. znode was deleted, but could be read from one 
follower.
 
 2023-09-22 06:14:59,309 [myid:3] - INFO  
[QuorumPeer[myid=3](plain=[0:0:0:0:0:0:0:0]:9639)(secure=disabled):Learner@737][]
 - Learner received NEWLEADER message
 2023-09-22 06:14:59,310 [myid:3] - INFO  
[QuorumPeer[myid=3](plain=[0:0:0:0:0:0:0:0]:9639)(secure=disabled):QuorumPeer@921][]
 - Peer state changed: following - synchronization
 2023-09-22 06:14:59,310 [myid:3] - INFO  
[QuorumPeer[myid=3](plain=[0:0:0:0:0:0:0:0]:9639)(secure=disabled):CommitProcessor@491][]
 - Configuring CommitProcessor with readBatchSize -1 commitBatchSize 1
 2023-09-22 06:14:59,310 [myid:3] - INFO  
[QuorumPeer[myid=3](plain=[0:0:0:0:0:0:0:0]:9639)(secure=disabled):CommitProcessor@452][]
 - Configuring CommitProcessor with 28 worker threads.
 2023-09-22 06:14:59,310 [myid:3] - INFO  
[QuorumPeer[myid=3](plain=[0:0:0:0:0:0:0:0]:9639)(secure=disabled):FollowerRequestProcessor@59][]
 - Initialized FollowerRequestProcessor with 
zookeeper.follower.skipLearnerRequestToNextProcessor as false
 2023-09-22 06:14:59,311 [myid:3] - WARN  
[QuorumPeer[myid=3](plain=[0:0:0:0:0:0:0:0]:9639)(secure=disabled):MBeanRegistry@110][]
 - Failed to register MBean InMemoryDataTree
 2023-09-22 06:14:59,311 [myid:3] - WARN  
[QuorumPeer[myid=3](plain=[0:0:0:0:0:0:0:0]:9639)(secure=disabled):LearnerZooKeeperServer@104][]
 - Failed to register with JMX
 
org.apache.ZooKeeperService:name0=ReplicatedServer_id3,name1=replica.3,name2=Follower,name3=InMemoryDataTree
    at com.sun.jmx.mbeanserver.Repository.addMBean(Repository.java:437)
    at 
com.sun.jmx.interceptor.DefaultMBeanServerInterceptor.registerWithRepository(DefaultMBeanServerInterceptor.java:1898)
    at 
com.sun.jmx.interceptor.DefaultMBeanServerInterceptor.registerDynamicMBean(DefaultMBeanServerInterceptor.java:966)
    at 
com.sun.jmx.interceptor.DefaultMBeanServerInterceptor.registerObject(DefaultMBeanServerInterceptor.java:900)
    at 
com.sun.jmx.interceptor.DefaultMBeanServerInterceptor.registerMBean(DefaultMBeanServerInterceptor.java:324)
    at 
com.sun.jmx.mbeanserver.JmxMBeanServer.registerMBean(JmxMBeanServer.java:522)
    at org.apache.zookeeper.jmx.MBeanRegistry.register(MBeanRegistry.java:106)
    at 
org.apache.zookeeper.server.quorum.LearnerZooKeeperServer.registerJMX(LearnerZooKeeperServer.java:102)
    at 
org.apache.zookeeper.server.ZooKeeperServer.startupWithServerState(ZooKeeperServer.java:715)
    at 
org.apache.zookeeper.server.ZooKeeperServer.startupWithoutServing(ZooKeeperServer.java:698)
    at 
org.apache.zookeeper.server.quorum.Learner.syncWithLeader(Learner.java:760)
    at 
org.apache.zookeeper.server.quorum.Follower.followLeader(Follower.java:108)
    at org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:1522)
 2023-09-22 06:15:00,292 [myid:3] - WARN  
[NIOWorkerThread-18:NIOServerCnxn@380][] - Close of session 0x0
 ZooKeeperServer not running
    at 
org.apache.zookeeper.server.NIOServerCnxn.readLength(NIOServerCnxn.java:554)
    at org.apache.zookeeper.server.NIOServerCnxn.doIO(NIOServerCnxn.java:339)
    at 
org.apache.zookeeper.server.NIOServerCnxnFactory$IOWorkRequest.doWork(NIOServerCnxnFactory.java:508)
    at 
org.apache.zookeeper.server.WorkerService$ScheduledWorkRequest.run(WorkerService.java:154)
    at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    at java.lang.Thread.run(Thread.java:748) // 30065370041
 2023-09-22 06:15:00,292 [myid:3] - WARN  [SyncThread:3:FileTxnLog@275][] - 
Current zxid 30065370041 is <= 30065371040 for 5
 2023-09-22 06:15:00,292 [myid:3] - WARN  
[NIOWorkerThread-10:NIOServerCnxn@380][] - Close of session 0x0
 ZooKeeperServer not running
    at 
org.apache.zookeeper.server.NIOServerCnxn.readLength(NIOServerCnxn.java:554)
    at org.apache.zookeeper.server.NIOServerCnxn.doIO(NIOServerCnxn.java:339)
    at 
org.apache.zookeeper.server.NIOServerCnxnFactory$IOWorkRequest.doWork(NIOServerCnxnFactory.java:508)
    at 
org.apache.zookeeper.server.WorkerService$ScheduledWorkRequest.run(WorkerService.java:154)
    at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    at java.lang.Thread.run(Thread.java:748)
 2023-09-22 06:15:00,293 [myid:3] - WARN  [SyncThread:3:FileTxnLog@275][] - 
Current zxid 30065370042 is <= 30065371040 for 1
 2023-09-22 06:15:00,293 [myid:3] - WARN  [SyncThread:3:FileTxnLog@275][] - 
Current zxid 30065370043 is <= 30065371040 for 1
 2023-09-22 06:15:00,293 [myid:3] - WARN  [SyncThread:3:FileTxnLog@275][] - 
Current zxid 30065370044 is <= 30065371040 for 1
 2023-09-22 06:15:00,294 [myid:3] - WARN  [SyncThread:3:FileTxnLog@275][] - 
Current zxid 30065370045 is <= 30065371040 for 2
 2023-09-22 06:15:00,294 [myid:3] - WARN  [SyncThread:3:FileTxnLog@275][] - 
Current zxid 30065370046 is <= 30065371040 for 5
 2023-09-22 06:15:00,294 [myid:3] - WARN  [SyncThread:3:FileTxnLog@275][] - 
Current zxid 30065370047 is <= 30065371040 for 1
 2023-09-22 06:15:00,294 [myid:3] - WARN  [SyncThread:3:FileTxnLog@275][] - 
Current zxid 30065370048 is <= 30065371040 for 1
 2023-09-22 06:15:00,294 [myid:3] - WARN  [SyncThread:3:FileTxnLog@275][] - 
Current zxid 30065370049 is <= 30065371040 for 2
 2023-09-22 06:15:00,294 [myid:3] - WARN  [SyncThread:3:FileTxnLog@275][] - 
Current zxid 30065370050 is <= 30065371040 for 5
 2023-09-22 06:15:00,294 [myid:3] - WARN  [SyncThread:3:FileTxnLog@275][] - 
Current zxid 30065370051 is <= 30065371040 for 2
 2023-09-22 06:15:00,294 [myid:3] - WARN  [SyncThread:3:FileTxnLog@275][] - 
Current zxid 30065370052 is <= 30065371040 for 5
 2023-09-22 06:15:00,294 [myid:3] - WARN  [SyncThread:3:FileTxnLog@275][] - 
Current zxid 30065370053 is <= 30065371040 for 1
 2023-09-22 06:15:00,294 [myid:3] - WARN  [SyncThread:3:FileTxnLog@275][] - 
Current zxid 30065370054 is <= 30065371040 for 1

> Follower doesn't get synchronized after process restart
> -------------------------------------------------------
>
>                 Key: ZOOKEEPER-4444
>                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-4444
>             Project: ZooKeeper
>          Issue Type: Bug
>    Affects Versions: 3.6.3
>            Reporter: Andreas Weber
>            Priority: Major
>
> Hi folks, I've got an issue with 3.6.3.
> I can't provide a simple test, because it seems to depend on timing in a 
> cluster environment, but I tried to reduce the scenario as far as possible:
>  * zookeeper cluster with 5 nodes, all of them Followers (no Observers)
>  * start some parallel threads which do some writes in the cluster (e.g. 
> create/delete znodes)
>  * kill one of the zookeeper processes, let's say on node X (where node X is 
> not the Leader)
>  * restart zookeeper process on node X
>  * wait a few seconds
>  * kill zookeeper process on node X again
>  * restart zookeeper process on node X again
> In some cases (every 3-4 runs) I see the following in the log of node X:
> After first restart of node X:
> {noformat}
>  WARN  persistence.FileTxnLog           - Current zxid 4294968525 is <= 
> 4294969524 for 15
>  WARN  persistence.FileTxnLog           - Current zxid 4294968526 is <= 
> 4294969524 for 15
>  WARN  persistence.FileTxnLog           - Current zxid 4294968527 is <= 
> 4294969524 for 15
>  ... (this kind of WARN is repeated some hundred times)
>  WARN  quorum.SendAckRequestProcessor   - Closing connection to leader, 
> exception during packet send java.net.SocketException: Socket closed ...
>  ... (this kind of WARN is repeated some hundred times)
> {noformat}
> After second restart of node X:
> {noformat}
>  ERROR persistence.FileTxnSnapLog       - 4294970146(highestZxid) > 
> 4294969147(next log) for type 2
>  WARN  server.DataTree                  - Message:Digests are not matching. 
> Value is Zxid. Value:4294969147
>  ERROR server.DataTree                  - First digest mismatch on txn: 
> 360466402305310720,3870,4294969147,1639258399998,2
> , ...
> , expected digest is 2,1365261838770
> , actual digest is 1098406565142, 
>  ERROR persistence.FileTxnSnapLog       - 4294970146(highestZxid) > 
> 4294969148(next log) for type 2
>  ERROR persistence.FileTxnSnapLog       - 4294970146(highestZxid) > 
> 4294969149(next log) for type 5
>  ERROR persistence.FileTxnSnapLog       - 4294970146(highestZxid) > 
> 4294969150(next log) for type 2
>  ... (this kind of ERROR is repeated some hundred times)
> {noformat}
> And afterwards (in the actual application), zookepeer on node X seems to have 
> a different view of the cluster state and doesn't get synchronized, at least 
> for a few hours.
> This e.g. leads to phantom reads of znodes that were deleted a long time ago.
> (The resulting behaviour looks a little bit similar as described in 
> ZOOKEEPER-3911.)
> This does not happen with zookeeper 3.6.2 !
> (at least I can't reproduce it with this version)



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (ZOOKEEPER-4444) Follower doesn't get synchronized after process restart

Reply via email to