[
https://issues.apache.org/jira/browse/ZOOKEEPER-4394?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17872050#comment-17872050
]
Benoit Sigoure commented on ZOOKEEPER-4394:
-------------------------------------------
One of our dev clusters ran into this bug.
Quorum: zookeeper-0, zookeeper-1(leader), zookeeper-2
Client app holding an ephemeral znode exits:
{code:java}
# zookeeper-0 log
2024-08-07 15:22:58,929 [myid:] - INFO
[NIOWorkerThread-4:o.a.z.s.NIOServerCnxn@337] - Unable to read additional data
from client, it probably closed the socket: address = /10.128.31.11:42544,
session = 0x30019eb942d0028
{code}
shortly thereafter:
{code:java}
# zookeeper-2 log
2024-08-07 15:23:00,009 [myid:] - INFO
[QuorumPeer[myid=3](plain=0.0.0.0:2181)(secure=disabled):o.a.z.s.q.QuorumPeer@922]
- Peer state changed: following - synchronization
2024-08-07 15:23:00,010 [myid:] - INFO
[QuorumPeer[myid=3](plain=0.0.0.0:2181)(secure=disabled):o.a.z.s.q.CommitProcessor@490]
- Configuring CommitProcessor with readBatchSize -1 commitBatchSize 1
2024-08-07 15:23:00,010 [myid:] - INFO
[QuorumPeer[myid=3](plain=0.0.0.0:2181)(secure=disabled):o.a.z.s.q.CommitProcessor@451]
- Configuring CommitProcessor with 8 worker threads.2024-08-07 15:23:00,010
[myid:] - INFO
[QuorumPeer[myid=3](plain=0.0.0.0:2181)(secure=disabled):o.a.z.s.q.FollowerRequestProcessor@59]
- Initialized FollowerRequestProcessor with
zookeeper.follower.skipLearnerRequestToNextProcessor as false2024-08-07
15:23:00,011 [myid:] - WARN
[QuorumPeer[myid=3](plain=0.0.0.0:2181)(secure=disabled):o.a.z.j.MBeanRegistry@110]
- Failed to register MBean InMemoryDataTree2024-08-07 15:23:00,011 [myid:] -
WARN
[QuorumPeer[myid=3](plain=0.0.0.0:2181)(secure=disabled):o.a.z.s.q.LearnerZooKeeperServer@104]
- Failed to register with JMXjavax.management.InstanceAlreadyExistsException:
org.apache.ZooKeeperService:name0=ReplicatedServer_id3,name1=replica.3,name2=Follower,name3=InMemoryDataTree
at java.management/com.sun.jmx.mbeanserver.Repository.addMBean(Unknown
Source) at
java.management/com.sun.jmx.interceptor.DefaultMBeanServerInterceptor.registerWithRepository(Unknown
Source) at
java.management/com.sun.jmx.interceptor.DefaultMBeanServerInterceptor.registerDynamicMBean(Unknown
Source) at
java.management/com.sun.jmx.interceptor.DefaultMBeanServerInterceptor.registerObject(Unknown
Source) at
java.management/com.sun.jmx.interceptor.DefaultMBeanServerInterceptor.registerMBean(Unknown
Source) at
java.management/com.sun.jmx.mbeanserver.JmxMBeanServer.registerMBean(Unknown
Source) at
org.apache.zookeeper.jmx.MBeanRegistry.register(MBeanRegistry.java:106) at
org.apache.zookeeper.server.quorum.LearnerZooKeeperServer.registerJMX(LearnerZooKeeperServer.java:102)
at
org.apache.zookeeper.server.ZooKeeperServer.startupWithServerState(ZooKeeperServer.java:730)
at
org.apache.zookeeper.server.ZooKeeperServer.startupWithoutServing(ZooKeeperServer.java:713)
at
org.apache.zookeeper.server.quorum.Learner.syncWithLeader(Learner.java:760) at
org.apache.zookeeper.server.quorum.Follower.followLeader(Follower.java:108) at
org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:1539)2024-08-07
15:23:00,012 [myid:] - INFO
[QuorumPeer[myid=3](plain=0.0.0.0:2181)(secure=disabled):o.a.z.s.q.Learner@721]
- Learner received UPTODATE message2024-08-07 15:23:00,013 [myid:] - INFO
[QuorumPeer[myid=3](plain=0.0.0.0:2181)(secure=disabled):o.a.z.s.q.QuorumPeer@917]
- Peer state changed: following - broadcast2024-08-07 15:23:00,015 [myid:] -
INFO [CommitProcessor:3:o.a.z.s.q.LearnerSessionTracker@116] - Committing
global session 0x10000242c8c49bc2024-08-07 15:23:00,015 [myid:] - INFO
[CommitProcessor:3:o.a.z.s.q.LearnerSessionTracker@116] - Committing global
session 0x10000242c8c49bd2024-08-07 15:23:00,016 [myid:] - INFO
[CommitProcessor:3:o.a.z.s.q.LearnerSessionTracker@116] - Committing global
session 0x10000242c8c49be2024-08-07 15:23:00,103 [myid:] - WARN
[QuorumPeer[myid=3](plain=0.0.0.0:2181)(secure=disabled):o.a.z.s.q.Follower@169]
- Got zxid 0x632029b34c expected 0x12024-08-07 15:23:00,604 [myid:] - INFO
[CommitProcessor:3:o.a.z.s.q.LearnerSessionTracker@116] - Committing global
session 0x10000242c8c49bf2024-08-07 15:23:02,202 [myid:] - INFO
[CommitProcessor:3:o.a.z.s.q.LearnerSessionTracker@116] - Committing global
session 0x10000242c8c49c02024-08-07 15:23:05,426 [myid:] - INFO
[CommitProcessor:3:o.a.z.s.q.LearnerSessionTracker@116] - Committing global
session 0x200001b71c42ddb2024-08-07 15:23:05,671 [myid:] - INFO
[NIOWorkerThread-1:o.a.z.s.NIOServerCnxn@522] - Processing ruok command from
/127.0.0.1:481922024-08-07 15:23:05,671 [myid:] - INFO
[NIOWorkerThread-8:o.a.z.s.NIOServerCnxn@522] - Processing ruok command from
/127.0.0.1:481842024-08-07 15:23:07,087 [myid:] - WARN
[SyncThread:3:o.a.z.s.p.FileTxnLog@394] - fsync-ing the write ahead log in
SyncThread:3 took 7075ms which will adversely effect operation latency.File
size is 67108880 bytes. See the ZooKeeper troubleshooting guide2024-08-07
15:23:10,107 [myid:] - WARN
[QuorumPeer[myid=3](plain=0.0.0.0:2181)(secure=disabled):o.a.z.s.q.Follower@128]
- Exception when following the leaderjava.io.EOFException: null at
java.base/java.io.DataInputStream.readInt(Unknown Source) at
org.apache.jute.BinaryInputArchive.readInt(BinaryInputArchive.java:96) at
org.apache.zookeeper.server.quorum.QuorumPacket.deserialize(QuorumPacket.java:86)
at
org.apache.jute.BinaryInputArchive.readRecord(BinaryInputArchive.java:134) at
org.apache.zookeeper.server.quorum.Learner.readPacket(Learner.java:228) at
org.apache.zookeeper.server.quorum.Follower.followLeader(Follower.java:124) at
org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:1539)2024-08-07
15:23:10,108 [myid:] - INFO
[QuorumPeer[myid=3](plain=0.0.0.0:2181)(secure=disabled):o.a.z.s.q.Follower@142]
- Disconnected from leader (with address:
zookeeper-1-zookeeper.zookeeper.svc.cluster.local./172.29.203.237:2888). Was
connected for 17136ms. Sync state: true2024-08-07 15:23:10,108 [myid:] - INFO
[QuorumPeer[myid=3](plain=0.0.0.0:2181)(secure=disabled):o.a.z.s.q.Follower@286]
- shutdown Follower2024-08-07 15:23:10,108 [myid:] - INFO
[QuorumPeer[myid=3](plain=0.0.0.0:2181)(secure=disabled):o.a.z.s.ZooKeeperServer@853]
- shutting down2024-08-07 15:23:10,108 [myid:] - INFO
[QuorumPeer[myid=3](plain=0.0.0.0:2181)(secure=disabled):o.a.z.s.RequestThrottler@256]
- Shutting down2024-08-07 15:23:10,108 [myid:] - INFO
[RequestThrottler:o.a.z.s.RequestThrottler@217] - Draining request throttler
queue2024-08-07 15:23:10,108 [myid:] - INFO
[RequestThrottler:o.a.z.s.RequestThrottler@195] - RequestThrottler shutdown.
Dropped 0 requests2024-08-07 15:23:10,108 [myid:] - INFO
[QuorumPeer[myid=3](plain=0.0.0.0:2181)(secure=disabled):o.a.z.s.q.FollowerRequestProcessor@172]
- Shutting down2024-08-07 15:23:10,108 [myid:] - INFO
[QuorumPeer[myid=3](plain=0.0.0.0:2181)(secure=disabled):o.a.z.s.q.CommitProcessor@631]
- Shutting down2024-08-07 15:23:10,108 [myid:] - INFO
[QuorumPeer[myid=3](plain=0.0.0.0:2181)(secure=disabled):o.a.z.s.FinalRequestProcessor@684]
- shutdown of request processor complete2024-08-07 15:23:10,108 [myid:] - INFO
[FollowerRequestProcessor:3:o.a.z.s.q.FollowerRequestProcessor@128] -
FollowerRequestProcessor exited loop!2024-08-07 15:23:10,108 [myid:] - INFO
[CommitProcessor:3:o.a.z.s.q.CommitProcessor@419] - CommitProcessor exited
loop!2024-08-07 15:23:10,191 [myid:] - ERROR
[QuorumPeer[myid=3](plain=0.0.0.0:2181)(secure=disabled):o.a.z.s.p.Util@166] -
Last transaction was partial.
{code}
Resulting state: ephemeral znode held by 10.128.31.11:42544 was correctly
removed on zookeeper-0 and zookeeper-1 but lingers on zookeeper-2.
> Learner.syncWithLeader got NullPointerException
> -----------------------------------------------
>
> Key: ZOOKEEPER-4394
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-4394
> Project: ZooKeeper
> Issue Type: Bug
> Components: server
> Affects Versions: 3.7.0
> Environment: ZooKeeper 3.7.0
> Reporter: Liu Haifeng
> Priority: Major
> Labels: pull-request-available
> Time Spent: 0.5h
> Remaining Estimate: 0h
>
> ZooKeeper follower node encountered NullPointerException during
> syncWithLeader.
> Logs indicate that the follower has received NEWLEADER packet between a
> PROPOSAL packet and it's corresponding COMMIT packet. The NEWLEADER packet
> leads to packetsNotCommitted.clear(), yet the COMMIT packet still wants to do
> packetsNotCommitted.peekFirst() to get the former PROPOSAL packet, and the
> later if-statement raised NPE.
> {code:java}
> case Leader.COMMIT:
> case Leader.COMMITANDACTIVATE:
> pif = packetsNotCommitted.peekFirst();
> if (pif.hdr.getZxid() == qp.getZxid() && qp.getType() ==
> Leader.COMMITANDACTIVATE) {
> // ...
> }{code}
> After look into the Leader side, I found:
> # LearnerHandler.syncFollower queues packets with zxid <= maxCommittedLog
> (PROPOSAL/COMMIT pairs);
> # Leader.startForwarding queues toBeApplied packets(PROPOSAL/COMMIT pairs);
> # Leader.startForwarding queues outstandingProposals packets(PROSOAL only);
> # LeanerHandler.run sends NEWLEADER message.
> Seams if the outstandingProposals is not empty at the certain moment, the
> follower could then receive PROPOSAL/NEWLEADER/COMMIT packets in order.
> The follower will retry from LOOKING again and is expected to be succeed at
> last, however, under heavy load it may be too many retries. Further more, I
> my case the follower has to sync data from leader's disk, and start over
> again after the NPE(prior sync not flushed?), which may harm the leader.
> I don't know if it is designed so or not, but consider the performance, can
> we at least avoid wasting of network/disk IO?
--
This message was sent by Atlassian Jira
(v8.20.10#820010)