[
https://issues.apache.org/jira/browse/ZOOKEEPER-4394?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Liu Haifeng updated ZOOKEEPER-4394:
-----------------------------------
Description:
ZooKeeper follower node encountered NullPointerException during syncWithLeader.
Logs indicate that the follower has received NEWLEADER packet between a
PROPOSAL packet and it's corresponding COMMIT packet. The NEWLEADER packet
leads to packetsNotCommitted.clear(), yet the COMMIT packet still wants to do
packetsNotCommitted.peekFirst() to get the former PROPOSAL packet, and the
later if-statement raised NPE.
{code:java}
case Leader.COMMIT:
case Leader.COMMITANDACTIVATE:
pif = packetsNotCommitted.peekFirst();
if (pif.hdr.getZxid() == qp.getZxid() && qp.getType() ==
Leader.COMMITANDACTIVATE) {
// ...
}{code}
After look into the Leader side, I found:
# LearnerHandler.syncFollower queues packets with zxid <= maxCommittedLog
(PROPOSAL/COMMIT pairs);
# Leader.startForwarding queues toBeApplied packets(PROPOSAL/COMMIT pairs);
# Leader.startForwarding queues outstandingProposals packets(PROSOAL only);
# LeanerHandler.run sends NEWLEADER message.
Seams if the outstandingProposals is not empty at the certain moment, the
follower could then receive PROPOSAL/NEWLEADER/COMMIT packets in order.
The follower will retry from LOOKING again and is expected to be succeed at
last, however, under heavy load it may be too many retries. Further more, I my
case the follower has to sync data from leader's disk, and start over again
after the NPE(prior sync not flushed?), which may harm the leader.
I don't know if it is designed so or not, but consider the performance, can we
at least avoid wasting of network/disk IO?
was:
ZooKeeper follower node encountered NullPointerException during syncWithLeader.
Logs indicate that the follower has received NEWLEADER packet between a
PROPOSAL packet and it's corresponding COMMIT packet. The NEWLEADER packet
leads to packetsNotCommitted.clear(), yet the COMMIT packet still wants to do
packetsNotCommitted.peekFirst() to get the former PROPOSAL packet, and the
later if-statement raised NPE.
{code:java}
case Leader.COMMIT:
case Leader.COMMITANDACTIVATE:
pif = packetsNotCommitted.peekFirst();
if (pif.hdr.getZxid() == qp.getZxid() && qp.getType() ==
Leader.COMMITANDACTIVATE) {
// ...
}{code}
After look into the Leader side, I found:
# LearnerHandler.syncFollower queues packets with zxid <= maxCommittedLog
(PROPOSAL/COMMIT pairs);
# Leader.startForwarding queues toBeApplied packets(PROPOSAL/COMMIT pairs);
# Leader.startForwarding queues outstandingProposals packets(PROSOAL only);
# LeanerHandler.run sends NEWLEADER message.
Seams if the outstandingProposals is not empty at the certain moment, the
follower could then receive PROPOSAL/NEWLEADER/COMMIT packets in order.
The follower will retry from LOOKING again and is expected to be succeed at
last, however, under heavy load it may be too many retries. Further more, I my
case the follower has to sync data from leader's disk, and start over again
after the NPE(prior sync not flushed?), which may harm the leader.
I don't know if it is designed so or not, but consider the performance, can we
at least avoid waisting network/disk IO?
> Learner.syncWithLeader got NullPointerException
> -----------------------------------------------
>
> Key: ZOOKEEPER-4394
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-4394
> Project: ZooKeeper
> Issue Type: Bug
> Components: server
> Affects Versions: 3.7.0
> Environment: ZooKeeper 3.7.0
> Reporter: Liu Haifeng
> Priority: Major
>
> ZooKeeper follower node encountered NullPointerException during
> syncWithLeader.
> Logs indicate that the follower has received NEWLEADER packet between a
> PROPOSAL packet and it's corresponding COMMIT packet. The NEWLEADER packet
> leads to packetsNotCommitted.clear(), yet the COMMIT packet still wants to do
> packetsNotCommitted.peekFirst() to get the former PROPOSAL packet, and the
> later if-statement raised NPE.
> {code:java}
> case Leader.COMMIT:
> case Leader.COMMITANDACTIVATE:
> pif = packetsNotCommitted.peekFirst();
> if (pif.hdr.getZxid() == qp.getZxid() && qp.getType() ==
> Leader.COMMITANDACTIVATE) {
> // ...
> }{code}
> After look into the Leader side, I found:
> # LearnerHandler.syncFollower queues packets with zxid <= maxCommittedLog
> (PROPOSAL/COMMIT pairs);
> # Leader.startForwarding queues toBeApplied packets(PROPOSAL/COMMIT pairs);
> # Leader.startForwarding queues outstandingProposals packets(PROSOAL only);
> # LeanerHandler.run sends NEWLEADER message.
> Seams if the outstandingProposals is not empty at the certain moment, the
> follower could then receive PROPOSAL/NEWLEADER/COMMIT packets in order.
> The follower will retry from LOOKING again and is expected to be succeed at
> last, however, under heavy load it may be too many retries. Further more, I
> my case the follower has to sync data from leader's disk, and start over
> again after the NPE(prior sync not flushed?), which may harm the leader.
> I don't know if it is designed so or not, but consider the performance, can
> we at least avoid wasting of network/disk IO?
--
This message was sent by Atlassian Jira
(v8.3.4#803005)