[
https://issues.apache.org/jira/browse/ZOOKEEPER-1026?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13010681#comment-13010681
]
Flavio Junqueira commented on ZOOKEEPER-1026:
---------------------------------------------
To follow up on my own comments. Here is how I'm currently thinking.
For the issue described in this jira to happen, the following needs to hold:
# We have a broken leader that does not have the highest last zxid in a quorum;
# There is a quorum containing the broken leader such that for every element of
the quorum, the last zxid epoch is the same or smaller compared to the epoch of
the broken leader. If this condition doesn't hold, then at least one server
will not follow the broken leader;
# The broken leader must have received at least one notification during leader
election that reflect an old state of the system.
Clearly the problem Vishal has reported matches this description, since
repeating notifications causes a server to receive notifications that might
reflect a stale state of the system. However, I'm not entirely convinced that
we can completely get rid of this problem with a patch for ZOOKEEPER-975
because of scenarios like the following, which at least abstractly seems to
work:
# There are three servers: S1, S2, S3;
# Servers come all from epoch e , and are trying to elect a new leader for e+1;
# Each server sends one notification to each of the other servers, proposing
itself as leader;
# S2 and S3 receive a notification from S1, and suppose that S1's vote
supersedes their own votes. S2 and S3 eventually decide to follow S1;
# S1 receives notifications from S2 and S3 and decides to lead;
# S2 follows S1 for a longer time than S3, so S3 lags behind;
# S1 drops leadership (doesn't matter why);
# S3 starts leader election and receives the notification S2 sent in step 3;
# S3 believes it is the leader and starts leading;
# S2 receives a leading notification from S3 and starts following S3, thus
truncating its log (!!!)
The last step is mainly due to the broken "if" statement I mentioned before.
Now, in this example, there was no repetition of notification messages
(ZOOKEEPER-975), and yet we've reached a problem. The question is if this run
can happen or not. Can you see any step that can't happen? To me, the one step
that is not clear is Step 8, but I can't convince myself that it can't happen.
> Sequence number assignment decreases after old node rejoins cluster
> -------------------------------------------------------------------
>
> Key: ZOOKEEPER-1026
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1026
> Project: ZooKeeper
> Issue Type: Bug
> Components: server
> Affects Versions: 3.3.3
> Reporter: Jeremy Stribling
> Attachments: ZOOKEEPER-1026.logs.tgz
>
>
> I ran into a weird case where a Zookeeper server rejoins the cluster after
> missing several operations, and then a client creates a new sequential node
> that has a number earlier than the last node it created. I don't have full
> logs, or a live system in this state, or any data directories, just some
> partial server logs and the evidence as seen by the client. Haven't tried
> reproducing it yet, just wanted to see if anyone here had any ideas. Here's
> the scenario (probably more info than necessary, but trying to be complete)
> 1) Initially (5:37:20): 3 nodes up, with ids 215, 126, and 37 (called nodes
> #1, #2, and #3 below):
> 2) Nodes periodically (and throughout this whole timeline) create sequential,
> non-ephemeral nodes under the /zkrsm parent node.
> 3) 5:46:57: Node #1 gets notified of /zkrsm/0000000000000000_record0000002116
> 4) 5:47:06: Node #1 restarts and rejoins
> 5) 5:49:26: Node #2 gets notified of /zkrsm/0000000000000000_record0000002708
> 6) 5:49:29: Node #2 restarts and rejoins
> 7) 5:52:01: Node #3 gets notified of /zkrsm/0000000000000000_record0000003291
> 8) 5:52:02: Node #3 restarts and begins the rejoining process
> 9) 5:52:08: Node #1 successfully creates
> /zkrsm/0000000000000000_record0000003348
> 10) 5:52:08: Node #2 dies after getting notified of
> /zkrsm/0000000000000000_record0000003348
> 11) 5:52:10ish: Node #3 is elected leader (the ZK server log doesn't have
> wallclock timestamps, so not exactly sure on the ordering of this step)
> 12) 5:52:15: Node #1 successfully creates
> /zkrsm/0000000000000000_record0000003292
> Note that the node created in step #12 is lower than the one created in step
> #9, and is exactly one greater than the last node seen by node #3 before it
> restarted.
> Here is the sequence of session establishments as seen from the C client of
> node #1 after its restart (the IP address of node #1=13.0.0.11, #2=13.0.0.12,
> #3=13.0.0.13):
> 2011-03-18 05:46:59,838:17454(0x7fc57d3db710):ZOO_INFO@check_events@1632:
> session establishment complete on server [13.0.0.13:2888],
> sessionId=0x252ec780a3020000, negotiated timeout=6000
> 2011-03-18 05:49:32,194:17454(0x7fc57cbda710):ZOO_INFO@check_events@1632:
> session establishment complete on server [13.0.0.13:2888],
> sessionId=0x252ec782f5100002, negotiated timeout=6000
> 2011-03-18 05:52:02,352:17454(0x7fc57d3db710):ZOO_INFO@check_events@1632:
> session establishment complete on server [13.0.0.12:2888],
> sessionId=0x7e2ec782ff5f0001, negotiated timeout=6000
> 2011-03-18 05:52:08,583:17454(0x7fc57d3db710):ZOO_INFO@check_events@1632:
> session establishment complete on server [13.0.0.11:2888],
> sessionId=0x7e2ec782ff5f0001, negotiated timeout=6000
> 2011-03-18 05:52:13,834:17454(0x7fc57cbda710):ZOO_INFO@check_events@1632:
> session establishment complete on server [13.0.0.11:2888],
> sessionId=0xd72ec7856d0f0001, negotiated timeout=6000
> I will attach logs for all nodes after each of their restarts, and a partial
> log for node #3 from before its restart.
--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira