[ https://issues.apache.org/jira/browse/ZOOKEEPER-2099?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15593051#comment-15593051 ]
Martin Kuchta commented on ZOOKEEPER-2099: ------------------------------------------ Looks like the test needs to be hardened a bit. I think I see the issue - QuorumBase.waitForServerUp doesn't guarantee that the client the test is using to create the nodes is also connected. I ran the test a few dozen times on my machine and saw no failures, but that's obviously not good enough. As for the testLE failure, that seems to be a known flaky test (ZOOKEEPER-1932) > Using txnlog to sync a learner can corrupt the learner's datatree > ----------------------------------------------------------------- > > Key: ZOOKEEPER-2099 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2099 > Project: ZooKeeper > Issue Type: Bug > Components: server > Affects Versions: 3.5.0, 3.6.0 > Reporter: Santeri (Santtu) Voutilainen > Assignee: Martin Kuchta > Attachments: ZOOKEEPER-2099-repro.patch, ZOOKEEPER-2099.patch > > > When a learner sync's with the leader, it is possible for the Leader to send > the learner a DIFF that does NOT contain all the transactions between the > learner's zxid and that of the leader's zxid thus resulting in a corruption > datatree on the learner. > For this to occur, the leader must have sync'd with a previous leader using a > SNAP and the zxid requested by the learner must still exist in the current > leader's txnlog files. > This issue was introduced by ZOOKEEPER-1413. > *Scenario* > A sample sequence in which this issue occurs: > # Hosts H1 and H2 disconnect from the current leader H3 (crash, network > partition, etc). The last zxid on these hosts is Z1. > # Additional transactions occur on the cluster resulting in the latest zxid > being Z2. > # Host H1 recovers and connects to H3 to sync and sends Z1 as part of its > FOLLOWERINFO or OBSERVERINFO packet. > # The leader, H3, decides to send a SNAP because a) it does not have the > necessary records in the in-mem committed log, AND b) the size of the > required txnlog to send it larger than the limit. > # Host H1 successfully sync's with the leader (H3). At this point H1's > txnlogs have records up to and including Z1 as well as Z2 and up. It does > NOT have records between Z1 and Z2. > # Host H3 fails; a leader election occurs and H1 is chosen as the leader > # Host H2 recovers and connects to H1 to sync and sends Z1 in its > FOLLOWERINFO/OBSERVERINFO packet > # The leader, H1, determines it can send a DIFF. It concludes this because > although it does not have the necessary records in its in-memory commit log, > it does have Z1 in its txnlog and the size of the log is less than the limit. > H1 ends up with a different size calculation than H3 because H1 is missing > all the records between Z1 and Z2 so it has less log to send. > # H2 receives the DIFF and applies the records to its data tree. Depending on > the type of transactions that occurred between Z1 and Z2 it may not hit any > errors when applying these records. > H2 now has a corrupted view of the data tree because it is missing all the > changes made by the transactions between Z1 and Z2. > *Recovery* > The way to recover from this situation is to delete the data/snap directory > contents from the affected hosts and have them resync with the leader at > which point they will receive a SNAP since they will appear as empty hosts. > *Workaround* > A quick workaround for anyone concerned about this issue is to disable sync > from the txnlog by changing the database size limit to 0. This is a code > change as it is not a configurable setting. > *Potential fixes* > There are several ways of fixing this. A few of options: > * Delete all snaps and txnlog files on a host when it receives a SNAP from > the leader > * Invalidate sync from txnlog after receiving a SNAP. This state must also be > persisted on-disk so that the txnlogs with the gap cannot be used to provide > a DIFF even after restart. A couple ways in which the state could be > persisted: > ** Write a file (for example: loggap.<zxid>) in the data dir indicating that > the host was sync'd with a SNAP and thus txnlogs might be missing. Presence > of these files would be checked when reading txnlogs. > ** Write a new record into the txnlog file as "sync'd-by-snap-from-leader" > marker. Readers of the txnlog would then check for presence of this record > when iterating through it and act appropriately. -- This message was sent by Atlassian JIRA (v6.3.4#6332)