[
https://issues.apache.org/jira/browse/ZOOKEEPER-2099?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15606438#comment-15606438
]
Hadoop QA commented on ZOOKEEPER-2099:
--------------------------------------
+1 overall. Here are the results of testing the latest attachment
http://issues.apache.org/jira/secure/attachment/12835192/ZOOKEEPER-2099.patch
against trunk revision cef5978969bedfe066f903834a9ea4af6d508844.
+1 @author. The patch does not contain any @author tags.
+1 tests included. The patch appears to include 12 new or modified tests.
+1 javadoc. The javadoc tool did not generate any warning messages.
+1 javac. The applied patch does not increase the total number of javac
compiler warnings.
+1 findbugs. The patch does not introduce any new Findbugs (version 2.0.3)
warnings.
+1 release audit. The applied patch does not increase the total number of
release audit warnings.
+1 core tests. The patch passed core unit tests.
+1 contrib tests. The patch passed contrib unit tests.
Test results:
https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/3499//testReport/
Findbugs warnings:
https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/3499//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
Console output:
https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/3499//console
This message is automatically generated.
> Using txnlog to sync a learner can corrupt the learner's datatree
> -----------------------------------------------------------------
>
> Key: ZOOKEEPER-2099
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2099
> Project: ZooKeeper
> Issue Type: Bug
> Components: server
> Affects Versions: 3.5.0, 3.6.0
> Reporter: Santeri (Santtu) Voutilainen
> Assignee: Martin Kuchta
> Attachments: ZOOKEEPER-2099-repro.patch, ZOOKEEPER-2099.patch,
> ZOOKEEPER-2099.patch
>
>
> When a learner sync's with the leader, it is possible for the Leader to send
> the learner a DIFF that does NOT contain all the transactions between the
> learner's zxid and that of the leader's zxid thus resulting in a corruption
> datatree on the learner.
> For this to occur, the leader must have sync'd with a previous leader using a
> SNAP and the zxid requested by the learner must still exist in the current
> leader's txnlog files.
> This issue was introduced by ZOOKEEPER-1413.
> *Scenario*
> A sample sequence in which this issue occurs:
> # Hosts H1 and H2 disconnect from the current leader H3 (crash, network
> partition, etc). The last zxid on these hosts is Z1.
> # Additional transactions occur on the cluster resulting in the latest zxid
> being Z2.
> # Host H1 recovers and connects to H3 to sync and sends Z1 as part of its
> FOLLOWERINFO or OBSERVERINFO packet.
> # The leader, H3, decides to send a SNAP because a) it does not have the
> necessary records in the in-mem committed log, AND b) the size of the
> required txnlog to send it larger than the limit.
> # Host H1 successfully sync's with the leader (H3). At this point H1's
> txnlogs have records up to and including Z1 as well as Z2 and up. It does
> NOT have records between Z1 and Z2.
> # Host H3 fails; a leader election occurs and H1 is chosen as the leader
> # Host H2 recovers and connects to H1 to sync and sends Z1 in its
> FOLLOWERINFO/OBSERVERINFO packet
> # The leader, H1, determines it can send a DIFF. It concludes this because
> although it does not have the necessary records in its in-memory commit log,
> it does have Z1 in its txnlog and the size of the log is less than the limit.
> H1 ends up with a different size calculation than H3 because H1 is missing
> all the records between Z1 and Z2 so it has less log to send.
> # H2 receives the DIFF and applies the records to its data tree. Depending on
> the type of transactions that occurred between Z1 and Z2 it may not hit any
> errors when applying these records.
> H2 now has a corrupted view of the data tree because it is missing all the
> changes made by the transactions between Z1 and Z2.
> *Recovery*
> The way to recover from this situation is to delete the data/snap directory
> contents from the affected hosts and have them resync with the leader at
> which point they will receive a SNAP since they will appear as empty hosts.
> *Workaround*
> A quick workaround for anyone concerned about this issue is to disable sync
> from the txnlog by changing the database size limit to 0. This is a code
> change as it is not a configurable setting.
> *Potential fixes*
> There are several ways of fixing this. A few of options:
> * Delete all snaps and txnlog files on a host when it receives a SNAP from
> the leader
> * Invalidate sync from txnlog after receiving a SNAP. This state must also be
> persisted on-disk so that the txnlogs with the gap cannot be used to provide
> a DIFF even after restart. A couple ways in which the state could be
> persisted:
> ** Write a file (for example: loggap.<zxid>) in the data dir indicating that
> the host was sync'd with a SNAP and thus txnlogs might be missing. Presence
> of these files would be checked when reading txnlogs.
> ** Write a new record into the txnlog file as "sync'd-by-snap-from-leader"
> marker. Readers of the txnlog would then check for presence of this record
> when iterating through it and act appropriately.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)