[
https://issues.apache.org/jira/browse/ZOOKEEPER-2099?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15593316#comment-15593316
]
Michael Han commented on ZOOKEEPER-2099:
----------------------------------------
Ah never mind, that is the test added into the patch:)
> Using txnlog to sync a learner can corrupt the learner's datatree
> -----------------------------------------------------------------
>
> Key: ZOOKEEPER-2099
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2099
> Project: ZooKeeper
> Issue Type: Bug
> Components: server
> Affects Versions: 3.5.0, 3.6.0
> Reporter: Santeri (Santtu) Voutilainen
> Assignee: Martin Kuchta
> Attachments: ZOOKEEPER-2099-repro.patch, ZOOKEEPER-2099.patch
>
>
> When a learner sync's with the leader, it is possible for the Leader to send
> the learner a DIFF that does NOT contain all the transactions between the
> learner's zxid and that of the leader's zxid thus resulting in a corruption
> datatree on the learner.
> For this to occur, the leader must have sync'd with a previous leader using a
> SNAP and the zxid requested by the learner must still exist in the current
> leader's txnlog files.
> This issue was introduced by ZOOKEEPER-1413.
> *Scenario*
> A sample sequence in which this issue occurs:
> # Hosts H1 and H2 disconnect from the current leader H3 (crash, network
> partition, etc). The last zxid on these hosts is Z1.
> # Additional transactions occur on the cluster resulting in the latest zxid
> being Z2.
> # Host H1 recovers and connects to H3 to sync and sends Z1 as part of its
> FOLLOWERINFO or OBSERVERINFO packet.
> # The leader, H3, decides to send a SNAP because a) it does not have the
> necessary records in the in-mem committed log, AND b) the size of the
> required txnlog to send it larger than the limit.
> # Host H1 successfully sync's with the leader (H3). At this point H1's
> txnlogs have records up to and including Z1 as well as Z2 and up. It does
> NOT have records between Z1 and Z2.
> # Host H3 fails; a leader election occurs and H1 is chosen as the leader
> # Host H2 recovers and connects to H1 to sync and sends Z1 in its
> FOLLOWERINFO/OBSERVERINFO packet
> # The leader, H1, determines it can send a DIFF. It concludes this because
> although it does not have the necessary records in its in-memory commit log,
> it does have Z1 in its txnlog and the size of the log is less than the limit.
> H1 ends up with a different size calculation than H3 because H1 is missing
> all the records between Z1 and Z2 so it has less log to send.
> # H2 receives the DIFF and applies the records to its data tree. Depending on
> the type of transactions that occurred between Z1 and Z2 it may not hit any
> errors when applying these records.
> H2 now has a corrupted view of the data tree because it is missing all the
> changes made by the transactions between Z1 and Z2.
> *Recovery*
> The way to recover from this situation is to delete the data/snap directory
> contents from the affected hosts and have them resync with the leader at
> which point they will receive a SNAP since they will appear as empty hosts.
> *Workaround*
> A quick workaround for anyone concerned about this issue is to disable sync
> from the txnlog by changing the database size limit to 0. This is a code
> change as it is not a configurable setting.
> *Potential fixes*
> There are several ways of fixing this. A few of options:
> * Delete all snaps and txnlog files on a host when it receives a SNAP from
> the leader
> * Invalidate sync from txnlog after receiving a SNAP. This state must also be
> persisted on-disk so that the txnlogs with the gap cannot be used to provide
> a DIFF even after restart. A couple ways in which the state could be
> persisted:
> ** Write a file (for example: loggap.<zxid>) in the data dir indicating that
> the host was sync'd with a SNAP and thus txnlogs might be missing. Presence
> of these files would be checked when reading txnlogs.
> ** Write a new record into the txnlog file as "sync'd-by-snap-from-leader"
> marker. Readers of the txnlog would then check for presence of this record
> when iterating through it and act appropriately.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)