[
https://issues.apache.org/jira/browse/ZOOKEEPER-2099?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14299439#comment-14299439
]
Santeri (Santtu) Voutilainen commented on ZOOKEEPER-2099:
---------------------------------------------------------
No, not yet. We've been discussing options, but haven't settled on a plan.
An extreme option is to truncate/delete all txnlog (and the in-mem
committedLog) for zxids < than the start of the snap whenever syncWithLeader
results in the SNAP (i.e. this would happen on the learner). This would work
since if this learner became the leader later, then the txnlog would not
contain the zxid requested by a stale learner and so a DIFF would not even be
possible. I consider this extreme and more of a last resort because it means
deleting txnlog from disk which could impact
investigations/backup-scripts/retention-policy/etc.
Another option would be to track the latest received SNAP zxid somewhere. Then
LearnerHandler#syncFollower would compare the requested zxid with the last SNAP
zxid and if the requested zxid is less then it would force a SNAP even if the
requested zxid existed in the txnlog.
The storage location for the latest received SNAP needs to persist with the
txnlogs (since it would still need to be known after a host restart). This
could be done by storing it in a separate file in the same directory as the
txnlog, or it could be appended to the txnlog at the time of the SNAP.
The latter storage option has the benefit that tools like LogFormatter would
also see it (and not just the latest but all snap zxids) and be able to handle
it. In the case of LogFormatter it could indicate that there is a gap in the
txnlog at that point.
I personally prefer appending a special SNAP-OCCURRED record into the txnlog,
but have not yet gone through all the investigation to determine whether that
would be safe and/or what other changes would be needed since that record
should probably use some invalid ZXID in its record (in order to avoid
confusion with a valid record with that same zxid on another host).
> Using txnlog to sync a learner can corrupt the learner's datatree
> -----------------------------------------------------------------
>
> Key: ZOOKEEPER-2099
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2099
> Project: ZooKeeper
> Issue Type: Bug
> Components: server
> Affects Versions: 3.5.0, 3.6.0
> Reporter: Santeri (Santtu) Voutilainen
> Attachments: ZOOKEEPER-2099-repro.patch
>
>
> When a learner sync's with the leader, it is possible for the Leader to send
> the learner a DIFF that does NOT contain all the transactions between the
> learner's zxid and that of the leader's zxid thus resulting in a corruption
> datatree on the learner.
> For this to occur, the leader must have sync'd with a previous leader using a
> SNAP and the zxid requested by the learner must still exist in the current
> leader's txnlog files.
> This issue was introduced by ZOOKEEPER-1413.
> *Scenario*
> A sample sequence in which this issue occurs:
> # Hosts H1 and H2 disconnect from the current leader H3 (crash, network
> partition, etc). The last zxid on these hosts is Z1.
> # Additional transactions occur on the cluster resulting in the latest zxid
> being Z2.
> # Host H1 recovers and connects to H3 to sync and sends Z1 as part of its
> FOLLOWERINFO or OBSERVERINFO packet.
> # The leader, H3, decides to send a SNAP because a) it does not have the
> necessary records in the in-mem committed log, AND b) the size of the
> required txnlog to send it larger than the limit.
> # Host H1 successfully sync's with the leader (H3). At this point H1's
> txnlogs have records up to and including Z1 as well as Z2 and up. It does
> NOT have records between Z1 and Z2.
> # Host H3 fails; a leader election occurs and H1 is chosen as the leader
> # Host H2 recovers and connects to H1 to sync and sends Z1 in its
> FOLLOWERINFO/OBSERVERINFO packet
> # The leader, H1, determines it can send a DIFF. It concludes this because
> although it does not have the necessary records in its in-memory commit log,
> it does have Z1 in its txnlog and the size of the log is less than the limit.
> H1 ends up with a different size calculation than H3 because H1 is missing
> all the records between Z1 and Z2 so it has less log to send.
> # H2 receives the DIFF and applies the records to its data tree. Depending on
> the type of transactions that occurred between Z1 and Z2 it may not hit any
> errors when applying these records.
> H2 now has a corrupted view of the data tree because it is missing all the
> changes made by the transactions between Z1 and Z2.
> *Recovery*
> The way to recover from this situation is to delete the data/snap directory
> contents from the affected hosts and have them resync with the leader at
> which point they will receive a SNAP since they will appear as empty hosts.
> *Workaround*
> A quick workaround for anyone concerned about this issue is to disable sync
> from the txnlog by changing the database size limit to 0. This is a code
> change as it is not a configurable setting.
> *Potential fixes*
> There are several ways of fixing this. A few of options:
> * Delete all snaps and txnlog files on a host when it receives a SNAP from
> the leader
> * Invalidate sync from txnlog after receiving a SNAP. This state must also be
> persisted on-disk so that the txnlogs with the gap cannot be used to provide
> a DIFF even after restart. A couple ways in which the state could be
> persisted:
> ** Write a file (for example: loggap.<zxid>) in the data dir indicating that
> the host was sync'd with a SNAP and thus txnlogs might be missing. Presence
> of these files would be checked when reading txnlogs.
> ** Write a new record into the txnlog file as "sync'd-by-snap-from-leader"
> marker. Readers of the txnlog would then check for presence of this record
> when iterating through it and act appropriately.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)