[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-2099?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Martin Kuchta updated ZOOKEEPER-2099:
-------------------------------------
    Attachment: ZOOKEEPER-2099.patch

Attached is an initial attempt at fixing this.

I was investigating an issue with massive inconsistency between ZK servers and 
developed my own test to reproduce the issue. I originally thought our issue 
was isolated to ephemeral znodes so I didn't find this issue in my searches, 
but the test I wrote ended up being almost exactly the same as the one provided 
by [~svoutil]. I did find another detail in my investigation that seems 
important - this bug not only causes the leader to not send the correct updates 
to the follower, but the leader can also incorrectly tell the follower to 
truncate its log, resulting in the loss of even more transactions on the 
follower. This requires a slightly different sequence of events (a few more 
steps). I found it easier to use my original test as a base and make some 
improvements using the test submitted here, so that's why the test I'm 
submitting looks so different.

h4. Test

Here's a description of what the test does:

# Enable forceSnapSync
# Set a very high snapshotSizeFactor to guarantee a DIFF if forceSnapSync is off
# Start 5 servers
# Create a baseline znode /w (not necessary, but shows where the data loss 
starts)
# Shutdown SID 4
# Create /x while SID 4 is down
# Shutdown SID 0
# Create /y while SIDs 0 and 4 are down
# Start SID 4 (which receives a SNAP from the current leader because of 
forceSnapSync=true)
# Create /z while SID 0 is down
# Disable forceSnapSync
# Shutdown current leader - SID 4 becomes leader
# Start SID 0 (which receives a TRUNC from SID 4 without the fix and a SNAP 
with the fix)
# Check for the presence of all znodes on all servers (without the fix, SID 0 
is missing /x and /y)

More detail on what goes wrong in step 13:

(Using W = the zxid of the transaction which creates /w, X for /x, etc.)

At this point, SID 4 has W and Z in its log and it has a snapshot containing 
the updates from W, X, and Y. It tries to sync with SID 0 (whose last zxid is 
Y), and iterates through its log until it finds a zxid > Y. It then looks back 
at the previous log entry (W), sees that W < Y, and tells SID 0 to truncate its 
log to W. After this, it starts sending updates at Z. SID 0 therefore deletes X 
and misses Y. The only correct thing for SID 4 to do here is to send a snapshot.

h4. Fix

The approach of writing a file with the last SNAP received to the data dir and 
checking that value when trying to sync with a follower seems best. The patch 
adds code to ZKDatabase to handle this file (called lastSnapReceived). 
LearnerHandler checks this lastSnapReceived value, and if it falls in the range 
of transactions a follower needs in syncFollower, a snapshot is sent.

We desperately need this fix because of the massive issues the bug is causing, 
so I will be doing as much testing as I can around it before fixing our 
internal version of ZK. It would be great to also get it polished to a state 
where it could be included in a future 3.5.x version.

Some big points to discuss:
* What should ZKDatabase/Learner do if it can't create or write to the file? It 
currently doesn't handle any exceptions which will result in the Learner 
stopping. This ensures correctness, but introduces another way for a Learner to 
fail.
* What should ZKDatabase/LearnerHandler do if it can't read the file? 
LearnerHandler currently catches all exceptions and falls back to sending a 
SNAP. This is always correct, but there will be performance loss in syncing new 
learners if the file becomes unreadable/corrupted somehow.
* Is there risk with upgrades or downgrades? It doesn't seem like there should 
be. Versions without the fix will just ignore the file if it's present in their 
data dir. Upgrading from a version without the fix to a version with the fix 
will result in the file being written when initializing the ZKDatabase.

Smaller points I couldn't decide on:
* Is it acceptable to enforce snapLog being non-null when constructing a 
ZKDatabase now? I had to modify some unit tests, but I liked that better than a 
test-only null check in the constructor.
* Should zookeeper.forceSnapSync and zookeeper.snapshotSizeFactor be settable 
system properties? The property names were included in the relevant classes but 
never used, and I wasn't sure if that was intended or not.
* Is IOUtils a good home for writeLongToFileAtomic since QuorumPeer and 
ZKDatabase both need that logic now?

Patch generated against master and seems to apply to branch-3.5. Not needed in 
branch-3.4 since the issue was introduced in 3.5.0


> Using txnlog to sync a learner can corrupt the learner's datatree
> -----------------------------------------------------------------
>
>                 Key: ZOOKEEPER-2099
>                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2099
>             Project: ZooKeeper
>          Issue Type: Bug
>          Components: server
>    Affects Versions: 3.5.0, 3.6.0
>            Reporter: Santeri (Santtu) Voutilainen
>         Attachments: ZOOKEEPER-2099-repro.patch, ZOOKEEPER-2099.patch
>
>
> When a learner sync's with the leader, it is possible for the Leader to send 
> the learner a DIFF that does NOT contain all the transactions between the 
> learner's zxid and that of the leader's zxid thus resulting in a corruption 
> datatree on the learner.
> For this to occur, the leader must have sync'd with a previous leader using a 
> SNAP and the zxid requested by the learner must still exist in the current 
> leader's txnlog files.
> This issue was introduced by ZOOKEEPER-1413.
> *Scenario*
> A sample sequence in which this issue occurs:
> # Hosts H1 and H2 disconnect from the current leader H3 (crash, network 
> partition, etc).  The last zxid on these hosts is Z1.
> # Additional transactions occur on the cluster resulting in the latest zxid 
> being Z2.
> # Host H1 recovers and connects to H3 to sync and sends Z1 as part of its 
> FOLLOWERINFO or OBSERVERINFO packet.
> # The leader, H3, decides to send a SNAP because a) it does not have the 
> necessary records in the in-mem committed log, AND b) the size of the 
> required txnlog to send it larger than the limit.
> # Host H1 successfully sync's with the leader (H3). At this point H1's 
> txnlogs have records up to and including Z1 as well as Z2 and up.  It does 
> NOT have records between Z1 and Z2.
> # Host H3 fails; a leader election occurs and H1 is chosen as the leader
> # Host H2 recovers and connects to H1 to sync and sends Z1 in its 
> FOLLOWERINFO/OBSERVERINFO packet
> # The leader, H1, determines it can send a DIFF.  It concludes this because 
> although it does not have the necessary records in its in-memory commit log, 
> it does have Z1 in its txnlog and the size of the log is less than the limit. 
>  H1 ends up with a different size calculation than H3 because H1 is missing 
> all the records between Z1 and Z2 so it has less log to send.
> # H2 receives the DIFF and applies the records to its data tree. Depending on 
> the type of transactions that occurred between Z1 and Z2 it may not hit any 
> errors when applying these records.
> H2 now has a corrupted view of the data tree because it is missing all the 
> changes made by the transactions between Z1 and Z2.
> *Recovery*
> The way to recover from this situation is to delete the data/snap directory 
> contents from the affected hosts and have them resync with the leader at 
> which point they will receive a SNAP since they will appear as empty hosts.
> *Workaround*
> A quick workaround for anyone concerned about this issue is to disable sync 
> from the txnlog by changing the database size limit to 0.  This is a code 
> change as it is not a configurable setting.
> *Potential fixes*
> There are several ways of fixing this.  A few of options:
> * Delete all snaps and txnlog files on a host when it receives a SNAP from 
> the leader
> * Invalidate sync from txnlog after receiving a SNAP. This state must also be 
> persisted on-disk so that the txnlogs with the gap cannot be used to provide 
> a DIFF even after restart.  A couple ways in which the state could be 
> persisted:
> ** Write a file (for example: loggap.<zxid>) in the data dir indicating that 
> the host was sync'd with a SNAP and thus txnlogs might be missing. Presence 
> of these files would be checked when reading txnlogs.
> ** Write a new record into the txnlog file as "sync'd-by-snap-from-leader" 
> marker. Readers of the txnlog would then check for presence of this record 
> when iterating through it and act appropriately.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to