[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-2099?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15593443#comment-15593443
 ] 

Michael Han commented on ZOOKEEPER-2099:
----------------------------------------

Just did a search on my archived build mails - I see a good amount of tests 
failed from time to time with 'KeeperErrorCode = ConnectionLoss'. I think the 
test cases should be made more fault tolerant to such false negatives. I agree 
that we should not blindly do retry and the retry should be done on a case by 
case basis. Let me dig more into what this new test and those failed existing 
tests did..

> Using txnlog to sync a learner can corrupt the learner's datatree
> -----------------------------------------------------------------
>
>                 Key: ZOOKEEPER-2099
>                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2099
>             Project: ZooKeeper
>          Issue Type: Bug
>          Components: server
>    Affects Versions: 3.5.0, 3.6.0
>            Reporter: Santeri (Santtu) Voutilainen
>            Assignee: Martin Kuchta
>         Attachments: ZOOKEEPER-2099-repro.patch, ZOOKEEPER-2099.patch
>
>
> When a learner sync's with the leader, it is possible for the Leader to send 
> the learner a DIFF that does NOT contain all the transactions between the 
> learner's zxid and that of the leader's zxid thus resulting in a corruption 
> datatree on the learner.
> For this to occur, the leader must have sync'd with a previous leader using a 
> SNAP and the zxid requested by the learner must still exist in the current 
> leader's txnlog files.
> This issue was introduced by ZOOKEEPER-1413.
> *Scenario*
> A sample sequence in which this issue occurs:
> # Hosts H1 and H2 disconnect from the current leader H3 (crash, network 
> partition, etc).  The last zxid on these hosts is Z1.
> # Additional transactions occur on the cluster resulting in the latest zxid 
> being Z2.
> # Host H1 recovers and connects to H3 to sync and sends Z1 as part of its 
> FOLLOWERINFO or OBSERVERINFO packet.
> # The leader, H3, decides to send a SNAP because a) it does not have the 
> necessary records in the in-mem committed log, AND b) the size of the 
> required txnlog to send it larger than the limit.
> # Host H1 successfully sync's with the leader (H3). At this point H1's 
> txnlogs have records up to and including Z1 as well as Z2 and up.  It does 
> NOT have records between Z1 and Z2.
> # Host H3 fails; a leader election occurs and H1 is chosen as the leader
> # Host H2 recovers and connects to H1 to sync and sends Z1 in its 
> FOLLOWERINFO/OBSERVERINFO packet
> # The leader, H1, determines it can send a DIFF.  It concludes this because 
> although it does not have the necessary records in its in-memory commit log, 
> it does have Z1 in its txnlog and the size of the log is less than the limit. 
>  H1 ends up with a different size calculation than H3 because H1 is missing 
> all the records between Z1 and Z2 so it has less log to send.
> # H2 receives the DIFF and applies the records to its data tree. Depending on 
> the type of transactions that occurred between Z1 and Z2 it may not hit any 
> errors when applying these records.
> H2 now has a corrupted view of the data tree because it is missing all the 
> changes made by the transactions between Z1 and Z2.
> *Recovery*
> The way to recover from this situation is to delete the data/snap directory 
> contents from the affected hosts and have them resync with the leader at 
> which point they will receive a SNAP since they will appear as empty hosts.
> *Workaround*
> A quick workaround for anyone concerned about this issue is to disable sync 
> from the txnlog by changing the database size limit to 0.  This is a code 
> change as it is not a configurable setting.
> *Potential fixes*
> There are several ways of fixing this.  A few of options:
> * Delete all snaps and txnlog files on a host when it receives a SNAP from 
> the leader
> * Invalidate sync from txnlog after receiving a SNAP. This state must also be 
> persisted on-disk so that the txnlogs with the gap cannot be used to provide 
> a DIFF even after restart.  A couple ways in which the state could be 
> persisted:
> ** Write a file (for example: loggap.<zxid>) in the data dir indicating that 
> the host was sync'd with a SNAP and thus txnlogs might be missing. Presence 
> of these files would be checked when reading txnlogs.
> ** Write a new record into the txnlog file as "sync'd-by-snap-from-leader" 
> marker. Readers of the txnlog would then check for presence of this record 
> when iterating through it and act appropriately.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to