[
https://issues.apache.org/jira/browse/ZOOKEEPER-2946?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Beom Heyn Kim updated ZOOKEEPER-2946:
-------------------------------------
Attachment: zookeeper-2946.patch
When the follower receives TRUNC from the leader, it is possible that there is
no transaction log at the given zxid (it is possible the given zxid is captured
by snapshot but not by the transaction log). It still needs to delete the
transactions that are not committed and newer than the zxid sent by the leader.
The patch tries to check if it is currently considering the file that is newer
than the zxid and try to delete it before the while loop in the function
truncate(long zxid) in the FileTxnLog.java.
> The truncate() function in FileTxnLog.java may fail to properly remove an
> uncommitted write resulting in data inconsistency
> ---------------------------------------------------------------------------------------------------------------------------
>
> Key: ZOOKEEPER-2946
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2946
> Project: ZooKeeper
> Issue Type: Bug
> Affects Versions: 3.4.11, 3.4.12
> Reporter: Beom Heyn Kim
> Attachments: zookeeper-2946.patch
>
>
> The truncate() function in FileTxnLog.java may fail to properly remove the
> uncommitted write. This happens when the follower that has the uncommitted
> writes tries to resync with the leader after a few epochs have past. The
> failure results in data inconsistency in the in-memory data tree across
> nodes. Here is one procedure to reproduce the inconsistency.
> Initially:
> # Start the ensemble with three nodes: node 0, 1 and 2 (the node 2 is the
> leader)
> # Create 5 znodes with initial values as follow (key = value)
> {noformat}
> /testDivergenceResync0 = 0
> /testDivergenceResync1 = 1
> /testDivergenceResync2 = 2
> /testDivergenceResync3 = 3
> /testDivergenceResync4 = 4
> {noformat}
> To Reproduce:
> # Diverge the node 2
> a. Shutdown the node 0 and 1
> b. Async setData to the node 2 writing 1000 to the key
> ‘/testDivergenceResync0’
> c. Shutdown the node 2
> # Restart the node 0 and 1 (let them finish with resync)
> # Diverge the node 1
> a. Shutdown the node 0
> b. Async setData to the node 1 writing 1001 to the key
> ‘/testDivergenceResync1’
> c. Shutdown the node 1
> # Restart the node 0, 1 and 2 (let them finish with resync)
> # Diverge the node 2
> a. Shutdown the node 0 and 1
> b. Async setData to the node 2 writing 1002 to the key
> ‘/testDivergenceResync2’
> c. Shutdown the node 2
> # Restart the node 0 and 2 (let them finish with resync)
> # Diverge the node 2
> a. Shutdown the node 0
> b. Async setData to the node 2 writing 1003 to the key
> ‘/testDivergenceResync3’
> c. Shutdown the node 2
> # Restart the node 0 and 1 (let them finish with resync)
> # Diverge the node 1
> a. Shutdown the node 0
> b. Async setData to the node 1 writing 1004 to the key
> ‘/testDivergenceResync4’
> c. Shutdown the node 1
> # Restart the node 0 and 2 (let them finish with resync)
> # Restart the node 1 (let it finish with resync)
> Reading each key from each node directly will give us the output:
> {noformat}
> /testDivergenceResync0 on the node 0 = 0
> /testDivergenceResync0 on the node 1 = 0
> /testDivergenceResync0 on the node 2 = 0
> /testDivergenceResync1 on the node 0 = 1001
> /testDivergenceResync1 on the node 1 = 1001
> /testDivergenceResync1 on the node 2 = 1001
> /testDivergenceResync2 on the node 0 = 2
> /testDivergenceResync2 on the node 1 = 2
> /testDivergenceResync2 on the node 2 = 2
> /testDivergenceResync3 on the node 0 = 3
> /testDivergenceResync3 on the node 1 = 3
> /testDivergenceResync3 on the node 2 = 1003
> /testDivergenceResync4 on the node 0 = 1004
> /testDivergenceResync4 on the node 1 = 1004
> /testDivergenceResync4 on the node 2 = 1004
> {noformat}
> Thus, the value of key /testDivergenceResync3 is inconsistent across nodes.
> What seems to happen:
> # At the step 7, setData (at zxid 0x400000001) writing value 1003 is
> committed on the node 2.
> {panel:title=Log from the node 2:}
> ...
> 2017-11-16 03:08:14,123 [myid:2] - DEBUG [ProcessThread(sid:2
> cport:-1)::CommitProcessor@174] - Processing request::
> sessionid:0x2000117327c0000 type:setData cxid:0x4 zxid:0x400000001 txntype:5
> reqpath:n/a
> 2017-11-16 03:08:14,124 [myid:2] - DEBUG [ProcessThread(sid:2
> cport:-1)::Leader@787] - Proposing:: sessionid:0x2000117327c0000 type:setData
> cxid:0x4 zxid:0x400000001 txntype:5 reqpath:n/a
> 2017-11-16 03:08:14,124 [myid:2] - INFO [SyncThread:2:FileTxnLog@209] -
> Creating new log file: log.400000001
> 2017-11-16 03:08:14,188 [myid:2] - DEBUG [SyncThread:2:Leader@600] - Count
> for zxid: 0x400000001 is 1
> 2017-11-16 03:08:15,752 [myid:2] - INFO
> [QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:11227:Leader@512] - Shutting down
> 2017-11-16 03:08:15,753 [myid:2] - INFO
> [QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:11227:Leader@518] - Shutdown called
> java.lang.Exception: shutdown Leader! reason: Not sufficient followers
> synced, only synced with sids: [ 2 ]
> ...
> {panel}
> # At the step 10, the node 2 is restarted and supposed to be properly
> resync’ed with the node 0 which is the leader.
> a. The node 0 sends TRUNC msg so that the node 2 can truncate the setData
> at zxid 0x400000001..
> b. On the other hand, the node 2 tries to truncate log to get in sync
> with the leader 0x200000001. However, the node 2 failed to properly truncate
> the setData at zxid 0x400000001. So, even if resync was finished, the value
> 1003 is still remained intact on the node 2 while other nodes have value 3
> for the same key.
> c. It seems on the node 2, there is only log.100000001 and log.400000001
> but no log.200000001. This seems to cause failing to delete log.400000001
> during truncate(). It looks like we will be considering log.400000001 by the
> time returning from the init() of FileTxnLog.java so that we will never
> execute ‘itr.logFile.delete()’ for the log.400000001.
> d. Then, after returning from the truncate(), loadDatabase() will be
> invoked and log.400000001 will be read and the setData at zxid 0x400000001
> gets loaded into the in-memory data tree.
> {panel:title=Log from the node 2:}
> ...
> 2017-11-16 03:08:59,051 [myid:2] - INFO
> [QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:11227:Follower@65] - FOLLOWING - LEADER
> ELECTION TOOK - 215
> 2017-11-16 03:08:59,052 [myid:2] - INFO
> [QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:11227:QuorumPeer$QuorumServer@184] -
> Resolved hostname: 127.0.0.1 to address: /127.0.0.1
> 2017-11-16 03:08:59,125 [myid:2] - WARN
> [QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:11227:Learner@349] - Truncating log to
> get in sync with the leader 0x200000001
> 2017-11-16 03:08:59,125 [myid:2] - DEBUG
> [QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:11227:FileTxnLog$FileTxnIterator@606] -
> Created new input stream
> /home/ben/project/strata/test-5-3-ZooKeeper-3.4.11-strata-0.1/data/2/version-2/log.100000001
> 2017-11-16 03:08:59,125 [myid:2] - DEBUG
> [QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:11227:FileTxnLog$FileTxnIterator@609] -
> Created new input archive
> /home/ben/project/strata/test-5-3-ZooKeeper-3.4.11-strata-0.1/data/2/version-2/log.100000001
> 2017-11-16 03:08:59,126 [myid:2] - DEBUG
> [QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:11227:FileTxnLog$FileTxnIterator@647] -
> EOF excepton java.io.EOFException
> 2017-11-16 03:08:59,126 [myid:2] - DEBUG
> [QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:11227:FileTxnLog$FileTxnIterator@606] -
> Created new input stream
> /home/ben/project/strata/test-5-3-ZooKeeper-3.4.11-strata-0.1/data/2/version-2/log.400000001
> 2017-11-16 03:08:59,126 [myid:2] - DEBUG
> [QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:11227:FileTxnLog$FileTxnIterator@609] -
> Created new input archive
> /home/ben/project/strata/test-5-3-ZooKeeper-3.4.11-strata-0.1/data/2/version-2/log.400000001
> 2017-11-16 03:08:59,126 [myid:2] - INFO
> [QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:11227:FileSnap@83] - Reading snapshot
> /home/ben/project/strata/test-5-3-ZooKeeper-3.4.11-strata-0.1/data/2/version-2/snapshot.200000001
> 2017-11-16 03:08:59,127 [myid:2] - DEBUG
> [QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:11227:FileTxnLog$FileTxnIterator@606] -
> Created new input stream
> /home/ben/project/strata/test-5-3-ZooKeeper-3.4.11-strata-0.1/data/2/version-2/log.100000001
> 2017-11-16 03:08:59,127 [myid:2] - DEBUG
> [QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:11227:FileTxnLog$FileTxnIterator@609] -
> Created new input archive
> /home/ben/project/strata/test-5-3-ZooKeeper-3.4.11-strata-0.1/data/2/version-2/log.100000001
> 2017-11-16 03:08:59,128 [myid:2] - DEBUG
> [QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:11227:FileTxnLog$FileTxnIterator@647] -
> EOF excepton java.io.EOFException
> 2017-11-16 03:08:59,128 [myid:2] - DEBUG
> [QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:11227:FileTxnLog$FileTxnIterator@606] -
> Created new input stream
> /home/ben/project/strata/test-5-3-ZooKeeper-3.4.11-strata-0.1/data/2/version-2/log.400000001
> 2017-11-16 03:08:59,128 [myid:2] - DEBUG
> [QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:11227:FileTxnLog$FileTxnIterator@609] -
> Created new input archive
> /home/ben/project/strata/test-5-3-ZooKeeper-3.4.11-strata-0.1/data/2/version-2/log.400000001
> 2017-11-16 03:08:59,128 [myid:2] - DEBUG
> [QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:11227:FileTxnLog$FileTxnIterator@647] -
> EOF excepton java.io.EOFException
> 2017-11-16 03:08:59,131 [myid:2] - WARN
> [QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:11227:Learner@387] - Got zxid 0x500000001
> expected 0x1
> 2017-11-16 03:08:59,132 [myid:2] - INFO
> [QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:11227:FileTxnSnapLog@248] - Snapshotting:
> 0x500000004 to
> /home/ben/project/strata/test-5-3-ZooKeeper-3.4.11-strata-0.1/data/2/version-2/snapshot.500000004
> ...
> {panel}
> {panel:title=Log from the node 0:}
> ...
> 2017-11-16 03:08:59,050 [myid:0] - INFO
> [QuorumPeer[myid=0]/0:0:0:0:0:0:0:0:11221:Leader@372] - LEADING - LEADER
> ELECTION TOOK - 222
> 2017-11-16 03:08:59,055 [myid:0] - INFO
> [LearnerHandler-/127.0.0.1:54482:LearnerHandler@346] - Follower sid: 2 : info
> : org.apache.zookeeper.server.quorum.QuorumPeer$QuorumServer@266e422
> 2017-11-16 03:08:59,124 [myid:0] - INFO
> [LearnerHandler-/127.0.0.1:54482:LearnerHandler@401] - Synchronizing with
> Follower sid: 2 maxCommittedLog=0x500000004 minCommittedLog=0x100000001
> peerLastZxid=0x400000001
> 2017-11-16 03:08:59,124 [myid:0] - DEBUG
> [LearnerHandler-/127.0.0.1:54482:LearnerHandler@415] - proposal size is 14
> 2017-11-16 03:08:59,124 [myid:0] - DEBUG
> [LearnerHandler-/127.0.0.1:54482:LearnerHandler@418] - Sending proposals to
> follower
> 2017-11-16 03:08:59,124 [myid:0] - INFO
> [LearnerHandler-/127.0.0.1:54482:LearnerHandler@475] - Sending TRUNC
> 2017-11-16 03:08:59,147 [myid:0] - INFO
> [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:11221:NIOServerCnxnFactory@215] -
> Accepted socket connection from /127.0.0.1:55118
> 2017-11-16 03:08:59,184 [myid:0] - WARN
> [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:11221:NIOServerCnxn@383] - Exception
> causing close of session 0x0: ZooKeeperServer not running
> 2017-11-16 03:08:59,184 [myid:0] - DEBUG
> [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:11221:NIOServerCnxn@386] - IOException
> stack trace
> java.io.IOException: ZooKeeperServer not running
> at
> org.apache.zookeeper.server.NIOServerCnxn.readLength(NIOServerCnxn.java:977)
> at
> org.apache.zookeeper.server.NIOServerCnxn.doIO(NIOServerCnxn.java:257)
> at
> org.apache.zookeeper.server.NIOServerCnxnFactory.run(NIOServerCnxnFactory.java:226)
> at java.lang.Thread.run(Thread.java:745)
> 2017-11-16 03:08:59,184 [myid:0] - INFO
> [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:11221:NIOServerCnxn@1040] - Closed
> socket connection for client /127.0.0.1:55118 (no session established for
> client)
> 2017-11-16 03:08:59,224 [myid:0] - INFO
> [LearnerHandler-/127.0.0.1:54482:LearnerHandler@535] - Received NEWLEADER-ACK
> message from 2
> 2017-11-16 03:08:59,224 [myid:0] - INFO
> [QuorumPeer[myid=0]/0:0:0:0:0:0:0:0:11221:Leader@962] - Have quorum of
> supporters, sids: [ 0,2 ]; starting up and setting last processed zxid:
> 0x600000000
> ...
> {panel}
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)