David Ribeiro Alves has uploaded a new change for review. http://gerrit.cloudera.org:8080/7221
Change subject: Simplify OpId/Timestamp assignment ...................................................................... Simplify OpId/Timestamp assignment There are a couple of constraints that are tricky to enforce and were requiring us to have convoluted interactions on OpId/Timestamp assignment and registration between the TimeManager, MvccManager and ConsensusQueue. These convoluted interactions are currently preventing a clean and simple leader leases implementation, particularly the single timestamp outstanding in TimeManager. The "tricky" constraints are: i) Before clean time is advanced in mvcc, all transactions with a timestamp lower than the current time must be registered with mvcc. This allows to wait for a certain timestamp t to be "safe" with TimeManager and then wait for all transactions with a timestamp t1 < t to commit, thus making sure that a scan at t is repeatable. ii) The leader can't advance follower safe timestamp to t before all the operations with a timestamp lower than it have also been sent. This is hard since timestamps are assigned outside of consensus. In general it's problematic to generate OpIds non-atomically with timestamps, since these are both supposed to be monotonically increasing in unison. To address this we had some complex state and interactions between the components. For instance this was requiring us to have TimeManager stop safe time advancement until a transaction being prepared was submitted to the queue, after assigning a timestamp in the TransactionDriver. This was problematic because with would only allow at most one assigned timestamp in-flight, meaning we were not able to extract the current safe time at will. This patch completely addresses ii), Timestamps and OpIds are now assigned by PeerMessageQueue, atomically. This makes it trivial to make sure that the safe time sent to replicas is inline with the sent messages. This also allows some simplifications and refactorings of the PeerMessageQueue and TimeManager. As for i) the way we enforce this constraint changed. Instead of registering a transaction before submitting it to consensus (which we now can't since consensus submission is also where we assign the timestamp) we register another, dummy transaction on the MvccManager before submission and abort it afterwards. This allows to "pin" safe time advancement down to a timestamp that is mandatorily smaller than then one that will be assigned. Change-Id: I2bedcc5c3708d06f48131f34ea14c835867b4800 --- M src/kudu/consensus/consensus-test-util.h M src/kudu/consensus/consensus_queue-test.cc M src/kudu/consensus/consensus_queue.cc M src/kudu/consensus/consensus_queue.h M src/kudu/consensus/log_cache-test.cc M src/kudu/consensus/raft_consensus.cc M src/kudu/consensus/time_manager-test.cc M src/kudu/consensus/time_manager.cc M src/kudu/consensus/time_manager.h M src/kudu/tablet/transactions/transaction_driver.cc 10 files changed, 167 insertions(+), 135 deletions(-) git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/21/7221/1 -- To view, visit http://gerrit.cloudera.org:8080/7221 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: newchange Gerrit-Change-Id: I2bedcc5c3708d06f48131f34ea14c835867b4800 Gerrit-PatchSet: 1 Gerrit-Project: kudu Gerrit-Branch: master Gerrit-Owner: David Ribeiro Alves <davidral...@gmail.com>