Hello Tidy Bot, Mike Percy, Alexey Serbin, Kudu Jenkins, Todd Lipcon, I'd like you to reexamine a change. Please visit
http://gerrit.cloudera.org:8080/7221 to look at the new patch set (#17). Change subject: Simplify OpId/Timestamp assignment and make it atomic ...................................................................... Simplify OpId/Timestamp assignment and make it atomic Context: Each message serialized through consensus is assigned a Timestamp and an OpId. These indexes are used in different contexts to establish ordering and, for the most part, they are monotonically increasing. In particular, for write messages, successive OpIds are guaranteed to have increasing Timestamps. For non-write messages though, like NO_OP and CONFIG_CHANGE, this is not the case. Because timestamps are assigned to messages in different places (for write messages they are assigned in the prepare phase, by the TransactionDriver, while for NO_OP and CONFIG_CHANGE they are assigned inside consensus) there was no way, until now, to make sure that timestamps are monotonically increasing across message types, like OpIds are. Not being able to trust that timestamps are monotonically increasing across all message types, means that we can't use the timestamps from non-write messages to advance time. This is a requirement for leader leases, as we need a single source of truth regarding time advancement and need config changes in particular to have been assigned timestamps we can trust, but pehaps more importantly it's also a requirement to be able to advance time on bootstrap when there are no write messages in the WAL. Not being able to do this is the underlying cause of KUDU-2233. Implementation: There are a couple of constraints that are tricky to enforce and were requiring us to have convoluted interactions on OpId/Timestamp assignment and registration between the TimeManager, MvccManager and ConsensusQueue. These convoluted interactions make it so that OpId and Timestamp are not assigned atomically. The "tricky" constraints are: i) Before clean time is advanced in mvcc, all transactions with a timestamp lower than the current time must be registered with mvcc. This allows to wait for a certain timestamp t to be "safe" with TimeManager and then wait for all transactions with a timestamp t1 < t to commit, thus making sure that a scan at t is repeatable. ii) The leader can't advance follower safe timestamp to t before all the operations with a timestamp lower than it have also been sent. This is hard since timestamps are assigned outside of consensus. In general it's problematic to generate OpIds non-atomically with timestamps, since these are both supposed to be monotonically increasing in unison. To address this we had some complex state and interactions between the components. For instance this was requiring us to have TimeManager stop safe time advancement until a transaction being prepared was submitted to the queue, after assigning a timestamp in the TransactionDriver. This was problematic because with would only allow at most one assigned timestamp in-flight, meaning we were not able to move safe time at will. This patch completely addresses ii), Timestamps and OpIds are now assigned by PeerMessageQueue, atomically. This makes it trivial to make sure that the safe time sent to replicas is inline with the sent messages. This also allows some simplifications and refactorings of the PeerMessageQueue and TimeManager. As for i) the way we enforce this constraint changed. Instead of registering a transaction before submitting it to consensus (which we now can't since consensus submission is also where we assign the timestamp) we register another, dummy transaction on the MvccManager before submission and abort it afterwards. This allows to "pin" safe time advancement down to a timestamp that is mandatorily smaller than then one that will be assigned thus enforcing the constraint with a simpler implementation at the cost of an additional (short-lived) transaction in mvcc. Note in this regard that there will be at most one aditional transaction in mvcc. One side effect of this change is that, for new config changes, we won't know the OpId until after the queue submission, which needs to happen after actually setting the pending config. This is largely inconsequential since, by design, we can have at most one pending config, but did require some refactoring. Change-Id: I2bedcc5c3708d06f48131f34ea14c835867b4800 --- M java/kudu-client/src/test/java/org/apache/kudu/client/TestHybridTime.java M src/kudu/consensus/consensus-test-util.h M src/kudu/consensus/consensus_meta-test.cc M src/kudu/consensus/consensus_meta.cc M src/kudu/consensus/consensus_peers-test.cc M src/kudu/consensus/consensus_queue-test.cc M src/kudu/consensus/consensus_queue.cc M src/kudu/consensus/consensus_queue.h M src/kudu/consensus/log_cache-test.cc M src/kudu/consensus/pending_rounds.cc M src/kudu/consensus/quorum_util.cc M src/kudu/consensus/quorum_util.h M src/kudu/consensus/raft_consensus.cc M src/kudu/consensus/raft_consensus.h M src/kudu/consensus/time_manager-test.cc M src/kudu/consensus/time_manager.cc M src/kudu/consensus/time_manager.h M src/kudu/integration-tests/alter_table-test.cc M src/kudu/integration-tests/raft_consensus-itest.cc M src/kudu/master/sys_catalog.cc M src/kudu/tablet/transactions/transaction_driver.cc M src/kudu/tserver/tablet_server-test.cc M src/kudu/tserver/ts_tablet_manager.cc 23 files changed, 350 insertions(+), 275 deletions(-) git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/21/7221/17 -- To view, visit http://gerrit.cloudera.org:8080/7221 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: kudu Gerrit-Branch: master Gerrit-MessageType: newpatchset Gerrit-Change-Id: I2bedcc5c3708d06f48131f34ea14c835867b4800 Gerrit-Change-Number: 7221 Gerrit-PatchSet: 17 Gerrit-Owner: David Ribeiro Alves <davidral...@gmail.com> Gerrit-Reviewer: Alexey Serbin <aser...@cloudera.com> Gerrit-Reviewer: David Ribeiro Alves <davidral...@gmail.com> Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Mike Percy <mpe...@apache.org> Gerrit-Reviewer: Tidy Bot Gerrit-Reviewer: Todd Lipcon <t...@apache.org>