[kudu-CR] Simplify OpId/Timestamp assignment and make it atomic
Mike Percy has abandoned this change. ( http://gerrit.cloudera.org:8080/7221 ) Change subject: Simplify OpId/Timestamp assignment and make it atomic .. Abandoned superseded by https://gerrit.cloudera.org/c/11428/ -- To view, visit http://gerrit.cloudera.org:8080/7221 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: kudu Gerrit-Branch: master Gerrit-MessageType: abandon Gerrit-Change-Id: I2bedcc5c3708d06f48131f34ea14c835867b4800 Gerrit-Change-Number: 7221 Gerrit-PatchSet: 28 Gerrit-Owner: David Ribeiro Alves Gerrit-Reviewer: Alexey Serbin Gerrit-Reviewer: David Ribeiro Alves Gerrit-Reviewer: Kudu Jenkins (120) Gerrit-Reviewer: Mike Percy Gerrit-Reviewer: Tidy Bot (241) Gerrit-Reviewer: Todd Lipcon
[kudu-CR] Simplify OpId/Timestamp assignment and make it atomic
Mike Percy has posted comments on this change. ( http://gerrit.cloudera.org:8080/7221 ) Change subject: Simplify OpId/Timestamp assignment and make it atomic .. Patch Set 28: I think we can abandon this approach now that we went with another approach to KUDU-2463 -- To view, visit http://gerrit.cloudera.org:8080/7221 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: kudu Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I2bedcc5c3708d06f48131f34ea14c835867b4800 Gerrit-Change-Number: 7221 Gerrit-PatchSet: 28 Gerrit-Owner: David Ribeiro Alves Gerrit-Reviewer: Alexey Serbin Gerrit-Reviewer: David Ribeiro Alves Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Mike Percy Gerrit-Reviewer: Tidy Bot Gerrit-Reviewer: Todd Lipcon Gerrit-Comment-Date: Fri, 12 Oct 2018 21:40:58 + Gerrit-HasComments: No
[kudu-CR] Simplify OpId/Timestamp assignment and make it atomic
Hello Tidy Bot, Mike Percy, Alexey Serbin, Kudu Jenkins, Todd Lipcon, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/7221 to look at the new patch set (#28). Change subject: Simplify OpId/Timestamp assignment and make it atomic .. Simplify OpId/Timestamp assignment and make it atomic There are a couple of constraints that are tricky to enforce and were requiring us to have convoluted interactions on OpId/Timestamp assignment and registration between the TimeManager, MvccManager and ConsensusQueue. These convoluted interactions are currently preventing a clean and simple leader leases implementation, particularly the single timestamp outstanding in TimeManager. The "tricky" constraints are: i) Before clean time is advanced in mvcc, all transactions with a timestamp lower than the current time must be registered with mvcc. This allows to wait for a certain timestamp t to be "safe" with TimeManager and then wait for all transactions with a timestamp t1 < t to commit, thus making sure that a scan at t is repeatable. ii) The leader can't advance follower safe timestamp to t before all the operations with a timestamp lower than it have also been sent. This is hard since timestamps are assigned outside of consensus. In general it's problematic to generate OpIds non-atomically with timestamps, since these are both supposed to be monotonically increasing in unison. To address this we had some complex state and interactions between the components. For instance this was requiring us to have TimeManager stop safe time advancement until a transaction being prepared was submitted to the queue, after assigning a timestamp in the TransactionDriver. This was problematic because with would only allow at most one assigned timestamp in-flight, meaning we were not able to move safe time at will. This patch completely addresses ii), Timestamps and OpIds are now assigned by PeerMessageQueue, atomically. This makes it trivial to make sure that the safe time sent to replicas is inline with the sent messages. This also allows some simplifications and refactorings of the PeerMessageQueue and TimeManager. As for i) the way we enforce this constraint changed. Instead of registering a transaction before submitting it to consensus (which we now can't since consensus submission is also where we assign the timestamp) we register another, dummy transaction on the MvccManager before submission and abort it afterwards. This allows to "pin" safe time advancement down to a timestamp that is mandatorily smaller than then one that will be assigned thus enforcing the constraint with a simpler implementation at the cost of an additional (short-lived) transaction in mvcc. Note in this regard that there will be at most one aditional transaction in mvcc. One side effect of this change is that, for new config changes, we won't know the OpId until after the queue submission, which needs to happen after actually setting the pending config. This is largely inconsequential since, by design, we can have at most one pending config, but did require some refactoring. Change-Id: I2bedcc5c3708d06f48131f34ea14c835867b4800 --- M java/kudu-client/src/test/java/org/apache/kudu/client/TestHybridTime.java M src/kudu/consensus/consensus-test-util.h M src/kudu/consensus/consensus_meta-test.cc M src/kudu/consensus/consensus_meta.cc M src/kudu/consensus/consensus_peers-test.cc M src/kudu/consensus/consensus_queue-test.cc M src/kudu/consensus/consensus_queue.cc M src/kudu/consensus/consensus_queue.h M src/kudu/consensus/log_cache-test.cc M src/kudu/consensus/pending_rounds.cc M src/kudu/consensus/quorum_util.cc M src/kudu/consensus/quorum_util.h M src/kudu/consensus/raft_consensus.cc M src/kudu/consensus/raft_consensus.h M src/kudu/consensus/time_manager-test.cc M src/kudu/consensus/time_manager.cc M src/kudu/consensus/time_manager.h M src/kudu/integration-tests/alter_table-test.cc M src/kudu/integration-tests/raft_consensus-itest.cc M src/kudu/integration-tests/test_workload.cc M src/kudu/master/sys_catalog.cc M src/kudu/tablet/transactions/transaction_driver.cc M src/kudu/tserver/tablet_server-test.cc M src/kudu/tserver/ts_tablet_manager.cc 24 files changed, 358 insertions(+), 284 deletions(-) git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/21/7221/28 -- To view, visit http://gerrit.cloudera.org:8080/7221 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: kudu Gerrit-Branch: master Gerrit-MessageType: newpatchset Gerrit-Change-Id: I2bedcc5c3708d06f48131f34ea14c835867b4800 Gerrit-Change-Number: 7221 Gerrit-PatchSet: 28 Gerrit-Owner: David Ribeiro Alves Gerrit-Reviewer: Alexey Serbin Gerrit-Reviewer: David Ribeiro Alves Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Mike Percy Gerrit-Reviewer: Tidy Bot Gerrit-Reviewer: Todd Lipcon
[kudu-CR] Simplify OpId/Timestamp assignment and make it atomic
David Ribeiro Alves has posted comments on this change. ( http://gerrit.cloudera.org:8080/7221 ) Change subject: Simplify OpId/Timestamp assignment and make it atomic .. Patch Set 24: (7 comments) http://gerrit.cloudera.org:8080/#/c/7221/24//COMMIT_MSG Commit Message: http://gerrit.cloudera.org:8080/#/c/7221/24//COMMIT_MSG@26 PS24, Line 26: it > t ? Done http://gerrit.cloudera.org:8080/#/c/7221/24//COMMIT_MSG@26 PS24, Line 26: also > drop 'also'? Done http://gerrit.cloudera.org:8080/#/c/7221/24//COMMIT_MSG@29 PS24, Line 29: non-atomically with > separately from I think non-atomically here is better. after all non-separately doesn't necessarily mean atomically http://gerrit.cloudera.org:8080/#/c/7221/24//COMMIT_MSG@35 PS24, Line 35: with > it ? Done http://gerrit.cloudera.org:8080/#/c/7221/24//COMMIT_MSG@39 PS24, Line 39: , > replace with stop/dot ? Done http://gerrit.cloudera.org:8080/#/c/7221/21/src/kudu/tablet/transactions/transaction_driver.cc File src/kudu/tablet/transactions/transaction_driver.cc: http://gerrit.cloudera.org:8080/#/c/7221/21/src/kudu/tablet/transactions/transaction_driver.cc@324 PS21, Line 324: DCHECK_EQ(driver_type, consensus::LEADER); > would it be correct here to add a DCHECK_EQ(transaction_->type(), consensus Done http://gerrit.cloudera.org:8080/#/c/7221/24/src/kudu/tablet/transactions/transaction_driver.cc File src/kudu/tablet/transactions/transaction_driver.cc: http://gerrit.cloudera.org:8080/#/c/7221/24/src/kudu/tablet/transactions/transaction_driver.cc@69 PS24, Line 69: using std::unique_ptr; > warning: using decl 'unique_ptr' is unused [misc-unused-using-decls] Done -- To view, visit http://gerrit.cloudera.org:8080/7221 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: kudu Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I2bedcc5c3708d06f48131f34ea14c835867b4800 Gerrit-Change-Number: 7221 Gerrit-PatchSet: 24 Gerrit-Owner: David Ribeiro Alves Gerrit-Reviewer: Alexey Serbin Gerrit-Reviewer: David Ribeiro Alves Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Mike Percy Gerrit-Reviewer: Tidy Bot Gerrit-Reviewer: Todd Lipcon Gerrit-Comment-Date: Mon, 23 Apr 2018 21:15:49 + Gerrit-HasComments: Yes
[kudu-CR] Simplify OpId/Timestamp assignment and make it atomic
Hello Tidy Bot, Mike Percy, Alexey Serbin, Kudu Jenkins, Todd Lipcon, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/7221 to look at the new patch set (#25). Change subject: Simplify OpId/Timestamp assignment and make it atomic .. Simplify OpId/Timestamp assignment and make it atomic There are a couple of constraints that are tricky to enforce and were requiring us to have convoluted interactions on OpId/Timestamp assignment and registration between the TimeManager, MvccManager and ConsensusQueue. These convoluted interactions are currently preventing a clean and simple leader leases implementation, particularly the single timestamp outstanding in TimeManager. The "tricky" constraints are: i) Before clean time is advanced in mvcc, all transactions with a timestamp lower than the current time must be registered with mvcc. This allows to wait for a certain timestamp t to be "safe" with TimeManager and then wait for all transactions with a timestamp t1 < t to commit, thus making sure that a scan at t is repeatable. ii) The leader can't advance follower safe timestamp to t, before all the operations with a timestamp lower than t have been sent. This is hard since timestamps are assigned outside of consensus. In general it's problematic to generate OpIds non-atomically with timestamps, since these are both supposed to be monotonically increasing in unison. To address this we had some complex state and interactions between the components. For instance this was requiring us to have TimeManager stop safe time advancement until a transaction being prepared was submitted to the queue, after assigning a timestamp in the TransactionDriver. This was problematic because with it would only allow at most one assigned timestamp in-flight, meaning we were not able to move safe time at will. This patch completely addresses ii) Timestamps and OpIds are now assigned by PeerMessageQueue, atomically. This makes it trivial to make sure that the safe time sent to replicas is inline with the sent messages. This also allows some simplifications and refactorings of the PeerMessageQueue and TimeManager. As for i) the way we enforce this constraint changed. Instead of registering a transaction before submitting it to consensus (which we now can't since consensus submission is also where we assign the timestamp) we register another, dummy transaction on the MvccManager before submission and abort it afterwards. This allows to "pin" safe time advancement down to a timestamp that is mandatorily smaller than then one that will be assigned thus enforcing the constraint with a simpler implementation at the cost of an additional (short-lived) transaction in mvcc. Note in this regard that there will be at most one aditional transaction in mvcc. One side effect of this change is that, for new config changes, we won't know the OpId until after the queue submission, which needs to happen after actually setting the pending config. This is largely inconsequential since, by design, we can have at most one pending config, but did require some refactoring. Change-Id: I2bedcc5c3708d06f48131f34ea14c835867b4800 --- M java/kudu-client/src/test/java/org/apache/kudu/client/TestHybridTime.java M src/kudu/consensus/consensus-test-util.h M src/kudu/consensus/consensus_meta-test.cc M src/kudu/consensus/consensus_meta.cc M src/kudu/consensus/consensus_peers-test.cc M src/kudu/consensus/consensus_queue-test.cc M src/kudu/consensus/consensus_queue.cc M src/kudu/consensus/consensus_queue.h M src/kudu/consensus/log_cache-test.cc M src/kudu/consensus/pending_rounds.cc M src/kudu/consensus/quorum_util.cc M src/kudu/consensus/quorum_util.h M src/kudu/consensus/raft_consensus.cc M src/kudu/consensus/raft_consensus.h M src/kudu/consensus/time_manager-test.cc M src/kudu/consensus/time_manager.cc M src/kudu/consensus/time_manager.h M src/kudu/integration-tests/alter_table-test.cc M src/kudu/integration-tests/raft_consensus-itest.cc M src/kudu/integration-tests/test_workload.cc M src/kudu/master/sys_catalog.cc M src/kudu/tablet/transactions/transaction_driver.cc M src/kudu/tserver/tablet_server-test.cc M src/kudu/tserver/ts_tablet_manager.cc 24 files changed, 357 insertions(+), 284 deletions(-) git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/21/7221/25 -- To view, visit http://gerrit.cloudera.org:8080/7221 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: kudu Gerrit-Branch: master Gerrit-MessageType: newpatchset Gerrit-Change-Id: I2bedcc5c3708d06f48131f34ea14c835867b4800 Gerrit-Change-Number: 7221 Gerrit-PatchSet: 25 Gerrit-Owner: David Ribeiro Alves Gerrit-Reviewer: Alexey Serbin Gerrit-Reviewer: David Ribeiro Alves Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Mike Percy Gerrit-Reviewer: Tidy Bot Gerrit-Reviewer: Todd Lipcon
[kudu-CR] Simplify OpId/Timestamp assignment and make it atomic
Alexey Serbin has posted comments on this change. ( http://gerrit.cloudera.org:8080/7221 ) Change subject: Simplify OpId/Timestamp assignment and make it atomic .. Patch Set 24: (5 comments) Just posting some nits on the commit message. I'll try to look at the after two weeks, if it's still around. http://gerrit.cloudera.org:8080/#/c/7221/24//COMMIT_MSG Commit Message: http://gerrit.cloudera.org:8080/#/c/7221/24//COMMIT_MSG@26 PS24, Line 26: also drop 'also'? http://gerrit.cloudera.org:8080/#/c/7221/24//COMMIT_MSG@26 PS24, Line 26: it t ? http://gerrit.cloudera.org:8080/#/c/7221/24//COMMIT_MSG@29 PS24, Line 29: non-atomically with separately from http://gerrit.cloudera.org:8080/#/c/7221/24//COMMIT_MSG@35 PS24, Line 35: with it ? http://gerrit.cloudera.org:8080/#/c/7221/24//COMMIT_MSG@39 PS24, Line 39: , replace with stop/dot ? -- To view, visit http://gerrit.cloudera.org:8080/7221 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: kudu Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I2bedcc5c3708d06f48131f34ea14c835867b4800 Gerrit-Change-Number: 7221 Gerrit-PatchSet: 24 Gerrit-Owner: David Ribeiro Alves Gerrit-Reviewer: Alexey Serbin Gerrit-Reviewer: David Ribeiro Alves Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Mike Percy Gerrit-Reviewer: Tidy Bot Gerrit-Reviewer: Todd Lipcon Gerrit-Comment-Date: Tue, 27 Mar 2018 06:08:35 + Gerrit-HasComments: Yes
[kudu-CR] Simplify OpId/Timestamp assignment and make it atomic
Hello Tidy Bot, Mike Percy, Alexey Serbin, Kudu Jenkins, Todd Lipcon, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/7221 to look at the new patch set (#24). Change subject: Simplify OpId/Timestamp assignment and make it atomic .. Simplify OpId/Timestamp assignment and make it atomic There are a couple of constraints that are tricky to enforce and were requiring us to have convoluted interactions on OpId/Timestamp assignment and registration between the TimeManager, MvccManager and ConsensusQueue. These convoluted interactions are currently preventing a clean and simple leader leases implementation, particularly the single timestamp outstanding in TimeManager. The "tricky" constraints are: i) Before clean time is advanced in mvcc, all transactions with a timestamp lower than the current time must be registered with mvcc. This allows to wait for a certain timestamp t to be "safe" with TimeManager and then wait for all transactions with a timestamp t1 < t to commit, thus making sure that a scan at t is repeatable. ii) The leader can't advance follower safe timestamp to t before all the operations with a timestamp lower than it have also been sent. This is hard since timestamps are assigned outside of consensus. In general it's problematic to generate OpIds non-atomically with timestamps, since these are both supposed to be monotonically increasing in unison. To address this we had some complex state and interactions between the components. For instance this was requiring us to have TimeManager stop safe time advancement until a transaction being prepared was submitted to the queue, after assigning a timestamp in the TransactionDriver. This was problematic because with would only allow at most one assigned timestamp in-flight, meaning we were not able to move safe time at will. This patch completely addresses ii), Timestamps and OpIds are now assigned by PeerMessageQueue, atomically. This makes it trivial to make sure that the safe time sent to replicas is inline with the sent messages. This also allows some simplifications and refactorings of the PeerMessageQueue and TimeManager. As for i) the way we enforce this constraint changed. Instead of registering a transaction before submitting it to consensus (which we now can't since consensus submission is also where we assign the timestamp) we register another, dummy transaction on the MvccManager before submission and abort it afterwards. This allows to "pin" safe time advancement down to a timestamp that is mandatorily smaller than then one that will be assigned thus enforcing the constraint with a simpler implementation at the cost of an additional (short-lived) transaction in mvcc. Note in this regard that there will be at most one aditional transaction in mvcc. One side effect of this change is that, for new config changes, we won't know the OpId until after the queue submission, which needs to happen after actually setting the pending config. This is largely inconsequential since, by design, we can have at most one pending config, but did require some refactoring. Change-Id: I2bedcc5c3708d06f48131f34ea14c835867b4800 --- M java/kudu-client/src/test/java/org/apache/kudu/client/TestHybridTime.java M src/kudu/consensus/consensus-test-util.h M src/kudu/consensus/consensus_meta-test.cc M src/kudu/consensus/consensus_meta.cc M src/kudu/consensus/consensus_peers-test.cc M src/kudu/consensus/consensus_queue-test.cc M src/kudu/consensus/consensus_queue.cc M src/kudu/consensus/consensus_queue.h M src/kudu/consensus/log_cache-test.cc M src/kudu/consensus/pending_rounds.cc M src/kudu/consensus/quorum_util.cc M src/kudu/consensus/quorum_util.h M src/kudu/consensus/raft_consensus.cc M src/kudu/consensus/raft_consensus.h M src/kudu/consensus/time_manager-test.cc M src/kudu/consensus/time_manager.cc M src/kudu/consensus/time_manager.h M src/kudu/integration-tests/alter_table-test.cc M src/kudu/integration-tests/raft_consensus-itest.cc M src/kudu/integration-tests/test_workload.cc M src/kudu/master/sys_catalog.cc M src/kudu/tablet/transactions/transaction_driver.cc M src/kudu/tserver/tablet_server-test.cc M src/kudu/tserver/ts_tablet_manager.cc 24 files changed, 358 insertions(+), 284 deletions(-) git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/21/7221/24 -- To view, visit http://gerrit.cloudera.org:8080/7221 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: kudu Gerrit-Branch: master Gerrit-MessageType: newpatchset Gerrit-Change-Id: I2bedcc5c3708d06f48131f34ea14c835867b4800 Gerrit-Change-Number: 7221 Gerrit-PatchSet: 24 Gerrit-Owner: David Ribeiro Alves Gerrit-Reviewer: Alexey Serbin Gerrit-Reviewer: David Ribeiro Alves Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Mike Percy Gerrit-Reviewer: Tidy Bot Gerrit-Reviewer: Todd Lipcon
[kudu-CR] Simplify OpId/Timestamp assignment and make it atomic
Hello Tidy Bot, Mike Percy, Alexey Serbin, Kudu Jenkins, Todd Lipcon, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/7221 to look at the new patch set (#23). Change subject: Simplify OpId/Timestamp assignment and make it atomic .. Simplify OpId/Timestamp assignment and make it atomic There are a couple of constraints that are tricky to enforce and were requiring us to have convoluted interactions on OpId/Timestamp assignment and registration between the TimeManager, MvccManager and ConsensusQueue. These convoluted interactions are currently preventing a clean and simple leader leases implementation, particularly the single timestamp outstanding in TimeManager. The "tricky" constraints are: i) Before clean time is advanced in mvcc, all transactions with a timestamp lower than the current time must be registered with mvcc. This allows to wait for a certain timestamp t to be "safe" with TimeManager and then wait for all transactions with a timestamp t1 < t to commit, thus making sure that a scan at t is repeatable. ii) The leader can't advance follower safe timestamp to t before all the operations with a timestamp lower than it have also been sent. This is hard since timestamps are assigned outside of consensus. In general it's problematic to generate OpIds non-atomically with timestamps, since these are both supposed to be monotonically increasing in unison. To address this we had some complex state and interactions between the components. For instance this was requiring us to have TimeManager stop safe time advancement until a transaction being prepared was submitted to the queue, after assigning a timestamp in the TransactionDriver. This was problematic because with would only allow at most one assigned timestamp in-flight, meaning we were not able to move safe time at will. This patch completely addresses ii), Timestamps and OpIds are now assigned by PeerMessageQueue, atomically. This makes it trivial to make sure that the safe time sent to replicas is inline with the sent messages. This also allows some simplifications and refactorings of the PeerMessageQueue and TimeManager. As for i) the way we enforce this constraint changed. Instead of registering a transaction before submitting it to consensus (which we now can't since consensus submission is also where we assign the timestamp) we register another, dummy transaction on the MvccManager before submission and abort it afterwards. This allows to "pin" safe time advancement down to a timestamp that is mandatorily smaller than then one that will be assigned thus enforcing the constraint with a simpler implementation at the cost of an additional (short-lived) transaction in mvcc. Note in this regard that there will be at most one aditional transaction in mvcc. One side effect of this change is that, for new config changes, we won't know the OpId until after the queue submission, which needs to happen after actually setting the pending config. This is largely inconsequential since, by design, we can have at most one pending config, but did require some refactoring. Change-Id: I2bedcc5c3708d06f48131f34ea14c835867b4800 --- M java/kudu-client/src/test/java/org/apache/kudu/client/TestHybridTime.java M src/kudu/consensus/consensus-test-util.h M src/kudu/consensus/consensus_meta-test.cc M src/kudu/consensus/consensus_meta.cc M src/kudu/consensus/consensus_peers-test.cc M src/kudu/consensus/consensus_queue-test.cc M src/kudu/consensus/consensus_queue.cc M src/kudu/consensus/consensus_queue.h M src/kudu/consensus/log_cache-test.cc M src/kudu/consensus/pending_rounds.cc M src/kudu/consensus/quorum_util.cc M src/kudu/consensus/quorum_util.h M src/kudu/consensus/raft_consensus.cc M src/kudu/consensus/raft_consensus.h M src/kudu/consensus/time_manager-test.cc M src/kudu/consensus/time_manager.cc M src/kudu/consensus/time_manager.h M src/kudu/integration-tests/alter_table-test.cc M src/kudu/integration-tests/raft_consensus-itest.cc M src/kudu/master/sys_catalog.cc M src/kudu/tablet/transactions/transaction_driver.cc M src/kudu/tserver/tablet_server-test.cc M src/kudu/tserver/ts_tablet_manager.cc 23 files changed, 356 insertions(+), 284 deletions(-) git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/21/7221/23 -- To view, visit http://gerrit.cloudera.org:8080/7221 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: kudu Gerrit-Branch: master Gerrit-MessageType: newpatchset Gerrit-Change-Id: I2bedcc5c3708d06f48131f34ea14c835867b4800 Gerrit-Change-Number: 7221 Gerrit-PatchSet: 23 Gerrit-Owner: David Ribeiro Alves Gerrit-Reviewer: Alexey Serbin Gerrit-Reviewer: David Ribeiro Alves Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Mike Percy Gerrit-Reviewer: Tidy Bot Gerrit-Reviewer: Todd Lipcon
[kudu-CR] Simplify OpId/Timestamp assignment and make it atomic
Hello Tidy Bot, Mike Percy, Alexey Serbin, Kudu Jenkins, Todd Lipcon, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/7221 to look at the new patch set (#22). Change subject: Simplify OpId/Timestamp assignment and make it atomic .. Simplify OpId/Timestamp assignment and make it atomic There are a couple of constraints that are tricky to enforce and were requiring us to have convoluted interactions on OpId/Timestamp assignment and registration between the TimeManager, MvccManager and ConsensusQueue. These convoluted interactions are currently preventing a clean and simple leader leases implementation, particularly the single timestamp outstanding in TimeManager. The "tricky" constraints are: i) Before clean time is advanced in mvcc, all transactions with a timestamp lower than the current time must be registered with mvcc. This allows to wait for a certain timestamp t to be "safe" with TimeManager and then wait for all transactions with a timestamp t1 < t to commit, thus making sure that a scan at t is repeatable. ii) The leader can't advance follower safe timestamp to t before all the operations with a timestamp lower than it have also been sent. This is hard since timestamps are assigned outside of consensus. In general it's problematic to generate OpIds non-atomically with timestamps, since these are both supposed to be monotonically increasing in unison. To address this we had some complex state and interactions between the components. For instance this was requiring us to have TimeManager stop safe time advancement until a transaction being prepared was submitted to the queue, after assigning a timestamp in the TransactionDriver. This was problematic because with would only allow at most one assigned timestamp in-flight, meaning we were not able to move safe time at will. This patch completely addresses ii), Timestamps and OpIds are now assigned by PeerMessageQueue, atomically. This makes it trivial to make sure that the safe time sent to replicas is inline with the sent messages. This also allows some simplifications and refactorings of the PeerMessageQueue and TimeManager. As for i) the way we enforce this constraint changed. Instead of registering a transaction before submitting it to consensus (which we now can't since consensus submission is also where we assign the timestamp) we register another, dummy transaction on the MvccManager before submission and abort it afterwards. This allows to "pin" safe time advancement down to a timestamp that is mandatorily smaller than then one that will be assigned thus enforcing the constraint with a simpler implementation at the cost of an additional (short-lived) transaction in mvcc. Note in this regard that there will be at most one aditional transaction in mvcc. One side effect of this change is that, for new config changes, we won't know the OpId until after the queue submission, which needs to happen after actually setting the pending config. This is largely inconsequential since, by design, we can have at most one pending config, but did require some refactoring. Change-Id: I2bedcc5c3708d06f48131f34ea14c835867b4800 --- M java/kudu-client/src/test/java/org/apache/kudu/client/TestHybridTime.java M src/kudu/consensus/consensus-test-util.h M src/kudu/consensus/consensus_meta-test.cc M src/kudu/consensus/consensus_meta.cc M src/kudu/consensus/consensus_peers-test.cc M src/kudu/consensus/consensus_queue-test.cc M src/kudu/consensus/consensus_queue.cc M src/kudu/consensus/consensus_queue.h M src/kudu/consensus/log_cache-test.cc M src/kudu/consensus/pending_rounds.cc M src/kudu/consensus/quorum_util.cc M src/kudu/consensus/quorum_util.h M src/kudu/consensus/raft_consensus.cc M src/kudu/consensus/raft_consensus.h M src/kudu/consensus/time_manager-test.cc M src/kudu/consensus/time_manager.cc M src/kudu/consensus/time_manager.h M src/kudu/integration-tests/alter_table-test.cc M src/kudu/integration-tests/raft_consensus-itest.cc M src/kudu/master/sys_catalog.cc M src/kudu/tablet/transactions/transaction_driver.cc M src/kudu/tserver/tablet_server-test.cc M src/kudu/tserver/ts_tablet_manager.cc 23 files changed, 353 insertions(+), 280 deletions(-) git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/21/7221/22 -- To view, visit http://gerrit.cloudera.org:8080/7221 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: kudu Gerrit-Branch: master Gerrit-MessageType: newpatchset Gerrit-Change-Id: I2bedcc5c3708d06f48131f34ea14c835867b4800 Gerrit-Change-Number: 7221 Gerrit-PatchSet: 22 Gerrit-Owner: David Ribeiro Alves Gerrit-Reviewer: Alexey Serbin Gerrit-Reviewer: David Ribeiro Alves Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Mike Percy Gerrit-Reviewer: Tidy Bot Gerrit-Reviewer: Todd Lipcon
[kudu-CR] Simplify OpId/Timestamp assignment and make it atomic
Todd Lipcon has posted comments on this change. ( http://gerrit.cloudera.org:8080/7221 ) Change subject: Simplify OpId/Timestamp assignment and make it atomic .. Patch Set 21: (2 comments) Just looked at TransactionDriver first since David asked for a quick review of that part http://gerrit.cloudera.org:8080/#/c/7221/21/src/kudu/tablet/transactions/transaction_driver.cc File src/kudu/tablet/transactions/transaction_driver.cc: http://gerrit.cloudera.org:8080/#/c/7221/21/src/kudu/tablet/transactions/transaction_driver.cc@324 PS21, Line 324: std::lock_guard lock(lock_); would it be correct here to add a DCHECK_EQ(transaction_->type(), consensus::LEADER)? I think so. Maybe would be good doc value. http://gerrit.cloudera.org:8080/#/c/7221/21/src/kudu/tablet/transactions/transaction_driver.cc@341 PS21, Line 341: MvccManager::ScopedCleanTimeAdvancePin pin(replica->tablet()->mvcc_manager(), : replica->clock()->Now()) I'm surprised that the clock()->Now() call doesn't need to be made inside of the MVCC lock. Otherwise isn't it racey? T1: call clock->Now() T2: advance clean time to a higher value T1: call Pin(older timestamp than current clean time) .. or are we relying on the fact that there can be no T2 starting a transaction at a higher 'Now' because the only thread that starts transactions is the prepare thread? Maybe needs some comment on that if it's the case. -- To view, visit http://gerrit.cloudera.org:8080/7221 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: kudu Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I2bedcc5c3708d06f48131f34ea14c835867b4800 Gerrit-Change-Number: 7221 Gerrit-PatchSet: 21 Gerrit-Owner: David Ribeiro Alves Gerrit-Reviewer: Alexey Serbin Gerrit-Reviewer: David Ribeiro Alves Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Mike Percy Gerrit-Reviewer: Tidy Bot Gerrit-Reviewer: Todd Lipcon Gerrit-Comment-Date: Wed, 21 Mar 2018 19:53:16 + Gerrit-HasComments: Yes
[kudu-CR] Simplify OpId/Timestamp assignment and make it atomic
Hello Tidy Bot, Mike Percy, Alexey Serbin, Kudu Jenkins, Todd Lipcon, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/7221 to look at the new patch set (#21). Change subject: Simplify OpId/Timestamp assignment and make it atomic .. Simplify OpId/Timestamp assignment and make it atomic There are a couple of constraints that are tricky to enforce and were requiring us to have convoluted interactions on OpId/Timestamp assignment and registration between the TimeManager, MvccManager and ConsensusQueue. These convoluted interactions are currently preventing a clean and simple leader leases implementation, particularly the single timestamp outstanding in TimeManager. The "tricky" constraints are: i) Before clean time is advanced in mvcc, all transactions with a timestamp lower than the current time must be registered with mvcc. This allows to wait for a certain timestamp t to be "safe" with TimeManager and then wait for all transactions with a timestamp t1 < t to commit, thus making sure that a scan at t is repeatable. ii) The leader can't advance follower safe timestamp to t before all the operations with a timestamp lower than it have also been sent. This is hard since timestamps are assigned outside of consensus. In general it's problematic to generate OpIds non-atomically with timestamps, since these are both supposed to be monotonically increasing in unison. To address this we had some complex state and interactions between the components. For instance this was requiring us to have TimeManager stop safe time advancement until a transaction being prepared was submitted to the queue, after assigning a timestamp in the TransactionDriver. This was problematic because with would only allow at most one assigned timestamp in-flight, meaning we were not able to move safe time at will. This patch completely addresses ii), Timestamps and OpIds are now assigned by PeerMessageQueue, atomically. This makes it trivial to make sure that the safe time sent to replicas is inline with the sent messages. This also allows some simplifications and refactorings of the PeerMessageQueue and TimeManager. As for i) the way we enforce this constraint changed. Instead of registering a transaction before submitting it to consensus (which we now can't since consensus submission is also where we assign the timestamp) we register another, dummy transaction on the MvccManager before submission and abort it afterwards. This allows to "pin" safe time advancement down to a timestamp that is mandatorily smaller than then one that will be assigned thus enforcing the constraint with a simpler implementation at the cost of an additional (short-lived) transaction in mvcc. Note in this regard that there will be at most one aditional transaction in mvcc. One side effect of this change is that, for new config changes, we won't know the OpId until after the queue submission, which needs to happen after actually setting the pending config. This is largely inconsequential since, by design, we can have at most one pending config, but did require some refactoring. Change-Id: I2bedcc5c3708d06f48131f34ea14c835867b4800 --- M java/kudu-client/src/test/java/org/apache/kudu/client/TestHybridTime.java M src/kudu/consensus/consensus-test-util.h M src/kudu/consensus/consensus_meta-test.cc M src/kudu/consensus/consensus_meta.cc M src/kudu/consensus/consensus_peers-test.cc M src/kudu/consensus/consensus_queue-test.cc M src/kudu/consensus/consensus_queue.cc M src/kudu/consensus/consensus_queue.h M src/kudu/consensus/log_cache-test.cc M src/kudu/consensus/pending_rounds.cc M src/kudu/consensus/quorum_util.cc M src/kudu/consensus/quorum_util.h M src/kudu/consensus/raft_consensus.cc M src/kudu/consensus/raft_consensus.h M src/kudu/consensus/time_manager-test.cc M src/kudu/consensus/time_manager.cc M src/kudu/consensus/time_manager.h M src/kudu/integration-tests/alter_table-test.cc M src/kudu/integration-tests/raft_consensus-itest.cc M src/kudu/master/sys_catalog.cc M src/kudu/tablet/transactions/transaction_driver.cc M src/kudu/tserver/tablet_server-test.cc M src/kudu/tserver/ts_tablet_manager.cc 23 files changed, 349 insertions(+), 280 deletions(-) git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/21/7221/21 -- To view, visit http://gerrit.cloudera.org:8080/7221 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: kudu Gerrit-Branch: master Gerrit-MessageType: newpatchset Gerrit-Change-Id: I2bedcc5c3708d06f48131f34ea14c835867b4800 Gerrit-Change-Number: 7221 Gerrit-PatchSet: 21 Gerrit-Owner: David Ribeiro Alves Gerrit-Reviewer: Alexey Serbin Gerrit-Reviewer: David Ribeiro Alves Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Mike Percy Gerrit-Reviewer: Tidy Bot Gerrit-Reviewer: Todd Lipcon
[kudu-CR] Simplify OpId/Timestamp assignment and make it atomic
David Ribeiro Alves has posted comments on this change. ( http://gerrit.cloudera.org:8080/7221 ) Change subject: Simplify OpId/Timestamp assignment and make it atomic .. Patch Set 19: > Patch Set 19: Verified-1 > > Build Failed > > http://jenkins.kudu.apache.org/job/kudu-gerrit/12511/ : FAILURE Hum, something weird is going on with a couple of tests. Should still be worth reviewing though -- To view, visit http://gerrit.cloudera.org:8080/7221 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: kudu Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I2bedcc5c3708d06f48131f34ea14c835867b4800 Gerrit-Change-Number: 7221 Gerrit-PatchSet: 19 Gerrit-Owner: David Ribeiro Alves Gerrit-Reviewer: Alexey Serbin Gerrit-Reviewer: David Ribeiro Alves Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Mike Percy Gerrit-Reviewer: Tidy Bot Gerrit-Reviewer: Todd Lipcon Gerrit-Comment-Date: Thu, 15 Mar 2018 23:41:35 + Gerrit-HasComments: No
[kudu-CR] Simplify OpId/Timestamp assignment and make it atomic
Hello Tidy Bot, Mike Percy, Alexey Serbin, Kudu Jenkins, Todd Lipcon, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/7221 to look at the new patch set (#19). Change subject: Simplify OpId/Timestamp assignment and make it atomic .. Simplify OpId/Timestamp assignment and make it atomic There are a couple of constraints that are tricky to enforce and were requiring us to have convoluted interactions on OpId/Timestamp assignment and registration between the TimeManager, MvccManager and ConsensusQueue. These convoluted interactions are currently preventing a clean and simple leader leases implementation, particularly the single timestamp outstanding in TimeManager. The "tricky" constraints are: i) Before clean time is advanced in mvcc, all transactions with a timestamp lower than the current time must be registered with mvcc. This allows to wait for a certain timestamp t to be "safe" with TimeManager and then wait for all transactions with a timestamp t1 < t to commit, thus making sure that a scan at t is repeatable. ii) The leader can't advance follower safe timestamp to t before all the operations with a timestamp lower than it have also been sent. This is hard since timestamps are assigned outside of consensus. In general it's problematic to generate OpIds non-atomically with timestamps, since these are both supposed to be monotonically increasing in unison. To address this we had some complex state and interactions between the components. For instance this was requiring us to have TimeManager stop safe time advancement until a transaction being prepared was submitted to the queue, after assigning a timestamp in the TransactionDriver. This was problematic because with would only allow at most one assigned timestamp in-flight, meaning we were not able to move safe time at will. This patch completely addresses ii), Timestamps and OpIds are now assigned by PeerMessageQueue, atomically. This makes it trivial to make sure that the safe time sent to replicas is inline with the sent messages. This also allows some simplifications and refactorings of the PeerMessageQueue and TimeManager. As for i) the way we enforce this constraint changed. Instead of registering a transaction before submitting it to consensus (which we now can't since consensus submission is also where we assign the timestamp) we register another, dummy transaction on the MvccManager before submission and abort it afterwards. This allows to "pin" safe time advancement down to a timestamp that is mandatorily smaller than then one that will be assigned thus enforcing the constraint with a simpler implementation at the cost of an additional (short-lived) transaction in mvcc. Note in this regard that there will be at most one aditional transaction in mvcc. One side effect of this change is that, for new config changes, we won't know the OpId until after the queue submission, which needs to happen after actually setting the pending config. This is largely inconsequential since, by design, we can have at most one pending config, but did require some refactoring. Change-Id: I2bedcc5c3708d06f48131f34ea14c835867b4800 --- M java/kudu-client/src/test/java/org/apache/kudu/client/TestHybridTime.java M src/kudu/consensus/consensus-test-util.h M src/kudu/consensus/consensus_meta-test.cc M src/kudu/consensus/consensus_meta.cc M src/kudu/consensus/consensus_peers-test.cc M src/kudu/consensus/consensus_queue-test.cc M src/kudu/consensus/consensus_queue.cc M src/kudu/consensus/consensus_queue.h M src/kudu/consensus/log_cache-test.cc M src/kudu/consensus/pending_rounds.cc M src/kudu/consensus/quorum_util.cc M src/kudu/consensus/quorum_util.h M src/kudu/consensus/raft_consensus.cc M src/kudu/consensus/raft_consensus.h M src/kudu/consensus/time_manager-test.cc M src/kudu/consensus/time_manager.cc M src/kudu/consensus/time_manager.h M src/kudu/integration-tests/alter_table-test.cc M src/kudu/integration-tests/raft_consensus-itest.cc M src/kudu/master/sys_catalog.cc M src/kudu/tablet/transactions/transaction_driver.cc M src/kudu/tserver/tablet_server-test.cc M src/kudu/tserver/ts_tablet_manager.cc 23 files changed, 347 insertions(+), 282 deletions(-) git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/21/7221/19 -- To view, visit http://gerrit.cloudera.org:8080/7221 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: kudu Gerrit-Branch: master Gerrit-MessageType: newpatchset Gerrit-Change-Id: I2bedcc5c3708d06f48131f34ea14c835867b4800 Gerrit-Change-Number: 7221 Gerrit-PatchSet: 19 Gerrit-Owner: David Ribeiro Alves Gerrit-Reviewer: Alexey Serbin Gerrit-Reviewer: David Ribeiro Alves Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Mike Percy Gerrit-Reviewer: Tidy Bot Gerrit-Reviewer: Todd Lipcon
[kudu-CR] Simplify OpId/Timestamp assignment and make it atomic
David Ribeiro Alves has posted comments on this change. ( http://gerrit.cloudera.org:8080/7221 ) Change subject: Simplify OpId/Timestamp assignment and make it atomic .. Patch Set 17: (1 comment) http://gerrit.cloudera.org:8080/#/c/7221/17/src/kudu/consensus/time_manager.cc File src/kudu/consensus/time_manager.cc: http://gerrit.cloudera.org:8080/#/c/7221/17/src/kudu/consensus/time_manager.cc@307 PS17, Line 307: DCHECK(lock_.is_locked()); > can you DCHECK_EQ(mode_, LEADER) here? tried this, can't do it because of unsafe config changes. -- To view, visit http://gerrit.cloudera.org:8080/7221 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: kudu Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I2bedcc5c3708d06f48131f34ea14c835867b4800 Gerrit-Change-Number: 7221 Gerrit-PatchSet: 17 Gerrit-Owner: David Ribeiro Alves Gerrit-Reviewer: Alexey Serbin Gerrit-Reviewer: David Ribeiro Alves Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Mike Percy Gerrit-Reviewer: Tidy Bot Gerrit-Reviewer: Todd Lipcon Gerrit-Comment-Date: Wed, 14 Mar 2018 23:34:21 + Gerrit-HasComments: Yes
[kudu-CR] Simplify OpId/Timestamp assignment and make it atomic
Hello Tidy Bot, Mike Percy, Alexey Serbin, Kudu Jenkins, Todd Lipcon, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/7221 to look at the new patch set (#18). Change subject: Simplify OpId/Timestamp assignment and make it atomic .. Simplify OpId/Timestamp assignment and make it atomic There are a couple of constraints that are tricky to enforce and were requiring us to have convoluted interactions on OpId/Timestamp assignment and registration between the TimeManager, MvccManager and ConsensusQueue. These convoluted interactions are currently preventing a clean and simple leader leases implementation, particularly the single timestamp outstanding in TimeManager. The "tricky" constraints are: i) Before clean time is advanced in mvcc, all transactions with a timestamp lower than the current time must be registered with mvcc. This allows to wait for a certain timestamp t to be "safe" with TimeManager and then wait for all transactions with a timestamp t1 < t to commit, thus making sure that a scan at t is repeatable. ii) The leader can't advance follower safe timestamp to t before all the operations with a timestamp lower than it have also been sent. This is hard since timestamps are assigned outside of consensus. In general it's problematic to generate OpIds non-atomically with timestamps, since these are both supposed to be monotonically increasing in unison. To address this we had some complex state and interactions between the components. For instance this was requiring us to have TimeManager stop safe time advancement until a transaction being prepared was submitted to the queue, after assigning a timestamp in the TransactionDriver. This was problematic because with would only allow at most one assigned timestamp in-flight, meaning we were not able to move safe time at will. This patch completely addresses ii), Timestamps and OpIds are now assigned by PeerMessageQueue, atomically. This makes it trivial to make sure that the safe time sent to replicas is inline with the sent messages. This also allows some simplifications and refactorings of the PeerMessageQueue and TimeManager. As for i) the way we enforce this constraint changed. Instead of registering a transaction before submitting it to consensus (which we now can't since consensus submission is also where we assign the timestamp) we register another, dummy transaction on the MvccManager before submission and abort it afterwards. This allows to "pin" safe time advancement down to a timestamp that is mandatorily smaller than then one that will be assigned thus enforcing the constraint with a simpler implementation at the cost of an additional (short-lived) transaction in mvcc. Note in this regard that there will be at most one aditional transaction in mvcc. One side effect of this change is that, for new config changes, we won't know the OpId until after the queue submission, which needs to happen after actually setting the pending config. This is largely inconsequential since, by design, we can have at most one pending config, but did require some refactoring. Change-Id: I2bedcc5c3708d06f48131f34ea14c835867b4800 --- M java/kudu-client/src/test/java/org/apache/kudu/client/TestHybridTime.java M src/kudu/consensus/consensus-test-util.h M src/kudu/consensus/consensus_meta-test.cc M src/kudu/consensus/consensus_meta.cc M src/kudu/consensus/consensus_peers-test.cc M src/kudu/consensus/consensus_queue-test.cc M src/kudu/consensus/consensus_queue.cc M src/kudu/consensus/consensus_queue.h M src/kudu/consensus/log_cache-test.cc M src/kudu/consensus/pending_rounds.cc M src/kudu/consensus/quorum_util.cc M src/kudu/consensus/quorum_util.h M src/kudu/consensus/raft_consensus.cc M src/kudu/consensus/raft_consensus.h M src/kudu/consensus/time_manager-test.cc M src/kudu/consensus/time_manager.cc M src/kudu/consensus/time_manager.h M src/kudu/integration-tests/alter_table-test.cc M src/kudu/integration-tests/raft_consensus-itest.cc M src/kudu/master/sys_catalog.cc M src/kudu/tablet/transactions/transaction_driver.cc M src/kudu/tserver/tablet_server-test.cc M src/kudu/tserver/ts_tablet_manager.cc 23 files changed, 348 insertions(+), 282 deletions(-) git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/21/7221/18 -- To view, visit http://gerrit.cloudera.org:8080/7221 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: kudu Gerrit-Branch: master Gerrit-MessageType: newpatchset Gerrit-Change-Id: I2bedcc5c3708d06f48131f34ea14c835867b4800 Gerrit-Change-Number: 7221 Gerrit-PatchSet: 18 Gerrit-Owner: David Ribeiro Alves Gerrit-Reviewer: Alexey Serbin Gerrit-Reviewer: David Ribeiro Alves Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Mike Percy Gerrit-Reviewer: Tidy Bot Gerrit-Reviewer: Todd Lipcon
[kudu-CR] Simplify OpId/Timestamp assignment and make it atomic
David Ribeiro Alves has posted comments on this change. ( http://gerrit.cloudera.org:8080/7221 ) Change subject: Simplify OpId/Timestamp assignment and make it atomic .. Patch Set 8: (36 comments) http://gerrit.cloudera.org:8080/#/c/7221/17/src/kudu/consensus/consensus-test-util.h File src/kudu/consensus/consensus-test-util.h: http://gerrit.cloudera.org:8080/#/c/7221/17/src/kudu/consensus/consensus-test-util.h@64 PS17, Line 64: OpId* id = msg->mutable_id(); : id->set_term(term); > nit: usually it's not a good idea to have 'using' in header files, unless t Done http://gerrit.cloudera.org:8080/#/c/7221/17/src/kudu/consensus/consensus-test-util.h@67 PS17, Line 67: > As I see, the pattern is always CreateDummyReplicateWithOpId().release(). I essentially split CreateDummyReplicate in 2 versions one that sets an opid and one that doesn't. I can refactor the pointer ownership in a follow up if you feel strongly, but would like to keep that out of this straightforward change. http://gerrit.cloudera.org:8080/#/c/7221/17/src/kudu/consensus/consensus-test-util.h@82 PS17, Line 82: inline RaftPeerPB FakeRaftPeerPB(const std::string& uuid) { > ditto: maybe, just return a raw pointer. same http://gerrit.cloudera.org:8080/#/c/7221/17/src/kudu/consensus/consensus-test-util.h@115 PS17, Line 115: continue; > Instead, is it possible to rely on the mode of the object pointed to by the not sure I follow... http://gerrit.cloudera.org:8080/#/c/7221/17/src/kudu/consensus/consensus-test-util.h@133 PS17, Line 133: _ > an extra space Done http://gerrit.cloudera.org:8080/#/c/7221/17/src/kudu/consensus/consensus_meta.cc File src/kudu/consensus/consensus_meta.cc: http://gerrit.cloudera.org:8080/#/c/7221/17/src/kudu/consensus/consensus_meta.cc@153 PS17, Line 153: return leader_uuid_; > Maybe, just use DCHECK() here? As I understand, the committed configuratio tbh honest I don't see a problem in having this here and being extra defensive. the point of this patch is that the pending config is no longer assigned an op_id before being marked as pendign which must be set before marking it as commited. this is here to make sure we always enforce that last bit. added a comment to that regard. http://gerrit.cloudera.org:8080/#/c/7221/17/src/kudu/consensus/consensus_queue.h File src/kudu/consensus/consensus_queue.h: http://gerrit.cloudera.org:8080/#/c/7221/17/src/kudu/consensus/consensus_queue.h@231 PS17, Line 231: // it is alive and making progress. > update this to indicate it also does assignment of OpId and Timestamp Done http://gerrit.cloudera.org:8080/#/c/7221/17/src/kudu/consensus/consensus_queue.h@238 PS17, Line 238: bool* more_pending); : : // Called by the consensus implementation to update the queu > Is it possible just to rely on the status of the queue and have a single me the latter. I pondered the "uglyness" of having two methods/extra params versus just relying on the state of the queue, but ultimately decided on this to make it clear/explicit and have extra checks. http://gerrit.cloudera.org:8080/#/c/7221/17/src/kudu/consensus/consensus_queue.cc File src/kudu/consensus/consensus_queue.cc: http://gerrit.cloudera.org:8080/#/c/7221/17/src/kudu/consensus/consensus_queue.cc@361 PS17, Line 361: // Maintain a thread-safe copy of necessary members. : OpId preceding_id; > per comment below about GetNextOpId having a side effect, here's an example Done http://gerrit.cloudera.org:8080/#/c/7221/17/src/kudu/consensus/consensus_queue.cc@421 PS17, Line 421: // We try to get the follower's next_index from our log. > nit: indent Done http://gerrit.cloudera.org:8080/#/c/7221/17/src/kudu/consensus/consensus_queue.cc@440 PS17, Line 440: << "while preparing peer request: " : << s.ToString() << ". Destination peer: " : > hrm, this seems a little surprising that GetNextOpId has this side effect r Done http://gerrit.cloudera.org:8080/#/c/7221/17/src/kudu/consensus/quorum_util.cc File src/kudu/consensus/quorum_util.cc: http://gerrit.cloudera.org:8080/#/c/7221/17/src/kudu/consensus/quorum_util.cc@158 PS17, Line 158: > add: DCHECK(type == COMMITTED_CONFIG || type == PENDING_CONFIG) Done http://gerrit.cloudera.org:8080/#/c/7221/17/src/kudu/consensus/raft_consensus.h File src/kudu/consensus/raft_consensus.h: http://gerrit.cloudera.org:8080/#/c/7221/17/src/kudu/consensus/raft_consensus.h@449 PS17, Line 449: t, > must Done http://gerrit.cloudera.org:8080/#/c/7221/17/src/kudu/consensus/raft_consensus.cc File src/kudu/consensus/raft_consensus.cc: PS17: > Does it make sense to add some specific test for this new behavior or it's for writes the behavior is undistinguishable. for non-writes the behavior is now "correct". there's a patch furt
[kudu-CR] Simplify OpId/Timestamp assignment and make it atomic
Mike Percy has posted comments on this change. ( http://gerrit.cloudera.org:8080/7221 ) Change subject: Simplify OpId/Timestamp assignment and make it atomic .. Patch Set 17: (10 comments) http://gerrit.cloudera.org:8080/#/c/7221/17/src/kudu/consensus/quorum_util.cc File src/kudu/consensus/quorum_util.cc: http://gerrit.cloudera.org:8080/#/c/7221/17/src/kudu/consensus/quorum_util.cc@158 PS17, Line 158: std::set uuids; add: DCHECK(type == COMMITTED_CONFIG || type == PENDING_CONFIG) http://gerrit.cloudera.org:8080/#/c/7221/17/src/kudu/consensus/raft_consensus.h File src/kudu/consensus/raft_consensus.h: http://gerrit.cloudera.org:8080/#/c/7221/17/src/kudu/consensus/raft_consensus.h@449 PS17, Line 449: mus must http://gerrit.cloudera.org:8080/#/c/7221/17/src/kudu/consensus/raft_consensus.cc File src/kudu/consensus/raft_consensus.cc: http://gerrit.cloudera.org:8080/#/c/7221/17/src/kudu/consensus/raft_consensus.cc@648 PS17, Line 648: RETURN_NOT_OK(HandlePendingConfigChangeUnlocked(round)); > although this is right, it looks confusing because round probably isn't a c or just: if (round->replicate_msg()->op_type() == CHANGE_CONFIG_OP) { ... } http://gerrit.cloudera.org:8080/#/c/7221/17/src/kudu/consensus/raft_consensus.cc@662 PS17, Line 662: if (PREDICT_TRUE(round->replicate_msg()->op_type() != CHANGE_CONFIG_OP)) return Status::OK(); Due to the method name, seems less surprising to add: DCHECK_EQ(CHANGE_CONFIG_OP, round->replicate_msg()->op_type()); http://gerrit.cloudera.org:8080/#/c/7221/17/src/kudu/consensus/raft_consensus.cc@673 PS17, Line 673: nit: extra line http://gerrit.cloudera.org:8080/#/c/7221/17/src/kudu/consensus/raft_consensus.cc@691 PS17, Line 691: ( nit: no need for inner parentheses http://gerrit.cloudera.org:8080/#/c/7221/17/src/kudu/tablet/transactions/transaction_driver.cc File src/kudu/tablet/transactions/transaction_driver.cc: http://gerrit.cloudera.org:8080/#/c/7221/17/src/kudu/tablet/transactions/transaction_driver.cc@336 PS17, Line 336: rule s/rule/invariant/ http://gerrit.cloudera.org:8080/#/c/7221/17/src/kudu/tablet/transactions/transaction_driver.cc@360 PS17, Line 360: constraint s/constraint/above invariant/ By the way: is there a DCHECK somewhere in mvcc to enforce the invariant that clean time < any tx timestamp? http://gerrit.cloudera.org:8080/#/c/7221/17/src/kudu/tablet/transactions/transaction_driver.cc@371 PS17, Line 371: repl_state_copy This variable masks a variable of the same name in an outer scope http://gerrit.cloudera.org:8080/#/c/7221/17/src/kudu/tablet/transactions/transaction_driver.cc@384 PS17, Line 384: case REPLICATING can we still get to this case? -- To view, visit http://gerrit.cloudera.org:8080/7221 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: kudu Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I2bedcc5c3708d06f48131f34ea14c835867b4800 Gerrit-Change-Number: 7221 Gerrit-PatchSet: 17 Gerrit-Owner: David Ribeiro Alves Gerrit-Reviewer: Alexey Serbin Gerrit-Reviewer: David Ribeiro Alves Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Mike Percy Gerrit-Reviewer: Tidy Bot Gerrit-Reviewer: Todd Lipcon Gerrit-Comment-Date: Tue, 20 Feb 2018 22:20:34 + Gerrit-HasComments: Yes
[kudu-CR] Simplify OpId/Timestamp assignment and make it atomic
Alexey Serbin has posted comments on this change. ( http://gerrit.cloudera.org:8080/7221 ) Change subject: Simplify OpId/Timestamp assignment and make it atomic .. Patch Set 17: (18 comments) http://gerrit.cloudera.org:8080/#/c/7221/16//COMMIT_MSG Commit Message: http://gerrit.cloudera.org:8080/#/c/7221/16//COMMIT_MSG@27 PS16, Line 27: pehaps perhaps http://gerrit.cloudera.org:8080/#/c/7221/16//COMMIT_MSG@43 PS16, Line 43: t readability nit: T http://gerrit.cloudera.org:8080/#/c/7221/16//COMMIT_MSG@44 PS16, Line 44: t1 < t readability nit: T1 < T http://gerrit.cloudera.org:8080/#/c/7221/16//COMMIT_MSG@45 PS16, Line 45: t readability nit: T http://gerrit.cloudera.org:8080/#/c/7221/16//COMMIT_MSG@74 PS16, Line 74: aditional additional http://gerrit.cloudera.org:8080/#/c/7221/17/src/kudu/consensus/consensus-test-util.h File src/kudu/consensus/consensus-test-util.h: http://gerrit.cloudera.org:8080/#/c/7221/17/src/kudu/consensus/consensus-test-util.h@64 PS17, Line 64: using log::Log; : using strings::Substitute; nit: usually it's not a good idea to have 'using' in header files, unless that's inside function body or other locally enclosed scope. http://gerrit.cloudera.org:8080/#/c/7221/17/src/kudu/consensus/consensus-test-util.h@67 PS17, Line 67: inline gscoped_ptr CreateDummyReplicateWithOpId As I see, the pattern is always CreateDummyReplicateWithOpId().release(). Why to return a gscoped_ptr at all? Maybe, just return a raw pointer instead. http://gerrit.cloudera.org:8080/#/c/7221/17/src/kudu/consensus/consensus-test-util.h@82 PS17, Line 82: inline gscoped_ptr CreateDummyReplicateWithoutOpId(int64_t payload_size) { ditto: maybe, just return a raw pointer. http://gerrit.cloudera.org:8080/#/c/7221/17/src/kudu/consensus/consensus-test-util.h@115 PS17, Line 115: AppendMode mode = APPEND_NON_LEADER, Instead, is it possible to rely on the mode of the object pointed to by the queue parameter? http://gerrit.cloudera.org:8080/#/c/7221/17/src/kudu/consensus/consensus-test-util.h@133 PS17, Line 133: an extra space http://gerrit.cloudera.org:8080/#/c/7221/17/src/kudu/consensus/consensus_meta.cc File src/kudu/consensus/consensus_meta.cc: http://gerrit.cloudera.org:8080/#/c/7221/17/src/kudu/consensus/consensus_meta.cc@153 PS17, Line 153: CHECK(config.has_opid_index()); Maybe, just use DCHECK() here? As I understand, the committed configuration is verified by VerifyRaftConfig() upon flushing the data into the disk. http://gerrit.cloudera.org:8080/#/c/7221/17/src/kudu/consensus/consensus_queue.h File src/kudu/consensus/consensus_queue.h: http://gerrit.cloudera.org:8080/#/c/7221/17/src/kudu/consensus/consensus_queue.h@238 PS17, Line 238: Status LeaderAppendOperation(const ReplicateRefPtr& msg); : : Status NonLeaderAppendOperation(const ReplicateRefPtr& msg); Is it possible just to rely on the status of the queue and have a single method here? Or you wanted it exactly like this because this way it's easier to enforce some constraints? http://gerrit.cloudera.org:8080/#/c/7221/17/src/kudu/consensus/raft_consensus.cc File src/kudu/consensus/raft_consensus.cc: PS17: Does it make sense to add some specific test for this new behavior or it's already enough coverage? http://gerrit.cloudera.org:8080/#/c/7221/17/src/kudu/consensus/raft_consensus.cc@2622 PS17, Line 2622: RaftConfigPB This could be const reference as well. http://gerrit.cloudera.org:8080/#/c/7221/17/src/kudu/consensus/raft_consensus.cc@2623 PS17, Line 2623: CHECK(differencer.Equals(new_config, pending_config)) << "The pending config and" nit: add a space after 'and' http://gerrit.cloudera.org:8080/#/c/7221/17/src/kudu/consensus/raft_consensus.cc@2624 PS17, Line 2624: . nit: replace with colon? http://gerrit.cloudera.org:8080/#/c/7221/17/src/kudu/consensus/raft_consensus.cc@2844 PS17, Line 2844: // We didn't know the op id index at the time we set the config as pending so set it : // so that we can still compare the configs. : pending_config.set_opid_index(config_to_commit.opid_index()); Since this is just for comparing configs in the if() closure below, move it in there? http://gerrit.cloudera.org:8080/#/c/7221/17/src/kudu/consensus/time_manager-test.cc File src/kudu/consensus/time_manager-test.cc: http://gerrit.cloudera.org:8080/#/c/7221/17/src/kudu/consensus/time_manager-test.cc@a176 PS17, Line 176: Maybe, add an assertion instead: ASSERT_GT(time_manager_->GetSafeTime(), safe_before); -- To view, visit http://gerrit.cloudera.org:8080/7221 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: kudu Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I2bedcc5c3708d06f48131f34ea14c835867b4800 Gerrit-Change-Number: 7221 Gerrit-PatchSet: 17 Gerrit-Owner: David Ribeiro Alves Gerrit-Reviewer: Alexey Serbin Ge
[kudu-CR] Simplify OpId/Timestamp assignment and make it atomic
Todd Lipcon has posted comments on this change. ( http://gerrit.cloudera.org:8080/7221 ) Change subject: Simplify OpId/Timestamp assignment and make it atomic .. Patch Set 17: (10 comments) http://gerrit.cloudera.org:8080/#/c/7221/17/src/kudu/consensus/consensus_queue.h File src/kudu/consensus/consensus_queue.h: http://gerrit.cloudera.org:8080/#/c/7221/17/src/kudu/consensus/consensus_queue.h@231 PS17, Line 231: // Appends a single message to be replicated to the peers. update this to indicate it also does assignment of OpId and Timestamp http://gerrit.cloudera.org:8080/#/c/7221/17/src/kudu/consensus/consensus_queue.cc File src/kudu/consensus/consensus_queue.cc: http://gerrit.cloudera.org:8080/#/c/7221/17/src/kudu/consensus/consensus_queue.cc@361 PS17, Line 361: op_id->CopyFrom(GetNextOpIdUnlocked()); : RETURN_NOT_OK(time_manager_->AssignTimestamp(msg->get())); per comment below about GetNextOpId having a side effect, here's an example of why I think it's dangerous. Here AssignTimestamp might fail, but then the side effect of GetNextOpId still took effect. http://gerrit.cloudera.org:8080/#/c/7221/17/src/kudu/consensus/consensus_queue.cc@421 PS17, Line 421: { nit: indent http://gerrit.cloudera.org:8080/#/c/7221/17/src/kudu/consensus/consensus_queue.cc@440 PS17, Line 440: if (PREDICT_FALSE(queue_state_.first_index_in_current_term == boost::none)) { : queue_state_.first_index_in_current_term = id.index(); : } hrm, this seems a little surprising that GetNextOpId has this side effect rather than setting it on append. http://gerrit.cloudera.org:8080/#/c/7221/17/src/kudu/consensus/raft_consensus.cc File src/kudu/consensus/raft_consensus.cc: http://gerrit.cloudera.org:8080/#/c/7221/17/src/kudu/consensus/raft_consensus.cc@648 PS17, Line 648: RETURN_NOT_OK(HandlePendingConfigChangeUnlocked(round)); although this is right, it looks confusing because round probably isn't a config change. Maybe if we name it something like 'SnoopForConfigChangeUnlocked' or 'HandleConfigChangeIfNecessaryUnlocked' it would be clearer? http://gerrit.cloudera.org:8080/#/c/7221/17/src/kudu/consensus/raft_consensus.cc@2609 PS17, Line 2609: RaftConfigPB this and new_config can be const refs http://gerrit.cloudera.org:8080/#/c/7221/17/src/kudu/consensus/raft_consensus.cc@2633 PS17, Line 2633: LOG_WITH_PREFIX_UNLOCKED(INFO) << "Skipping abort of non-pending config change. Status: " : << status.ToString(); it should still have an opid right? ie if we get to this path it should have been in the queue and been assigned an op? http://gerrit.cloudera.org:8080/#/c/7221/17/src/kudu/consensus/time_manager.cc File src/kudu/consensus/time_manager.cc: http://gerrit.cloudera.org:8080/#/c/7221/17/src/kudu/consensus/time_manager.cc@110 PS17, Line 110: // NOTE: Currently this method just updates the clock and stores the message's timestamp. : // It always returns Status::OK() if the clock returns an OK status on Update(). hrm, is this method still even very useful? Maybe it would be clearer to just use clock->Update? http://gerrit.cloudera.org:8080/#/c/7221/17/src/kudu/consensus/time_manager.cc@294 PS17, Line 294: case LEADER: GetSerialTimestampUnlocked(); : FALLTHROUGH_INTENDED; this looks wrong. you are falling through to the next case which breaks anyway. Also seems odd to be calling GetSerialTimestamp without doing anything with the result. Wouldn't this function be equivalent to just say 'return GetSerialTimestampUnlocked'? http://gerrit.cloudera.org:8080/#/c/7221/17/src/kudu/consensus/time_manager.cc@307 PS17, Line 307: DCHECK(lock_.is_locked()); can you DCHECK_EQ(mode_, LEADER) here? -- To view, visit http://gerrit.cloudera.org:8080/7221 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: kudu Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I2bedcc5c3708d06f48131f34ea14c835867b4800 Gerrit-Change-Number: 7221 Gerrit-PatchSet: 17 Gerrit-Owner: David Ribeiro Alves Gerrit-Reviewer: Alexey Serbin Gerrit-Reviewer: David Ribeiro Alves Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Mike Percy Gerrit-Reviewer: Tidy Bot Gerrit-Reviewer: Todd Lipcon Gerrit-Comment-Date: Sat, 10 Feb 2018 02:14:38 + Gerrit-HasComments: Yes
[kudu-CR] Simplify OpId/Timestamp assignment and make it atomic
Todd Lipcon has posted comments on this change. ( http://gerrit.cloudera.org:8080/7221 ) Change subject: Simplify OpId/Timestamp assignment and make it atomic .. Patch Set 17: (1 comment) http://gerrit.cloudera.org:8080/#/c/7221/8/src/kudu/tablet/transactions/transaction_driver.cc File src/kudu/tablet/transactions/transaction_driver.cc: http://gerrit.cloudera.org:8080/#/c/7221/8/src/kudu/tablet/transactions/transaction_driver.cc@295 PS8, Line 295: CHECK_EQ(prepare_state_, NOT_PREPARED); > instead of using a dummy transaction, would it be possible to register it e Did you see this comment and the one above? -- To view, visit http://gerrit.cloudera.org:8080/7221 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: kudu Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I2bedcc5c3708d06f48131f34ea14c835867b4800 Gerrit-Change-Number: 7221 Gerrit-PatchSet: 17 Gerrit-Owner: David Ribeiro Alves Gerrit-Reviewer: Alexey Serbin Gerrit-Reviewer: David Ribeiro Alves Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Mike Percy Gerrit-Reviewer: Tidy Bot Gerrit-Reviewer: Todd Lipcon Gerrit-Comment-Date: Sat, 10 Feb 2018 01:22:52 + Gerrit-HasComments: Yes
[kudu-CR] Simplify OpId/Timestamp assignment and make it atomic
Hello Tidy Bot, Mike Percy, Alexey Serbin, Kudu Jenkins, Todd Lipcon, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/7221 to look at the new patch set (#17). Change subject: Simplify OpId/Timestamp assignment and make it atomic .. Simplify OpId/Timestamp assignment and make it atomic Context: Each message serialized through consensus is assigned a Timestamp and an OpId. These indexes are used in different contexts to establish ordering and, for the most part, they are monotonically increasing. In particular, for write messages, successive OpIds are guaranteed to have increasing Timestamps. For non-write messages though, like NO_OP and CONFIG_CHANGE, this is not the case. Because timestamps are assigned to messages in different places (for write messages they are assigned in the prepare phase, by the TransactionDriver, while for NO_OP and CONFIG_CHANGE they are assigned inside consensus) there was no way, until now, to make sure that timestamps are monotonically increasing across message types, like OpIds are. Not being able to trust that timestamps are monotonically increasing across all message types, means that we can't use the timestamps from non-write messages to advance time. This is a requirement for leader leases, as we need a single source of truth regarding time advancement and need config changes in particular to have been assigned timestamps we can trust, but pehaps more importantly it's also a requirement to be able to advance time on bootstrap when there are no write messages in the WAL. Not being able to do this is the underlying cause of KUDU-2233. Implementation: There are a couple of constraints that are tricky to enforce and were requiring us to have convoluted interactions on OpId/Timestamp assignment and registration between the TimeManager, MvccManager and ConsensusQueue. These convoluted interactions make it so that OpId and Timestamp are not assigned atomically. The "tricky" constraints are: i) Before clean time is advanced in mvcc, all transactions with a timestamp lower than the current time must be registered with mvcc. This allows to wait for a certain timestamp t to be "safe" with TimeManager and then wait for all transactions with a timestamp t1 < t to commit, thus making sure that a scan at t is repeatable. ii) The leader can't advance follower safe timestamp to t before all the operations with a timestamp lower than it have also been sent. This is hard since timestamps are assigned outside of consensus. In general it's problematic to generate OpIds non-atomically with timestamps, since these are both supposed to be monotonically increasing in unison. To address this we had some complex state and interactions between the components. For instance this was requiring us to have TimeManager stop safe time advancement until a transaction being prepared was submitted to the queue, after assigning a timestamp in the TransactionDriver. This was problematic because with would only allow at most one assigned timestamp in-flight, meaning we were not able to move safe time at will. This patch completely addresses ii), Timestamps and OpIds are now assigned by PeerMessageQueue, atomically. This makes it trivial to make sure that the safe time sent to replicas is inline with the sent messages. This also allows some simplifications and refactorings of the PeerMessageQueue and TimeManager. As for i) the way we enforce this constraint changed. Instead of registering a transaction before submitting it to consensus (which we now can't since consensus submission is also where we assign the timestamp) we register another, dummy transaction on the MvccManager before submission and abort it afterwards. This allows to "pin" safe time advancement down to a timestamp that is mandatorily smaller than then one that will be assigned thus enforcing the constraint with a simpler implementation at the cost of an additional (short-lived) transaction in mvcc. Note in this regard that there will be at most one aditional transaction in mvcc. One side effect of this change is that, for new config changes, we won't know the OpId until after the queue submission, which needs to happen after actually setting the pending config. This is largely inconsequential since, by design, we can have at most one pending config, but did require some refactoring. Change-Id: I2bedcc5c3708d06f48131f34ea14c835867b4800 --- M java/kudu-client/src/test/java/org/apache/kudu/client/TestHybridTime.java M src/kudu/consensus/consensus-test-util.h M src/kudu/consensus/consensus_meta-test.cc M src/kudu/consensus/consensus_meta.cc M src/kudu/consensus/consensus_peers-test.cc M src/kudu/consensus/consensus_queue-test.cc M src/kudu/consensus/consensus_queue.cc M src/kudu/consensus/consensus_queue.h M src/kudu/consensus/log_cache-test.cc M src/kudu/consensus/pending_rounds.cc M src/kudu/consensus/quorum_util.cc M src/kudu/con
[kudu-CR] Simplify OpId/Timestamp assignment and make it atomic
Hello Tidy Bot, Mike Percy, Alexey Serbin, Kudu Jenkins, Todd Lipcon, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/7221 to look at the new patch set (#16). Change subject: Simplify OpId/Timestamp assignment and make it atomic .. Simplify OpId/Timestamp assignment and make it atomic Context: Each message serialized through consensus is assigned a Timestamp and an OpId. These indexes are used in different contexts to establish ordering and, for the most part, they are monotonically increasing. In particular, for write messages, successive OpIds are guaranteed to have increasing Timestamps. For non-write messages though, like NO_OP and CONFIG_CHANGE, this is not the case. Because timestamps are assigned to messages in different places (for write messages they are assigned in the prepare phase, by the TransactionDriver, while for NO_OP and CONFIG_CHANGE they are assigned inside consensus) there was no way, until now, to make sure that timestamps are monotonically increasing across message types, like OpIds are. Not being able to trust that timestamps are monotonically increasing across all message types, means that we can't use the timestamps from non-write messages to advance time. This is a requirement for leader leases, as we need a single source of truth regarding time advancement and need config changes in particular to have been assigned timestamps we can trust, but pehaps more importantly it's also a requirement to be able to advance time on bootstrap when there are no write messages in the WAL. Not being able to do this is the underlying cause of KUDU-2233. Implementation: There are a couple of constraints that are tricky to enforce and were requiring us to have convoluted interactions on OpId/Timestamp assignment and registration between the TimeManager, MvccManager and ConsensusQueue. These convoluted interactions make it so that OpId and Timestamp are not assigned atomically. The "tricky" constraints are: i) Before clean time is advanced in mvcc, all transactions with a timestamp lower than the current time must be registered with mvcc. This allows to wait for a certain timestamp t to be "safe" with TimeManager and then wait for all transactions with a timestamp t1 < t to commit, thus making sure that a scan at t is repeatable. ii) The leader can't advance follower safe timestamp to t before all the operations with a timestamp lower than it have also been sent. This is hard since timestamps are assigned outside of consensus. In general it's problematic to generate OpIds non-atomically with timestamps, since these are both supposed to be monotonically increasing in unison. To address this we had some complex state and interactions between the components. For instance this was requiring us to have TimeManager stop safe time advancement until a transaction being prepared was submitted to the queue, after assigning a timestamp in the TransactionDriver. This was problematic because with would only allow at most one assigned timestamp in-flight, meaning we were not able to move safe time at will. This patch completely addresses ii), Timestamps and OpIds are now assigned by PeerMessageQueue, atomically. This makes it trivial to make sure that the safe time sent to replicas is inline with the sent messages. This also allows some simplifications and refactorings of the PeerMessageQueue and TimeManager. As for i) the way we enforce this constraint changed. Instead of registering a transaction before submitting it to consensus (which we now can't since consensus submission is also where we assign the timestamp) we register another, dummy transaction on the MvccManager before submission and abort it afterwards. This allows to "pin" safe time advancement down to a timestamp that is mandatorily smaller than then one that will be assigned thus enforcing the constraint with a simpler implementation at the cost of an additional (short-lived) transaction in mvcc. Note in this regard that there will be at most one aditional transaction in mvcc. One side effect of this change is that, for new config changes, we won't know the OpId until after the queue submission, which needs to happen after actually setting the pending config. This is largely inconsequential since, by design, we can have at most one pending config, but did require some refactoring. Change-Id: I2bedcc5c3708d06f48131f34ea14c835867b4800 --- M java/kudu-client/src/test/java/org/apache/kudu/client/TestHybridTime.java M src/kudu/consensus/consensus-test-util.h M src/kudu/consensus/consensus_meta-test.cc M src/kudu/consensus/consensus_meta.cc M src/kudu/consensus/consensus_peers-test.cc M src/kudu/consensus/consensus_queue-test.cc M src/kudu/consensus/consensus_queue.cc M src/kudu/consensus/consensus_queue.h M src/kudu/consensus/log_cache-test.cc M src/kudu/consensus/pending_rounds.cc M src/kudu/consensus/quorum_util.cc M src/kudu/con
[kudu-CR] Simplify OpId/Timestamp assignment and make it atomic
Hello Tidy Bot, Kudu Jenkins, Todd Lipcon, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/7221 to look at the new patch set (#15). Change subject: Simplify OpId/Timestamp assignment and make it atomic .. Simplify OpId/Timestamp assignment and make it atomic Context: Each message serialized through consensus is assigned a Timestamp and an OpId. These indexes are used in different contexts to establish ordering and, for the most part, they are monotonically increasing. In particular, for write messages, successive OpIds are guaranteed to have increasing Timestamps. For non-write messages though, like NO_OP and CONFIG_CHANGE, this is not the case. Because timestamps are assigned to messages in different places (for write messages they are assigned in the prepare phase, by the TransactionDriver, while for NO_OP and CONFIG_CHANGE they are assigned inside consensus) there was no way, until now, to make sure that timestamps are monotonically increasing across message types, like OpIds are. Not being able to trust that timestamps are monotonically increasing across all message types, means that we can't use the timestamps from non-write messages to advance time. This is a requirement for leader leases, as we need a single source of truth regarding time advancement and need config changes in particular to have been assigned timestamps we can trust, but pehaps more importantly it's also a requirement to be able to advance time on bootstrap when there are no write messages in the WAL. Not being able to do this is the underlying cause of KUDU-2233. Implementation: There are a couple of constraints that are tricky to enforce and were requiring us to have convoluted interactions on OpId/Timestamp assignment and registration between the TimeManager, MvccManager and ConsensusQueue. These convoluted interactions make it so that OpId and Timestamp are not assigned atomically. The "tricky" constraints are: i) Before clean time is advanced in mvcc, all transactions with a timestamp lower than the current time must be registered with mvcc. This allows to wait for a certain timestamp t to be "safe" with TimeManager and then wait for all transactions with a timestamp t1 < t to commit, thus making sure that a scan at t is repeatable. ii) The leader can't advance follower safe timestamp to t before all the operations with a timestamp lower than it have also been sent. This is hard since timestamps are assigned outside of consensus. In general it's problematic to generate OpIds non-atomically with timestamps, since these are both supposed to be monotonically increasing in unison. To address this we had some complex state and interactions between the components. For instance this was requiring us to have TimeManager stop safe time advancement until a transaction being prepared was submitted to the queue, after assigning a timestamp in the TransactionDriver. This was problematic because with would only allow at most one assigned timestamp in-flight, meaning we were not able to move safe time at will. This patch completely addresses ii), Timestamps and OpIds are now assigned by PeerMessageQueue, atomically. This makes it trivial to make sure that the safe time sent to replicas is inline with the sent messages. This also allows some simplifications and refactorings of the PeerMessageQueue and TimeManager. As for i) the way we enforce this constraint changed. Instead of registering a transaction before submitting it to consensus (which we now can't since consensus submission is also where we assign the timestamp) we register another, dummy transaction on the MvccManager before submission and abort it afterwards. This allows to "pin" safe time advancement down to a timestamp that is mandatorily smaller than then one that will be assigned thus enforcing the constraint with a simpler implementation at the cost of an additional (short-lived) transaction in mvcc. Note in this regard that there will be at most one aditional transaction in mvcc. One side effect of this change is that, for new config changes, we won't know the OpId until after the queue submission, which needs to happen after actually setting the pending config. This is largely inconsequential since, by design, we can have at most one pending config, but did require some refactoring. Change-Id: I2bedcc5c3708d06f48131f34ea14c835867b4800 --- M java/kudu-client/src/test/java/org/apache/kudu/client/TestHybridTime.java M src/kudu/consensus/consensus-test-util.h M src/kudu/consensus/consensus_meta-test.cc M src/kudu/consensus/consensus_meta.cc M src/kudu/consensus/consensus_peers-test.cc M src/kudu/consensus/consensus_queue-test.cc M src/kudu/consensus/consensus_queue.cc M src/kudu/consensus/consensus_queue.h M src/kudu/consensus/log_cache-test.cc M src/kudu/consensus/pending_rounds.cc M src/kudu/consensus/quorum_util.cc M src/kudu/consensus/quorum_util.h M src/
[kudu-CR] Simplify OpId/Timestamp assignment and make it atomic
Hello Tidy Bot, Kudu Jenkins, Todd Lipcon, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/7221 to look at the new patch set (#12). Change subject: Simplify OpId/Timestamp assignment and make it atomic .. Simplify OpId/Timestamp assignment and make it atomic Context: Each message serialized through consensus is assigned a Timestamp and an OpId. These indexes are used in different contexts to establish ordering and, for the most part, they are monotonically increasing. In particular, for write messages, successive OpIds are guaranteed to have increasing Timestamps. For non-write messages though, like NO_OP and CONFIG_CHANGE, this is not the case. Because timestamps are assigned to messages in different places (for write messages they are assigned in the prepare phase, by the TransactionDriver, while for NO_OP and CONFIG_CHANGE they are assigned inside consensus) there was no way, until now, to make sure that timestamps are monotonically increasing across message types, like OpIds are. Not being able to trust that timestamps are monotonically increasing across all message types, means that we can't use the timestamps from non-write messages to advance time. This is a requirement for leader leases, as we need a single source of truth regarding time advancement and need config changes in particular to have been assigned timestamps we can trust, but pehaps more importantly it's also a requirement to be able to advance time on bootstrap when there are no write messages in the WAL. Not being able to do this is the underlying cause of KUDU-2233. Implementation: There are a couple of constraints that are tricky to enforce and were requiring us to have convoluted interactions on OpId/Timestamp assignment and registration between the TimeManager, MvccManager and ConsensusQueue. These convoluted interactions make it so that OpId and Timestamp are not assigned atomically. The "tricky" constraints are: i) Before clean time is advanced in mvcc, all transactions with a timestamp lower than the current time must be registered with mvcc. This allows to wait for a certain timestamp t to be "safe" with TimeManager and then wait for all transactions with a timestamp t1 < t to commit, thus making sure that a scan at t is repeatable. ii) The leader can't advance follower safe timestamp to t before all the operations with a timestamp lower than it have also been sent. This is hard since timestamps are assigned outside of consensus. In general it's problematic to generate OpIds non-atomically with timestamps, since these are both supposed to be monotonically increasing in unison. To address this we had some complex state and interactions between the components. For instance this was requiring us to have TimeManager stop safe time advancement until a transaction being prepared was submitted to the queue, after assigning a timestamp in the TransactionDriver. This was problematic because with would only allow at most one assigned timestamp in-flight, meaning we were not able to move safe time at will. This patch completely addresses ii), Timestamps and OpIds are now assigned by PeerMessageQueue, atomically. This makes it trivial to make sure that the safe time sent to replicas is inline with the sent messages. This also allows some simplifications and refactorings of the PeerMessageQueue and TimeManager. As for i) the way we enforce this constraint changed. Instead of registering a transaction before submitting it to consensus (which we now can't since consensus submission is also where we assign the timestamp) we register another, dummy transaction on the MvccManager before submission and abort it afterwards. This allows to "pin" safe time advancement down to a timestamp that is mandatorily smaller than then one that will be assigned thus enforcing the constraint with a simpler implementation at the cost of an additional (short-lived) transaction in mvcc. Note in this regard that there will be at most one aditional transaction in mvcc. One side effect of this change is that, for new config changes, we won't know the OpId until after the queue submission, which needs to happen after actually setting the pending config. This is largely inconsequential since, by design, we can have at most one pending config, but did require some refactoring. Change-Id: I2bedcc5c3708d06f48131f34ea14c835867b4800 --- M java/kudu-client/src/test/java/org/apache/kudu/client/TestHybridTime.java M src/kudu/consensus/consensus-test-util.h M src/kudu/consensus/consensus_meta.cc M src/kudu/consensus/consensus_peers-test.cc M src/kudu/consensus/consensus_queue-test.cc M src/kudu/consensus/consensus_queue.cc M src/kudu/consensus/consensus_queue.h M src/kudu/consensus/log_cache-test.cc M src/kudu/consensus/pending_rounds.cc M src/kudu/consensus/quorum_util.cc M src/kudu/consensus/quorum_util.h M src/kudu/consensus/raft_consensus.cc M src/kudu/
[kudu-CR] Simplify OpId/Timestamp assignment and make it atomic
Hello Tidy Bot, Kudu Jenkins, Todd Lipcon, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/7221 to look at the new patch set (#10). Change subject: Simplify OpId/Timestamp assignment and make it atomic .. Simplify OpId/Timestamp assignment and make it atomic There are a couple of constraints that are tricky to enforce and were requiring us to have convoluted interactions on OpId/Timestamp assignment and registration between the TimeManager, MvccManager and ConsensusQueue. These convoluted interactions are currently preventing a clean and simple leader leases implementation, particularly the single timestamp outstanding in TimeManager. The "tricky" constraints are: i) Before clean time is advanced in mvcc, all transactions with a timestamp lower than the current time must be registered with mvcc. This allows to wait for a certain timestamp t to be "safe" with TimeManager and then wait for all transactions with a timestamp t1 < t to commit, thus making sure that a scan at t is repeatable. ii) The leader can't advance follower safe timestamp to t before all the operations with a timestamp lower than it have also been sent. This is hard since timestamps are assigned outside of consensus. In general it's problematic to generate OpIds non-atomically with timestamps, since these are both supposed to be monotonically increasing in unison. To address this we had some complex state and interactions between the components. For instance this was requiring us to have TimeManager stop safe time advancement until a transaction being prepared was submitted to the queue, after assigning a timestamp in the TransactionDriver. This was problematic because with would only allow at most one assigned timestamp in-flight, meaning we were not able to move safe time at will. This patch completely addresses ii), Timestamps and OpIds are now assigned by PeerMessageQueue, atomically. This makes it trivial to make sure that the safe time sent to replicas is inline with the sent messages. This also allows some simplifications and refactorings of the PeerMessageQueue and TimeManager. As for i) the way we enforce this constraint changed. Instead of registering a transaction before submitting it to consensus (which we now can't since consensus submission is also where we assign the timestamp) we register another, dummy transaction on the MvccManager before submission and abort it afterwards. This allows to "pin" safe time advancement down to a timestamp that is mandatorily smaller than then one that will be assigned thus enforcing the constraint with a simpler implementation at the cost of an additional (short-lived) transaction in mvcc. Note in this regard that there will be at most one aditional transaction in mvcc. One side effect of this change is that, for new config changes, we won't know the OpId until after the queue submission, which needs to happen after actually setting the pending config. This is largely inconsequential since, by design, we can have at most one pending config, but did require some refactoring. Change-Id: I2bedcc5c3708d06f48131f34ea14c835867b4800 --- M java/kudu-client/src/test/java/org/apache/kudu/client/TestHybridTime.java M src/kudu/consensus/consensus-test-util.h M src/kudu/consensus/consensus_meta.cc M src/kudu/consensus/consensus_peers-test.cc M src/kudu/consensus/consensus_queue-test.cc M src/kudu/consensus/consensus_queue.cc M src/kudu/consensus/consensus_queue.h M src/kudu/consensus/log_cache-test.cc M src/kudu/consensus/pending_rounds.cc M src/kudu/consensus/quorum_util.cc M src/kudu/consensus/quorum_util.h M src/kudu/consensus/raft_consensus.cc M src/kudu/consensus/raft_consensus.h M src/kudu/consensus/time_manager-test.cc M src/kudu/consensus/time_manager.cc M src/kudu/consensus/time_manager.h M src/kudu/integration-tests/alter_table-test.cc M src/kudu/integration-tests/raft_consensus-itest.cc M src/kudu/master/sys_catalog.cc M src/kudu/tablet/transactions/transaction_driver.cc M src/kudu/tserver/tablet_server-test.cc 21 files changed, 335 insertions(+), 270 deletions(-) git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/21/7221/10 -- To view, visit http://gerrit.cloudera.org:8080/7221 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: kudu Gerrit-Branch: master Gerrit-MessageType: newpatchset Gerrit-Change-Id: I2bedcc5c3708d06f48131f34ea14c835867b4800 Gerrit-Change-Number: 7221 Gerrit-PatchSet: 10 Gerrit-Owner: David Ribeiro Alves Gerrit-Reviewer: David Ribeiro Alves Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Tidy Bot Gerrit-Reviewer: Todd Lipcon
[kudu-CR] Simplify OpId/Timestamp assignment and make it atomic
David Ribeiro Alves has posted comments on this change. ( http://gerrit.cloudera.org:8080/7221 ) Change subject: Simplify OpId/Timestamp assignment and make it atomic .. Patch Set 9: this last push is just rebase over the current master, that fixes the hard conflicts that arose because of all the changes due to 3-4-3. Still need to do some follow up cleanup and address comments -- To view, visit http://gerrit.cloudera.org:8080/7221 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: kudu Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I2bedcc5c3708d06f48131f34ea14c835867b4800 Gerrit-Change-Number: 7221 Gerrit-PatchSet: 9 Gerrit-Owner: David Ribeiro Alves Gerrit-Reviewer: David Ribeiro Alves Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Tidy Bot Gerrit-Reviewer: Todd Lipcon Gerrit-Comment-Date: Sat, 20 Jan 2018 00:22:46 + Gerrit-HasComments: No
[kudu-CR] Simplify OpId/Timestamp assignment and make it atomic
Hello Tidy Bot, Kudu Jenkins, Todd Lipcon, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/7221 to look at the new patch set (#9). Change subject: Simplify OpId/Timestamp assignment and make it atomic .. Simplify OpId/Timestamp assignment and make it atomic There are a couple of constraints that are tricky to enforce and were requiring us to have convoluted interactions on OpId/Timestamp assignment and registration between the TimeManager, MvccManager and ConsensusQueue. These convoluted interactions are currently preventing a clean and simple leader leases implementation, particularly the single timestamp outstanding in TimeManager. The "tricky" constraints are: i) Before clean time is advanced in mvcc, all transactions with a timestamp lower than the current time must be registered with mvcc. This allows to wait for a certain timestamp t to be "safe" with TimeManager and then wait for all transactions with a timestamp t1 < t to commit, thus making sure that a scan at t is repeatable. ii) The leader can't advance follower safe timestamp to t before all the operations with a timestamp lower than it have also been sent. This is hard since timestamps are assigned outside of consensus. In general it's problematic to generate OpIds non-atomically with timestamps, since these are both supposed to be monotonically increasing in unison. To address this we had some complex state and interactions between the components. For instance this was requiring us to have TimeManager stop safe time advancement until a transaction being prepared was submitted to the queue, after assigning a timestamp in the TransactionDriver. This was problematic because with would only allow at most one assigned timestamp in-flight, meaning we were not able to move safe time at will. This patch completely addresses ii), Timestamps and OpIds are now assigned by PeerMessageQueue, atomically. This makes it trivial to make sure that the safe time sent to replicas is inline with the sent messages. This also allows some simplifications and refactorings of the PeerMessageQueue and TimeManager. As for i) the way we enforce this constraint changed. Instead of registering a transaction before submitting it to consensus (which we now can't since consensus submission is also where we assign the timestamp) we register another, dummy transaction on the MvccManager before submission and abort it afterwards. This allows to "pin" safe time advancement down to a timestamp that is mandatorily smaller than then one that will be assigned thus enforcing the constraint with a simpler implementation at the cost of an additional (short-lived) transaction in mvcc. Note in this regard that there will be at most one aditional transaction in mvcc. One side effect of this change is that, for new config changes, we won't know the OpId until after the queue submission, which needs to happen after actually setting the pending config. This is largely inconsequential since, by design, we can have at most one pending config, but did require some refactoring. Change-Id: I2bedcc5c3708d06f48131f34ea14c835867b4800 --- M java/kudu-client/src/test/java/org/apache/kudu/client/TestHybridTime.java M src/kudu/consensus/consensus-test-util.h M src/kudu/consensus/consensus_meta.cc M src/kudu/consensus/consensus_peers-test.cc M src/kudu/consensus/consensus_queue-test.cc M src/kudu/consensus/consensus_queue.cc M src/kudu/consensus/consensus_queue.h M src/kudu/consensus/log_cache-test.cc M src/kudu/consensus/pending_rounds.cc M src/kudu/consensus/quorum_util.cc M src/kudu/consensus/quorum_util.h M src/kudu/consensus/raft_consensus.cc M src/kudu/consensus/raft_consensus.h M src/kudu/consensus/time_manager-test.cc M src/kudu/consensus/time_manager.cc M src/kudu/consensus/time_manager.h M src/kudu/integration-tests/alter_table-test.cc M src/kudu/integration-tests/raft_consensus-itest.cc M src/kudu/master/sys_catalog.cc M src/kudu/tablet/transactions/transaction_driver.cc M src/kudu/tserver/tablet_server-test.cc 21 files changed, 324 insertions(+), 264 deletions(-) git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/21/7221/9 -- To view, visit http://gerrit.cloudera.org:8080/7221 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: kudu Gerrit-Branch: master Gerrit-MessageType: newpatchset Gerrit-Change-Id: I2bedcc5c3708d06f48131f34ea14c835867b4800 Gerrit-Change-Number: 7221 Gerrit-PatchSet: 9 Gerrit-Owner: David Ribeiro Alves Gerrit-Reviewer: David Ribeiro Alves Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Tidy Bot Gerrit-Reviewer: Todd Lipcon
[kudu-CR] Simplify OpId/Timestamp assignment and make it atomic
Todd Lipcon has posted comments on this change. Change subject: Simplify OpId/Timestamp assignment and make it atomic .. Patch Set 8: (2 comments) need to look in more detail at the changes in consensus/ but some quicker questions about the MvccManager design http://gerrit.cloudera.org:8080/#/c/7221/8/src/kudu/tablet/transactions/transaction_driver.cc File src/kudu/tablet/transactions/transaction_driver.cc: Line 267: prepare_state_ = PREPARED; it's sketching me out that the state transition to PREPARED happens under this lock acquisition for followers but not until down below for leaders. Why can't we keep it as it was? Separate note: I think the big block comment in the header needs to be tweaked a bit. PS8, Line 295: // Register a dummy transaction with mvcc manager that we'll abort later. instead of using a dummy transaction, would it be possible to register it early with the current clock time, as you've done here, and add a new API like MvccManager::PushTransactionTimestamp(scoped_txn, new_timestamp) which changes the timestamp of the existing entry? This would avoid an allocation at least and can include an assertion that you can only PushTransactionTimestamp to a later timestamp -- To view, visit http://gerrit.cloudera.org:8080/7221 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: I2bedcc5c3708d06f48131f34ea14c835867b4800 Gerrit-PatchSet: 8 Gerrit-Project: kudu Gerrit-Branch: master Gerrit-Owner: David Ribeiro Alves Gerrit-Reviewer: David Ribeiro Alves Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Tidy Bot Gerrit-Reviewer: Todd Lipcon Gerrit-HasComments: Yes
[kudu-CR] Simplify OpId/Timestamp assignment and make it atomic
David Ribeiro Alves has posted comments on this change. Change subject: Simplify OpId/Timestamp assignment and make it atomic .. Patch Set 8: Gonna pause this patch series until mike's consensus metadata changes land -- To view, visit http://gerrit.cloudera.org:8080/7221 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: I2bedcc5c3708d06f48131f34ea14c835867b4800 Gerrit-PatchSet: 8 Gerrit-Project: kudu Gerrit-Branch: master Gerrit-Owner: David Ribeiro Alves Gerrit-Reviewer: David Ribeiro Alves Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Tidy Bot Gerrit-HasComments: No
[kudu-CR] Simplify OpId/Timestamp assignment and make it atomic
Hello Kudu Jenkins, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/7221 to look at the new patch set (#6). Change subject: Simplify OpId/Timestamp assignment and make it atomic .. Simplify OpId/Timestamp assignment and make it atomic There are a couple of constraints that are tricky to enforce and were requiring us to have convoluted interactions on OpId/Timestamp assignment and registration between the TimeManager, MvccManager and ConsensusQueue. These convoluted interactions are currently preventing a clean and simple leader leases implementation, particularly the single timestamp outstanding in TimeManager. The "tricky" constraints are: i) Before clean time is advanced in mvcc, all transactions with a timestamp lower than the current time must be registered with mvcc. This allows to wait for a certain timestamp t to be "safe" with TimeManager and then wait for all transactions with a timestamp t1 < t to commit, thus making sure that a scan at t is repeatable. ii) The leader can't advance follower safe timestamp to t before all the operations with a timestamp lower than it have also been sent. This is hard since timestamps are assigned outside of consensus. In general it's problematic to generate OpIds non-atomically with timestamps, since these are both supposed to be monotonically increasing in unison. To address this we had some complex state and interactions between the components. For instance this was requiring us to have TimeManager stop safe time advancement until a transaction being prepared was submitted to the queue, after assigning a timestamp in the TransactionDriver. This was problematic because with would only allow at most one assigned timestamp in-flight, meaning we were not able to move safe time at will. This patch completely addresses ii), Timestamps and OpIds are now assigned by PeerMessageQueue, atomically. This makes it trivial to make sure that the safe time sent to replicas is inline with the sent messages. This also allows some simplifications and refactorings of the PeerMessageQueue and TimeManager. As for i) the way we enforce this constraint changed. Instead of registering a transaction before submitting it to consensus (which we now can't since consensus submission is also where we assign the timestamp) we register another, dummy transaction on the MvccManager before submission and abort it afterwards. This allows to "pin" safe time advancement down to a timestamp that is mandatorily smaller than then one that will be assigned thus enforcing the constraint with a simpler implementation at the cost of an additional (short-lived) transaction in mvcc. Note in this regard that there will be at most one aditional transaction in mvcc. One side effect of this change is that, for new config changes, we won't know the OpId until after the queue submission, which needs to happen after actually setting the pending config. This is largely inconsequential since, by design, we can have at most one pending config, but did require some refactoring. Change-Id: I2bedcc5c3708d06f48131f34ea14c835867b4800 --- M java/kudu-client/src/test/java/org/apache/kudu/client/TestHybridTime.java M src/kudu/consensus/consensus-test-util.h M src/kudu/consensus/consensus_peers-test.cc M src/kudu/consensus/consensus_queue-test.cc M src/kudu/consensus/consensus_queue.cc M src/kudu/consensus/consensus_queue.h M src/kudu/consensus/log_cache-test.cc M src/kudu/consensus/pending_rounds.cc M src/kudu/consensus/quorum_util.cc M src/kudu/consensus/raft_consensus.cc M src/kudu/consensus/raft_consensus.h M src/kudu/consensus/time_manager-test.cc M src/kudu/consensus/time_manager.cc M src/kudu/consensus/time_manager.h M src/kudu/integration-tests/alter_table-test.cc M src/kudu/integration-tests/raft_consensus-itest.cc M src/kudu/tablet/transactions/transaction_driver.cc M src/kudu/tserver/tablet_server-test.cc 18 files changed, 303 insertions(+), 254 deletions(-) git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/21/7221/6 -- To view, visit http://gerrit.cloudera.org:8080/7221 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: newpatchset Gerrit-Change-Id: I2bedcc5c3708d06f48131f34ea14c835867b4800 Gerrit-PatchSet: 6 Gerrit-Project: kudu Gerrit-Branch: master Gerrit-Owner: David Ribeiro Alves Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Tidy Bot
[kudu-CR] Simplify OpId/Timestamp assignment
Hello Kudu Jenkins, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/7221 to look at the new patch set (#5). Change subject: Simplify OpId/Timestamp assignment .. Simplify OpId/Timestamp assignment There are a couple of constraints that are tricky to enforce and were requiring us to have convoluted interactions on OpId/Timestamp assignment and registration between the TimeManager, MvccManager and ConsensusQueue. These convoluted interactions are currently preventing a clean and simple leader leases implementation, particularly the single timestamp outstanding in TimeManager. The "tricky" constraints are: i) Before clean time is advanced in mvcc, all transactions with a timestamp lower than the current time must be registered with mvcc. This allows to wait for a certain timestamp t to be "safe" with TimeManager and then wait for all transactions with a timestamp t1 < t to commit, thus making sure that a scan at t is repeatable. ii) The leader can't advance follower safe timestamp to t before all the operations with a timestamp lower than it have also been sent. This is hard since timestamps are assigned outside of consensus. In general it's problematic to generate OpIds non-atomically with timestamps, since these are both supposed to be monotonically increasing in unison. To address this we had some complex state and interactions between the components. For instance this was requiring us to have TimeManager stop safe time advancement until a transaction being prepared was submitted to the queue, after assigning a timestamp in the TransactionDriver. This was problematic because with would only allow at most one assigned timestamp in-flight, meaning we were not able to extract the current safe time at will. This patch completely addresses ii), Timestamps and OpIds are now assigned by PeerMessageQueue, atomically. This makes it trivial to make sure that the safe time sent to replicas is inline with the sent messages. This also allows some simplifications and refactorings of the PeerMessageQueue and TimeManager. As for i) the way we enforce this constraint changed. Instead of registering a transaction before submitting it to consensus (which we now can't since consensus submission is also where we assign the timestamp) we register another, dummy transaction on the MvccManager before submission and abort it afterwards. This allows to "pin" safe time advancement down to a timestamp that is mandatorily smaller than then one that will be assigned. One side effect of this change is that, for new config changes, we won't know the OpId until after the queue submission, which needs to happen after actually setting the pending config. This is largely inconsequential but did require some refactoring. Change-Id: I2bedcc5c3708d06f48131f34ea14c835867b4800 --- M java/kudu-client/src/test/java/org/apache/kudu/client/TestHybridTime.java M src/kudu/consensus/consensus-test-util.h M src/kudu/consensus/consensus_peers-test.cc M src/kudu/consensus/consensus_queue-test.cc M src/kudu/consensus/consensus_queue.cc M src/kudu/consensus/consensus_queue.h M src/kudu/consensus/log_cache-test.cc M src/kudu/consensus/pending_rounds.cc M src/kudu/consensus/quorum_util.cc M src/kudu/consensus/raft_consensus.cc M src/kudu/consensus/raft_consensus.h M src/kudu/consensus/time_manager-test.cc M src/kudu/consensus/time_manager.cc M src/kudu/consensus/time_manager.h M src/kudu/integration-tests/alter_table-test.cc M src/kudu/integration-tests/raft_consensus-itest.cc M src/kudu/tablet/transactions/transaction_driver.cc M src/kudu/tserver/tablet_server-test.cc 18 files changed, 301 insertions(+), 252 deletions(-) git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/21/7221/5 -- To view, visit http://gerrit.cloudera.org:8080/7221 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: newpatchset Gerrit-Change-Id: I2bedcc5c3708d06f48131f34ea14c835867b4800 Gerrit-PatchSet: 5 Gerrit-Project: kudu Gerrit-Branch: master Gerrit-Owner: David Ribeiro Alves Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Tidy Bot
[kudu-CR] Simplify OpId/Timestamp assignment
Hello Kudu Jenkins, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/7221 to look at the new patch set (#4). Change subject: Simplify OpId/Timestamp assignment .. Simplify OpId/Timestamp assignment There are a couple of constraints that are tricky to enforce and were requiring us to have convoluted interactions on OpId/Timestamp assignment and registration between the TimeManager, MvccManager and ConsensusQueue. These convoluted interactions are currently preventing a clean and simple leader leases implementation, particularly the single timestamp outstanding in TimeManager. The "tricky" constraints are: i) Before clean time is advanced in mvcc, all transactions with a timestamp lower than the current time must be registered with mvcc. This allows to wait for a certain timestamp t to be "safe" with TimeManager and then wait for all transactions with a timestamp t1 < t to commit, thus making sure that a scan at t is repeatable. ii) The leader can't advance follower safe timestamp to t before all the operations with a timestamp lower than it have also been sent. This is hard since timestamps are assigned outside of consensus. In general it's problematic to generate OpIds non-atomically with timestamps, since these are both supposed to be monotonically increasing in unison. To address this we had some complex state and interactions between the components. For instance this was requiring us to have TimeManager stop safe time advancement until a transaction being prepared was submitted to the queue, after assigning a timestamp in the TransactionDriver. This was problematic because with would only allow at most one assigned timestamp in-flight, meaning we were not able to extract the current safe time at will. This patch completely addresses ii), Timestamps and OpIds are now assigned by PeerMessageQueue, atomically. This makes it trivial to make sure that the safe time sent to replicas is inline with the sent messages. This also allows some simplifications and refactorings of the PeerMessageQueue and TimeManager. As for i) the way we enforce this constraint changed. Instead of registering a transaction before submitting it to consensus (which we now can't since consensus submission is also where we assign the timestamp) we register another, dummy transaction on the MvccManager before submission and abort it afterwards. This allows to "pin" safe time advancement down to a timestamp that is mandatorily smaller than then one that will be assigned. One side effect of this change is that, for new config changes, we won't know the OpId until after the queue submission, which needs to happen after actually setting the pending config. This is largely inconsequential but did require some refactoring. Change-Id: I2bedcc5c3708d06f48131f34ea14c835867b4800 --- M java/kudu-client/src/test/java/org/apache/kudu/client/TestHybridTime.java M src/kudu/consensus/consensus-test-util.h M src/kudu/consensus/consensus_peers-test.cc M src/kudu/consensus/consensus_queue-test.cc M src/kudu/consensus/consensus_queue.cc M src/kudu/consensus/consensus_queue.h M src/kudu/consensus/log_cache-test.cc M src/kudu/consensus/pending_rounds.cc M src/kudu/consensus/quorum_util.cc M src/kudu/consensus/raft_consensus.cc M src/kudu/consensus/raft_consensus.h M src/kudu/consensus/time_manager-test.cc M src/kudu/consensus/time_manager.cc M src/kudu/consensus/time_manager.h M src/kudu/integration-tests/alter_table-test.cc M src/kudu/integration-tests/raft_consensus-itest.cc M src/kudu/tablet/transactions/transaction_driver.cc M src/kudu/tserver/tablet_server-test.cc 18 files changed, 301 insertions(+), 252 deletions(-) git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/21/7221/4 -- To view, visit http://gerrit.cloudera.org:8080/7221 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: newpatchset Gerrit-Change-Id: I2bedcc5c3708d06f48131f34ea14c835867b4800 Gerrit-PatchSet: 4 Gerrit-Project: kudu Gerrit-Branch: master Gerrit-Owner: David Ribeiro Alves Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Tidy Bot
[kudu-CR] Simplify OpId/Timestamp assignment
Hello Kudu Jenkins, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/7221 to look at the new patch set (#3). Change subject: Simplify OpId/Timestamp assignment .. Simplify OpId/Timestamp assignment There are a couple of constraints that are tricky to enforce and were requiring us to have convoluted interactions on OpId/Timestamp assignment and registration between the TimeManager, MvccManager and ConsensusQueue. These convoluted interactions are currently preventing a clean and simple leader leases implementation, particularly the single timestamp outstanding in TimeManager. The "tricky" constraints are: i) Before clean time is advanced in mvcc, all transactions with a timestamp lower than the current time must be registered with mvcc. This allows to wait for a certain timestamp t to be "safe" with TimeManager and then wait for all transactions with a timestamp t1 < t to commit, thus making sure that a scan at t is repeatable. ii) The leader can't advance follower safe timestamp to t before all the operations with a timestamp lower than it have also been sent. This is hard since timestamps are assigned outside of consensus. In general it's problematic to generate OpIds non-atomically with timestamps, since these are both supposed to be monotonically increasing in unison. To address this we had some complex state and interactions between the components. For instance this was requiring us to have TimeManager stop safe time advancement until a transaction being prepared was submitted to the queue, after assigning a timestamp in the TransactionDriver. This was problematic because with would only allow at most one assigned timestamp in-flight, meaning we were not able to extract the current safe time at will. This patch completely addresses ii), Timestamps and OpIds are now assigned by PeerMessageQueue, atomically. This makes it trivial to make sure that the safe time sent to replicas is inline with the sent messages. This also allows some simplifications and refactorings of the PeerMessageQueue and TimeManager. As for i) the way we enforce this constraint changed. Instead of registering a transaction before submitting it to consensus (which we now can't since consensus submission is also where we assign the timestamp) we register another, dummy transaction on the MvccManager before submission and abort it afterwards. This allows to "pin" safe time advancement down to a timestamp that is mandatorily smaller than then one that will be assigned. One side effect of this change is that, for new config changes, we won't know the OpId until after the queue submission, which needs to happen after actually setting the pending config. This is largely inconsequential but did require some refactoring. Change-Id: I2bedcc5c3708d06f48131f34ea14c835867b4800 --- M java/kudu-client/src/test/java/org/apache/kudu/client/TestHybridTime.java M src/kudu/consensus/consensus-test-util.h M src/kudu/consensus/consensus_peers-test.cc M src/kudu/consensus/consensus_queue-test.cc M src/kudu/consensus/consensus_queue.cc M src/kudu/consensus/consensus_queue.h M src/kudu/consensus/log_cache-test.cc M src/kudu/consensus/pending_rounds.cc M src/kudu/consensus/quorum_util.cc M src/kudu/consensus/raft_consensus.cc M src/kudu/consensus/raft_consensus.h M src/kudu/consensus/time_manager-test.cc M src/kudu/consensus/time_manager.cc M src/kudu/consensus/time_manager.h M src/kudu/integration-tests/alter_table-test.cc M src/kudu/integration-tests/raft_consensus-itest.cc M src/kudu/tablet/transactions/transaction_driver.cc M src/kudu/tserver/tablet_server-test.cc 18 files changed, 301 insertions(+), 245 deletions(-) git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/21/7221/3 -- To view, visit http://gerrit.cloudera.org:8080/7221 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: newpatchset Gerrit-Change-Id: I2bedcc5c3708d06f48131f34ea14c835867b4800 Gerrit-PatchSet: 3 Gerrit-Project: kudu Gerrit-Branch: master Gerrit-Owner: David Ribeiro Alves Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Tidy Bot
[kudu-CR] Simplify OpId/Timestamp assignment
Hello Kudu Jenkins, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/7221 to look at the new patch set (#2). Change subject: Simplify OpId/Timestamp assignment .. Simplify OpId/Timestamp assignment There are a couple of constraints that are tricky to enforce and were requiring us to have convoluted interactions on OpId/Timestamp assignment and registration between the TimeManager, MvccManager and ConsensusQueue. These convoluted interactions are currently preventing a clean and simple leader leases implementation, particularly the single timestamp outstanding in TimeManager. The "tricky" constraints are: i) Before clean time is advanced in mvcc, all transactions with a timestamp lower than the current time must be registered with mvcc. This allows to wait for a certain timestamp t to be "safe" with TimeManager and then wait for all transactions with a timestamp t1 < t to commit, thus making sure that a scan at t is repeatable. ii) The leader can't advance follower safe timestamp to t before all the operations with a timestamp lower than it have also been sent. This is hard since timestamps are assigned outside of consensus. In general it's problematic to generate OpIds non-atomically with timestamps, since these are both supposed to be monotonically increasing in unison. To address this we had some complex state and interactions between the components. For instance this was requiring us to have TimeManager stop safe time advancement until a transaction being prepared was submitted to the queue, after assigning a timestamp in the TransactionDriver. This was problematic because with would only allow at most one assigned timestamp in-flight, meaning we were not able to extract the current safe time at will. This patch completely addresses ii), Timestamps and OpIds are now assigned by PeerMessageQueue, atomically. This makes it trivial to make sure that the safe time sent to replicas is inline with the sent messages. This also allows some simplifications and refactorings of the PeerMessageQueue and TimeManager. As for i) the way we enforce this constraint changed. Instead of registering a transaction before submitting it to consensus (which we now can't since consensus submission is also where we assign the timestamp) we register another, dummy transaction on the MvccManager before submission and abort it afterwards. This allows to "pin" safe time advancement down to a timestamp that is mandatorily smaller than then one that will be assigned. One side effect of this change is that, for new config changes, we won't know the OpId until after the queue submission, which needs to happen after actually setting the pending config. This is largely inconsequential but did require some refactoring. Change-Id: I2bedcc5c3708d06f48131f34ea14c835867b4800 --- M src/kudu/consensus/consensus-test-util.h M src/kudu/consensus/consensus_peers-test.cc M src/kudu/consensus/consensus_queue-test.cc M src/kudu/consensus/consensus_queue.cc M src/kudu/consensus/consensus_queue.h M src/kudu/consensus/log_cache-test.cc M src/kudu/consensus/pending_rounds.cc M src/kudu/consensus/quorum_util.cc M src/kudu/consensus/raft_consensus.cc M src/kudu/consensus/raft_consensus.h M src/kudu/consensus/time_manager-test.cc M src/kudu/consensus/time_manager.cc M src/kudu/consensus/time_manager.h M src/kudu/integration-tests/alter_table-test.cc M src/kudu/tablet/transactions/transaction_driver.cc 15 files changed, 247 insertions(+), 206 deletions(-) git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/21/7221/2 -- To view, visit http://gerrit.cloudera.org:8080/7221 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: newpatchset Gerrit-Change-Id: I2bedcc5c3708d06f48131f34ea14c835867b4800 Gerrit-PatchSet: 2 Gerrit-Project: kudu Gerrit-Branch: master Gerrit-Owner: David Ribeiro Alves Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Tidy Bot
[kudu-CR] Simplify OpId/Timestamp assignment
David Ribeiro Alves has uploaded a new change for review. http://gerrit.cloudera.org:8080/7221 Change subject: Simplify OpId/Timestamp assignment .. Simplify OpId/Timestamp assignment There are a couple of constraints that are tricky to enforce and were requiring us to have convoluted interactions on OpId/Timestamp assignment and registration between the TimeManager, MvccManager and ConsensusQueue. These convoluted interactions are currently preventing a clean and simple leader leases implementation, particularly the single timestamp outstanding in TimeManager. The "tricky" constraints are: i) Before clean time is advanced in mvcc, all transactions with a timestamp lower than the current time must be registered with mvcc. This allows to wait for a certain timestamp t to be "safe" with TimeManager and then wait for all transactions with a timestamp t1 < t to commit, thus making sure that a scan at t is repeatable. ii) The leader can't advance follower safe timestamp to t before all the operations with a timestamp lower than it have also been sent. This is hard since timestamps are assigned outside of consensus. In general it's problematic to generate OpIds non-atomically with timestamps, since these are both supposed to be monotonically increasing in unison. To address this we had some complex state and interactions between the components. For instance this was requiring us to have TimeManager stop safe time advancement until a transaction being prepared was submitted to the queue, after assigning a timestamp in the TransactionDriver. This was problematic because with would only allow at most one assigned timestamp in-flight, meaning we were not able to extract the current safe time at will. This patch completely addresses ii), Timestamps and OpIds are now assigned by PeerMessageQueue, atomically. This makes it trivial to make sure that the safe time sent to replicas is inline with the sent messages. This also allows some simplifications and refactorings of the PeerMessageQueue and TimeManager. As for i) the way we enforce this constraint changed. Instead of registering a transaction before submitting it to consensus (which we now can't since consensus submission is also where we assign the timestamp) we register another, dummy transaction on the MvccManager before submission and abort it afterwards. This allows to "pin" safe time advancement down to a timestamp that is mandatorily smaller than then one that will be assigned. Change-Id: I2bedcc5c3708d06f48131f34ea14c835867b4800 --- M src/kudu/consensus/consensus-test-util.h M src/kudu/consensus/consensus_queue-test.cc M src/kudu/consensus/consensus_queue.cc M src/kudu/consensus/consensus_queue.h M src/kudu/consensus/log_cache-test.cc M src/kudu/consensus/raft_consensus.cc M src/kudu/consensus/time_manager-test.cc M src/kudu/consensus/time_manager.cc M src/kudu/consensus/time_manager.h M src/kudu/tablet/transactions/transaction_driver.cc 10 files changed, 167 insertions(+), 135 deletions(-) git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/21/7221/1 -- To view, visit http://gerrit.cloudera.org:8080/7221 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: newchange Gerrit-Change-Id: I2bedcc5c3708d06f48131f34ea14c835867b4800 Gerrit-PatchSet: 1 Gerrit-Project: kudu Gerrit-Branch: master Gerrit-Owner: David Ribeiro Alves