[kudu-CR] Simplify OpId/Timestamp assignment and make it atomic

2017-06-22 Thread David Ribeiro Alves (Code Review)
Hello Kudu Jenkins,

I'd like you to reexamine a change.  Please visit

http://gerrit.cloudera.org:8080/7221

to look at the new patch set (#6).

Change subject: Simplify OpId/Timestamp assignment and make it atomic
..

Simplify OpId/Timestamp assignment and make it atomic

There are a couple of constraints that are tricky to enforce and were
requiring us to have convoluted interactions on OpId/Timestamp assignment
and registration between the TimeManager, MvccManager and ConsensusQueue.

These convoluted interactions are currently preventing a clean and simple
leader leases implementation, particularly the single timestamp outstanding
in TimeManager.

The "tricky" constraints are:

i) Before clean time is advanced in mvcc, all transactions with a
timestamp lower than the current time must be registered with mvcc.
This allows to wait for a certain timestamp t to be "safe" with TimeManager
and then wait for all transactions with a timestamp t1 < t to commit, thus
making sure that a scan at t is repeatable.

ii) The leader can't advance follower safe timestamp to t before
all the operations with a timestamp lower than it have also been sent.
This is hard since timestamps are assigned outside of consensus.

In general it's problematic to generate OpIds non-atomically with
timestamps, since these are both supposed to be monotonically increasing
in unison. To address this we had some complex state and interactions between
the components. For instance this was requiring us to have TimeManager
stop safe time advancement until a transaction being prepared was
submitted to the queue, after assigning a timestamp in the
TransactionDriver. This was problematic because with would only allow at most
one assigned timestamp in-flight, meaning we were not able to move safe
time at will.

This patch completely addresses ii), Timestamps and OpIds are now assigned
by PeerMessageQueue, atomically. This makes it trivial to make sure that the
safe time sent to replicas is inline with the sent messages. This also allows
some simplifications and refactorings of the PeerMessageQueue and TimeManager.

As for i) the way we enforce this constraint changed. Instead of registering
a transaction before submitting it to consensus (which we now can't since
consensus submission is also where we assign the timestamp) we register
another, dummy transaction on the MvccManager before submission and abort it
afterwards. This allows to "pin" safe time advancement down to a timestamp
that is mandatorily smaller than then one that will be assigned thus enforcing
the constraint with a simpler implementation at the cost of an additional
(short-lived) transaction in mvcc. Note in this regard that there will be
at most one aditional transaction in mvcc.

One side effect of this change is that, for new config changes, we won't
know the OpId until after the queue submission, which needs to happen
after actually setting the pending config. This is largely inconsequential
since, by design, we can have at most one pending config, but did require
some refactoring.

Change-Id: I2bedcc5c3708d06f48131f34ea14c835867b4800
---
M java/kudu-client/src/test/java/org/apache/kudu/client/TestHybridTime.java
M src/kudu/consensus/consensus-test-util.h
M src/kudu/consensus/consensus_peers-test.cc
M src/kudu/consensus/consensus_queue-test.cc
M src/kudu/consensus/consensus_queue.cc
M src/kudu/consensus/consensus_queue.h
M src/kudu/consensus/log_cache-test.cc
M src/kudu/consensus/pending_rounds.cc
M src/kudu/consensus/quorum_util.cc
M src/kudu/consensus/raft_consensus.cc
M src/kudu/consensus/raft_consensus.h
M src/kudu/consensus/time_manager-test.cc
M src/kudu/consensus/time_manager.cc
M src/kudu/consensus/time_manager.h
M src/kudu/integration-tests/alter_table-test.cc
M src/kudu/integration-tests/raft_consensus-itest.cc
M src/kudu/tablet/transactions/transaction_driver.cc
M src/kudu/tserver/tablet_server-test.cc
18 files changed, 303 insertions(+), 254 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/21/7221/6
-- 
To view, visit http://gerrit.cloudera.org:8080/7221
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I2bedcc5c3708d06f48131f34ea14c835867b4800
Gerrit-PatchSet: 6
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: David Ribeiro Alves 
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Tidy Bot


[kudu-CR] Simplify OpId/Timestamp assignment and make it atomic

2017-06-23 Thread David Ribeiro Alves (Code Review)
David Ribeiro Alves has posted comments on this change.

Change subject: Simplify OpId/Timestamp assignment and make it atomic
..


Patch Set 8:

Gonna pause this patch series until mike's consensus metadata changes land

-- 
To view, visit http://gerrit.cloudera.org:8080/7221
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I2bedcc5c3708d06f48131f34ea14c835867b4800
Gerrit-PatchSet: 8
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: David Ribeiro Alves 
Gerrit-Reviewer: David Ribeiro Alves 
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Tidy Bot
Gerrit-HasComments: No


[kudu-CR] Simplify OpId/Timestamp assignment and make it atomic

2017-06-27 Thread Todd Lipcon (Code Review)
Todd Lipcon has posted comments on this change.

Change subject: Simplify OpId/Timestamp assignment and make it atomic
..


Patch Set 8:

(2 comments)

need to look in more detail at the changes in consensus/ but some quicker 
questions about the MvccManager design

http://gerrit.cloudera.org:8080/#/c/7221/8/src/kudu/tablet/transactions/transaction_driver.cc
File src/kudu/tablet/transactions/transaction_driver.cc:

Line 267:   prepare_state_ = PREPARED;
it's sketching me out that the state transition to PREPARED happens under this 
lock acquisition for followers but not until down below for leaders. Why can't 
we keep it as it was?


Separate note: I think the big block comment in the header needs to be tweaked 
a bit.


PS8, Line 295:   // Register a dummy transaction with mvcc manager that 
we'll abort later.
instead of using a dummy transaction, would it be possible to register it early 
with the current clock time, as you've done here, and add a new API like 
MvccManager::PushTransactionTimestamp(scoped_txn, new_timestamp) which changes 
the timestamp of the existing entry?

This would avoid an allocation at least and can include an assertion that you 
can only PushTransactionTimestamp to a later timestamp


-- 
To view, visit http://gerrit.cloudera.org:8080/7221
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I2bedcc5c3708d06f48131f34ea14c835867b4800
Gerrit-PatchSet: 8
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: David Ribeiro Alves 
Gerrit-Reviewer: David Ribeiro Alves 
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon 
Gerrit-HasComments: Yes


[kudu-CR] Simplify OpId/Timestamp assignment and make it atomic

2018-01-19 Thread David Ribeiro Alves (Code Review)
Hello Tidy Bot, Kudu Jenkins, Todd Lipcon,

I'd like you to reexamine a change. Please visit

http://gerrit.cloudera.org:8080/7221

to look at the new patch set (#9).

Change subject: Simplify OpId/Timestamp assignment and make it atomic
..

Simplify OpId/Timestamp assignment and make it atomic

There are a couple of constraints that are tricky to enforce and were
requiring us to have convoluted interactions on OpId/Timestamp assignment
and registration between the TimeManager, MvccManager and ConsensusQueue.

These convoluted interactions are currently preventing a clean and simple
leader leases implementation, particularly the single timestamp outstanding
in TimeManager.

The "tricky" constraints are:

i) Before clean time is advanced in mvcc, all transactions with a
timestamp lower than the current time must be registered with mvcc.
This allows to wait for a certain timestamp t to be "safe" with TimeManager
and then wait for all transactions with a timestamp t1 < t to commit, thus
making sure that a scan at t is repeatable.

ii) The leader can't advance follower safe timestamp to t before
all the operations with a timestamp lower than it have also been sent.
This is hard since timestamps are assigned outside of consensus.

In general it's problematic to generate OpIds non-atomically with
timestamps, since these are both supposed to be monotonically increasing
in unison. To address this we had some complex state and interactions between
the components. For instance this was requiring us to have TimeManager
stop safe time advancement until a transaction being prepared was
submitted to the queue, after assigning a timestamp in the
TransactionDriver. This was problematic because with would only allow at most
one assigned timestamp in-flight, meaning we were not able to move safe
time at will.

This patch completely addresses ii), Timestamps and OpIds are now assigned
by PeerMessageQueue, atomically. This makes it trivial to make sure that the
safe time sent to replicas is inline with the sent messages. This also allows
some simplifications and refactorings of the PeerMessageQueue and TimeManager.

As for i) the way we enforce this constraint changed. Instead of registering
a transaction before submitting it to consensus (which we now can't since
consensus submission is also where we assign the timestamp) we register
another, dummy transaction on the MvccManager before submission and abort it
afterwards. This allows to "pin" safe time advancement down to a timestamp
that is mandatorily smaller than then one that will be assigned thus enforcing
the constraint with a simpler implementation at the cost of an additional
(short-lived) transaction in mvcc. Note in this regard that there will be
at most one aditional transaction in mvcc.

One side effect of this change is that, for new config changes, we won't
know the OpId until after the queue submission, which needs to happen
after actually setting the pending config. This is largely inconsequential
since, by design, we can have at most one pending config, but did require
some refactoring.

Change-Id: I2bedcc5c3708d06f48131f34ea14c835867b4800
---
M java/kudu-client/src/test/java/org/apache/kudu/client/TestHybridTime.java
M src/kudu/consensus/consensus-test-util.h
M src/kudu/consensus/consensus_meta.cc
M src/kudu/consensus/consensus_peers-test.cc
M src/kudu/consensus/consensus_queue-test.cc
M src/kudu/consensus/consensus_queue.cc
M src/kudu/consensus/consensus_queue.h
M src/kudu/consensus/log_cache-test.cc
M src/kudu/consensus/pending_rounds.cc
M src/kudu/consensus/quorum_util.cc
M src/kudu/consensus/quorum_util.h
M src/kudu/consensus/raft_consensus.cc
M src/kudu/consensus/raft_consensus.h
M src/kudu/consensus/time_manager-test.cc
M src/kudu/consensus/time_manager.cc
M src/kudu/consensus/time_manager.h
M src/kudu/integration-tests/alter_table-test.cc
M src/kudu/integration-tests/raft_consensus-itest.cc
M src/kudu/master/sys_catalog.cc
M src/kudu/tablet/transactions/transaction_driver.cc
M src/kudu/tserver/tablet_server-test.cc
21 files changed, 324 insertions(+), 264 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/21/7221/9
--
To view, visit http://gerrit.cloudera.org:8080/7221
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I2bedcc5c3708d06f48131f34ea14c835867b4800
Gerrit-Change-Number: 7221
Gerrit-PatchSet: 9
Gerrit-Owner: David Ribeiro Alves 
Gerrit-Reviewer: David Ribeiro Alves 
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon 


[kudu-CR] Simplify OpId/Timestamp assignment and make it atomic

2018-01-19 Thread David Ribeiro Alves (Code Review)
David Ribeiro Alves has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/7221 )

Change subject: Simplify OpId/Timestamp assignment and make it atomic
..


Patch Set 9:

this last push is just rebase over the current master, that fixes the hard 
conflicts that arose because of all the changes due to 3-4-3. Still need to do 
some follow up cleanup and address comments


--
To view, visit http://gerrit.cloudera.org:8080/7221
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I2bedcc5c3708d06f48131f34ea14c835867b4800
Gerrit-Change-Number: 7221
Gerrit-PatchSet: 9
Gerrit-Owner: David Ribeiro Alves 
Gerrit-Reviewer: David Ribeiro Alves 
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon 
Gerrit-Comment-Date: Sat, 20 Jan 2018 00:22:46 +
Gerrit-HasComments: No


[kudu-CR] Simplify OpId/Timestamp assignment and make it atomic

2018-01-25 Thread David Ribeiro Alves (Code Review)
Hello Tidy Bot, Kudu Jenkins, Todd Lipcon,

I'd like you to reexamine a change. Please visit

http://gerrit.cloudera.org:8080/7221

to look at the new patch set (#10).

Change subject: Simplify OpId/Timestamp assignment and make it atomic
..

Simplify OpId/Timestamp assignment and make it atomic

There are a couple of constraints that are tricky to enforce and were
requiring us to have convoluted interactions on OpId/Timestamp assignment
and registration between the TimeManager, MvccManager and ConsensusQueue.

These convoluted interactions are currently preventing a clean and simple
leader leases implementation, particularly the single timestamp outstanding
in TimeManager.

The "tricky" constraints are:

i) Before clean time is advanced in mvcc, all transactions with a
timestamp lower than the current time must be registered with mvcc.
This allows to wait for a certain timestamp t to be "safe" with TimeManager
and then wait for all transactions with a timestamp t1 < t to commit, thus
making sure that a scan at t is repeatable.

ii) The leader can't advance follower safe timestamp to t before
all the operations with a timestamp lower than it have also been sent.
This is hard since timestamps are assigned outside of consensus.

In general it's problematic to generate OpIds non-atomically with
timestamps, since these are both supposed to be monotonically increasing
in unison. To address this we had some complex state and interactions between
the components. For instance this was requiring us to have TimeManager
stop safe time advancement until a transaction being prepared was
submitted to the queue, after assigning a timestamp in the
TransactionDriver. This was problematic because with would only allow at most
one assigned timestamp in-flight, meaning we were not able to move safe
time at will.

This patch completely addresses ii), Timestamps and OpIds are now assigned
by PeerMessageQueue, atomically. This makes it trivial to make sure that the
safe time sent to replicas is inline with the sent messages. This also allows
some simplifications and refactorings of the PeerMessageQueue and TimeManager.

As for i) the way we enforce this constraint changed. Instead of registering
a transaction before submitting it to consensus (which we now can't since
consensus submission is also where we assign the timestamp) we register
another, dummy transaction on the MvccManager before submission and abort it
afterwards. This allows to "pin" safe time advancement down to a timestamp
that is mandatorily smaller than then one that will be assigned thus enforcing
the constraint with a simpler implementation at the cost of an additional
(short-lived) transaction in mvcc. Note in this regard that there will be
at most one aditional transaction in mvcc.

One side effect of this change is that, for new config changes, we won't
know the OpId until after the queue submission, which needs to happen
after actually setting the pending config. This is largely inconsequential
since, by design, we can have at most one pending config, but did require
some refactoring.

Change-Id: I2bedcc5c3708d06f48131f34ea14c835867b4800
---
M java/kudu-client/src/test/java/org/apache/kudu/client/TestHybridTime.java
M src/kudu/consensus/consensus-test-util.h
M src/kudu/consensus/consensus_meta.cc
M src/kudu/consensus/consensus_peers-test.cc
M src/kudu/consensus/consensus_queue-test.cc
M src/kudu/consensus/consensus_queue.cc
M src/kudu/consensus/consensus_queue.h
M src/kudu/consensus/log_cache-test.cc
M src/kudu/consensus/pending_rounds.cc
M src/kudu/consensus/quorum_util.cc
M src/kudu/consensus/quorum_util.h
M src/kudu/consensus/raft_consensus.cc
M src/kudu/consensus/raft_consensus.h
M src/kudu/consensus/time_manager-test.cc
M src/kudu/consensus/time_manager.cc
M src/kudu/consensus/time_manager.h
M src/kudu/integration-tests/alter_table-test.cc
M src/kudu/integration-tests/raft_consensus-itest.cc
M src/kudu/master/sys_catalog.cc
M src/kudu/tablet/transactions/transaction_driver.cc
M src/kudu/tserver/tablet_server-test.cc
21 files changed, 335 insertions(+), 270 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/21/7221/10
--
To view, visit http://gerrit.cloudera.org:8080/7221
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I2bedcc5c3708d06f48131f34ea14c835867b4800
Gerrit-Change-Number: 7221
Gerrit-PatchSet: 10
Gerrit-Owner: David Ribeiro Alves 
Gerrit-Reviewer: David Ribeiro Alves 
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon 


[kudu-CR] Simplify OpId/Timestamp assignment and make it atomic

2018-01-30 Thread David Ribeiro Alves (Code Review)
Hello Tidy Bot, Kudu Jenkins, Todd Lipcon,

I'd like you to reexamine a change. Please visit

http://gerrit.cloudera.org:8080/7221

to look at the new patch set (#12).

Change subject: Simplify OpId/Timestamp assignment and make it atomic
..

Simplify OpId/Timestamp assignment and make it atomic

Context:

Each message serialized through consensus is assigned a Timestamp and an OpId.
These indexes are used in different contexts to establish ordering and, for the
most part, they are monotonically increasing. In particular, for write messages,
successive OpIds are guaranteed to have increasing Timestamps. For non-write
messages though, like NO_OP and CONFIG_CHANGE, this is not the case. Because
timestamps are assigned to messages in different places (for write messages they
are assigned in the prepare phase, by the TransactionDriver, while for NO_OP and
CONFIG_CHANGE they are assigned inside consensus) there was no way, until now,
to make sure that timestamps are monotonically increasing across message types,
like OpIds are.

Not being able to trust that timestamps are monotonically increasing across
all message types, means that we can't use the timestamps from non-write
messages to advance time. This is a requirement for leader leases, as we
need a single source of truth regarding time advancement and need config
changes in particular to have been assigned timestamps we can trust, but
pehaps more importantly it's also a requirement to be able to advance time
on bootstrap when there are no write messages in the WAL. Not being able
to do this is the underlying cause of KUDU-2233.

Implementation:

There are a couple of constraints that are tricky to enforce and were
requiring us to have convoluted interactions on OpId/Timestamp assignment
and registration between the TimeManager, MvccManager and ConsensusQueue.
These convoluted interactions make it so that OpId and Timestamp are
not assigned atomically.

The "tricky" constraints are:

i) Before clean time is advanced in mvcc, all transactions with a
timestamp lower than the current time must be registered with mvcc.
This allows to wait for a certain timestamp t to be "safe" with TimeManager
and then wait for all transactions with a timestamp t1 < t to commit, thus
making sure that a scan at t is repeatable.

ii) The leader can't advance follower safe timestamp to t before
all the operations with a timestamp lower than it have also been sent.
This is hard since timestamps are assigned outside of consensus.

In general it's problematic to generate OpIds non-atomically with
timestamps, since these are both supposed to be monotonically increasing
in unison. To address this we had some complex state and interactions between
the components. For instance this was requiring us to have TimeManager
stop safe time advancement until a transaction being prepared was
submitted to the queue, after assigning a timestamp in the
TransactionDriver. This was problematic because with would only allow at most
one assigned timestamp in-flight, meaning we were not able to move safe
time at will.

This patch completely addresses ii), Timestamps and OpIds are now assigned
by PeerMessageQueue, atomically. This makes it trivial to make sure that the
safe time sent to replicas is inline with the sent messages. This also allows
some simplifications and refactorings of the PeerMessageQueue and TimeManager.

As for i) the way we enforce this constraint changed. Instead of registering
a transaction before submitting it to consensus (which we now can't since
consensus submission is also where we assign the timestamp) we register
another, dummy transaction on the MvccManager before submission and abort it
afterwards. This allows to "pin" safe time advancement down to a timestamp
that is mandatorily smaller than then one that will be assigned thus enforcing
the constraint with a simpler implementation at the cost of an additional
(short-lived) transaction in mvcc. Note in this regard that there will be
at most one aditional transaction in mvcc.

One side effect of this change is that, for new config changes, we won't
know the OpId until after the queue submission, which needs to happen
after actually setting the pending config. This is largely inconsequential
since, by design, we can have at most one pending config, but did require
some refactoring.

Change-Id: I2bedcc5c3708d06f48131f34ea14c835867b4800
---
M java/kudu-client/src/test/java/org/apache/kudu/client/TestHybridTime.java
M src/kudu/consensus/consensus-test-util.h
M src/kudu/consensus/consensus_meta.cc
M src/kudu/consensus/consensus_peers-test.cc
M src/kudu/consensus/consensus_queue-test.cc
M src/kudu/consensus/consensus_queue.cc
M src/kudu/consensus/consensus_queue.h
M src/kudu/consensus/log_cache-test.cc
M src/kudu/consensus/pending_rounds.cc
M src/kudu/consensus/quorum_util.cc
M src/kudu/consensus/quorum_util.h
M src/kudu/consensus/raft_consensus.cc
M src/kudu/

[kudu-CR] Simplify OpId/Timestamp assignment and make it atomic

2018-01-30 Thread David Ribeiro Alves (Code Review)
Hello Tidy Bot, Kudu Jenkins, Todd Lipcon,

I'd like you to reexamine a change. Please visit

http://gerrit.cloudera.org:8080/7221

to look at the new patch set (#15).

Change subject: Simplify OpId/Timestamp assignment and make it atomic
..

Simplify OpId/Timestamp assignment and make it atomic

Context:

Each message serialized through consensus is assigned a Timestamp and an OpId.
These indexes are used in different contexts to establish ordering and, for the
most part, they are monotonically increasing. In particular, for write messages,
successive OpIds are guaranteed to have increasing Timestamps. For non-write
messages though, like NO_OP and CONFIG_CHANGE, this is not the case. Because
timestamps are assigned to messages in different places (for write messages they
are assigned in the prepare phase, by the TransactionDriver, while for NO_OP and
CONFIG_CHANGE they are assigned inside consensus) there was no way, until now,
to make sure that timestamps are monotonically increasing across message types,
like OpIds are.

Not being able to trust that timestamps are monotonically increasing across
all message types, means that we can't use the timestamps from non-write
messages to advance time. This is a requirement for leader leases, as we
need a single source of truth regarding time advancement and need config
changes in particular to have been assigned timestamps we can trust, but
pehaps more importantly it's also a requirement to be able to advance time
on bootstrap when there are no write messages in the WAL. Not being able
to do this is the underlying cause of KUDU-2233.

Implementation:

There are a couple of constraints that are tricky to enforce and were
requiring us to have convoluted interactions on OpId/Timestamp assignment
and registration between the TimeManager, MvccManager and ConsensusQueue.
These convoluted interactions make it so that OpId and Timestamp are
not assigned atomically.

The "tricky" constraints are:

i) Before clean time is advanced in mvcc, all transactions with a
timestamp lower than the current time must be registered with mvcc.
This allows to wait for a certain timestamp t to be "safe" with TimeManager
and then wait for all transactions with a timestamp t1 < t to commit, thus
making sure that a scan at t is repeatable.

ii) The leader can't advance follower safe timestamp to t before
all the operations with a timestamp lower than it have also been sent.
This is hard since timestamps are assigned outside of consensus.

In general it's problematic to generate OpIds non-atomically with
timestamps, since these are both supposed to be monotonically increasing
in unison. To address this we had some complex state and interactions between
the components. For instance this was requiring us to have TimeManager
stop safe time advancement until a transaction being prepared was
submitted to the queue, after assigning a timestamp in the
TransactionDriver. This was problematic because with would only allow at most
one assigned timestamp in-flight, meaning we were not able to move safe
time at will.

This patch completely addresses ii), Timestamps and OpIds are now assigned
by PeerMessageQueue, atomically. This makes it trivial to make sure that the
safe time sent to replicas is inline with the sent messages. This also allows
some simplifications and refactorings of the PeerMessageQueue and TimeManager.

As for i) the way we enforce this constraint changed. Instead of registering
a transaction before submitting it to consensus (which we now can't since
consensus submission is also where we assign the timestamp) we register
another, dummy transaction on the MvccManager before submission and abort it
afterwards. This allows to "pin" safe time advancement down to a timestamp
that is mandatorily smaller than then one that will be assigned thus enforcing
the constraint with a simpler implementation at the cost of an additional
(short-lived) transaction in mvcc. Note in this regard that there will be
at most one aditional transaction in mvcc.

One side effect of this change is that, for new config changes, we won't
know the OpId until after the queue submission, which needs to happen
after actually setting the pending config. This is largely inconsequential
since, by design, we can have at most one pending config, but did require
some refactoring.

Change-Id: I2bedcc5c3708d06f48131f34ea14c835867b4800
---
M java/kudu-client/src/test/java/org/apache/kudu/client/TestHybridTime.java
M src/kudu/consensus/consensus-test-util.h
M src/kudu/consensus/consensus_meta-test.cc
M src/kudu/consensus/consensus_meta.cc
M src/kudu/consensus/consensus_peers-test.cc
M src/kudu/consensus/consensus_queue-test.cc
M src/kudu/consensus/consensus_queue.cc
M src/kudu/consensus/consensus_queue.h
M src/kudu/consensus/log_cache-test.cc
M src/kudu/consensus/pending_rounds.cc
M src/kudu/consensus/quorum_util.cc
M src/kudu/consensus/quorum_util.h
M src/

[kudu-CR] Simplify OpId/Timestamp assignment and make it atomic

2018-01-30 Thread David Ribeiro Alves (Code Review)
Hello Tidy Bot, Mike Percy, Alexey Serbin, Kudu Jenkins, Todd Lipcon,

I'd like you to reexamine a change. Please visit

http://gerrit.cloudera.org:8080/7221

to look at the new patch set (#16).

Change subject: Simplify OpId/Timestamp assignment and make it atomic
..

Simplify OpId/Timestamp assignment and make it atomic

Context:

Each message serialized through consensus is assigned a Timestamp and an OpId.
These indexes are used in different contexts to establish ordering and, for the
most part, they are monotonically increasing. In particular, for write messages,
successive OpIds are guaranteed to have increasing Timestamps. For non-write
messages though, like NO_OP and CONFIG_CHANGE, this is not the case. Because
timestamps are assigned to messages in different places (for write messages they
are assigned in the prepare phase, by the TransactionDriver, while for NO_OP and
CONFIG_CHANGE they are assigned inside consensus) there was no way, until now,
to make sure that timestamps are monotonically increasing across message types,
like OpIds are.

Not being able to trust that timestamps are monotonically increasing across
all message types, means that we can't use the timestamps from non-write
messages to advance time. This is a requirement for leader leases, as we
need a single source of truth regarding time advancement and need config
changes in particular to have been assigned timestamps we can trust, but
pehaps more importantly it's also a requirement to be able to advance time
on bootstrap when there are no write messages in the WAL. Not being able
to do this is the underlying cause of KUDU-2233.

Implementation:

There are a couple of constraints that are tricky to enforce and were
requiring us to have convoluted interactions on OpId/Timestamp assignment
and registration between the TimeManager, MvccManager and ConsensusQueue.
These convoluted interactions make it so that OpId and Timestamp are
not assigned atomically.

The "tricky" constraints are:

i) Before clean time is advanced in mvcc, all transactions with a
timestamp lower than the current time must be registered with mvcc.
This allows to wait for a certain timestamp t to be "safe" with TimeManager
and then wait for all transactions with a timestamp t1 < t to commit, thus
making sure that a scan at t is repeatable.

ii) The leader can't advance follower safe timestamp to t before
all the operations with a timestamp lower than it have also been sent.
This is hard since timestamps are assigned outside of consensus.

In general it's problematic to generate OpIds non-atomically with
timestamps, since these are both supposed to be monotonically increasing
in unison. To address this we had some complex state and interactions between
the components. For instance this was requiring us to have TimeManager
stop safe time advancement until a transaction being prepared was
submitted to the queue, after assigning a timestamp in the
TransactionDriver. This was problematic because with would only allow at most
one assigned timestamp in-flight, meaning we were not able to move safe
time at will.

This patch completely addresses ii), Timestamps and OpIds are now assigned
by PeerMessageQueue, atomically. This makes it trivial to make sure that the
safe time sent to replicas is inline with the sent messages. This also allows
some simplifications and refactorings of the PeerMessageQueue and TimeManager.

As for i) the way we enforce this constraint changed. Instead of registering
a transaction before submitting it to consensus (which we now can't since
consensus submission is also where we assign the timestamp) we register
another, dummy transaction on the MvccManager before submission and abort it
afterwards. This allows to "pin" safe time advancement down to a timestamp
that is mandatorily smaller than then one that will be assigned thus enforcing
the constraint with a simpler implementation at the cost of an additional
(short-lived) transaction in mvcc. Note in this regard that there will be
at most one aditional transaction in mvcc.

One side effect of this change is that, for new config changes, we won't
know the OpId until after the queue submission, which needs to happen
after actually setting the pending config. This is largely inconsequential
since, by design, we can have at most one pending config, but did require
some refactoring.

Change-Id: I2bedcc5c3708d06f48131f34ea14c835867b4800
---
M java/kudu-client/src/test/java/org/apache/kudu/client/TestHybridTime.java
M src/kudu/consensus/consensus-test-util.h
M src/kudu/consensus/consensus_meta-test.cc
M src/kudu/consensus/consensus_meta.cc
M src/kudu/consensus/consensus_peers-test.cc
M src/kudu/consensus/consensus_queue-test.cc
M src/kudu/consensus/consensus_queue.cc
M src/kudu/consensus/consensus_queue.h
M src/kudu/consensus/log_cache-test.cc
M src/kudu/consensus/pending_rounds.cc
M src/kudu/consensus/quorum_util.cc
M src/kudu/con

[kudu-CR] Simplify OpId/Timestamp assignment and make it atomic

2018-02-09 Thread David Ribeiro Alves (Code Review)
Hello Tidy Bot, Mike Percy, Alexey Serbin, Kudu Jenkins, Todd Lipcon,

I'd like you to reexamine a change. Please visit

http://gerrit.cloudera.org:8080/7221

to look at the new patch set (#17).

Change subject: Simplify OpId/Timestamp assignment and make it atomic
..

Simplify OpId/Timestamp assignment and make it atomic

Context:

Each message serialized through consensus is assigned a Timestamp and an OpId.
These indexes are used in different contexts to establish ordering and, for the
most part, they are monotonically increasing. In particular, for write messages,
successive OpIds are guaranteed to have increasing Timestamps. For non-write
messages though, like NO_OP and CONFIG_CHANGE, this is not the case. Because
timestamps are assigned to messages in different places (for write messages they
are assigned in the prepare phase, by the TransactionDriver, while for NO_OP and
CONFIG_CHANGE they are assigned inside consensus) there was no way, until now,
to make sure that timestamps are monotonically increasing across message types,
like OpIds are.

Not being able to trust that timestamps are monotonically increasing across
all message types, means that we can't use the timestamps from non-write
messages to advance time. This is a requirement for leader leases, as we
need a single source of truth regarding time advancement and need config
changes in particular to have been assigned timestamps we can trust, but
pehaps more importantly it's also a requirement to be able to advance time
on bootstrap when there are no write messages in the WAL. Not being able
to do this is the underlying cause of KUDU-2233.

Implementation:

There are a couple of constraints that are tricky to enforce and were
requiring us to have convoluted interactions on OpId/Timestamp assignment
and registration between the TimeManager, MvccManager and ConsensusQueue.
These convoluted interactions make it so that OpId and Timestamp are
not assigned atomically.

The "tricky" constraints are:

i) Before clean time is advanced in mvcc, all transactions with a
timestamp lower than the current time must be registered with mvcc.
This allows to wait for a certain timestamp t to be "safe" with TimeManager
and then wait for all transactions with a timestamp t1 < t to commit, thus
making sure that a scan at t is repeatable.

ii) The leader can't advance follower safe timestamp to t before
all the operations with a timestamp lower than it have also been sent.
This is hard since timestamps are assigned outside of consensus.

In general it's problematic to generate OpIds non-atomically with
timestamps, since these are both supposed to be monotonically increasing
in unison. To address this we had some complex state and interactions between
the components. For instance this was requiring us to have TimeManager
stop safe time advancement until a transaction being prepared was
submitted to the queue, after assigning a timestamp in the
TransactionDriver. This was problematic because with would only allow at most
one assigned timestamp in-flight, meaning we were not able to move safe
time at will.

This patch completely addresses ii), Timestamps and OpIds are now assigned
by PeerMessageQueue, atomically. This makes it trivial to make sure that the
safe time sent to replicas is inline with the sent messages. This also allows
some simplifications and refactorings of the PeerMessageQueue and TimeManager.

As for i) the way we enforce this constraint changed. Instead of registering
a transaction before submitting it to consensus (which we now can't since
consensus submission is also where we assign the timestamp) we register
another, dummy transaction on the MvccManager before submission and abort it
afterwards. This allows to "pin" safe time advancement down to a timestamp
that is mandatorily smaller than then one that will be assigned thus enforcing
the constraint with a simpler implementation at the cost of an additional
(short-lived) transaction in mvcc. Note in this regard that there will be
at most one aditional transaction in mvcc.

One side effect of this change is that, for new config changes, we won't
know the OpId until after the queue submission, which needs to happen
after actually setting the pending config. This is largely inconsequential
since, by design, we can have at most one pending config, but did require
some refactoring.

Change-Id: I2bedcc5c3708d06f48131f34ea14c835867b4800
---
M java/kudu-client/src/test/java/org/apache/kudu/client/TestHybridTime.java
M src/kudu/consensus/consensus-test-util.h
M src/kudu/consensus/consensus_meta-test.cc
M src/kudu/consensus/consensus_meta.cc
M src/kudu/consensus/consensus_peers-test.cc
M src/kudu/consensus/consensus_queue-test.cc
M src/kudu/consensus/consensus_queue.cc
M src/kudu/consensus/consensus_queue.h
M src/kudu/consensus/log_cache-test.cc
M src/kudu/consensus/pending_rounds.cc
M src/kudu/consensus/quorum_util.cc
M src/kudu/con

[kudu-CR] Simplify OpId/Timestamp assignment and make it atomic

2018-02-09 Thread Todd Lipcon (Code Review)
Todd Lipcon has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/7221 )

Change subject: Simplify OpId/Timestamp assignment and make it atomic
..


Patch Set 17:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/7221/8/src/kudu/tablet/transactions/transaction_driver.cc
File src/kudu/tablet/transactions/transaction_driver.cc:

http://gerrit.cloudera.org:8080/#/c/7221/8/src/kudu/tablet/transactions/transaction_driver.cc@295
PS8, Line 295: CHECK_EQ(prepare_state_, NOT_PREPARED);
> instead of using a dummy transaction, would it be possible to register it e
Did you see this comment and the one above?



--
To view, visit http://gerrit.cloudera.org:8080/7221
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I2bedcc5c3708d06f48131f34ea14c835867b4800
Gerrit-Change-Number: 7221
Gerrit-PatchSet: 17
Gerrit-Owner: David Ribeiro Alves 
Gerrit-Reviewer: Alexey Serbin 
Gerrit-Reviewer: David Ribeiro Alves 
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy 
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon 
Gerrit-Comment-Date: Sat, 10 Feb 2018 01:22:52 +
Gerrit-HasComments: Yes


[kudu-CR] Simplify OpId/Timestamp assignment and make it atomic

2018-02-09 Thread Todd Lipcon (Code Review)
Todd Lipcon has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/7221 )

Change subject: Simplify OpId/Timestamp assignment and make it atomic
..


Patch Set 17:

(10 comments)

http://gerrit.cloudera.org:8080/#/c/7221/17/src/kudu/consensus/consensus_queue.h
File src/kudu/consensus/consensus_queue.h:

http://gerrit.cloudera.org:8080/#/c/7221/17/src/kudu/consensus/consensus_queue.h@231
PS17, Line 231:   // Appends a single message to be replicated to the peers.
update this to indicate it also does assignment of OpId and Timestamp


http://gerrit.cloudera.org:8080/#/c/7221/17/src/kudu/consensus/consensus_queue.cc
File src/kudu/consensus/consensus_queue.cc:

http://gerrit.cloudera.org:8080/#/c/7221/17/src/kudu/consensus/consensus_queue.cc@361
PS17, Line 361: op_id->CopyFrom(GetNextOpIdUnlocked());
  : RETURN_NOT_OK(time_manager_->AssignTimestamp(msg->get()));
per comment below about GetNextOpId having a side effect, here's an example of 
why I think it's dangerous. Here AssignTimestamp might fail, but then the side 
effect of GetNextOpId still took effect.


http://gerrit.cloudera.org:8080/#/c/7221/17/src/kudu/consensus/consensus_queue.cc@421
PS17, Line 421:  {
nit: indent


http://gerrit.cloudera.org:8080/#/c/7221/17/src/kudu/consensus/consensus_queue.cc@440
PS17, Line 440:   if (PREDICT_FALSE(queue_state_.first_index_in_current_term == 
boost::none)) {
  : queue_state_.first_index_in_current_term = id.index();
  :   }
hrm, this seems a little surprising that GetNextOpId has this side effect 
rather than setting it on append.


http://gerrit.cloudera.org:8080/#/c/7221/17/src/kudu/consensus/raft_consensus.cc
File src/kudu/consensus/raft_consensus.cc:

http://gerrit.cloudera.org:8080/#/c/7221/17/src/kudu/consensus/raft_consensus.cc@648
PS17, Line 648:   RETURN_NOT_OK(HandlePendingConfigChangeUnlocked(round));
although this is right, it looks confusing because round probably isn't a 
config change. Maybe if we name it something like 
'SnoopForConfigChangeUnlocked' or 'HandleConfigChangeIfNecessaryUnlocked' it 
would be clearer?


http://gerrit.cloudera.org:8080/#/c/7221/17/src/kudu/consensus/raft_consensus.cc@2609
PS17, Line 2609: RaftConfigPB
this and new_config can be const refs


http://gerrit.cloudera.org:8080/#/c/7221/17/src/kudu/consensus/raft_consensus.cc@2633
PS17, Line 2633:   LOG_WITH_PREFIX_UNLOCKED(INFO) << "Skipping abort of 
non-pending config change. Status:  "
   :  << status.ToString();
it should still have an opid right? ie if we get to this path it should have 
been in the queue and been assigned an op?


http://gerrit.cloudera.org:8080/#/c/7221/17/src/kudu/consensus/time_manager.cc
File src/kudu/consensus/time_manager.cc:

http://gerrit.cloudera.org:8080/#/c/7221/17/src/kudu/consensus/time_manager.cc@110
PS17, Line 110:   // NOTE: Currently this method just updates the clock and 
stores the message's timestamp.
  :   //   It always returns Status::OK() if the clock returns 
an OK status on Update().
hrm, is this method still even very useful? Maybe it would be clearer to just 
use clock->Update?


http://gerrit.cloudera.org:8080/#/c/7221/17/src/kudu/consensus/time_manager.cc@294
PS17, Line 294:  case LEADER: GetSerialTimestampUnlocked();
  : FALLTHROUGH_INTENDED;
this looks wrong. you are falling through to the next case which breaks anyway.

Also seems odd to be calling GetSerialTimestamp without doing anything with the 
result.

Wouldn't this function be equivalent to just say 'return 
GetSerialTimestampUnlocked'?


http://gerrit.cloudera.org:8080/#/c/7221/17/src/kudu/consensus/time_manager.cc@307
PS17, Line 307:   DCHECK(lock_.is_locked());
can you DCHECK_EQ(mode_, LEADER) here?



--
To view, visit http://gerrit.cloudera.org:8080/7221
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I2bedcc5c3708d06f48131f34ea14c835867b4800
Gerrit-Change-Number: 7221
Gerrit-PatchSet: 17
Gerrit-Owner: David Ribeiro Alves 
Gerrit-Reviewer: Alexey Serbin 
Gerrit-Reviewer: David Ribeiro Alves 
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy 
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon 
Gerrit-Comment-Date: Sat, 10 Feb 2018 02:14:38 +
Gerrit-HasComments: Yes


[kudu-CR] Simplify OpId/Timestamp assignment and make it atomic

2018-02-13 Thread Alexey Serbin (Code Review)
Alexey Serbin has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/7221 )

Change subject: Simplify OpId/Timestamp assignment and make it atomic
..


Patch Set 17:

(18 comments)

http://gerrit.cloudera.org:8080/#/c/7221/16//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/7221/16//COMMIT_MSG@27
PS16, Line 27: pehaps
perhaps


http://gerrit.cloudera.org:8080/#/c/7221/16//COMMIT_MSG@43
PS16, Line 43: t
readability nit: T


http://gerrit.cloudera.org:8080/#/c/7221/16//COMMIT_MSG@44
PS16, Line 44: t1 < t
readability nit: T1 < T


http://gerrit.cloudera.org:8080/#/c/7221/16//COMMIT_MSG@45
PS16, Line 45: t
readability nit: T


http://gerrit.cloudera.org:8080/#/c/7221/16//COMMIT_MSG@74
PS16, Line 74: aditional
additional


http://gerrit.cloudera.org:8080/#/c/7221/17/src/kudu/consensus/consensus-test-util.h
File src/kudu/consensus/consensus-test-util.h:

http://gerrit.cloudera.org:8080/#/c/7221/17/src/kudu/consensus/consensus-test-util.h@64
PS17, Line 64: using log::Log;
 : using strings::Substitute;
nit: usually it's not a good idea to have 'using' in header files, unless 
that's inside function body or other locally enclosed scope.


http://gerrit.cloudera.org:8080/#/c/7221/17/src/kudu/consensus/consensus-test-util.h@67
PS17, Line 67: inline gscoped_ptr CreateDummyReplicateWithOpId
As I see, the pattern is always CreateDummyReplicateWithOpId().release().

Why to return a gscoped_ptr at all?  Maybe, just return a raw pointer instead.


http://gerrit.cloudera.org:8080/#/c/7221/17/src/kudu/consensus/consensus-test-util.h@82
PS17, Line 82: inline gscoped_ptr 
CreateDummyReplicateWithoutOpId(int64_t payload_size) {
ditto: maybe, just return a raw pointer.


http://gerrit.cloudera.org:8080/#/c/7221/17/src/kudu/consensus/consensus-test-util.h@115
PS17, Line 115: AppendMode mode = APPEND_NON_LEADER,
Instead, is it possible to rely on the mode of the object pointed to by the 
queue parameter?


http://gerrit.cloudera.org:8080/#/c/7221/17/src/kudu/consensus/consensus-test-util.h@133
PS17, Line 133:
an extra space


http://gerrit.cloudera.org:8080/#/c/7221/17/src/kudu/consensus/consensus_meta.cc
File src/kudu/consensus/consensus_meta.cc:

http://gerrit.cloudera.org:8080/#/c/7221/17/src/kudu/consensus/consensus_meta.cc@153
PS17, Line 153: CHECK(config.has_opid_index());
Maybe, just use DCHECK() here?  As I understand, the committed configuration is 
verified by VerifyRaftConfig() upon flushing the data into the disk.


http://gerrit.cloudera.org:8080/#/c/7221/17/src/kudu/consensus/consensus_queue.h
File src/kudu/consensus/consensus_queue.h:

http://gerrit.cloudera.org:8080/#/c/7221/17/src/kudu/consensus/consensus_queue.h@238
PS17, Line 238:   Status LeaderAppendOperation(const ReplicateRefPtr& msg);
  :
  :   Status NonLeaderAppendOperation(const ReplicateRefPtr& msg);
Is it possible just to rely on the status of the queue and have a single method 
here?  Or you wanted it exactly like this because this way it's easier to 
enforce some constraints?


http://gerrit.cloudera.org:8080/#/c/7221/17/src/kudu/consensus/raft_consensus.cc
File src/kudu/consensus/raft_consensus.cc:

PS17:
Does it make sense to add some specific test for this new behavior or it's 
already enough coverage?


http://gerrit.cloudera.org:8080/#/c/7221/17/src/kudu/consensus/raft_consensus.cc@2622
PS17, Line 2622: RaftConfigPB
This could be const reference as well.


http://gerrit.cloudera.org:8080/#/c/7221/17/src/kudu/consensus/raft_consensus.cc@2623
PS17, Line 2623:   CHECK(differencer.Equals(new_config, pending_config)) << 
"The pending config and"
nit: add a space after 'and'


http://gerrit.cloudera.org:8080/#/c/7221/17/src/kudu/consensus/raft_consensus.cc@2624
PS17, Line 2624: .
nit: replace with colon?


http://gerrit.cloudera.org:8080/#/c/7221/17/src/kudu/consensus/raft_consensus.cc@2844
PS17, Line 2844: // We didn't know the op id index at the time we set the 
config as pending so set it
   : // so that we can still compare the configs.
   : 
pending_config.set_opid_index(config_to_commit.opid_index());
Since this is just for comparing configs in the if() closure below, move it in 
there?


http://gerrit.cloudera.org:8080/#/c/7221/17/src/kudu/consensus/time_manager-test.cc
File src/kudu/consensus/time_manager-test.cc:

http://gerrit.cloudera.org:8080/#/c/7221/17/src/kudu/consensus/time_manager-test.cc@a176
PS17, Line 176:
Maybe, add an assertion instead:

ASSERT_GT(time_manager_->GetSafeTime(), safe_before);



--
To view, visit http://gerrit.cloudera.org:8080/7221
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I2bedcc5c3708d06f48131f34ea14c835867b4800
Gerrit-Change-Number: 7221
Gerrit-PatchSet: 17
Gerrit-Owner: David Ribeiro Alves 
Gerrit-Reviewer: Alexey Serbin 
Ge

[kudu-CR] Simplify OpId/Timestamp assignment and make it atomic

2018-02-20 Thread Mike Percy (Code Review)
Mike Percy has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/7221 )

Change subject: Simplify OpId/Timestamp assignment and make it atomic
..


Patch Set 17:

(10 comments)

http://gerrit.cloudera.org:8080/#/c/7221/17/src/kudu/consensus/quorum_util.cc
File src/kudu/consensus/quorum_util.cc:

http://gerrit.cloudera.org:8080/#/c/7221/17/src/kudu/consensus/quorum_util.cc@158
PS17, Line 158:   std::set uuids;
add: DCHECK(type == COMMITTED_CONFIG || type == PENDING_CONFIG)


http://gerrit.cloudera.org:8080/#/c/7221/17/src/kudu/consensus/raft_consensus.h
File src/kudu/consensus/raft_consensus.h:

http://gerrit.cloudera.org:8080/#/c/7221/17/src/kudu/consensus/raft_consensus.h@449
PS17, Line 449: mus
must


http://gerrit.cloudera.org:8080/#/c/7221/17/src/kudu/consensus/raft_consensus.cc
File src/kudu/consensus/raft_consensus.cc:

http://gerrit.cloudera.org:8080/#/c/7221/17/src/kudu/consensus/raft_consensus.cc@648
PS17, Line 648:   RETURN_NOT_OK(HandlePendingConfigChangeUnlocked(round));
> although this is right, it looks confusing because round probably isn't a c
or just: if (round->replicate_msg()->op_type() == CHANGE_CONFIG_OP) { ... }


http://gerrit.cloudera.org:8080/#/c/7221/17/src/kudu/consensus/raft_consensus.cc@662
PS17, Line 662:   if (PREDICT_TRUE(round->replicate_msg()->op_type() != 
CHANGE_CONFIG_OP)) return Status::OK();
Due to the method name, seems less surprising to add: 
DCHECK_EQ(CHANGE_CONFIG_OP, round->replicate_msg()->op_type());


http://gerrit.cloudera.org:8080/#/c/7221/17/src/kudu/consensus/raft_consensus.cc@673
PS17, Line 673:
nit: extra line


http://gerrit.cloudera.org:8080/#/c/7221/17/src/kudu/consensus/raft_consensus.cc@691
PS17, Line 691: (
nit: no need for inner parentheses


http://gerrit.cloudera.org:8080/#/c/7221/17/src/kudu/tablet/transactions/transaction_driver.cc
File src/kudu/tablet/transactions/transaction_driver.cc:

http://gerrit.cloudera.org:8080/#/c/7221/17/src/kudu/tablet/transactions/transaction_driver.cc@336
PS17, Line 336: rule
s/rule/invariant/


http://gerrit.cloudera.org:8080/#/c/7221/17/src/kudu/tablet/transactions/transaction_driver.cc@360
PS17, Line 360: constraint
s/constraint/above invariant/

By the way: is there a DCHECK somewhere in mvcc to enforce the invariant that 
clean time < any tx timestamp?


http://gerrit.cloudera.org:8080/#/c/7221/17/src/kudu/tablet/transactions/transaction_driver.cc@371
PS17, Line 371: repl_state_copy
This variable masks a variable of the same name in an outer scope


http://gerrit.cloudera.org:8080/#/c/7221/17/src/kudu/tablet/transactions/transaction_driver.cc@384
PS17, Line 384: case REPLICATING
can we still get to this case?



--
To view, visit http://gerrit.cloudera.org:8080/7221
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I2bedcc5c3708d06f48131f34ea14c835867b4800
Gerrit-Change-Number: 7221
Gerrit-PatchSet: 17
Gerrit-Owner: David Ribeiro Alves 
Gerrit-Reviewer: Alexey Serbin 
Gerrit-Reviewer: David Ribeiro Alves 
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy 
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon 
Gerrit-Comment-Date: Tue, 20 Feb 2018 22:20:34 +
Gerrit-HasComments: Yes


[kudu-CR] Simplify OpId/Timestamp assignment and make it atomic

2018-03-14 Thread David Ribeiro Alves (Code Review)
David Ribeiro Alves has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/7221 )

Change subject: Simplify OpId/Timestamp assignment and make it atomic
..


Patch Set 8:

(36 comments)

http://gerrit.cloudera.org:8080/#/c/7221/17/src/kudu/consensus/consensus-test-util.h
File src/kudu/consensus/consensus-test-util.h:

http://gerrit.cloudera.org:8080/#/c/7221/17/src/kudu/consensus/consensus-test-util.h@64
PS17, Line 64:   OpId* id = msg->mutable_id();
 :   id->set_term(term);
> nit: usually it's not a good idea to have 'using' in header files, unless t
Done


http://gerrit.cloudera.org:8080/#/c/7221/17/src/kudu/consensus/consensus-test-util.h@67
PS17, Line 67:
> As I see, the pattern is always CreateDummyReplicateWithOpId().release().
I essentially split CreateDummyReplicate in 2 versions one that sets an opid 
and one that doesn't. I can refactor the pointer ownership in a follow up if 
you feel strongly, but would like to keep that out of this straightforward 
change.


http://gerrit.cloudera.org:8080/#/c/7221/17/src/kudu/consensus/consensus-test-util.h@82
PS17, Line 82: inline RaftPeerPB FakeRaftPeerPB(const std::string& uuid) {
> ditto: maybe, just return a raw pointer.
same


http://gerrit.cloudera.org:8080/#/c/7221/17/src/kudu/consensus/consensus-test-util.h@115
PS17, Line 115: continue;
> Instead, is it possible to rely on the mode of the object pointed to by the
not sure I follow...


http://gerrit.cloudera.org:8080/#/c/7221/17/src/kudu/consensus/consensus-test-util.h@133
PS17, Line 133: _
> an extra space
Done


http://gerrit.cloudera.org:8080/#/c/7221/17/src/kudu/consensus/consensus_meta.cc
File src/kudu/consensus/consensus_meta.cc:

http://gerrit.cloudera.org:8080/#/c/7221/17/src/kudu/consensus/consensus_meta.cc@153
PS17, Line 153: return leader_uuid_;
> Maybe, just use DCHECK() here?  As I understand, the committed configuratio
tbh honest I don't see a problem in having this here and being extra defensive. 
the point of this patch is that the pending config is no longer assigned an 
op_id before being marked as pendign which must be set before marking it as 
commited. this is here to make sure we always enforce that last bit.
added a comment to that regard.


http://gerrit.cloudera.org:8080/#/c/7221/17/src/kudu/consensus/consensus_queue.h
File src/kudu/consensus/consensus_queue.h:

http://gerrit.cloudera.org:8080/#/c/7221/17/src/kudu/consensus/consensus_queue.h@231
PS17, Line 231:   // it is alive and making progress.
> update this to indicate it also does assignment of OpId and Timestamp
Done


http://gerrit.cloudera.org:8080/#/c/7221/17/src/kudu/consensus/consensus_queue.h@238
PS17, Line 238: bool* more_pending);
  :
  :   // Called by the consensus implementation to update the queu
> Is it possible just to rely on the status of the queue and have a single me
the latter. I pondered the "uglyness" of having two methods/extra params versus 
just relying on the state of the queue, but ultimately decided on this to make 
it clear/explicit and have extra checks.


http://gerrit.cloudera.org:8080/#/c/7221/17/src/kudu/consensus/consensus_queue.cc
File src/kudu/consensus/consensus_queue.cc:

http://gerrit.cloudera.org:8080/#/c/7221/17/src/kudu/consensus/consensus_queue.cc@361
PS17, Line 361:   // Maintain a thread-safe copy of necessary members.
  :   OpId preceding_id;
> per comment below about GetNextOpId having a side effect, here's an example
Done


http://gerrit.cloudera.org:8080/#/c/7221/17/src/kudu/consensus/consensus_queue.cc@421
PS17, Line 421: // We try to get the follower's next_index from our log.
> nit: indent
Done


http://gerrit.cloudera.org:8080/#/c/7221/17/src/kudu/consensus/consensus_queue.cc@440
PS17, Line 440: << "while preparing 
peer request: "
  : << s.ToString() << ". 
Destination peer: "
  :
> hrm, this seems a little surprising that GetNextOpId has this side effect r
Done


http://gerrit.cloudera.org:8080/#/c/7221/17/src/kudu/consensus/quorum_util.cc
File src/kudu/consensus/quorum_util.cc:

http://gerrit.cloudera.org:8080/#/c/7221/17/src/kudu/consensus/quorum_util.cc@158
PS17, Line 158:
> add: DCHECK(type == COMMITTED_CONFIG || type == PENDING_CONFIG)
Done


http://gerrit.cloudera.org:8080/#/c/7221/17/src/kudu/consensus/raft_consensus.h
File src/kudu/consensus/raft_consensus.h:

http://gerrit.cloudera.org:8080/#/c/7221/17/src/kudu/consensus/raft_consensus.h@449
PS17, Line 449: t,
> must
Done


http://gerrit.cloudera.org:8080/#/c/7221/17/src/kudu/consensus/raft_consensus.cc
File src/kudu/consensus/raft_consensus.cc:

PS17:
> Does it make sense to add some specific test for this new behavior or it's
for writes the behavior is undistinguishable. for non-writes the behavior is 
now "correct". there's a patch furt

[kudu-CR] Simplify OpId/Timestamp assignment and make it atomic

2018-03-14 Thread David Ribeiro Alves (Code Review)
Hello Tidy Bot, Mike Percy, Alexey Serbin, Kudu Jenkins, Todd Lipcon,

I'd like you to reexamine a change. Please visit

http://gerrit.cloudera.org:8080/7221

to look at the new patch set (#18).

Change subject: Simplify OpId/Timestamp assignment and make it atomic
..

Simplify OpId/Timestamp assignment and make it atomic

There are a couple of constraints that are tricky to enforce and were
requiring us to have convoluted interactions on OpId/Timestamp assignment
and registration between the TimeManager, MvccManager and ConsensusQueue.

These convoluted interactions are currently preventing a clean and simple
leader leases implementation, particularly the single timestamp outstanding
in TimeManager.

The "tricky" constraints are:

i) Before clean time is advanced in mvcc, all transactions with a
timestamp lower than the current time must be registered with mvcc.
This allows to wait for a certain timestamp t to be "safe" with TimeManager
and then wait for all transactions with a timestamp t1 < t to commit, thus
making sure that a scan at t is repeatable.

ii) The leader can't advance follower safe timestamp to t before
all the operations with a timestamp lower than it have also been sent.
This is hard since timestamps are assigned outside of consensus.

In general it's problematic to generate OpIds non-atomically with
timestamps, since these are both supposed to be monotonically increasing
in unison. To address this we had some complex state and interactions between
the components. For instance this was requiring us to have TimeManager
stop safe time advancement until a transaction being prepared was
submitted to the queue, after assigning a timestamp in the
TransactionDriver. This was problematic because with would only allow at most
one assigned timestamp in-flight, meaning we were not able to move safe
time at will.

This patch completely addresses ii), Timestamps and OpIds are now assigned
by PeerMessageQueue, atomically. This makes it trivial to make sure that the
safe time sent to replicas is inline with the sent messages. This also allows
some simplifications and refactorings of the PeerMessageQueue and TimeManager.

As for i) the way we enforce this constraint changed. Instead of registering
a transaction before submitting it to consensus (which we now can't since
consensus submission is also where we assign the timestamp) we register
another, dummy transaction on the MvccManager before submission and abort it
afterwards. This allows to "pin" safe time advancement down to a timestamp
that is mandatorily smaller than then one that will be assigned thus enforcing
the constraint with a simpler implementation at the cost of an additional
(short-lived) transaction in mvcc. Note in this regard that there will be
at most one aditional transaction in mvcc.

One side effect of this change is that, for new config changes, we won't
know the OpId until after the queue submission, which needs to happen
after actually setting the pending config. This is largely inconsequential
since, by design, we can have at most one pending config, but did require
some refactoring.

Change-Id: I2bedcc5c3708d06f48131f34ea14c835867b4800
---
M java/kudu-client/src/test/java/org/apache/kudu/client/TestHybridTime.java
M src/kudu/consensus/consensus-test-util.h
M src/kudu/consensus/consensus_meta-test.cc
M src/kudu/consensus/consensus_meta.cc
M src/kudu/consensus/consensus_peers-test.cc
M src/kudu/consensus/consensus_queue-test.cc
M src/kudu/consensus/consensus_queue.cc
M src/kudu/consensus/consensus_queue.h
M src/kudu/consensus/log_cache-test.cc
M src/kudu/consensus/pending_rounds.cc
M src/kudu/consensus/quorum_util.cc
M src/kudu/consensus/quorum_util.h
M src/kudu/consensus/raft_consensus.cc
M src/kudu/consensus/raft_consensus.h
M src/kudu/consensus/time_manager-test.cc
M src/kudu/consensus/time_manager.cc
M src/kudu/consensus/time_manager.h
M src/kudu/integration-tests/alter_table-test.cc
M src/kudu/integration-tests/raft_consensus-itest.cc
M src/kudu/master/sys_catalog.cc
M src/kudu/tablet/transactions/transaction_driver.cc
M src/kudu/tserver/tablet_server-test.cc
M src/kudu/tserver/ts_tablet_manager.cc
23 files changed, 348 insertions(+), 282 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/21/7221/18
--
To view, visit http://gerrit.cloudera.org:8080/7221
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I2bedcc5c3708d06f48131f34ea14c835867b4800
Gerrit-Change-Number: 7221
Gerrit-PatchSet: 18
Gerrit-Owner: David Ribeiro Alves 
Gerrit-Reviewer: Alexey Serbin 
Gerrit-Reviewer: David Ribeiro Alves 
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy 
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon 


[kudu-CR] Simplify OpId/Timestamp assignment and make it atomic

2018-03-14 Thread David Ribeiro Alves (Code Review)
David Ribeiro Alves has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/7221 )

Change subject: Simplify OpId/Timestamp assignment and make it atomic
..


Patch Set 17:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/7221/17/src/kudu/consensus/time_manager.cc
File src/kudu/consensus/time_manager.cc:

http://gerrit.cloudera.org:8080/#/c/7221/17/src/kudu/consensus/time_manager.cc@307
PS17, Line 307:   DCHECK(lock_.is_locked());
> can you DCHECK_EQ(mode_, LEADER) here?
tried this, can't do it because of unsafe config changes.



--
To view, visit http://gerrit.cloudera.org:8080/7221
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I2bedcc5c3708d06f48131f34ea14c835867b4800
Gerrit-Change-Number: 7221
Gerrit-PatchSet: 17
Gerrit-Owner: David Ribeiro Alves 
Gerrit-Reviewer: Alexey Serbin 
Gerrit-Reviewer: David Ribeiro Alves 
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy 
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon 
Gerrit-Comment-Date: Wed, 14 Mar 2018 23:34:21 +
Gerrit-HasComments: Yes


[kudu-CR] Simplify OpId/Timestamp assignment and make it atomic

2018-03-14 Thread David Ribeiro Alves (Code Review)
Hello Tidy Bot, Mike Percy, Alexey Serbin, Kudu Jenkins, Todd Lipcon,

I'd like you to reexamine a change. Please visit

http://gerrit.cloudera.org:8080/7221

to look at the new patch set (#19).

Change subject: Simplify OpId/Timestamp assignment and make it atomic
..

Simplify OpId/Timestamp assignment and make it atomic

There are a couple of constraints that are tricky to enforce and were
requiring us to have convoluted interactions on OpId/Timestamp assignment
and registration between the TimeManager, MvccManager and ConsensusQueue.

These convoluted interactions are currently preventing a clean and simple
leader leases implementation, particularly the single timestamp outstanding
in TimeManager.

The "tricky" constraints are:

i) Before clean time is advanced in mvcc, all transactions with a
timestamp lower than the current time must be registered with mvcc.
This allows to wait for a certain timestamp t to be "safe" with TimeManager
and then wait for all transactions with a timestamp t1 < t to commit, thus
making sure that a scan at t is repeatable.

ii) The leader can't advance follower safe timestamp to t before
all the operations with a timestamp lower than it have also been sent.
This is hard since timestamps are assigned outside of consensus.

In general it's problematic to generate OpIds non-atomically with
timestamps, since these are both supposed to be monotonically increasing
in unison. To address this we had some complex state and interactions between
the components. For instance this was requiring us to have TimeManager
stop safe time advancement until a transaction being prepared was
submitted to the queue, after assigning a timestamp in the
TransactionDriver. This was problematic because with would only allow at most
one assigned timestamp in-flight, meaning we were not able to move safe
time at will.

This patch completely addresses ii), Timestamps and OpIds are now assigned
by PeerMessageQueue, atomically. This makes it trivial to make sure that the
safe time sent to replicas is inline with the sent messages. This also allows
some simplifications and refactorings of the PeerMessageQueue and TimeManager.

As for i) the way we enforce this constraint changed. Instead of registering
a transaction before submitting it to consensus (which we now can't since
consensus submission is also where we assign the timestamp) we register
another, dummy transaction on the MvccManager before submission and abort it
afterwards. This allows to "pin" safe time advancement down to a timestamp
that is mandatorily smaller than then one that will be assigned thus enforcing
the constraint with a simpler implementation at the cost of an additional
(short-lived) transaction in mvcc. Note in this regard that there will be
at most one aditional transaction in mvcc.

One side effect of this change is that, for new config changes, we won't
know the OpId until after the queue submission, which needs to happen
after actually setting the pending config. This is largely inconsequential
since, by design, we can have at most one pending config, but did require
some refactoring.

Change-Id: I2bedcc5c3708d06f48131f34ea14c835867b4800
---
M java/kudu-client/src/test/java/org/apache/kudu/client/TestHybridTime.java
M src/kudu/consensus/consensus-test-util.h
M src/kudu/consensus/consensus_meta-test.cc
M src/kudu/consensus/consensus_meta.cc
M src/kudu/consensus/consensus_peers-test.cc
M src/kudu/consensus/consensus_queue-test.cc
M src/kudu/consensus/consensus_queue.cc
M src/kudu/consensus/consensus_queue.h
M src/kudu/consensus/log_cache-test.cc
M src/kudu/consensus/pending_rounds.cc
M src/kudu/consensus/quorum_util.cc
M src/kudu/consensus/quorum_util.h
M src/kudu/consensus/raft_consensus.cc
M src/kudu/consensus/raft_consensus.h
M src/kudu/consensus/time_manager-test.cc
M src/kudu/consensus/time_manager.cc
M src/kudu/consensus/time_manager.h
M src/kudu/integration-tests/alter_table-test.cc
M src/kudu/integration-tests/raft_consensus-itest.cc
M src/kudu/master/sys_catalog.cc
M src/kudu/tablet/transactions/transaction_driver.cc
M src/kudu/tserver/tablet_server-test.cc
M src/kudu/tserver/ts_tablet_manager.cc
23 files changed, 347 insertions(+), 282 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/21/7221/19
--
To view, visit http://gerrit.cloudera.org:8080/7221
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I2bedcc5c3708d06f48131f34ea14c835867b4800
Gerrit-Change-Number: 7221
Gerrit-PatchSet: 19
Gerrit-Owner: David Ribeiro Alves 
Gerrit-Reviewer: Alexey Serbin 
Gerrit-Reviewer: David Ribeiro Alves 
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy 
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon 


[kudu-CR] Simplify OpId/Timestamp assignment and make it atomic

2018-03-15 Thread David Ribeiro Alves (Code Review)
David Ribeiro Alves has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/7221 )

Change subject: Simplify OpId/Timestamp assignment and make it atomic
..


Patch Set 19:

> Patch Set 19: Verified-1
>
> Build Failed 
>
> http://jenkins.kudu.apache.org/job/kudu-gerrit/12511/ : FAILURE

Hum, something weird is going on with a couple of tests. Should still be worth 
reviewing though


--
To view, visit http://gerrit.cloudera.org:8080/7221
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I2bedcc5c3708d06f48131f34ea14c835867b4800
Gerrit-Change-Number: 7221
Gerrit-PatchSet: 19
Gerrit-Owner: David Ribeiro Alves 
Gerrit-Reviewer: Alexey Serbin 
Gerrit-Reviewer: David Ribeiro Alves 
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy 
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon 
Gerrit-Comment-Date: Thu, 15 Mar 2018 23:41:35 +
Gerrit-HasComments: No


[kudu-CR] Simplify OpId/Timestamp assignment and make it atomic

2018-03-15 Thread David Ribeiro Alves (Code Review)
Hello Tidy Bot, Mike Percy, Alexey Serbin, Kudu Jenkins, Todd Lipcon,

I'd like you to reexamine a change. Please visit

http://gerrit.cloudera.org:8080/7221

to look at the new patch set (#21).

Change subject: Simplify OpId/Timestamp assignment and make it atomic
..

Simplify OpId/Timestamp assignment and make it atomic

There are a couple of constraints that are tricky to enforce and were
requiring us to have convoluted interactions on OpId/Timestamp assignment
and registration between the TimeManager, MvccManager and ConsensusQueue.

These convoluted interactions are currently preventing a clean and simple
leader leases implementation, particularly the single timestamp outstanding
in TimeManager.

The "tricky" constraints are:

i) Before clean time is advanced in mvcc, all transactions with a
timestamp lower than the current time must be registered with mvcc.
This allows to wait for a certain timestamp t to be "safe" with TimeManager
and then wait for all transactions with a timestamp t1 < t to commit, thus
making sure that a scan at t is repeatable.

ii) The leader can't advance follower safe timestamp to t before
all the operations with a timestamp lower than it have also been sent.
This is hard since timestamps are assigned outside of consensus.

In general it's problematic to generate OpIds non-atomically with
timestamps, since these are both supposed to be monotonically increasing
in unison. To address this we had some complex state and interactions between
the components. For instance this was requiring us to have TimeManager
stop safe time advancement until a transaction being prepared was
submitted to the queue, after assigning a timestamp in the
TransactionDriver. This was problematic because with would only allow at most
one assigned timestamp in-flight, meaning we were not able to move safe
time at will.

This patch completely addresses ii), Timestamps and OpIds are now assigned
by PeerMessageQueue, atomically. This makes it trivial to make sure that the
safe time sent to replicas is inline with the sent messages. This also allows
some simplifications and refactorings of the PeerMessageQueue and TimeManager.

As for i) the way we enforce this constraint changed. Instead of registering
a transaction before submitting it to consensus (which we now can't since
consensus submission is also where we assign the timestamp) we register
another, dummy transaction on the MvccManager before submission and abort it
afterwards. This allows to "pin" safe time advancement down to a timestamp
that is mandatorily smaller than then one that will be assigned thus enforcing
the constraint with a simpler implementation at the cost of an additional
(short-lived) transaction in mvcc. Note in this regard that there will be
at most one aditional transaction in mvcc.

One side effect of this change is that, for new config changes, we won't
know the OpId until after the queue submission, which needs to happen
after actually setting the pending config. This is largely inconsequential
since, by design, we can have at most one pending config, but did require
some refactoring.

Change-Id: I2bedcc5c3708d06f48131f34ea14c835867b4800
---
M java/kudu-client/src/test/java/org/apache/kudu/client/TestHybridTime.java
M src/kudu/consensus/consensus-test-util.h
M src/kudu/consensus/consensus_meta-test.cc
M src/kudu/consensus/consensus_meta.cc
M src/kudu/consensus/consensus_peers-test.cc
M src/kudu/consensus/consensus_queue-test.cc
M src/kudu/consensus/consensus_queue.cc
M src/kudu/consensus/consensus_queue.h
M src/kudu/consensus/log_cache-test.cc
M src/kudu/consensus/pending_rounds.cc
M src/kudu/consensus/quorum_util.cc
M src/kudu/consensus/quorum_util.h
M src/kudu/consensus/raft_consensus.cc
M src/kudu/consensus/raft_consensus.h
M src/kudu/consensus/time_manager-test.cc
M src/kudu/consensus/time_manager.cc
M src/kudu/consensus/time_manager.h
M src/kudu/integration-tests/alter_table-test.cc
M src/kudu/integration-tests/raft_consensus-itest.cc
M src/kudu/master/sys_catalog.cc
M src/kudu/tablet/transactions/transaction_driver.cc
M src/kudu/tserver/tablet_server-test.cc
M src/kudu/tserver/ts_tablet_manager.cc
23 files changed, 349 insertions(+), 280 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/21/7221/21
--
To view, visit http://gerrit.cloudera.org:8080/7221
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I2bedcc5c3708d06f48131f34ea14c835867b4800
Gerrit-Change-Number: 7221
Gerrit-PatchSet: 21
Gerrit-Owner: David Ribeiro Alves 
Gerrit-Reviewer: Alexey Serbin 
Gerrit-Reviewer: David Ribeiro Alves 
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy 
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon 


[kudu-CR] Simplify OpId/Timestamp assignment and make it atomic

2018-03-21 Thread Todd Lipcon (Code Review)
Todd Lipcon has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/7221 )

Change subject: Simplify OpId/Timestamp assignment and make it atomic
..


Patch Set 21:

(2 comments)

Just looked at TransactionDriver first since David asked for a quick review of 
that part

http://gerrit.cloudera.org:8080/#/c/7221/21/src/kudu/tablet/transactions/transaction_driver.cc
File src/kudu/tablet/transactions/transaction_driver.cc:

http://gerrit.cloudera.org:8080/#/c/7221/21/src/kudu/tablet/transactions/transaction_driver.cc@324
PS21, Line 324: std::lock_guard lock(lock_);
would it be correct here to add a DCHECK_EQ(transaction_->type(), 
consensus::LEADER)? I think so. Maybe would be good doc value.


http://gerrit.cloudera.org:8080/#/c/7221/21/src/kudu/tablet/transactions/transaction_driver.cc@341
PS21, Line 341:  MvccManager::ScopedCleanTimeAdvancePin 
pin(replica->tablet()->mvcc_manager(),
  :
replica->clock()->Now())
I'm surprised that the clock()->Now() call doesn't need to be made inside of 
the MVCC lock. Otherwise isn't it racey?

T1: call clock->Now()
T2: advance clean time to a higher value
T1: call Pin(older timestamp than current clean time)

.. or are we relying on the fact that there can be no T2 starting a transaction 
at a higher 'Now' because the only thread that starts transactions is the 
prepare thread? Maybe needs some comment on that if it's the case.



--
To view, visit http://gerrit.cloudera.org:8080/7221
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I2bedcc5c3708d06f48131f34ea14c835867b4800
Gerrit-Change-Number: 7221
Gerrit-PatchSet: 21
Gerrit-Owner: David Ribeiro Alves 
Gerrit-Reviewer: Alexey Serbin 
Gerrit-Reviewer: David Ribeiro Alves 
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy 
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon 
Gerrit-Comment-Date: Wed, 21 Mar 2018 19:53:16 +
Gerrit-HasComments: Yes


[kudu-CR] Simplify OpId/Timestamp assignment and make it atomic

2018-03-22 Thread David Ribeiro Alves (Code Review)
Hello Tidy Bot, Mike Percy, Alexey Serbin, Kudu Jenkins, Todd Lipcon,

I'd like you to reexamine a change. Please visit

http://gerrit.cloudera.org:8080/7221

to look at the new patch set (#22).

Change subject: Simplify OpId/Timestamp assignment and make it atomic
..

Simplify OpId/Timestamp assignment and make it atomic

There are a couple of constraints that are tricky to enforce and were
requiring us to have convoluted interactions on OpId/Timestamp assignment
and registration between the TimeManager, MvccManager and ConsensusQueue.

These convoluted interactions are currently preventing a clean and simple
leader leases implementation, particularly the single timestamp outstanding
in TimeManager.

The "tricky" constraints are:

i) Before clean time is advanced in mvcc, all transactions with a
timestamp lower than the current time must be registered with mvcc.
This allows to wait for a certain timestamp t to be "safe" with TimeManager
and then wait for all transactions with a timestamp t1 < t to commit, thus
making sure that a scan at t is repeatable.

ii) The leader can't advance follower safe timestamp to t before
all the operations with a timestamp lower than it have also been sent.
This is hard since timestamps are assigned outside of consensus.

In general it's problematic to generate OpIds non-atomically with
timestamps, since these are both supposed to be monotonically increasing
in unison. To address this we had some complex state and interactions between
the components. For instance this was requiring us to have TimeManager
stop safe time advancement until a transaction being prepared was
submitted to the queue, after assigning a timestamp in the
TransactionDriver. This was problematic because with would only allow at most
one assigned timestamp in-flight, meaning we were not able to move safe
time at will.

This patch completely addresses ii), Timestamps and OpIds are now assigned
by PeerMessageQueue, atomically. This makes it trivial to make sure that the
safe time sent to replicas is inline with the sent messages. This also allows
some simplifications and refactorings of the PeerMessageQueue and TimeManager.

As for i) the way we enforce this constraint changed. Instead of registering
a transaction before submitting it to consensus (which we now can't since
consensus submission is also where we assign the timestamp) we register
another, dummy transaction on the MvccManager before submission and abort it
afterwards. This allows to "pin" safe time advancement down to a timestamp
that is mandatorily smaller than then one that will be assigned thus enforcing
the constraint with a simpler implementation at the cost of an additional
(short-lived) transaction in mvcc. Note in this regard that there will be
at most one aditional transaction in mvcc.

One side effect of this change is that, for new config changes, we won't
know the OpId until after the queue submission, which needs to happen
after actually setting the pending config. This is largely inconsequential
since, by design, we can have at most one pending config, but did require
some refactoring.

Change-Id: I2bedcc5c3708d06f48131f34ea14c835867b4800
---
M java/kudu-client/src/test/java/org/apache/kudu/client/TestHybridTime.java
M src/kudu/consensus/consensus-test-util.h
M src/kudu/consensus/consensus_meta-test.cc
M src/kudu/consensus/consensus_meta.cc
M src/kudu/consensus/consensus_peers-test.cc
M src/kudu/consensus/consensus_queue-test.cc
M src/kudu/consensus/consensus_queue.cc
M src/kudu/consensus/consensus_queue.h
M src/kudu/consensus/log_cache-test.cc
M src/kudu/consensus/pending_rounds.cc
M src/kudu/consensus/quorum_util.cc
M src/kudu/consensus/quorum_util.h
M src/kudu/consensus/raft_consensus.cc
M src/kudu/consensus/raft_consensus.h
M src/kudu/consensus/time_manager-test.cc
M src/kudu/consensus/time_manager.cc
M src/kudu/consensus/time_manager.h
M src/kudu/integration-tests/alter_table-test.cc
M src/kudu/integration-tests/raft_consensus-itest.cc
M src/kudu/master/sys_catalog.cc
M src/kudu/tablet/transactions/transaction_driver.cc
M src/kudu/tserver/tablet_server-test.cc
M src/kudu/tserver/ts_tablet_manager.cc
23 files changed, 353 insertions(+), 280 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/21/7221/22
--
To view, visit http://gerrit.cloudera.org:8080/7221
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I2bedcc5c3708d06f48131f34ea14c835867b4800
Gerrit-Change-Number: 7221
Gerrit-PatchSet: 22
Gerrit-Owner: David Ribeiro Alves 
Gerrit-Reviewer: Alexey Serbin 
Gerrit-Reviewer: David Ribeiro Alves 
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy 
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon 


[kudu-CR] Simplify OpId/Timestamp assignment and make it atomic

2018-03-22 Thread David Ribeiro Alves (Code Review)
Hello Tidy Bot, Mike Percy, Alexey Serbin, Kudu Jenkins, Todd Lipcon,

I'd like you to reexamine a change. Please visit

http://gerrit.cloudera.org:8080/7221

to look at the new patch set (#23).

Change subject: Simplify OpId/Timestamp assignment and make it atomic
..

Simplify OpId/Timestamp assignment and make it atomic

There are a couple of constraints that are tricky to enforce and were
requiring us to have convoluted interactions on OpId/Timestamp assignment
and registration between the TimeManager, MvccManager and ConsensusQueue.

These convoluted interactions are currently preventing a clean and simple
leader leases implementation, particularly the single timestamp outstanding
in TimeManager.

The "tricky" constraints are:

i) Before clean time is advanced in mvcc, all transactions with a
timestamp lower than the current time must be registered with mvcc.
This allows to wait for a certain timestamp t to be "safe" with TimeManager
and then wait for all transactions with a timestamp t1 < t to commit, thus
making sure that a scan at t is repeatable.

ii) The leader can't advance follower safe timestamp to t before
all the operations with a timestamp lower than it have also been sent.
This is hard since timestamps are assigned outside of consensus.

In general it's problematic to generate OpIds non-atomically with
timestamps, since these are both supposed to be monotonically increasing
in unison. To address this we had some complex state and interactions between
the components. For instance this was requiring us to have TimeManager
stop safe time advancement until a transaction being prepared was
submitted to the queue, after assigning a timestamp in the
TransactionDriver. This was problematic because with would only allow at most
one assigned timestamp in-flight, meaning we were not able to move safe
time at will.

This patch completely addresses ii), Timestamps and OpIds are now assigned
by PeerMessageQueue, atomically. This makes it trivial to make sure that the
safe time sent to replicas is inline with the sent messages. This also allows
some simplifications and refactorings of the PeerMessageQueue and TimeManager.

As for i) the way we enforce this constraint changed. Instead of registering
a transaction before submitting it to consensus (which we now can't since
consensus submission is also where we assign the timestamp) we register
another, dummy transaction on the MvccManager before submission and abort it
afterwards. This allows to "pin" safe time advancement down to a timestamp
that is mandatorily smaller than then one that will be assigned thus enforcing
the constraint with a simpler implementation at the cost of an additional
(short-lived) transaction in mvcc. Note in this regard that there will be
at most one aditional transaction in mvcc.

One side effect of this change is that, for new config changes, we won't
know the OpId until after the queue submission, which needs to happen
after actually setting the pending config. This is largely inconsequential
since, by design, we can have at most one pending config, but did require
some refactoring.

Change-Id: I2bedcc5c3708d06f48131f34ea14c835867b4800
---
M java/kudu-client/src/test/java/org/apache/kudu/client/TestHybridTime.java
M src/kudu/consensus/consensus-test-util.h
M src/kudu/consensus/consensus_meta-test.cc
M src/kudu/consensus/consensus_meta.cc
M src/kudu/consensus/consensus_peers-test.cc
M src/kudu/consensus/consensus_queue-test.cc
M src/kudu/consensus/consensus_queue.cc
M src/kudu/consensus/consensus_queue.h
M src/kudu/consensus/log_cache-test.cc
M src/kudu/consensus/pending_rounds.cc
M src/kudu/consensus/quorum_util.cc
M src/kudu/consensus/quorum_util.h
M src/kudu/consensus/raft_consensus.cc
M src/kudu/consensus/raft_consensus.h
M src/kudu/consensus/time_manager-test.cc
M src/kudu/consensus/time_manager.cc
M src/kudu/consensus/time_manager.h
M src/kudu/integration-tests/alter_table-test.cc
M src/kudu/integration-tests/raft_consensus-itest.cc
M src/kudu/master/sys_catalog.cc
M src/kudu/tablet/transactions/transaction_driver.cc
M src/kudu/tserver/tablet_server-test.cc
M src/kudu/tserver/ts_tablet_manager.cc
23 files changed, 356 insertions(+), 284 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/21/7221/23
--
To view, visit http://gerrit.cloudera.org:8080/7221
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I2bedcc5c3708d06f48131f34ea14c835867b4800
Gerrit-Change-Number: 7221
Gerrit-PatchSet: 23
Gerrit-Owner: David Ribeiro Alves 
Gerrit-Reviewer: Alexey Serbin 
Gerrit-Reviewer: David Ribeiro Alves 
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy 
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon 


[kudu-CR] Simplify OpId/Timestamp assignment and make it atomic

2018-03-22 Thread David Ribeiro Alves (Code Review)
Hello Tidy Bot, Mike Percy, Alexey Serbin, Kudu Jenkins, Todd Lipcon,

I'd like you to reexamine a change. Please visit

http://gerrit.cloudera.org:8080/7221

to look at the new patch set (#24).

Change subject: Simplify OpId/Timestamp assignment and make it atomic
..

Simplify OpId/Timestamp assignment and make it atomic

There are a couple of constraints that are tricky to enforce and were
requiring us to have convoluted interactions on OpId/Timestamp assignment
and registration between the TimeManager, MvccManager and ConsensusQueue.

These convoluted interactions are currently preventing a clean and simple
leader leases implementation, particularly the single timestamp outstanding
in TimeManager.

The "tricky" constraints are:

i) Before clean time is advanced in mvcc, all transactions with a
timestamp lower than the current time must be registered with mvcc.
This allows to wait for a certain timestamp t to be "safe" with TimeManager
and then wait for all transactions with a timestamp t1 < t to commit, thus
making sure that a scan at t is repeatable.

ii) The leader can't advance follower safe timestamp to t before
all the operations with a timestamp lower than it have also been sent.
This is hard since timestamps are assigned outside of consensus.

In general it's problematic to generate OpIds non-atomically with
timestamps, since these are both supposed to be monotonically increasing
in unison. To address this we had some complex state and interactions between
the components. For instance this was requiring us to have TimeManager
stop safe time advancement until a transaction being prepared was
submitted to the queue, after assigning a timestamp in the
TransactionDriver. This was problematic because with would only allow at most
one assigned timestamp in-flight, meaning we were not able to move safe
time at will.

This patch completely addresses ii), Timestamps and OpIds are now assigned
by PeerMessageQueue, atomically. This makes it trivial to make sure that the
safe time sent to replicas is inline with the sent messages. This also allows
some simplifications and refactorings of the PeerMessageQueue and TimeManager.

As for i) the way we enforce this constraint changed. Instead of registering
a transaction before submitting it to consensus (which we now can't since
consensus submission is also where we assign the timestamp) we register
another, dummy transaction on the MvccManager before submission and abort it
afterwards. This allows to "pin" safe time advancement down to a timestamp
that is mandatorily smaller than then one that will be assigned thus enforcing
the constraint with a simpler implementation at the cost of an additional
(short-lived) transaction in mvcc. Note in this regard that there will be
at most one aditional transaction in mvcc.

One side effect of this change is that, for new config changes, we won't
know the OpId until after the queue submission, which needs to happen
after actually setting the pending config. This is largely inconsequential
since, by design, we can have at most one pending config, but did require
some refactoring.

Change-Id: I2bedcc5c3708d06f48131f34ea14c835867b4800
---
M java/kudu-client/src/test/java/org/apache/kudu/client/TestHybridTime.java
M src/kudu/consensus/consensus-test-util.h
M src/kudu/consensus/consensus_meta-test.cc
M src/kudu/consensus/consensus_meta.cc
M src/kudu/consensus/consensus_peers-test.cc
M src/kudu/consensus/consensus_queue-test.cc
M src/kudu/consensus/consensus_queue.cc
M src/kudu/consensus/consensus_queue.h
M src/kudu/consensus/log_cache-test.cc
M src/kudu/consensus/pending_rounds.cc
M src/kudu/consensus/quorum_util.cc
M src/kudu/consensus/quorum_util.h
M src/kudu/consensus/raft_consensus.cc
M src/kudu/consensus/raft_consensus.h
M src/kudu/consensus/time_manager-test.cc
M src/kudu/consensus/time_manager.cc
M src/kudu/consensus/time_manager.h
M src/kudu/integration-tests/alter_table-test.cc
M src/kudu/integration-tests/raft_consensus-itest.cc
M src/kudu/integration-tests/test_workload.cc
M src/kudu/master/sys_catalog.cc
M src/kudu/tablet/transactions/transaction_driver.cc
M src/kudu/tserver/tablet_server-test.cc
M src/kudu/tserver/ts_tablet_manager.cc
24 files changed, 358 insertions(+), 284 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/21/7221/24
--
To view, visit http://gerrit.cloudera.org:8080/7221
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I2bedcc5c3708d06f48131f34ea14c835867b4800
Gerrit-Change-Number: 7221
Gerrit-PatchSet: 24
Gerrit-Owner: David Ribeiro Alves 
Gerrit-Reviewer: Alexey Serbin 
Gerrit-Reviewer: David Ribeiro Alves 
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy 
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon 


[kudu-CR] Simplify OpId/Timestamp assignment and make it atomic

2018-03-26 Thread Alexey Serbin (Code Review)
Alexey Serbin has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/7221 )

Change subject: Simplify OpId/Timestamp assignment and make it atomic
..


Patch Set 24:

(5 comments)

Just posting some nits on the commit message.  I'll try to look at the after 
two weeks, if it's still around.

http://gerrit.cloudera.org:8080/#/c/7221/24//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/7221/24//COMMIT_MSG@26
PS24, Line 26: also
drop 'also'?


http://gerrit.cloudera.org:8080/#/c/7221/24//COMMIT_MSG@26
PS24, Line 26: it
t  ?


http://gerrit.cloudera.org:8080/#/c/7221/24//COMMIT_MSG@29
PS24, Line 29: non-atomically with
separately from


http://gerrit.cloudera.org:8080/#/c/7221/24//COMMIT_MSG@35
PS24, Line 35: with
it ?


http://gerrit.cloudera.org:8080/#/c/7221/24//COMMIT_MSG@39
PS24, Line 39: ,
replace with stop/dot ?



--
To view, visit http://gerrit.cloudera.org:8080/7221
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I2bedcc5c3708d06f48131f34ea14c835867b4800
Gerrit-Change-Number: 7221
Gerrit-PatchSet: 24
Gerrit-Owner: David Ribeiro Alves 
Gerrit-Reviewer: Alexey Serbin 
Gerrit-Reviewer: David Ribeiro Alves 
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy 
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon 
Gerrit-Comment-Date: Tue, 27 Mar 2018 06:08:35 +
Gerrit-HasComments: Yes


[kudu-CR] Simplify OpId/Timestamp assignment and make it atomic

2018-04-23 Thread David Ribeiro Alves (Code Review)
Hello Tidy Bot, Mike Percy, Alexey Serbin, Kudu Jenkins, Todd Lipcon,

I'd like you to reexamine a change. Please visit

http://gerrit.cloudera.org:8080/7221

to look at the new patch set (#25).

Change subject: Simplify OpId/Timestamp assignment and make it atomic
..

Simplify OpId/Timestamp assignment and make it atomic

There are a couple of constraints that are tricky to enforce and were
requiring us to have convoluted interactions on OpId/Timestamp assignment
and registration between the TimeManager, MvccManager and ConsensusQueue.

These convoluted interactions are currently preventing a clean and simple
leader leases implementation, particularly the single timestamp outstanding
in TimeManager.

The "tricky" constraints are:

i) Before clean time is advanced in mvcc, all transactions with a
timestamp lower than the current time must be registered with mvcc.
This allows to wait for a certain timestamp t to be "safe" with TimeManager
and then wait for all transactions with a timestamp t1 < t to commit, thus
making sure that a scan at t is repeatable.

ii) The leader can't advance follower safe timestamp to t, before
all the operations with a timestamp lower than t have been sent.
This is hard since timestamps are assigned outside of consensus.

In general it's problematic to generate OpIds non-atomically with
timestamps, since these are both supposed to be monotonically increasing
in unison. To address this we had some complex state and interactions between
the components. For instance this was requiring us to have TimeManager
stop safe time advancement until a transaction being prepared was
submitted to the queue, after assigning a timestamp in the
TransactionDriver. This was problematic because with it would only allow at
most one assigned timestamp in-flight, meaning we were not able to move safe
time at will.

This patch completely addresses ii) Timestamps and OpIds are now assigned
by PeerMessageQueue, atomically. This makes it trivial to make sure that the
safe time sent to replicas is inline with the sent messages. This also allows
some simplifications and refactorings of the PeerMessageQueue and TimeManager.

As for i) the way we enforce this constraint changed. Instead of registering
a transaction before submitting it to consensus (which we now can't since
consensus submission is also where we assign the timestamp) we register
another, dummy transaction on the MvccManager before submission and abort it
afterwards. This allows to "pin" safe time advancement down to a timestamp
that is mandatorily smaller than then one that will be assigned thus enforcing
the constraint with a simpler implementation at the cost of an additional
(short-lived) transaction in mvcc. Note in this regard that there will be
at most one aditional transaction in mvcc.

One side effect of this change is that, for new config changes, we won't
know the OpId until after the queue submission, which needs to happen
after actually setting the pending config. This is largely inconsequential
since, by design, we can have at most one pending config, but did require
some refactoring.

Change-Id: I2bedcc5c3708d06f48131f34ea14c835867b4800
---
M java/kudu-client/src/test/java/org/apache/kudu/client/TestHybridTime.java
M src/kudu/consensus/consensus-test-util.h
M src/kudu/consensus/consensus_meta-test.cc
M src/kudu/consensus/consensus_meta.cc
M src/kudu/consensus/consensus_peers-test.cc
M src/kudu/consensus/consensus_queue-test.cc
M src/kudu/consensus/consensus_queue.cc
M src/kudu/consensus/consensus_queue.h
M src/kudu/consensus/log_cache-test.cc
M src/kudu/consensus/pending_rounds.cc
M src/kudu/consensus/quorum_util.cc
M src/kudu/consensus/quorum_util.h
M src/kudu/consensus/raft_consensus.cc
M src/kudu/consensus/raft_consensus.h
M src/kudu/consensus/time_manager-test.cc
M src/kudu/consensus/time_manager.cc
M src/kudu/consensus/time_manager.h
M src/kudu/integration-tests/alter_table-test.cc
M src/kudu/integration-tests/raft_consensus-itest.cc
M src/kudu/integration-tests/test_workload.cc
M src/kudu/master/sys_catalog.cc
M src/kudu/tablet/transactions/transaction_driver.cc
M src/kudu/tserver/tablet_server-test.cc
M src/kudu/tserver/ts_tablet_manager.cc
24 files changed, 357 insertions(+), 284 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/21/7221/25
--
To view, visit http://gerrit.cloudera.org:8080/7221
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I2bedcc5c3708d06f48131f34ea14c835867b4800
Gerrit-Change-Number: 7221
Gerrit-PatchSet: 25
Gerrit-Owner: David Ribeiro Alves 
Gerrit-Reviewer: Alexey Serbin 
Gerrit-Reviewer: David Ribeiro Alves 
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy 
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon 


[kudu-CR] Simplify OpId/Timestamp assignment and make it atomic

2018-04-23 Thread David Ribeiro Alves (Code Review)
David Ribeiro Alves has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/7221 )

Change subject: Simplify OpId/Timestamp assignment and make it atomic
..


Patch Set 24:

(7 comments)

http://gerrit.cloudera.org:8080/#/c/7221/24//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/7221/24//COMMIT_MSG@26
PS24, Line 26: it
> t  ?
Done


http://gerrit.cloudera.org:8080/#/c/7221/24//COMMIT_MSG@26
PS24, Line 26: also
> drop 'also'?
Done


http://gerrit.cloudera.org:8080/#/c/7221/24//COMMIT_MSG@29
PS24, Line 29: non-atomically with
> separately from
I think non-atomically here is better. after all non-separately doesn't 
necessarily mean atomically


http://gerrit.cloudera.org:8080/#/c/7221/24//COMMIT_MSG@35
PS24, Line 35: with
> it ?
Done


http://gerrit.cloudera.org:8080/#/c/7221/24//COMMIT_MSG@39
PS24, Line 39: ,
> replace with stop/dot ?
Done


http://gerrit.cloudera.org:8080/#/c/7221/21/src/kudu/tablet/transactions/transaction_driver.cc
File src/kudu/tablet/transactions/transaction_driver.cc:

http://gerrit.cloudera.org:8080/#/c/7221/21/src/kudu/tablet/transactions/transaction_driver.cc@324
PS21, Line 324:   DCHECK_EQ(driver_type, consensus::LEADER);
> would it be correct here to add a DCHECK_EQ(transaction_->type(), consensus
Done


http://gerrit.cloudera.org:8080/#/c/7221/24/src/kudu/tablet/transactions/transaction_driver.cc
File src/kudu/tablet/transactions/transaction_driver.cc:

http://gerrit.cloudera.org:8080/#/c/7221/24/src/kudu/tablet/transactions/transaction_driver.cc@69
PS24, Line 69: using std::unique_ptr;
> warning: using decl 'unique_ptr' is unused [misc-unused-using-decls]
Done



--
To view, visit http://gerrit.cloudera.org:8080/7221
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I2bedcc5c3708d06f48131f34ea14c835867b4800
Gerrit-Change-Number: 7221
Gerrit-PatchSet: 24
Gerrit-Owner: David Ribeiro Alves 
Gerrit-Reviewer: Alexey Serbin 
Gerrit-Reviewer: David Ribeiro Alves 
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy 
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon 
Gerrit-Comment-Date: Mon, 23 Apr 2018 21:15:49 +
Gerrit-HasComments: Yes


[kudu-CR] Simplify OpId/Timestamp assignment and make it atomic

2018-07-24 Thread David Ribeiro Alves (Code Review)
Hello Tidy Bot, Mike Percy, Alexey Serbin, Kudu Jenkins, Todd Lipcon,

I'd like you to reexamine a change. Please visit

http://gerrit.cloudera.org:8080/7221

to look at the new patch set (#28).

Change subject: Simplify OpId/Timestamp assignment and make it atomic
..

Simplify OpId/Timestamp assignment and make it atomic

There are a couple of constraints that are tricky to enforce and were
requiring us to have convoluted interactions on OpId/Timestamp assignment
and registration between the TimeManager, MvccManager and ConsensusQueue.

These convoluted interactions are currently preventing a clean and simple
leader leases implementation, particularly the single timestamp outstanding
in TimeManager.

The "tricky" constraints are:

i) Before clean time is advanced in mvcc, all transactions with a
timestamp lower than the current time must be registered with mvcc.
This allows to wait for a certain timestamp t to be "safe" with TimeManager
and then wait for all transactions with a timestamp t1 < t to commit, thus
making sure that a scan at t is repeatable.

ii) The leader can't advance follower safe timestamp to t before
all the operations with a timestamp lower than it have also been sent.
This is hard since timestamps are assigned outside of consensus.

In general it's problematic to generate OpIds non-atomically with
timestamps, since these are both supposed to be monotonically increasing
in unison. To address this we had some complex state and interactions between
the components. For instance this was requiring us to have TimeManager
stop safe time advancement until a transaction being prepared was
submitted to the queue, after assigning a timestamp in the
TransactionDriver. This was problematic because with would only allow at most
one assigned timestamp in-flight, meaning we were not able to move safe
time at will.

This patch completely addresses ii), Timestamps and OpIds are now assigned
by PeerMessageQueue, atomically. This makes it trivial to make sure that the
safe time sent to replicas is inline with the sent messages. This also allows
some simplifications and refactorings of the PeerMessageQueue and TimeManager.

As for i) the way we enforce this constraint changed. Instead of registering
a transaction before submitting it to consensus (which we now can't since
consensus submission is also where we assign the timestamp) we register
another, dummy transaction on the MvccManager before submission and abort it
afterwards. This allows to "pin" safe time advancement down to a timestamp
that is mandatorily smaller than then one that will be assigned thus enforcing
the constraint with a simpler implementation at the cost of an additional
(short-lived) transaction in mvcc. Note in this regard that there will be
at most one aditional transaction in mvcc.

One side effect of this change is that, for new config changes, we won't
know the OpId until after the queue submission, which needs to happen
after actually setting the pending config. This is largely inconsequential
since, by design, we can have at most one pending config, but did require
some refactoring.

Change-Id: I2bedcc5c3708d06f48131f34ea14c835867b4800
---
M java/kudu-client/src/test/java/org/apache/kudu/client/TestHybridTime.java
M src/kudu/consensus/consensus-test-util.h
M src/kudu/consensus/consensus_meta-test.cc
M src/kudu/consensus/consensus_meta.cc
M src/kudu/consensus/consensus_peers-test.cc
M src/kudu/consensus/consensus_queue-test.cc
M src/kudu/consensus/consensus_queue.cc
M src/kudu/consensus/consensus_queue.h
M src/kudu/consensus/log_cache-test.cc
M src/kudu/consensus/pending_rounds.cc
M src/kudu/consensus/quorum_util.cc
M src/kudu/consensus/quorum_util.h
M src/kudu/consensus/raft_consensus.cc
M src/kudu/consensus/raft_consensus.h
M src/kudu/consensus/time_manager-test.cc
M src/kudu/consensus/time_manager.cc
M src/kudu/consensus/time_manager.h
M src/kudu/integration-tests/alter_table-test.cc
M src/kudu/integration-tests/raft_consensus-itest.cc
M src/kudu/integration-tests/test_workload.cc
M src/kudu/master/sys_catalog.cc
M src/kudu/tablet/transactions/transaction_driver.cc
M src/kudu/tserver/tablet_server-test.cc
M src/kudu/tserver/ts_tablet_manager.cc
24 files changed, 358 insertions(+), 284 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/21/7221/28
--
To view, visit http://gerrit.cloudera.org:8080/7221
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I2bedcc5c3708d06f48131f34ea14c835867b4800
Gerrit-Change-Number: 7221
Gerrit-PatchSet: 28
Gerrit-Owner: David Ribeiro Alves 
Gerrit-Reviewer: Alexey Serbin 
Gerrit-Reviewer: David Ribeiro Alves 
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy 
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon 


[kudu-CR] Simplify OpId/Timestamp assignment and make it atomic

2018-10-12 Thread Mike Percy (Code Review)
Mike Percy has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/7221 )

Change subject: Simplify OpId/Timestamp assignment and make it atomic
..


Patch Set 28:

I think we can abandon this approach now that we went with another approach to 
KUDU-2463


--
To view, visit http://gerrit.cloudera.org:8080/7221
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I2bedcc5c3708d06f48131f34ea14c835867b4800
Gerrit-Change-Number: 7221
Gerrit-PatchSet: 28
Gerrit-Owner: David Ribeiro Alves 
Gerrit-Reviewer: Alexey Serbin 
Gerrit-Reviewer: David Ribeiro Alves 
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy 
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon 
Gerrit-Comment-Date: Fri, 12 Oct 2018 21:40:58 +
Gerrit-HasComments: No


[kudu-CR] Simplify OpId/Timestamp assignment and make it atomic

2018-11-26 Thread Mike Percy (Code Review)
Mike Percy has abandoned this change. ( http://gerrit.cloudera.org:8080/7221 )

Change subject: Simplify OpId/Timestamp assignment and make it atomic
..


Abandoned

superseded by https://gerrit.cloudera.org/c/11428/
--
To view, visit http://gerrit.cloudera.org:8080/7221
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: abandon
Gerrit-Change-Id: I2bedcc5c3708d06f48131f34ea14c835867b4800
Gerrit-Change-Number: 7221
Gerrit-PatchSet: 28
Gerrit-Owner: David Ribeiro Alves 
Gerrit-Reviewer: Alexey Serbin 
Gerrit-Reviewer: David Ribeiro Alves 
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Reviewer: Mike Percy 
Gerrit-Reviewer: Tidy Bot (241)
Gerrit-Reviewer: Todd Lipcon