subject:"\[kudu\-CR\] \[catalog manager\] categorization of rw operation failures"

Alexey Serbin has posted comments on this change.

Change subject: [catalog_manager] categorization of rw operation failures
..


Patch Set 26:

(5 comments)

http://gerrit.cloudera.org:8080/#/c/6170/26/src/kudu/master/catalog_manager.cc
File src/kudu/master/catalog_manager.cc:

PS26, Line 471: const
> Why not?  And why Status is so special?
ok, I think your idea of following the same style here makes sense.


PS26, Line 484: const
> Probably, const Status& s = ...; would be more preferable?
ok, agreed.


PS26, Line 762: const
> Yep, I just want to clarify on why it's not acceptable.
Keeping the same style makes sense.

However, in this case I would prefer to keep the const modifier.  The reason is 
that this method returns that value in the very end, and for the reader it's 
good to have an easy way to know that's the exact value captured in the very 
beginning of the function.

If you feel strongly opposed to this, I'll update it removing the const 
modifier.  But I hope you will also find it's better from the readability 
standpoint.


PS26, Line 3231: LOG_WITH_PREFIX(INFO)
   : << "Latest config for has opid_index of " << 
latest_index
   : << " while this task has opid_index of "
   : << cstate_.config().opid_index() << ". Aborting task.";
> spurious reformat
Done


PS26, Line 3285:   LOG_WITH_PREFIX(WARNING)
   :   << "ChangeConfig() failed with leader "
   :   << target_ts_desc_->ToString()
   :   << " due to CAS failure. No further retry: "
   :   << status.ToString();
   :   MarkFailed();
   :   break;
   : default:
   :   LOG_WITH_PREFIX(INFO)
   :   << "ChangeConfig() failed with leader "
   :   << target_ts_desc_->ToString()
   :   << " due to error "
   :   << 
TabletServerErrorPB::Code_Name(resp_.error().code())
   :   << ". This operation will be retried. Error detail: "
   :   << status.ToString();
   :   break;
> spurious reformat
Done


-- 
To view, visit http://gerrit.cloudera.org:8080/6170
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I826826049e3c08a6c8345949690cbbedaea32ff8
Gerrit-PatchSet: 26
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Alexey Serbin 
Gerrit-Reviewer: Adar Dembo 
Gerrit-Reviewer: Alexey Serbin 
Gerrit-Reviewer: Dan Burkert 
Gerrit-Reviewer: David Ribeiro Alves 
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon 
Gerrit-HasComments: Yes

[kudu-CR] [catalog manager] categorization of rw operation failures

Hello David Ribeiro Alves, Adar Dembo, Kudu Jenkins,

I'd like you to reexamine a change.  Please visit

http://gerrit.cloudera.org:8080/6170

to look at the new patch set (#28).

Change subject: [catalog_manager] categorization of rw operation failures
..

[catalog_manager] categorization of rw operation failures

This changelist introduces the categorization of the system catalog's
read and write operation failures which happen on leader post-election
callback.  There are two categories of errors: fatal and non-fatal.

If an operation against the system catalog fails in between terms of
the catalog leadership, the error is considered non-fatal.  In case of
a non-fatal error the leader post-election task bails out: the catalog
is no longer the leader at the original term and the task should be
executed by the new leader upon ElectedAsLeaderCb.

If an operation against the system catalog fails at the same term of the
catalog leadership, the error is considered fatal and that causes the
master process to crash.  This is to avoid possible inconsistency when
working with the tables/tablets metadata, the IPKI certificate authority
information and TSKs (Token Signing Keys).

Any failure of a read or write operation against the system catalog
happened during the catalog's shutdown is ignored and the leader
post-election task bails out once detecting such failure.

The same policy applies to other (i.e. not specific to read and write
operations against the system catalog) errors which might happen while
working with the IPKI certificate authority information and TokenSigner.
The rationale is the same as for handling the system catalog operation
failures: in case of an error, the leader has no consistent information
to work with, meanwhile a non-leader does not use the information
affected by the failure at all and can safely ignore the error.

Added a test to verify that the master server does not crash if change
of leadership detected while trying to persist a newly generated TSK
(Token Signing Key).

Change-Id: I826826049e3c08a6c8345949690cbbedaea32ff8
---
M src/kudu/integration-tests/CMakeLists.txt
A src/kudu/integration-tests/catalog_manager_tsk-itest.cc
M src/kudu/master/catalog_manager.cc
M src/kudu/master/catalog_manager.h
M src/kudu/master/master-test.cc
M src/kudu/master/master_service.cc
M src/kudu/master/sys_catalog-test.cc
7 files changed, 507 insertions(+), 149 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/70/6170/28
-- 
To view, visit http://gerrit.cloudera.org:8080/6170
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I826826049e3c08a6c8345949690cbbedaea32ff8
Gerrit-PatchSet: 28
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Alexey Serbin 
Gerrit-Reviewer: Adar Dembo 
Gerrit-Reviewer: Alexey Serbin 
Gerrit-Reviewer: Dan Burkert 
Gerrit-Reviewer: David Ribeiro Alves 
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon

[kudu-CR] [catalog manager] categorization of rw operation failures

Alexey Serbin has posted comments on this change.

Change subject: [catalog_manager] categorization of rw operation failures
..


Patch Set 26:

(23 comments)

http://gerrit.cloudera.org:8080/#/c/6170/26/src/kudu/integration-tests/catalog_manager_tsk-itest.cc
File src/kudu/integration-tests/catalog_manager_tsk-itest.cc:

PS26, Line 82:  We want this to happen as often as possible, but
 :   // due to truncation issues (double --> int64_t) two TSK 
keys might be
 :   // generated with interval less than 1 second, which is 
not desirable.
> not sure what's the intent of this sentence? do you do something to mitigat
Agreed, I think it's better to remove this comment.


PS26, Line 88: master_tsk_propagated_by_non_leaders
> this flag is a bit cryptic, at first I thought this was referring to tablet
agreed


PS26, Line 114: builder
  : .default_admin_operation_timeout(timeout)
> no need to wrap, in fact could you do all of this in the decl line?
I don't know how to do that in the decl line if not creating the second 
instance of the class, using copy constructor.


Line 126: // signed by the key which hasn't yet reached the tserver. This 
can
> with a key which hasn't yet reached the tserver or non-leader master (or so
Both.

This is workaround is no longer needed, though.


PS26, Line 132: IPKI
> can you TODO yourself?
no longer relavant -- the TODO is resolved.


PS26, Line 134: Ideally, (hb_interval_ms_ * 2) would be enough, but due
  : //   to network latency and multi-process OS, the factor 
should be greater.
> how about: We allow for extra slop beyond the minimum (hb_interval_ms_ * 2)
no longer relavant.


PS26, Line 136: SleepFor(MonoDelta::FromMilliseconds(hb_interval_ms_ * 10));
> these timeouts always come back to bite us in the form of flakyness. could 
This is no longer needed since the auto-retry for ERROR_UNAVAILABLE is already 
implemented.


PS26, Line 137: //std::min(run_time_seconds_ * 1000 / 2, 
hb_interval_ms_ * 50)));
> did you mean to leave this?
Done


PS26, Line 185: //SleepFor(MonoDelta::FromMilliseconds(hb_interval_ms_));
> left over garbage?
Done


http://gerrit.cloudera.org:8080/#/c/6170/26/src/kudu/master/catalog_manager.cc
File src/kudu/master/catalog_manager.cc:

PS26, Line 471: const
> we never declare status const. I know you can here, but I haven't seen that
Why not?  And why Status is so special?

It's just a simple notation to explicitly say that it's not to change in the 
whole scope.


PS26, Line 484: const
> same
Probably, const Status& s = ...; would be more preferable?


PS26, Line 492: LOG(WARNING) << err_msg << "will try next time";
> this is not very informative
Well, it would be

'failed to refresh TSK: : will try next time'

It reports that adding a newly generated TSK did not come through, and the 
catalog manager will retry it on next cycle.  However, the TokenSigner still 
has a valid key to use.

I'll replace 'next time' with 'next cycle'.


PS26, Line 494: has left
> where did it go? :)
:)

Good catch!


PS26, Line 762: const
> same
Yep, I just want to clarify on why it's not acceptable.


PS26, Line 777:  leadership change in the middle:
  : // if the master server loses its leadership role, then 
there will be an
  : // error if trying to write into the system table. If that 
happens, we do
  : // not activate the newly generated CA information since it 
is no longer
  : // relevant. If it was leadership change indeed, the system 
table should
  : // contain the CA certificate information when this master 
is re-elected
  : // next.
> numbered bullets or some better way to lay this out. it's hard to parse as 
Done


PS26, Line 794:  
> a
Done


PS26, Line 857: LOG(INFO) << "Generated new certificate authority record";
> can you provide more info? it'd be good to print a seq no or some identifie
Good idea: we could print so called fingerprint of the private  key.  However, 
currently we don't have appropriate methods in PrivateKey and CaCert classes.

Let me address that in a separate changelist.


PS26, Line 914: using std::bind;
> using declarations go on the top of the file
I could not find that in the code style guide.  Putting that into the top of 
the file does not make much sense because in this file there boost::bind is 
used as well.  So, I decided to use std::bind() when necessary.


PS26, Line 919: auto wrapper = [this](std::function func,
  :   const Consensus& consensus,
  :   int64_t start_term,
  :   const char* op_description) {
  :   leader_lock_.AssertAcquiredForWriting();
  :   const Status s = func();
> not really sure what's happening here. can you simplify?
I thi

[kudu-CR] [catalog manager] categorization of rw operation failures

Hello David Ribeiro Alves, Adar Dembo, Kudu Jenkins,

I'd like you to reexamine a change.  Please visit

http://gerrit.cloudera.org:8080/6170

to look at the new patch set (#27).

Change subject: [catalog_manager] categorization of rw operation failures
..

[catalog_manager] categorization of rw operation failures

This changelist introduces the categorization of the system catalog's
read and write operation failures which happen on leader post-election
callback.  There are two categories of errors: fatal and non-fatal.

If an operation against the system catalog fails in between terms of
the catalog leadership, the error is considered non-fatal.  In case of
a non-fatal error the leader post-election task bails out: the catalog
is no longer the leader at the original term and the task should be
executed by the new leader upon ElectedAsLeaderCb.

If an operation against the system catalog fails at the same term of the
catalog leadership, the error is considered fatal and that causes the
master process to crash.  This is to avoid possible inconsistency when
working with the tables/tablets metadata, the IPKI certificate authority
information and TSKs (Token Signing Keys).

Any failure of a read or write operation against the system catalog
happened during the catalog's shutdown is ignored and the leader
post-election task bails out once detecting such failure.

The same policy applies to other (i.e. not specific to read and write
operations against the system catalog) errors which might happen while
working with the IPKI certificate authority information and TokenSigner.
The rationale is the same as for handling the system catalog operation
failures: in case of an error, the leader has no consistent information
to work with, meanwhile a non-leader does not use the information
affected by the failure at all and can safely ignore the error.

Added a test to verify that the master server does not crash if change
of leadership detected while trying to persist a newly generated TSK
(Token Signing Key).

Change-Id: I826826049e3c08a6c8345949690cbbedaea32ff8
---
M src/kudu/integration-tests/CMakeLists.txt
A src/kudu/integration-tests/catalog_manager_tsk-itest.cc
M src/kudu/master/catalog_manager.cc
M src/kudu/master/catalog_manager.h
M src/kudu/master/master-test.cc
M src/kudu/master/master_service.cc
M src/kudu/master/sys_catalog-test.cc
7 files changed, 524 insertions(+), 163 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/70/6170/27
-- 
To view, visit http://gerrit.cloudera.org:8080/6170
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I826826049e3c08a6c8345949690cbbedaea32ff8
Gerrit-PatchSet: 27
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Alexey Serbin 
Gerrit-Reviewer: Adar Dembo 
Gerrit-Reviewer: Alexey Serbin 
Gerrit-Reviewer: Dan Burkert 
Gerrit-Reviewer: David Ribeiro Alves 
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon

[kudu-CR] [catalog manager] categorization of rw operation failures

2017-05-04 Thread David Ribeiro Alves (Code Review)

David Ribeiro Alves has posted comments on this change.

Change subject: [catalog_manager] categorization of rw operation failures
..


Patch Set 26:

(25 comments)

http://gerrit.cloudera.org:8080/#/c/6170/26/src/kudu/integration-tests/catalog_manager_tsk-itest.cc
File src/kudu/integration-tests/catalog_manager_tsk-itest.cc:

PS26, Line 82:  We want this to happen as often as possible, but
 :   // due to truncation issues (double --> int64_t) two TSK 
keys might be
 :   // generated with interval less than 1 second, which is 
not desirable.
not sure what's the intent of this sentence? do you do something to mitigate 
the "undesirability"?


PS26, Line 88: master_tsk_propagated_by_non_leaders
this flag is a bit cryptic, at first I thought this was referring to tablet 
servers. not sure it's defined in this patch but 
master_non_leader_masters_propagate_tsk would be clearer, IMO


PS26, Line 114: builder
  : .default_admin_operation_timeout(timeout)
no need to wrap, in fact could you do all of this in the decl line?


Line 126: // signed by the key which hasn't yet reached the tserver. This 
can
with a key which hasn't yet reached the tserver or non-leader master (or 
somesuch)


PS26, Line 132: IPKI
can you TODO yourself?


PS26, Line 134: Ideally, (hb_interval_ms_ * 2) would be enough, but due
  : //   to network latency and multi-process OS, the factor 
should be greater.
how about: We allow for extra slop beyond the minimum (hb_interval_ms_ * 2) to 
account for network latency and thread preemption.


PS26, Line 136: SleepFor(MonoDelta::FromMilliseconds(hb_interval_ms_ * 10));
these timeouts always come back to bite us in the form of flakyness. could we 
sleep the minimum and keep retrying for a while (always making sure that we get 
the expected error and none other)


PS26, Line 137: //std::min(run_time_seconds_ * 1000 / 2, 
hb_interval_ms_ * 50)));
did you mean to leave this?


PS26, Line 185: //SleepFor(MonoDelta::FromMilliseconds(hb_interval_ms_));
left over garbage?


http://gerrit.cloudera.org:8080/#/c/6170/26/src/kudu/master/catalog_manager.cc
File src/kudu/master/catalog_manager.cc:

PS26, Line 471: const
we never declare status const. I know you can here, but I haven't seen that 
anywhere, so I'd prefer to follow the same style


PS26, Line 484: const
same


PS26, Line 492: LOG(WARNING) << err_msg << "will try next time";
this is not very informative


PS26, Line 494: has left
where did it go? :)


PS26, Line 762: const
same


PS26, Line 777:  leadership change in the middle:
  : // if the master server loses its leadership role, then 
there will be an
  : // error if trying to write into the system table. If that 
happens, we do
  : // not activate the newly generated CA information since it 
is no longer
  : // relevant. If it was leadership change indeed, the system 
table should
  : // contain the CA certificate information when this master 
is re-elected
  : // next.
numbered bullets or some better way to lay this out. it's hard to parse as it is


PS26, Line 794:  
a


PS26, Line 857: LOG(INFO) << "Generated new certificate authority record";
can you provide more info? it'd be good to print a seq no or some identifier


PS26, Line 914: using std::bind;
using declarations go on the top of the file


PS26, Line 919: auto wrapper = [this](std::function func,
  :   const Consensus& consensus,
  :   int64_t start_term,
  :   const char* op_description) {
  :   leader_lock_.AssertAcquiredForWriting();
  :   const Status s = func();
not really sure what's happening here. can you simplify?


PS26, Line 961: static const char* const kLoadMetaOpDescription =
  : "Loading table and tablet metadata into memory";
can you declare this const on the top of the file?


PS26, Line 3231: LOG_WITH_PREFIX(INFO)
   : << "Latest config for has opid_index of " << 
latest_index
   : << " while this task has opid_index of "
   : << cstate_.config().opid_index() << ". Aborting task.";
spurious reformat


PS26, Line 3285:   LOG_WITH_PREFIX(WARNING)
   :   << "ChangeConfig() failed with leader "
   :   << target_ts_desc_->ToString()
   :   << " due to CAS failure. No further retry: "
   :   << status.ToString();
   :   MarkFailed();
   :   break;
   : default:
   :   LOG_WITH_PREFIX(INFO)
   :   << "ChangeConfig() failed with leader "
   :   << target_ts_desc_->ToString()

[kudu-CR] [catalog manager] categorization of rw operation failures

Alexey Serbin has posted comments on this change.

Change subject: [catalog_manager] categorization of rw operation failures
..


Patch Set 26:

> Build Successful
 > 
 > http://104.196.14.100/job/kudu-gerrit/6937/ : SUCCESS

1024 test iterations has passed without error:
  http://dist-test.cloudera.org//job?job_id=aserbin.1489090956.900

Prior to the with introduced in https://gerrit.cloudera.org/#/c/6329/  there 
were 3-5 failures for every 256 runs.

-- 
To view, visit http://gerrit.cloudera.org:8080/6170
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I826826049e3c08a6c8345949690cbbedaea32ff8
Gerrit-PatchSet: 26
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Alexey Serbin 
Gerrit-Reviewer: Adar Dembo 
Gerrit-Reviewer: Alexey Serbin 
Gerrit-Reviewer: Dan Burkert 
Gerrit-Reviewer: David Ribeiro Alves 
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon 
Gerrit-HasComments: No

[kudu-CR] [catalog manager] categorization of rw operation failures

Alexey Serbin has posted comments on this change.

Change subject: [catalog_manager] categorization of rw operation failures
..


Patch Set 26:

> Uploaded patch set 26.

The race has gone:

  http://dist-test.cloudera.org//job?job_id=aserbin.1489089006.9779

-- 
To view, visit http://gerrit.cloudera.org:8080/6170
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I826826049e3c08a6c8345949690cbbedaea32ff8
Gerrit-PatchSet: 26
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Alexey Serbin 
Gerrit-Reviewer: Adar Dembo 
Gerrit-Reviewer: Alexey Serbin 
Gerrit-Reviewer: Dan Burkert 
Gerrit-Reviewer: David Ribeiro Alves 
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon 
Gerrit-HasComments: No

[kudu-CR] [catalog manager] categorization of rw operation failures

Hello David Ribeiro Alves, Adar Dembo, Kudu Jenkins,

I'd like you to reexamine a change.  Please visit

http://gerrit.cloudera.org:8080/6170

to look at the new patch set (#26).

Change subject: [catalog_manager] categorization of rw operation failures
..

[catalog_manager] categorization of rw operation failures

This changelist introduces the categorization of the system catalog's
read and write operation failures which happen on leader post-election
callback.  There are two categories of errors: fatal and non-fatal.

If an operation against the system catalog fails in between terms of
the catalog leadership, the error is considered non-fatal.  In case of
a non-fatal error the leader post-election task bails out: the catalog
is no longer the leader at the original term and the task should be
executed by the new leader upon ElectedAsLeaderCb.

If an operation against the system catalog fails at the same term of the
catalog leadership, the error is considered fatal and that causes the
master process to crash.  This is to avoid possible inconsistency when
working with the tables/tablets metadata, the IPKI certificate authority
information and TSKs (Token Signing Keys).

Any failure of a read or write operation against the system catalog
happened during the catalog's shutdown is ignored and the leader
post-election task bails out once detecting such failure.

The same policy applies to other (i.e. not specific to read and write
operations against the system catalog) errors which might happen while
working with the IPKI certificate authority information and TokenSigner.
The rationale is the same as for handling the system catalog operation
failures: in case of an error, the leader has no consistent information
to work with, meanwhile a non-leader does not use the information
affected by the failure at all and can safely ignore the error.

Added a test to verify that the master server does not crash if change
of leadership detected while trying to persist a newly generated TSK
(Token Signing Key).

Change-Id: I826826049e3c08a6c8345949690cbbedaea32ff8
---
M src/kudu/integration-tests/CMakeLists.txt
A src/kudu/integration-tests/catalog_manager_tsk-itest.cc
M src/kudu/master/catalog_manager.cc
M src/kudu/master/catalog_manager.h
M src/kudu/master/master-test.cc
M src/kudu/master/master_service.cc
M src/kudu/master/sys_catalog-test.cc
7 files changed, 486 insertions(+), 160 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/70/6170/26
-- 
To view, visit http://gerrit.cloudera.org:8080/6170
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I826826049e3c08a6c8345949690cbbedaea32ff8
Gerrit-PatchSet: 26
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Alexey Serbin 
Gerrit-Reviewer: Adar Dembo 
Gerrit-Reviewer: Alexey Serbin 
Gerrit-Reviewer: Dan Burkert 
Gerrit-Reviewer: David Ribeiro Alves 
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon

[kudu-CR] [catalog manager] categorization of rw operation failures

2017-03-09 Thread David Ribeiro Alves (Code Review)

David Ribeiro Alves has posted comments on this change.

Change subject: [catalog_manager] categorization of rw operation failures
..


Patch Set 25:

you're right, the logs show that TSK skips a number. e.g. 23, 24, 26, 25

-- 
To view, visit http://gerrit.cloudera.org:8080/6170
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I826826049e3c08a6c8345949690cbbedaea32ff8
Gerrit-PatchSet: 25
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Alexey Serbin 
Gerrit-Reviewer: Adar Dembo 
Gerrit-Reviewer: Alexey Serbin 
Gerrit-Reviewer: Dan Burkert 
Gerrit-Reviewer: David Ribeiro Alves 
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon 
Gerrit-HasComments: No

[kudu-CR] [catalog manager] categorization of rw operation failures

Alexey Serbin has posted comments on this change.

Change subject: [catalog_manager] categorization of rw operation failures
..


Patch Set 25:

> I chased the bug. It's the following. Say node A is at term 10 and
 > is leader current TSK seq no is 0.
 > 1 - Starts CatalogManagerBgTasks::Run(), which runs since its
 > leader, but takes a while to actually get to the part
 > TryGenerateNewTskUnlocked() is called.
 > 2 ,- In the meanwhile A loses leadership, B takes over and
 > generates TSK 1, later TSK 2.
 > 3 - B loses leadership, A wins it again.
 > 4 - Before A gets a chance to run the "leader election callback"
 > the bg task from 1 completes (it can because it's leader again).
 > The TSK that gets written is 1, breaking monotonicity.
 > 
 > Note that this is a very contrived scenario that needs leadership
 > interleaving that is likely unrealistic when TSK's last days.

One nit: in that scenario TSK 1 should fail to be written, so there are no TSK 
1, just TSK 0 and TSK 2 in the table.  If TSK 1 is already in the table, it 
would not be possible for the bg task to write a record with the same composite 
key.

-- 
To view, visit http://gerrit.cloudera.org:8080/6170
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I826826049e3c08a6c8345949690cbbedaea32ff8
Gerrit-PatchSet: 25
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Alexey Serbin 
Gerrit-Reviewer: Adar Dembo 
Gerrit-Reviewer: Alexey Serbin 
Gerrit-Reviewer: Dan Burkert 
Gerrit-Reviewer: David Ribeiro Alves 
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon 
Gerrit-HasComments: No

[kudu-CR] [catalog manager] categorization of rw operation failures

Alexey Serbin has posted comments on this change.

Change subject: [catalog_manager] categorization of rw operation failures
..


Patch Set 25:

> Sorry for the rebase. Did it to post a WIP of my debugging code:
 > https://gerrit.cloudera.org/#/c/6327/

That's fine.  I think it's good to have more logging here, otherwise it 
involves too much guesswork to find what's going on :)

-- 
To view, visit http://gerrit.cloudera.org:8080/6170
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I826826049e3c08a6c8345949690cbbedaea32ff8
Gerrit-PatchSet: 25
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Alexey Serbin 
Gerrit-Reviewer: Adar Dembo 
Gerrit-Reviewer: Alexey Serbin 
Gerrit-Reviewer: Dan Burkert 
Gerrit-Reviewer: David Ribeiro Alves 
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon 
Gerrit-HasComments: No

[kudu-CR] [catalog manager] categorization of rw operation failures

2017-03-09 Thread David Ribeiro Alves (Code Review)

David Ribeiro Alves has posted comments on this change.

Change subject: [catalog_manager] categorization of rw operation failures
..


Patch Set 24:

Sorry for the rebase. Did it to post a WIP of my debugging code: 
https://gerrit.cloudera.org/#/c/6327/

-- 
To view, visit http://gerrit.cloudera.org:8080/6170
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I826826049e3c08a6c8345949690cbbedaea32ff8
Gerrit-PatchSet: 24
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Alexey Serbin 
Gerrit-Reviewer: Adar Dembo 
Gerrit-Reviewer: Alexey Serbin 
Gerrit-Reviewer: Dan Burkert 
Gerrit-Reviewer: David Ribeiro Alves 
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon 
Gerrit-HasComments: No

[kudu-CR] [catalog manager] categorization of rw operation failures

Alexey Serbin has posted comments on this change.

Change subject: [catalog_manager] categorization of rw operation failures
..


Patch Set 24:

> > > Given that we're still chasing strange test failures on this,
 > and
 > > > it's on a pretty important and tricky code path, maybe we
 > should
 > > > chat about whether it's really necessary for 1.3.0? i.e are the
 > > > downside risks of not having it included in the release worse
 > > than
 > > > the downside risks of potential bugs? I haven't followed it
 > > closely
 > > > but as the 1.3 RM I'm feeling nervous about complex patches
 > > coming
 > > > in very close to the first rc being cut (hoping to do that
 > > > tomorrow)
 > >
 > > Yes, that's a very good point.  However, I think I understand
 > what
 > > is the issue.  The issue is that upon master leadership change
 > the
 > > new leader sometimes does not see the last successful write from
 > > the former leader.  That bug can affect table/tablet metadata as
 > > well.  I.e., the newly created tablet could be overlooked at
 > > leadership change, and it will be seen only on the next call of
 > > ElectedLeaderCb.
 > >
 > > E.g., it's possible to take a look at log from 
 > > https://kudu-test-results.s3.amazonaws.com/aserbin.1489032335.25361.f2e8aa26b74185a6bd16d5d554488e6f1af190f5.13.0-artifacts.zip
 > >
 > > This is what I think happened there:
 > >
 > > 1. The master at 127.0.0.1:11032 generated and successfully wrote
 > > TSK with id 0 (I0309 04:05:44.679442   558 catalog_manager.cc:3432]
 > > Generated new TSK 0).  Later on, re-election happened and it was
 > > elected as a leader again, and it generated TSK with id 1 but
 > since
 > > there is injected latency prior to writing the key into the
 > table,
 > > it failed to write it into the system table due to leadership
 > term
 > > change.
 > >
 > > 2. Some other leader started its leadership duties but failed to
 > > caught up as a leader.
 > >
 > > 3. Our former (first) master server became the leader again and
 > it
 > > generated and successfully written TSK with id 2 (I0309
 > > 04:05:46.812899   668 catalog_manager.cc:3432] Generated new TSK
 > 2)
 > >
 > > 4. Right after that leadership changed and other master server
 > ran
 > > its ElectedAsLeaderCb and it did not see the latest TSK record in
 > > the system table.  Seeing just the record with TSK id 0, it
 > > generated and successfully written its new TSK with id 1 (I0309
 > > 04:05:47.270205   378 catalog_manager.cc:3432] Generated new TSK
 > 1)
 > >
 > > 5. Now, the client has connected to the current master leader
 > which
 > > has just generated TSK and made it current (TSK rotation period
 > is
 > > 2 seconds).  It got authn token signed by TSK with id 1.
 > >
 > > 6. The client tries to execute write operation against the tablet
 > > server which has received TSKs with id 0 and 2.  The tablet
 > server
 > > cannot see the TSK with id 1 because the new master does not send
 > > it in response since the tablet server sends 2 as the latest TSK
 > > id.
 > >
 > > 7. The tablet server responded with 'Runtime error:
 > > ERROR_UNAVAILABLE: Not authorized: authentication token signed
 > with
 > > unknown key' while the client tried to negotiate the connection.
 > 
 > Instead of the link to the artifacts it's possible to use
 > http://dist-test.cloudera.org//job?job_id=aserbin.1489032335.25361
 > and retrieve the artifacts of the very first failure in the list.

David and I looked at this a little bit more and David suspects that we could 
have the following problem:

1. While still the leader master, the bg task starts doing its job
2. The leadership changes twice while the bg task was running -- the master has 
lost its leadership and the leadership returned back to the master, but the bg 
task holds that lock (in rd mode).
3. Since the ElectedAsLeaderCb cannot run (it tries to acquire the lock in wr 
mode), it bg task might come into the point where it's ready to write into the 
system table and that write would be successful.  If it were refreshing TSK to 
set it to (T), and the current TSK id is already (T + 1) since the 
in-the-middle-leader failed to write (T) TSK, so there are (T - 1)  and (T + 1) 
TSKs in the table, but there is no (T) TSK yet.  In that case, the bg task 
would write a new TSK with stale id into the system table and there would be no 
error, since the master is the leader again.

So, this patch requires some clarification in that regard.  I think we should 
proceed as Todd suggested cutting RC without this patch.

-- 
To view, visit http://gerrit.cloudera.org:8080/6170
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I826826049e3c08a6c8345949690cbbedaea32ff8
Gerrit-PatchSet: 24
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Alexey Serbin 
Gerrit-Reviewer: Adar Dembo 
Gerrit-Reviewer: Alexey Serbin 
Gerrit-Reviewer: Dan Bu

[kudu-CR] [catalog manager] categorization of rw operation failures

2017-03-09 Thread David Ribeiro Alves (Code Review)

David Ribeiro Alves has posted comments on this change.

Change subject: [catalog_manager] categorization of rw operation failures
..


Patch Set 24:

I chased the bug. It's the following. Say node A is at term 10 and is leader 
current TSK seq no is 0.
1 - Starts CatalogManagerBgTasks::Run(), which runs since its leader, but takes 
a while to actually get to the part TryGenerateNewTskUnlocked() is called.
2 ,- In the meanwhile A loses leadership, B takes over and generates TSK 1, 
later TSK 2.
3 - B loses leadership, A wins it again.
4 - Before A gets a chance to run the "leader election callback" the bg task 
from 1 completes (it can because it's leader again). The TSK that gets written 
is 1, breaking monotonicity.

Note that this is a very contrived scenario that needs leadership interleaving 
that is likely unrealistic when TSK's last days.

-- 
To view, visit http://gerrit.cloudera.org:8080/6170
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I826826049e3c08a6c8345949690cbbedaea32ff8
Gerrit-PatchSet: 24
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Alexey Serbin 
Gerrit-Reviewer: Adar Dembo 
Gerrit-Reviewer: Alexey Serbin 
Gerrit-Reviewer: Dan Burkert 
Gerrit-Reviewer: David Ribeiro Alves 
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon 
Gerrit-HasComments: No

[kudu-CR] [catalog manager] categorization of rw operation failures

Alexey Serbin has posted comments on this change.

Change subject: [catalog_manager] categorization of rw operation failures
..


Patch Set 24:

> > Given that we're still chasing strange test failures on this, and
 > > it's on a pretty important and tricky code path, maybe we should
 > > chat about whether it's really necessary for 1.3.0? i.e are the
 > > downside risks of not having it included in the release worse
 > than
 > > the downside risks of potential bugs? I haven't followed it
 > closely
 > > but as the 1.3 RM I'm feeling nervous about complex patches
 > coming
 > > in very close to the first rc being cut (hoping to do that
 > > tomorrow)
 > 
 > Yes, that's a very good point.  However, I think I understand what
 > is the issue.  The issue is that upon master leadership change the
 > new leader sometimes does not see the last successful write from
 > the former leader.  That bug can affect table/tablet metadata as
 > well.  I.e., the newly created tablet could be overlooked at
 > leadership change, and it will be seen only on the next call of
 > ElectedLeaderCb.
 > 
 > E.g., it's possible to take a look at log from 
 > https://kudu-test-results.s3.amazonaws.com/aserbin.1489032335.25361.f2e8aa26b74185a6bd16d5d554488e6f1af190f5.13.0-artifacts.zip
 > 
 > This is what I think happened there:
 > 
 > 1. The master at 127.0.0.1:11032 generated and successfully wrote
 > TSK with id 0 (I0309 04:05:44.679442   558 catalog_manager.cc:3432]
 > Generated new TSK 0).  Later on, re-election happened and it was
 > elected as a leader again, and it generated TSK with id 1 but since
 > there is injected latency prior to writing the key into the table,
 > it failed to write it into the system table due to leadership term
 > change.
 > 
 > 2. Some other leader started its leadership duties but failed to
 > caught up as a leader.
 > 
 > 3. Our former (first) master server became the leader again and it
 > generated and successfully written TSK with id 2 (I0309
 > 04:05:46.812899   668 catalog_manager.cc:3432] Generated new TSK 2)
 > 
 > 4. Right after that leadership changed and other master server ran
 > its ElectedAsLeaderCb and it did not see the latest TSK record in
 > the system table.  Seeing just the record with TSK id 0, it
 > generated and successfully written its new TSK with id 1 (I0309
 > 04:05:47.270205   378 catalog_manager.cc:3432] Generated new TSK 1)
 > 
 > 5. Now, the client has connected to the current master leader which
 > has just generated TSK and made it current (TSK rotation period is
 > 2 seconds).  It got authn token signed by TSK with id 1.
 > 
 > 6. The client tries to execute write operation against the tablet
 > server which has received TSKs with id 0 and 2.  The tablet server
 > cannot see the TSK with id 1 because the new master does not send
 > it in response since the tablet server sends 2 as the latest TSK
 > id.
 > 
 > 7. The tablet server responded with 'Runtime error:
 > ERROR_UNAVAILABLE: Not authorized: authentication token signed with
 > unknown key' while the client tried to negotiate the connection.

Instead of the link to the artifacts it's possible to use 
http://dist-test.cloudera.org//job?job_id=aserbin.1489032335.25361 and retrieve 
the artifacts of the very first failure in the list.

-- 
To view, visit http://gerrit.cloudera.org:8080/6170
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I826826049e3c08a6c8345949690cbbedaea32ff8
Gerrit-PatchSet: 24
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Alexey Serbin 
Gerrit-Reviewer: Adar Dembo 
Gerrit-Reviewer: Alexey Serbin 
Gerrit-Reviewer: Dan Burkert 
Gerrit-Reviewer: David Ribeiro Alves 
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon 
Gerrit-HasComments: No

[kudu-CR] [catalog manager] categorization of rw operation failures

Alexey Serbin has posted comments on this change.

Change subject: [catalog_manager] categorization of rw operation failures
..


Patch Set 24:

> Given that we're still chasing strange test failures on this, and
 > it's on a pretty important and tricky code path, maybe we should
 > chat about whether it's really necessary for 1.3.0? i.e are the
 > downside risks of not having it included in the release worse than
 > the downside risks of potential bugs? I haven't followed it closely
 > but as the 1.3 RM I'm feeling nervous about complex patches coming
 > in very close to the first rc being cut (hoping to do that
 > tomorrow)

Yes, that's a very good point.  However, I think I understand what is the 
issue.  The issue is that upon master leadership change the new leader 
sometimes does not see the last successful write from the former leader.  That 
bug can affect table/tablet metadata as well.  I.e., the newly created tablet 
could be overlooked at leadership change, and it will be seen only on the next 
call of ElectedLeaderCb.

E.g., it's possible to take a look at log from 
https://kudu-test-results.s3.amazonaws.com/aserbin.1489032335.25361.f2e8aa26b74185a6bd16d5d554488e6f1af190f5.13.0-artifacts.zip

This is what I think happened there:

1. The master at 127.0.0.1:11032 generated and successfully wrote TSK with id 0 
(I0309 04:05:44.679442   558 catalog_manager.cc:3432] Generated new TSK 0).  
Later on, re-election happened and it was elected as a leader again, and it 
generated TSK with id 1 but since there is injected latency prior to writing 
the key into the table, it failed to write it into the system table due to 
leadership term change.

2. Some other leader started its leadership duties but failed to caught up as a 
leader.

3. Our former (first) master server became the leader again and it generated 
and successfully written TSK with id 2 (I0309 04:05:46.812899   668 
catalog_manager.cc:3432] Generated new TSK 2)

4. Right after that leadership changed and other master server ran its 
ElectedAsLeaderCb and it did not see the latest TSK record in the system table. 
 Seeing just the record with TSK id 0, it generated and successfully written 
its new TSK with id 1 (I0309 04:05:47.270205   378 catalog_manager.cc:3432] 
Generated new TSK 1)

5. Now, the client has connected to the current master leader which has just 
generated TSK and made it current (TSK rotation period is 2 seconds).  It got 
authn token signed by TSK with id 1.

6. The client tries to execute write operation against the tablet server which 
has received TSKs with id 0 and 2.  The tablet server cannot see the TSK with 
id 1 because the new master does not send it in response since the tablet 
server sends 2 as the latest TSK id.

7. The tablet server responded with 'Runtime error: ERROR_UNAVAILABLE: Not 
authorized: authentication token signed with unknown key' while the client 
tried to negotiate the connection.

-- 
To view, visit http://gerrit.cloudera.org:8080/6170
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I826826049e3c08a6c8345949690cbbedaea32ff8
Gerrit-PatchSet: 24
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Alexey Serbin 
Gerrit-Reviewer: Adar Dembo 
Gerrit-Reviewer: Alexey Serbin 
Gerrit-Reviewer: Dan Burkert 
Gerrit-Reviewer: David Ribeiro Alves 
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon 
Gerrit-HasComments: No

[kudu-CR] [catalog manager] categorization of rw operation failures

2017-03-08 Thread Todd Lipcon (Code Review)

Todd Lipcon has posted comments on this change.

Change subject: [catalog_manager] categorization of rw operation failures
..


Patch Set 24:

Given that we're still chasing strange test failures on this, and it's on a 
pretty important and tricky code path, maybe we should chat about whether it's 
really necessary for 1.3.0? i.e are the downside risks of not having it 
included in the release worse than the downside risks of potential bugs? I 
haven't followed it closely but as the 1.3 RM I'm feeling nervous about complex 
patches coming in very close to the first rc being cut (hoping to do that 
tomorrow)

-- 
To view, visit http://gerrit.cloudera.org:8080/6170
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I826826049e3c08a6c8345949690cbbedaea32ff8
Gerrit-PatchSet: 24
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Alexey Serbin 
Gerrit-Reviewer: Adar Dembo 
Gerrit-Reviewer: Alexey Serbin 
Gerrit-Reviewer: Dan Burkert 
Gerrit-Reviewer: David Ribeiro Alves 
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon 
Gerrit-HasComments: No

[kudu-CR] [catalog manager] categorization of rw operation failures

Alexey Serbin has posted comments on this change.

Change subject: [catalog_manager] categorization of rw operation failures
..


Patch Set 24:

> Uploaded patch set 24.

The test results at:
  http://dist-test.cloudera.org//job?job_id=aserbin.1489032335.25361

There is weird failure case of happened at dist-test-slave-dist-test-slave-28mz

It seems like write to the system table was reported as success and then the 
leader stepped down, however that new record hasn't been picked up by the new 
leader.

-- 
To view, visit http://gerrit.cloudera.org:8080/6170
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I826826049e3c08a6c8345949690cbbedaea32ff8
Gerrit-PatchSet: 24
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Alexey Serbin 
Gerrit-Reviewer: Adar Dembo 
Gerrit-Reviewer: Alexey Serbin 
Gerrit-Reviewer: Dan Burkert 
Gerrit-Reviewer: David Ribeiro Alves 
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon 
Gerrit-HasComments: No

[kudu-CR] [catalog manager] categorization of rw operation failures

Hello David Ribeiro Alves, Adar Dembo, Kudu Jenkins,

I'd like you to reexamine a change.  Please visit

http://gerrit.cloudera.org:8080/6170

to look at the new patch set (#24).

Change subject: [catalog_manager] categorization of rw operation failures
..

[catalog_manager] categorization of rw operation failures

This changelist introduces the categorization of the system catalog's
read and write operation failures which happen on leader post-election
callback.  There are two categories of errors: fatal and non-fatal.

If an operation against the system catalog fails in between terms of
the catalog leadership, the error is considered non-fatal.  In case of
a non-fatal error the leader post-election task bails out: the catalog
is no longer the leader at the original term and the task should be
executed by the new leader upon ElectedAsLeaderCb.

If an operation against the system catalog fails at the same term of the
catalog leadership, the error is considered fatal and that causes the
master process to crash.  This is to avoid possible inconsistency when
working with the tables/tablets metadata, the IPKI certificate authority
information and TSKs (Token Signing Keys).

Any failure of a read or write operation against the system catalog
happened during the catalog's shutdown is ignored and the leader
post-election task bails out once detecting such failure.

The same policy applies to other (i.e. not specific to read and write
operations against the system catalog) errors which might happen while
working with the IPKI certificate authority information and TokenSigner.
The rationale is the same as for handling the system catalog operation
failures: in case of an error, the leader has no consistent information
to work with, meanwhile a non-leader does not use the information
affected by the failure at all and can safely ignore the error.

Added a test to verify that the master server does not crash if change
of leadership detected while trying to persist a newly generated TSK
(Token Signing Key).

Change-Id: I826826049e3c08a6c8345949690cbbedaea32ff8
---
M src/kudu/integration-tests/CMakeLists.txt
A src/kudu/integration-tests/catalog_manager_tsk-itest.cc
M src/kudu/master/catalog_manager.cc
M src/kudu/master/catalog_manager.h
M src/kudu/master/master-test.cc
M src/kudu/master/master_service.cc
M src/kudu/master/sys_catalog-test.cc
M src/kudu/security/token-test.cc
8 files changed, 492 insertions(+), 136 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/70/6170/24
-- 
To view, visit http://gerrit.cloudera.org:8080/6170
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I826826049e3c08a6c8345949690cbbedaea32ff8
Gerrit-PatchSet: 24
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Alexey Serbin 
Gerrit-Reviewer: Adar Dembo 
Gerrit-Reviewer: Alexey Serbin 
Gerrit-Reviewer: Dan Burkert 
Gerrit-Reviewer: David Ribeiro Alves 
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon

[kudu-CR] [catalog manager] categorization of rw operation failures

Alexey Serbin has posted comments on this change.

Change subject: [catalog_manager] categorization of rw operation failures
..


Patch Set 22:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/6170/22/src/kudu/integration-tests/catalog_manager_tsk-itest.cc
File src/kudu/integration-tests/catalog_manager_tsk-itest.cc:

Line 38: using std::atomic;
> warning: using decl 'atomic' is unused [misc-unused-using-decls]
Done


-- 
To view, visit http://gerrit.cloudera.org:8080/6170
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I826826049e3c08a6c8345949690cbbedaea32ff8
Gerrit-PatchSet: 22
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Alexey Serbin 
Gerrit-Reviewer: Adar Dembo 
Gerrit-Reviewer: Alexey Serbin 
Gerrit-Reviewer: Dan Burkert 
Gerrit-Reviewer: David Ribeiro Alves 
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon 
Gerrit-HasComments: Yes

[kudu-CR] [catalog manager] categorization of rw operation failures

Hello David Ribeiro Alves, Adar Dembo, Kudu Jenkins,

I'd like you to reexamine a change.  Please visit

http://gerrit.cloudera.org:8080/6170

to look at the new patch set (#23).

Change subject: [catalog_manager] categorization of rw operation failures
..

[catalog_manager] categorization of rw operation failures

This changelist introduces the categorization of the system catalog's
read and write operation failures which happen on leader post-election
callback.  There are two categories of errors: fatal and non-fatal.

If an operation against the system catalog fails in between terms of
the catalog leadership, the error is considered non-fatal.  In case of
a non-fatal error the leader post-election task bails out: the catalog
is no longer the leader at the original term and the task should be
executed by the new leader upon ElectedAsLeaderCb.

If an operation against the system catalog fails at the same term of the
catalog leadership, the error is considered fatal and that causes the
master process to crash.  This is to avoid possible inconsistency when
working with the tables/tablets metadata, the IPKI certificate authority
information and TSKs (Token Signing Keys).

Any failure of a read or write operation against the system catalog
happened during the catalog's shutdown is ignored and the leader
post-election task bails out once detecting such failure.

The same policy applies to other (i.e. not specific to read and write
operations against the system catalog) errors which might happen while
working with the IPKI certificate authority information and TokenSigner.
The rationale is the same as for handling the system catalog operation
failures: in case of an error, the leader has no consistent information
to work with, meanwhile a non-leader does not use the information
affected by the failure at all and can safely ignore the error.

Added a test to verify that the master server does not crash if change
of leadership detected while trying to persist a newly generated TSK
(Token Signing Key).

Change-Id: I826826049e3c08a6c8345949690cbbedaea32ff8
---
M src/kudu/integration-tests/CMakeLists.txt
A src/kudu/integration-tests/catalog_manager_tsk-itest.cc
M src/kudu/master/catalog_manager.cc
M src/kudu/master/catalog_manager.h
M src/kudu/master/master-test.cc
M src/kudu/master/sys_catalog-test.cc
6 files changed, 418 insertions(+), 133 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/70/6170/23
-- 
To view, visit http://gerrit.cloudera.org:8080/6170
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I826826049e3c08a6c8345949690cbbedaea32ff8
Gerrit-PatchSet: 23
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Alexey Serbin 
Gerrit-Reviewer: Adar Dembo 
Gerrit-Reviewer: Alexey Serbin 
Gerrit-Reviewer: Dan Burkert 
Gerrit-Reviewer: David Ribeiro Alves 
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon

[kudu-CR] [catalog manager] categorization of rw operation failures

2017-03-08 Thread David Ribeiro Alves (Code Review)

David Ribeiro Alves has posted comments on this change.

Change subject: [catalog_manager] categorization of rw operation failures
..


Patch Set 22:

when you do loop in dist tests or something, please post the command/build 
used, the results a link to the run results

-- 
To view, visit http://gerrit.cloudera.org:8080/6170
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I826826049e3c08a6c8345949690cbbedaea32ff8
Gerrit-PatchSet: 22
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Alexey Serbin 
Gerrit-Reviewer: Adar Dembo 
Gerrit-Reviewer: Alexey Serbin 
Gerrit-Reviewer: Dan Burkert 
Gerrit-Reviewer: David Ribeiro Alves 
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon 
Gerrit-HasComments: No

[kudu-CR] [catalog manager] categorization of rw operation failures

Hello David Ribeiro Alves, Adar Dembo, Kudu Jenkins,

I'd like you to reexamine a change.  Please visit

http://gerrit.cloudera.org:8080/6170

to look at the new patch set (#22).

Change subject: [catalog_manager] categorization of rw operation failures
..

[catalog_manager] categorization of rw operation failures

This changelist introduces the categorization of the system catalog's
read and write operation failures which happen on leader post-election
callback.  There are two categories of errors: fatal and non-fatal.

If an operation against the system catalog fails in between terms of
the catalog leadership, the error is considered non-fatal.  In case of
a non-fatal error the leader post-election task bails out: the catalog
is no longer the leader at the original term and the task should be
executed by the new leader upon ElectedAsLeaderCb.

If an operation against the system catalog fails at the same term of the
catalog leadership, the error is considered fatal and that causes the
master process to crash.  This is to avoid possible inconsistency when
working with the tables/tablets metadata, the IPKI certificate authority
information and TSKs (Token Signing Keys).

Any failure of a read or write operation against the system catalog
happened during the catalog's shutdown is ignored and the leader
post-election task bails out once detecting such failure.

The same policy applies to other (i.e. not specific to read and write
operations against the system catalog) errors which might happen while
working with the IPKI certificate authority information and TokenSigner.
The rationale is the same as for handling the system catalog operation
failures: in case of an error, the leader has no consistent information
to work with, meanwhile a non-leader does not use the information
affected by the failure at all and can safely ignore the error.

Added a test to verify that the master server does not crash if change
of leadership detected while trying to persist a newly generated TSK
(Token Signing Key).

Change-Id: I826826049e3c08a6c8345949690cbbedaea32ff8
---
M src/kudu/integration-tests/CMakeLists.txt
A src/kudu/integration-tests/catalog_manager_tsk-itest.cc
M src/kudu/master/catalog_manager.cc
M src/kudu/master/catalog_manager.h
M src/kudu/master/master-test.cc
M src/kudu/master/sys_catalog-test.cc
6 files changed, 420 insertions(+), 133 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/70/6170/22
-- 
To view, visit http://gerrit.cloudera.org:8080/6170
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I826826049e3c08a6c8345949690cbbedaea32ff8
Gerrit-PatchSet: 22
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Alexey Serbin 
Gerrit-Reviewer: Adar Dembo 
Gerrit-Reviewer: Alexey Serbin 
Gerrit-Reviewer: Dan Burkert 
Gerrit-Reviewer: David Ribeiro Alves 
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon

[kudu-CR] [catalog manager] categorization of rw operation failures

Alexey Serbin has posted comments on this change.

Change subject: [catalog_manager] categorization of rw operation failures
..


Patch Set 21:

> could you loop the test a bit?

Sure -- that's what I did yesterday.  But I was mostly running it without 
KUDU_ALLOW_SLOW_TESTS yesterday -- will run it more with KUDU_ALLOW_SLOW_TESTS 
today.

-- 
To view, visit http://gerrit.cloudera.org:8080/6170
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I826826049e3c08a6c8345949690cbbedaea32ff8
Gerrit-PatchSet: 21
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Alexey Serbin 
Gerrit-Reviewer: Adar Dembo 
Gerrit-Reviewer: Alexey Serbin 
Gerrit-Reviewer: Dan Burkert 
Gerrit-Reviewer: David Ribeiro Alves 
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon 
Gerrit-HasComments: No

[kudu-CR] [catalog manager] categorization of rw operation failures

2017-03-08 Thread David Ribeiro Alves (Code Review)

David Ribeiro Alves has posted comments on this change.

Change subject: [catalog_manager] categorization of rw operation failures
..


Patch Set 21: Code-Review+1

could you loop the test a bit?

-- 
To view, visit http://gerrit.cloudera.org:8080/6170
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I826826049e3c08a6c8345949690cbbedaea32ff8
Gerrit-PatchSet: 21
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Alexey Serbin 
Gerrit-Reviewer: Adar Dembo 
Gerrit-Reviewer: Alexey Serbin 
Gerrit-Reviewer: Dan Burkert 
Gerrit-Reviewer: David Ribeiro Alves 
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon 
Gerrit-HasComments: No

[kudu-CR] [catalog manager] categorization of rw operation failures

Hello Adar Dembo, Kudu Jenkins,

I'd like you to reexamine a change.  Please visit

http://gerrit.cloudera.org:8080/6170

to look at the new patch set (#21).

Change subject: [catalog_manager] categorization of rw operation failures
..

[catalog_manager] categorization of rw operation failures

This changelist introduces the categorization of the system catalog's
read and write operation failures which happen on leader post-election
callback.  There are two categories of errors: fatal and non-fatal.

If an operation against the system catalog fails in between terms of
the catalog leadership, the error is considered non-fatal.  In case of
a non-fatal error the leader post-election task bails out: the catalog
is no longer the leader at the original term and the task should be
executed by the new leader upon ElectedAsLeaderCb.

If an operation against the system catalog fails at the same term of the
catalog leadership, the error is considered fatal and that causes the
master process to crash.  This is to avoid possible inconsistency when
working with the tables/tablets metadata, the IPKI certificate authority
information and TSKs (Token Signing Keys).

Any failure of a read or write operation against the system catalog
happened during the catalog's shutdown is ignored and the leader
post-election task bails out once detecting such failure.

The same policy applies to other (i.e. not specific to read and write
operations against the system catalog) errors which might happen while
working with the IPKI certificate authority information and TokenSigner.
The rationale is the same as for handling the system catalog operation
failures: in case of an error, the leader has no consistent information
to work with, meanwhile a non-leader does not use the information
affected by the failure at all and can safely ignore the error.

Added a test to verify that the master server does not crash if change
of leadership detected while trying to persist a newly generated TSK
(Token Signing Key).

Change-Id: I826826049e3c08a6c8345949690cbbedaea32ff8
---
M src/kudu/integration-tests/CMakeLists.txt
A src/kudu/integration-tests/catalog_manager_tsk-itest.cc
M src/kudu/master/catalog_manager.cc
M src/kudu/master/catalog_manager.h
M src/kudu/master/master-test.cc
M src/kudu/master/sys_catalog-test.cc
6 files changed, 429 insertions(+), 133 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/70/6170/21
-- 
To view, visit http://gerrit.cloudera.org:8080/6170
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I826826049e3c08a6c8345949690cbbedaea32ff8
Gerrit-PatchSet: 21
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Alexey Serbin 
Gerrit-Reviewer: Adar Dembo 
Gerrit-Reviewer: Alexey Serbin 
Gerrit-Reviewer: Dan Burkert 
Gerrit-Reviewer: David Ribeiro Alves 
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon

[kudu-CR] [catalog manager] categorization of rw operation failures

Hello Adar Dembo, Kudu Jenkins,

I'd like you to reexamine a change.  Please visit

http://gerrit.cloudera.org:8080/6170

to look at the new patch set (#20).

Change subject: [catalog_manager] categorization of rw operation failures
..

[catalog_manager] categorization of rw operation failures

This changelist introduces the categorization of the system catalog's
read and write operation failures which happen on leader post-election
callback.  There are two categories of errors: fatal and non-fatal.

If an operation against the system catalog fails in between terms of
the catalog leadership, the error is considered non-fatal.  In case of
a non-fatal error the leader post-election task bails out: the catalog
is no longer the leader at the original term and the task should be
executed by the new leader upon ElectedAsLeaderCb.

If an operation against the system catalog fails at the same term of the
catalog leadership, the error is considered fatal and that causes the
master process to crash.  This is to avoid possible inconsistency when
working with the tables/tablets metadata, the IPKI certificate authority
information and TSKs (Token Signing Keys).

Any failure of a read or write operation against the system catalog
happened during the catalog's shutdown is ignored and the leader
post-election task bails out once detecting such failure.

The same policy applies to other (i.e. not specific to read and write
operations against the system catalog) errors which might happen while
working with the IPKI certificate authority information and TokenSigner.
The rationale is the same as for handling the system catalog operation
failures: in case of an error, the leader has no consistent information
to work with, meanwhile a non-leader does not use the information
affected by the failure at all and can safely ignore the error.

Added a test to verify that the master server does not crash if change
of leadership detected while trying to persist a newly generated TSK
(Token Signing Key).

Change-Id: I826826049e3c08a6c8345949690cbbedaea32ff8
---
M src/kudu/integration-tests/CMakeLists.txt
A src/kudu/integration-tests/catalog_manager_tsk-itest.cc
M src/kudu/master/catalog_manager.cc
M src/kudu/master/catalog_manager.h
M src/kudu/master/master-test.cc
M src/kudu/master/sys_catalog-test.cc
6 files changed, 428 insertions(+), 133 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/70/6170/20
-- 
To view, visit http://gerrit.cloudera.org:8080/6170
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I826826049e3c08a6c8345949690cbbedaea32ff8
Gerrit-PatchSet: 20
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Alexey Serbin 
Gerrit-Reviewer: Adar Dembo 
Gerrit-Reviewer: Alexey Serbin 
Gerrit-Reviewer: Dan Burkert 
Gerrit-Reviewer: David Ribeiro Alves 
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon

[kudu-CR] [catalog manager] categorization of rw operation failures

Alexey Serbin has posted comments on this change.

Change subject: [catalog_manager] categorization of rw operation failures
..


Patch Set 18:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/6170/18/src/kudu/integration-tests/catalog_manager_tsk-itest.cc
File src/kudu/integration-tests/catalog_manager_tsk-itest.cc:

PS18, Line 120: ErrorStatusPB::ERROR_UNAVAILABLE
> Nope, that's not transpire to the caller -- to the caller it looks like Not
Correction:

That's about a token signed with a key which hasn't yet propagated to the 
target tablet server.  However, the client already has gotten the token signed 
by the key.

That's why we have that propagation period long enough for production 
installations.  You can find more info in the TokenSigner class comments in 
src/kudu/security/token_signer.h  file.


-- 
To view, visit http://gerrit.cloudera.org:8080/6170
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I826826049e3c08a6c8345949690cbbedaea32ff8
Gerrit-PatchSet: 18
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Alexey Serbin 
Gerrit-Reviewer: Adar Dembo 
Gerrit-Reviewer: Alexey Serbin 
Gerrit-Reviewer: Dan Burkert 
Gerrit-Reviewer: David Ribeiro Alves 
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon 
Gerrit-HasComments: Yes

[kudu-CR] [catalog manager] categorization of rw operation failures

Hello Adar Dembo, Kudu Jenkins,

I'd like you to reexamine a change.  Please visit

http://gerrit.cloudera.org:8080/6170

to look at the new patch set (#19).

Change subject: [catalog_manager] categorization of rw operation failures
..

[catalog_manager] categorization of rw operation failures

This changelist introduces the categorization of the system catalog's
read and write operation failures which happen on leader post-election
callback.  There are two categories of errors: fatal and non-fatal.

If an operation against the system catalog fails in between terms of
the catalog leadership, the error is considered non-fatal.  In case of
a non-fatal error the leader post-election task bails out: the catalog
is no longer the leader at the original term and the task should be
executed by the new leader upon ElectedAsLeaderCb.

If an operation against the system catalog fails at the same term of the
catalog leadership, the error is considered fatal and that causes the
master process to crash.  This is to avoid possible inconsistency when
working with the tables/tablets metadata, the IPKI certificate authority
information and TSKs (Token Signing Keys).

Any failure of a read or write operation against the system catalog
happened during the catalog's shutdown is ignored and the leader
post-election task bails out once detecting such failure.

The same policy applies to other (i.e. not specific to read and write
operations against the system catalog) errors which might happen while
working with the IPKI certificate authority information and TokenSigner.
The rationale is the same as for handling the system catalog operation
failures: in case of an error, the leader has no consistent information
to work with, meanwhile a non-leader does not use the information
affected by the failure at all and can safely ignore the error.

Added a test to verify that the master server does not crash if change
of leadership detected while trying to persist a newly generated TSK
(Token Signing Key).

Change-Id: I826826049e3c08a6c8345949690cbbedaea32ff8
---
M src/kudu/integration-tests/CMakeLists.txt
A src/kudu/integration-tests/catalog_manager_tsk-itest.cc
M src/kudu/master/catalog_manager.cc
M src/kudu/master/catalog_manager.h
M src/kudu/master/master-test.cc
M src/kudu/master/sys_catalog-test.cc
6 files changed, 428 insertions(+), 133 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/70/6170/19
-- 
To view, visit http://gerrit.cloudera.org:8080/6170
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I826826049e3c08a6c8345949690cbbedaea32ff8
Gerrit-PatchSet: 19
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Alexey Serbin 
Gerrit-Reviewer: Adar Dembo 
Gerrit-Reviewer: Alexey Serbin 
Gerrit-Reviewer: Dan Burkert 
Gerrit-Reviewer: David Ribeiro Alves 
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon

[kudu-CR] [catalog manager] categorization of rw operation failures

Alexey Serbin has posted comments on this change.

Change subject: [catalog_manager] categorization of rw operation failures
..

Patch Set 18:

(9 comments)

http://gerrit.cloudera.org:8080/#/c/6170/18/src/kudu/integration-tests/catalog_manager_tsk-itest.cc
File src/kudu/integration-tests/catalog_manager_tsk-itest.cc:

PS18, Line 71: --raft_enable_pre_election=false
> ah forgot to suggest that you add this flag. nice. btw if this is a single
Yep, having this flag greatly increases chances of leadership change in this
scenario -- I found it while trying to achieve better reproduction ratio :)

That's a good observation -- yep, I think this flag makes sense only for
masters in this scenario regardless on number of involved tservers. Gooo

PS18, Line 120: ErrorStatusPB::ERROR_UNAVAILABLE
> does this error transpire all the way to the caller? i.e. could you inspect
Nope, that's not transpire to the caller -- to the caller it looks like
NotAuthorized status, and it's not possible to see the server sent
ERROR_UNAVAILABLE (except for parsing the error message, which we are not about
to do).

That is not about expired/old token. That's about a token signed with a key
which hasn't yet propagated to the target tablet server, but already received
by the client.

As for the expired token errors, there are no calls in this test which would
fail do to an expired token -- as you can see, the TokenSigner configured to
issue tokens valid for the duration of the test. Also, the token verification
happens during connection establishment, and that's the only place it's checked.

PS18, Line 121: gets authn token signed by the
: // key which hasn't yet reached the server.
> do you mean: "gets authn token signed by the master which hasn't yet reache
Here it's about the key used for token verification. So, probably it would be
easier to understand if I replace 'server' with 'tserver'.

Line 128: // NOTE: This should be removed once the client is updated to
automatically
> mark with TODO(PKI)
Done

Line 159: // Check that master servers do not crash on change of leadership
while
> yeah are you sure that the master leadership is changing?
Yes, I'm: I spent some time yesterday verifying that and discovered that
'raft_enable_pre_election' option to make it happen more often.

It's possible to see that in the log while running the test: search for
messages like 'failed to refresh TSK'. Usually, within 60 second run it
happens 20-30 times when running the debug build.

Line 159: // Check that master servers do not crash on change of leadership
while
> What's forcing leadership change in this test?
The leadership changes are provoked by the injected latency just after
generating TSK key but prior to writing it into the system table: setting
--leader_failure_max_missed_heartbeat_periods flag to just one heartbeat period
and unsetting --raft_enable_pre_election gives high chances of re-election to
happen while current leader has blocked its leadership-related activity. It's
possible to see that in logs while running the test. We want to observe
messages like 'failed to refresh TSK', and that happens pretty often.

I added the comment for the test.

Line 176: waiter.join();
> Could you change this so that SmokeTestCluster is run in the thread in a lo
Why is it any better? Instead, if some error happened in the thread, I'm not
sure it's correctly handled by the gtest -- as I remember, there was an issue
to have ASSERT_xxx() in non-main thread so it would be possible to call just
CHECK_xx() instead.

http://gerrit.cloudera.org:8080/#/c/6170/18/src/kudu/master/catalog_manager.cc
File src/kudu/master/catalog_manager.cc:

Line 473: // If there is an error (e.g., we are not the leader)
abort this task
> With the additions below, this path is no longer aborting the task. Should
Since from this point it's not clear why the error happened (it might be
something else besides leadership change), I think it's worth continuing and
try to do TSK-related work. If that was leadership-related problem, it will be
detected by the code below pretty fast.

That's why I think it's better not to put 'continue' here.

PS18, Line 499: the TokenSigner has left with no key
> this wording isn't clear. It may not be needed at all.
ok, I'll remove it if you think it's not needed. The idea was to give some
reasoning why the process was forced to crash.

--
To view, visit http://gerrit.cloudera.org:8080/6170
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I826826049e3c08a6c8345949690cbbedaea32ff8
Gerrit-PatchSet: 18
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Alexey Serbin
Gerrit-Reviewer: Adar Dembo
Gerrit-Reviewer: Alexey Serbin
Gerrit-Reviewer: Dan Burkert
Gerrit-Reviewer: David Ribeiro Alves
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Rev

[kudu-CR] [catalog manager] categorization of rw operation failures

2017-03-08 Thread David Ribeiro Alves (Code Review)

David Ribeiro Alves has posted comments on this change.

Change subject: [catalog_manager] categorization of rw operation failures
..


Patch Set 18:

(4 comments)

http://gerrit.cloudera.org:8080/#/c/6170/18/src/kudu/integration-tests/catalog_manager_tsk-itest.cc
File src/kudu/integration-tests/catalog_manager_tsk-itest.cc:

PS18, Line 71: --raft_enable_pre_election=false
ah forgot to suggest that you add this flag. nice. btw if this is a single 
tablet server setup you just need to set these flags for the master right?


PS18, Line 120: ErrorStatusPB::ERROR_UNAVAILABLE
does this error transpire all the way to the caller? i.e. could you inspect the 
status for a particular error and/or string and just ignore it and keep trying 
until it succeeds? What are the calls in this method that might fail due to an 
expired/old token?


PS18, Line 121: gets authn token signed by the
  : // key which hasn't yet reached the server.
do you mean: "gets authn token signed by the master which hasn't yet reached 
the tablet server." ?


Line 159: // Check that master servers do not crash on change of leadership 
while
> What's forcing leadership change in this test?
yeah are you sure that the master leadership is changing?


-- 
To view, visit http://gerrit.cloudera.org:8080/6170
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I826826049e3c08a6c8345949690cbbedaea32ff8
Gerrit-PatchSet: 18
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Alexey Serbin 
Gerrit-Reviewer: Adar Dembo 
Gerrit-Reviewer: Alexey Serbin 
Gerrit-Reviewer: Dan Burkert 
Gerrit-Reviewer: David Ribeiro Alves 
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon 
Gerrit-HasComments: Yes

[kudu-CR] [catalog manager] categorization of rw operation failures

2017-03-08 Thread Dan Burkert (Code Review)

Dan Burkert has posted comments on this change.

Change subject: [catalog_manager] categorization of rw operation failures
..

Patch Set 10:

(5 comments)

http://gerrit.cloudera.org:8080/#/c/6170/18/src/kudu/integration-tests/catalog_manager_tsk-itest.cc
File src/kudu/integration-tests/catalog_manager_tsk-itest.cc:

Line 128
mark with TODO(PKI)

Line 159
What's forcing leadership change in this test?

Line 176
Could you change this so that SmokeTestCluster is run in the thread in a loop,
and the main thread sets the atomic? That's how most of these tests are
structured.

http://gerrit.cloudera.org:8080/#/c/6170/18/src/kudu/master/catalog_manager.cc
File src/kudu/master/catalog_manager.cc:

Line 473: }
With the additions below, this path is no longer aborting the task. Should it
be? I think it may be appropriate to add a 'continue' after logging the error.

PS18, Line 499: eout expiration.
this wording isn't clear. It may not be needed at all.

--
To view, visit http://gerrit.cloudera.org:8080/6170
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I826826049e3c08a6c8345949690cbbedaea32ff8
Gerrit-PatchSet: 10
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Alexey Serbin
Gerrit-Reviewer: Adar Dembo
Gerrit-Reviewer: Alexey Serbin
Gerrit-Reviewer: Dan Burkert
Gerrit-Reviewer: David Ribeiro Alves
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon
Gerrit-HasComments: Yes

[kudu-CR] [catalog manager] categorization of rw operation failures

Hello Adar Dembo,

I'd like you to reexamine a change.  Please visit

http://gerrit.cloudera.org:8080/6170

to look at the new patch set (#18).

Change subject: [catalog_manager] categorization of rw operation failures
..

[catalog_manager] categorization of rw operation failures

This changelist introduces the categorization of the system catalog's
read and write operation failures which happen on leader post-election
callback.  There are two categories of errors: fatal and non-fatal.

If an operation against the system catalog fails in between terms of
the catalog leadership, the error is considered non-fatal.  In case of
a non-fatal error the leader post-election task bails out: the catalog
is no longer the leader at the original term and the task should be
executed by the new leader upon ElectedAsLeaderCb.

If an operation against the system catalog fails at the same term of the
catalog leadership, the error is considered fatal and that causes the
master process to crash.  This is to avoid possible inconsistency when
working with the tables/tablets metadata, the IPKI certificate authority
information and TSKs (Token Signing Keys).

Any failure of a read or write operation against the system catalog
happened during the catalog's shutdown is ignored and the leader
post-election task bails out once detecting such failure.

The same policy applies to other (i.e. not specific to read and write
operations against the system catalog) errors which might happen while
working with the IPKI certificate authority information and TokenSigner.
The rationale is the same as for handling the system catalog operation
failures: in case of an error, the leader has no consistent information
to work with, meanwhile a non-leader does not use the information
affected by the failure at all and can safely ignore the error.

Added a test to verify that the master server does not crash if change
of leadership detected while trying to persist a newly generated TSK
(Token Signing Key).

Change-Id: I826826049e3c08a6c8345949690cbbedaea32ff8
---
M src/kudu/integration-tests/CMakeLists.txt
A src/kudu/integration-tests/catalog_manager_tsk-itest.cc
M src/kudu/master/catalog_manager.cc
M src/kudu/master/catalog_manager.h
M src/kudu/master/master-test.cc
M src/kudu/master/sys_catalog-test.cc
6 files changed, 423 insertions(+), 133 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/70/6170/18
-- 
To view, visit http://gerrit.cloudera.org:8080/6170
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I826826049e3c08a6c8345949690cbbedaea32ff8
Gerrit-PatchSet: 18
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Alexey Serbin 
Gerrit-Reviewer: Adar Dembo 
Gerrit-Reviewer: Alexey Serbin 
Gerrit-Reviewer: Dan Burkert 
Gerrit-Reviewer: David Ribeiro Alves 
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon

[kudu-CR] [catalog manager] categorization of rw operation failures

Alexey Serbin has posted comments on this change.

Change subject: [catalog_manager] categorization of rw operation failures
..


Patch Set 17: Verified+1

Flake: MasterFailoverTest.TestCreateTableSync

-- 
To view, visit http://gerrit.cloudera.org:8080/6170
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I826826049e3c08a6c8345949690cbbedaea32ff8
Gerrit-PatchSet: 17
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Alexey Serbin 
Gerrit-Reviewer: Adar Dembo 
Gerrit-Reviewer: Alexey Serbin 
Gerrit-Reviewer: Dan Burkert 
Gerrit-Reviewer: David Ribeiro Alves 
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon 
Gerrit-HasComments: No

[kudu-CR] [catalog manager] categorization of rw operation failures

Hello Adar Dembo, Kudu Jenkins,

I'd like you to reexamine a change.  Please visit

http://gerrit.cloudera.org:8080/6170

to look at the new patch set (#17).

Change subject: [catalog_manager] categorization of rw operation failures
..

[catalog_manager] categorization of rw operation failures

This changelist introduces the categorization of the system catalog's
read and write operation failures which happen on leader post-election
callback.  There are two categories of errors: fatal and non-fatal.

If an operation against the system catalog fails in between terms of
the catalog leadership, the error is considered non-fatal.  In case of
a non-fatal error the leader post-election task bails out: the catalog
is no longer the leader at the original term and the task should be
executed by the new leader upon ElectedAsLeaderCb.

If an operation against the system catalog fails at the same term of the
catalog leadership, the error is considered fatal and that causes the
master process to crash.  This is to avoid possible inconsistency when
working with the tables/tablets metadata, the IPKI certificate authority
information and TSKs (Token Signing Keys).

Any failure of a read or write operation against the system catalog
happened during the catalog's shutdown is ignored and the leader
post-election task bails out once detecting such failure.

The same policy applies to other (i.e. not specific to read and write
operations against the system catalog) errors which might happen while
working with the IPKI certificate authority information and TokenSigner.
The rationale is the same as for handling the system catalog operation
failures: in case of an error, the leader has no consistent information
to work with, meanwhile a non-leader does not use the information
affected by the failure at all and can safely ignore the error.

Added a test to verify that the master server does not crash if change
of leadership detected while trying to persist a newly generated TSK
(Token Signing Key).

Change-Id: I826826049e3c08a6c8345949690cbbedaea32ff8
---
M src/kudu/integration-tests/CMakeLists.txt
A src/kudu/integration-tests/catalog_manager_tsk-itest.cc
M src/kudu/master/catalog_manager.cc
M src/kudu/master/catalog_manager.h
M src/kudu/master/master-test.cc
M src/kudu/master/sys_catalog-test.cc
M src/kudu/master/sys_catalog.cc
7 files changed, 419 insertions(+), 135 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/70/6170/17
-- 
To view, visit http://gerrit.cloudera.org:8080/6170
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I826826049e3c08a6c8345949690cbbedaea32ff8
Gerrit-PatchSet: 17
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Alexey Serbin 
Gerrit-Reviewer: Adar Dembo 
Gerrit-Reviewer: Alexey Serbin 
Gerrit-Reviewer: Dan Burkert 
Gerrit-Reviewer: David Ribeiro Alves 
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon

[kudu-CR] [catalog manager] categorization of rw operation failures

Hello Adar Dembo, Kudu Jenkins,

I'd like you to reexamine a change.  Please visit

http://gerrit.cloudera.org:8080/6170

to look at the new patch set (#16).

Change subject: [catalog_manager] categorization of rw operation failures
..

[catalog_manager] categorization of rw operation failures

This changelist introduces the categorization of the system catalog's
read and write operation failures which happen on leader post-election
callback.  There are two categories of errors: fatal and non-fatal.

If an operation against the system catalog fails in between terms of
the catalog leadership, the error is considered non-fatal.  In case of
a non-fatal error the leader post-election task bails out: the catalog
is no longer the leader at the original term and the task should be
executed by the new leader upon ElectedAsLeaderCb.

If an operation against the system catalog fails at the same term of the
catalog leadership, the error is considered fatal and that causes the
master process to crash.  This is to avoid possible inconsistency when
working with the tables/tablets metadata, the IPKI certificate authority
information and TSKs (Token Signing Keys).

Any failure of a read or write operation against the system catalog
happened during the catalog's shutdown is ignored and the leader
post-election task bails out once detecting such failure.

The same policy applies to other (i.e. not specific to read and write
operations against the system catalog) errors which might happen while
working with the IPKI certificate authority information and TokenSigner.
The rationale is the same as for handling the system catalog operation
failures: in case of an error, the leader has no consistent information
to work with, meanwhile a non-leader does not use the information
affected by the failure at all and can safely ignore the error.

Added a test to verify that the master server does not crash if change
of leadership detected while trying to persist a newly generated TSK
(Token Signing Key).

Change-Id: I826826049e3c08a6c8345949690cbbedaea32ff8
---
M src/kudu/integration-tests/CMakeLists.txt
A src/kudu/integration-tests/catalog_manager_tsk-itest.cc
M src/kudu/master/catalog_manager.cc
M src/kudu/master/catalog_manager.h
M src/kudu/master/master-test.cc
M src/kudu/master/sys_catalog-test.cc
M src/kudu/master/sys_catalog.cc
7 files changed, 404 insertions(+), 135 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/70/6170/16
-- 
To view, visit http://gerrit.cloudera.org:8080/6170
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I826826049e3c08a6c8345949690cbbedaea32ff8
Gerrit-PatchSet: 16
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Alexey Serbin 
Gerrit-Reviewer: Adar Dembo 
Gerrit-Reviewer: Alexey Serbin 
Gerrit-Reviewer: Dan Burkert 
Gerrit-Reviewer: David Ribeiro Alves 
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon

[kudu-CR] [catalog manager] categorization of rw operation failures

Hello Adar Dembo, Kudu Jenkins,

I'd like you to reexamine a change.  Please visit

http://gerrit.cloudera.org:8080/6170

to look at the new patch set (#15).

Change subject: [catalog_manager] categorization of rw operation failures
..

[catalog_manager] categorization of rw operation failures

This changelist introduces the categorization of the system catalog's
read and write operation failures which happen on leader post-election
callback.  There are two categories of errors: fatal and non-fatal.

If an operation against the system catalog fails in between terms of
the catalog leadership, the error is considered non-fatal.  In case of
a non-fatal error the leader post-election task bails out: the catalog
is no longer the leader at the original term and the task should be
executed by the new leader upon ElectedAsLeaderCb.

If an operation against the system catalog fails at the same term of the
catalog leadership, the error is considered fatal and that causes the
master process to crash.  This is to avoid possible inconsistency when
working with the tables/tablets metadata, the IPKI certificate authority
information and TSKs (Token Signing Keys).

Any failure of a read or write operation against the system catalog
happened during the catalog's shutdown is ignored and the leader
post-election task bails out once detecting such failure.

The same policy applies to other (i.e. not specific to read and write
operations against the system catalog) errors which might happen while
working with the IPKI certificate authority information and TokenSigner.
The rationale is the same as for handling the system catalog operation
failures: in case of an error, the leader has no consistent information
to work with, meanwhile a non-leader does not use the information
affected by the failure at all and can safely ignore the error.

Added a test to verify that the master server does not crash if change
of leadership detected while trying to persist a newly generated TSK
(Token Signing Key).

Change-Id: I826826049e3c08a6c8345949690cbbedaea32ff8
---
M src/kudu/integration-tests/CMakeLists.txt
A src/kudu/integration-tests/catalog_manager_tsk-itest.cc
M src/kudu/master/catalog_manager.cc
M src/kudu/master/catalog_manager.h
M src/kudu/master/master-test.cc
M src/kudu/master/sys_catalog-test.cc
M src/kudu/master/sys_catalog.cc
7 files changed, 404 insertions(+), 135 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/70/6170/15
-- 
To view, visit http://gerrit.cloudera.org:8080/6170
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I826826049e3c08a6c8345949690cbbedaea32ff8
Gerrit-PatchSet: 15
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Alexey Serbin 
Gerrit-Reviewer: Adar Dembo 
Gerrit-Reviewer: Alexey Serbin 
Gerrit-Reviewer: Dan Burkert 
Gerrit-Reviewer: David Ribeiro Alves 
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon

[kudu-CR] [catalog manager] categorization of rw operation failures

Alexey Serbin has posted comments on this change.

Change subject: [catalog_manager] categorization of rw operation failures
..


Patch Set 13:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/6170/13/src/kudu/integration-tests/catalog_manager_tsk-itest.cc
File src/kudu/integration-tests/catalog_manager_tsk-itest.cc:

PS13, Line 61: std::copy(master_flags.begin(), master_flags.end(),
 : std::back_inserter(cluster_opts_.extra_master_flags))
> oh, somehow I though that cluster_opts_ had common flags and that you had a
Thanks the clarification!


-- 
To view, visit http://gerrit.cloudera.org:8080/6170
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I826826049e3c08a6c8345949690cbbedaea32ff8
Gerrit-PatchSet: 13
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Alexey Serbin 
Gerrit-Reviewer: Adar Dembo 
Gerrit-Reviewer: Alexey Serbin 
Gerrit-Reviewer: Dan Burkert 
Gerrit-Reviewer: David Ribeiro Alves 
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon 
Gerrit-HasComments: Yes

[kudu-CR] [catalog manager] categorization of rw operation failures

2017-03-07 Thread David Ribeiro Alves (Code Review)

David Ribeiro Alves has posted comments on this change.

Change subject: [catalog_manager] categorization of rw operation failures
..


Patch Set 13:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/6170/13/src/kudu/integration-tests/catalog_manager_tsk-itest.cc
File src/kudu/integration-tests/catalog_manager_tsk-itest.cc:

PS13, Line 61: std::copy(master_flags.begin(), master_flags.end(),
 : std::back_inserter(cluster_opts_.extra_master_flags))
> I'm not sure I follow -- I'm adding the master flags (which are specific to
oh, somehow I though that cluster_opts_ had common flags and that you had added 
the master flags to those and to the master specific flags. Now I see it 
doesn't my bad. carry on


-- 
To view, visit http://gerrit.cloudera.org:8080/6170
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I826826049e3c08a6c8345949690cbbedaea32ff8
Gerrit-PatchSet: 13
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Alexey Serbin 
Gerrit-Reviewer: Adar Dembo 
Gerrit-Reviewer: Alexey Serbin 
Gerrit-Reviewer: Dan Burkert 
Gerrit-Reviewer: David Ribeiro Alves 
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon 
Gerrit-HasComments: Yes

[kudu-CR] [catalog manager] categorization of rw operation failures

Alexey Serbin has posted comments on this change.

Change subject: [catalog_manager] categorization of rw operation failures
..


Patch Set 13:

(4 comments)

http://gerrit.cloudera.org:8080/#/c/6170/13/src/kudu/integration-tests/catalog_manager_tsk-itest.cc
File src/kudu/integration-tests/catalog_manager_tsk-itest.cc:

PS13, Line 61: std::copy(master_flags.begin(), master_flags.end(),
 : std::back_inserter(cluster_opts_.extra_master_flags))
> you're adding these flags twice, once to common and one to the masters. the
I'm not sure I follow -- I'm adding the master flags (which are specific to 
masters only) to masters, and not adding to common flags.

Why do you think those are added twice?


PS13, Line 80:   // Wait while the cluster runs -- there should be multiple TSK 
generated.
 :   const double run_time_seconds = AllowSlowTests() ? 300 : 60;
 :   SleepFor(MonoDelta::FromSeconds(run_time_seconds));
> I think we need to have a workload of some sort so that we're sure that eve
We can safely use a client to do the work, if necessary.  What  do kind of 
workload do we need?  Just creation of tables or we want to populate it with 
some data as well?


http://gerrit.cloudera.org:8080/#/c/6170/13/src/kudu/master/sys_catalog.cc
File src/kudu/master/sys_catalog.cc:

Line 605
> oops was commenting on the wrong rev. I see you fixed it, never mind (thoug
Done


Line 605
> random formatting changes such as this to parts of the code otherwise uncha
Done


-- 
To view, visit http://gerrit.cloudera.org:8080/6170
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I826826049e3c08a6c8345949690cbbedaea32ff8
Gerrit-PatchSet: 13
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Alexey Serbin 
Gerrit-Reviewer: Adar Dembo 
Gerrit-Reviewer: Alexey Serbin 
Gerrit-Reviewer: Dan Burkert 
Gerrit-Reviewer: David Ribeiro Alves 
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon 
Gerrit-HasComments: Yes

[kudu-CR] [catalog manager] categorization of rw operation failures

Hello Adar Dembo, Kudu Jenkins,

I'd like you to reexamine a change.  Please visit

http://gerrit.cloudera.org:8080/6170

to look at the new patch set (#14).

Change subject: [catalog_manager] categorization of rw operation failures
..

[catalog_manager] categorization of rw operation failures

This changelist introduces the categorization of the system catalog's
read and write operation failures which happen on leader post-election
callback.  There are two categories of errors: fatal and non-fatal.

If an operation against the system catalog fails in between terms of
the catalog leadership, the error is considered non-fatal.  In case of
a non-fatal error the leader post-election task bails out: the catalog
is no longer the leader at the original term and the task should be
executed by the new leader upon ElectedAsLeaderCb.

If an operation against the system catalog fails at the same term of the
catalog leadership, the error is considered fatal and that causes the
master process to crash.  This is to avoid possible inconsistency when
working with the tables/tablets metadata, the IPKI certificate authority
information and TSKs (Token Signing Keys).

Any failure of a read or write operation against the system catalog
happened during the catalog's shutdown is ignored and the leader
post-election task bails out once detecting such failure.

The same policy applies to other (i.e. not specific to read and write
operations against the system catalog) errors which might happen while
working with the IPKI certificate authority information and TokenSigner.
The rationale is the same as for handling the system catalog operation
failures: in case of an error, the leader has no consistent information
to work with, meanwhile a non-leader does not use the information
affected by the failure at all and can safely ignore the error.

Added a test to verify that the master server does not crash if change
of leadership detected while trying to persist a newly generated TSK
(Token Signing Key).

Change-Id: I826826049e3c08a6c8345949690cbbedaea32ff8
---
M src/kudu/integration-tests/CMakeLists.txt
A src/kudu/integration-tests/catalog_manager_tsk-itest.cc
M src/kudu/master/catalog_manager.cc
M src/kudu/master/catalog_manager.h
M src/kudu/master/master-test.cc
M src/kudu/master/sys_catalog-test.cc
M src/kudu/master/sys_catalog.cc
7 files changed, 330 insertions(+), 135 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/70/6170/14
-- 
To view, visit http://gerrit.cloudera.org:8080/6170
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I826826049e3c08a6c8345949690cbbedaea32ff8
Gerrit-PatchSet: 14
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Alexey Serbin 
Gerrit-Reviewer: Adar Dembo 
Gerrit-Reviewer: Alexey Serbin 
Gerrit-Reviewer: Dan Burkert 
Gerrit-Reviewer: David Ribeiro Alves 
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon

[kudu-CR] [catalog manager] categorization of rw operation failures

2017-03-07 Thread David Ribeiro Alves (Code Review)

David Ribeiro Alves has posted comments on this change.

Change subject: [catalog_manager] categorization of rw operation failures
..


Patch Set 13:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/6170/13/src/kudu/master/sys_catalog.cc
File src/kudu/master/sys_catalog.cc:

Line 605
> random formatting changes such as this to parts of the code otherwise uncha
oops was commenting on the wrong rev. I see you fixed it, never mind (though I 
think my point is relevant in general :) )


-- 
To view, visit http://gerrit.cloudera.org:8080/6170
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I826826049e3c08a6c8345949690cbbedaea32ff8
Gerrit-PatchSet: 13
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Alexey Serbin 
Gerrit-Reviewer: Adar Dembo 
Gerrit-Reviewer: Alexey Serbin 
Gerrit-Reviewer: Dan Burkert 
Gerrit-Reviewer: David Ribeiro Alves 
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon 
Gerrit-HasComments: Yes

[kudu-CR] [catalog manager] categorization of rw operation failures

2017-03-07 Thread David Ribeiro Alves (Code Review)

David Ribeiro Alves has posted comments on this change.

Change subject: [catalog_manager] categorization of rw operation failures
..


Patch Set 13:

(3 comments)

http://gerrit.cloudera.org:8080/#/c/6170/13/src/kudu/integration-tests/catalog_manager_tsk-itest.cc
File src/kudu/integration-tests/catalog_manager_tsk-itest.cc:

PS13, Line 61: std::copy(master_flags.begin(), master_flags.end(),
 : std::back_inserter(cluster_opts_.extra_master_flags))
you're adding these flags twice, once to common and one to the masters. the 
master daemon receives both.


PS13, Line 80:   // Wait while the cluster runs -- there should be multiple TSK 
generated.
 :   const double run_time_seconds = AllowSlowTests() ? 300 : 60;
 :   SleepFor(MonoDelta::FromSeconds(run_time_seconds));
I think we need to have a workload of some sort so that we're sure that 
everything eventually is ok. since we can't use the client as it can't handle 
this situation, could you use the rpc api directly to perform some simple rpcs 
against the masters and make sure that eventually you get a valid token.
I assume such a workload would fail now since latency >> than the heartbeat 
interval.


http://gerrit.cloudera.org:8080/#/c/6170/13/src/kudu/master/sys_catalog.cc
File src/kudu/master/sys_catalog.cc:

Line 605
random formatting changes such as this to parts of the code otherwise unchanged 
are subjective (the previous person probably thought it looked better this way 
and this passed code review) and put more burden on the reviewer, which must 
then make a judgement call on whether these look better or write a comment such 
as this.

please revert this and the changes below.


-- 
To view, visit http://gerrit.cloudera.org:8080/6170
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I826826049e3c08a6c8345949690cbbedaea32ff8
Gerrit-PatchSet: 13
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Alexey Serbin 
Gerrit-Reviewer: Adar Dembo 
Gerrit-Reviewer: Alexey Serbin 
Gerrit-Reviewer: Dan Burkert 
Gerrit-Reviewer: David Ribeiro Alves 
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon 
Gerrit-HasComments: Yes

[kudu-CR] [catalog manager] categorization of rw operation failures

Hello Adar Dembo, Kudu Jenkins,

I'd like you to reexamine a change.  Please visit

http://gerrit.cloudera.org:8080/6170

to look at the new patch set (#13).

Change subject: [catalog_manager] categorization of rw operation failures
..

[catalog_manager] categorization of rw operation failures

This changelist introduces the categorization of the system catalog's
read and write operation failures which happen on leader post-election
callback.  There are two categories of errors: fatal and non-fatal.

If an operation against the system catalog fails in between terms of
the catalog leadership, the error is considered non-fatal.  In case of
a non-fatal error the leader post-election task bails out: the catalog
is no longer the leader at the original term and the task should be
executed by the new leader upon ElectedAsLeaderCb.

If an operation against the system catalog fails at the same term of the
catalog leadership, the error is considered fatal and that causes the
master process to crash.  This is to avoid possible inconsistency when
working with the tables/tablets metadata, the IPKI certificate authority
information and TSKs (Token Signing Keys).

Any failure of a read or write operation against the system catalog
happened during the catalog's shutdown is ignored and the leader
post-election task bails out once detecting such failure.

The same policy applies to other (i.e. not specific to read and write
operations against the system catalog) errors which might happen while
working with the IPKI certificate authority information and TokenSigner.
The rationale is the same as for handling the system catalog operation
failures: in case of an error, the leader has no consistent information
to work with, meanwhile a non-leader does not use the information
affected by the failure at all and can safely ignore the error.

Added a test to verify that the master server does not crash if change
of leadership detected while trying to persist a newly generated TSK
(Token Signing Key).

Change-Id: I826826049e3c08a6c8345949690cbbedaea32ff8
---
M src/kudu/integration-tests/CMakeLists.txt
A src/kudu/integration-tests/catalog_manager_tsk-itest.cc
M src/kudu/master/catalog_manager.cc
M src/kudu/master/catalog_manager.h
M src/kudu/master/master-test.cc
M src/kudu/master/sys_catalog-test.cc
M src/kudu/master/sys_catalog.cc
7 files changed, 330 insertions(+), 138 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/70/6170/13
-- 
To view, visit http://gerrit.cloudera.org:8080/6170
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I826826049e3c08a6c8345949690cbbedaea32ff8
Gerrit-PatchSet: 13
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Alexey Serbin 
Gerrit-Reviewer: Adar Dembo 
Gerrit-Reviewer: Alexey Serbin 
Gerrit-Reviewer: Dan Burkert 
Gerrit-Reviewer: David Ribeiro Alves 
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon

[kudu-CR] [catalog manager] categorization of rw operation failures

Alexey Serbin has posted comments on this change.

Change subject: [catalog_manager] categorization of rw operation failures
..


Patch Set 11:

(2 comments)

Added ./bin/catalog_manager_tsk-itest

http://gerrit.cloudera.org:8080/#/c/6170/11/src/kudu/master/sys_catalog-test.cc
File src/kudu/master/sys_catalog-test.cc:

PS11, Line 425: EXPECT_EQ
> why the change to EXPECT?
Because this is not crucial for the following code (which is none), so it's 
EXPECT_EQ() is enough.  If you mean it ASSERT_EQ() is good as well, so less 
unrelated changes -- I'll return it back.


http://gerrit.cloudera.org:8080/#/c/6170/11/src/kudu/master/sys_catalog.cc
File src/kudu/master/sys_catalog.cc:

Line 76: using kudu::tserver::TabletServerErrorPB;
> warning: using decl 'TabletServerErrorPB' is unused [misc-unused-using-decl
Done


-- 
To view, visit http://gerrit.cloudera.org:8080/6170
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I826826049e3c08a6c8345949690cbbedaea32ff8
Gerrit-PatchSet: 11
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Alexey Serbin 
Gerrit-Reviewer: Adar Dembo 
Gerrit-Reviewer: Alexey Serbin 
Gerrit-Reviewer: Dan Burkert 
Gerrit-Reviewer: David Ribeiro Alves 
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon 
Gerrit-HasComments: Yes

[kudu-CR] [catalog manager] categorization of rw operation failures

Hello Adar Dembo, Kudu Jenkins,

I'd like you to reexamine a change.  Please visit

http://gerrit.cloudera.org:8080/6170

to look at the new patch set (#12).

Change subject: [catalog_manager] categorization of rw operation failures
..

[catalog_manager] categorization of rw operation failures

This changelist introduces the categorization of the system catalog's
read and write operation failures which happen on leader post-election
callback.  There are two categories of errors: fatal and non-fatal.

If an operation against the system catalog fails in between terms of
the catalog leadership, the error is considered non-fatal.  In case of
a non-fatal error the leader post-election task bails out: the catalog
is no longer the leader at the original term and the task should be
executed by the new leader upon ElectedAsLeaderCb.

If an operation against the system catalog fails at the same term of the
catalog leadership, the error is considered fatal and that causes the
master process to crash.  This is to avoid possible inconsistency when
working with the tables/tablets metadata, the IPKI certificate authority
information and TSKs (Token Signing Keys).

Any failure of a read or write operation against the system catalog
happened during the catalog's shutdown is ignored and the leader
post-election task bails out once detecting such failure.

The same policy applies to other (i.e. not specific to read and write
operations against the system catalog) errors which might happen while
working with the IPKI certificate authority information and TokenSigner.
The rationale is the same as for handling the system catalog operation
failures: in case of an error, the leader has no consistent information
to work with, meanwhile a non-leader does not use the information
affected by the failure at all and can safely ignore the error.

Added a test to verify that the master server does not crash if change
of leadership detected while trying to persist a newly generated TSK
(Token Signing Key).

Change-Id: I826826049e3c08a6c8345949690cbbedaea32ff8
---
M src/kudu/integration-tests/CMakeLists.txt
A src/kudu/integration-tests/catalog_manager_tsk-itest.cc
M src/kudu/master/catalog_manager.cc
M src/kudu/master/catalog_manager.h
M src/kudu/master/master-test.cc
M src/kudu/master/sys_catalog-test.cc
M src/kudu/master/sys_catalog.cc
7 files changed, 328 insertions(+), 138 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/70/6170/12
-- 
To view, visit http://gerrit.cloudera.org:8080/6170
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I826826049e3c08a6c8345949690cbbedaea32ff8
Gerrit-PatchSet: 12
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Alexey Serbin 
Gerrit-Reviewer: Adar Dembo 
Gerrit-Reviewer: Alexey Serbin 
Gerrit-Reviewer: Dan Burkert 
Gerrit-Reviewer: David Ribeiro Alves 
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon

[kudu-CR] [catalog manager] categorization of rw operation failures

2017-03-06 Thread David Ribeiro Alves (Code Review)

David Ribeiro Alves has posted comments on this change.

Change subject: [catalog_manager] categorization of rw operation failures
..


Patch Set 11:

(1 comment)

I think the previous rev demonstrated that this is lacking coverage. 
Specifically we need at one test that makes it so that a newly elected leader 
in the process of running the task loses leadership so that it can't write at 
the end of the task.

http://gerrit.cloudera.org:8080/#/c/6170/11/src/kudu/master/sys_catalog-test.cc
File src/kudu/master/sys_catalog-test.cc:

PS11, Line 425: EXPECT_EQ
why the change to EXPECT?


-- 
To view, visit http://gerrit.cloudera.org:8080/6170
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I826826049e3c08a6c8345949690cbbedaea32ff8
Gerrit-PatchSet: 11
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Alexey Serbin 
Gerrit-Reviewer: Adar Dembo 
Gerrit-Reviewer: Alexey Serbin 
Gerrit-Reviewer: Dan Burkert 
Gerrit-Reviewer: David Ribeiro Alves 
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon 
Gerrit-HasComments: Yes

[kudu-CR] [catalog manager] categorization of rw operation failures

2017-03-06 Thread Alexey Serbin (Code Review)

Alexey Serbin has posted comments on this change.

Change subject: [catalog_manager] categorization of rw operation failures
..


Patch Set 10:

(10 comments)

http://gerrit.cloudera.org:8080/#/c/6170/10/src/kudu/master/catalog_manager.cc
File src/kudu/master/catalog_manager.cc:

Line 465: // If there is an error (e.g., we are not the leader) 
abort this task
> With the additions below, this path is no longer aborting the task.
I updated the message correspondingly.


Line 477: bool error_is_not_leader;
> Now we have two separate async callback codepaths using two separate method
Good point.  I originally started to use the same term comparison  approach 
here, but then I became concerned with races in this case -- since this is not 
the elected-as-a-leader callback.  After modifications on the 
ScopedLeaderSharedLock this should not happen.


Line 482:   string err_msg = "failed to refresh TSK: " + s.ToString() + 
": ";
> I'm still worried about the general chattiness of these INFO logs.  They me
Yep, that's why they are INFOs


Line 487: LOG(INFO) << err_msg << "will try next time";
> This is an unexpected codepath, right?  Should be at least a warning:
Done


Line 494: LOG(FATAL) << "TokenSigner has left with no valid key to 
use";
> This message doesn't include the failure.  Consider:
Done


Line 768: LOG(INFO) << "Did not find CA certificate and key for Kudu IPKI, "
> The way this is worded makes it sound like a failure, consider instead:
I dropped this message and left the message in StoreCertAuthorityInfo


Line 853:   LOG(INFO) << "Successfully stored the newly generated certificate 
authority "
> This information is already indicated by the log message online 768, consid
I decided to drop that at line 768 and keep this one.


Line 971: LOG(INFO) << kCaInitOpDescription << "...";
> This is mostly redundant with the log message on line 768, consider droppin
I don't think this is not redundant -- the message on line 768 triggers only 
once when there is no CA record in the system table yet.  However, this message 
reports on the initialization of CA component, which is a broader thing and 
happens on every elected-as-leader-callback.


Line 3427: LOG(INFO) << "Saved newly generated TSK " << tsk->key_seq_num()
> This is a useful log message, but I think the details about saving and syst
Done


http://gerrit.cloudera.org:8080/#/c/6170/10/src/kudu/master/sys_catalog-test.cc
File src/kudu/master/sys_catalog-test.cc:

Line 422:   const Status& s = master_->catalog_manager()->sys_catalog()->
> nit: by value
Done


-- 
To view, visit http://gerrit.cloudera.org:8080/6170
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I826826049e3c08a6c8345949690cbbedaea32ff8
Gerrit-PatchSet: 10
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Alexey Serbin 
Gerrit-Reviewer: Adar Dembo 
Gerrit-Reviewer: Alexey Serbin 
Gerrit-Reviewer: Dan Burkert 
Gerrit-Reviewer: David Ribeiro Alves 
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon 
Gerrit-HasComments: Yes

[kudu-CR] [catalog manager] categorization of rw operation failures

2017-03-06 Thread Alexey Serbin (Code Review)

Hello Adar Dembo, Kudu Jenkins,

I'd like you to reexamine a change.  Please visit

http://gerrit.cloudera.org:8080/6170

to look at the new patch set (#11).

Change subject: [catalog_manager] categorization of rw operation failures
..

[catalog_manager] categorization of rw operation failures

This changelist introduces the categorization of the system catalog's
read and write operation failures which happen on leader post-election
callback.  There are two categories of errors: fatal and non-fatal.

If an operation against the system catalog fails in between terms of
the catalog leadership, the error is considered non-fatal.  In case of
a non-fatal error the leader post-election task bails out: the catalog
is no longer the leader at the original term and the task should be
executed by the new leader upon ElectedAsLeaderCb.

If an operation against the system catalog fails at the same term of the
catalog leadership, the error is considered fatal and that causes the
master process to crash.  This is to avoid possible inconsistency when
working with the tables/tablets metadata, the IPKI certificate authority
information and TSKs (Token Signing Keys).

Any failure of a read or write operation against the system catalog
happened during the catalog's shutdown is ignored and the leader
post-election task bails out once detecting such failure.

The same policy applies to other (i.e. not specific to read and write
operations against the system catalog) errors which might happen while
working with the IPKI certificate authority information and TokenSigner.
The rationale is the same as for handling the system catalog operation
failures: in case of an error, the leader has no consistent information
to work with, meanwhile a non-leader does not use the information
affected by the failure at all and can safely ignore the error.

Change-Id: I826826049e3c08a6c8345949690cbbedaea32ff8
---
M src/kudu/master/catalog_manager.cc
M src/kudu/master/catalog_manager.h
M src/kudu/master/master-test.cc
M src/kudu/master/sys_catalog-test.cc
M src/kudu/master/sys_catalog.cc
5 files changed, 232 insertions(+), 139 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/70/6170/11
-- 
To view, visit http://gerrit.cloudera.org:8080/6170
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I826826049e3c08a6c8345949690cbbedaea32ff8
Gerrit-PatchSet: 11
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Alexey Serbin 
Gerrit-Reviewer: Adar Dembo 
Gerrit-Reviewer: Alexey Serbin 
Gerrit-Reviewer: Dan Burkert 
Gerrit-Reviewer: David Ribeiro Alves 
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon

[kudu-CR] [catalog manager] categorization of rw operation failures

2017-03-06 Thread Dan Burkert (Code Review)

Dan Burkert has posted comments on this change.

Change subject: [catalog_manager] categorization of rw operation failures
..


Patch Set 10:

(10 comments)

http://gerrit.cloudera.org:8080/#/c/6170/10/src/kudu/master/catalog_manager.cc
File src/kudu/master/catalog_manager.cc:

Line 465: // If there is an error (e.g., we are not the leader) 
abort this task
With the additions below, this path is no longer aborting the task.


Line 477: bool error_is_not_leader;
Now we have two separate async callback codepaths using two separate methods to 
determine if leadership has been lost.  Could you use the term comparison 
method here as well?


Line 482:   string err_msg = "failed to refresh TSK: " + s.ToString() + 
": ";
I'm still worried about the general chattiness of these INFO logs.  They 
mention failures, but they are normal and expected, and don't require operator 
action.


Line 487: LOG(INFO) << err_msg << "will try next time";
This is an unexpected codepath, right?  Should be at least a warning:

LOG(WARNING) << "Failed to generate TSK: " << s.ToString();


Line 494: LOG(FATAL) << "TokenSigner has left with no valid key to 
use";
This message doesn't include the failure.  Consider:

LOG(FATAL) << "Failed to generate TSK: " << s.ToString();


Line 768: LOG(INFO) << "Did not find CA certificate and key for Kudu IPKI, "
The way this is worded makes it sound like a failure, consider instead:

"Generating a new IPKI CA certificate and key"


Line 853:   LOG(INFO) << "Successfully stored the newly generated certificate 
authority "
This information is already indicated by the log message online 768, consider 
dropping this or reducing to VLOG


Line 971: LOG(INFO) << kCaInitOpDescription << "...";
This is mostly redundant with the log message on line 768, consider dropping 
this or reducing to VLOG.


Line 3427: LOG(INFO) << "Saved newly generated TSK " << tsk->key_seq_num()
This is a useful log message, but I think the details about saving and system 
table are not helpful.  Consider changing it to:

LOG(INFO) << "Generated new TSK " << tsk->key_seq_num();


http://gerrit.cloudera.org:8080/#/c/6170/10/src/kudu/master/sys_catalog-test.cc
File src/kudu/master/sys_catalog-test.cc:

Line 422:   const Status& s = master_->catalog_manager()->sys_catalog()->
nit: by value


-- 
To view, visit http://gerrit.cloudera.org:8080/6170
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I826826049e3c08a6c8345949690cbbedaea32ff8
Gerrit-PatchSet: 10
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Alexey Serbin 
Gerrit-Reviewer: Adar Dembo 
Gerrit-Reviewer: Alexey Serbin 
Gerrit-Reviewer: Dan Burkert 
Gerrit-Reviewer: David Ribeiro Alves 
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon 
Gerrit-HasComments: Yes

[kudu-CR] [catalog manager] categorization of rw operation failures

2017-03-06 Thread Alexey Serbin (Code Review)

Alexey Serbin has posted comments on this change.

Change subject: [catalog_manager] categorization of rw operation failures
..


Patch Set 10:

(1 comment)

OK, it seems I need to add some test coverage for the new scenarios.

http://gerrit.cloudera.org:8080/#/c/6170/10/src/kudu/master/sys_catalog.cc
File src/kudu/master/sys_catalog.cc:

PS10, Line 627:   const Status s = SyncWrite(&req, &resp);
  :   if (!s.ok() && error_is_not_leader != nullptr) {
  : *error_is_not_leader =
  : (resp.has_error() &&
  :  resp.error().code() == 
TabletServerErrorPB::NOT_THE_LEADER);
  :   }
> is this tested somewhere? I don't think we actually set this error code in 
You are right -- I haven't added any coverage for these scenarios.  Was relying 
on the existing coverage.


-- 
To view, visit http://gerrit.cloudera.org:8080/6170
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I826826049e3c08a6c8345949690cbbedaea32ff8
Gerrit-PatchSet: 10
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Alexey Serbin 
Gerrit-Reviewer: Adar Dembo 
Gerrit-Reviewer: Alexey Serbin 
Gerrit-Reviewer: Dan Burkert 
Gerrit-Reviewer: David Ribeiro Alves 
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon 
Gerrit-HasComments: Yes

[kudu-CR] [catalog manager] categorization of rw operation failures

2017-03-06 Thread David Ribeiro Alves (Code Review)

David Ribeiro Alves has posted comments on this change.

Change subject: [catalog_manager] categorization of rw operation failures
..


Patch Set 10:

(3 comments)

http://gerrit.cloudera.org:8080/#/c/6170/10//COMMIT_MSG
Commit Message:

PS10, Line 11:  
nit extra space


PS10, Line 14:  
same


http://gerrit.cloudera.org:8080/#/c/6170/10/src/kudu/master/sys_catalog.cc
File src/kudu/master/sys_catalog.cc:

PS10, Line 627:   const Status s = SyncWrite(&req, &resp);
  :   if (!s.ok() && error_is_not_leader != nullptr) {
  : *error_is_not_leader =
  : (resp.has_error() &&
  :  resp.error().code() == 
TabletServerErrorPB::NOT_THE_LEADER);
  :   }
is this tested somewhere? I don't think we actually set this error code in the 
write path, this makes me nervous cuz it seems like this whole patch is lacking 
coverage. Am I missing something?


-- 
To view, visit http://gerrit.cloudera.org:8080/6170
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I826826049e3c08a6c8345949690cbbedaea32ff8
Gerrit-PatchSet: 10
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Alexey Serbin 
Gerrit-Reviewer: Adar Dembo 
Gerrit-Reviewer: Alexey Serbin 
Gerrit-Reviewer: Dan Burkert 
Gerrit-Reviewer: David Ribeiro Alves 
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon 
Gerrit-HasComments: Yes

[kudu-CR] [catalog manager] categorization of rw operation failures

Alexey Serbin has posted comments on this change.

Change subject: [catalog_manager] categorization of rw operation failures
..


Patch Set 9:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/6170/9/src/kudu/master/catalog_manager.h
File src/kudu/master/catalog_manager.h:

PS9, Line 546: PrepareForMasterLeadershipTask
> You mean PrepareForLeadershipTask().
Yes, exactly -- thanks!


-- 
To view, visit http://gerrit.cloudera.org:8080/6170
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I826826049e3c08a6c8345949690cbbedaea32ff8
Gerrit-PatchSet: 9
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Alexey Serbin 
Gerrit-Reviewer: Adar Dembo 
Gerrit-Reviewer: Alexey Serbin 
Gerrit-Reviewer: Dan Burkert 
Gerrit-Reviewer: David Ribeiro Alves 
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon 
Gerrit-HasComments: Yes

[kudu-CR] [catalog manager] categorization of rw operation failures

Hello Adar Dembo,

I'd like you to reexamine a change.  Please visit

http://gerrit.cloudera.org:8080/6170

to look at the new patch set (#10).

Change subject: [catalog_manager] categorization of rw operation failures
..

[catalog_manager] categorization of rw operation failures

This changelist introduces the categorization of the system catalog's
read and write operation failures which happen on leader post-election
callback.  There are two categories of errors: fatal and non-fatal.

If an operation against the system catalog fails in between terms of
the catalog leadership, the error is considered non-fatal.  In case of
a non-fatal error the leader post-election task bails out: the catalog
is no longer the leader at the original term and the task should be
executed by the new leader upon ElectedAsLeaderCb.

If an operation against the system catalog fails at the same term of the
catalog leadership, the error is considered fatal and that causes the
master process to crash.  This is to avoid possible inconsistency when
working with the tables/tablets metadata, the IPKI certificate authority
information and TSKs (Token Signing Keys).

Any failure of a read or write operation against the system catalog
happened during the catalog's shutdown is ignored and the leader
post-election task bails out once detecting such failure.

The same policy applies to other (i.e. not specific to read and write
operations against the system catalog) errors which might happen while
working with the IPKI certificate authority information and TokenSigner.
The rationale is the same as for handling the system catalog operation
failures: in case of an error, the leader has no consistent information
to work with, meanwhile a non-leader does not use the information
affected by the failure at all and can safely ignore the error.

Change-Id: I826826049e3c08a6c8345949690cbbedaea32ff8
---
M src/kudu/master/catalog_manager.cc
M src/kudu/master/catalog_manager.h
M src/kudu/master/master-test.cc
M src/kudu/master/sys_catalog-test.cc
M src/kudu/master/sys_catalog.cc
M src/kudu/master/sys_catalog.h
6 files changed, 235 insertions(+), 135 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/70/6170/10
-- 
To view, visit http://gerrit.cloudera.org:8080/6170
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I826826049e3c08a6c8345949690cbbedaea32ff8
Gerrit-PatchSet: 10
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Alexey Serbin 
Gerrit-Reviewer: Adar Dembo 
Gerrit-Reviewer: Alexey Serbin 
Gerrit-Reviewer: Dan Burkert 
Gerrit-Reviewer: David Ribeiro Alves 
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon

[kudu-CR] [catalog manager] categorization of rw operation failures

2017-03-03 Thread Adar Dembo (Code Review)

Adar Dembo has posted comments on this change.

Change subject: [catalog_manager] categorization of rw operation failures
..


Patch Set 9: Code-Review+1

(1 comment)

http://gerrit.cloudera.org:8080/#/c/6170/9/src/kudu/master/catalog_manager.h
File src/kudu/master/catalog_manager.h:

PS9, Line 546: PrepareForMasterLeadershipTask
You mean PrepareForLeadershipTask().


-- 
To view, visit http://gerrit.cloudera.org:8080/6170
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I826826049e3c08a6c8345949690cbbedaea32ff8
Gerrit-PatchSet: 9
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Alexey Serbin 
Gerrit-Reviewer: Adar Dembo 
Gerrit-Reviewer: Alexey Serbin 
Gerrit-Reviewer: Dan Burkert 
Gerrit-Reviewer: David Ribeiro Alves 
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon 
Gerrit-HasComments: Yes

[kudu-CR] [catalog manager] categorization of rw operation failures

Alexey Serbin has posted comments on this change.

Change subject: [catalog_manager] categorization of rw operation failures
..


Patch Set 9: Verified+1

Flake: org.apache.kudu.spark.kudu.DefaultSourceTest.Test SparkSQL 
StringStartsWith filters

-- 
To view, visit http://gerrit.cloudera.org:8080/6170
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I826826049e3c08a6c8345949690cbbedaea32ff8
Gerrit-PatchSet: 9
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Alexey Serbin 
Gerrit-Reviewer: Adar Dembo 
Gerrit-Reviewer: Alexey Serbin 
Gerrit-Reviewer: Dan Burkert 
Gerrit-Reviewer: David Ribeiro Alves 
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon 
Gerrit-HasComments: No

[kudu-CR] [catalog manager] categorization of rw operation failures

Alexey Serbin has posted comments on this change.

Change subject: [catalog_manager] categorization of rw operation failures
..


Patch Set 8:

(10 comments)

http://gerrit.cloudera.org:8080/#/c/6170/8//COMMIT_MSG
Commit Message:

PS8, Line 10: d/write operation failures happened while executing the leader
: post-election callback
> sentence fragment
Done


PS8, Line 14: /
> Make it explicit whether this means 'and' or 'or'.
Done


http://gerrit.cloudera.org:8080/#/c/6170/8/src/kudu/master/catalog_manager.cc
File src/kudu/master/catalog_manager.cc:

Line 463:   const Status& s = 
catalog_manager_->ProcessPendingAssignments(to_process);
> Would you mind making this an owned Status here and below?  I find the by-v
Done


PS8, Line 763: found
> is found
Done


PS8, Line 770:  
> the master
Done


PS8, Line 774: already
 : // contain CA certificate information written by the new 
leader master
> contain the CA certificate information when this master is re-elected next.
Done


Line 790:   return s;
> I'm guessing this is an unnecessary copy because the status is kept as a re
That's a good point.


PS8, Line 911: const Consensus& consensus
> this could just be captured.
If you don't mind, I would prefer to leave it as it is for better modularity.


Line 914:   const int64_t term_pre =
> Why not use 'term'?
Because there is pairing 'term_post'.  I updated this code so this is no longer 
an issue.


Line 938: LOG(WARNING) << op_description << " failed; "
> This is logged at INFO level on line 899, why WARNING here?
It's in the context of the 'if (!s.ok())' path -- some operation has failed.  
But since you noticed this, I think it's better to change it to INFO as well 
because the failed operation means we are not using that anyway and some other 
master server picked this up.


-- 
To view, visit http://gerrit.cloudera.org:8080/6170
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I826826049e3c08a6c8345949690cbbedaea32ff8
Gerrit-PatchSet: 8
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Alexey Serbin 
Gerrit-Reviewer: Adar Dembo 
Gerrit-Reviewer: Alexey Serbin 
Gerrit-Reviewer: Dan Burkert 
Gerrit-Reviewer: David Ribeiro Alves 
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon 
Gerrit-HasComments: Yes

[kudu-CR] [catalog manager] categorization of rw operation failures

Hello Adar Dembo,

I'd like you to reexamine a change.  Please visit

http://gerrit.cloudera.org:8080/6170

to look at the new patch set (#9).

Change subject: [catalog_manager] categorization of rw operation failures
..

[catalog_manager] categorization of rw operation failures

This changelist introduces the categorization of the system catalog's
read and write operation failures happened on leader post-election
callback.  There are two categories of errors: fatal and non-fatal.

If system catalog operation fails in between terms of the system catalog
leadership, the error is considered non-fatal.  In case of a non-fatal
error the leader post-election task bails out: the catalog is no longer
the leader at the original term and the task should be executed by the
new leader upon ElectedAsLeaderCb.

If system catalog operation failure happened at the same term of the
system catalog leadership, the error is considered fatal and that causes
the master process to crash.  This is to avoid possible inconsistency
when working with the tables/tablets metadata, the IPKI certificate
authority information and TSKs (Token Signing Keys).

All read and write system catalog operation failures happened during
the catalog's shutdown are ignored and the leader post-election task
bails out when detecting that.

The same policy applies to other (i.e. not specific to the system
catalog) errors which might happen while working with the IPKI
certificate authority information and TokenSigner.  The reason is
the same as with the system catalog operation failures: in case of an
error, the leader has no consistent information to work with, meanwhile
a non-leader does not use that information at all and can safely ignore
the error.

Change-Id: I826826049e3c08a6c8345949690cbbedaea32ff8
---
M src/kudu/master/catalog_manager.cc
M src/kudu/master/catalog_manager.h
M src/kudu/master/master-test.cc
M src/kudu/master/sys_catalog-test.cc
M src/kudu/master/sys_catalog.cc
M src/kudu/master/sys_catalog.h
6 files changed, 235 insertions(+), 135 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/70/6170/9
-- 
To view, visit http://gerrit.cloudera.org:8080/6170
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I826826049e3c08a6c8345949690cbbedaea32ff8
Gerrit-PatchSet: 9
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Alexey Serbin 
Gerrit-Reviewer: Adar Dembo 
Gerrit-Reviewer: Alexey Serbin 
Gerrit-Reviewer: Dan Burkert 
Gerrit-Reviewer: David Ribeiro Alves 
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon

[kudu-CR] [catalog manager] categorization of rw operation failures

2017-03-03 Thread Dan Burkert (Code Review)

Dan Burkert has posted comments on this change.

Change subject: [catalog_manager] categorization of rw operation failures
..


Patch Set 8:

(10 comments)

http://gerrit.cloudera.org:8080/#/c/6170/8//COMMIT_MSG
Commit Message:

PS8, Line 10: d/write operation failures happened while executing the leader
: post-election callback
sentence fragment


PS8, Line 14: /
Make it explicit whether this means 'and' or 'or'.


http://gerrit.cloudera.org:8080/#/c/6170/8/src/kudu/master/catalog_manager.cc
File src/kudu/master/catalog_manager.cc:

Line 463:   const Status& s = 
catalog_manager_->ProcessPendingAssignments(to_process);
Would you mind making this an owned Status here and below?  I find the by-value 
return stored as reference pattern to be very unintuitive, and it doesn't match 
what's actually happening.


PS8, Line 763: found
is found


PS8, Line 770:  
the master


PS8, Line 774: already
 : // contain CA certificate information written by the new 
leader master
contain the CA certificate information when this master is re-elected next.


Line 790:   return s;
I'm guessing this is an unnecessary copy because the status is kept as a 
reference.  If it weren't a reference this is pretty much guaranteed NRVO.


PS8, Line 911: const Consensus& consensus
this could just be captured.


Line 914:   const int64_t term_pre =
Why not use 'term'?


Line 938: LOG(WARNING) << op_description << " failed; "
This is logged at INFO level on line 899, why WARNING here?


-- 
To view, visit http://gerrit.cloudera.org:8080/6170
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I826826049e3c08a6c8345949690cbbedaea32ff8
Gerrit-PatchSet: 8
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Alexey Serbin 
Gerrit-Reviewer: Adar Dembo 
Gerrit-Reviewer: Alexey Serbin 
Gerrit-Reviewer: Dan Burkert 
Gerrit-Reviewer: David Ribeiro Alves 
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon 
Gerrit-HasComments: Yes

[kudu-CR] [catalog manager] categorization of rw operation failures

Alexey Serbin has posted comments on this change.

Change subject: [catalog_manager] categorization of rw operation failures
..


Patch Set 8: Verified+1

Looks like a couple of flakes:
  MultiThreadedTabletTest/3.DeleteAndReinsert
  org.apache.kudu.spark.kudu.DefaultSourceTest.Test SparkSQL StringStartsWith 
filters

-- 
To view, visit http://gerrit.cloudera.org:8080/6170
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I826826049e3c08a6c8345949690cbbedaea32ff8
Gerrit-PatchSet: 8
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Alexey Serbin 
Gerrit-Reviewer: Adar Dembo 
Gerrit-Reviewer: Alexey Serbin 
Gerrit-Reviewer: Dan Burkert 
Gerrit-Reviewer: David Ribeiro Alves 
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon 
Gerrit-HasComments: No

[kudu-CR] [catalog manager] categorization of rw operation failures

Hello Adar Dembo, Kudu Jenkins,

I'd like you to reexamine a change.  Please visit

http://gerrit.cloudera.org:8080/6170

to look at the new patch set (#8).

Change subject: [catalog_manager] categorization of rw operation failures
..

[catalog_manager] categorization of rw operation failures

This changelist introduces the categorization of the system catalog's
read/write operation failures happened while executing the leader
post-election callback.  There are two categories of errors: fatal and
non-fatal.

If a read/write operation fails in between terms of the system catalog
leadership, the error is considered non-fatal.  In case of a non-fatal
error the leader post-election task bails out: the catalog is no longer
the leader at the original term and the task should be executed by the
new leader upon the ElectedAsLeaderCb.

If a read/write operation failure happened at the same term of the
system catalog leadership, the error is considered fatal and that causes
the master process to crash.  This is to avoid possible inconsistency
when loading into memory and persisting the tables/tablets metadata,
IPKI certificate authority information and TSKs (Token Signing Keys).

All read/write operation failures happened during the system catalog's
shutdown are ignored and the leader post-election task bails out.

The same policy applies to other errors which might happen while
working with IPKI certificate authority information and TokenSigner.
This is because such errors are fatal if they happened to the currently
active system catalog leader, otherwise they can be ignored because
the related information is not used by non-leader catalog manager.
Once the catalog manager becomes a leader again, it will refresh that
information before using it.

Change-Id: I826826049e3c08a6c8345949690cbbedaea32ff8
---
M src/kudu/master/catalog_manager.cc
M src/kudu/master/catalog_manager.h
M src/kudu/master/master-test.cc
M src/kudu/master/sys_catalog-test.cc
M src/kudu/master/sys_catalog.cc
5 files changed, 204 insertions(+), 125 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/70/6170/8
-- 
To view, visit http://gerrit.cloudera.org:8080/6170
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I826826049e3c08a6c8345949690cbbedaea32ff8
Gerrit-PatchSet: 8
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Alexey Serbin 
Gerrit-Reviewer: Adar Dembo 
Gerrit-Reviewer: Alexey Serbin 
Gerrit-Reviewer: Dan Burkert 
Gerrit-Reviewer: David Ribeiro Alves 
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon

[kudu-CR] [catalog manager] categorization of rw operation failures