[kudu-CR] [catalog manager] categorization of rw operation failures

Alexey Serbin (Code Review) Thu, 09 Mar 2017 00:37:57 -0800

Alexey Serbin has posted comments on this change.

Change subject: [catalog_manager] categorization of rw operation failures
......................................................................



Patch Set 24:

> > > Given that we're still chasing strange test failures on this,
 > and
 > > > it's on a pretty important and tricky code path, maybe we
 > should
 > > > chat about whether it's really necessary for 1.3.0? i.e are the
 > > > downside risks of not having it included in the release worse
 > > than
 > > > the downside risks of potential bugs? I haven't followed it
 > > closely
 > > > but as the 1.3 RM I'm feeling nervous about complex patches
 > > coming
 > > > in very close to the first rc being cut (hoping to do that
 > > > tomorrow)
 > >
 > > Yes, that's a very good point.  However, I think I understand
 > what
 > > is the issue.  The issue is that upon master leadership change
 > the
 > > new leader sometimes does not see the last successful write from
 > > the former leader.  That bug can affect table/tablet metadata as
 > > well.  I.e., the newly created tablet could be overlooked at
 > > leadership change, and it will be seen only on the next call of
 > > ElectedLeaderCb.
 > >
 > > E.g., it's possible to take a look at log from 
 > > https://kudu-test-results.s3.amazonaws.com/aserbin.1489032335.25361.f2e8aa26b74185a6bd16d5d554488e6f1af190f5.13.0-artifacts.zip
 > >
 > > This is what I think happened there:
 > >
 > > 1. The master at 127.0.0.1:11032 generated and successfully wrote
 > > TSK with id 0 (I0309 04:05:44.679442   558 catalog_manager.cc:3432]
 > > Generated new TSK 0).  Later on, re-election happened and it was
 > > elected as a leader again, and it generated TSK with id 1 but
 > since
 > > there is injected latency prior to writing the key into the
 > table,
 > > it failed to write it into the system table due to leadership
 > term
 > > change.
 > >
 > > 2. Some other leader started its leadership duties but failed to
 > > caught up as a leader.
 > >
 > > 3. Our former (first) master server became the leader again and
 > it
 > > generated and successfully written TSK with id 2 (I0309
 > > 04:05:46.812899   668 catalog_manager.cc:3432] Generated new TSK
 > 2)
 > >
 > > 4. Right after that leadership changed and other master server
 > ran
 > > its ElectedAsLeaderCb and it did not see the latest TSK record in
 > > the system table.  Seeing just the record with TSK id 0, it
 > > generated and successfully written its new TSK with id 1 (I0309
 > > 04:05:47.270205   378 catalog_manager.cc:3432] Generated new TSK
 > 1)
 > >
 > > 5. Now, the client has connected to the current master leader
 > which
 > > has just generated TSK and made it current (TSK rotation period
 > is
 > > 2 seconds).  It got authn token signed by TSK with id 1.
 > >
 > > 6. The client tries to execute write operation against the tablet
 > > server which has received TSKs with id 0 and 2.  The tablet
 > server
 > > cannot see the TSK with id 1 because the new master does not send
 > > it in response since the tablet server sends 2 as the latest TSK
 > > id.
 > >
 > > 7. The tablet server responded with 'Runtime error:
 > > ERROR_UNAVAILABLE: Not authorized: authentication token signed
 > with
 > > unknown key' while the client tried to negotiate the connection.
 > 
 > Instead of the link to the artifacts it's possible to use
 > http://dist-test.cloudera.org//job?job_id=aserbin.1489032335.25361
 > and retrieve the artifacts of the very first failure in the list.

David and I looked at this a little bit more and David suspects that we could 
have the following problem:

1. While still the leader master, the bg task starts doing its job
2. The leadership changes twice while the bg task was running -- the master has 
lost its leadership and the leadership returned back to the master, but the bg 
task holds that lock (in rd mode).
3. Since the ElectedAsLeaderCb cannot run (it tries to acquire the lock in wr 
mode), it bg task might come into the point where it's ready to write into the 
system table and that write would be successful.  If it were refreshing TSK to 
set it to (T), and the current TSK id is already (T + 1) since the 
in-the-middle-leader failed to write (T) TSK, so there are (T - 1)  and (T + 1) 
TSKs in the table, but there is no (T) TSK yet.  In that case, the bg task 
would write a new TSK with stale id into the system table and there would be no 
error, since the master is the leader again.

So, this patch requires some clarification in that regard.  I think we should 
proceed as Todd suggested cutting RC without this patch.

-- 
To view, visit http://gerrit.cloudera.org:8080/6170
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I826826049e3c08a6c8345949690cbbedaea32ff8
Gerrit-PatchSet: 24
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Alexey Serbin <aser...@cloudera.com>
Gerrit-Reviewer: Adar Dembo <a...@cloudera.com>
Gerrit-Reviewer: Alexey Serbin <aser...@cloudera.com>
Gerrit-Reviewer: Dan Burkert <danburk...@apache.org>
Gerrit-Reviewer: David Ribeiro Alves <dral...@apache.org>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon <t...@apache.org>
Gerrit-HasComments: No

[kudu-CR] [catalog manager] categorization of rw operation failures

Reply via email to