Alexey Serbin has posted comments on this change. Change subject: [catalog_manager] categorization of rw operation failures ......................................................................
Patch Set 24: > > > Given that we're still chasing strange test failures on this, > and > > > it's on a pretty important and tricky code path, maybe we > should > > > chat about whether it's really necessary for 1.3.0? i.e are the > > > downside risks of not having it included in the release worse > > than > > > the downside risks of potential bugs? I haven't followed it > > closely > > > but as the 1.3 RM I'm feeling nervous about complex patches > > coming > > > in very close to the first rc being cut (hoping to do that > > > tomorrow) > > > > Yes, that's a very good point. However, I think I understand > what > > is the issue. The issue is that upon master leadership change > the > > new leader sometimes does not see the last successful write from > > the former leader. That bug can affect table/tablet metadata as > > well. I.e., the newly created tablet could be overlooked at > > leadership change, and it will be seen only on the next call of > > ElectedLeaderCb. > > > > E.g., it's possible to take a look at log from > > https://kudu-test-results.s3.amazonaws.com/aserbin.1489032335.25361.f2e8aa26b74185a6bd16d5d554488e6f1af190f5.13.0-artifacts.zip > > > > This is what I think happened there: > > > > 1. The master at 127.0.0.1:11032 generated and successfully wrote > > TSK with id 0 (I0309 04:05:44.679442 558 catalog_manager.cc:3432] > > Generated new TSK 0). Later on, re-election happened and it was > > elected as a leader again, and it generated TSK with id 1 but > since > > there is injected latency prior to writing the key into the > table, > > it failed to write it into the system table due to leadership > term > > change. > > > > 2. Some other leader started its leadership duties but failed to > > caught up as a leader. > > > > 3. Our former (first) master server became the leader again and > it > > generated and successfully written TSK with id 2 (I0309 > > 04:05:46.812899 668 catalog_manager.cc:3432] Generated new TSK > 2) > > > > 4. Right after that leadership changed and other master server > ran > > its ElectedAsLeaderCb and it did not see the latest TSK record in > > the system table. Seeing just the record with TSK id 0, it > > generated and successfully written its new TSK with id 1 (I0309 > > 04:05:47.270205 378 catalog_manager.cc:3432] Generated new TSK > 1) > > > > 5. Now, the client has connected to the current master leader > which > > has just generated TSK and made it current (TSK rotation period > is > > 2 seconds). It got authn token signed by TSK with id 1. > > > > 6. The client tries to execute write operation against the tablet > > server which has received TSKs with id 0 and 2. The tablet > server > > cannot see the TSK with id 1 because the new master does not send > > it in response since the tablet server sends 2 as the latest TSK > > id. > > > > 7. The tablet server responded with 'Runtime error: > > ERROR_UNAVAILABLE: Not authorized: authentication token signed > with > > unknown key' while the client tried to negotiate the connection. > > Instead of the link to the artifacts it's possible to use > http://dist-test.cloudera.org//job?job_id=aserbin.1489032335.25361 > and retrieve the artifacts of the very first failure in the list. David and I looked at this a little bit more and David suspects that we could have the following problem: 1. While still the leader master, the bg task starts doing its job 2. The leadership changes twice while the bg task was running -- the master has lost its leadership and the leadership returned back to the master, but the bg task holds that lock (in rd mode). 3. Since the ElectedAsLeaderCb cannot run (it tries to acquire the lock in wr mode), it bg task might come into the point where it's ready to write into the system table and that write would be successful. If it were refreshing TSK to set it to (T), and the current TSK id is already (T + 1) since the in-the-middle-leader failed to write (T) TSK, so there are (T - 1) and (T + 1) TSKs in the table, but there is no (T) TSK yet. In that case, the bg task would write a new TSK with stale id into the system table and there would be no error, since the master is the leader again. So, this patch requires some clarification in that regard. I think we should proceed as Todd suggested cutting RC without this patch. -- To view, visit http://gerrit.cloudera.org:8080/6170 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: I826826049e3c08a6c8345949690cbbedaea32ff8 Gerrit-PatchSet: 24 Gerrit-Project: kudu Gerrit-Branch: master Gerrit-Owner: Alexey Serbin <aser...@cloudera.com> Gerrit-Reviewer: Adar Dembo <a...@cloudera.com> Gerrit-Reviewer: Alexey Serbin <aser...@cloudera.com> Gerrit-Reviewer: Dan Burkert <danburk...@apache.org> Gerrit-Reviewer: David Ribeiro Alves <dral...@apache.org> Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Tidy Bot Gerrit-Reviewer: Todd Lipcon <t...@apache.org> Gerrit-HasComments: No