David Ribeiro Alves has posted comments on this change. Change subject: [catalog manager] fixed deadlock on catalog shutdown ......................................................................
Patch Set 4: (16 comments) http://gerrit.cloudera.org:8080/#/c/6134/4//COMMIT_MSG Commit Message: PS4, Line 7: [catalog manager] fixed deadlock on catalog shutdown Not so important now, but when I meant trace I meant the stack trace showing deadlock PS4, Line 10: remove extra space PS4, Line 13: system the system PS4, Line 22: like missing colon after like PS4, Line 35: spitting messages like missing colon after like http://gerrit.cloudera.org:8080/#/c/6134/4/src/kudu/master/catalog_manager.cc File src/kudu/master/catalog_manager.cc: PS4, Line 722: scoped_refptr<Consensus> consensus = : sys_catalog_->tablet_peer()->shared_consensus(); : if (!consensus || !consensus->IsRunning()) { : return Status::IllegalState("no active consensus"); : } > The WaitUntilCaughtUpAsLeader method is called only by the 'elected-as-a-le so how can consensus be null here? PS4, Line 807: This could be just a raw pointer since the tablet peer shutdown happens : // after this task finishes. so why isn't it? :) PS4, Line 811: consensus->IsRunning() do you care that it is running? can it be "not running"? (consensus->ConsensusState doesn't depend on whether its running) PS4, Line 988: scoped_refptr<Consensus> consensus = : sys_catalog_->tablet_peer()->shared_consensus(); : if (!consensus || !consensus->IsRunning()) { : return RaftPeerPB::UNKNOWN_ROLE; : } : return consensus->role(); could you replace this and other checks (and avoid reaching into the consensus locks twice, once in IsRunning() and another in role()) by having a test on the state of the catalog_manager itself? If the catalog manager is kRunning, then consensus must be alive, right? If it is alive then it doesn't matter if its running, at least in this particular case. PS4, Line 1022: AbortAndWaitForAllTasks(copy); After this statement you can short-circuit if the there is no sync_catalog_. The only instance when that might happen is when tests issue a shutdown before initialization finished meaning there will be no consensus or leader_election_pool_ tasks. This can simplify a bit by avoiding the extra ifs. Also if there is a sync_catalog_ then you don't need to ref count consensus (sync_catalog has a counted ref to tablet peer which has a counted ref to consensus, this method is idempotent) PS4, Line 1024: consensus nit "consensus configuration" or "raft config" (it's not ideal but it's what we've been calling it) PS4, Line 1026: consensus same PS4, Line 1030: Raft quorum prefer consensus configuration or raft configuration PS4, Line 1033: tablets tablet PS4, Line 1040: if (consensus && consensus->IsRunning()) { as I mentioned above this if unecessary, if we got here we have a consensus and ShutDown() is idempotent PS4, Line 1045: activity tasks -- To view, visit http://gerrit.cloudera.org:8080/6134 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: I10ad66fe33d4696adf2a02a09e2790afa8869583 Gerrit-PatchSet: 4 Gerrit-Project: kudu Gerrit-Branch: master Gerrit-Owner: Alexey Serbin <aser...@cloudera.com> Gerrit-Reviewer: Adar Dembo <a...@cloudera.com> Gerrit-Reviewer: Alexey Serbin <aser...@cloudera.com> Gerrit-Reviewer: David Ribeiro Alves <dral...@apache.org> Gerrit-Reviewer: Mike Percy <mpe...@apache.org> Gerrit-Reviewer: Tidy Bot Gerrit-Reviewer: Todd Lipcon <t...@apache.org> Gerrit-HasComments: Yes