Hello Mike Percy, Adar Dembo, Kudu Jenkins, I'd like you to reexamine a change. Please visit
http://gerrit.cloudera.org:8080/6134 to look at the new patch set (#8). Change subject: [catalog manager] fixed deadlock on catalog shutdown ...................................................................... [catalog manager] fixed deadlock on catalog shutdown Fixed deadlock on system catalog manager shutdown in case of multi-master Kudu cluster. Prior to the fix, the leader master often hung in its 'elected-as-a-leader' callback while trying to write into the system table. It was awaiting for completion of the system table operations, but those were retried indefinitely since the system catalog table's Raft quorum was not available (other masters were shutdown). Prior to the fix, the deadlock happened pretty often while running the master_MasterReplicationTest.TestCycleThroughAllMasters scenario in master_replication-itest (DEBUG build). This bug manifested itself in other tests where multi-master Kudu mini-cluster is used. After the fix, the success rate became 1024 of 1024. The mechanics behind the deadlock are as follows: * The majority of the system table's peers go down (e.g. all non-leader masters shut down). * The ElectedAsLeaderCb task issues an operation to the system table (e.g. write newly generated TSK). * The code below calls Shutdown() on the leader election pool. That call does not return because the underlying Raft indefinitely retries to get the response for the submitted operations. The problem manifested itself the following way: after outputting something like: I0224 18:25:16.760793 1964126208 raft_consensus.cc:1569] T 00000000000000000000000000000000 P bd5cf976e19f4843b81cd02f14c6c87a [term 1 FOLLOWER]: Raft consensus shutting down. I0224 18:25:16.760815 1964126208 raft_consensus.cc:1585] T 00000000000000000000000000000000 P bd5cf976e19f4843b81cd02f14c6c87a [term 1 FOLLOWER]: Raft consensus is shut down! I0224 18:25:16.773479 1964126208 master.cc:214] Master@127.0.0.1:11011 shutdown complete. I0224 18:25:16.774673 1964126208 master.cc:210] Master@127.0.0.1:11012 shutting down... the test continued to run indefinitely, spitting messages like: W0224 18:25:21.246805 62234624 consensus_peers.cc:357] T 00000000000000000000000000000000 P 51eb32e67c014327b965ae3e6f4993e1 -> Peer 14cb97657cb4407fab1ce3e097d7a71b (127.0.0.1:11010): Couldn't send request to peer 14cb97657cb4407fab1ce3e097d7a71b for tablet 00000000000000000000000000000000. Status: Network error: Client connection negotiation failed: client connection to 127.0.0.1:11010: connect: Connection refused (error 61). Retrying in the next heartbeat period. Already tried 14 times. Change-Id: I10ad66fe33d4696adf2a02a09e2790afa8869583 --- M src/kudu/master/catalog_manager.cc 1 file changed, 34 insertions(+), 10 deletions(-) git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/34/6134/8 -- To view, visit http://gerrit.cloudera.org:8080/6134 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: newpatchset Gerrit-Change-Id: I10ad66fe33d4696adf2a02a09e2790afa8869583 Gerrit-PatchSet: 8 Gerrit-Project: kudu Gerrit-Branch: master Gerrit-Owner: Alexey Serbin <aser...@cloudera.com> Gerrit-Reviewer: Adar Dembo <a...@cloudera.com> Gerrit-Reviewer: Alexey Serbin <aser...@cloudera.com> Gerrit-Reviewer: David Ribeiro Alves <dral...@apache.org> Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Mike Percy <mpe...@apache.org> Gerrit-Reviewer: Tidy Bot Gerrit-Reviewer: Todd Lipcon <t...@apache.org>