Hello Adar Dembo, Kudu Jenkins,

I'd like you to reexamine a change.  Please visit

    http://gerrit.cloudera.org:8080/6134

to look at the new patch set (#7).

Change subject: [catalog manager] fixed deadlock on catalog shutdown
......................................................................

[catalog manager] fixed deadlock on catalog shutdown

Fixed deadlock on system catalog manager shutdown in case of
multi-master Kudu cluster. Prior to the fix, the leader master often
hung in its 'elected-as-a-leader' callback while trying to write into
the system table. It was awaiting for completion of the system table
operations, but those were retried indefinitely since the system catalog
table's Raft quorum was not available (other masters were shutdown).

Prior to the fix, the deadlock happened pretty often while running
the master_MasterReplicationTest.TestCycleThroughAllMasters scenario in
master_replication-itest (DEBUG build). This bug manifested itself
in other tests where multi-master Kudu mini-cluster is used.

The mechanics behind the deadlock are as follows:
  * The rest of the system table's Raft quorum goes down
    (i.e. non-leader masters shut down).
  * The ElectedAsLeaderCb task issues an operation to the system table
    (e.g. write newly generated TSK).
  * The code below calls Shutdown() on the leader election pool. That
    call does not return because the underlying Raft indefinitely
    retries to get the response for the submitted operations.

The problem manifested itself the following way: after outputting
something like:

I0224 18:25:16.760793 1964126208 raft_consensus.cc:1569] T
00000000000000000000000000000000 P bd5cf976e19f4843b81cd02f14c6c87a
[term 1 FOLLOWER]: Raft consensus shutting down.
I0224 18:25:16.760815 1964126208 raft_consensus.cc:1585] T
00000000000000000000000000000000 P bd5cf976e19f4843b81cd02f14c6c87a
[term 1 FOLLOWER]: Raft consensus is shut down!
I0224 18:25:16.773479 1964126208 master.cc:214] Master@127.0.0.1:11011
shutdown complete.
I0224 18:25:16.774673 1964126208 master.cc:210] Master@127.0.0.1:11012
shutting down...

the test continued to run indefinitely, spitting messages like:

W0224 18:25:21.246805 62234624 consensus_peers.cc:357] T
00000000000000000000000000000000 P 51eb32e67c014327b965ae3e6f4993e1 ->
Peer 14cb97657cb4407fab1ce3e097d7a71b (127.0.0.1:11010): Couldn't send
request to peer 14cb97657cb4407fab1ce3e097d7a71b for tablet
00000000000000000000000000000000. Status: Network error: Client
connection negotiation failed: client connection to 127.0.0.1:11010:
connect: Connection refused (error 61). Retrying in the next heartbeat
period. Already tried 14 times.

Change-Id: I10ad66fe33d4696adf2a02a09e2790afa8869583
---
M src/kudu/master/catalog_manager.cc
1 file changed, 34 insertions(+), 10 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/34/6134/7
-- 
To view, visit http://gerrit.cloudera.org:8080/6134
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I10ad66fe33d4696adf2a02a09e2790afa8869583
Gerrit-PatchSet: 7
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Alexey Serbin <aser...@cloudera.com>
Gerrit-Reviewer: Adar Dembo <a...@cloudera.com>
Gerrit-Reviewer: Alexey Serbin <aser...@cloudera.com>
Gerrit-Reviewer: David Ribeiro Alves <dral...@apache.org>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy <mpe...@apache.org>
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon <t...@apache.org>

Reply via email to