Mike Percy created KUDU-1365:
--------------------------------

             Summary: Leader flapping when one machine has a very slow disk
                 Key: KUDU-1365
                 URL: https://issues.apache.org/jira/browse/KUDU-1365
             Project: Kudu
          Issue Type: Bug
          Components: consensus
    Affects Versions: 0.7.0
            Reporter: Mike Percy
            Assignee: Mike Percy


There is an issue that [~decster] ran into in production where a machine had a 
bad disk. Unfortunately, the disk was not failing to write, it was just very 
slow. This resulted in the UpdateConsensus RPC queue filling up on that machine 
when it was a follower, which looked like this in the log (from the leader's 
perspective):

{code}
W0229 00:07:14.332468 18148 consensus_peers.cc:316] T 
41e637c3c2b34e8db36d138c4d37d032 P 6434970484be4d29855f05e1f6aed1b8 -> Peer 
7f331507718d477f96d60eb1bc573baa (st128:18700): Couldn't send request to peer 
7f331507718d477f96d60eb1bc573baa for tablet 41e637c3c2b34e8db36d138c4d37d032 
Status: Remote error: Service unavailable: UpdateConsensus request on 
kudu.consensus.ConsensusService dropped due to backpressure. The service queue 
is full; it has 50 items.. Retrying in the next heartbeat period. Already tried 
25 times.
{code}

The result is that the follower could not receive heartbeat messages from the 
leader anymore. The follower (st128) would decide that the leader was dead and 
start an election. Because it had the same amount of data as the rest of the 
cluster, it won the election. Then, for reasons we still need to investigate, 
it would not heartbeat to its own followers. After some timeout, a different 
node (typically the previous leader) would start an election, get elected, and 
the flapping process would continue.

It's possible that the bad node, when leader, was only partially transitioned 
to leadership, and was blocking on some disk operation before starting to 
heartbeat. Hopefully we can get logs from the bad node so we can better 
understand what was happening from its perspective.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to