[
https://issues.apache.org/jira/browse/KUDU-1365?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15179801#comment-15179801
]
Mike Percy commented on KUDU-1365:
----------------------------------
This needs a bit of careful thought, but we have some potential options to
address the problem of a bad node attempting to become leader.
One is to implement a candidacy pre-check, so the bad node will never start the
election. This is something we had intended to implement anyway.
Other potential options:
* add special heartbeats when we can't write that go to a special queue
* allow partial processing of items in the queue to check for leader liveness
* separate heartbeats completely from UpdateConsensus
> Leader flapping when one machine has a very slow disk
> -----------------------------------------------------
>
> Key: KUDU-1365
> URL: https://issues.apache.org/jira/browse/KUDU-1365
> Project: Kudu
> Issue Type: Bug
> Components: consensus
> Affects Versions: 0.7.0
> Reporter: Mike Percy
> Assignee: Mike Percy
>
> There is an issue that [~decster] ran into in production where a machine had
> a bad disk. Unfortunately, the disk was not failing to write, it was just
> very slow. This resulted in the UpdateConsensus RPC queue filling up on that
> machine when it was a follower, which looked like this in the log (from the
> leader's perspective):
> {code}
> W0229 00:07:14.332468 18148 consensus_peers.cc:316] T
> 41e637c3c2b34e8db36d138c4d37d032 P 6434970484be4d29855f05e1f6aed1b8 -> Peer
> 7f331507718d477f96d60eb1bc573baa (st128:18700): Couldn't send request to peer
> 7f331507718d477f96d60eb1bc573baa for tablet 41e637c3c2b34e8db36d138c4d37d032
> Status: Remote error: Service unavailable: UpdateConsensus request on
> kudu.consensus.ConsensusService dropped due to backpressure. The service
> queue is full; it has 50 items.. Retrying in the next heartbeat period.
> Already tried 25 times.
> {code}
> The result is that the follower could not receive heartbeat messages from the
> leader anymore. The follower (st128) would decide that the leader was dead
> and start an election. Because it had the same amount of data as the rest of
> the cluster, it won the election. Then, for reasons we still need to
> investigate, it would not heartbeat to its own followers. After some timeout,
> a different node (typically the previous leader) would start an election, get
> elected, and the flapping process would continue.
> It's possible that the bad node, when leader, was only partially transitioned
> to leadership, and was blocking on some disk operation before starting to
> heartbeat. Hopefully we can get logs from the bad node so we can better
> understand what was happening from its perspective.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)