[jira] [Commented] (KUDU-1365) Leader flapping when one machine has a very slow disk

Mike Percy (JIRA) Fri, 04 Mar 2016 04:11:55 -0800

    [ 
https://issues.apache.org/jira/browse/KUDU-1365?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15179801#comment-15179801
 ]


Mike Percy commented on KUDU-1365:
----------------------------------

This needs a bit of careful thought, but we have some potential options to 
address the problem of a bad node attempting to become leader.

One is to implement a candidacy pre-check, so the bad node will never start the 
election. This is something we had intended to implement anyway.

Other potential options:
* add special heartbeats when we can't write that go to a special queue
* allow partial processing of items in the queue to check for leader liveness
* separate heartbeats completely from UpdateConsensus

> Leader flapping when one machine has a very slow disk
> -----------------------------------------------------
>
>                 Key: KUDU-1365
>                 URL: https://issues.apache.org/jira/browse/KUDU-1365
>             Project: Kudu
>          Issue Type: Bug
>          Components: consensus
>    Affects Versions: 0.7.0
>            Reporter: Mike Percy
>            Assignee: Mike Percy
>
> There is an issue that [~decster] ran into in production where a machine had 
> a bad disk. Unfortunately, the disk was not failing to write, it was just 
> very slow. This resulted in the UpdateConsensus RPC queue filling up on that 
> machine when it was a follower, which looked like this in the log (from the 
> leader's perspective):
> {code}
> W0229 00:07:14.332468 18148 consensus_peers.cc:316] T 
> 41e637c3c2b34e8db36d138c4d37d032 P 6434970484be4d29855f05e1f6aed1b8 -> Peer 
> 7f331507718d477f96d60eb1bc573baa (st128:18700): Couldn't send request to peer 
> 7f331507718d477f96d60eb1bc573baa for tablet 41e637c3c2b34e8db36d138c4d37d032 
> Status: Remote error: Service unavailable: UpdateConsensus request on 
> kudu.consensus.ConsensusService dropped due to backpressure. The service 
> queue is full; it has 50 items.. Retrying in the next heartbeat period. 
> Already tried 25 times.
> {code}
> The result is that the follower could not receive heartbeat messages from the 
> leader anymore. The follower (st128) would decide that the leader was dead 
> and start an election. Because it had the same amount of data as the rest of 
> the cluster, it won the election. Then, for reasons we still need to 
> investigate, it would not heartbeat to its own followers. After some timeout, 
> a different node (typically the previous leader) would start an election, get 
> elected, and the flapping process would continue.
> It's possible that the bad node, when leader, was only partially transitioned 
> to leadership, and was blocking on some disk operation before starting to 
> heartbeat. Hopefully we can get logs from the bad node so we can better 
> understand what was happening from its perspective.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (KUDU-1365) Leader flapping when one machine has a very slow disk

Reply via email to