Todd Lipcon created KUDU-1516:
---------------------------------
Summary: ksck should check for more raft-related status issues
Key: KUDU-1516
URL: https://issues.apache.org/jira/browse/KUDU-1516
Project: Kudu
Issue Type: Improvement
Components: consensus, ksck, supportability
Affects Versions: 0.9.1
Reporter: Todd Lipcon
Priority: Critical
We currently have a test cluster where one or more tablets have gotten
under-replicated (1 replica remaining out of 3) and weren't able to
re-replicate in time. 'ksck' still reports that the table is healthy though,
and just reports two down tablet servers. It seems there is a lot of room for
improvement:
- for each tablet, check that at least a majority of its replicas are on live
tablet servers, and those tablet servers consider the replica to be in RUNNING
state
- some basic tablet "health checks" like asking followers if they have recently
successfully heard from leader?
- perhaps a canary request pushed to each tablet? (eg an empty write or no_op)
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)