Todd Lipcon created KUDU-1516:
---------------------------------

             Summary: ksck should check for more raft-related status issues
                 Key: KUDU-1516
                 URL: https://issues.apache.org/jira/browse/KUDU-1516
             Project: Kudu
          Issue Type: Improvement
          Components: consensus, ksck, supportability
    Affects Versions: 0.9.1
            Reporter: Todd Lipcon
            Priority: Critical


We currently have a test cluster where one or more tablets have gotten 
under-replicated (1 replica remaining out of 3) and weren't able to 
re-replicate in time. 'ksck' still reports that the table is healthy though, 
and just reports two down tablet servers. It seems there is a lot of room for 
improvement:
- for each tablet, check that at least a majority of its replicas are on live 
tablet servers, and those tablet servers consider the replica to be in RUNNING 
state
- some basic tablet "health checks" like asking followers if they have recently 
successfully heard from leader?
- perhaps a canary request pushed to each tablet? (eg an empty write or no_op)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to