[ 
https://issues.apache.org/jira/browse/KUDU-1860?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15849314#comment-15849314
 ] 

Jean-Daniel Cryans commented on KUDU-1860:
------------------------------------------

bq. To clarify: it's "evicted" meaning that there is a pending configuration 
that removes it, but the pending configuration is not yet committed?

Yes.

bq. We don't currently centralize the pending config to the master IIRC, but we 
could consider doing so, or fetching it from tservers during ksck (which might 
be less error-prone)

Trying to reconcile the views of all the replicas could be tricky, maybe giving 
some false positives since it wouldn't be atomic, but it seems better than not 
showing bad tablets. We could also consider doing some health checks like 
sending some dummy message that the config has to replicate, and if that works 
then consider that tablet ok.

> ksck doesn't identify tablets that are evicted but still in config
> ------------------------------------------------------------------
>
>                 Key: KUDU-1860
>                 URL: https://issues.apache.org/jira/browse/KUDU-1860
>             Project: Kudu
>          Issue Type: Bug
>          Components: util
>    Affects Versions: 1.2.0
>            Reporter: Jean-Daniel Cryans
>            Priority: Critical
>
> As reported by a user on Slack, ksck can give you a wrong output such as:
> {noformat}
>   ca199fafca544df2a1b2a01be9d5266d (server1:7250): RUNNING [LEADER]
>   a077957f627c4758ab5a989aca8a1ca8 (server2:7250): RUNNING
>   5c09a555c205482b8131f15b2c249ec6 (server3:7250): bad state
>     State:       NOT_STARTED
>     Data state:  TABLET_DATA_TOMBSTONED
>     Last status: Tablet initializing...
> {noformat}
> The problem is that server2 was already evicted out of the configuration 
> (based on reading the logs) but it wasn't committed in the config (which 
> contains server 1 and 3) since there's really only 1 server left out of 3.
> Ideally ksck should try to see what each server thinks the configuration is 
> and see if there's a difference from what's in the master. As it is, it looks 
> like we're missing 1 replica but in reality this is a broken tablet.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

Reply via email to