[ 
https://issues.apache.org/jira/browse/KUDU-3341?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

YifanZhang updated KUDU-3341:
-----------------------------
    Description: 
Sometimes a tablet server could be shutdown because of detected disk failures, 
and this server would be re-added to the cluster with all data cleared.

Replicas could be replicated after  
{{--follower_unavailable_considered_failed_sec}} seconds. And then master send 
DeleteTablet RPCs to this tserver, but receive either a RPC failure(tserver was 
shutdown) or a WRONG_SERVER_UUID error(tserver started with a new uuid), and 
keep retrying to delete tablets after 
{{{}--unresponsive_ts_rpc_timeout_ms{}}}(default 1 hour).

It's not so necessary to retry when receive WRONG_SERVER_UUID errors, because 
the server uuid could only be corrected by restarting the tablet server, at 
that time full tablet reports would sent to master and if any, outdated 
replicas could be deleted finally.

  was:
Sometimes a tablet server could be shutdown because of detected disk failures, 
and this server would be re-added to the cluster with all data cleared.

Replicas could be replicated after  
{{\-\-follower_unavailable_considered_failed_sec}} seconds. And then master 
send DeleteTablet RPCs to this tserver, but receive either a RPC 
failure(tserver was shutdown) or a WRONG_SERVER_UUID error(tserver started with 
a new uuid), and keep retrying to delete tablets after 
{{{}--unresponsive_ts_rpc_timeout_ms{}}}(default 1 hour).

It's not so necessary to retry when receive WRONG_SERVER_UUID errors, because 
the server uuid could only be corrected by restarting the tablet server, at 
that time full tablet reports would sent to master and outdated replicas could 
be deleted finally.


> Catalog Manager should stop retrying DeleteTablet when receive 
> WRONG_SERVER_UUID error
> --------------------------------------------------------------------------------------
>
>                 Key: KUDU-3341
>                 URL: https://issues.apache.org/jira/browse/KUDU-3341
>             Project: Kudu
>          Issue Type: Improvement
>          Components: master
>            Reporter: YifanZhang
>            Priority: Minor
>
> Sometimes a tablet server could be shutdown because of detected disk 
> failures, and this server would be re-added to the cluster with all data 
> cleared.
> Replicas could be replicated after  
> {{--follower_unavailable_considered_failed_sec}} seconds. And then master 
> send DeleteTablet RPCs to this tserver, but receive either a RPC 
> failure(tserver was shutdown) or a WRONG_SERVER_UUID error(tserver started 
> with a new uuid), and keep retrying to delete tablets after 
> {{{}--unresponsive_ts_rpc_timeout_ms{}}}(default 1 hour).
> It's not so necessary to retry when receive WRONG_SERVER_UUID errors, because 
> the server uuid could only be corrected by restarting the tablet server, at 
> that time full tablet reports would sent to master and if any, outdated 
> replicas could be deleted finally.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

Reply via email to