[ 
https://issues.apache.org/jira/browse/KUDU-3341?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17452480#comment-17452480
 ] 

ASF subversion and git services commented on KUDU-3341:
-------------------------------------------------------

Commit 0222c3163129b1d6c1c37b216482aa64f921c415 in kudu's branch 
refs/heads/master from zhangyifan27
[ https://gitbox.apache.org/repos/asf?p=kudu.git;h=0222c31 ]

KUDU-3341: Stop retrying to DeleteTablet on wrong server

This patch improves catalog_manager's behavior when delete tablet with
a 'WRONG_SERVER_UUID' error. It's better to mark this RetryTask failed
than keep retrying to send too many requests. Because master would
always receive same error message until the "wrong uuid server" restarts
with a "correct uuid", at that time the tserver would send full tablets
report and then trigger the deletion of outdated tablets.

I also add a test that reproduces the scenario described in the JIRA.

Change-Id: Ieaa36086300bda7f958570c690b951dc090c342a
Reviewed-on: http://gerrit.cloudera.org:8080/18057
Tested-by: Kudu Jenkins
Reviewed-by: Andrew Wong <aw...@cloudera.com>
Reviewed-by: Attila Bukor <abu...@apache.org>


> Catalog Manager should stop retrying DeleteTablet when receive 
> WRONG_SERVER_UUID error
> --------------------------------------------------------------------------------------
>
>                 Key: KUDU-3341
>                 URL: https://issues.apache.org/jira/browse/KUDU-3341
>             Project: Kudu
>          Issue Type: Improvement
>          Components: master
>            Reporter: YifanZhang
>            Assignee: YifanZhang
>            Priority: Minor
>
> Sometimes a tablet server could be shutdown because of detected disk 
> failures, and this server would be re-added to the cluster with all data 
> cleared.
> Replicas could be replicated after  
> {{\-\-follower_unavailable_considered_failed_sec}} seconds. And then master 
> send DeleteTablet RPCs to this tserver, but receive either a RPC 
> failure(tserver was shutdown) or a WRONG_SERVER_UUID error(tserver started 
> with a new uuid), and keep retrying to delete tablets after 
> {{{}--unresponsive_ts_rpc_timeout_ms{}}}(default 1 hour).
> It's not so necessary to retry when receive WRONG_SERVER_UUID errors, because 
> the server uuid could only be corrected by restarting the tablet server, at 
> that time full tablet reports would sent to master and if any, outdated 
> replicas could be deleted finally.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

Reply via email to