[ https://issues.apache.org/jira/browse/KUDU-3341?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17452480#comment-17452480 ]
ASF subversion and git services commented on KUDU-3341: ------------------------------------------------------- Commit 0222c3163129b1d6c1c37b216482aa64f921c415 in kudu's branch refs/heads/master from zhangyifan27 [ https://gitbox.apache.org/repos/asf?p=kudu.git;h=0222c31 ] KUDU-3341: Stop retrying to DeleteTablet on wrong server This patch improves catalog_manager's behavior when delete tablet with a 'WRONG_SERVER_UUID' error. It's better to mark this RetryTask failed than keep retrying to send too many requests. Because master would always receive same error message until the "wrong uuid server" restarts with a "correct uuid", at that time the tserver would send full tablets report and then trigger the deletion of outdated tablets. I also add a test that reproduces the scenario described in the JIRA. Change-Id: Ieaa36086300bda7f958570c690b951dc090c342a Reviewed-on: http://gerrit.cloudera.org:8080/18057 Tested-by: Kudu Jenkins Reviewed-by: Andrew Wong <aw...@cloudera.com> Reviewed-by: Attila Bukor <abu...@apache.org> > Catalog Manager should stop retrying DeleteTablet when receive > WRONG_SERVER_UUID error > -------------------------------------------------------------------------------------- > > Key: KUDU-3341 > URL: https://issues.apache.org/jira/browse/KUDU-3341 > Project: Kudu > Issue Type: Improvement > Components: master > Reporter: YifanZhang > Assignee: YifanZhang > Priority: Minor > > Sometimes a tablet server could be shutdown because of detected disk > failures, and this server would be re-added to the cluster with all data > cleared. > Replicas could be replicated after > {{\-\-follower_unavailable_considered_failed_sec}} seconds. And then master > send DeleteTablet RPCs to this tserver, but receive either a RPC > failure(tserver was shutdown) or a WRONG_SERVER_UUID error(tserver started > with a new uuid), and keep retrying to delete tablets after > {{{}--unresponsive_ts_rpc_timeout_ms{}}}(default 1 hour). > It's not so necessary to retry when receive WRONG_SERVER_UUID errors, because > the server uuid could only be corrected by restarting the tablet server, at > that time full tablet reports would sent to master and if any, outdated > replicas could be deleted finally. -- This message was sent by Atlassian Jira (v8.20.1#820001)