Adar Dembo has posted comments on this change. Change subject: master: Retry background tasks even if TS UUID not registered ......................................................................
Patch Set 1: (1 comment) http://gerrit.cloudera.org:8080/#/c/6534/1//COMMIT_MSG Commit Message: PS1, Line 14: This patch causes the task to be retried even if the initial tablet : server UUID lookup fails. > SendDeleteReplicaRequest() is called as part of CatalogManager::HandleRepor I don't understand your response, but I think I understand the issue now. You alluded to it earlier in your commit description: the problem occurs is when the _heartbeating_ tserver is known to the master, but the _destination_ tserver (of the action RPC) is not. The issue is that some of the actions performed by HandleReportedTablet() are edge-triggered; that is, these actions are triggered because the previous tablet cstate doesn't match the new cstate, and the new cstate is written into the master tablet even if the action fails. IIUC, the DeleteReplica() RPC sent from catalog_manager.cc:L2482 and the AddServer() RPC sent from L2493 are the two vulnerable cases. The former manifests when the tserver hosting the evicted replica hasn't yet registered, and the latter when the tserver hosting the leader replica hasn't yet registered. Is that correct? Is there a particular reason why these cases are so strictly edge triggered? In both cases we would expect the not-yet-registered tserver to register shortly, so if we were smart enough to take the same action in response to its full tablet report, it wouldn't matter that the RPC we sent out previously was canceled. I'm not just asking for some academic reason. Imagine if the master were to fail again just after writing the tablet's new cstate into the master tablet but just before the RPC is sent to the destination tserver. After restarting the master, the RPC state is gone, but the new tablet cstate had been written out before the restart, which means full tablet reports will not lead to any DeleteReplica() or AddServer() RPCs, at least not until the next tablet config change event. -- To view, visit http://gerrit.cloudera.org:8080/6534 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: I3a3de7fe87266f11392fd3bb0c74f19ad803de9d Gerrit-PatchSet: 1 Gerrit-Project: kudu Gerrit-Branch: master Gerrit-Owner: Mike Percy <mpe...@apache.org> Gerrit-Reviewer: Adar Dembo <a...@cloudera.com> Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Mike Percy <mpe...@apache.org> Gerrit-Reviewer: Tidy Bot Gerrit-HasComments: Yes