Adar Dembo has posted comments on this change.

Change subject: master: Retry background tasks even if TS UUID not registered
......................................................................


Patch Set 1:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/6534/1//COMMIT_MSG
Commit Message:

PS1, Line 14: This patch causes the task to be retried even if the initial 
tablet
            : server UUID lookup fails.
> SendDeleteReplicaRequest() is called as part of CatalogManager::HandleRepor
I don't understand your response, but I think I understand the issue now. You 
alluded to it earlier in your commit description: the problem occurs is when 
the _heartbeating_ tserver is known to the master, but the _destination_ 
tserver (of the action RPC) is not. The issue is that some of the actions 
performed by HandleReportedTablet() are edge-triggered; that is, these actions 
are triggered because the previous tablet cstate doesn't match the new cstate, 
and the new cstate is written into the master tablet even if the action fails. 
IIUC, the DeleteReplica() RPC sent from catalog_manager.cc:L2482 and the 
AddServer() RPC sent from L2493 are the two vulnerable cases. The former 
manifests when the tserver hosting the evicted replica hasn't yet registered, 
and the latter when the tserver hosting the leader replica hasn't yet 
registered. Is that correct?

Is there a particular reason why these cases are so strictly edge triggered? In 
both cases we would expect the not-yet-registered tserver to register shortly, 
so if we were smart enough to take the same action in response to its full 
tablet report, it wouldn't matter that the RPC we sent out previously was 
canceled.

I'm not just asking for some academic reason. Imagine if the master were to 
fail again just after writing the tablet's new cstate into the master tablet 
but just before the RPC is sent to the destination tserver. After restarting 
the master, the RPC state is gone, but the new tablet cstate had been written 
out before the restart, which means full tablet reports will not lead to any 
DeleteReplica() or AddServer() RPCs, at least not until the next tablet config 
change event.


-- 
To view, visit http://gerrit.cloudera.org:8080/6534
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I3a3de7fe87266f11392fd3bb0c74f19ad803de9d
Gerrit-PatchSet: 1
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Mike Percy <mpe...@apache.org>
Gerrit-Reviewer: Adar Dembo <a...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy <mpe...@apache.org>
Gerrit-Reviewer: Tidy Bot
Gerrit-HasComments: Yes

Reply via email to