Adar Dembo has posted comments on this change. Change subject: WIP: Add a design doc for rpc retry/failover semantics ......................................................................
Patch Set 1: (48 comments) I don't have that much context on this, but I tried to focus on the implications for multi-master. http://gerrit.cloudera.org:8080/#/c/2642/1//COMMIT_MSG Commit Message: Line 7: WIP: Add a design doc for rpc retry/failover semantics Please add it to the README.md file in the design-docs directory. http://gerrit.cloudera.org:8080/#/c/2642/1/docs/design-docs/rpc-retry-and-failover.md File docs/design-docs/rpc-retry-and-failover.md: Did we ever talk about how "rough" these design docs can be, as a group? I left many stylistic comments to improve readability, but I'll understand if I've incorrectly gauged where the "bar" should be on that sort of stuff. Line 20: RPC's are not idempotent i.e. if handled twice they will lead to incorrect state Nit: RPCs (that's what you used on L17). Line 22: is guaranteed to execute it *Exactly One* times. Nit: to go well with the list below, change to "execute it *Exactly Once*." Also, I'd use bold here (**foo**) instead of italics. Line 24: Kudu currently has 4 RPC service interfaces, Tablet, TabletAdmin, Master and Consensus. Let's refer to them more like how the .proto files refer to them: TabletServer, TabletServerAdmin, Master, and Consensus. There's also RemoteBootstrap, which should probably be included. And there's Generic, though I'm not sure if it needs to be included. Line 29: volatile What does volatile mean in this context? Line 33: 3- Needs Exactly Once but has currently an alternative - Operations like in 2 but that already Drop "currently", otherwise it sounds like CAS is a temporary thing that you'd rather replace with "proper" Exactly Once. Line 27: 1- Doesn't need Exactly Once - Operations that don't mutate server state. : : 2- Would benefit from Exactly Once - Operations that only mutate volatile state or mutate : state that is not replicated across servers and that would benefit from exactly once for : generalized error handling, but that don't strictly need it. : : 3- Needs Exactly Once but has currently an alternative - Operations like in 2 but that already : have some alternative error handling mechanism like CAS. : : 4- Require Exactly Once for correctness - Operations that don't currently have E.O. implemented : and require it for correctness. Nit: this list didn't format well in markdown. Use "1.", "2.", etc. Also, surround "Exactly Once" with stars like you did before. Line 47: replicated rpcs, we'll use the same terminology to avoid redefining the concepts: Nit: RPCs Line 49: - Identification - Each RPC must have an unique identifier. Nit: a, not an Line 50: - Completion record - Each RPC must have a durable record of its completion. Like "Exactly Once", let's refer to Completion Record in capital letters? Elsewhere in the doc. Line 51: - Retry rendez-vous - When an RPC is retried it must find the completion record of the Nit: rendezvous Line 55: ## Design options Having read through this list, I'm in favor of #1, but I probably don't fully grasp its inherent complexity. Line 60: 1- Completely encapsulated at the RPC layer - This the option that [1] presents at length, Nit: This is the option... Line 61: allowing to "bolt-on" these properties almost independently of what the underlying rpc's Nit: RPCs Line 63: replicated RPCs we'd need to come up with a generic, durable, Completion record mechanism. Nit: Completion Record. Or completion record. But it's weird to capitalize just the C. Line 65: it's not totally clear what happens when an RPC mutates multiple server "objects" or Could you give an example of such an RPC, so that I understand what you mean by "server objects"? Line 66: what these server objects really are. Also not a markdown-friendly list. Line 70: each different RPC. For specific rpc's like Write() this option seems to map to existing Nit: RPCs Line 72: replay would mean that is would be relatively easy to implement a retry rendez-vous Nit: would mean that it Also, rendezvous Line 73: mechanism. However this option lack in generality and it would likely require duplicate Nit: lacks Line 79: rendez : vous Nit: there's a space inside rendezvous and the sentence just trails off. Line 84: Option 1 is reasonably well presented in the paper, so the discussion will mostly focus on options Nit: replace "the paper" with "[1]". Line 90: - Identification - Although the clients don't currently have an ID, a natural place to issue Do clients needs to persist the ID somehow? Or is it OK for a client to get a new ID when it restarts? Also, when are client IDs garbage collected by the master? Maybe that's described below... Line 92: to fully guarantee unique id's though, we'd have to store client ids in the master tablet. Nit: IDs. Elsewhere in the doc too. Line 96: Note on client leases: [1] Discusses client leases at length and why they are a useful feature Nit: discusses Line 99: is what identifies each single RPC uniquely. On top of a lease issueing mechanism this system Nit: issuing Line 104: client IDs issued by the master or quite some time. Moreover it's something that we can Nit: issued by the master for quite some time? Line 106: >From here down, we're basically trying to squeeze as much mileage as possible >out of the WAL (to solve the remaining problems). It may be clearer to state >that explicitly up front. Line 107: - Completion record - Each client request is recorded by the consensus log almost as is so That's only true for Write requests, not every request. What of master state changes, like CreateTable? Oh, I see, you touch on that in the last sentence of this paragraph. Maybe reword a bit so it's clear up here too? Line 113: - Retry rendez-vous - For this element we'd need to create a new component, the Replay Cache, Nit: rendezvous. Elsewhere too. Line 125: with log (wal) garbage collection, with the added complexity that we should then remove Nit: WAL Line 132: interfaces, implemeted by other components, from which to fetch/store completion records Nit: implemented Line 138: Milestone 1 - Define interfaces for Completion Record storage and handling and implement Nit: use some markdown to make the milestones stand out a little more. Maybe subheadings? Line 140: Nit: trailing whitespace in the doc. Look for red blocks in gerrit. Also issuing. Elsewhere too. Line 143: cat. 4 operation), which is arguably the most important one. We'll define a Would be good to explicitly state why it's most important (i.e. most common state change, probably?) Line 145: 1 - Start new operations if this is a new RPC. Nit: not a markdown-friendly list. Line 146: 2 - For RPCs that have previsouly completed it will be able to fetch existing completion Nit: previously Line 154: wal. Note that this component must cache not only the RPC "result" (i.e. SUCCESS/ABORTED Nit: WAL. Elsewhere too. Line 163: Client ID issueing will be done by the master, when a client connects to the master for Would be nice to see more detail on how the master persists client IDs. Line 167: starting step. As long as it's configurable by gflag, I agree this is a reasonable first step. Line 175: in Introdction) we would benefit from having an RPC layer retry mechanism, for the sender, Nit: the introduction Line 181: CompletionRecordStore. This local version would not having a durability mechanism and would Nit: have, not having Line 184: help immensely in consoldating retry logic that we have spread all over. Presumably there'd be a CompletionRecordStore for the master as well? If it's local, does that mean that an ill-timed master leader election (i.e. one just after a failed client request) could lead to non-E.O. semantics? Line 189: leases are stored and garbage collection is performed with the help with a "cluster clock". Would be good to explain in more detail what this "cluster clock" is. Line 212: We're currently observing errors with these operations, so they should be classified as cat. 2. Isn't it kind of AlterSchema though? Viewed from the tserver's perspective I agree that they're category 2, e.g. the following sequence: 1. Sender 1 sends CreateTablet("foo") to TS. 2. The request never reaches the TS. 3. Sender 2 sends CreateTablet("foo") to TS. 4. Tablet foo is created. 5. Sender 1 retries CreateTablet("foo") to TS. 6. TS responds with ALREADY_PRESENT, but sender 1 doesn't know whether foo was already present to begin with, or whether step 1 created it. However, there is no sender 1 and sender 2: there is only one leader master. Even in the case of a leader master election, sender 1 should not retry because it is no longer the leader. Line 248: Due to a central table map, these operations are already executed exactly once, and thus But there's still ambiguity if multiple clients try to create/alter/delete the same table and one ends up retrying. For example, a retrying CreateTable() client that receives ALREADY_PRESENT doesn't know who created the table. Was it the original RPC? Was it another client altogether, in which case the schema may not match? Moreover, a retried CreateTable() may end up recreating a table that was just deleted by a second client, in between the two RPCs sent by the first client. Line 253: [1] - Implementing Linearizability at Large Scale and Low Latency : web.stanford.edu/~ouster/cgi-bin/papers/rifl.pdf The markdown-approved way of linking is: [I'm an inline-style link with title](https://www.google.com "Google's Homepage") -- To view, visit http://gerrit.cloudera.org:8080/2642 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: Idc2aa40486153b39724e1c9bd09c626b829274c6 Gerrit-PatchSet: 1 Gerrit-Project: kudu Gerrit-Branch: master Gerrit-Owner: David Ribeiro Alves <[email protected]> Gerrit-Reviewer: Adar Dembo <[email protected]> Gerrit-Reviewer: Dan Burkert <[email protected]> Gerrit-Reviewer: Jean-Daniel Cryans Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Mike Percy <[email protected]> Gerrit-Reviewer: Todd Lipcon <[email protected]> Gerrit-HasComments: Yes
