David Ribeiro Alves has posted comments on this change. Change subject: Add a design doc for rpc retry/failover semantics ......................................................................
Patch Set 1: (56 comments) http://gerrit.cloudera.org:8080/#/c/2642/1//COMMIT_MSG Commit Message: Line 7: WIP: Add a design doc for rpc retry/failover semantics > Please add it to the README.md file in the design-docs directory. Done http://gerrit.cloudera.org:8080/#/c/2642/1/docs/design-docs/rpc-retry-and-failover.md File docs/design-docs/rpc-retry-and-failover.md: > Did we ever talk about how "rough" these design docs can be, as a group? I I don't think we really discussed anything beyond the fact that these are not supposed to be super precise and might even lag on the impl a little bit. In any case I'll happily address your comments. Line 20: RPC's are not idempotent i.e. if handled twice they will lead to incorrect state > Nit: RPCs (that's what you used on L17). Done Line 22: is guaranteed to execute it *Exactly One* times. > Nit: to go well with the list below, change to "execute it *Exactly Once*." Done Line 24: Kudu currently has 4 RPC service interfaces, Tablet, TabletAdmin, Master and Consensus. > Let's refer to them more like how the .proto files refer to them: TabletSer Done Line 29: volatile > What does volatile mean in this context? non-persistent, will change accordingly Line 33: 3- Needs Exactly Once but has currently an alternative - Operations like in 2 but that already > Drop "currently", otherwise it sounds like CAS is a temporary thing that yo Done Line 27: 1- Doesn't need Exactly Once - Operations that don't mutate server state. : : 2- Would benefit from Exactly Once - Operations that only mutate volatile state or mutate : state that is not replicated across servers and that would benefit from exactly once for : generalized error handling, but that don't strictly need it. : : 3- Needs Exactly Once but has currently an alternative - Operations like in 2 but that already : have some alternative error handling mechanism like CAS. : : 4- Require Exactly Once for correctness - Operations that don't currently have E.O. implemented : and require it for correctness. > Nit: this list didn't format well in markdown. Use "1.", "2.", etc. Done. I don't think its necessary to have "Exactly Once" in bold all over Line 47: replicated rpcs, we'll use the same terminology to avoid redefining the concepts: > Nit: RPCs Done Line 49: - Identification - Each RPC must have an unique identifier. > Nit: a, not an Done Line 50: - Completion record - Each RPC must have a durable record of its completion. > Like "Exactly Once", let's refer to Completion Record in capital letters? E Done Line 51: - Retry rendez-vous - When an RPC is retried it must find the completion record of the > Nit: rendezvous interesting the french version has an hyphen, but apparently the english one doesn't. Done Line 55: ## Design options > Having read through this list, I'm in favor of #1, but I probably don't ful The problem with 1 is that we must have a new form of recording, durably, the completion records on the rpc layer itself. which is not ideal. The hybrid option allows for both encapsulating the retries and dedup in the rpc layer, but delegating the storage/handling of the completion records to other layers. Line 60: 1- Completely encapsulated at the RPC layer - This the option that [1] presents at length, > Nit: This is the option... Done Line 61: allowing to "bolt-on" these properties almost independently of what the underlying rpc's > Nit: RPCs Done Line 63: replicated RPCs we'd need to come up with a generic, durable, Completion record mechanism. > Nit: Completion Record. Or completion record. But it's weird to capitalize Done Line 65: it's not totally clear what happens when an RPC mutates multiple server "objects" or > Could you give an example of such an RPC, so that I understand what you mea As I state it's not completely clear from the paper what these objects really are. Line 66: what these server objects really are. > Also not a markdown-friendly list. Done Line 70: each different RPC. For specific rpc's like Write() this option seems to map to existing > Nit: RPCs Done Line 72: replay would mean that is would be relatively easy to implement a retry rendez-vous > Nit: would mean that it Done Line 73: mechanism. However this option lack in generality and it would likely require duplicate > Nit: lacks Done Line 76: 3- A hybrid of the above - Certain parts of the problem would be handled by the RPC layer, > this is more or less what HDFS does. https://issues.apache.org/jira/browse/ Yeah I read that, I just think that we could do a bit better. On the server side we'd have a genetic, non-persistent CompletionRecordStore that simply keeps track of requests non-persistently for operations that have a simple behavior, but we would allow certain operations (like Write()) to register their own CompletionRecordStore that handles the idiosyncrasies of that particular operations. IMO This would end up saving us a bunch of error handling code and provide a bolt-on way for us to include a simple form of Exactly once, while at the same time would allow us to have a more featured one when we need it. Line 79: rendez : vous > Nit: there's a space inside rendezvous and the sentence just trails off. Done Line 84: Option 1 is reasonably well presented in the paper, so the discussion will mostly focus on options > Nit: replace "the paper" with "[1]". Done Line 90: - Identification - Although the clients don't currently have an ID, a natural place to issue > I don't really follow why it can't just be a client-generated UUID? This is The point of the distributed system, in the paper, is twofold: - The leases allow to garbage collect the completion records in a way that is consistent with the clients that need it. - It _guarantees_ that each request is uniquely identified. (maybe there isn't much of a concern here though, at least for randomly generated uuids) However, I think that, even if we ever deem it necessary, on a first approach we don't need anything but the UUID itself. In fact in my implementation I have the LeaseManager just generate an UUID locally which never expires. I changed the verbage of milestone 1 and elsewhere accordingly. If we deem that we for sure will never use these leases all that is really required is to elide the lease manager class and to use the raw uuid generator instead, which shouldn't be much work. Line 90: - Identification - Although the clients don't currently have an ID, a natural place to issue > Do clients needs to persist the ID somehow? Or is it OK for a client to get This is not meant to tolerate crashy clients. On restart the client would be assigned a new ID Line 92: to fully guarantee unique id's though, we'd have to store client ids in the master tablet. > Nit: IDs. Elsewhere in the doc too. Done Line 96: Note on client leases: [1] Discusses client leases at length and why they are a useful feature > Nit: discusses Done Line 99: is what identifies each single RPC uniquely. On top of a lease issueing mechanism this system > Nit: issuing Done Line 104: client IDs issued by the master or quite some time. Moreover it's something that we can > Nit: issued by the master for quite some time? Done Line 106: > From here down, we're basically trying to squeeze as much mileage as possib Actually only Completion Records (and a possible CompletionRecordStore) is directly related to the WAL. Garbage collection is pegged to the WAL, but only circumstantially since ideally we'd like it pegged to the client leases. In fact due to Todd's suggestion even GC is not pegged to the WAL at all. Line 107: - Completion record - Each client request is recorded by the consensus log almost as is so > That's only true for Write requests, not every request. What of master stat Done Line 113: - Retry rendez-vous - For this element we'd need to create a new component, the Replay Cache, > Nit: rendezvous. Elsewhere too. Done Line 122: the time of the retry, then the retry should be attached to the ongoing transaction. > another option here is to bounce the client with an "IN_PROGRESS" or "SERVE it shouldn't be too hard to do it without tying the handler thread, a possible CompletionRecordStore would automatically either execute the request inline, (or however inline it was before) or register a callback to be executed on completion by the executing thread. changed the verbage to make that clear. Line 125: with log (wal) garbage collection, with the added complexity that we should then remove > Nit: WAL Done Line 126: these records also from the Replay Cache component. > They don't necessarily need to be coupled, right? HDFS uses a time based ex In [1] garbage collection is tied to the leases (which is one of the goals). I went with the "safe" way, but you're right that it might not be enough. I'll mention those alternatives. Note that record lifetime policy is likely an isolated enough issue that we can iterate upon. Line 132: interfaces, implemeted by other components, from which to fetch/store completion records > Nit: implemented Done Line 138: Milestone 1 - Define interfaces for Completion Record storage and handling and implement > Nit: use some markdown to make the milestones stand out a little more. Mayb Done Line 140: > Nit: trailing whitespace in the doc. Look for red blocks in gerrit. Done Line 143: cat. 4 operation), which is arguably the most important one. We'll define a > Would be good to explicitly state why it's most important (i.e. most common Done Line 145: 1 - Start new operations if this is a new RPC. > Nit: not a markdown-friendly list. Done Line 146: 2 - For RPCs that have previsouly completed it will be able to fetch existing completion > Nit: previously Done Line 154: wal. Note that this component must cache not only the RPC "result" (i.e. SUCCESS/ABORTED > Nit: WAL. Elsewhere too. Done Line 158: Wal garbage collections should be set higher > Yea, I'm not sure it makes sense to couple the two. Assumedly the minimum t I corrected this here and above to refer to the time and size policies only (for the case when we don't have the leases) Line 158: Wal garbage collections should be set higher > That's the most worrying part to me, especially since we still don't identi See my answer below. Line 163: Client ID issueing will be done by the master, when a client connects to the master for > Would be nice to see more detail on how the master persists client IDs. Those would have to go into the master tablet, but I've changed how we're going to generate the ids (at least initially) so I won'd go into that for now. Line 167: starting step. > As long as it's configurable by gflag, I agree this is a reasonable first s see my answer above Line 174: etc.). For these and generally for all the RPCs that fall into category 2 (mentioned > I think these RPCs are more like 'cat 3' by your terminology -- they have f All the replicated RPCs (like create table/delete table) will require an ad-hoc completion store (maybe the same, maybe different, not sure yet). For the non replicated requests, we were observing problems specifically with these two RPCs a while ago, which is why they were the first that came to mind. Changed the example. Line 175: in Introdction) we would benefit from having an RPC layer retry mechanism, for the sender, > Nit: the introduction Done Line 181: CompletionRecordStore. This local version would not having a durability mechanism and would > Nit: have, not having Done Line 184: help immensely in consoldating retry logic that we have spread all over. > Presumably there'd be a CompletionRecordStore for the master as well? If it Basically we'd have the CompletionRecordStore for the replicated operations that would be backed by consensus (for the TS's and the Master), we'd have a generic one for most of the non-replicated operations that we'd still want some EO semantics to happen (probably non-persistent) and a NoOp version that just passes to the current impl. Made these choices clearer when enunciating the options. Line 186: Milestone 3 - Client leases and coordinated garbage collection. Include all cat. 3 operations. > Per a comment above, I dont really see the value in client leases and maste I would like to consider Milestone 1 as "something that works albeit just for Write()" and, as I really don't want to be messing with log anchoring if we're going to adopt a simple time/space based policy I moved the time based policy to M1 and left the space based policy for M2 Line 189: leases are stored and garbage collection is performed with the help with a "cluster clock". > Would be good to explain in more detail what this "cluster clock" is. Done Line 212: We're currently observing errors with these operations, so they should be classified as cat. 2. > right, I dont see any problem with our current scheme here, where either on agree, I put these as cat 2 cuz there were, at the time, lingering bugs with there. maybe that's not the case anymore, demoted to cat 3 Line 248: Due to a central table map, these operations are already executed exactly once, and thus > yea, I think these are more interesting. Alter Table is also a somewhat nas agree on both counts, changed this to cat 2 Line 253: [1] - Implementing Linearizability at Large Scale and Low Latency : web.stanford.edu/~ouster/cgi-bin/papers/rifl.pdf > The markdown-approved way of linking is: Done -- To view, visit http://gerrit.cloudera.org:8080/2642 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: Idc2aa40486153b39724e1c9bd09c626b829274c6 Gerrit-PatchSet: 1 Gerrit-Project: kudu Gerrit-Branch: master Gerrit-Owner: David Ribeiro Alves <[email protected]> Gerrit-Reviewer: Adar Dembo <[email protected]> Gerrit-Reviewer: Dan Burkert <[email protected]> Gerrit-Reviewer: David Ribeiro Alves <[email protected]> Gerrit-Reviewer: Jean-Daniel Cryans Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Mike Percy <[email protected]> Gerrit-Reviewer: Todd Lipcon <[email protected]> Gerrit-HasComments: Yes
