David Ribeiro Alves has posted comments on this change.

Change subject: Add a design doc for rpc retry/failover semantics
......................................................................


Patch Set 1:

(56 comments)

http://gerrit.cloudera.org:8080/#/c/2642/1//COMMIT_MSG
Commit Message:

Line 7: WIP: Add a design doc for rpc retry/failover semantics
> Please add it to the README.md file in the design-docs directory.
Done


http://gerrit.cloudera.org:8080/#/c/2642/1/docs/design-docs/rpc-retry-and-failover.md
File docs/design-docs/rpc-retry-and-failover.md:

> Did we ever talk about how "rough" these design docs can be, as a group? I 
I don't think we really discussed anything beyond the fact that these are not 
supposed to be super precise and might even lag on the impl a little bit. In 
any case I'll happily address your comments.


Line 20: RPC's are not idempotent i.e. if handled twice they will lead to 
incorrect state
> Nit: RPCs (that's what you used on L17).
Done


Line 22: is guaranteed to execute it *Exactly One* times.
> Nit: to go well with the list below, change to "execute it *Exactly Once*."
Done


Line 24: Kudu currently has 4 RPC service interfaces, Tablet, TabletAdmin, 
Master and Consensus.
> Let's refer to them more like how the .proto files refer to them: TabletSer
Done


Line 29: volatile
> What does volatile mean in this context?
non-persistent, will change accordingly


Line 33: 3- Needs Exactly Once but has currently an alternative - Operations 
like in 2 but that already
> Drop "currently", otherwise it sounds like CAS is a temporary thing that yo
Done


Line 27: 1- Doesn't need Exactly Once - Operations that don't mutate server 
state.
       : 
       : 2- Would benefit from Exactly Once - Operations that only mutate 
volatile state or mutate
       : state that is not replicated across servers and that would benefit 
from exactly once for
       : generalized error handling, but that don't strictly need it.
       : 
       : 3- Needs Exactly Once but has currently an alternative - Operations 
like in 2 but that already
       : have some alternative error handling mechanism like CAS.
       : 
       : 4- Require Exactly Once for correctness - Operations that don't 
currently have E.O. implemented
       : and require it for correctness.
> Nit: this list didn't format well in markdown. Use "1.", "2.", etc.
Done. I don't think its necessary to have "Exactly Once" in bold all over


Line 47: replicated rpcs, we'll use the same terminology to avoid redefining 
the concepts:
> Nit: RPCs
Done


Line 49: - Identification - Each RPC must have an unique identifier.
> Nit: a, not an
Done


Line 50: - Completion record - Each RPC must have a durable record of its 
completion.
> Like "Exactly Once", let's refer to Completion Record in capital letters? E
Done


Line 51: - Retry rendez-vous - When an RPC is retried it must find the 
completion record of the
> Nit: rendezvous
interesting the french version has an hyphen, but apparently the english one 
doesn't. Done


Line 55: ## Design options
> Having read through this list, I'm in favor of #1, but I probably don't ful
The problem with 1 is that we must have a new form of recording, durably, the 
completion records on the rpc layer itself. which is not ideal. The hybrid 
option allows for both encapsulating the retries and dedup in the rpc layer, 
but delegating the storage/handling of the completion records to other layers.


Line 60: 1- Completely encapsulated at the RPC layer - This the option that [1] 
presents at length,
> Nit: This is the option...
Done


Line 61: allowing to "bolt-on" these properties almost independently of what 
the underlying rpc's
> Nit: RPCs
Done


Line 63: replicated RPCs we'd need to come up with a generic, durable, 
Completion record mechanism.
> Nit: Completion Record. Or completion record. But it's weird to capitalize 
Done


Line 65: it's not totally clear what happens when an RPC mutates multiple 
server "objects" or
> Could you give an example of such an RPC, so that I understand what you mea
As I state it's not completely clear from the paper what these objects really 
are.


Line 66: what these server objects really are.
> Also not a markdown-friendly list.
Done


Line 70: each different RPC. For specific rpc's like Write() this option seems 
to map to existing
> Nit: RPCs
Done


Line 72: replay would mean that is would be relatively easy to implement a 
retry rendez-vous
> Nit: would mean that it
Done


Line 73: mechanism. However this option lack in generality and it would likely 
require duplicate
> Nit: lacks
Done


Line 76: 3- A hybrid of the above - Certain parts of the problem would be 
handled by the RPC layer,
> this is more or less what HDFS does. https://issues.apache.org/jira/browse/
Yeah I read that, I just think that we could do a bit better. On the server 
side we'd have a genetic, non-persistent CompletionRecordStore that simply 
keeps track of requests non-persistently for operations that have a simple 
behavior, but we would allow certain operations (like Write()) to register 
their own CompletionRecordStore that handles the idiosyncrasies of that 
particular operations. IMO This would end up saving us a bunch of error 
handling code and provide a bolt-on way for us to include a simple form of 
Exactly once, while at the same time would allow us to have a more featured one 
when we need it.


Line 79: rendez
       : vous 
> Nit: there's a space inside rendezvous and the sentence just trails off.
Done


Line 84: Option 1 is reasonably well presented in the paper, so the discussion 
will mostly focus on options
> Nit: replace "the paper" with "[1]".
Done


Line 90: - Identification - Although the clients don't currently have an ID, a 
natural place to issue
> I don't really follow why it can't just be a client-generated UUID? This is
The point of the distributed system, in the paper, is twofold:
- The leases allow to garbage collect the completion records in a way that is 
consistent with the clients that need it.
- It _guarantees_ that each request is uniquely identified. (maybe there isn't 
much of a concern here though, at least for randomly generated uuids)

However, I think that, even if we ever deem it necessary, on a first approach 
we don't need anything but the UUID itself. In fact in my implementation I have 
the LeaseManager just generate an UUID locally which never expires. I changed 
the verbage of milestone 1 and elsewhere accordingly.

If we deem that we for sure will never use these leases all that is really 
required is to elide the lease manager class and to use the raw uuid generator 
instead, which shouldn't be much work.


Line 90: - Identification - Although the clients don't currently have an ID, a 
natural place to issue
> Do clients needs to persist the ID somehow? Or is it OK for a client to get
This is not meant to tolerate crashy clients. On restart the client would be 
assigned a new ID


Line 92: to fully guarantee unique id's though, we'd have to store client ids 
in the master tablet.
> Nit: IDs. Elsewhere in the doc too.
Done


Line 96: Note on client leases: [1] Discusses client leases at length and why 
they are a useful feature
> Nit: discusses
Done


Line 99: is what identifies each single RPC uniquely. On top of a lease 
issueing mechanism this system
> Nit: issuing
Done


Line 104: client IDs issued by the master or quite some time. Moreover it's 
something that we can
> Nit: issued by the master for quite some time?
Done


Line 106: 
> From here down, we're basically trying to squeeze as much mileage as possib
Actually only Completion Records (and a possible CompletionRecordStore) is 
directly related to the WAL. Garbage collection is pegged to the WAL, but only 
circumstantially since ideally we'd like it pegged to the client leases. In 
fact due to Todd's suggestion even GC is not pegged to the WAL at all.


Line 107: - Completion record - Each client request is recorded by the 
consensus log almost as is so
> That's only true for Write requests, not every request. What of master stat
Done


Line 113: - Retry rendez-vous - For this element we'd need to create a new 
component, the Replay Cache,
> Nit: rendezvous. Elsewhere too.
Done


Line 122:     the time of the retry, then the retry should be attached to the 
ongoing transaction.
> another option here is to bounce the client with an "IN_PROGRESS" or "SERVE
it shouldn't be too hard to do it without tying the handler thread, a possible 
CompletionRecordStore would automatically either execute the request inline, 
(or however inline it was before) or register a callback to be executed on 
completion by the executing thread.
changed the verbage to make that clear.


Line 125: with log (wal) garbage collection, with the added complexity that we 
should then remove
> Nit: WAL
Done


Line 126: these records also from the Replay Cache component.
> They don't necessarily need to be coupled, right? HDFS uses a time based ex
In [1] garbage collection is tied to the leases (which is one of the goals). I 
went with the "safe" way, but you're right that it might not be enough. I'll 
mention those alternatives. Note that record lifetime policy is likely an 
isolated enough issue that we can iterate upon.


Line 132: interfaces, implemeted by other components, from which to fetch/store 
completion records
> Nit: implemented
Done


Line 138: Milestone 1 - Define interfaces for Completion Record storage and 
handling and implement
> Nit: use some markdown to make the milestones stand out a little more. Mayb
Done


Line 140:  
> Nit: trailing whitespace in the doc. Look for red blocks in gerrit.
Done


Line 143: cat. 4 operation), which is arguably the most important one. We'll 
define a
> Would be good to explicitly state why it's most important (i.e. most common
Done


Line 145: 1 - Start new operations if this is a new RPC.
> Nit: not a markdown-friendly list.
Done


Line 146: 2 - For RPCs that have previsouly completed it will be able to fetch 
existing completion
> Nit: previously
Done


Line 154: wal. Note that this component must cache not only the RPC "result" 
(i.e. SUCCESS/ABORTED
> Nit: WAL. Elsewhere too.
Done


Line 158: Wal garbage collections should be set higher
> Yea, I'm not sure it makes sense to couple the two. Assumedly the minimum t
I corrected this here and above to refer to the time and size policies only 
(for the case when we don't have the leases)


Line 158: Wal garbage collections should be set higher
> That's the most worrying part to me, especially since we still don't identi
See my answer below.


Line 163: Client ID issueing will be done by the master, when a client connects 
to the master for
> Would be nice to see more detail on how the master persists client IDs.
Those would have to go into the master tablet, but I've changed how we're going 
to generate the ids (at least initially) so I won'd go into that for now.


Line 167: starting step.
> As long as it's configurable by gflag, I agree this is a reasonable first s
see my answer above


Line 174: etc.). For these and generally for all the RPCs that fall into 
category 2 (mentioned
> I think these RPCs are more like 'cat 3' by your terminology -- they have f
All the replicated RPCs (like create table/delete table) will require an ad-hoc 
completion store (maybe the same, maybe different, not sure yet). For the non 
replicated requests, we were observing problems specifically with these two 
RPCs a while ago, which is why they were the first that came to mind. Changed 
the example.


Line 175: in Introdction) we would benefit from having an RPC layer retry 
mechanism, for the sender,
> Nit: the introduction
Done


Line 181: CompletionRecordStore. This local version would not having a 
durability mechanism and would
> Nit: have, not having
Done


Line 184: help immensely in consoldating retry logic that we have spread all 
over. 
> Presumably there'd be a CompletionRecordStore for the master as well? If it
Basically we'd have the CompletionRecordStore for the replicated operations 
that would be backed by consensus (for the TS's and the Master), we'd have a 
generic one for most of the non-replicated operations that we'd still want some 
EO semantics to happen (probably non-persistent) and a NoOp version that just 
passes to the current impl.
Made these choices clearer when enunciating the options.


Line 186: Milestone 3 - Client leases and coordinated garbage collection. 
Include all cat. 3 operations.
> Per a comment above, I dont really see the value in client leases and maste
I would like to consider Milestone 1 as "something that works albeit just for 
Write()" and, as I really don't want to be messing with log anchoring if we're 
going to adopt a simple time/space based policy I moved the time based policy 
to M1 and left the space based policy for M2


Line 189: leases are stored and garbage collection is performed with the help 
with a "cluster clock".
> Would be good to explain in more detail what this "cluster clock" is.
Done


Line 212: We're currently observing errors with these operations, so they 
should be classified as cat. 2.
> right, I dont see any problem with our current scheme here, where either on
agree, I put these as cat 2 cuz there were, at the time, lingering bugs with 
there. maybe that's not the case anymore, demoted to cat 3


Line 248: Due to a central table map, these operations are already executed 
exactly once, and thus
> yea, I think these are more interesting. Alter Table is also a somewhat nas
agree on both counts, changed this to cat 2


Line 253: [1] - Implementing Linearizability at Large Scale and Low Latency
        : web.stanford.edu/~ouster/cgi-bin/papers/rifl.pdf
> The markdown-approved way of linking is: 
Done


-- 
To view, visit http://gerrit.cloudera.org:8080/2642
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: Idc2aa40486153b39724e1c9bd09c626b829274c6
Gerrit-PatchSet: 1
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: David Ribeiro Alves <[email protected]>
Gerrit-Reviewer: Adar Dembo <[email protected]>
Gerrit-Reviewer: Dan Burkert <[email protected]>
Gerrit-Reviewer: David Ribeiro Alves <[email protected]>
Gerrit-Reviewer: Jean-Daniel Cryans
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy <[email protected]>
Gerrit-Reviewer: Todd Lipcon <[email protected]>
Gerrit-HasComments: Yes

Reply via email to