[kudu-CR] WIP: Add a design doc for rpc retry/failover semantics

Todd Lipcon (Code Review) Wed, 06 Apr 2016 21:47:27 -0700

Todd Lipcon has posted comments on this change.

Change subject: WIP: Add a design doc for rpc retry/failover semantics
......................................................................

Patch Set 1:

(9 comments)

http://gerrit.cloudera.org:8080/#/c/2642/1/docs/design-docs/rpc-retry-and-failover.md
File docs/design-docs/rpc-retry-and-failover.md:

Line 76: 3- A hybrid of the above - Certain parts of the problem would be
handled by the RPC layer,
this is more or less what HDFS does.
https://issues.apache.org/jira/browse/HDFS-4942 has a small design doc and some
good discussion. Basically what they have is that it's handled at the RPC layer
for the client -- their client knows how to auto-retry certain types of errors
-- but handled ad-hoc per-RPC on the server side. See
https://github.com/apache/hadoop-common/blob/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/FSNamesystem.java#L2153
for example usage.

The discussion on the above-linked JIRA has some good points about why doing it
per-RPC on the server side is reasonable (in particular because some RPCs'
idempotence may only be determined based on the specific request contents)

Line 90: - Identification - Although the clients don't currently have an ID, a
natural place to issue
> Do clients needs to persist the ID somehow? Or is it OK for a client to get
I don't really follow why it can't just be a client-generated UUID? This is
what Hadoop does and haven't heard any complaints:
https://github.com/apache/hadoop-common/blob/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/ipc/ClientId.java

Seems much simpler than any "distributed system" component here.

Line 122: the time of the retry, then the retry should be attached to the
ongoing transaction.
another option here is to bounce the client with an "IN_PROGRESS" or
"SERVER_TOO_BUSY" type error and let them retry later. We could do the "attach"
thing if we can do so in a way that doesn't tie up a handler thread, but
otherwise I'm worried that if an op is slow, callers with short timeouts will
just end up retrying repeatedly and stacking up handler threads all blocked on
the same slow request.

Line 126: these records also from the Replay Cache component.
They don't necessarily need to be coupled, right? HDFS uses a time based
expiration. We could log to consensus for replication purposes but still expire
from the in-memory retry cache after a timeout.

I think it's worth adding somewhere up towards the top of this document some of
the requirements around how long we may need to cache for retries. HDFS sets
the default to 10 minutes apparently, but also limits to 3% of the total NN
memory. I'm also wondering whether it's worth worrying about retries crossing
server restarts or not. Would be worth understanding whether things get
substantially simpler if we cut some of these requirements down.

Some data gathered around average response size for different RPCs would be
useful, too (both compressed and uncompressed). My ballpark gut feel is that
most of the RPCs in question (eg writes) are usually successful, and in the
successful case the responses are very small (~10-20 bytes). Some analysis on
this would tell us what kind of overhead we might expect if we started logging
the rpc responses into the WAL.

Line 158: Wal garbage collections should be set higher
> That's the most worrying part to me, especially since we still don't identi
Yea, I'm not sure it makes sense to couple the two. Assumedly the minimum
time-based WAL retention is going to be at least as long as the retry cache
retention (since we don't expect a client to fail an RPC and then want to retry
10 minutes later). So, I dont think this should try to integrate at all with
WAL retention, but instead just be time-based (and size-limited)

Line 174: etc.). For these and generally for all the RPCs that fall into
category 2 (mentioned
I think these RPCs are more like 'cat 3' by your terminology -- they have
feasible alternatives already (eg a duplicated CreateTablet or DeleteTablet is
fine).

Did you mean CreateTable and DeleteTable, which are more messy?

I still think handling these per-RPC with some simple code (as HDFS does) is
better than trying to over-generalize (and the master WAL can serve as a
completion store for this as well)

Line 186: Milestone 3 - Client leases and coordinated garbage collection.
Include all cat. 3 operations.
Per a comment above, I dont really see the value in client leases and
master-assigned client ids over just using UUIDs.

I think I'd prioritize garbage collection (a simple TTL-based thing) above the
'milestone 2' stuff.

Line 212: We're currently observing errors with these operations, so they
should be classified as cat. 2.
> Isn't it kind of AlterSchema though? Viewed from the tserver's perspective
right, I dont see any problem with our current scheme here, where either one
can be safely duplicated. Sure, we had to put on our thinking caps for a few
minutes to get it right, but the resulting code or policy isnt too complicated,
and "design things to be idempotent when possible" is a good policy IMO.

Line 248: Due to a central table map, these operations are already executed
exactly once, and thus
> But there's still ambiguity if multiple clients try to create/alter/delete
yea, I think these are more interesting. Alter Table is also a somewhat nasty
one, since on a retry you'll get "column already exists" if you try to add a
new column.

--
To view, visit http://gerrit.cloudera.org:8080/2642
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: Idc2aa40486153b39724e1c9bd09c626b829274c6
Gerrit-PatchSet: 1
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: David Ribeiro Alves <[email protected]>
Gerrit-Reviewer: Adar Dembo <[email protected]>
Gerrit-Reviewer: Dan Burkert <[email protected]>
Gerrit-Reviewer: Jean-Daniel Cryans
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy <[email protected]>
Gerrit-Reviewer: Todd Lipcon <[email protected]>
Gerrit-HasComments: Yes

[kudu-CR] WIP: Add a design doc for rpc retry/failover semantics

Reply via email to