Todd Lipcon has posted comments on this change.

Change subject: WIP: Add a design doc for rpc retry/failover semantics
......................................................................


Patch Set 1:

(9 comments)

http://gerrit.cloudera.org:8080/#/c/2642/1/docs/design-docs/rpc-retry-and-failover.md
File docs/design-docs/rpc-retry-and-failover.md:

Line 76: 3- A hybrid of the above - Certain parts of the problem would be 
handled by the RPC layer,
this is more or less what HDFS does. 
https://issues.apache.org/jira/browse/HDFS-4942 has a small design doc and some 
good discussion. Basically what they have is that it's handled at the RPC layer 
for the client -- their client knows how to auto-retry certain types of errors 
-- but handled ad-hoc per-RPC on the server side. See 
https://github.com/apache/hadoop-common/blob/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/FSNamesystem.java#L2153
 for example usage.

The discussion on the above-linked JIRA has some good points about why doing it 
per-RPC on the server side is reasonable (in particular because some RPCs' 
idempotence may only be determined based on the specific request contents)


Line 90: - Identification - Although the clients don't currently have an ID, a 
natural place to issue
> Do clients needs to persist the ID somehow? Or is it OK for a client to get
I don't really follow why it can't just be a client-generated UUID? This is 
what Hadoop does and haven't heard any complaints: 
https://github.com/apache/hadoop-common/blob/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/ipc/ClientId.java

Seems much simpler than any "distributed system" component here.


Line 122:     the time of the retry, then the retry should be attached to the 
ongoing transaction.
another option here is to bounce the client with an "IN_PROGRESS" or 
"SERVER_TOO_BUSY" type error and let them retry later. We could do the "attach" 
thing if we can do so in a way that doesn't tie up a handler thread, but 
otherwise I'm worried that if an op is slow, callers with short timeouts will 
just end up retrying repeatedly and stacking up handler threads all blocked on 
the same slow request.


Line 126: these records also from the Replay Cache component.
They don't necessarily need to be coupled, right? HDFS uses a time based 
expiration. We could log to consensus for replication purposes but still expire 
from the in-memory retry cache after a timeout.

I think it's worth adding somewhere up towards the top of this document some of 
the requirements around how long we may need to cache for retries. HDFS sets 
the default to 10 minutes apparently, but also limits to 3% of the total NN 
memory. I'm also wondering whether it's worth worrying about retries crossing 
server restarts or not. Would be worth understanding whether things get 
substantially simpler if we cut some of these requirements down.

Some data gathered around average response size for different RPCs would be 
useful, too (both compressed and uncompressed). My ballpark gut feel is that 
most of the RPCs in question (eg writes) are usually successful, and in the 
successful case the responses are very small (~10-20 bytes). Some analysis on 
this would tell us what kind of overhead we might expect if we started logging 
the rpc responses into the WAL.


Line 158: Wal garbage collections should be set higher
> That's the most worrying part to me, especially since we still don't identi
Yea, I'm not sure it makes sense to couple the two. Assumedly the minimum 
time-based WAL retention is going to be at least as long as the retry cache 
retention (since we don't expect a client to fail an RPC and then want to retry 
10 minutes later). So, I dont think this should try to integrate at all with 
WAL retention, but instead just be time-based (and size-limited)


Line 174: etc.). For these and generally for all the RPCs that fall into 
category 2 (mentioned
I think these RPCs are more like 'cat 3' by your terminology -- they have 
feasible alternatives already (eg a duplicated CreateTablet or DeleteTablet is 
fine).

Did you mean CreateTable and DeleteTable, which are more messy?

I still think handling these per-RPC with some simple code (as HDFS does) is 
better than trying to over-generalize (and the master WAL can serve as a 
completion store for this as well)


Line 186: Milestone 3 - Client leases and coordinated garbage collection. 
Include all cat. 3 operations.
Per a comment above, I dont really see the value in client leases and 
master-assigned client ids over just using UUIDs.

I think I'd prioritize garbage collection (a simple TTL-based thing) above the 
'milestone 2' stuff.


Line 212: We're currently observing errors with these operations, so they 
should be classified as cat. 2.
> Isn't it kind of AlterSchema though? Viewed from the tserver's perspective 
right, I dont see any problem with our current scheme here, where either one 
can be safely duplicated. Sure, we had to put on our thinking caps for a few 
minutes to get it right, but the resulting code or policy isnt too complicated, 
and "design things to be idempotent when possible" is a good policy IMO.


Line 248: Due to a central table map, these operations are already executed 
exactly once, and thus
> But there's still ambiguity if multiple clients try to create/alter/delete 
yea, I think these are more interesting. Alter Table is also a somewhat nasty 
one, since on a retry you'll get "column already exists" if you try to add a 
new column.


-- 
To view, visit http://gerrit.cloudera.org:8080/2642
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: Idc2aa40486153b39724e1c9bd09c626b829274c6
Gerrit-PatchSet: 1
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: David Ribeiro Alves <[email protected]>
Gerrit-Reviewer: Adar Dembo <[email protected]>
Gerrit-Reviewer: Dan Burkert <[email protected]>
Gerrit-Reviewer: Jean-Daniel Cryans
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy <[email protected]>
Gerrit-Reviewer: Todd Lipcon <[email protected]>
Gerrit-HasComments: Yes

Reply via email to