Todd Lipcon has posted comments on this change. Change subject: WIP: Add a design doc for rpc retry/failover semantics ......................................................................
Patch Set 1: (9 comments) http://gerrit.cloudera.org:8080/#/c/2642/1/docs/design-docs/rpc-retry-and-failover.md File docs/design-docs/rpc-retry-and-failover.md: Line 76: 3- A hybrid of the above - Certain parts of the problem would be handled by the RPC layer, this is more or less what HDFS does. https://issues.apache.org/jira/browse/HDFS-4942 has a small design doc and some good discussion. Basically what they have is that it's handled at the RPC layer for the client -- their client knows how to auto-retry certain types of errors -- but handled ad-hoc per-RPC on the server side. See https://github.com/apache/hadoop-common/blob/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/FSNamesystem.java#L2153 for example usage. The discussion on the above-linked JIRA has some good points about why doing it per-RPC on the server side is reasonable (in particular because some RPCs' idempotence may only be determined based on the specific request contents) Line 90: - Identification - Although the clients don't currently have an ID, a natural place to issue > Do clients needs to persist the ID somehow? Or is it OK for a client to get I don't really follow why it can't just be a client-generated UUID? This is what Hadoop does and haven't heard any complaints: https://github.com/apache/hadoop-common/blob/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/ipc/ClientId.java Seems much simpler than any "distributed system" component here. Line 122: the time of the retry, then the retry should be attached to the ongoing transaction. another option here is to bounce the client with an "IN_PROGRESS" or "SERVER_TOO_BUSY" type error and let them retry later. We could do the "attach" thing if we can do so in a way that doesn't tie up a handler thread, but otherwise I'm worried that if an op is slow, callers with short timeouts will just end up retrying repeatedly and stacking up handler threads all blocked on the same slow request. Line 126: these records also from the Replay Cache component. They don't necessarily need to be coupled, right? HDFS uses a time based expiration. We could log to consensus for replication purposes but still expire from the in-memory retry cache after a timeout. I think it's worth adding somewhere up towards the top of this document some of the requirements around how long we may need to cache for retries. HDFS sets the default to 10 minutes apparently, but also limits to 3% of the total NN memory. I'm also wondering whether it's worth worrying about retries crossing server restarts or not. Would be worth understanding whether things get substantially simpler if we cut some of these requirements down. Some data gathered around average response size for different RPCs would be useful, too (both compressed and uncompressed). My ballpark gut feel is that most of the RPCs in question (eg writes) are usually successful, and in the successful case the responses are very small (~10-20 bytes). Some analysis on this would tell us what kind of overhead we might expect if we started logging the rpc responses into the WAL. Line 158: Wal garbage collections should be set higher > That's the most worrying part to me, especially since we still don't identi Yea, I'm not sure it makes sense to couple the two. Assumedly the minimum time-based WAL retention is going to be at least as long as the retry cache retention (since we don't expect a client to fail an RPC and then want to retry 10 minutes later). So, I dont think this should try to integrate at all with WAL retention, but instead just be time-based (and size-limited) Line 174: etc.). For these and generally for all the RPCs that fall into category 2 (mentioned I think these RPCs are more like 'cat 3' by your terminology -- they have feasible alternatives already (eg a duplicated CreateTablet or DeleteTablet is fine). Did you mean CreateTable and DeleteTable, which are more messy? I still think handling these per-RPC with some simple code (as HDFS does) is better than trying to over-generalize (and the master WAL can serve as a completion store for this as well) Line 186: Milestone 3 - Client leases and coordinated garbage collection. Include all cat. 3 operations. Per a comment above, I dont really see the value in client leases and master-assigned client ids over just using UUIDs. I think I'd prioritize garbage collection (a simple TTL-based thing) above the 'milestone 2' stuff. Line 212: We're currently observing errors with these operations, so they should be classified as cat. 2. > Isn't it kind of AlterSchema though? Viewed from the tserver's perspective right, I dont see any problem with our current scheme here, where either one can be safely duplicated. Sure, we had to put on our thinking caps for a few minutes to get it right, but the resulting code or policy isnt too complicated, and "design things to be idempotent when possible" is a good policy IMO. Line 248: Due to a central table map, these operations are already executed exactly once, and thus > But there's still ambiguity if multiple clients try to create/alter/delete yea, I think these are more interesting. Alter Table is also a somewhat nasty one, since on a retry you'll get "column already exists" if you try to add a new column. -- To view, visit http://gerrit.cloudera.org:8080/2642 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: Idc2aa40486153b39724e1c9bd09c626b829274c6 Gerrit-PatchSet: 1 Gerrit-Project: kudu Gerrit-Branch: master Gerrit-Owner: David Ribeiro Alves <[email protected]> Gerrit-Reviewer: Adar Dembo <[email protected]> Gerrit-Reviewer: Dan Burkert <[email protected]> Gerrit-Reviewer: Jean-Daniel Cryans Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Mike Percy <[email protected]> Gerrit-Reviewer: Todd Lipcon <[email protected]> Gerrit-HasComments: Yes
