Adar Dembo has posted comments on this change. Change subject: Refactor retry handling logic for writes ......................................................................
Patch Set 7: (12 comments) http://gerrit.cloudera.org:8080/#/c/2970/7//COMMIT_MSG Commit Message: Line 10: th new methods: WriteRpc::AnalyzeResponse() WriteRpc::RetryIfNeeded(). Nit: "th" new methods? the? three? http://gerrit.cloudera.org:8080/#/c/2970/7/src/kudu/client/batcher.cc File src/kudu/client/batcher.cc: Line 48: #include "kudu/rpc/replicated_rpc.h" Nit: before rpc.h. Line 213: can be used to choose the action to take Nit: the enum doesn't enable a "choice" per se; it explicitly tells the caller which action to take. Line 403: void WriteRpc::LookupTabletCb(const Status& status) { Shouldn't this also call into Analyze/Retry? In theory it should; in practice it's tricky because the OK case behaves a bit differently than in the usual Analyze/Retry call: we empty followers_ and retry. Line 444: if (result.status.IsRemoteError()) { I don't think this affects control flow but should improve readability: could you move L444-452 out of the if statement beginning at L442? This helps separate the two different things going on here, the first being "prefer early failures over controller failures" and the second being "check for a retriable TOO_BUSY controller failure". It also helps emphasize that the probing for this failure condition is the same kind of probing that we do below for other conditions. Line 445: const ErrorStatusPB* err = mutable_retrier()->controller().error_response(); So part of this refactor (maybe not in this patch) will include the removal of RpcRetrier, right? Line 508: DCHECK(found) << "Tablet " << tablet_->tablet_id() << ": Unable to mark replica " Nit: can you rewrite this line using strings::Substitute() so it's more clear? Line 528: default: { Nit: why the new scope? Line 537: Two things: 1. Extract result.status into a local variable. You can use a shorter variable name plus I think it's more clear than overwrite what you got from AnalyzeResponse(). 2. Add a comment saying that from here on out, the RPC either succeeded or experienced a fatal failure. http://gerrit.cloudera.org:8080/#/c/2970/7/src/kudu/client/meta_cache.cc File src/kudu/client/meta_cache.cc: Line 78: s = Status::NetworkError("No addresses for " + hp.ToString()); I know you asked me about this in chat, but could you also mention it in the commit description? http://gerrit.cloudera.org:8080/#/c/2970/7/src/kudu/rpc/replicated_rpc.h File src/kudu/rpc/replicated_rpc.h: Line 22: // Result status of a replicated Rpc. What is a "replicated" RPC? Why is it useful to constrain this object to only "replicated" RPCs? If you mean that the RPC's contents are replicated server-side into the WAL, that seems like an implementation detail that RPC code shouldn't care about. Line 23: struct ReplicatedRpcStatus { At first I expected that this and ScanRpcStatus overlapped, but now I see that, though they're structured in the same way, the Result enum is mostly different. But there are some similarities (i.e. OK, TOO_BUSY, NON_RETRIABLE_ERROR, possibly some deadline stuff that's missing here). How do you intend for the two to interplay in the future? Given the commonalities between them I'd expect something... -- To view, visit http://gerrit.cloudera.org:8080/2970 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: I0e0d491f902191c88c58e3d627106cc1be1bb3cc Gerrit-PatchSet: 7 Gerrit-Project: kudu Gerrit-Branch: master Gerrit-Owner: David Ribeiro Alves <[email protected]> Gerrit-Reviewer: Adar Dembo <[email protected]> Gerrit-Reviewer: David Ribeiro Alves <[email protected]> Gerrit-Reviewer: Todd Lipcon <[email protected]> Gerrit-HasComments: Yes
