Adar Dembo has posted comments on this change. Change subject: rpc: use earliest-deadline-first RPC scheduling and rejection ......................................................................
Patch Set 3: (11 comments) http://gerrit.cloudera.org:8080/#/c/2641/3//COMMIT_MSG Commit Message: Line 38: scheduling for the queue. Because the scan RPC retries retain the deadline What of other operations? Do we faithfully retain the 'operation' deadline across all RPCs? I think we put some effort into doing so, I just don't know how successful we were. I'm asking because I wonder how RPCs that keep resetting their deadline would be affected by this policy change. Also, what's the interplay with RPC deadlines vs. operation deadlines? IIRC a scan's deadline is "min(individual RPC deadline, operation deadline)". The individual RPC timeout defaults to 5s while the operation timeout is typically greater, so you'll usually see the former dictating the RPC's configured deadline. And that means a smaller number of discrete time points with which to determine fairness. I don't know if that's problematic or not, but it's worth thinking about and perhaps discussing. http://gerrit.cloudera.org:8080/#/c/2641/3/src/kudu/rpc/rpc_stub-test.cc File src/kudu/rpc/rpc_stub-test.cc: Line 430: ASSERT_GT(min, avg / 2) Have you tried running this test in a 'slow' environment, such as with ASAN or TSAN enabled? http://gerrit.cloudera.org:8080/#/c/2641/3/src/kudu/rpc/service_pool.cc File src/kudu/rpc/service_pool.cc: Line 116: LOG(WARNING) << err_msg; Don't think this would be excessive in production? Line 119: DLOG(INFO) << err_msg << " Contents of service queue:\n" Not necessarily in scope, but I've found this raw dumping to be really excessive. Would be great to find a way to summarize it instead, like "5 RPCs of type x from server y, 3 RPCs of type x from server z", etc. http://gerrit.cloudera.org:8080/#/c/2641/3/src/kudu/rpc/service_queue.h File src/kudu/rpc/service_queue.h: Lots of similarities and differences with BlockingQueue. I think there are enough differences that bolting optional deadlines (via traits or something equivalent) onto BlockingQueue would be hard. Was that your take as well? Line 25: #include <vector> Unused. Can you double check the others? Line 37: // Blocking queue used for passing inbound RPC calls to the service handler How painful would it be for ServiceQueue to remain a templated class so as to avoid InboundCall knowledge? You'd need to provide the deadline comparator when instantiating the queue, but maybe that's it? Besides separation of concerns and potential future reuse, it would make testing the queue in isolation far easier. Line 38: // pool. Nit: re-wrap so this word isn't on its own here? Line 44: // can evict any call that does not have a deadline. This incentivizes clients to Makes sense. In practice, I think the only deadline-less calls we have are in tests. Line 73: auto it = queue_.begin(); Does this also mean that calls are retrieved in deadline order as opposed to in queue insertion order? If so, that's definitely worth calling out in the ServiceQueue contract. Line 101: if (queue_.size() >= max_queue_size_) { Is it actually possible for queue_.size() to exceed max_queue_size_? Or is this just defensive programming? -- To view, visit http://gerrit.cloudera.org:8080/2641 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: I423ce5d8c54f61aeab4909393bbcac3516fe94c6 Gerrit-PatchSet: 3 Gerrit-Project: kudu Gerrit-Branch: master Gerrit-Owner: Todd Lipcon <[email protected]> Gerrit-Reviewer: Adar Dembo <[email protected]> Gerrit-Reviewer: Anonymous Coward #80 Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Todd Lipcon <[email protected]> Gerrit-HasComments: Yes
