Hello Anonymous Coward #80, Adar Dembo,

I'd like you to do a code review.  Please visit

    http://gerrit.cloudera.org:8080/2641

to review the following change.

Change subject: WIP: rpc: improve fairness of queue rejection
......................................................................

WIP: rpc: improve fairness of queue rejection

This changes the behavior of the service pool rejecting RPCs when the
queue is full. This is seeking to address an issue I noticed when testing
Impala on a box with 32 cores:

- Impala is spinning up 96 clients which all operate in a loop scanning
  local tablets. The "think time" in between Scan RPCs is very small,
  since the scanner threads are just pushing the requests onto an Impala-side
  queue and not doing any processing.

- With the default settings, we have 20 RPC handlers and a queue length of
  50. This causes the remaining 26 threads to get rejected with TOO_BUSY
  errors on their first Scan() RPCs.

- The unlucky threads back off by going to sleep for a little bit. Meanwhile,
  every time one of the lucky threads gets a response, it sends a new RPC and
  occupies the space in the queue that was just freed up. Because we have
  exactly 70 "lucky" threads, and 70 slots on the server side, and no "think
  time", the queue is full almost all the time.

- When one of the "unlucky" threads wakes up from its backoff sleep, it is
  extremely likely that it will not find an empty queue slot, and will thus
  just get rejected again.

The result of this behavior is extreme unfairness -- those threads that got
lucky at the beginning are successfully processing lots of scan requests, but
the ones that got unlucky at the beginning get rejected over and over again
until they eventually time out.

There are several possible ways to address this issue. The simplest way is
probably the approach taken by this patch: if we receive an RPC when the
service queue is full, rather than rejecting that, we select a random
victim from the queue and take its place. This spreads out the rejections
much more evenly.

The patch includes a simple functional test which spawns a bunch of threads
which act somewhat like the above Impala scenario, and measure the number
of successful RPCs they are able to send in a 2-second period. Without the
patch, I got:

I0325 19:05:51.272783 30960 rpc_stub-test.cc:391] 0 46 47 46 47 47 48 47 47 46 
47 45 0 45 0 0 46 0

In other words, of the 18 threads, 5 did not manage to complete a single
successful request, whereas the other threads all completed between 45 and 48.
With the patch, the behavior was much more fair:

I0325 19:06:07.039108 31052 rpc_stub-test.cc:391] 40 30 20 44 43 36 38 30 37 46 
33 28 39 19 15 29 46 34

In this case, the most unlucky threads achieved less than half as many requests
as the luckiest threads, so it's still not _fair_. However, it avoids
the complete starvation issue, and thus should avoid the timeouts which
are currently causing Impala queries to fail.

WIP patch because this needs some more comments/cleanup, but I'll give it a shot
on a cluster.

Change-Id: I423ce5d8c54f61aeab4909393bbcac3516fe94c6
---
M src/kudu/rpc/rpc-test-base.h
M src/kudu/rpc/rpc_stub-test.cc
M src/kudu/rpc/service_pool.cc
M src/kudu/rpc/service_pool.h
M src/kudu/util/blocking_queue-test.cc
M src/kudu/util/blocking_queue.h
6 files changed, 124 insertions(+), 31 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/41/2641/1
-- 
To view, visit http://gerrit.cloudera.org:8080/2641
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newchange
Gerrit-Change-Id: I423ce5d8c54f61aeab4909393bbcac3516fe94c6
Gerrit-PatchSet: 1
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Todd Lipcon <[email protected]>
Gerrit-Reviewer: Adar Dembo <[email protected]>
Gerrit-Reviewer: Anonymous Coward #80

Reply via email to