[DISCUSS] Improve load shedding by enforcing timeouts throughout stack

Adam Kocoloski Mon, 15 Apr 2019 17:41:10 -0700

Hi all,

For once, I’m coming to you with a topic that is not strictly about 
FoundationDB :)


CouchDB offers a few config settings (some of them undocumented) to put a limit 
on how long the server is allowed to take to generate a response. The trouble 
with many of these timeouts is that, when they fire, they do not actually clean 
up all of the work that they initiated. A couple of examples:

- Each HTTP response coordinated by the “fabric” application spawns several 
ephemeral processes via “rexi" on different nodes in the cluster to retrieve 
data and send it back to the process coordinating the response. If the request 
timeout fires, the coordinating process will be killed off, but the ephemeral 
workers might not be. In a healthy cluster they’ll exit on their own when they 
finish their jobs, but there are conditions under which they can sit around for 
extended periods of time waiting for an overloaded gen_server (e.g. 
couch_server) to respond.

- Those named gen_servers (like couch_server) responsible for serializing 
access to important data structures will dutifully process messages received 
from old requests without any regard for (of even knowledge of) the fact that 
the client that sent the message timed out long ago. This can lead to a sort of 
death spiral in which the gen_server is ultimately spending ~all of its time 
serving dead clients and every client is timing out.

I’d like to see us introduce a documented maximum request duration for all 
requests except the _changes feed, and then use that information to aid in load 
shedding throughout the stack. We can audit the codebase for gen_server calls 
with long timeouts (I know of a few on the critical path that set their 
timeouts to `infinity`) and we can design servers that efficiently drop old 
requests, knowing that the client who made the request must have timed out. A 
couple of topics for discussion:

- the “gen_server that sheds old requests” is a very generic pattern, one that 
seems like it could be well-suited to its own behaviour. A cursory search of 
the internet didn’t turn up any prior art here, which surprises me a bit. I’m 
wondering if this is worth bringing up with the broader Erlang community.

- setting and enforcing timeouts is a healthy pattern for read-only requests as 
it gives a lot more feedback to clients about the health of the server. When it 
comes to updates things are a little bit more muddy, just because there remains 
a chance that an update can be committed, but the caller times out before 
learning of the successful commit. We should try to minimize the likelihood of 
that occurring.

Cheers, Adam

P.S. I did say that this wasn’t _strictly_ about FoundationDB, but of course 
FDB has a hard 5 second limit on all transactions, so it is a bit of a forcing 
function :).Even putting FoundationDB aside, I would still argue to pursue this 
path based on our Ops experience with the current codebase.

[DISCUSS] Improve load shedding by enforcing timeouts throughout stack

Reply via email to