What happens when it turns out the client *hasn't* timed out and we just...hang up on them? Should we consider at least trying to send back some sort of HTTP status code?
-Joan On 2019-04-18 10:58, Garren Smith wrote: > I'm +1 on this. With partition queries, we added a few more timeouts that > can be enabled which Cloudant enable. So having the ability to shed old > requests when these timeouts get hit would be great. > > Cheers > Garren > > On Tue, Apr 16, 2019 at 2:41 AM Adam Kocoloski <kocol...@apache.org> wrote: > >> Hi all, >> >> For once, I’m coming to you with a topic that is not strictly about >> FoundationDB :) >> >> CouchDB offers a few config settings (some of them undocumented) to put a >> limit on how long the server is allowed to take to generate a response. The >> trouble with many of these timeouts is that, when they fire, they do not >> actually clean up all of the work that they initiated. A couple of examples: >> >> - Each HTTP response coordinated by the “fabric” application spawns >> several ephemeral processes via “rexi" on different nodes in the cluster to >> retrieve data and send it back to the process coordinating the response. If >> the request timeout fires, the coordinating process will be killed off, but >> the ephemeral workers might not be. In a healthy cluster they’ll exit on >> their own when they finish their jobs, but there are conditions under which >> they can sit around for extended periods of time waiting for an overloaded >> gen_server (e.g. couch_server) to respond. >> >> - Those named gen_servers (like couch_server) responsible for serializing >> access to important data structures will dutifully process messages >> received from old requests without any regard for (of even knowledge of) >> the fact that the client that sent the message timed out long ago. This can >> lead to a sort of death spiral in which the gen_server is ultimately >> spending ~all of its time serving dead clients and every client is timing >> out. >> >> I’d like to see us introduce a documented maximum request duration for all >> requests except the _changes feed, and then use that information to aid in >> load shedding throughout the stack. We can audit the codebase for >> gen_server calls with long timeouts (I know of a few on the critical path >> that set their timeouts to `infinity`) and we can design servers that >> efficiently drop old requests, knowing that the client who made the request >> must have timed out. A couple of topics for discussion: >> >> - the “gen_server that sheds old requests” is a very generic pattern, one >> that seems like it could be well-suited to its own behaviour. A cursory >> search of the internet didn’t turn up any prior art here, which surprises >> me a bit. I’m wondering if this is worth bringing up with the broader >> Erlang community. >> >> - setting and enforcing timeouts is a healthy pattern for read-only >> requests as it gives a lot more feedback to clients about the health of the >> server. When it comes to updates things are a little bit more muddy, just >> because there remains a chance that an update can be committed, but the >> caller times out before learning of the successful commit. We should try to >> minimize the likelihood of that occurring. >> >> Cheers, Adam >> >> P.S. I did say that this wasn’t _strictly_ about FoundationDB, but of >> course FDB has a hard 5 second limit on all transactions, so it is a bit of >> a forcing function :).Even putting FoundationDB aside, I would still argue >> to pursue this path based on our Ops experience with the current codebase. >