Re: [DISCUSS] Improve load shedding by enforcing timeouts throughout stack

Done Collectively Thu, 18 Apr 2019 14:03:55 -0700

Any reason 408 would be undesirable?

https://developer.mozilla.org/en-US/docs/Web/HTTP/Status/408



On Thu, Apr 18, 2019 at 10:37 AM Robert Newson <rnew...@apache.org> wrote:

> 503 imo.
>
> --
>   Robert Samuel Newson
>   rnew...@apache.org
>
> On Thu, 18 Apr 2019, at 18:24, Adam Kocoloski wrote:
> > Yes, we should. Currently it’s a 500, maybe there’s something more
> appropriate:
> >
> >
> https://github.com/apache/couchdb/blob/8ef42f7241f8788afc1b6e7255ce78ce5d5ea5c3/src/chttpd/src/chttpd.erl#L947-L949
> >
> > Adam
> >
> > > On Apr 18, 2019, at 12:50 PM, Joan Touzet <woh...@apache.org> wrote:
> > >
> > > What happens when it turns out the client *hasn't* timed out and we
> > > just...hang up on them? Should we consider at least trying to send back
> > > some sort of HTTP status code?
> > >
> > > -Joan
> > >
> > > On 2019-04-18 10:58, Garren Smith wrote:
> > >> I'm +1 on this. With partition queries, we added a few more timeouts
> that
> > >> can be enabled which Cloudant enable. So having the ability to shed
> old
> > >> requests when these timeouts get hit would be great.
> > >>
> > >> Cheers
> > >> Garren
> > >>
> > >> On Tue, Apr 16, 2019 at 2:41 AM Adam Kocoloski <kocol...@apache.org>
> wrote:
> > >>
> > >>> Hi all,
> > >>>
> > >>> For once, I’m coming to you with a topic that is not strictly about
> > >>> FoundationDB :)
> > >>>
> > >>> CouchDB offers a few config settings (some of them undocumented) to
> put a
> > >>> limit on how long the server is allowed to take to generate a
> response. The
> > >>> trouble with many of these timeouts is that, when they fire, they do
> not
> > >>> actually clean up all of the work that they initiated. A couple of
> examples:
> > >>>
> > >>> - Each HTTP response coordinated by the “fabric” application spawns
> > >>> several ephemeral processes via “rexi" on different nodes in the
> cluster to
> > >>> retrieve data and send it back to the process coordinating the
> response. If
> > >>> the request timeout fires, the coordinating process will be killed
> off, but
> > >>> the ephemeral workers might not be. In a healthy cluster they’ll
> exit on
> > >>> their own when they finish their jobs, but there are conditions
> under which
> > >>> they can sit around for extended periods of time waiting for an
> overloaded
> > >>> gen_server (e.g. couch_server) to respond.
> > >>>
> > >>> - Those named gen_servers (like couch_server) responsible for
> serializing
> > >>> access to important data structures will dutifully process messages
> > >>> received from old requests without any regard for (of even knowledge
> of)
> > >>> the fact that the client that sent the message timed out long ago.
> This can
> > >>> lead to a sort of death spiral in which the gen_server is ultimately
> > >>> spending ~all of its time serving dead clients and every client is
> timing
> > >>> out.
> > >>>
> > >>> I’d like to see us introduce a documented maximum request duration
> for all
> > >>> requests except the _changes feed, and then use that information to
> aid in
> > >>> load shedding throughout the stack. We can audit the codebase for
> > >>> gen_server calls with long timeouts (I know of a few on the critical
> path
> > >>> that set their timeouts to `infinity`) and we can design servers that
> > >>> efficiently drop old requests, knowing that the client who made the
> request
> > >>> must have timed out. A couple of topics for discussion:
> > >>>
> > >>> - the “gen_server that sheds old requests” is a very generic
> pattern, one
> > >>> that seems like it could be well-suited to its own behaviour. A
> cursory
> > >>> search of the internet didn’t turn up any prior art here, which
> surprises
> > >>> me a bit. I’m wondering if this is worth bringing up with the broader
> > >>> Erlang community.
> > >>>
> > >>> - setting and enforcing timeouts is a healthy pattern for read-only
> > >>> requests as it gives a lot more feedback to clients about the health
> of the
> > >>> server. When it comes to updates things are a little bit more muddy,
> just
> > >>> because there remains a chance that an update can be committed, but
> the
> > >>> caller times out before learning of the successful commit. We should
> try to
> > >>> minimize the likelihood of that occurring.
> > >>>
> > >>> Cheers, Adam
> > >>>
> > >>> P.S. I did say that this wasn’t _strictly_ about FoundationDB, but of
> > >>> course FDB has a hard 5 second limit on all transactions, so it is a
> bit of
> > >>> a forcing function :).Even putting FoundationDB aside, I would still
> argue
> > >>> to pursue this path based on our Ops experience with the current
> codebase.
> > >>
> > >
> >
> >
>

Re: [DISCUSS] Improve load shedding by enforcing timeouts throughout stack

Reply via email to