Re: [DISCUSS] Improve load shedding by enforcing timeouts throughout stack

Nick Vatamaniuc Tue, 23 Apr 2019 11:49:14 -0700

We don't spawn (/link) or monitor remote processes, just monitor the local
coordinator process. That should cheaper performance-wise. It's also for
relatively long running streaming fabric requests (changes, all_docs). But
you're right, perhaps doing these for shorter requests (doc updates, doc
GETs) might become noticeable. Perhaps a pool of reusable monitoring
processes work there...


For couch_server timeouts. I wonder if we can do a simpler thing and
inspect the `From` part of each call and if the Pid is not alive drop the
requestor at least avoid doing any expensive processing. For casts it might
involve sending a sender Pid in the message. That doesn't address timeouts,
just the case where the coordinating process went away while the message
was stuck in the long message queue.

On Mon, Apr 22, 2019 at 4:32 PM Robert Newson <[email protected]> wrote:

> My memory is fuzzy, but those items sound a lot like what happens with
> rex, that motivated us (i.e, Adam) to build rexi, which deliberately does
> less than the stock approach.
>
> --
>   Robert Samuel Newson
>   [email protected]
>
> On Mon, 22 Apr 2019, at 18:33, Nick Vatamaniuc wrote:
> > Hi everyone,
> >
> > We partially implement the first part (cleaning rexi workers) for all
> > the
> > fabric streaming requests. Which should be all_docs, changes, view map,
> > view reduce:
> >
> https://github.com/apache/couchdb/commit/632f303a47bd89a97c831fd0532cb7541b80355d
> >
> > The pattern there is the following:
> >
> >  - With every request spawn a monitoring process that is in charge of
> > keeping track of all the workers as they are spawned.
> >  - If regular cleanup takes place, then this monitoring process is
> killed,
> > to avoid sending double the number of kill messages to workers.
> >  - If the coordinating process doesn't run cleanup and just dies, the
> > monitoring process will performs cleanup on its behalf.
> >
> > Cheers,
> > -Nick
> >
> >
> >
> > On Thu, Apr 18, 2019 at 5:16 PM Robert Samuel Newson <[email protected]
> >
> > wrote:
> >
> > > My view is a) the server was unavailable for this request due to all
> the
> > > other requests it’s currently dealing with b) the connection was not
> idle,
> > > the client is not at fault.
> > >
> > > B.
> > >
> > > > On 18 Apr 2019, at 22:03, Done Collectively <[email protected]>
> wrote:
> > > >
> > > > Any reason 408 would be undesirable?
> > > >
> > > > https://developer.mozilla.org/en-US/docs/Web/HTTP/Status/408
> > > >
> > > >
> > > > On Thu, Apr 18, 2019 at 10:37 AM Robert Newson <[email protected]>
> > > wrote:
> > > >
> > > >> 503 imo.
> > > >>
> > > >> --
> > > >>  Robert Samuel Newson
> > > >>  [email protected]
> > > >>
> > > >> On Thu, 18 Apr 2019, at 18:24, Adam Kocoloski wrote:
> > > >>> Yes, we should. Currently it’s a 500, maybe there’s something more
> > > >> appropriate:
> > > >>>
> > > >>>
> > > >>
> > >
> https://github.com/apache/couchdb/blob/8ef42f7241f8788afc1b6e7255ce78ce5d5ea5c3/src/chttpd/src/chttpd.erl#L947-L949
> > > >>>
> > > >>> Adam
> > > >>>
> > > >>>> On Apr 18, 2019, at 12:50 PM, Joan Touzet <[email protected]>
> wrote:
> > > >>>>
> > > >>>> What happens when it turns out the client *hasn't* timed out and
> we
> > > >>>> just...hang up on them? Should we consider at least trying to send
> > > back
> > > >>>> some sort of HTTP status code?
> > > >>>>
> > > >>>> -Joan
> > > >>>>
> > > >>>> On 2019-04-18 10:58, Garren Smith wrote:
> > > >>>>> I'm +1 on this. With partition queries, we added a few more
> timeouts
> > > >> that
> > > >>>>> can be enabled which Cloudant enable. So having the ability to
> shed
> > > >> old
> > > >>>>> requests when these timeouts get hit would be great.
> > > >>>>>
> > > >>>>> Cheers
> > > >>>>> Garren
> > > >>>>>
> > > >>>>> On Tue, Apr 16, 2019 at 2:41 AM Adam Kocoloski <
> [email protected]>
> > > >> wrote:
> > > >>>>>
> > > >>>>>> Hi all,
> > > >>>>>>
> > > >>>>>> For once, I’m coming to you with a topic that is not strictly
> about
> > > >>>>>> FoundationDB :)
> > > >>>>>>
> > > >>>>>> CouchDB offers a few config settings (some of them
> undocumented) to
> > > >> put a
> > > >>>>>> limit on how long the server is allowed to take to generate a
> > > >> response. The
> > > >>>>>> trouble with many of these timeouts is that, when they fire,
> they do
> > > >> not
> > > >>>>>> actually clean up all of the work that they initiated. A couple
> of
> > > >> examples:
> > > >>>>>>
> > > >>>>>> - Each HTTP response coordinated by the “fabric” application
> spawns
> > > >>>>>> several ephemeral processes via “rexi" on different nodes in the
> > > >> cluster to
> > > >>>>>> retrieve data and send it back to the process coordinating the
> > > >> response. If
> > > >>>>>> the request timeout fires, the coordinating process will be
> killed
> > > >> off, but
> > > >>>>>> the ephemeral workers might not be. In a healthy cluster they’ll
> > > >> exit on
> > > >>>>>> their own when they finish their jobs, but there are conditions
> > > >> under which
> > > >>>>>> they can sit around for extended periods of time waiting for an
> > > >> overloaded
> > > >>>>>> gen_server (e.g. couch_server) to respond.
> > > >>>>>>
> > > >>>>>> - Those named gen_servers (like couch_server) responsible for
> > > >> serializing
> > > >>>>>> access to important data structures will dutifully process
> messages
> > > >>>>>> received from old requests without any regard for (of even
> knowledge
> > > >> of)
> > > >>>>>> the fact that the client that sent the message timed out long
> ago.
> > > >> This can
> > > >>>>>> lead to a sort of death spiral in which the gen_server is
> ultimately
> > > >>>>>> spending ~all of its time serving dead clients and every client
> is
> > > >> timing
> > > >>>>>> out.
> > > >>>>>>
> > > >>>>>> I’d like to see us introduce a documented maximum request
> duration
> > > >> for all
> > > >>>>>> requests except the _changes feed, and then use that
> information to
> > > >> aid in
> > > >>>>>> load shedding throughout the stack. We can audit the codebase
> for
> > > >>>>>> gen_server calls with long timeouts (I know of a few on the
> critical
> > > >> path
> > > >>>>>> that set their timeouts to `infinity`) and we can design servers
> > > that
> > > >>>>>> efficiently drop old requests, knowing that the client who made
> the
> > > >> request
> > > >>>>>> must have timed out. A couple of topics for discussion:
> > > >>>>>>
> > > >>>>>> - the “gen_server that sheds old requests” is a very generic
> > > >> pattern, one
> > > >>>>>> that seems like it could be well-suited to its own behaviour. A
> > > >> cursory
> > > >>>>>> search of the internet didn’t turn up any prior art here, which
> > > >> surprises
> > > >>>>>> me a bit. I’m wondering if this is worth bringing up with the
> > > broader
> > > >>>>>> Erlang community.
> > > >>>>>>
> > > >>>>>> - setting and enforcing timeouts is a healthy pattern for
> read-only
> > > >>>>>> requests as it gives a lot more feedback to clients about the
> health
> > > >> of the
> > > >>>>>> server. When it comes to updates things are a little bit more
> muddy,
> > > >> just
> > > >>>>>> because there remains a chance that an update can be committed,
> but
> > > >> the
> > > >>>>>> caller times out before learning of the successful commit. We
> should
> > > >> try to
> > > >>>>>> minimize the likelihood of that occurring.
> > > >>>>>>
> > > >>>>>> Cheers, Adam
> > > >>>>>>
> > > >>>>>> P.S. I did say that this wasn’t _strictly_ about FoundationDB,
> but
> > > of
> > > >>>>>> course FDB has a hard 5 second limit on all transactions, so it
> is a
> > > >> bit of
> > > >>>>>> a forcing function :).Even putting FoundationDB aside, I would
> still
> > > >> argue
> > > >>>>>> to pursue this path based on our Ops experience with the current
> > > >> codebase.
> > > >>>>>
> > > >>>>
> > > >>>
> > > >>>
> > > >>
> > >
> > >
> >
>

Re: [DISCUSS] Improve load shedding by enforcing timeouts throughout stack

Reply via email to