Re: [DISCUSS] Improve load shedding by enforcing timeouts throughout stack

Robert Newson Mon, 22 Apr 2019 13:33:01 -0700

My memory is fuzzy, but those items sound a lot like what happens with rex, 
that motivated us (i.e, Adam) to build rexi, which deliberately does less than 
the stock approach.


-- 
  Robert Samuel Newson
  [email protected]

On Mon, 22 Apr 2019, at 18:33, Nick Vatamaniuc wrote:
> Hi everyone,
> 
> We partially implement the first part (cleaning rexi workers) for all 
> the
> fabric streaming requests. Which should be all_docs, changes, view map,
> view reduce:
> https://github.com/apache/couchdb/commit/632f303a47bd89a97c831fd0532cb7541b80355d
> 
> The pattern there is the following:
> 
>  - With every request spawn a monitoring process that is in charge of
> keeping track of all the workers as they are spawned.
>  - If regular cleanup takes place, then this monitoring process is killed,
> to avoid sending double the number of kill messages to workers.
>  - If the coordinating process doesn't run cleanup and just dies, the
> monitoring process will performs cleanup on its behalf.
> 
> Cheers,
> -Nick
> 
> 
> 
> On Thu, Apr 18, 2019 at 5:16 PM Robert Samuel Newson <[email protected]>
> wrote:
> 
> > My view is a) the server was unavailable for this request due to all the
> > other requests it’s currently dealing with b) the connection was not idle,
> > the client is not at fault.
> >
> > B.
> >
> > > On 18 Apr 2019, at 22:03, Done Collectively <[email protected]> wrote:
> > >
> > > Any reason 408 would be undesirable?
> > >
> > > https://developer.mozilla.org/en-US/docs/Web/HTTP/Status/408
> > >
> > >
> > > On Thu, Apr 18, 2019 at 10:37 AM Robert Newson <[email protected]>
> > wrote:
> > >
> > >> 503 imo.
> > >>
> > >> --
> > >>  Robert Samuel Newson
> > >>  [email protected]
> > >>
> > >> On Thu, 18 Apr 2019, at 18:24, Adam Kocoloski wrote:
> > >>> Yes, we should. Currently it’s a 500, maybe there’s something more
> > >> appropriate:
> > >>>
> > >>>
> > >>
> > https://github.com/apache/couchdb/blob/8ef42f7241f8788afc1b6e7255ce78ce5d5ea5c3/src/chttpd/src/chttpd.erl#L947-L949
> > >>>
> > >>> Adam
> > >>>
> > >>>> On Apr 18, 2019, at 12:50 PM, Joan Touzet <[email protected]> wrote:
> > >>>>
> > >>>> What happens when it turns out the client *hasn't* timed out and we
> > >>>> just...hang up on them? Should we consider at least trying to send
> > back
> > >>>> some sort of HTTP status code?
> > >>>>
> > >>>> -Joan
> > >>>>
> > >>>> On 2019-04-18 10:58, Garren Smith wrote:
> > >>>>> I'm +1 on this. With partition queries, we added a few more timeouts
> > >> that
> > >>>>> can be enabled which Cloudant enable. So having the ability to shed
> > >> old
> > >>>>> requests when these timeouts get hit would be great.
> > >>>>>
> > >>>>> Cheers
> > >>>>> Garren
> > >>>>>
> > >>>>> On Tue, Apr 16, 2019 at 2:41 AM Adam Kocoloski <[email protected]>
> > >> wrote:
> > >>>>>
> > >>>>>> Hi all,
> > >>>>>>
> > >>>>>> For once, I’m coming to you with a topic that is not strictly about
> > >>>>>> FoundationDB :)
> > >>>>>>
> > >>>>>> CouchDB offers a few config settings (some of them undocumented) to
> > >> put a
> > >>>>>> limit on how long the server is allowed to take to generate a
> > >> response. The
> > >>>>>> trouble with many of these timeouts is that, when they fire, they do
> > >> not
> > >>>>>> actually clean up all of the work that they initiated. A couple of
> > >> examples:
> > >>>>>>
> > >>>>>> - Each HTTP response coordinated by the “fabric” application spawns
> > >>>>>> several ephemeral processes via “rexi" on different nodes in the
> > >> cluster to
> > >>>>>> retrieve data and send it back to the process coordinating the
> > >> response. If
> > >>>>>> the request timeout fires, the coordinating process will be killed
> > >> off, but
> > >>>>>> the ephemeral workers might not be. In a healthy cluster they’ll
> > >> exit on
> > >>>>>> their own when they finish their jobs, but there are conditions
> > >> under which
> > >>>>>> they can sit around for extended periods of time waiting for an
> > >> overloaded
> > >>>>>> gen_server (e.g. couch_server) to respond.
> > >>>>>>
> > >>>>>> - Those named gen_servers (like couch_server) responsible for
> > >> serializing
> > >>>>>> access to important data structures will dutifully process messages
> > >>>>>> received from old requests without any regard for (of even knowledge
> > >> of)
> > >>>>>> the fact that the client that sent the message timed out long ago.
> > >> This can
> > >>>>>> lead to a sort of death spiral in which the gen_server is ultimately
> > >>>>>> spending ~all of its time serving dead clients and every client is
> > >> timing
> > >>>>>> out.
> > >>>>>>
> > >>>>>> I’d like to see us introduce a documented maximum request duration
> > >> for all
> > >>>>>> requests except the _changes feed, and then use that information to
> > >> aid in
> > >>>>>> load shedding throughout the stack. We can audit the codebase for
> > >>>>>> gen_server calls with long timeouts (I know of a few on the critical
> > >> path
> > >>>>>> that set their timeouts to `infinity`) and we can design servers
> > that
> > >>>>>> efficiently drop old requests, knowing that the client who made the
> > >> request
> > >>>>>> must have timed out. A couple of topics for discussion:
> > >>>>>>
> > >>>>>> - the “gen_server that sheds old requests” is a very generic
> > >> pattern, one
> > >>>>>> that seems like it could be well-suited to its own behaviour. A
> > >> cursory
> > >>>>>> search of the internet didn’t turn up any prior art here, which
> > >> surprises
> > >>>>>> me a bit. I’m wondering if this is worth bringing up with the
> > broader
> > >>>>>> Erlang community.
> > >>>>>>
> > >>>>>> - setting and enforcing timeouts is a healthy pattern for read-only
> > >>>>>> requests as it gives a lot more feedback to clients about the health
> > >> of the
> > >>>>>> server. When it comes to updates things are a little bit more muddy,
> > >> just
> > >>>>>> because there remains a chance that an update can be committed, but
> > >> the
> > >>>>>> caller times out before learning of the successful commit. We should
> > >> try to
> > >>>>>> minimize the likelihood of that occurring.
> > >>>>>>
> > >>>>>> Cheers, Adam
> > >>>>>>
> > >>>>>> P.S. I did say that this wasn’t _strictly_ about FoundationDB, but
> > of
> > >>>>>> course FDB has a hard 5 second limit on all transactions, so it is a
> > >> bit of
> > >>>>>> a forcing function :).Even putting FoundationDB aside, I would still
> > >> argue
> > >>>>>> to pursue this path based on our Ops experience with the current
> > >> codebase.
> > >>>>>
> > >>>>
> > >>>
> > >>>
> > >>
> >
> >
>

Re: [DISCUSS] Improve load shedding by enforcing timeouts throughout stack

Reply via email to