Re: [DISCUSS] Improve load shedding by enforcing timeouts throughout stack

Adam Kocoloski Fri, 26 Apr 2019 12:38:53 -0700

Hi all,

The point I’m on is that we should take advantage of this extra bit of 
information that we acquire out-of-band (e.g. we just decide as a project that 
all operations take less than 5 seconds) and come up with smarter / cheaper / 
faster ways of doing load shedding based on that information.


For example, yes it could be interesting to use is_process_alive/1 to see if a 
client is still hanging around, and have the gen_server discard the work 
otherwise. It might also be too expensive to matter; I’m not sure anyone here 
has a good a priori sense of the cost of that call. But I’d certainly wager 
it’s more expensive than calling timer:now_diff/2 in the server and discarding 
any requests that were submitted more than 5 seconds ago.

Most of our timeout / cleanup solutions to date have been focused top-down, 
without making any assumptions about the behavior of the workers or servers 
underneath. I think we should try to approach this problem bottoms-up, forcing 
every call to complete within 5 seconds and handling timeouts correctly as they 
bubble up.

Adam

> On Apr 23, 2019, at 2:48 PM, Nick Vatamaniuc <vatam...@gmail.com> wrote:
> 
> We don't spawn (/link) or monitor remote processes, just monitor the local
> coordinator process. That should cheaper performance-wise. It's also for
> relatively long running streaming fabric requests (changes, all_docs). But
> you're right, perhaps doing these for shorter requests (doc updates, doc
> GETs) might become noticeable. Perhaps a pool of reusable monitoring
> processes work there...
> 
> For couch_server timeouts. I wonder if we can do a simpler thing and
> inspect the `From` part of each call and if the Pid is not alive drop the
> requestor at least avoid doing any expensive processing. For casts it might
> involve sending a sender Pid in the message. That doesn't address timeouts,
> just the case where the coordinating process went away while the message
> was stuck in the long message queue.
> 
> On Mon, Apr 22, 2019 at 4:32 PM Robert Newson <rnew...@apache.org> wrote:
> 
>> My memory is fuzzy, but those items sound a lot like what happens with
>> rex, that motivated us (i.e, Adam) to build rexi, which deliberately does
>> less than the stock approach.
>> 
>> --
>>  Robert Samuel Newson
>>  rnew...@apache.org
>> 
>> On Mon, 22 Apr 2019, at 18:33, Nick Vatamaniuc wrote:
>>> Hi everyone,
>>> 
>>> We partially implement the first part (cleaning rexi workers) for all
>>> the
>>> fabric streaming requests. Which should be all_docs, changes, view map,
>>> view reduce:
>>> 
>> https://github.com/apache/couchdb/commit/632f303a47bd89a97c831fd0532cb7541b80355d
>>> 
>>> The pattern there is the following:
>>> 
>>> - With every request spawn a monitoring process that is in charge of
>>> keeping track of all the workers as they are spawned.
>>> - If regular cleanup takes place, then this monitoring process is
>> killed,
>>> to avoid sending double the number of kill messages to workers.
>>> - If the coordinating process doesn't run cleanup and just dies, the
>>> monitoring process will performs cleanup on its behalf.
>>> 
>>> Cheers,
>>> -Nick
>>> 
>>> 
>>> 
>>> On Thu, Apr 18, 2019 at 5:16 PM Robert Samuel Newson <rnew...@apache.org
>>> 
>>> wrote:
>>> 
>>>> My view is a) the server was unavailable for this request due to all
>> the
>>>> other requests it’s currently dealing with b) the connection was not
>> idle,
>>>> the client is not at fault.
>>>> 
>>>> B.
>>>> 
>>>>> On 18 Apr 2019, at 22:03, Done Collectively <sans...@inator.biz>
>> wrote:
>>>>> 
>>>>> Any reason 408 would be undesirable?
>>>>> 
>>>>> https://developer.mozilla.org/en-US/docs/Web/HTTP/Status/408
>>>>> 
>>>>> 
>>>>> On Thu, Apr 18, 2019 at 10:37 AM Robert Newson <rnew...@apache.org>
>>>> wrote:
>>>>> 
>>>>>> 503 imo.
>>>>>> 
>>>>>> --
>>>>>> Robert Samuel Newson
>>>>>> rnew...@apache.org
>>>>>> 
>>>>>> On Thu, 18 Apr 2019, at 18:24, Adam Kocoloski wrote:
>>>>>>> Yes, we should. Currently it’s a 500, maybe there’s something more
>>>>>> appropriate:
>>>>>>> 
>>>>>>> 
>>>>>> 
>>>> 
>> https://github.com/apache/couchdb/blob/8ef42f7241f8788afc1b6e7255ce78ce5d5ea5c3/src/chttpd/src/chttpd.erl#L947-L949
>>>>>>> 
>>>>>>> Adam
>>>>>>> 
>>>>>>>> On Apr 18, 2019, at 12:50 PM, Joan Touzet <woh...@apache.org>
>> wrote:
>>>>>>>> 
>>>>>>>> What happens when it turns out the client *hasn't* timed out and
>> we
>>>>>>>> just...hang up on them? Should we consider at least trying to send
>>>> back
>>>>>>>> some sort of HTTP status code?
>>>>>>>> 
>>>>>>>> -Joan
>>>>>>>> 
>>>>>>>> On 2019-04-18 10:58, Garren Smith wrote:
>>>>>>>>> I'm +1 on this. With partition queries, we added a few more
>> timeouts
>>>>>> that
>>>>>>>>> can be enabled which Cloudant enable. So having the ability to
>> shed
>>>>>> old
>>>>>>>>> requests when these timeouts get hit would be great.
>>>>>>>>> 
>>>>>>>>> Cheers
>>>>>>>>> Garren
>>>>>>>>> 
>>>>>>>>> On Tue, Apr 16, 2019 at 2:41 AM Adam Kocoloski <
>> kocol...@apache.org>
>>>>>> wrote:
>>>>>>>>> 
>>>>>>>>>> Hi all,
>>>>>>>>>> 
>>>>>>>>>> For once, I’m coming to you with a topic that is not strictly
>> about
>>>>>>>>>> FoundationDB :)
>>>>>>>>>> 
>>>>>>>>>> CouchDB offers a few config settings (some of them
>> undocumented) to
>>>>>> put a
>>>>>>>>>> limit on how long the server is allowed to take to generate a
>>>>>> response. The
>>>>>>>>>> trouble with many of these timeouts is that, when they fire,
>> they do
>>>>>> not
>>>>>>>>>> actually clean up all of the work that they initiated. A couple
>> of
>>>>>> examples:
>>>>>>>>>> 
>>>>>>>>>> - Each HTTP response coordinated by the “fabric” application
>> spawns
>>>>>>>>>> several ephemeral processes via “rexi" on different nodes in the
>>>>>> cluster to
>>>>>>>>>> retrieve data and send it back to the process coordinating the
>>>>>> response. If
>>>>>>>>>> the request timeout fires, the coordinating process will be
>> killed
>>>>>> off, but
>>>>>>>>>> the ephemeral workers might not be. In a healthy cluster they’ll
>>>>>> exit on
>>>>>>>>>> their own when they finish their jobs, but there are conditions
>>>>>> under which
>>>>>>>>>> they can sit around for extended periods of time waiting for an
>>>>>> overloaded
>>>>>>>>>> gen_server (e.g. couch_server) to respond.
>>>>>>>>>> 
>>>>>>>>>> - Those named gen_servers (like couch_server) responsible for
>>>>>> serializing
>>>>>>>>>> access to important data structures will dutifully process
>> messages
>>>>>>>>>> received from old requests without any regard for (of even
>> knowledge
>>>>>> of)
>>>>>>>>>> the fact that the client that sent the message timed out long
>> ago.
>>>>>> This can
>>>>>>>>>> lead to a sort of death spiral in which the gen_server is
>> ultimately
>>>>>>>>>> spending ~all of its time serving dead clients and every client
>> is
>>>>>> timing
>>>>>>>>>> out.
>>>>>>>>>> 
>>>>>>>>>> I’d like to see us introduce a documented maximum request
>> duration
>>>>>> for all
>>>>>>>>>> requests except the _changes feed, and then use that
>> information to
>>>>>> aid in
>>>>>>>>>> load shedding throughout the stack. We can audit the codebase
>> for
>>>>>>>>>> gen_server calls with long timeouts (I know of a few on the
>> critical
>>>>>> path
>>>>>>>>>> that set their timeouts to `infinity`) and we can design servers
>>>> that
>>>>>>>>>> efficiently drop old requests, knowing that the client who made
>> the
>>>>>> request
>>>>>>>>>> must have timed out. A couple of topics for discussion:
>>>>>>>>>> 
>>>>>>>>>> - the “gen_server that sheds old requests” is a very generic
>>>>>> pattern, one
>>>>>>>>>> that seems like it could be well-suited to its own behaviour. A
>>>>>> cursory
>>>>>>>>>> search of the internet didn’t turn up any prior art here, which
>>>>>> surprises
>>>>>>>>>> me a bit. I’m wondering if this is worth bringing up with the
>>>> broader
>>>>>>>>>> Erlang community.
>>>>>>>>>> 
>>>>>>>>>> - setting and enforcing timeouts is a healthy pattern for
>> read-only
>>>>>>>>>> requests as it gives a lot more feedback to clients about the
>> health
>>>>>> of the
>>>>>>>>>> server. When it comes to updates things are a little bit more
>> muddy,
>>>>>> just
>>>>>>>>>> because there remains a chance that an update can be committed,
>> but
>>>>>> the
>>>>>>>>>> caller times out before learning of the successful commit. We
>> should
>>>>>> try to
>>>>>>>>>> minimize the likelihood of that occurring.
>>>>>>>>>> 
>>>>>>>>>> Cheers, Adam
>>>>>>>>>> 
>>>>>>>>>> P.S. I did say that this wasn’t _strictly_ about FoundationDB,
>> but
>>>> of
>>>>>>>>>> course FDB has a hard 5 second limit on all transactions, so it
>> is a
>>>>>> bit of
>>>>>>>>>> a forcing function :).Even putting FoundationDB aside, I would
>> still
>>>>>> argue
>>>>>>>>>> to pursue this path based on our Ops experience with the current
>>>>>> codebase.
>>>>>>>>> 
>>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>> 
>>>> 
>>>> 
>>> 
>>

Re: [DISCUSS] Improve load shedding by enforcing timeouts throughout stack

Reply via email to