Re: [DISCUSS] Improve load shedding by enforcing timeouts throughout stack

Joan Touzet Fri, 26 Apr 2019 13:20:19 -0700

Hi Adam,

I'll bring up a concern from a recent client with whom I engaged.

They're on 1.x. On 1.x they have been doing 50k bulk update operationsin a single request. 1.x doesn't time out. The updates are such thatthey guarantee that none will result in a conflict or be rejected, soall 50k are accepted. They do this so it appears atomic to the nextreader - a read from another client can't occur in the middle of the bigupdate, because we have a single couch_file in 1.x.

Obviously, in 2.x this doesn't work on two levels. First, there'smultiple readers and writers across a cluster, so the big bulk operationdoesn't act as a blocker until it's finished for any interposed reads.Second, you can't reliably finish 50k updates in a single batch in acluster anyway, because you'll probably hit the fabric timeout, if notother cluster timeouts.

As a general rule of thumb, I advise people to keep bulk documentupdates to no more than batches of 1k at a time, with the understandingthat in 2.x these are not treated as an atomic transaction (and theyweren't strictly that way in 1.x, either, but never mind that...)

If we decide as a project that all operations must take less than 5seconds, we're probably going to have to reduce the bulk update batchsize even further. I'm betting 100 would be the upper bound on bulk updates.


Is this going to impose a significant performance penalty on bulk ops?

-Joan

On 2019-04-26 3:30 p.m., Adam Kocoloski wrote:

Hi all,

The point I’m on is that we should take advantage of this extra bit of 
information that we acquire out-of-band (e.g. we just decide as a project that 
all operations take less than 5 seconds) and come up with smarter / cheaper / 
faster ways of doing load shedding based on that information.

For example, yes it could be interesting to use is_process_alive/1 to see if a 
client is still hanging around, and have the gen_server discard the work 
otherwise. It might also be too expensive to matter; I’m not sure anyone here 
has a good a priori sense of the cost of that call. But I’d certainly wager 
it’s more expensive than calling timer:now_diff/2 in the server and discarding 
any requests that were submitted more than 5 seconds ago.

Most of our timeout / cleanup solutions to date have been focused top-down, 
without making any assumptions about the behavior of the workers or servers 
underneath. I think we should try to approach this problem bottoms-up, forcing 
every call to complete within 5 seconds and handling timeouts correctly as they 
bubble up.

Adam

On Apr 23, 2019, at 2:48 PM, Nick Vatamaniuc <vatam...@gmail.com> wrote:

We don't spawn (/link) or monitor remote processes, just monitor the local
coordinator process. That should cheaper performance-wise. It's also for
relatively long running streaming fabric requests (changes, all_docs). But
you're right, perhaps doing these for shorter requests (doc updates, doc
GETs) might become noticeable. Perhaps a pool of reusable monitoring
processes work there...

For couch_server timeouts. I wonder if we can do a simpler thing and
inspect the `From` part of each call and if the Pid is not alive drop the
requestor at least avoid doing any expensive processing. For casts it might
involve sending a sender Pid in the message. That doesn't address timeouts,
just the case where the coordinating process went away while the message
was stuck in the long message queue.

On Mon, Apr 22, 2019 at 4:32 PM Robert Newson <rnew...@apache.org> wrote:

My memory is fuzzy, but those items sound a lot like what happens with
rex, that motivated us (i.e, Adam) to build rexi, which deliberately does
less than the stock approach.

--
  Robert Samuel Newson
  rnew...@apache.org

On Mon, 22 Apr 2019, at 18:33, Nick Vatamaniuc wrote:

Hi everyone,

We partially implement the first part (cleaning rexi workers) for all
the
fabric streaming requests. Which should be all_docs, changes, view map,
view reduce:

https://github.com/apache/couchdb/commit/632f303a47bd89a97c831fd0532cb7541b80355d


The pattern there is the following:

- With every request spawn a monitoring process that is in charge of
keeping track of all the workers as they are spawned.
- If regular cleanup takes place, then this monitoring process is

killed,

to avoid sending double the number of kill messages to workers.
- If the coordinating process doesn't run cleanup and just dies, the
monitoring process will performs cleanup on its behalf.

Cheers,
-Nick



On Thu, Apr 18, 2019 at 5:16 PM Robert Samuel Newson <rnew...@apache.org

wrote:

My view is a) the server was unavailable for this request due to all

the

other requests it’s currently dealing with b) the connection was not

idle,

the client is not at fault.

B.

On 18 Apr 2019, at 22:03, Done Collectively <sans...@inator.biz>

wrote:


Any reason 408 would be undesirable?

https://developer.mozilla.org/en-US/docs/Web/HTTP/Status/408


On Thu, Apr 18, 2019 at 10:37 AM Robert Newson <rnew...@apache.org>

wrote:

503 imo.

--
Robert Samuel Newson
rnew...@apache.org

On Thu, 18 Apr 2019, at 18:24, Adam Kocoloski wrote:

Yes, we should. Currently it’s a 500, maybe there’s something more

appropriate:

https://github.com/apache/couchdb/blob/8ef42f7241f8788afc1b6e7255ce78ce5d5ea5c3/src/chttpd/src/chttpd.erl#L947-L949


Adam

On Apr 18, 2019, at 12:50 PM, Joan Touzet <woh...@apache.org>

wrote:


What happens when it turns out the client *hasn't* timed out and

we

just...hang up on them? Should we consider at least trying to send

back

some sort of HTTP status code?

-Joan

On 2019-04-18 10:58, Garren Smith wrote:

I'm +1 on this. With partition queries, we added a few more

timeouts

that

can be enabled which Cloudant enable. So having the ability to

shed

old

requests when these timeouts get hit would be great.

Cheers
Garren

On Tue, Apr 16, 2019 at 2:41 AM Adam Kocoloski <

kocol...@apache.org>

wrote:

Hi all,

For once, I’m coming to you with a topic that is not strictly

about

FoundationDB :)

CouchDB offers a few config settings (some of them

undocumented) to

put a

limit on how long the server is allowed to take to generate a

response. The

trouble with many of these timeouts is that, when they fire,

they do

not

actually clean up all of the work that they initiated. A couple

of

examples:


- Each HTTP response coordinated by the “fabric” application

spawns

several ephemeral processes via “rexi" on different nodes in the

cluster to

retrieve data and send it back to the process coordinating the

response. If

the request timeout fires, the coordinating process will be

killed

off, but

the ephemeral workers might not be. In a healthy cluster they’ll

exit on

their own when they finish their jobs, but there are conditions

under which

they can sit around for extended periods of time waiting for an

overloaded

gen_server (e.g. couch_server) to respond.

- Those named gen_servers (like couch_server) responsible for

serializing

access to important data structures will dutifully process

messages

received from old requests without any regard for (of even

knowledge

of)

the fact that the client that sent the message timed out long

ago.

This can

lead to a sort of death spiral in which the gen_server is

ultimately

spending ~all of its time serving dead clients and every client

is

timing

out.

I’d like to see us introduce a documented maximum request

duration

for all

requests except the _changes feed, and then use that

information to

aid in

load shedding throughout the stack. We can audit the codebase

for

gen_server calls with long timeouts (I know of a few on the

critical

path

that set their timeouts to `infinity`) and we can design servers

that

efficiently drop old requests, knowing that the client who made

the

request

must have timed out. A couple of topics for discussion:

- the “gen_server that sheds old requests” is a very generic

pattern, one

that seems like it could be well-suited to its own behaviour. A

cursory

search of the internet didn’t turn up any prior art here, which

surprises

me a bit. I’m wondering if this is worth bringing up with the

broader

Erlang community.

- setting and enforcing timeouts is a healthy pattern for

read-only

requests as it gives a lot more feedback to clients about the

health

of the

server. When it comes to updates things are a little bit more

muddy,

just

because there remains a chance that an update can be committed,

but

the

caller times out before learning of the successful commit. We

should

try to

minimize the likelihood of that occurring.

Cheers, Adam

P.S. I did say that this wasn’t _strictly_ about FoundationDB,

but

of

course FDB has a hard 5 second limit on all transactions, so it

is a

bit of

a forcing function :).Even putting FoundationDB aside, I would

still

argue

to pursue this path based on our Ops experience with the current

codebase.

Re: [DISCUSS] Improve load shedding by enforcing timeouts throughout stack

Reply via email to