Re: [DISCUSS] Improve load shedding by enforcing timeouts throughout stack

2019-04-18 Thread Robert Samuel Newson
My view is a) the server was unavailable for this request due to all the other 
requests it’s currently dealing with b) the connection was not idle, the client 
is not at fault.

B.

> On 18 Apr 2019, at 22:03, Done Collectively  wrote:
> 
> Any reason 408 would be undesirable?
> 
> https://developer.mozilla.org/en-US/docs/Web/HTTP/Status/408
> 
> 
> On Thu, Apr 18, 2019 at 10:37 AM Robert Newson  wrote:
> 
>> 503 imo.
>> 
>> --
>>  Robert Samuel Newson
>>  rnew...@apache.org
>> 
>> On Thu, 18 Apr 2019, at 18:24, Adam Kocoloski wrote:
>>> Yes, we should. Currently it’s a 500, maybe there’s something more
>> appropriate:
>>> 
>>> 
>> https://github.com/apache/couchdb/blob/8ef42f7241f8788afc1b6e7255ce78ce5d5ea5c3/src/chttpd/src/chttpd.erl#L947-L949
>>> 
>>> Adam
>>> 
 On Apr 18, 2019, at 12:50 PM, Joan Touzet  wrote:
 
 What happens when it turns out the client *hasn't* timed out and we
 just...hang up on them? Should we consider at least trying to send back
 some sort of HTTP status code?
 
 -Joan
 
 On 2019-04-18 10:58, Garren Smith wrote:
> I'm +1 on this. With partition queries, we added a few more timeouts
>> that
> can be enabled which Cloudant enable. So having the ability to shed
>> old
> requests when these timeouts get hit would be great.
> 
> Cheers
> Garren
> 
> On Tue, Apr 16, 2019 at 2:41 AM Adam Kocoloski 
>> wrote:
> 
>> Hi all,
>> 
>> For once, I’m coming to you with a topic that is not strictly about
>> FoundationDB :)
>> 
>> CouchDB offers a few config settings (some of them undocumented) to
>> put a
>> limit on how long the server is allowed to take to generate a
>> response. The
>> trouble with many of these timeouts is that, when they fire, they do
>> not
>> actually clean up all of the work that they initiated. A couple of
>> examples:
>> 
>> - Each HTTP response coordinated by the “fabric” application spawns
>> several ephemeral processes via “rexi" on different nodes in the
>> cluster to
>> retrieve data and send it back to the process coordinating the
>> response. If
>> the request timeout fires, the coordinating process will be killed
>> off, but
>> the ephemeral workers might not be. In a healthy cluster they’ll
>> exit on
>> their own when they finish their jobs, but there are conditions
>> under which
>> they can sit around for extended periods of time waiting for an
>> overloaded
>> gen_server (e.g. couch_server) to respond.
>> 
>> - Those named gen_servers (like couch_server) responsible for
>> serializing
>> access to important data structures will dutifully process messages
>> received from old requests without any regard for (of even knowledge
>> of)
>> the fact that the client that sent the message timed out long ago.
>> This can
>> lead to a sort of death spiral in which the gen_server is ultimately
>> spending ~all of its time serving dead clients and every client is
>> timing
>> out.
>> 
>> I’d like to see us introduce a documented maximum request duration
>> for all
>> requests except the _changes feed, and then use that information to
>> aid in
>> load shedding throughout the stack. We can audit the codebase for
>> gen_server calls with long timeouts (I know of a few on the critical
>> path
>> that set their timeouts to `infinity`) and we can design servers that
>> efficiently drop old requests, knowing that the client who made the
>> request
>> must have timed out. A couple of topics for discussion:
>> 
>> - the “gen_server that sheds old requests” is a very generic
>> pattern, one
>> that seems like it could be well-suited to its own behaviour. A
>> cursory
>> search of the internet didn’t turn up any prior art here, which
>> surprises
>> me a bit. I’m wondering if this is worth bringing up with the broader
>> Erlang community.
>> 
>> - setting and enforcing timeouts is a healthy pattern for read-only
>> requests as it gives a lot more feedback to clients about the health
>> of the
>> server. When it comes to updates things are a little bit more muddy,
>> just
>> because there remains a chance that an update can be committed, but
>> the
>> caller times out before learning of the successful commit. We should
>> try to
>> minimize the likelihood of that occurring.
>> 
>> Cheers, Adam
>> 
>> P.S. I did say that this wasn’t _strictly_ about FoundationDB, but of
>> course FDB has a hard 5 second limit on all transactions, so it is a
>> bit of
>> a forcing function :).Even putting FoundationDB aside, I would still
>> argue
>> to pursue this path based on our Ops experience with the current
>> codebase.
> 
 
>>> 
>>> 
>> 



Re: [DISCUSS] Improve load shedding by enforcing timeouts throughout stack

2019-04-18 Thread Done Collectively
Any reason 408 would be undesirable?

https://developer.mozilla.org/en-US/docs/Web/HTTP/Status/408


On Thu, Apr 18, 2019 at 10:37 AM Robert Newson  wrote:

> 503 imo.
>
> --
>   Robert Samuel Newson
>   rnew...@apache.org
>
> On Thu, 18 Apr 2019, at 18:24, Adam Kocoloski wrote:
> > Yes, we should. Currently it’s a 500, maybe there’s something more
> appropriate:
> >
> >
> https://github.com/apache/couchdb/blob/8ef42f7241f8788afc1b6e7255ce78ce5d5ea5c3/src/chttpd/src/chttpd.erl#L947-L949
> >
> > Adam
> >
> > > On Apr 18, 2019, at 12:50 PM, Joan Touzet  wrote:
> > >
> > > What happens when it turns out the client *hasn't* timed out and we
> > > just...hang up on them? Should we consider at least trying to send back
> > > some sort of HTTP status code?
> > >
> > > -Joan
> > >
> > > On 2019-04-18 10:58, Garren Smith wrote:
> > >> I'm +1 on this. With partition queries, we added a few more timeouts
> that
> > >> can be enabled which Cloudant enable. So having the ability to shed
> old
> > >> requests when these timeouts get hit would be great.
> > >>
> > >> Cheers
> > >> Garren
> > >>
> > >> On Tue, Apr 16, 2019 at 2:41 AM Adam Kocoloski 
> wrote:
> > >>
> > >>> Hi all,
> > >>>
> > >>> For once, I’m coming to you with a topic that is not strictly about
> > >>> FoundationDB :)
> > >>>
> > >>> CouchDB offers a few config settings (some of them undocumented) to
> put a
> > >>> limit on how long the server is allowed to take to generate a
> response. The
> > >>> trouble with many of these timeouts is that, when they fire, they do
> not
> > >>> actually clean up all of the work that they initiated. A couple of
> examples:
> > >>>
> > >>> - Each HTTP response coordinated by the “fabric” application spawns
> > >>> several ephemeral processes via “rexi" on different nodes in the
> cluster to
> > >>> retrieve data and send it back to the process coordinating the
> response. If
> > >>> the request timeout fires, the coordinating process will be killed
> off, but
> > >>> the ephemeral workers might not be. In a healthy cluster they’ll
> exit on
> > >>> their own when they finish their jobs, but there are conditions
> under which
> > >>> they can sit around for extended periods of time waiting for an
> overloaded
> > >>> gen_server (e.g. couch_server) to respond.
> > >>>
> > >>> - Those named gen_servers (like couch_server) responsible for
> serializing
> > >>> access to important data structures will dutifully process messages
> > >>> received from old requests without any regard for (of even knowledge
> of)
> > >>> the fact that the client that sent the message timed out long ago.
> This can
> > >>> lead to a sort of death spiral in which the gen_server is ultimately
> > >>> spending ~all of its time serving dead clients and every client is
> timing
> > >>> out.
> > >>>
> > >>> I’d like to see us introduce a documented maximum request duration
> for all
> > >>> requests except the _changes feed, and then use that information to
> aid in
> > >>> load shedding throughout the stack. We can audit the codebase for
> > >>> gen_server calls with long timeouts (I know of a few on the critical
> path
> > >>> that set their timeouts to `infinity`) and we can design servers that
> > >>> efficiently drop old requests, knowing that the client who made the
> request
> > >>> must have timed out. A couple of topics for discussion:
> > >>>
> > >>> - the “gen_server that sheds old requests” is a very generic
> pattern, one
> > >>> that seems like it could be well-suited to its own behaviour. A
> cursory
> > >>> search of the internet didn’t turn up any prior art here, which
> surprises
> > >>> me a bit. I’m wondering if this is worth bringing up with the broader
> > >>> Erlang community.
> > >>>
> > >>> - setting and enforcing timeouts is a healthy pattern for read-only
> > >>> requests as it gives a lot more feedback to clients about the health
> of the
> > >>> server. When it comes to updates things are a little bit more muddy,
> just
> > >>> because there remains a chance that an update can be committed, but
> the
> > >>> caller times out before learning of the successful commit. We should
> try to
> > >>> minimize the likelihood of that occurring.
> > >>>
> > >>> Cheers, Adam
> > >>>
> > >>> P.S. I did say that this wasn’t _strictly_ about FoundationDB, but of
> > >>> course FDB has a hard 5 second limit on all transactions, so it is a
> bit of
> > >>> a forcing function :).Even putting FoundationDB aside, I would still
> argue
> > >>> to pursue this path based on our Ops experience with the current
> codebase.
> > >>
> > >
> >
> >
>


Re: [DISCUSS] Session token invalidation

2019-04-18 Thread Robert Samuel Newson
I think the blacklist idea is a non-starter because of the storage overhead.

However I do agree that we should end the auto extension of session cookies. 
You should get exactly whatever the configured duration is and no more. When 
that cookie expires, or sooner if you’re smart, you can request a new one, 
presenting credentials again.

We don’t want to make a change like that without a major version bump so I 
suggest it’s part of 3.0.

B.

> On 1 Apr 2019, at 15:29, Tabeth Nkangoh  wrote:
> 
> Hello all, my name is Tabeth and this is my first post. Please let me know if 
> I'm not following any conventions surrounding the usage of this mailing list. 
> Without further ado:
> 
> I believe it would be prudent for CouchDB to add the ability to invalidate 
> specific sessions. There was some discussion of this on GitHub recently 
> (https://github.com/apache/couchdb/issues/844).
> 
> To summarize, the concern was that if a session token were to be taken 
> somehow, it could be used to indefinitely re-new valid session tokens. In the 
> thread, two methods to resolve this issue were discussed:
> 
>  1.  The administrator of the CouchDB instance could change the user salt or 
> derived_key, which would invalidate all sessions for the user as well as 
> prevent them from logging in. Presumably from this point the CouchDB 
> administrator would send them a new password to use.
>  2.  A user could log-out by using their password and re-saving their 
> password to their _user document, regenerating the salt and/or derived_key, 
> invalidating all sessions but allow them to continue to login.
> 
> With this said, the two (coupled) issues I would like to discuss are:
> 
>  1.   Will there be support to invalidate specific sessions?
>  2.  Are we planning on removing the auto generation of session tokens in 
> CouchDB 2.X?
> 
> To expand on these two briefly:
> 
> In regards to invalidating specific sessions, I believe users are accustomed 
> to not being logged out from all sessions on potentially different clients 
> (mobile, browser, etc.) when they log-out from one. The log-out scheme that 
> currently can be employed (summarized above) would log-out a user from all 
> sessions, meaning all clients. If this is behavior we would like to remove, I 
> would recommend the usage of a blacklist. This blacklist state could be saved 
> in CouchDB itself, via a database, but isn't strictly necessary. An idea I 
> had for how this could be implemented is briefly described here: 
> https://github.com/apache/couchdb/issues/844#issuecomment-478357774.
> 
> The other advantage of this is that the semantics of the _session API's 
> DELETE would be better aligned with what would actually happen, as DELETE in 
> this scenario could actually deactivate the session specified. Currently 
> _session's DELETE method doesn't actually amount to much in practice.
> 
> ***
> 
> In regards to auto-generation of session tokens, I'm not sure of the 
> historical reason to why this was added, but in lieu of potentially 
> implementing (1) into CouchDB, this should be removed if at all possible. In 
> addition, even without considering (1), removing this auto generation of 
> session tokens would also prevent the indefinite renewal of session tokens, 
> given one. Instead, I believe session tokens should be extended only when 
> explicitly requested via a POST to _session with a valid username and 
> password.
> 
> 
> ***
> 
> As a final thought, if CouchDB were to remove auto-renewing sessions as well 
> as employ a blacklist, I would recommend that the _session API be modified 
> such that a user can specify an expiry duration in the request, with a 
> fall-back to [couch_httpd_auth]-timeout, if none is specified.
> 
> ***
> 
> Finally, some open questions:
> 
> 
>  1.  If we do decide to employ a blacklist, should it be stored in an 
> internal database?
> *   What would be the performance impact of potentially thousands of 
> log-outs and their respective documents being added to this blacklist?
> *   Would said blacklist be regularly purged to remove blacklist 
> documents that have expired cookies? How would this be done?
>  2.  What effect on legacy CouchDB usage would there be to remove automatic 
> renewing of session tokens?
> 
> I'd love to hear any and all feedback you all have. Let me know if anything 
> I'm saying is unclear and I'll try to elaborate.
> 
> Sincerely,
> Tabeth
> 
> 



Re: [DISCUSS] Improve load shedding by enforcing timeouts throughout stack

2019-04-18 Thread Robert Newson
503 imo.

-- 
  Robert Samuel Newson
  rnew...@apache.org

On Thu, 18 Apr 2019, at 18:24, Adam Kocoloski wrote:
> Yes, we should. Currently it’s a 500, maybe there’s something more 
> appropriate:
> 
> https://github.com/apache/couchdb/blob/8ef42f7241f8788afc1b6e7255ce78ce5d5ea5c3/src/chttpd/src/chttpd.erl#L947-L949
> 
> Adam
> 
> > On Apr 18, 2019, at 12:50 PM, Joan Touzet  wrote:
> > 
> > What happens when it turns out the client *hasn't* timed out and we
> > just...hang up on them? Should we consider at least trying to send back
> > some sort of HTTP status code?
> > 
> > -Joan
> > 
> > On 2019-04-18 10:58, Garren Smith wrote:
> >> I'm +1 on this. With partition queries, we added a few more timeouts that
> >> can be enabled which Cloudant enable. So having the ability to shed old
> >> requests when these timeouts get hit would be great.
> >> 
> >> Cheers
> >> Garren
> >> 
> >> On Tue, Apr 16, 2019 at 2:41 AM Adam Kocoloski  wrote:
> >> 
> >>> Hi all,
> >>> 
> >>> For once, I’m coming to you with a topic that is not strictly about
> >>> FoundationDB :)
> >>> 
> >>> CouchDB offers a few config settings (some of them undocumented) to put a
> >>> limit on how long the server is allowed to take to generate a response. 
> >>> The
> >>> trouble with many of these timeouts is that, when they fire, they do not
> >>> actually clean up all of the work that they initiated. A couple of 
> >>> examples:
> >>> 
> >>> - Each HTTP response coordinated by the “fabric” application spawns
> >>> several ephemeral processes via “rexi" on different nodes in the cluster 
> >>> to
> >>> retrieve data and send it back to the process coordinating the response. 
> >>> If
> >>> the request timeout fires, the coordinating process will be killed off, 
> >>> but
> >>> the ephemeral workers might not be. In a healthy cluster they’ll exit on
> >>> their own when they finish their jobs, but there are conditions under 
> >>> which
> >>> they can sit around for extended periods of time waiting for an overloaded
> >>> gen_server (e.g. couch_server) to respond.
> >>> 
> >>> - Those named gen_servers (like couch_server) responsible for serializing
> >>> access to important data structures will dutifully process messages
> >>> received from old requests without any regard for (of even knowledge of)
> >>> the fact that the client that sent the message timed out long ago. This 
> >>> can
> >>> lead to a sort of death spiral in which the gen_server is ultimately
> >>> spending ~all of its time serving dead clients and every client is timing
> >>> out.
> >>> 
> >>> I’d like to see us introduce a documented maximum request duration for all
> >>> requests except the _changes feed, and then use that information to aid in
> >>> load shedding throughout the stack. We can audit the codebase for
> >>> gen_server calls with long timeouts (I know of a few on the critical path
> >>> that set their timeouts to `infinity`) and we can design servers that
> >>> efficiently drop old requests, knowing that the client who made the 
> >>> request
> >>> must have timed out. A couple of topics for discussion:
> >>> 
> >>> - the “gen_server that sheds old requests” is a very generic pattern, one
> >>> that seems like it could be well-suited to its own behaviour. A cursory
> >>> search of the internet didn’t turn up any prior art here, which surprises
> >>> me a bit. I’m wondering if this is worth bringing up with the broader
> >>> Erlang community.
> >>> 
> >>> - setting and enforcing timeouts is a healthy pattern for read-only
> >>> requests as it gives a lot more feedback to clients about the health of 
> >>> the
> >>> server. When it comes to updates things are a little bit more muddy, just
> >>> because there remains a chance that an update can be committed, but the
> >>> caller times out before learning of the successful commit. We should try 
> >>> to
> >>> minimize the likelihood of that occurring.
> >>> 
> >>> Cheers, Adam
> >>> 
> >>> P.S. I did say that this wasn’t _strictly_ about FoundationDB, but of
> >>> course FDB has a hard 5 second limit on all transactions, so it is a bit 
> >>> of
> >>> a forcing function :).Even putting FoundationDB aside, I would still argue
> >>> to pursue this path based on our Ops experience with the current codebase.
> >> 
> > 
> 
>


Re: [DISCUSS] Improve load shedding by enforcing timeouts throughout stack

2019-04-18 Thread Adam Kocoloski
Yes, we should. Currently it’s a 500, maybe there’s something more appropriate:

https://github.com/apache/couchdb/blob/8ef42f7241f8788afc1b6e7255ce78ce5d5ea5c3/src/chttpd/src/chttpd.erl#L947-L949

Adam

> On Apr 18, 2019, at 12:50 PM, Joan Touzet  wrote:
> 
> What happens when it turns out the client *hasn't* timed out and we
> just...hang up on them? Should we consider at least trying to send back
> some sort of HTTP status code?
> 
> -Joan
> 
> On 2019-04-18 10:58, Garren Smith wrote:
>> I'm +1 on this. With partition queries, we added a few more timeouts that
>> can be enabled which Cloudant enable. So having the ability to shed old
>> requests when these timeouts get hit would be great.
>> 
>> Cheers
>> Garren
>> 
>> On Tue, Apr 16, 2019 at 2:41 AM Adam Kocoloski  wrote:
>> 
>>> Hi all,
>>> 
>>> For once, I’m coming to you with a topic that is not strictly about
>>> FoundationDB :)
>>> 
>>> CouchDB offers a few config settings (some of them undocumented) to put a
>>> limit on how long the server is allowed to take to generate a response. The
>>> trouble with many of these timeouts is that, when they fire, they do not
>>> actually clean up all of the work that they initiated. A couple of examples:
>>> 
>>> - Each HTTP response coordinated by the “fabric” application spawns
>>> several ephemeral processes via “rexi" on different nodes in the cluster to
>>> retrieve data and send it back to the process coordinating the response. If
>>> the request timeout fires, the coordinating process will be killed off, but
>>> the ephemeral workers might not be. In a healthy cluster they’ll exit on
>>> their own when they finish their jobs, but there are conditions under which
>>> they can sit around for extended periods of time waiting for an overloaded
>>> gen_server (e.g. couch_server) to respond.
>>> 
>>> - Those named gen_servers (like couch_server) responsible for serializing
>>> access to important data structures will dutifully process messages
>>> received from old requests without any regard for (of even knowledge of)
>>> the fact that the client that sent the message timed out long ago. This can
>>> lead to a sort of death spiral in which the gen_server is ultimately
>>> spending ~all of its time serving dead clients and every client is timing
>>> out.
>>> 
>>> I’d like to see us introduce a documented maximum request duration for all
>>> requests except the _changes feed, and then use that information to aid in
>>> load shedding throughout the stack. We can audit the codebase for
>>> gen_server calls with long timeouts (I know of a few on the critical path
>>> that set their timeouts to `infinity`) and we can design servers that
>>> efficiently drop old requests, knowing that the client who made the request
>>> must have timed out. A couple of topics for discussion:
>>> 
>>> - the “gen_server that sheds old requests” is a very generic pattern, one
>>> that seems like it could be well-suited to its own behaviour. A cursory
>>> search of the internet didn’t turn up any prior art here, which surprises
>>> me a bit. I’m wondering if this is worth bringing up with the broader
>>> Erlang community.
>>> 
>>> - setting and enforcing timeouts is a healthy pattern for read-only
>>> requests as it gives a lot more feedback to clients about the health of the
>>> server. When it comes to updates things are a little bit more muddy, just
>>> because there remains a chance that an update can be committed, but the
>>> caller times out before learning of the successful commit. We should try to
>>> minimize the likelihood of that occurring.
>>> 
>>> Cheers, Adam
>>> 
>>> P.S. I did say that this wasn’t _strictly_ about FoundationDB, but of
>>> course FDB has a hard 5 second limit on all transactions, so it is a bit of
>>> a forcing function :).Even putting FoundationDB aside, I would still argue
>>> to pursue this path based on our Ops experience with the current codebase.
>> 
> 



Re: [DISCUSS] Improve load shedding by enforcing timeouts throughout stack

2019-04-18 Thread Joan Touzet
What happens when it turns out the client *hasn't* timed out and we
just...hang up on them? Should we consider at least trying to send back
some sort of HTTP status code?

-Joan

On 2019-04-18 10:58, Garren Smith wrote:
> I'm +1 on this. With partition queries, we added a few more timeouts that
> can be enabled which Cloudant enable. So having the ability to shed old
> requests when these timeouts get hit would be great.
> 
> Cheers
> Garren
> 
> On Tue, Apr 16, 2019 at 2:41 AM Adam Kocoloski  wrote:
> 
>> Hi all,
>>
>> For once, I’m coming to you with a topic that is not strictly about
>> FoundationDB :)
>>
>> CouchDB offers a few config settings (some of them undocumented) to put a
>> limit on how long the server is allowed to take to generate a response. The
>> trouble with many of these timeouts is that, when they fire, they do not
>> actually clean up all of the work that they initiated. A couple of examples:
>>
>> - Each HTTP response coordinated by the “fabric” application spawns
>> several ephemeral processes via “rexi" on different nodes in the cluster to
>> retrieve data and send it back to the process coordinating the response. If
>> the request timeout fires, the coordinating process will be killed off, but
>> the ephemeral workers might not be. In a healthy cluster they’ll exit on
>> their own when they finish their jobs, but there are conditions under which
>> they can sit around for extended periods of time waiting for an overloaded
>> gen_server (e.g. couch_server) to respond.
>>
>> - Those named gen_servers (like couch_server) responsible for serializing
>> access to important data structures will dutifully process messages
>> received from old requests without any regard for (of even knowledge of)
>> the fact that the client that sent the message timed out long ago. This can
>> lead to a sort of death spiral in which the gen_server is ultimately
>> spending ~all of its time serving dead clients and every client is timing
>> out.
>>
>> I’d like to see us introduce a documented maximum request duration for all
>> requests except the _changes feed, and then use that information to aid in
>> load shedding throughout the stack. We can audit the codebase for
>> gen_server calls with long timeouts (I know of a few on the critical path
>> that set their timeouts to `infinity`) and we can design servers that
>> efficiently drop old requests, knowing that the client who made the request
>> must have timed out. A couple of topics for discussion:
>>
>> - the “gen_server that sheds old requests” is a very generic pattern, one
>> that seems like it could be well-suited to its own behaviour. A cursory
>> search of the internet didn’t turn up any prior art here, which surprises
>> me a bit. I’m wondering if this is worth bringing up with the broader
>> Erlang community.
>>
>> - setting and enforcing timeouts is a healthy pattern for read-only
>> requests as it gives a lot more feedback to clients about the health of the
>> server. When it comes to updates things are a little bit more muddy, just
>> because there remains a chance that an update can be committed, but the
>> caller times out before learning of the successful commit. We should try to
>> minimize the likelihood of that occurring.
>>
>> Cheers, Adam
>>
>> P.S. I did say that this wasn’t _strictly_ about FoundationDB, but of
>> course FDB has a hard 5 second limit on all transactions, so it is a bit of
>> a forcing function :).Even putting FoundationDB aside, I would still argue
>> to pursue this path based on our Ops experience with the current codebase.
> 



Re: [DISCUSS] Improve load shedding by enforcing timeouts throughout stack

2019-04-18 Thread Garren Smith
I'm +1 on this. With partition queries, we added a few more timeouts that
can be enabled which Cloudant enable. So having the ability to shed old
requests when these timeouts get hit would be great.

Cheers
Garren

On Tue, Apr 16, 2019 at 2:41 AM Adam Kocoloski  wrote:

> Hi all,
>
> For once, I’m coming to you with a topic that is not strictly about
> FoundationDB :)
>
> CouchDB offers a few config settings (some of them undocumented) to put a
> limit on how long the server is allowed to take to generate a response. The
> trouble with many of these timeouts is that, when they fire, they do not
> actually clean up all of the work that they initiated. A couple of examples:
>
> - Each HTTP response coordinated by the “fabric” application spawns
> several ephemeral processes via “rexi" on different nodes in the cluster to
> retrieve data and send it back to the process coordinating the response. If
> the request timeout fires, the coordinating process will be killed off, but
> the ephemeral workers might not be. In a healthy cluster they’ll exit on
> their own when they finish their jobs, but there are conditions under which
> they can sit around for extended periods of time waiting for an overloaded
> gen_server (e.g. couch_server) to respond.
>
> - Those named gen_servers (like couch_server) responsible for serializing
> access to important data structures will dutifully process messages
> received from old requests without any regard for (of even knowledge of)
> the fact that the client that sent the message timed out long ago. This can
> lead to a sort of death spiral in which the gen_server is ultimately
> spending ~all of its time serving dead clients and every client is timing
> out.
>
> I’d like to see us introduce a documented maximum request duration for all
> requests except the _changes feed, and then use that information to aid in
> load shedding throughout the stack. We can audit the codebase for
> gen_server calls with long timeouts (I know of a few on the critical path
> that set their timeouts to `infinity`) and we can design servers that
> efficiently drop old requests, knowing that the client who made the request
> must have timed out. A couple of topics for discussion:
>
> - the “gen_server that sheds old requests” is a very generic pattern, one
> that seems like it could be well-suited to its own behaviour. A cursory
> search of the internet didn’t turn up any prior art here, which surprises
> me a bit. I’m wondering if this is worth bringing up with the broader
> Erlang community.
>
> - setting and enforcing timeouts is a healthy pattern for read-only
> requests as it gives a lot more feedback to clients about the health of the
> server. When it comes to updates things are a little bit more muddy, just
> because there remains a chance that an update can be committed, but the
> caller times out before learning of the successful commit. We should try to
> minimize the likelihood of that occurring.
>
> Cheers, Adam
>
> P.S. I did say that this wasn’t _strictly_ about FoundationDB, but of
> course FDB has a hard 5 second limit on all transactions, so it is a bit of
> a forcing function :).Even putting FoundationDB aside, I would still argue
> to pursue this path based on our Ops experience with the current codebase.