Re: [Patch] non blocking writes in core
On Tue, Nov 19, 2013 at 07:44:07PM +0200, Graham Leggett wrote: On 18 Nov 2013, at 1:24 PM, Plüm, Rüdiger, Vodafone Group ruediger.pl...@vodafone.com wrote: +rv = send_brigade_nonblocking(net-client_socket, bb, + (ctx-bytes_written), c); +if (APR_STATUS_IS_EAGAIN(rv)) { +setaside_remaining_output(f, ctx, bb, c); +} +else if (rv != APR_EAGAIN) { What if rv is APR_SUCCESS? This is indeed broken, fixed. Some more testing has revealed that mod_ssl's output filter breaks rules 2 and 5 of the 10 output filter rules published here: http://httpd.apache.org/docs/trunk/da/developer/output-filters.html#rules Those rules are written (explicitly) for resource-level filters. They would have to be a little different for CONNECTION level, e.g. EOS handling should probably be different... though I'm not sure how we'd write the rule. Probably connection filters should delete or pass EOS, since they don't (shouldn't) care about request boundaries. Dunno. mod_ssl's output filter goes to some effort to apply (10) consistently, and is agnostic to data bucket types (5), but yeah, the metadata handling is looks wrong as you say. Regards, Joe
Re: [Patch] non blocking writes in core
On 21 Nov 2013, at 10:43 AM, Joe Orton jor...@redhat.com wrote: Those rules are written (explicitly) for resource-level filters. They would have to be a little different for CONNECTION level, e.g. EOS handling should probably be different... though I'm not sure how we'd write the rule. Probably connection filters should delete or pass EOS, since they don't (shouldn't) care about request boundaries. Dunno. mod_ssl's output filter goes to some effort to apply (10) consistently, and is agnostic to data bucket types (5), but yeah, the metadata handling is looks wrong as you say. Having looked at this is a little deeper, having implemented the initial pass at nonblocking support for the core, and then trying to implement how mod_ssl might use this, it has uncovered another snag - filters right now assume that APR_EAGAIN is a fatal error and leave immediately, when in reality we might be half way through transforming and writing a brigade. Right now, it seems that filters have the property that they are obliged to consume the brigade they have been given. What this means practically is that if mod_ssl is given a 4GB file bucket, it is obliged to consume the entire 4GB file bucket and ensure it is written downstream before returning, and that breaks write completion. Ideally, the core filter should be able to return APR_EAGAIN and mod_ssl should do the right thing to set aside the remainder before passing back the error to the caller, but it doesn't do that, and neither does mod_deflate and a host of other external filters. Ideally we need a new type of output filter, an asynchronous output filter, which in addition to the ten rules have these rules: - Empty brigades must be passed upstream and not ignored like now. This is so that an upstream filter with setaside data has the chance to keep writing it, without messing about with metadata buckets or other performance sapping stuff. - Async filters must understand and react to APR_EAGAIN from upstream, APR_EAGAIN is not an error, instead it is a request to call us again when we are writable, possibly with an empty brigade. We must setaside what we have so far unwritten (probably using functionality buried today in the core output filter, suitably reference counting the c-data_in_output_filters). - The core output filter becomes non blocking by default, if we want blocking behaviour we must explicitly send a flush bucket. While it would be nice to pass a block/nonblock flag like we do with the input filter stack this competes with the flush buckets, which effectively mean block please. Defaulting to nonblock until explicitly told otherwise is nice and simple. Given that the world must be given an opportunity to migrate to async filters, we would need to support both at the same time. What I am thinking is that a filter registers itself either as a legacy synchronous filter (you'll be given a brigade, don't return until all the brigade is consumed), or an asynchronous filter (feel free to return APR_EAGAIN at any time, and setaside data if you receive APR_EAGAIN from upstream), you state explicitly which. The key bit will be the transition from an upstream async filter to an earlier sync filter, APR_EAGAIN will need to be trapped and responded to by sending a flush bucket up the stack, effectively allowing us to convert APR_EAGAIN into APR_SUCCESS which can then be safely received by the earlier sync filter. This is a simple task that could be accomplished inside ap_pass_brigade(). The event MPM would simply need to pass empty brigades to c-output_filters until c-data_in_output_filters returns to zero. The addition of a sync filter to the c-output_filters would cause write_completion to stop working, but it will still be functional until the sync filter became async. We currently have this situation where we keep track of the start of the protocol filter stack in c-output_filters instead of the real start of the stack, we could for now insert a helper protocol filter that runs first and waits until EOS is received meaning synchronous processing is done, before setting aside the data and returning APR_SUCCESS, inviting event to enter write completion in the hope that everything from this filter onwards is async and therefore eligible for write completion. This filter can also be responsible for flow control, vastly simplifying the core_output_filter we have now. All of this is still backportable to v2.4. Regards, Graham --
RE: [Patch] non blocking writes in core
-Original Message- From: Graham Leggett [mailto:minf...@sharp.fm] Sent: Donnerstag, 21. November 2013 14:58 To: dev@httpd.apache.org Subject: Re: [Patch] non blocking writes in core On 21 Nov 2013, at 10:43 AM, Joe Orton jor...@redhat.com wrote: Those rules are written (explicitly) for resource-level filters. They would have to be a little different for CONNECTION level, e.g. EOS handling should probably be different... though I'm not sure how we'd write the rule. Probably connection filters should delete or pass EOS, since they don't (shouldn't) care about request boundaries. Dunno. mod_ssl's output filter goes to some effort to apply (10) consistently, and is agnostic to data bucket types (5), but yeah, the metadata handling is looks wrong as you say. Having looked at this is a little deeper, having implemented the initial pass at nonblocking support for the core, and then trying to implement how mod_ssl might use this, it has uncovered another snag - filters right now assume that APR_EAGAIN is a fatal error and leave immediately, when in reality we might be half way through transforming and writing a brigade. Right now, it seems that filters have the property that they are obliged to consume the brigade they have been given. What this means practically is that if mod_ssl is given a 4GB file bucket, it is obliged to consume the entire 4GB file bucket and ensure it is written downstream before returning, and that breaks write completion. Ideally, the core filter should be able to return APR_EAGAIN and mod_ssl should do the right thing to set aside the remainder before passing back the error to the caller, but it doesn't do that, and neither does mod_deflate and a host of other external filters. Ideally we need a new type of output filter, an asynchronous output filter, which in addition to the ten rules have these rules: - Empty brigades must be passed upstream and not ignored like now. This is so that an upstream filter with setaside data has the chance to keep writing it, without messing about with metadata buckets or other performance sapping stuff. - Async filters must understand and react to APR_EAGAIN from upstream, APR_EAGAIN is not an error, instead it is a request to call us again when we are writable, possibly with an empty brigade. We must setaside what we have so far unwritten (probably using functionality buried today in the core output filter, suitably reference counting the c- data_in_output_filters). - The core output filter becomes non blocking by default, if we want blocking behaviour we must explicitly send a flush bucket. While it would be nice to pass a block/nonblock flag like we do with the input filter stack this competes with the flush buckets, which effectively mean block please. Defaulting to nonblock until explicitly told otherwise is nice and simple. Given that the world must be given an opportunity to migrate to async filters, we would need to support both at the same time. What I am thinking is that a filter registers itself either as a legacy synchronous filter (you'll be given a brigade, don't return until all the brigade is consumed), or an asynchronous filter (feel free to return APR_EAGAIN at any time, and setaside data if you receive APR_EAGAIN from upstream), you state explicitly which. The key bit will be the transition from an upstream async filter to an earlier sync filter, APR_EAGAIN will need to be trapped and responded to by sending a flush bucket up the stack, effectively allowing us to convert APR_EAGAIN into APR_SUCCESS which can then be safely received by the earlier sync filter. This is a simple task that could be accomplished inside ap_pass_brigade(). The event MPM would simply need to pass empty brigades to c- output_filters until c-data_in_output_filters returns to zero. The addition of a sync filter to the c-output_filters would cause write_completion to stop working, but it will still be functional until the sync filter became async. We currently have this situation where we keep track of the start of the protocol filter stack in c-output_filters instead of the real start of the stack, we could for now insert a helper protocol filter that runs first and waits until EOS is received meaning synchronous processing is If this filter sets aside all stuff until eos (or a flush bucket) is seen it might cause a huge amount of memory consumption. So we possibly need some limit for memory consuming buckets which when passed causes us to send the stuff we have down the chain even without an eos or flush bucket. As far as I know the current core output filter also only buffers a limited amount of memory buckets until it decides to go for blocking write. Otherwise sounds sensible. Regards Rüdiger
Re: [Patch] non blocking writes in core
On 21 Nov 2013, at 4:50 PM, Plüm, Rüdiger, Vodafone Group ruediger.pl...@vodafone.com wrote: If this filter sets aside all stuff until eos (or a flush bucket) is seen it might cause a huge amount of memory consumption. So we possibly need some limit for memory consuming buckets which when passed causes us to send the stuff we have down the chain even without an eos or flush bucket. As far as I know the current core output filter also only buffers a limited amount of memory buckets until it decides to go for blocking write. In theory it would be ideal if flow control could be handled at the start of the protocol stack rather than at the end. In other words, the protocol used for flow control in core be moved into a dedicated filter. We will then be able to pass file buckets through without a problem, and properly handle morphing and memory buckets up to the sensible limits we have now. Life then becomes simple, memory constrained (and hopefully exclusively async) downstream from this filter. Regards, Graham --
Re: [Patch] non blocking writes in core
/me like :) On Nov 21, 2013, at 8:57 AM, Graham Leggett minf...@sharp.fm wrote: Given that the world must be given an opportunity to migrate to async filters, we would need to support both at the same time. What I am thinking is that a filter registers itself either as a legacy synchronous filter (you'll be given a brigade, don't return until all the brigade is consumed), or an asynchronous filter (feel free to return APR_EAGAIN at any time, and setaside data if you receive APR_EAGAIN from upstream), you state explicitly which. The key bit will be the transition from an upstream async filter to an earlier sync filter, APR_EAGAIN will need to be trapped and responded to by sending a flush bucket up the stack, effectively allowing us to convert APR_EAGAIN into APR_SUCCESS which can then be safely received by the earlier sync filter. This is a simple task that could be accomplished inside ap_pass_brigade(). The event MPM would simply need to pass empty brigades to c-output_filters until c-data_in_output_filters returns to zero. The addition of a sync filter to the c-output_filters would cause write_completion to stop working, but it will still be functional until the sync filter became async. We currently have this situation where we keep track of the start of the protocol filter stack in c-output_filters instead of the real start of the stack, we could for now insert a helper protocol filter that runs first and waits until EOS is received meaning synchronous processing is done, before setting aside the data and returning APR_SUCCESS, inviting event to enter write completion in the hope that everything from this filter onwards is async and therefore eligible for write completion. This filter can also be responsible for flow control, vastly simplifying the core_output_filter we have now. All of this is still backportable to v2.4. Regards, Graham --
Re: [Patch] non blocking writes in core
On 18 Nov 2013, at 1:24 PM, Plüm, Rüdiger, Vodafone Group ruediger.pl...@vodafone.com wrote: +rv = send_brigade_nonblocking(net-client_socket, bb, + (ctx-bytes_written), c); +if (APR_STATUS_IS_EAGAIN(rv)) { +setaside_remaining_output(f, ctx, bb, c); +} +else if (rv != APR_EAGAIN) { What if rv is APR_SUCCESS? This is indeed broken, fixed. Some more testing has revealed that mod_ssl's output filter breaks rules 2 and 5 of the 10 output filter rules published here: http://httpd.apache.org/docs/trunk/da/developer/output-filters.html#rules Instead of passing on all metadata buckets as required, mod_ssl only takes into account EOS, EOC and FLUSH, and tries to consume metadata buckets as normal data buckets. SSL_write() then sees the zero length write and treats it as a noop. The core filter is never called, the event mpm interprets a non zero c-data_in_output_filters as an invitation to try again, and we go into a spin. If mod_ssl does this, we can expect other modules in the wild to be a problem. A possible solution to this dilemma is to try and detect buggy filters and compensate for the problem. We might have a protocol helper filter that run early on and passes the brigade upstream. If we started with c-data_in_output_filters as non zero and we are in WRITE_COMPLETION, and on return the filter sees APR_SUCCESS from upstream, but c-data_in_output_filters is still not zero, we may not have reached the core filter, so call the core filter directly once with a nonblock bucket simulating the behaviour we have now. This will ensure the setaside data in the core output filter will always be written at least once on each call, preventing the spin. This path in theory should only be followed if the buggy filter is present, because when the buggy filter is not present and the core is successfully reached the return code will be APR_EAGAIN, or APR_SUCCESS with c-data_in_output_filters going to zero. Regards, Graham --
Re: [Patch] non blocking writes in core
On 19 Nov 2013, at 7:44 PM, Graham Leggett minf...@sharp.fm wrote: This is indeed broken, fixed. Some more testing has revealed that mod_ssl's output filter breaks rules 2 and 5 of the 10 output filter rules published here: http://httpd.apache.org/docs/trunk/da/developer/output-filters.html#rules Instead of passing on all metadata buckets as required, mod_ssl only takes into account EOS, EOC and FLUSH, and tries to consume metadata buckets as normal data buckets. SSL_write() then sees the zero length write and treats it as a noop. The core filter is never called, the event mpm interprets a non zero c-data_in_output_filters as an invitation to try again, and we go into a spin. If mod_ssl does this, we can expect other modules in the wild to be a problem. A possible solution to this dilemma is to try and detect buggy filters and compensate for the problem. We might have a protocol helper filter that run early on and passes the brigade upstream. If we started with c-data_in_output_filters as non zero and we are in WRITE_COMPLETION, and on return the filter sees APR_SUCCESS from upstream, but c-data_in_output_filters is still not zero, we may not have reached the core filter, so call the core filter directly once with a nonblock bucket simulating the behaviour we have now. This will ensure the setaside data in the core output filter will always be written at least once on each call, preventing the spin. This path in theory should only be followed if the buggy filter is present, because when the buggy filter is not present and the core is successfully reached the return code will be APR_EAGAIN, or APR_SUCCESS with c-data_in_output_filters going to zero. Something like the attached patch, which was a lot simpler than the above. Regards, Graham -- httpd-core-nonblocking2.patch Description: Binary data
RE: [Patch] non blocking writes in core
-Original Message- From: Graham Leggett [mailto:minf...@sharp.fm] Sent: dinsdag 19 november 2013 18:44 To: dev@httpd.apache.org Subject: Re: [Patch] non blocking writes in core On 18 Nov 2013, at 1:24 PM, Plüm, Rüdiger, Vodafone Group ruediger.pl...@vodafone.com wrote: +rv = send_brigade_nonblocking(net-client_socket, bb, + (ctx-bytes_written), c); +if (APR_STATUS_IS_EAGAIN(rv)) { +setaside_remaining_output(f, ctx, bb, c); +} +else if (rv != APR_EAGAIN) { What if rv is APR_SUCCESS? This is indeed broken, fixed. This also breaks unsafe for platforms where there are multiple EAGAIN values, like on Windows where APR_STATUS_IS_EAGAIN() returns true for quite a few error codes. Bert
RE: [Patch] non blocking writes in core
Some comments: @@ -581,6 +564,33 @@ eor_buckets_in_brigade, morphing_bucket_in_brigade); } +/* Handle non blocking writes. If we saw a non blocking bucket, attempt + * a non blocking write. If the non blocking write would have returned + * APR_EGAIN, set aside the remainder and return APR_EAGAIN. + */ +if (nonblock_bucket_in_brigade) { +if (loglevel = APLOG_TRACE6) { +ap_log_cerror(APLOG_MARK, APLOG_TRACE8, APR_SUCCESS, c, + core_output_filter: non blocking write + to the network); +} +rv = send_brigade_nonblocking(net-client_socket, bb, + (ctx-bytes_written), c); +if (APR_STATUS_IS_EAGAIN(rv)) { +setaside_remaining_output(f, ctx, bb, c); +} +else if (rv != APR_EAGAIN) { What if rv is APR_SUCCESS? +/* The client has aborted the connection */ +ap_log_cerror(APLOG_MARK, APLOG_TRACE1, rv, c, + core_output_filter: writing data to the network); +c-aborted = 1; +} +return rv; +} + +/* Otherwise handle a normal write. A non blocking write will be attempted, + * but we don't return APR_EAGAIN, we return APR_SUCCESS instead. + */ if (bytes_in_brigade = THRESHOLD_MIN_WRITE) { rv = send_brigade_nonblocking(net-client_socket, bb, (ctx-bytes_written), c); +AP_DECLARE_DATA const apr_bucket_type_t ap_bucket_type_nonblock = { +EOC, 5, APR_BUCKET_METADATA, +apr_bucket_destroy_noop, I guess this should be something like NBB instead of EOC (copy and paste error?) Otherwise looks sensible from a brief look. Regards Rüdiger -Original Message- From: Graham Leggett Sent: Sonntag, 17. November 2013 20:41 To: dev@httpd.apache.org Subject: [Patch] non blocking writes in core Hi all, Continuing on from the discussion about how we might support write completion in mod_ssl, I have come up with the following patch below. I started by changing the event MPM to call all protocol filters instead of just the hard coded write filter: rv = ap_pass_brigade(c-output_filters, cs-bb); The first thing that this revealed was that we need to pass a non-empty brigade down the stack. I needed a metadata bucket to pass down the stack to make the brigade not-empty, and so chose to create a metadata bucket for this purpose (more on this below). The second thing that we needed to support for mod_ssl to have any hope of taking advantage of write completion was non-blocking writes to the core output filter. This would give mod_ssl the option to setaside any large buckets (example: file buckets) and allow the core to enter write completion as soon as it saw EAGAIN. We can't change the API for ap_pass_brigade() to add a nonblock flag like an input filter, but we can pass a NONBLOCK metadata bucket down the stack, and this dovetails nicely with our need for a metadata bucket above. So, this patch does the following: - Takes out the pass NULL to the core output filter hack that supported write completion previously in the event MPM (other MPMs to follow). - Add the nonblock bucket, and pass this bucket down the protocol stack during write completion. - Teach the core output filter how to react to the NONBLOCK bucket, returning APR_EAGAIN appropriately when the NONBLOCK bucket is present (and APR_SUCCESS when not, as per previous behaviour). In theory, this opens the door for us to support asynchronous filters that know about NONBLOCK buckets and nonblocking behaviour. In the case of naive filters, we may end up with existing blocking behaviour during write completion, but this is no worse than we have now. Also in theory, all of this can be backported to v2.4. Regards, Graham --