I think your response demonstrates fairly well how obtuse input
filtering has become. ;)

I saw so much repeated code for parsing brigades, that I created a
"readahead" API: ap_brigade_ra().  It is passed similar arguments as
those to input filters, and additionally is passed a readahead struct
and a readahead limit.  This abstraction acts as a buffer and parses
out bytes and lines, reading from upstream filters when it needs to.

Filters can store their readahead struct in their ctx and can use
the readahead API instead of calling ap_get_brigade() themselves.
(There's a lot more to say here, but I'll have to put together a
more coherent description)

In short, the readahead API attempts to address all of your concerns
listed below, and more.  (I originally started this when I wanted to
make HTTP_IN filter fully non-blocking and saw just how deep the
blocking went.)


On Thu, Aug 12, 2004 at 10:20:14AM -0700, Justin Erenkrantz wrote:
>--On Thursday, August 12, 2004 2:51 AM -0400 Glenn Strauss 
><[EMAIL PROTECTED]> wrote:
>
>>     /** The filter should pass any special buckets (not in-memory) as long
>> as it      *  does not need to perform any processing on them (translation
>> or protocol      *  delimiting) (augments AP_MODE_BYTES; mutually exclusive
>> w/ AP_MODE_PEEK)      */
>>     AP_MODE_PASS_SPECIALS = 4,
>
>This is a violation of the bucket brigade API (i.e. trying to force a type on 
>the returned bucket when it is supposed to be abstract).  I definitely would 
>like to see a *lot* more justification as to why this is mandatory.

Unlike AP_MODE_EXHAUSTIVE which is an imperative, AP_MODE_PASS_SPECIALS
is an optional request that can be made with AP_MODE_BYTES.

AP_MODE_BYTES | AP_MODE_PASS_SPECIALS is intended to be the safe equivalent
to AP_MODE_EXHAUSTIVE.  It says "pass me everything that you've got, that
you can safely pass.  I, the caller, can handle non-in-memory buckets".
For example, the core input filter could pass everything.  SSL would NOT
pass the socket because it must decrypt the data.  The HTTP_IN filter
would NOT pass everything because it would need to keep track of C-L or
chunking.  However, in the proxy response case where not C-L or chunked
(reading until connection close), then HTTP_IN could pass the socket.
The readahead API handles the special buckets for any filter that uses it.

Among other things, AP_MODE_BYTES | AP_MODE_PASS_SPECIALS allows the
input filter chain to collapse itself where it can.

>>     /** The filter read should be treated as speculative and any returned
>>      *  data should be stored for later retrieval in another mode.
>>      *  (augments AP_MODE_BYTES or AP_MODE_LINE)
>>      */
>>     AP_MODE_PEEK = 8,
>
>PEEK (aka SPECULATIVE: PEEK was a poor name for a lot of reasons which is why 
>we tossed it).  It's not possible to know whether it is going to be a line or 
>a byte, so it doesn't make sense for that to be its own independent mode. 

It is not a separate mode in my API.  It is either
(AP_MODE_BYTES | AP_MODE_PEEK) or (AP_MODE_LINE | AP_MODE_PEEK)

Input filtering is (should be) a read API which reads a set of bytes.
Sometimes the caller wants to treat these bytes as organized in a line.
Sometimes the caller wants to read a (possibly) MIME-folded set of lines.
Sometimes the caller wants to speculatively read some bytes or speculative
read a line. ...

These "sometimes" are why I need ap_input_mode_t to be a set of bitflags.

("peek" vs. "speculative" name doesn't matter to me)

>Doing a speculative read on a line isn't possible and adds a lot of overhead.

Yes it's possible and it is built into the readahead API.  Other than
copying buckets (small, but non-trivial overhead), it does not add any
additional overhead.  At the end of the readahead routines, the buckets
to be returned are either removed from the readahead struct, or copied
from them.  A simple conditional decides.

>You should only be calling SPECULATIVE when your protocol doesn't state what 
>to do next.  If you are expecting a line, then just call AP_MODE_GETLINE.  So, 
>I don't believe this option isn't combinable with anything else.

I think speculative mode is severly broken.  We should not call "across"
filters; we should not have to know which filters are above us.
AP_MODE_PEEK actually pulls data from upstream into the readahead struct.
We are actually peeking into data that we then have the opportunity to
consume again.  However, any translation from above filters will already
have been performed, and will not need to be performed again; it is 
buffered in the readahead struct.

If the only reason for SPECULATIVE mode is to peek into connection filters,
then get rid of SPECULATIVE mode and just insert a "peek" filter at
the bottom of the connection filters.

>>     /** When reading lines, look for MIME-folded continuations as well
>>      *  (augments AP_MODE_LINE)
>>      */
>>     AP_MODE_MIME_FOLDING = 16
>
>Actually, this can't be handled as efficiently as you might think it would due 
>to the presence of SSL or other connection-level filters.  Therefore, folding 
>might only be possible to do in ap_http_filter, but it can't go down further 
>as into core_input_filter (which is where we now call apr_brigade_split_line). 
>A new getline_folding filter right in front of ap_http_filter would do the 
>trick, I suppose.

But why limit this only to ap_http_filter?
The readahead API handles this for any filter that reads MIME fields,
and it buffers partial lines until the complete line is finished.

>Once again, I don't believe it makes sense to combine this with any other 
>option.  For example, this doesn't combine with AP_MODE_READBYTES or 
>AP_MODE_SPECULATIVE.  It only makes sense in combination with AP_MODE_GETLINE.
>
>So, one option is to add a new mode here: AP_MODE_GETLINE_CONTINUATION (or 
>some such) and place it as a protocol-level filter.  However, there are some 
>corner cases that have to be dealt with: up-to and including extending over 
>the line limits (see below).  Just altering AP_MODE_GETLINE to handle 
>continuations isn't allowable as there are some HTTP protocol behaviors that 
>can not be line continued: i.e. request line.
>
>Right now, we deal with these corner cases outside of the filter chain.  Now 
>that ap_rgetline() has been refactored several times: the code was too hairy 
>at that time I last re-did input filtering that it wasn't worth it to mess 
>with ap_rgetline() then.  (After I re-did the input filters, Brian Pane re-did 
>ap_rgetline() if my recollection of the timeline is correct.)  So, I believe 
>it's easier now to just add a new filter mode and try to abstract away the 
>continuation logic now.  The code wasn't very clean back then...
>
>> of code duplication between modules.  For example, the behavior of
>> line-mode is vauge and requires that callers re-parse the brigade
>> to check for complete lines.  I've got a whole litany of things,
>
>It's for the reason listed below...
>
>> including bad code examples of brigade parsing in the Apache core,
>> but that's another post.
>
>Well, pointers as to where things aren't right are certainly appreciated.

I'll put together a list and will post later or tomorrow.

>As to ap_input_mode_t as bitflags: the case doesn't yet seem to be made as to 
>why a combination of options is mandatory.  I believe adding one additional 
>AP_MODE_FOLDED_LINE or something would be sufficient.
>
>> One advantage of the new API:
>> MIME continuation lines are used so often (HTTP headers/trailers,
>> email message headers, MIME encapsulation, ...) that they should be
>> part of the core API.  Instead of code duplication (witness
>> ap_rgetline_core() and ap_get_mime_headers_core() both implementing
>> continuation line unfolding), AP_MODE_LINE | AP_MODE_MIME_FOLDING
>> does not return a partial line.  It only returns a complete line
>> including continuations.  Simplifying things a bit for now, this
>> allows caller to know that if its brigade is returned non-empty,
>> that the line is complete, including continuations, without caller
>> having to re-parse the line to double-check.
>
>It is simply a denial of service condition to allow unbounded lines to be 
>read: therefore, it must be possible to return partial lines (or possibly 
>error out).  This is why there is an internal upper limit on how much can be 
>read into a single line in the core_input_filter (via ap_brigade_splitline). 
>We *must* respect the LimitRequestLine and LimitFieldSize (?).  Dealing with 
>this config logic in a filter is a tad more complicated - this is why a 
>separate mime-continuation-only protocol filter makes a lot of sense: the code 
>isn't as trivial as you might expect.

The readahead API honors readbytes and will return APR_ENOSPC if a LF
is not detected before readbytes is reached.  APR_INCOMPLETE is returned
if EOS is seen before linefeed and a partial line has been read.  The
caller can always look into the readahead struct if s/he wants to see
the partial line.

>One thing that does concern me though is that the LimitFieldSize's only apply 
>to HTTP MIME request headers.  There are other allowable cases where the 
>limits shouldn't apply; so I'm mildly concerned as to how we'd allow a filter 
>to intentionally bypass those limits: perhaps those use cases have to continue 
>calling GETLINE and doing the continuation themselves.  -- justin

The readahead API allows the caller to specify the limit.


Please send me a private email if you would like to see the code;
the design is there, and so is a lot of the code (too long to post),
but it has not yet been tested.  I'd have waited until I was a bit
further along, but since it changes some behavior, I would like to
get the minimal behavior changes into 2.4 if I can.  Thanks!

Cheers,
Glenn

Reply via email to