On Mon, Sep 12, 2016 at 10:49 AM, William A Rowe Jr <wr...@rowe-clan.net>
wrote:

> On Mon, Aug 29, 2016 at 1:04 PM, Ruediger Pluem <rpl...@apache.org> wrote:
>
>>
>> On 08/29/2016 06:25 PM, William A Rowe Jr wrote:
>> > Thanks all for the feedback. Status and follow-up questions inline
>> >
>> > On Thu, Aug 25, 2016 at 10:02 PM, William A Rowe Jr <
>> wr...@rowe-clan.net <mailto:wr...@rowe-clan.net>> wrote:
>> >
>> >     4. Should the next 2.4/2.2 releases default to Strict[URI] at all?
>> >
>> >     Real world direct observation especially appreciated from actual
>> deployments.
>> >
>> > Strict (and StrictURI) remain the default.
>>
>> StrictURI as a default only makes sense if we have our own house in order
>> (see above), otherwise it should be opt in.
>
>
> So it's not only our house [our %3B encoding in httpd isn't a showstopper
> here]... but also whether widely used user-agent browsers and tooling have
> their houses in order, so I started to study the current browser
> behaviors.
> The applicable spec is https://tools.ietf.org/html/rfc3986#section-3.3
>

The second test below has been updated with 2 and 3 byte utf-8 sequences,
and see no new surprises showed up.

Checked the unreserved set with '?' and '/' observing special meanings.
Nothing here should become escaped when given as a URI;
http://localhost:8080/unreserved-._~/sub-delims-!$&;'
()*+,;=/gen-delims-:@?query

Checked the invalid set of characters all of which must be encoded
per the spec, and verify that #frag is not passed to the server;
http://localhost:8080/gen-delims-[]/invalid- "<>\^`{|}§‡#frag

Checked the reserved set including '#' '%' '?' by their encoded value
to determine if there are any unpleasant reverse surprises lurking;
http://localhost:8080/encoded-%23%25%2F%3A%3B%3D%3F%40%5B%5C%5D%7B%7C%7D

Checked a list of unreserved/unassigned gen-delims and sub-delims
to determine if the user agent normalizes while composing the request;
http://localhost:8080/plain-%21%24%26%27%28%29%2A%2B%2C%2D%
2E%31%32%33%41%42%43%5F%61%62%63%7E

Using the simplistic $ nc -kl localhost 8080 here are the results
I obtained from a couple of current browsers. More observations and
feedback
of other user-agents to this list would be appreciated.

Chrome 53:
GET /unreserved-._~/sub-delims-!$&'()*+,;=/gen-delims-:@?query HTTP/1.1
GET /gen-delims-[]/invalid-%20%22%3C%3E/%5E%60%7B%7C%7D%C2%A7%E2%80%A1
 HTTP/1.1
odd>            ^^                     ^
GET /encoded-%23%25%2F%3A%3B%3D%3F%40%5B%5C%5D%7B%7C%7D HTTP/1.1
GET /plain-%21%24%26%27%28%29%2A%2B%2C-.123ABC_abc~ HTTP/1.1
odd>        ^  ^  ^  ^  ^  ^  ^  ^  ^

Firefox 48:
GET /unreserved-._~/sub-delims-!$&'()*+,;=/gen-delims-:@?query HTTP/1.1
GET /gen-delims-[]/invalid-%20%22%3C%3E/%5E%60%7B|%7D HTTP/1.1
odd>            ^^                     ^         ^
GET /encoded-%23%25%2F%3A%3B%3D%3F%40%5B%5C%5D%7B%7C%7D%C2%A7%E2%80%A1
 HTTP/1.1
GET /plain-%21%24%26%27%28%29%2A%2B%2C%2D%2E%31%32%33%41%42%43%5F%61%62%63%7E
HTTP/1.1
odd>        ^  ^  ^  ^  ^  ^  ^  ^  ^  ^  ^  ^  ^  ^  ^  ^  ^  ^  ^  ^  ^
 ^

IE 11:
GET /unreserved-._~/sub-delims-!$&'()*+,;=/gen-delims-:@?query HTTP/1.1
GET /gen-delims-[]/invalid-%20%22%3C%3E/%5E%60%7B%7C%7D%C2%A7%E2%80%A1
HTTP/1.1
odd>            ^^                     ^
GET /encoded-%23%25%2F%3A%3B%3D%3F%40%5B%5C%5D%7B%7C%7D HTTP/1.1
GET /plain-%21%24%26%27%28%29%2A%2B%2C-.123ABC_abc~ HTTP/1.1
odd>        ^  ^  ^  ^  ^  ^  ^  ^  ^



> The character '\' is converted to a '/' by both browsers, in a nod either
> to Microsoft insanity, or a less-accessible '/' key. (Which suggests that
> the yen sign might be treated similarly in some jp locales.) Invalid as a
> literal '\' character, both browsers support an explicit %5C for those who
> really want to use that in a URI. No actual issue here.
>

Ditto for Microsoft IE.


> Interestingly, gen-delims '@' and ':' are explicitly allowed by 3.3
> grammer
> (as I've tested above), while '[' and ']' are omitted and therefore not
> allowed
> according to spec. (On this, StrictURI won't care yet, because we are
> simply correcting for any valid URI character, not by section, and '[' ']'
> are
> obviously allowed for the IPv6 port specification - so we don't reject
> yet.)
> When we add strict parsing to the apr uri parsing function, we will trip
> over this, from all browsers, in spite of these being prohibited and
> declared
> unwise for the past 18 years or more.
>

IE also suffers the '[' ']' defect above, and does not share the
Firefox-specific
defect of '|' below. In short Chrome and IE behavior appears to be
identical
over the wire.


> The character '|' is also invalid. However, Firefox fails to follow the
> spec
> again here (although Chrome gets it right).
>
> With respect to these characters, recall this 18 year old document,
> last paragraph describes the rational;
> https://tools.ietf.org/html/rfc2396.html#section-2.4.3
>
>    unwise      = "{" | "}" | "|" | "\" | "^" | "[" | "]" | "`"
>
>    Data corresponding to excluded characters must be escaped in order to
>    be properly represented within a URI.
>
>
> Which replaced https://tools.ietf.org/html/rfc1738#section-2.2 now
> almost 22 years old, without changing the rules;
>
>    Unsafe:
>
>    Characters can be unsafe for a number of reasons.  The space
>    character is unsafe because significant spaces may disappear and
>    insignificant spaces may be introduced when URLs are transcribed or
>    typeset or subjected to the treatment of word-processing programs.
>    The characters "<" and ">" are unsafe because they are used as the
>    delimiters around URLs in free text; the quote mark (""") is used to
>    delimit URLs in some systems.  The character "#" is unsafe and should
>    always be encoded because it is used in World Wide Web and in other
>    systems to delimit a URL from a fragment/anchor identifier that might
>    follow it.  The character "%" is unsafe because it is used for
>    encodings of other characters.  Other characters are unsafe because
>    gateways and other transport agents are known to sometimes modify
>    such characters. These characters are "{", "}", "|", "\", "^", "~",
>    "[", "]", and "`".
>
>    All unsafe characters must always be encoded within a URL.
>
>
> While it was labeled 'unsafe', 'unwise', and now disallowed-by-omission
> from RFC3986, the 'must' designation couldn't have been any clearer.
> We've had this right for 2 decades at httpd.
>
> Second paragraph of https://tools.ietf.org/html/rfc3986#appendix-D.1
> goes into some detail about this change, and while it is hard to parse,
> the paragraph is stating that '[' ']' were once invalid, now are reserved,
> and remain disallowed in all other path segments and use cases.
>
> The upshot, right now StrictURI will accept '[' and ']', but this won't
> survive
> a rewrite of the apr parser operating with a 'strict' toggle. StrictURI
> does
> not accept '|'. The remaining question is what to do, if anything, about
> carving a specific exception here due to modern Firefox issues.
>
> Thoughts/Comments/Additional test data?  TIA!
>
>

Reply via email to