On Mon, Sep 12, 2016 at 10:49 AM, William A Rowe Jr <wr...@rowe-clan.net> wrote:
> On Mon, Aug 29, 2016 at 1:04 PM, Ruediger Pluem <rpl...@apache.org> wrote: > >> >> On 08/29/2016 06:25 PM, William A Rowe Jr wrote: >> > Thanks all for the feedback. Status and follow-up questions inline >> > >> > On Thu, Aug 25, 2016 at 10:02 PM, William A Rowe Jr < >> wr...@rowe-clan.net <mailto:wr...@rowe-clan.net>> wrote: >> > >> > 4. Should the next 2.4/2.2 releases default to Strict[URI] at all? >> > >> > Real world direct observation especially appreciated from actual >> deployments. >> > >> > Strict (and StrictURI) remain the default. >> >> StrictURI as a default only makes sense if we have our own house in order >> (see above), otherwise it should be opt in. > > > So it's not only our house [our %3B encoding in httpd isn't a showstopper > here]... but also whether widely used user-agent browsers and tooling have > their houses in order, so I started to study the current browser > behaviors. > The applicable spec is https://tools.ietf.org/html/rfc3986#section-3.3 > The second test below has been updated with 2 and 3 byte utf-8 sequences, and see no new surprises showed up. Checked the unreserved set with '?' and '/' observing special meanings. Nothing here should become escaped when given as a URI; http://localhost:8080/unreserved-._~/sub-delims-!$&' ()*+,;=/gen-delims-:@?query Checked the invalid set of characters all of which must be encoded per the spec, and verify that #frag is not passed to the server; http://localhost:8080/gen-delims-[]/invalid- "<>\^`{|}§‡#frag Checked the reserved set including '#' '%' '?' by their encoded value to determine if there are any unpleasant reverse surprises lurking; http://localhost:8080/encoded-%23%25%2F%3A%3B%3D%3F%40%5B%5C%5D%7B%7C%7D Checked a list of unreserved/unassigned gen-delims and sub-delims to determine if the user agent normalizes while composing the request; http://localhost:8080/plain-%21%24%26%27%28%29%2A%2B%2C%2D% 2E%31%32%33%41%42%43%5F%61%62%63%7E Using the simplistic $ nc -kl localhost 8080 here are the results I obtained from a couple of current browsers. More observations and feedback of other user-agents to this list would be appreciated. Chrome 53: GET /unreserved-._~/sub-delims-!$&'()*+,;=/gen-delims-:@?query HTTP/1.1 GET /gen-delims-[]/invalid-%20%22%3C%3E/%5E%60%7B%7C%7D%C2%A7%E2%80%A1 HTTP/1.1 odd> ^^ ^ GET /encoded-%23%25%2F%3A%3B%3D%3F%40%5B%5C%5D%7B%7C%7D HTTP/1.1 GET /plain-%21%24%26%27%28%29%2A%2B%2C-.123ABC_abc~ HTTP/1.1 odd> ^ ^ ^ ^ ^ ^ ^ ^ ^ Firefox 48: GET /unreserved-._~/sub-delims-!$&'()*+,;=/gen-delims-:@?query HTTP/1.1 GET /gen-delims-[]/invalid-%20%22%3C%3E/%5E%60%7B|%7D HTTP/1.1 odd> ^^ ^ ^ GET /encoded-%23%25%2F%3A%3B%3D%3F%40%5B%5C%5D%7B%7C%7D%C2%A7%E2%80%A1 HTTP/1.1 GET /plain-%21%24%26%27%28%29%2A%2B%2C%2D%2E%31%32%33%41%42%43%5F%61%62%63%7E HTTP/1.1 odd> ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ IE 11: GET /unreserved-._~/sub-delims-!$&'()*+,;=/gen-delims-:@?query HTTP/1.1 GET /gen-delims-[]/invalid-%20%22%3C%3E/%5E%60%7B%7C%7D%C2%A7%E2%80%A1 HTTP/1.1 odd> ^^ ^ GET /encoded-%23%25%2F%3A%3B%3D%3F%40%5B%5C%5D%7B%7C%7D HTTP/1.1 GET /plain-%21%24%26%27%28%29%2A%2B%2C-.123ABC_abc~ HTTP/1.1 odd> ^ ^ ^ ^ ^ ^ ^ ^ ^ > The character '\' is converted to a '/' by both browsers, in a nod either > to Microsoft insanity, or a less-accessible '/' key. (Which suggests that > the yen sign might be treated similarly in some jp locales.) Invalid as a > literal '\' character, both browsers support an explicit %5C for those who > really want to use that in a URI. No actual issue here. > Ditto for Microsoft IE. > Interestingly, gen-delims '@' and ':' are explicitly allowed by 3.3 > grammer > (as I've tested above), while '[' and ']' are omitted and therefore not > allowed > according to spec. (On this, StrictURI won't care yet, because we are > simply correcting for any valid URI character, not by section, and '[' ']' > are > obviously allowed for the IPv6 port specification - so we don't reject > yet.) > When we add strict parsing to the apr uri parsing function, we will trip > over this, from all browsers, in spite of these being prohibited and > declared > unwise for the past 18 years or more. > IE also suffers the '[' ']' defect above, and does not share the Firefox-specific defect of '|' below. In short Chrome and IE behavior appears to be identical over the wire. > The character '|' is also invalid. However, Firefox fails to follow the > spec > again here (although Chrome gets it right). > > With respect to these characters, recall this 18 year old document, > last paragraph describes the rational; > https://tools.ietf.org/html/rfc2396.html#section-2.4.3 > > unwise = "{" | "}" | "|" | "\" | "^" | "[" | "]" | "`" > > Data corresponding to excluded characters must be escaped in order to > be properly represented within a URI. > > > Which replaced https://tools.ietf.org/html/rfc1738#section-2.2 now > almost 22 years old, without changing the rules; > > Unsafe: > > Characters can be unsafe for a number of reasons. The space > character is unsafe because significant spaces may disappear and > insignificant spaces may be introduced when URLs are transcribed or > typeset or subjected to the treatment of word-processing programs. > The characters "<" and ">" are unsafe because they are used as the > delimiters around URLs in free text; the quote mark (""") is used to > delimit URLs in some systems. The character "#" is unsafe and should > always be encoded because it is used in World Wide Web and in other > systems to delimit a URL from a fragment/anchor identifier that might > follow it. The character "%" is unsafe because it is used for > encodings of other characters. Other characters are unsafe because > gateways and other transport agents are known to sometimes modify > such characters. These characters are "{", "}", "|", "\", "^", "~", > "[", "]", and "`". > > All unsafe characters must always be encoded within a URL. > > > While it was labeled 'unsafe', 'unwise', and now disallowed-by-omission > from RFC3986, the 'must' designation couldn't have been any clearer. > We've had this right for 2 decades at httpd. > > Second paragraph of https://tools.ietf.org/html/rfc3986#appendix-D.1 > goes into some detail about this change, and while it is hard to parse, > the paragraph is stating that '[' ']' were once invalid, now are reserved, > and remain disallowed in all other path segments and use cases. > > The upshot, right now StrictURI will accept '[' and ']', but this won't > survive > a rewrite of the apr parser operating with a 'strict' toggle. StrictURI > does > not accept '|'. The remaining question is what to do, if anything, about > carving a specific exception here due to modern Firefox issues. > > Thoughts/Comments/Additional test data? TIA! > >