On Mon, Aug 29, 2016 at 1:04 PM, Ruediger Pluem <rpl...@apache.org> wrote:
> > On 08/29/2016 06:25 PM, William A Rowe Jr wrote: > > Thanks all for the feedback. Status and follow-up questions inline > > > > On Thu, Aug 25, 2016 at 10:02 PM, William A Rowe Jr <wr...@rowe-clan.net > <mailto:wr...@rowe-clan.net>> wrote: > > > > 4. Should the next 2.4/2.2 releases default to Strict[URI] at all? > > > > Real world direct observation especially appreciated from actual > deployments. > > > > Strict (and StrictURI) remain the default. > > StrictURI as a default only makes sense if we have our own house in order > (see above), otherwise it should be opt in. So it's not only our house [our %3B encoding in httpd isn't a showstopper here]... but also whether widely used user-agent browsers and tooling have their houses in order, so I started to study the current browser behaviors. The applicable spec is https://tools.ietf.org/html/rfc3986#section-3.3 Checked the unreserved set with '?' and '/' observing special meanings. Nothing here should become escaped when given as a URI; http://localhost:8080/unreserved-._~/sub-delims-!$&'()*+,;=/gen-delims-:@?query Checked the invalid set of characters all of which must be encoded per the spec, and verify that #frag is not passed to the server; http://localhost:8080/gen-delims-[]/invalid- "<>\^`{|}#frag Checked the reserved set including '#' '%' '?' by their encoded value to determine if there are any unpleasant reverse surprises lurking; http://localhost:8080/encoded-%23%25%2F%3A%3B%3D%3F%40%5B%5C%5D%7B%7C%7D Checked a list of unreserved/unassigned gen-delims and sub-delims to determine if the user agent normalizes while composing the request; http://localhost:8080/plain-%21%24%26%27%28%29%2A%2B%2C%2D%2E%31%32%33%41%42%43%5F%61%62%63%7E Using the simplistic $ nc -kl localhost 8080 here are the results I obtained from a couple of current browsers, more observations and feedback of other user-agents to this list would be appreciated. Chrome 53: GET /unreserved-._~/sub-delims-!$&'()*+,;=/gen-delims-:@?query HTTP/1.1 GET /gen-delims-[]/invalid-%20%22%3C%3E/%5E%60%7B%7C%7D HTTP/1.1 odd> ^^ ^ GET /encoded-%23%25%2F%3A%3B%3D%3F%40%5B%5C%5D%7B%7C%7D HTTP/1.1 GET /plain-%21%24%26%27%28%29%2A%2B%2C-.123ABC_abc~ HTTP/1.1 odd> ^ ^ ^ ^ ^ ^ ^ ^ ^ Firefox 48: GET /unreserved-._~/sub-delims-!$&'()*+,;=/gen-delims-:@?query HTTP/1.1 GET /gen-delims-[]/invalid-%20%22%3C%3E/%5E%60%7B|%7D HTTP/1.1 odd> ^^ ^ ^ GET /encoded-%23%25%2F%3A%3B%3D%3F%40%5B%5C%5D%7B%7C%7D HTTP/1.1 GET /plain-%21%24%26%27%28%29%2A%2B%2C%2D%2E%31%32%33%41%42%43%5F%61%62%63%7E HTTP/1.1 odd> ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ The character '\' is converted to a '/' by both browsers, in a nod either to Microsoft insanity, or a less-accessible '/' key. (Which suggests that the yen sign might be treated similarly in some jp locales.) Invalid as a literal '\' character, both browsers support an explicit %5C for those who really want to use that in a URI. No actual issue here. Interestingly, gen-delims '@' and ':' are explicitly allowed by 3.3 grammer (as I've tested above), while '[' and ']' are omitted and therefore not allowed according to spec. (On this, StrictURI won't care yet, because we are simply correcting for any valid URI character, not by section, and '[' ']' are obviously allowed for the IPv6 port specification - so we don't reject yet.) When we add strict parsing to the apr uri parsing function, we will trip over this, from all browsers, in spite of these being prohibited and declared unwise for the past 18 years or more. The character '|' is also invalid. However, Firefox fails to follow the spec again here (although Chrome gets it right). With respect to these characters, recall this 18 year old document, last paragraph describes the rational; https://tools.ietf.org/html/rfc2396.html#section-2.4.3 unwise = "{" | "}" | "|" | "\" | "^" | "[" | "]" | "`" Data corresponding to excluded characters must be escaped in order to be properly represented within a URI. Which replaced https://tools.ietf.org/html/rfc1738#section-2.2 now almost 22 years old, without changing the rules; Unsafe: Characters can be unsafe for a number of reasons. The space character is unsafe because significant spaces may disappear and insignificant spaces may be introduced when URLs are transcribed or typeset or subjected to the treatment of word-processing programs. The characters "<" and ">" are unsafe because they are used as the delimiters around URLs in free text; the quote mark (""") is used to delimit URLs in some systems. The character "#" is unsafe and should always be encoded because it is used in World Wide Web and in other systems to delimit a URL from a fragment/anchor identifier that might follow it. The character "%" is unsafe because it is used for encodings of other characters. Other characters are unsafe because gateways and other transport agents are known to sometimes modify such characters. These characters are "{", "}", "|", "\", "^", "~", "[", "]", and "`". All unsafe characters must always be encoded within a URL. While it was labeled 'unsafe', 'unwise', and now disallowed-by-omission from RFC3986, the 'must' designation couldn't have been any clearer. We've had this right for 2 decades at httpd. Second paragraph of https://tools.ietf.org/html/rfc3986#appendix-D.1 goes into some detail about this change, and while it is hard to parse, the paragraph is stating that '[' ']' were once invalid, now are reserved, and remain disallowed in all other path segments and use cases. The upshot, right now StrictURI will accept '[' and ']', but this won't survive a rewrite of the apr parser operating with a 'strict' toggle. StrictURI does not accept '|'. The remaining question is what to do, if anything, about carving a specific exception here due to modern Firefox issues. Thoughts/Comments/Additional test data? TIA!