On Fri, Jul 12, 2019 at 02:13:00PM +0200, Helmut Grohne wrote: > Hi, > > On Thu, Jul 11, 2019 at 02:38:19AM +0200, OHNO Tetsuji wrote: > > lighttpd server is returnd ???400 Bad Request", if %C0 (or any other > > char.) is included in the URL. > > > > for example, > > http://localhost/index.lighttpd.html : return OK (display index page) > > http://localhost/index.lighttpd.html?%C0 : 400 Bad Request > > http://localhost/index.lighttpd.html?%C1 : 400 Bad Request > > http://localhost/index.lighttpd.html?%C2 : OK > > > > I can't understand this behavior. > > Thank you for the detailed report. I don't fully understand this either > and am thus Ccing Glenn Strauss (upstream).
https://en.wikipedia.org/wiki/UTF-8#Overlong_encodings " The standard specifies that the correct encoding of a code point use only the minimum number of bytes required to hold the significant bits of the code point. Longer encodings are called overlong and are not valid UTF-8 representations of the code point. This rule maintains a one-to-one correspondence between code points and their valid encodings, so that there is a unique valid encoding for each code point. This ensures that string comparisons and searches are well-defined. " https://tools.ietf.org/html/rfc3986#section-2.5 " When a new URI scheme defines a component that represents textual data consisting of characters from the Universal Character Set [UCS], the data should first be encoded as octets according to the UTF-8 character encoding [STD63]; then only those octets that do not correspond to characters in the unreserved set should be percent- encoded. " tl;dr: URIs must contain valid UTF-8, including percent-encoded bytes of UTF-8 chars, as required above. C0 might be part of the byte sequence C0 80, which is an overlong UTF-8 encoding of the NUL character. In the wrong contexts, this might be abused in a truncation attack if C0 80 in the middle of a string were interpreted as '\0'. Both C0 and C1 bytes are part of overlong UTF-8 encodings, and are not part of any UTF-8 encodings using the minimum number of bytes, as required by the standard. Therefore, lighttpd rejects those percent-encoded bytes when looking for potentially malicious bytes in URLs. If you are storing binary data in a URL and naively percent-encode the bytes, doing so is not guaranteed to produce valid UTF-8. Please consider a different encoding for your binary data, such as base64 modified to use URL-safe chars. Cheers, Glenn