On Mon, Jan 14, 2013 at 1:36 PM, Sungjin Chun <chu...@gmail.com> wrote:
> As far as I know, revised RFC permits UTF-8 characters in the URL without > encoding. Am I wrong here? > The latest URI RFC is 3986. The relevant description in prose is: Local names, such as file system names, are stored with a local character encoding. URI producing applications (e.g., origin servers) will typically use the local encoding as the basis for producing meaningful names. The URI producer will transform the local encoding to one that is suitable for a public interface and then transform the public interface encoding into the restricted set of URI characters (reserved, unreserved, and percent-encodings). Those characters are, in turn, encoded as octets to be used as a reference within a data format (e.g., a document charset), and such data formats are often subsequently encoded for transmission over Internet protocols. The relevant parts of the BNF are: pct-encoded = "%" HEXDIG HEXDIG reserved = gen-delims / sub-delims gen-delims = ":" / "/" / "?" / "#" / "[" / "]" / "@" sub-delims = "!" / "$" / "&" / "'" / "(" / ")" / "*" / "+" / "," / ";" / "=" unreserved = ALPHA / DIGIT / "-" / "." / "_" / "~" path = path-abempty ; begins with "/" or is empty / path-absolute ; begins with "/" but not "//" / path-noscheme ; begins with a non-colon segment / path-rootless ; begins with a segment / path-empty ; zero characters path-abempty = *( "/" segment ) path-absolute = "/" [ segment-nz *( "/" segment ) ] path-noscheme = segment-nz-nc *( "/" segment ) path-rootless = segment-nz *( "/" segment ) path-empty = 0<pchar> segment = *pchar segment-nz = 1*pchar segment-nz-nc = 1*( unreserved / pct-encoded / sub-delims / "@" ) ; non-zero-length segment without any colon ":" pchar = unreserved / pct-encoded / sub-delims / ":" / "@" Thus you can't use raw non-ASCII bytes in a URI - they must be encoded, and interpretation is up to the origin (and is overwhelmingly utf8 these days). Even Solr (the search engine) permits them. > It would of course be possible for any tool or webserver to accept URIs with non-ASCII bytes, but I don't know of any browsers which would _send_ such a request, because in general it would be rejected. I tried searching non-ASCII on whitehouse.gov (which uses Solr) and indeed it generated a percent-encoded query. My browser (Chrome) rendered the percent escapes as utf-8 for me though. There's also punycode which can be used to represent Unicode domain names (which otherwise don't even allow percent escapes). In some cases certain browsers will render this for you (generally if the encoded script matches the top-level country name, e.g. for a .kr domain Hangul would be shown), but it's in general a dangerous extension because it makes phishing attempts easier. -- Alex
_______________________________________________ Chicken-users mailing list Chicken-users@nongnu.org https://lists.nongnu.org/mailman/listinfo/chicken-users