Marc-Andre Lemburg <[EMAIL PROTECTED]> added the comment: On 2008-08-07 23:17, Bill Janssen wrote: > Bill Janssen <[EMAIL PROTECTED]> added the comment: > > My main fear with this patch is that "unquote" will become seen as > unreliable, because naive software trying to parse URLs will encounter > uses of percent-encoding where the encoded octets are not in fact UTF-8 > bytes. They're just some set of bytes.
unquote will have to be able to deal with old-style URLs that use the Latin-1 encoding. HTML uses (or used to use) the Latin-1 encoding as default and that's how URLs ended up using it as well: http://www.w3schools.com/TAGS/ref_urlencode.asp I'd suggest to have it first try UTF-8 decoding and then fall back to Latin-1 decoding. > A secondary concern is that it > will invisibly produce invalid data, because it decodes some > non-UTF-8-encoded string that happens to only use UTF-8-valid sequences > as the wrong string value. It's rather unlikely that someone will have used a Latin-1 encoded URL which happens to decode as valid UTF-8: The valid UTF-8 combinations don't really make any sense when used as text, e.g. Ã?öÃ1/4 > Now, I have to confess that I don't know how common these use cases are > in actual URL usage. It would be nice if there was some organization > that had a large collection of URLs, and could provide a test set we > could run a scanner over :-). > > As a workaround, though, I've sent a message off to Larry Masinter to > ask about this case. He's one of the authors of the URI spec. ---------- nosy: +lemburg _______________________________________ Python tracker <[EMAIL PROTECTED]> <http://bugs.python.org/issue3300> _______________________________________ _______________________________________________ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com