Matt Giuca <[EMAIL PROTECTED]> added the comment: New patch (patch10). Details on Rietveld review tracker (http://codereview.appspot.com/2827).
Another update on the remaining "outstanding issues": Resolved issues since last time: > Should unquote accept a bytes/bytearray as well as a str? No. But see below. > Lib/email/utils.py: > Should encode_rfc2231 with charset=None accept strings with non-ASCII > characters, and just encode them to UTF-8? Implemented Antoine's fix ("or 'ascii'"). > Should quote accept safe characters outside the > ASCII range (thereby potentially producing invalid URIs)? No. New issues: unquote_to_bytes doesn't cope well with non-ASCII characters (currently encodes as UTF-8 - not a lot we can do since this is a str->bytes operation). However, we can allow it to accept a bytes as input (while unquote does not), and it preserves the bytes precisely. Discussion at http://codereview.appspot.com/2827/diff/82/84, line 265. I have *implemented* that suggestion - so unquote_to_bytes now accepts either a bytes or str, while unquote accepts only a str. No changes need to be made unless there is disagreement on that decision. I also emailed Barry Warsaw about the email/utils.py patch (because we weren't sure exactly what that code was doing). However, I'm sure that this patch isn't breaking anything there, because I call unquote with encoding="latin-1", which is the same behaviour as the current head. That's all the issues I have left over in this patch. Attaching patch 10 (for revision 65675). Commit log for patch 10: Fix for issue 3300. urllib.parse.unquote: Added "encoding" and "errors" optional arguments, allowing the caller to determine the decoding of percent-encoded octets. As per RFC 3986, default is "utf-8" (previously implicitly decoded as ISO-8859-1). Fixed a bug in which mixed-case hex digits (such as "%aF") weren't being decoded at all. urllib.parse.quote: Added "encoding" and "errors" optional arguments, allowing the caller to determine the encoding of non-ASCII characters before being percent-encoded. Default is "utf-8" (previously characters in range(128, 256) were encoded as ISO-8859-1, and characters above that as UTF-8). Characters/bytes above 128 are no longer allowed to be "safe". Now allows either bytes or strings. Optimised "Quoter"; now inherits defaultdict. Added functions urllib.parse.quote_from_bytes, urllib.parse.unquote_to_bytes. All quote/unquote functions now exported from the module. Doc/library/urllib.parse.rst: Updated docs on quote and unquote to reflect new interface, added quote_from_bytes and unquote_to_bytes. Lib/test/test_urllib.py: Added many new test cases testing encoding and decoding Unicode strings with various encodings, as well as testing the new functions. Lib/test/test_http_cookiejar.py, Lib/test/test_cgi.py, Lib/test/test_wsgiref.py: Updated and added test cases to deal with UTF-8-encoded URIs. Lib/email/utils.py: Calls urllib.parse.quote and urllib.parse.unquote with encoding="latin-1", to preserve existing behaviour (which the email module is dependent upon). Added file: http://bugs.python.org/file11111/parse.py.patch10 _______________________________________ Python tracker <[EMAIL PROTECTED]> <http://bugs.python.org/issue3300> _______________________________________ _______________________________________________ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com