[issue3300] urllib.quote and unquote - Unicode issues

Matt Giuca Thu, 14 Aug 2008 08:28:48 -0700

Matt Giuca <[EMAIL PROTECTED]> added the comment:

New patch (patch10). Details on Rietveld review tracker
(http://codereview.appspot.com/2827).


Another update on the remaining "outstanding issues":

Resolved issues since last time:

> Should unquote accept a bytes/bytearray as well as a str?
No. But see below.

> Lib/email/utils.py:
> Should encode_rfc2231 with charset=None accept strings with non-ASCII
> characters, and just encode them to UTF-8?
Implemented Antoine's fix ("or 'ascii'").

> Should quote accept safe characters outside the
> ASCII range (thereby potentially producing invalid URIs)?
No.

New issues:

unquote_to_bytes doesn't cope well with non-ASCII characters (currently
encodes as UTF-8 - not a lot we can do since this is a str->bytes
operation). However, we can allow it to accept a bytes as input (while
unquote does not), and it preserves the bytes precisely.
Discussion at http://codereview.appspot.com/2827/diff/82/84, line 265.

I have *implemented* that suggestion - so unquote_to_bytes now accepts
either a bytes or str, while unquote accepts only a str. No changes need
to be made unless there is disagreement on that decision.

I also emailed Barry Warsaw about the email/utils.py patch (because we
weren't sure exactly what that code was doing). However, I'm sure that
this patch isn't breaking anything there, because I call unquote with
encoding="latin-1", which is the same behaviour as the current head.

That's all the issues I have left over in this patch.

Attaching patch 10 (for revision 65675).

Commit log for patch 10:

Fix for issue 3300.

urllib.parse.unquote:
  Added "encoding" and "errors" optional arguments, allowing the caller
  to determine the decoding of percent-encoded octets.
  As per RFC 3986, default is "utf-8" (previously implicitly decoded
  as ISO-8859-1).
  Fixed a bug in which mixed-case hex digits (such as "%aF") weren't
  being decoded at all.

urllib.parse.quote:
  Added "encoding" and "errors" optional arguments, allowing the
  caller to determine the encoding of non-ASCII characters
  before being percent-encoded.
  Default is "utf-8" (previously characters in range(128, 256)
  were encoded as ISO-8859-1, and characters above that as UTF-8).
  Characters/bytes above 128 are no longer allowed to be "safe".
  Now allows either bytes or strings.
  Optimised "Quoter"; now inherits defaultdict.

Added functions urllib.parse.quote_from_bytes,
urllib.parse.unquote_to_bytes.
All quote/unquote functions now exported from the module.

Doc/library/urllib.parse.rst: Updated docs on quote and unquote to
reflect new interface, added quote_from_bytes and unquote_to_bytes.

Lib/test/test_urllib.py: Added many new test cases testing encoding
and decoding Unicode strings with various encodings, as well as testing
the new functions.

Lib/test/test_http_cookiejar.py, Lib/test/test_cgi.py,
Lib/test/test_wsgiref.py: Updated and added test cases to deal with
UTF-8-encoded URIs.

Lib/email/utils.py: Calls urllib.parse.quote and urllib.parse.unquote
with encoding="latin-1", to preserve existing behaviour (which the email
module is dependent upon).

Added file: http://bugs.python.org/file11111/parse.py.patch10

_______________________________________
Python tracker <[EMAIL PROTECTED]>
<http://bugs.python.org/issue3300>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue3300] urllib.quote and unquote - Unicode issues

Reply via email to