[issue4426] UTF7 decoding is far too strict
STINNER Victor victor.stin...@haypocalc.com added the comment: Copy of msg76404 from duplicate issue (#4425): '/'.encode('utf7') returns '+AC8-'. It should return '/'. See RFC 2152. '/'.decode('utf7') raises an exception (this is a special case of a general problem with UTF-7 decoding, which I will report as a separate bug). -- ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue4426 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue4426] UTF7 decoding is far too strict
STINNER Victor victor.stin...@haypocalc.com added the comment: I updated the patch to python trunk. I was hard because patch -p1 failed at many places, so please double check the updated patch. Changes from utf7patch: * I removed the test c =0 in DECODE_DIRECT(c) because c type is Py_UNICODE and the type is always unsigned (right?) * I added shiftOutStart = p; at the beginning of PyUnicode_DecodeUTF7Stateful() to fix the gcc warning reported by pitrou. * I fixed the failing test in test_unicode.py (also reported by pitrou) * I added the tests listed in msg76414 (by nick) I didn't read the UTF-7 encoder or decoder code. I just updated the patch. -- Added file: http://bugs.python.org/file13867/issue4426.patch ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue4426 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue4426] UTF7 decoding is far too strict
Changes by STINNER Victor victor.stin...@haypocalc.com: Removed file: http://bugs.python.org/file13867/issue4426.patch ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue4426 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue4426] UTF7 decoding is far too strict
STINNER Victor victor.stin...@haypocalc.com added the comment: (oops, i stripped spaces in my last patch) -- Added file: http://bugs.python.org/file13868/issue4426.patch ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue4426 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue4426] UTF7 decoding is far too strict
Antoine Pitrou pit...@free.fr added the comment: A quick comment on the patch: it seems to invert (quite futily I'd say) the meaning of the arguments given to PyUnicode_EncodeUTF7, which is an incompatible API change. I'm in favour of reworking this patch in order to keep the original API. If I'm not mistaken, it's just a matter of using the boolean complement of these two arguments (encodeSetO, encodeWhiteSpace) inside the body of PyUnicode_EncodeUTF7. -- ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue4426 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue4426] UTF7 decoding is far too strict
Nick Barnes nick.bar...@pobox.com added the comment: This was my first contribution to Python. I don't know what the rules are on changing the arguments of an internal function such as PyUnicode_EncodeUTF7(). Since I was rewriting the whole function anyway, I tried to give it arguments which made more sense with respect to the defining RFC. If you want us to revert to the original meanings of these arguments (so the third argument means use base-64 encoding for characters in set O, and not use direct encoding for characters in set O), please can we have better names for them? The name encodeSetO was meaningless: the function encodes *all* the characters in the string. What the argument specifies is whether the set O characters are self-encoded or base-64 encoded. So maybe it should be called base64SetO. Ditto for the encodeWhiteSpace name. Here's a trunk patch with the meaning of those parameters reverted, and better names. -- Added file: http://bugs.python.org/file13872/python-unicode-trunk-patch ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue4426 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue4426] UTF7 decoding is far too strict
Antoine Pitrou pit...@free.fr added the comment: Thanks for the update. Functions like PyUnicode_EncodeUTF7() are part of the public C API, therefore their semantics can't be changed lightly. The patch looks ok to me, apart from minor style issues. -- assignee: - pitrou resolution: - accepted versions: +Python 2.7, Python 3.1 -Python 2.6 ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue4426 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue4426] UTF7 decoding is far too strict
Antoine Pitrou pit...@free.fr added the comment: Committed in r72283 and r72285. Thanks! -- resolution: accepted - fixed status: open - closed ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue4426 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue4426] UTF7 decoding is far too strict
Antoine Pitrou pit...@free.fr added the comment: I'm not in a position to comment on the encoding algorithm itself but I have a couple of comments: * I get the following compilation warning: Objects/unicodeobject.c: In function ‘PyUnicode_DecodeUTF7Stateful’: Objects/unicodeobject.c:1531: attention : ‘shiftOutStart’ may be used uninitialized in this function * shouldn't there be an additional test for the '/' behaviour (unless it is already there and I have overlooked it)? * your patch fails on another test: Traceback (most recent call last): File Lib/test/test_unicode.py, line 538, in test_codecs_utf7 self.assertRaises(UnicodeDecodeError, '+\xc1'.decode, 'utf-7') AssertionError: UnicodeDecodeError not raised -- keywords: +needs review, patch priority: - critical stage: - patch review ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue4426 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue4426] UTF7 decoding is far too strict
Changes by Gabriel Genellina [EMAIL PROTECTED]: -- nosy: +gagenellina ___ Python tracker [EMAIL PROTECTED] http://bugs.python.org/issue4426 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue4426] UTF7 decoding is far too strict
Nick Barnes [EMAIL PROTECTED] added the comment: My original defect report here was incorrect, or possibly only relates to a particular older Python installation. It is still the case that UTF-7 decoding is fussier than it need be (decoding should be permissive), and is broken specifically for the '/' character (ASCII 47). I'm probably going to rewrite the whole codec for greater clarity and decoding permissiveness. Any UTF-7 encoder has two boolean parameters: whether to base-64 encode whitespace (sp, ht, nl, cr), and whether to base-64 encode set O characters. The existing Python UTF-7 encoder says no to both of these. It would be useful to have them as options. How should encoding parameters such as these be passed? As setstate() methods on the IncrementalEncoder and StreamWriter objects? Or should I provide four separate codecs (retaining the existing behaviour in the 'utf7' codec, of course). ___ Python tracker [EMAIL PROTECTED] http://bugs.python.org/issue4426 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue4426] UTF7 decoding is far too strict
Nick Barnes [EMAIL PROTECTED] added the comment: Here is my patch. This is a rewrite of the UTF7 encoder and decoder. It now handles surrogate pairs correctly, so non-BMP characters work with this codec. And my motivating example ('/'.decode('utf7')) works OK. I'm not totally confident of the error-handling code here, but in 'strict' mode this is definitely a better codec than the one it replaces. This patch is based on the Python 2.6 distribution. Added file: http://bugs.python.org/file12179/utf7patch ___ Python tracker [EMAIL PROTECTED] http://bugs.python.org/issue4426 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue4426] UTF7 decoding is far too strict
Changes by Antoine Pitrou [EMAIL PROTECTED]: -- nosy: +pitrou ___ Python tracker [EMAIL PROTECTED] http://bugs.python.org/issue4426 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue4426] UTF7 decoding is far too strict
Marc-Andre Lemburg [EMAIL PROTECTED] added the comment: On 2008-11-25 19:56, Nick Barnes wrote: Nick Barnes [EMAIL PROTECTED] added the comment: Well, I could submit a diff for unicodeobject.c, but I have never contributed to Python (or used this particular tracking system) before. Is there a standard form for contributing changes? Unified diff? Please send unified diffs and attach them to the ticket. While you're at it: perhaps you could try to refactor the code a bit in order to get rid off the many macros use throughout the codec. They make the code a lot less readable. Thanks, -- Marc-Andre Lemburg eGenix.com 2008-11-12: Released mxODBC.Connect 0.9.3 http://python.egenix.com/ Try mxODBC.Zope.DA for Windows,Linux,Solaris,MacOSX for free ! eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg Registered at Amtsgericht Duesseldorf: HRB 46611 ___ Python tracker [EMAIL PROTECTED] http://bugs.python.org/issue4426 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue4426] UTF7 decoding is far too strict
New submission from Nick Barnes [EMAIL PROTECTED]: UTF-7 decoding raises an exception for any character not in the RFC2152 Set D (directly encoded characters). In particular, it raises an exception for characters in Set O (optional direct characters), such as = [ ] @ etc. These characters can legitimately appear in UTF-7-encoded text, and should be decoded (as themselves). As it is, the UTF-7 decoder can't reliably be used to decode any UTF-7 text other than that encoded by Python's own UTF-7 encoder. Looking at the source of unicodeobject.c, the call to the SPECIAL macro on line 1009 has hardcoded second and third arguments of zero. Maybe changing the second argument to 1 would fix this. Maybe. -- components: Unicode messages: 76405 nosy: Nick Barnes severity: normal status: open title: UTF7 decoding is far too strict type: behavior versions: Python 2.6 ___ Python tracker [EMAIL PROTECTED] http://bugs.python.org/issue4426 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue4426] UTF7 decoding is far too strict
STINNER Victor [EMAIL PROTECTED] added the comment: Can you write some tests to help fixing this issue? Stupid example (I don't know UTF-8 encoding): all((byte.encode(utf-7) == byte) for byte in '=[]@') all((byte.decode(utf-7) == byte) for byte in '=[]@') -- nosy: +haypo ___ Python tracker [EMAIL PROTECTED] http://bugs.python.org/issue4426 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue4426] UTF7 decoding is far too strict
Nick Barnes [EMAIL PROTECTED] added the comment: # Note, this test covers issues 4425 and 4426 # Direct encoded characters: set_d = ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789'(),-./:? # Optional direct characters: set_o = '!#$%*;=@[]^_`{|}' all((c.encode('utf7') == c) for c in set_d) all((c.decode('utf7') == c) for c in set_d) all((c.decode('utf7') == c) for c in set_o) ___ Python tracker [EMAIL PROTECTED] http://bugs.python.org/issue4426 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue4426] UTF7 decoding is far too strict
Marc-Andre Lemburg [EMAIL PROTECTED] added the comment: On 2008-11-25 12:11, Nick Barnes wrote: New submission from Nick Barnes [EMAIL PROTECTED]: UTF-7 decoding raises an exception for any character not in the RFC2152 Set D (directly encoded characters). In particular, it raises an exception for characters in Set O (optional direct characters), such as = [ ] @ etc. These characters can legitimately appear in UTF-7-encoded text, and should be decoded (as themselves). As it is, the UTF-7 decoder can't reliably be used to decode any UTF-7 text other than that encoded by Python's own UTF-7 encoder. Thanks for noticing this. Apparently, the UTF-7 codec is not used a lot by Python users, since it's been like this for years. The tests we have do check round-trip safety, but not the special characteristics of the UTF-7 codec. Also note that the code for the codec was contributed and is, AFAIK, not maintained by any of the Python developers. -- nosy: +lemburg ___ Python tracker [EMAIL PROTECTED] http://bugs.python.org/issue4426 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue4426] UTF7 decoding is far too strict
Nick Barnes [EMAIL PROTECTED] added the comment: Well, I could submit a diff for unicodeobject.c, but I have never contributed to Python (or used this particular tracking system) before. Is there a standard form for contributing changes? Unified diff? ___ Python tracker [EMAIL PROTECTED] http://bugs.python.org/issue4426 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue4426] UTF7 decoding is far too strict
STINNER Victor [EMAIL PROTECTED] added the comment: Is there a standard form for contributing changes? Unified diff? Attach a patch file (unified diff, yes). ___ Python tracker [EMAIL PROTECTED] http://bugs.python.org/issue4426 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com