[issue16979] Broken error handling in codecs.unicode_escape_decode()

2013-01-16 Thread Serhiy Storchaka

New submission from Serhiy Storchaka:

An error handler in unicode_escape_decode() eats at least one byte (or more) 
after illegal escape sequence.

>>> import codecs
>>> codecs.unicode_escape_decode(br'\u!@#', 'replace')
('�', 5)
>>> codecs.unicode_escape_decode(br'\u!@#$', 'replace')
('�@#$', 6)

raw_unicode_escape_decode() works right:

>>> codecs.raw_unicode_escape_decode(br'\u!@#', 'replace')
('�!@#', 5)
>>> codecs.raw_unicode_escape_decode(br'\u!@#$', 'replace')
('�!@#$', 6)

See also issue16975.

--
assignee: serhiy.storchaka
components: Unicode
messages: 180077
nosy: ezio.melotti, serhiy.storchaka
priority: normal
severity: normal
stage: needs patch
status: open
title: Broken error handling in codecs.unicode_escape_decode()
type: behavior
versions: Python 2.7, Python 3.2, Python 3.3, Python 3.4

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue16979] Broken error handling in codecs.unicode_escape_decode()

2013-01-16 Thread Serhiy Storchaka

Serhiy Storchaka added the comment:

Here is a patch for 3.4. Patches for other versions will be different a lot.

--
dependencies: +SystemError in codecs.unicode_escape_decode()
keywords: +patch
stage: needs patch -> patch review
Added file: 
http://bugs.python.org/file28752/unicode_escape_decode_error_handling-3.4.patch

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue16979] Broken error handling in codecs.unicode_escape_decode()

2013-01-25 Thread Serhiy Storchaka

Changes by Serhiy Storchaka :


Removed file: 
http://bugs.python.org/file28752/unicode_escape_decode_error_handling-3.4.patch

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue16979] Broken error handling in codecs.unicode_escape_decode()

2013-01-25 Thread Serhiy Storchaka

Serhiy Storchaka added the comment:

Here is a set of patches for all versions (patch for 3.4 updated).

--
Added file: 
http://bugs.python.org/file28833/unicode_escape_decode_error_handling-2.7.patch
Added file: 
http://bugs.python.org/file28834/unicode_escape_decode_error_handling-3.2.patch
Added file: 
http://bugs.python.org/file28835/unicode_escape_decode_error_handling-3.3.patch
Added file: 
http://bugs.python.org/file28836/unicode_escape_decode_error_handling-3.4.patch

___
Python tracker 

___diff -r 5970c90dd8d1 Lib/test/test_codeccallbacks.py
--- a/Lib/test/test_codeccallbacks.py   Fri Jan 25 23:30:50 2013 +0200
+++ b/Lib/test/test_codeccallbacks.py   Sat Jan 26 00:51:30 2013 +0200
@@ -262,12 +262,12 @@
 
 self.assertEqual(
 "\\u3042\u3xxx".decode("unicode-escape", "test.handler1"),
-u"\u3042[<92><117><51><120>]xx"
+u"\u3042[<92><117><51>]xxx"
 )
 
 self.assertEqual(
 "\\u3042\u3xx".decode("unicode-escape", "test.handler1"),
-u"\u3042[<92><117><51><120><120>]"
+u"\u3042[<92><117><51>]xx"
 )
 
 self.assertEqual(
diff -r 5970c90dd8d1 Lib/test/test_codecs.py
--- a/Lib/test/test_codecs.py   Fri Jan 25 23:30:50 2013 +0200
+++ b/Lib/test/test_codecs.py   Sat Jan 26 00:51:30 2013 +0200
@@ -1786,6 +1786,84 @@
 self.assertEqual(srw.read(), u"\xfc")
 
 
+class UnicodeEscapeTest(unittest.TestCase):
+def test_empty(self):
+self.assertEqual(codecs.unicode_escape_encode(u""), ("", 0))
+self.assertEqual(codecs.unicode_escape_decode(""), (u"", 0))
+
+def test_raw_encode(self):
+encode = codecs.unicode_escape_encode
+for b in range(32, 127):
+if b != ord('\\'):
+self.assertEqual(encode(unichr(b)), (chr(b), 1))
+
+def test_raw_decode(self):
+decode = codecs.unicode_escape_decode
+for b in range(256):
+if b != ord('\\'):
+self.assertEqual(decode(chr(b) + '0'), (unichr(b) + u'0', 2))
+
+def test_escape_encode(self):
+encode = codecs.unicode_escape_encode
+self.assertEqual(encode(u'\t'), (r'\t', 1))
+self.assertEqual(encode(u'\n'), (r'\n', 1))
+self.assertEqual(encode(u'\r'), (r'\r', 1))
+self.assertEqual(encode(u'\\'), (r'\\', 1))
+for b in range(32):
+if chr(b) not in '\t\n\r':
+self.assertEqual(encode(unichr(b)), ('\\x%02x' % b, 1))
+for b in range(127, 256):
+self.assertEqual(encode(unichr(b)), ('\\x%02x' % b, 1))
+self.assertEqual(encode(u'\u20ac'), (r'\u20ac', 1))
+self.assertEqual(encode(u'\U0001d120'),
+ (r'\U0001d120', len(u'\U0001d120')))
+
+def test_escape_decode(self):
+decode = codecs.unicode_escape_decode
+self.assertEqual(decode("[\\\n]"), (u"[]", 4))
+self.assertEqual(decode(r'[\"]'), (u'["]', 4))
+self.assertEqual(decode(r"[\']"), (u"[']", 4))
+self.assertEqual(decode(r"[\\]"), (ur"[\]", 4))
+self.assertEqual(decode(r"[\a]"), (u"[\x07]", 4))
+self.assertEqual(decode(r"[\b]"), (u"[\x08]", 4))
+self.assertEqual(decode(r"[\t]"), (u"[\x09]", 4))
+self.assertEqual(decode(r"[\n]"), (u"[\x0a]", 4))
+self.assertEqual(decode(r"[\v]"), (u"[\x0b]", 4))
+self.assertEqual(decode(r"[\f]"), (u"[\x0c]", 4))
+self.assertEqual(decode(r"[\r]"), (u"[\x0d]", 4))
+self.assertEqual(decode(r"[\7]"), (u"[\x07]", 4))
+self.assertEqual(decode(r"[\8]"), (ur"[\8]", 4))
+self.assertEqual(decode(r"[\78]"), (u"[\x078]", 5))
+self.assertEqual(decode(r"[\41]"), (u"[!]", 5))
+self.assertEqual(decode(r"[\418]"), (u"[!8]", 6))
+self.assertEqual(decode(r"[\101]"), (u"[A]", 6))
+self.assertEqual(decode(r"[\1010]"), (u"[A0]", 7))
+self.assertEqual(decode(r"[\x41]"), (u"[A]", 6))
+self.assertEqual(decode(r"[\x410]"), (u"[A0]", 7))
+self.assertEqual(decode(r"\u20ac"), (u"\u20ac", 6))
+self.assertEqual(decode(r"\U0001d120"), (u"\U0001d120", 10))
+for b in range(256):
+if chr(b) not in '\n"\'\\abtnvfr01234567xuUN':
+self.assertEqual(decode('\\' + chr(b)),
+ (u'\\' + unichr(b), 2))
+
+def test_decode_errors(self):
+decode = codecs.unicode_escape_decode
+for c, d in ('x', 2), ('u', 4), ('U', 4):
+for i in range(d):
+self.assertRaises(UnicodeDecodeError, decode,
+  "\\" + c + "0"*i)
+self.assertRaises(UnicodeDecodeError, decode,
+  "[\\" + c + "0"*i + "]")
+data = "[\\" + c + "0"*i + "]\\" + c + "0"*i
+self.assertEqual(decode(data, "ignore"), (u"[]", le

[issue16979] Broken error handling in codecs.unicode_escape_decode()

2013-01-28 Thread Serhiy Storchaka

Serhiy Storchaka added the comment:

Ezio, is it a good factorization?

def check(self, coder):
def checker(input, expect):
self.assertEqual(coder(input), (expect, len(input)))
return checker

def test_escape_decode(self):
decode = codecs.unicode_escape_decode
check = self.check(decode)
check(b"[\\\n]", "[]")
check(br'[\"]', '["]')
check(br"[\']", "[']")
# other 20 checks ...

And same for test_escape_encode and for bytes escape decoder.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue16979] Broken error handling in codecs.unicode_escape_decode()

2013-01-28 Thread Ezio Melotti

Ezio Melotti added the comment:

LGTM.
If you want to push it even further you could make a list of (input, expected) 
and call the check() in a loop.  That way it will also be easier to refactor 
if/when we add subtests (#16997).

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue16979] Broken error handling in codecs.unicode_escape_decode()

2013-01-29 Thread Roundup Robot

Roundup Robot added the comment:

New changeset a242ac99161f by Serhiy Storchaka in branch '2.7':
Issue #16979: Fix error handling bugs in the unicode-escape-decode decoder.
http://hg.python.org/cpython/rev/a242ac99161f

New changeset 084bec5443d6 by Serhiy Storchaka in branch '3.2':
Issue #16979: Fix error handling bugs in the unicode-escape-decode decoder.
http://hg.python.org/cpython/rev/084bec5443d6

New changeset 086defaf16fe by Serhiy Storchaka in branch '3.3':
Issue #16979: Fix error handling bugs in the unicode-escape-decode decoder.
http://hg.python.org/cpython/rev/086defaf16fe

New changeset 218da678bb8b by Serhiy Storchaka in branch 'default':
Issue #16979: Fix error handling bugs in the unicode-escape-decode decoder.
http://hg.python.org/cpython/rev/218da678bb8b

--
nosy: +python-dev

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue16979] Broken error handling in codecs.unicode_escape_decode()

2013-01-29 Thread Serhiy Storchaka

Serhiy Storchaka added the comment:

Until subtests added an explicit call looks better to me. And when subtests 
will be added we will just add subtest inside the helper function.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue16979] Broken error handling in codecs.unicode_escape_decode()

2013-01-29 Thread Serhiy Storchaka

Changes by Serhiy Storchaka :


--
resolution:  -> fixed
stage: patch review -> committed/rejected
status: open -> closed

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com