[issue24848] Warts in UTF-7 error handling

2015-11-10 Thread Serhiy Storchaka

Changes by Serhiy Storchaka :


--
assignee:  -> serhiy.storchaka
resolution:  -> fixed
stage: patch review -> resolved
status: open -> closed

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue24848] Warts in UTF-7 error handling

2015-10-10 Thread Roundup Robot

Roundup Robot added the comment:

New changeset ff1366ff2761 by Serhiy Storchaka in branch '2.7':
Issue #24848: Fixed yet one bug in UTF-7 decoder.  Testing for BASE64 character
https://hg.python.org/cpython/rev/ff1366ff2761

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue24848] Warts in UTF-7 error handling

2015-10-08 Thread Serhiy Storchaka

Serhiy Storchaka added the comment:

The difference between 2.7 and 3.x is that 2.7 uses isalnum() in IS_BASE64, and 
3.x test concrete ranges. Therefore depending on platform and locale 2.7 can 
accept wrong bytes as BASE64 characters and return incorrect result. Following 
patch makes 2.7 code the same as 3.x. Tests are changed to fail with large 
probability with unpatched code ('\xe1' is an alnum on almost all 8-bit 
locales).

--
Added file: http://bugs.python.org/file40719/decode_utf7_locale-2.7.patch

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue24848] Warts in UTF-7 error handling

2015-10-08 Thread STINNER Victor

STINNER Victor added the comment:

The patch looks good to me.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue24848] Warts in UTF-7 error handling

2015-10-04 Thread Serhiy Storchaka

Serhiy Storchaka added the comment:

Test failure is random. With build 3435 tests are successful, with all other 
are failed. The same with other buildbot: 
http://buildbot.python.org/all/builders/x86%20Windows7%202.7/ . 3345 and 3347 
are green, others are red.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue24848] Warts in UTF-7 error handling

2015-10-02 Thread STINNER Victor

STINNER Victor added the comment:

http://buildbot.python.org/all/builders/x86%20XP-4%202.7/builds/3431/steps/test/logs/stdio


==
FAIL: test_errors (test.test_codecs.UTF7Test)
--
Traceback (most recent call last):
  File 
"d:\cygwin\home\db3l\buildarea\2.7.bolen-windows\build\lib\test\test_codecs.py",
 line 709, in test_errors
self.assertEqual(raw.decode('utf-7', 'replace'), expected)
AssertionError: u'a\u20ac\ufffd' != u'a\u20ac\ufffdb'
- a\u20ac\ufffd
+ a\u20ac\ufffdb
?+


==
FAIL: test_lone_surrogates (test.test_codecs.UTF7Test)
--
Traceback (most recent call last):
  File 
"d:\cygwin\home\db3l\buildarea\2.7.bolen-windows\build\lib\test\test_codecs.py",
 line 743, in test_lone_surrogates
self.assertEqual(raw.decode('utf-7', 'replace'), expected)
AssertionError: u'a\ufffd' != u'a\ufffdb'
- a\ufffd
+ a\ufffdb
?   +

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue24848] Warts in UTF-7 error handling

2015-10-02 Thread Roundup Robot

Roundup Robot added the comment:

New changeset 3c13567ea642 by Serhiy Storchaka in branch '3.4':
Issue #24848: Fixed bugs in UTF-7 decoding of misformed data:
https://hg.python.org/cpython/rev/3c13567ea642

New changeset a61fa2b08f87 by Serhiy Storchaka in branch '3.5':
Issue #24848: Fixed bugs in UTF-7 decoding of misformed data:
https://hg.python.org/cpython/rev/a61fa2b08f87

New changeset 037253b7cd6d by Serhiy Storchaka in branch 'default':
Issue #24848: Fixed bugs in UTF-7 decoding of misformed data:
https://hg.python.org/cpython/rev/037253b7cd6d

New changeset c6eaa722e2c1 by Serhiy Storchaka in branch '2.7':
Issue #24848: Fixed bugs in UTF-7 decoding of misformed data:
https://hg.python.org/cpython/rev/c6eaa722e2c1

--
nosy: +python-dev

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue24848] Warts in UTF-7 error handling

2015-10-02 Thread Serhiy Storchaka

Serhiy Storchaka added the comment:

Have no ideas why tests are failed and only on this buildbot.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue24848] Warts in UTF-7 error handling

2015-10-02 Thread STINNER Victor

STINNER Victor added the comment:

Oops, ignore my comment, I forgot to recompile Python. "make" and the bug is 
done :-)

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue24848] Warts in UTF-7 error handling

2015-10-02 Thread STINNER Victor

STINNER Victor added the comment:

> Have no ideas why tests are failed and only on this buildbot.

test_codecs always crash on Python 3.6 with Python compiled in debug mode:

test_errors (test.test_codecs.UTF7Test) ... python: 
Objects/unicodeobject.c:1263: _copy_characters: Assertion `ch <= to_maxchar' 
failed.
Fatal Python error: Aborted

Current thread 0x7f1489057700 (most recent call first):
  File "/home/haypo/prog/python/default/Lib/encodings/utf_7.py", line 12 in 
decode
  File "/home/haypo/prog/python/default/Lib/test/test_codecs.py", line 1021 in 
test_errors
  File "/home/haypo/prog/python/default/Lib/unittest/case.py", line 600 in run
  File "/home/haypo/prog/python/default/Lib/unittest/case.py", line 648 in 
__call__
  File "/home/haypo/prog/python/default/Lib/unittest/suite.py", line 122 in run
  File "/home/haypo/prog/python/default/Lib/unittest/suite.py", line 84 in 
__call__
  File "/home/haypo/prog/python/default/Lib/unittest/suite.py", line 122 in run
  File "/home/haypo/prog/python/default/Lib/unittest/suite.py", line 84 in 
__call__
  File "/home/haypo/prog/python/default/Lib/unittest/suite.py", line 122 in run
  File "/home/haypo/prog/python/default/Lib/unittest/suite.py", line 84 in 
__call__
  File "/home/haypo/prog/python/default/Lib/unittest/runner.py", line 176 in run
  File "/home/haypo/prog/python/default/Lib/test/support/__init__.py", line 
1775 in _run_suite
  File "/home/haypo/prog/python/default/Lib/test/support/__init__.py", line 
1809 in run_unittest
  File "/home/haypo/prog/python/default/Lib/test/libregrtest/runtest.py", line 
159 in test_runner
  File "/home/haypo/prog/python/default/Lib/test/libregrtest/runtest.py", line 
160 in runtest_inner
  File "/home/haypo/prog/python/default/Lib/test/libregrtest/runtest.py", line 
124 in runtest
  File "/home/haypo/prog/python/default/Lib/test/libregrtest/main.py", line 285 
in run_tests_sequential
  File "/home/haypo/prog/python/default/Lib/test/libregrtest/main.py", line 344 
in run_tests
  File "/home/haypo/prog/python/default/Lib/test/libregrtest/main.py", line 380 
in main
  File "/home/haypo/prog/python/default/Lib/test/libregrtest/main.py", line 421 
in main
  File "/home/haypo/prog/python/default/Lib/test/libregrtest/main.py", line 443 
in main_in_temp_cwd
  File "/home/haypo/prog/python/default/Lib/test/__main__.py", line 3 in 

  File "/home/haypo/prog/python/default/Lib/runpy.py", line 85 in _run_code
  File "/home/haypo/prog/python/default/Lib/runpy.py", line 170 in 
_run_module_as_main
Abandon (core dumped)

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue24848] Warts in UTF-7 error handling

2015-09-27 Thread Serhiy Storchaka

Serhiy Storchaka added the comment:

Updated patch fixes also a bug in _PyUnicodeWriter. Other affected encoding is 
"unicode-escape":

>>> br'\u;'.decode('unicode-escape', 'replace')
'ý;'

--
Added file: http://bugs.python.org/file40604/utf7_error_handling-2.patch

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue24848] Warts in UTF-7 error handling

2015-09-27 Thread Serhiy Storchaka

Changes by Serhiy Storchaka :


--
nosy: +pitrou

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue24848] Warts in UTF-7 error handling

2015-08-21 Thread Serhiy Storchaka

Serhiy Storchaka added the comment:

There is a reason for behavior in case 2. This is likely a truncated data and 
it is safer to raise an exception than silently produce lone surrogate. Current 
UTF-7 encoder always adds '-' after ending shift sequence. I suppose this is 
not a bug.

However there are yet three bugs.

4. Decoder can emit lone low surrogate before replacement character in case of 
error.

 b'+2DTdI-'.decode('utf-7', 'replace')
'\ud834�'

A low surrogate is a part of incomplete astral character and shouldn't emitted 
in case of error in encoded astral character.

5. According to RFC 2152: A + character followed immediately by any 
character other than members of set B or - is an ill-formed sequence. But 
this is accepted by current decoder as empty shift sequence that is decoded to 
empty string.

 b'a+,b'.decode('utf-7')
'a,b'
 b'a+'.decode('utf-7')
'a'

6. Replacement character '\ufffd' can be replaced with character 'ý' ('\xfd'):

 b'\xff'.decode('utf-7', 'replace')
'�'
 b'a\xff'.decode('utf-7', 'replace')
'a�'
 b'a\xffb'.decode('utf-7', 'replace')
'a�b'
 b'\xffb'.decode('utf-7', 'replace')
'ýb'

This bug is reproduced only in 3.4+.

Following patch fixes bugs 1 and 4 and adds more tests.

Corner cases 2 and 3 are likely not bugs.

I doubt about fixing bug 5. iconv accepts such ill-formed sequences. In any 
case I think the fix of this bug can be applied only for default branch.

I have no idea how to fix bug 6. I afraid it can be a bug in _PyUnicodeWriter 
and therefore can affect other decoders.

--
keywords: +patch
stage:  - patch review
Added file: http://bugs.python.org/file40223/utf7_error_handling.patch

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue24848
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue24848] Warts in UTF-7 error handling

2015-08-12 Thread Serhiy Storchaka

New submission from Serhiy Storchaka:

Trying to implement UTF-7 codec in Python I found some warts in error handling.

1. Non-ASCII bytes.

No errors:
 'a€b'.encode('utf-7')
b'a+IKw-b'
 b'a+IKw-b'.decode('utf-7')
'a€b'

Terminating '-' at the end of the string is optional.
 b'a+IKw'.decode('utf-7')
'a€'

And sometimes it is optional in the middle of the string (if following char is 
not used in BASE64).
 b'a+IKw;b'.decode('utf-7')
'a€;b'

But if following char is not ASCII, it is accepted as well, and this looks as a 
bug.
 b'a+IKw\xffb'.decode('utf-7')
'a€ÿb'

In all other cases non-ASCII byte causes an error:
 b'a\xffb'.decode('utf-7')
Traceback (most recent call last):
  File stdin, line 1, in module
  File /home/serhiy/py/cpython/Lib/encodings/utf_7.py, line 12, in decode
return codecs.utf_7_decode(input, errors, True)
UnicodeDecodeError: 'utf7' codec can't decode byte 0xff in position 1: 
unexpected special character
 b'a\xffb'.decode('utf-7', 'replace')
'a�b'

2. Ending lone high surrogate.

Lone surrogates are silently accepted by utf-7 codec.

 '\ud8e4\U0001d121'.encode('utf-7')
b'+2OTYNN0h-'
 '\U0001d121\ud8e4'.encode('utf-7')
b'+2DTdIdjk-'
 b'+2OTYNN0h-'.decode('utf-7')
'\ud8e4턡'
 b'+2OTYNN0h'.decode('utf-7')
'\ud8e4턡'
 b'+2DTdIdjk-'.decode('utf-7')
'턡\ud8e4'

Except at the end of unterminated shift sequence:
 b'+2DTdIdjk'.decode('utf-7')
Traceback (most recent call last):
  File stdin, line 1, in module
  File /home/serhiy/py/cpython/Lib/encodings/utf_7.py, line 12, in decode
return codecs.utf_7_decode(input, errors, True)
UnicodeDecodeError: 'utf7' codec can't decode bytes in position 0-8: 
unterminated shift sequence

3. Incorrect shift sequence.

Strange behavior happens when shift sequence ends with wrong bits.
 b'a+IKx-b'.decode('utf-7', 'ignore')
'a€b'
 b'a+IKx-b'.decode('utf-7', 'replace')
'a€�b'
 b'a+IKx-b'.decode('utf-7', 'backslashreplace')
'a€\\x2b\\x49\\x4b\\x78\\x2db'

The decoder first decodes as much characters as can, and then pass all shift 
sequence (including already decoded bytes) to error handler. Not sure this is a 
bug, but this differs from common behavior of other decoders.

--
components: Unicode
messages: 248450
nosy: ezio.melotti, haypo, lemburg, loewis, serhiy.storchaka
priority: normal
severity: normal
status: open
title: Warts in UTF-7 error handling
type: behavior
versions: Python 2.7, Python 3.4, Python 3.5, Python 3.6

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue24848
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com