[issue13333] utf-7 inconsistent with surrogates

2011-11-14 Thread Antoine Pitrou

Antoine Pitrou pit...@free.fr added the comment:

Here is a patch.

--
keywords: +patch
stage:  - patch review
Added file: http://bugs.python.org/file23686/utf7.patch

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue1
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue13333] utf-7 inconsistent with surrogates

2011-11-14 Thread Martin v . Löwis

Martin v. Löwis mar...@v.loewis.de added the comment:

Can you please regenerate the patch against default's head?

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue1
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue13333] utf-7 inconsistent with surrogates

2011-11-14 Thread Antoine Pitrou

Antoine Pitrou pit...@free.fr added the comment:

It's a patch for 3.2.

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue1
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue13333] utf-7 inconsistent with surrogates

2011-11-14 Thread Martin v . Löwis

Martin v. Löwis mar...@v.loewis.de added the comment:

Please don't use git-style diffs then, since otherwise the review can't figure 
out what the patch applies to (and neither could I figure that out).

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue1
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue13333] utf-7 inconsistent with surrogates

2011-11-14 Thread Antoine Pitrou

Antoine Pitrou pit...@free.fr added the comment:

Here is a non-git diff then :)

--
Added file: http://bugs.python.org/file23688/utf7-nogit.patch

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue1
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue13333] utf-7 inconsistent with surrogates

2011-11-14 Thread Martin v . Löwis

Martin v. Löwis mar...@v.loewis.de added the comment:

LGTM.

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue1
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue13333] utf-7 inconsistent with surrogates

2011-11-14 Thread Roundup Robot

Roundup Robot devn...@psf.upfronthosting.co.za added the comment:

New changeset ddfcb0de564f by Antoine Pitrou in branch '3.2':
Issue #1: The UTF-7 decoder now accepts lone surrogates
http://hg.python.org/cpython/rev/ddfcb0de564f

New changeset 250091e60f28 by Antoine Pitrou in branch 'default':
Issue #1: The UTF-7 decoder now accepts lone surrogates
http://hg.python.org/cpython/rev/250091e60f28

New changeset 050772822bde by Antoine Pitrou in branch '2.7':
Issue #1: The UTF-7 decoder now accepts lone surrogates
http://hg.python.org/cpython/rev/050772822bde

--
nosy: +python-dev

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue1
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue13333] utf-7 inconsistent with surrogates

2011-11-14 Thread Antoine Pitrou

Antoine Pitrou pit...@free.fr added the comment:

I made a little fix to the patch for wide unicode builds and then committed it. 
Thank you!

--
resolution:  - fixed
stage: patch review - committed/rejected
status: open - closed

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue1
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue13333] utf-7 inconsistent with surrogates

2011-11-11 Thread Ezio Melotti

Ezio Melotti ezio.melo...@gmail.com added the comment:

FWIW Wikipedia says Other characters must be encoded in UTF-16 (hence U+1 
and higher would be encoded into surrogates) and then in modified Base64.

So one possible interpretation is that while encoding a non-BMP char, it should 
be first converted in a surrogate pair and then each of the surrogates should 
be encoded just like any other 16bit code unit.
While decoding, it seems reasonable to do the opposite, i.e. recombine the 
surrogate pair.

The RFC doesn't say anything about lone surrogates, but I think that the fact 
that surrogates are used internally doesn't necessarily mean that the codec 
should be able to encode/decode them when they are not paired.  The other UTF-* 
codecs reject them, but that's because it is explicitly forbidden by their 
respective standards.

So I'm +1 about recombining them while decoding, and ±0 about allowing lone 
surrogates.

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue1
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue13333] utf-7 inconsistent with surrogates

2011-11-03 Thread Antoine Pitrou

New submission from Antoine Pitrou pit...@free.fr:

The utf-7 codec happily encodes lone surrogates, but it won't decode them:

 \ud801.encode(utf-7)
b'+2AE-'
 \ud801\ud801.encode(utf-7)
b'+2AHYAQ-'
 \ud801.encode(utf-7).decode(utf-7)
Traceback (most recent call last):
  File stdin, line 1, in module
  File /home/antoine/cpython/default/Lib/encodings/utf_7.py, line 12, in 
decode
return codecs.utf_7_decode(input, errors, True)
UnicodeDecodeError: 'utf7' codec can't decode bytes in position 0-4: second 
surrogate missing at end of shift sequence
 \ud801\ud801.encode(utf-7).decode(utf-7)
Traceback (most recent call last):
  File stdin, line 1, in module
  File /home/antoine/cpython/default/Lib/encodings/utf_7.py, line 12, in 
decode
return codecs.utf_7_decode(input, errors, True)
UnicodeDecodeError: 'utf7' codec can't decode bytes in position 0-6: second 
surrogate missing


I don't know which behaviour is better but round-tripping is certainly a 
desirable property of any codec.

--
components: Interpreter Core, Unicode
messages: 146919
nosy: ezio.melotti, loewis, pitrou
priority: normal
severity: normal
status: open
title: utf-7 inconsistent with surrogates
type: behavior
versions: Python 2.7, Python 3.2, Python 3.3

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue1
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue13333] utf-7 inconsistent with surrogates

2011-11-03 Thread Petri Lehtinen

Changes by Petri Lehtinen pe...@digip.org:


--
nosy: +petri.lehtinen

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue1
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue13333] utf-7 inconsistent with surrogates

2011-11-03 Thread Martin v . Löwis

Martin v. Löwis mar...@v.loewis.de added the comment:

RFC 2152 talks about encoding 16-bit unicode, and clarifies

 Surrogate pairs (UTF-16) are converted by treating each half 
 of the pair as a separate 16 bit quantity (i.e., no special
 treatment).

So lone surrogates clearly should be supported.

This text could be interpreted as saying that decoding surrogate pairs should 
also keep them (rather than combining them). However, the RFC also assumes that 
the decoded form will use 16-bit code units; for Python, I think we should 
continue combining surrogate pairs on decoding UTF-7 when we find them.

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue1
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com