New submission from Philip Jenvey:

surrogateescape claims to be "implemented by all standard Python codecs"

http://docs.python.org/3/library/codecs.html#codec-base-classes

However it fails w/ multibytecodecs on encode:

Python 3.2.3+ (3.2:eb999002916c, Oct 26 2012, 16:11:03) 
[GCC 4.2.1 (Apple Inc. build 5666) (dot 3)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> "\u30fb".encode('gb18030')
b'\x819\xa79'
>>> "\u30fb\udc80".encode('gb18030', 'surrogateescape')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: encoding error handler must return (unicode, int) tuple

The problem being that multibytecodec.c forces error handler return results to 
always be unicode and surrogateescape returns bytes here.

(surrogatepass also similarly returns bytes but it claims to be utf-8 only)

The error handler spec seems to imply that error handlers should always return 
unicode, because "The encoder will encode the replacement"

http://docs.python.org/3/library/codecs.html#codecs.register_error

but obviously that's not really the case: some codecs special case bytes 
results and copy them directly to the output, e.g.:

http://hg.python.org/cpython/file/ce3f0399ea33/Objects/unicodeobject.c#l6305

----------
components: Interpreter Core
messages: 176711
nosy: pjenvey
priority: normal
severity: normal
status: open
title: surrogateescape broken w/ multibytecodecs' encode
versions: Python 3.2, Python 3.3

_______________________________________
Python tracker <rep...@bugs.python.org>
<http://bugs.python.org/issue16585>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

Reply via email to