New submission from Philip Jenvey: surrogateescape claims to be "implemented by all standard Python codecs"
http://docs.python.org/3/library/codecs.html#codec-base-classes However it fails w/ multibytecodecs on encode: Python 3.2.3+ (3.2:eb999002916c, Oct 26 2012, 16:11:03) [GCC 4.2.1 (Apple Inc. build 5666) (dot 3)] on darwin Type "help", "copyright", "credits" or "license" for more information. >>> "\u30fb".encode('gb18030') b'\x819\xa79' >>> "\u30fb\udc80".encode('gb18030', 'surrogateescape') Traceback (most recent call last): File "<stdin>", line 1, in <module> TypeError: encoding error handler must return (unicode, int) tuple The problem being that multibytecodec.c forces error handler return results to always be unicode and surrogateescape returns bytes here. (surrogatepass also similarly returns bytes but it claims to be utf-8 only) The error handler spec seems to imply that error handlers should always return unicode, because "The encoder will encode the replacement" http://docs.python.org/3/library/codecs.html#codecs.register_error but obviously that's not really the case: some codecs special case bytes results and copy them directly to the output, e.g.: http://hg.python.org/cpython/file/ce3f0399ea33/Objects/unicodeobject.c#l6305 ---------- components: Interpreter Core messages: 176711 nosy: pjenvey priority: normal severity: normal status: open title: surrogateescape broken w/ multibytecodecs' encode versions: Python 3.2, Python 3.3 _______________________________________ Python tracker <rep...@bugs.python.org> <http://bugs.python.org/issue16585> _______________________________________ _______________________________________________ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com