Re: Why are some unicode error handlers "encode only"?

Walter Dörwald Sun, 11 Mar 2012 09:46:21 -0700

On 11.03.12 15:37, Steven D'Aprano wrote:

At least two standard error handlers are documented as working for
encoding only:


xmlcharrefreplace
backslashreplace

See http://docs.python.org/library/codecs.html#codec-base-classes

and http://docs.python.org/py3k/library/codecs.html

Why is this? I don't see why they shouldn't work for decoding as well.

Because xmlcharrefreplace and backslashreplace are *error* handlers.However the bytes sequence b'〹' does *not* contain any bytes thatare not decodable for e.g. the ASCII codec. So there are no errors tohandle.

Consider this example using Python 3.2:

b"aaa--\xe9z--\xe9!--bbb".decode("cp932")

Traceback (most recent call last):
   File "<stdin>", line 1, in<module>
UnicodeDecodeError: 'cp932' codec can't decode bytes in position 9-10:
illegal multibyte sequence

The two bytes b'\xe9!' is an illegal multibyte sequence for CP-932 (also
known as MS-KANJI or SHIFT-JIS). Is there some reason why this shouldn't
or can't be supported?

The byte sequence b'\xe9!' however is not something that would have beenproduced by the backslashreplace error handler. b'\\xe9!' (a sequencecontaining 5 bytes) would have been (and this probably would decodewithout any problems with the cp932 codec).

# This doesn't actually work.
b"aaa--\xe9z--\xe9!--bbb".decode("cp932", "backslashreplace")
=>  r'aaa--騷--\xe9\x21--bbb'

and similarly for xmlcharrefreplace.

This would require a postprocess step *after* the bytes have beendecoded. This is IMHO out of scope for Python's codec machinery.


Servus,
   Walter

--
http://mail.python.org/mailman/listinfo/python-list

Re: Why are some unicode error handlers "encode only"?

Reply via email to