On 11.03.12 15:37, Steven D'Aprano wrote:

At least two standard error handlers are documented as working for
encoding only:

xmlcharrefreplace
backslashreplace

See http://docs.python.org/library/codecs.html#codec-base-classes

and http://docs.python.org/py3k/library/codecs.html

Why is this? I don't see why they shouldn't work for decoding as well.

Because xmlcharrefreplace and backslashreplace are *error* handlers. However the bytes sequence b'〹' does *not* contain any bytes that are not decodable for e.g. the ASCII codec. So there are no errors to handle.

Consider this example using Python 3.2:

b"aaa--\xe9z--\xe9!--bbb".decode("cp932")
Traceback (most recent call last):
   File "<stdin>", line 1, in<module>
UnicodeDecodeError: 'cp932' codec can't decode bytes in position 9-10:
illegal multibyte sequence

The two bytes b'\xe9!' is an illegal multibyte sequence for CP-932 (also
known as MS-KANJI or SHIFT-JIS). Is there some reason why this shouldn't
or can't be supported?

The byte sequence b'\xe9!' however is not something that would have been produced by the backslashreplace error handler. b'\\xe9!' (a sequence containing 5 bytes) would have been (and this probably would decode without any problems with the cp932 codec).

# This doesn't actually work.
b"aaa--\xe9z--\xe9!--bbb".decode("cp932", "backslashreplace")
=>  r'aaa--騷--\xe9\x21--bbb'

and similarly for xmlcharrefreplace.

This would require a postprocess step *after* the bytes have been decoded. This is IMHO out of scope for Python's codec machinery.

Servus,
   Walter

--
http://mail.python.org/mailman/listinfo/python-list

Reply via email to