On Thu, Jan 11, 2018, at 03:58, M.-A. Lemburg wrote:
> There's a problem with these encodings: they are mostly meant
> for decoding (broken) data, but as soon as we have them in the stdlib,
> people will also start using them for encoding data, producing more
> corrupted data.

Is it really corrupted?

> Do you really things it's a good idea to support this natively
> in Python ?

The problem is, that's ignoring the very real fact that this is, and has always 
been* the behavior of the native encodings built in to Windows. My opinion is 
that Microsoft, for whatever reason, misrepresented their encodings when they 
submitted them to Unicode. The native APIs for text conversion have mechanisms 
for error reporting, and these supposedly undefined characters do not trigger 
them as they do for e.g. CP932 0xA0.

Without the MB_ERR_INVALID_CHARS flag, cp932 0xA0 maps to U+F8F0 (private use), 
a best fit mapping, and cp1252 0x81 maps to U+0081 (one of the mappings being 
discussed here)
If you do set the MB_ERR_INVALID_CHARS flag, however, cp932 0xA0 returns an 
error 1113** (ERROR_NO_UNICODE_TRANSLATION), whereas cp1252 0x81 still returns 
U+0081.

As far as the actual encoding implemented in windows is concerned, CP1252's 
0x81->U+0081 mapping is a wholly valid one (though undocumented), and not in 
any way a fallback or a "best fit" or an invalid character.

*except for the addition of the Euro sign to each encoding at typically 0x80 in 
circa 1998.
**It's worth mentioning that our cp932 returns U+F8F0, even with 
errors='strict', despite this not being present in the unicode published 
mapping. It has done this at least since the CJKCodecs change in 2004. I can't 
determine where (or if) it was implemented at all before that.
_______________________________________________
Python-ideas mailing list
Python-ideas@python.org
https://mail.python.org/mailman/listinfo/python-ideas
Code of Conduct: http://python.org/psf/codeofconduct/

Reply via email to