Re: [Python-ideas] Support WHATWG versions of legacy encodings

Serhiy Storchaka Thu, 11 Jan 2018 01:57:10 -0800

09.01.18 23:15, Rob Speer пише:

There is an encoding with no name of its own. It's supported by everycurrent web browser and standardized by WHATWG. It's so prevalent thatif you ask a Web browser to decode "iso-8859-1" or "windows-1252", youwill get this encoding _instead_. It is probably the second or thirdmost common text encoding in the world. And Python doesn't quite support it.
You can see the character table for this encoding at:
https://encoding.spec.whatwg.org/index-windows-1252.txt
For the sake of discussion, let's call this encoding "web-1252". WHATWGcalls it "windows-1252", but notice that it's subtly different fromPython's "windows-1252" encoding.. Python's windows-1252 has bytes thatare undefined:
 >>> b'\x90'.decode('windows-1252')
UnicodeDecodeError: 'charmap' codec can't decode byte 0x90 in position0: character maps to <undefined>
In web-1252, the bytes that are undefined according to windows-1252 mapto the control characters in those positions in iso-8859-1 -- that is,the Unicode codepoints with the same number as the byte. In web-1252,b'\x90' would decode as '\u0090'.
This may seem like a silly encoding that encourages doing horriblethings with text. That's pretty much the case. But there's a reasonevery Web browser implements it:
- It's compatible with windows-1252
- Any sequence of bytes can be round-tripped through it without losinginformation
It's not just this one encoding. WHATWG's encoding standard(https://encoding.spec.whatwg.org/ <https://encoding..spec.whatwg.org/>)contains modified versions of windows-1250 through windows-1258 andwindows-874.

The way of solving this issue in Python is using an error handler. The"surrogateescape" error handler is specially designed for losslessreversible decoding. It maps every unassigned byte in the range0x80-0xff to a single character in the range U+dc80-U+dcff. This allowsyou to distinguish correctly decoded characters from the escaped bytes,perform character by character processing of the decoded text, andencode the result back with the same encoding.


>>> b'\x90\x91\x92\x93'.decode('windows-1252', 'surrogateescape')
'\udc90‘’“'
>>> '\udc90‘’“'.encode('windows-1252', 'surrogateescape')
b'\x90\x91\x92\x93'

If you want to map unassigned bytes to other characters, you should justcreate a new error handler. There are caveats, since such characters arenot distinguished from correctly decoded characters.

The same problem with the UTF-8 encoding. WHATWG allows encoding anddecoding surrogate characters in the range U+d800-U+dcff. This iscontrary to the Unicode Standard and raises an error by default inPython. But you can allow encoding and decoding of surrogate charactersby explicitly specifying the "surrogatepass" error handler.


_______________________________________________
Python-ideas mailing list
Python-ideas@python.org
https://mail.python.org/mailman/listinfo/python-ideas
Code of Conduct: http://python.org/psf/codeofconduct/

Re: [Python-ideas] Support WHATWG versions of legacy encodings

Reply via email to