Oh that's interesting. So it seems to be Python that's the exception here. Would we really be able to add entries to character mappings that haven't changed since Python 2.0?
On Tue, 9 Jan 2018 at 16:53 Ivan Pozdeev via Python-ideas < python-ideas@python.org> wrote: > First of all, many thanks for such a excellently writen letter. It was a > real pleasure to read. > On 10.01.2018 0:15, Rob Speer wrote: > > Hi! I joined this list because I'm interested in filling a gap in Python's > standard library, relating to text encodings. > > There is an encoding with no name of its own. It's supported by every > current web browser and standardized by WHATWG. It's so prevalent that if > you ask a Web browser to decode "iso-8859-1" or "windows-1252", you will > get this encoding _instead_. It is probably the second or third most common > text encoding in the world. And Python doesn't quite support it. > > You can see the character table for this encoding at: > https://encoding.spec.whatwg.org/index-windows-1252.txt > > For the sake of discussion, let's call this encoding "web-1252". WHATWG > calls it "windows-1252", but notice that it's subtly different from > Python's "windows-1252" encoding. Python's windows-1252 has bytes that are > undefined: > > >>> b'\x90'.decode('windows-1252') > UnicodeDecodeError: 'charmap' codec can't decode byte 0x90 in position 0: > character maps to <undefined> > > In web-1252, the bytes that are undefined according to windows-1252 map to > the control characters in those positions in iso-8859-1 -- that is, the > Unicode codepoints with the same number as the byte. In web-1252, b'\x90' > would decode as '\u0090'. > > According to https://en.wikipedia.org/wiki/Windows-1252 , Windows does > the same: > > "According to the information on Microsoft's and the Unicode > Consortium's websites, positions 81, 8D, 8F, 90, and 9D are unused; > however, the Windows API MultiByteToWideChar > <http://msdn.microsoft.com/en-us/library/windows/desktop/dd319072%28v=vs.85%29.aspx> > maps these to the corresponding C1 control codes > <https://en.wikipedia.org/wiki/C0_and_C1_control_codes>." > And in ISO-8859-1, the same handling is done for unused code points even > by the standard ( https://en.wikipedia.org/wiki/ISO/IEC_8859-1 ) : > > "*ISO-8859-1* is the IANA > <https://en.wikipedia.org/wiki/Internet_Assigned_Numbers_Authority> > preferred name for this standard when supplemented with the C0 and C1 > control codes <https://en.wikipedia.org/wiki/C0_and_C1_control_codes> > from ISO/IEC 6429 <https://en.wikipedia.org/wiki/ISO/IEC_6429>" > And what would you think -- these "C1 control codes" are also the > corresponding Unicode points! ( > https://en.wikipedia.org/wiki/Latin-1_Supplement_(Unicode_block) ) > > Since Windows is pretty much the reference implementation for > "windows-xxxx" encodings, it even makes sense to alter the existing > encodings rather than add new ones. > > > This may seem like a silly encoding that encourages doing horrible things > with text. That's pretty much the case. But there's a reason every Web > browser implements it: > > - It's compatible with windows-1252 > - Any sequence of bytes can be round-tripped through it without losing > information > > It's not just this one encoding. WHATWG's encoding standard ( > https://encoding.spec.whatwg.org/) contains modified versions of > windows-1250 through windows-1258 and windows-874. > > Support for these encodings matters to me, in part, because I maintain a > Unicode data-cleaning library, "ftfy". One thing it does is to detect and > undo encoding/decoding errors that cause mojibake, as long as they're > detectible and reversible. Looking at real-world examples of text that has > been damaged by mojibake, it's clear that lots of text is transferred > through what I'm calling the "web-1252" encoding, in a way that's > incompatible with Python's "windows-1252". > > In order to be able to work with and fix this kind of text, ftfy registers > new codecs -- and I implemented this even before I knew that they were > standardized in Web browsers. When ftfy is imported, you can decode text as > "sloppy-windows-1252" (the name I chose for this encoding), for example. > > ftfy can tell people a sequence of steps that they can use in the future > to fix text that's like the text they provided. Very often, these steps > require the sloppy-windows-1252 or sloppy-windows-1251 encoding, which > means the steps only work with ftfy imported, even for people who are not > using the features of ftfy. > > Support for these encodings also seems highly relevant to people who use > Python for web scraping, as it would be desirable to maximize compatibility > with what a Web browser would do. > > This really seems like it belongs in the standard library instead of being > an incidental feature of my library. I know that code in the standard > library has "one foot in the grave". I _want_ these legacy encodings to > have one foot in the grave. But some of them are extremely common, and > Python code should be able to deal with them. > > Adding these encodings to Python would be straightforward to implement. > Does this require a PEP, a pull request, or further discussion? > > > _______________________________________________ > Python-ideas mailing > listPython-ideas@python.orghttps://mail.python.org/mailman/listinfo/python-ideas > Code of Conduct: http://python.org/psf/codeofconduct/ > > > -- > Regards, > Ivan > > _______________________________________________ > Python-ideas mailing list > Python-ideas@python.org > https://mail.python.org/mailman/listinfo/python-ideas > Code of Conduct: http://python.org/psf/codeofconduct/ >
_______________________________________________ Python-ideas mailing list Python-ideas@python.org https://mail.python.org/mailman/listinfo/python-ideas Code of Conduct: http://python.org/psf/codeofconduct/