Oh that's interesting. So it seems to be Python that's the exception here.

Would we really be able to add entries to character mappings that haven't
changed since Python 2.0?

On Tue, 9 Jan 2018 at 16:53 Ivan Pozdeev via Python-ideas <
python-ideas@python.org> wrote:

> First of all, many thanks for such a excellently writen letter. It was a
> real pleasure to read.
> On 10.01.2018 0:15, Rob Speer wrote:
>
> Hi! I joined this list because I'm interested in filling a gap in Python's
> standard library, relating to text encodings.
>
> There is an encoding with no name of its own. It's supported by every
> current web browser and standardized by WHATWG. It's so prevalent that if
> you ask a Web browser to decode "iso-8859-1" or "windows-1252", you will
> get this encoding _instead_. It is probably the second or third most common
> text encoding in the world. And Python doesn't quite support it.
>
> You can see the character table for this encoding at:
> https://encoding.spec.whatwg.org/index-windows-1252.txt
>
> For the sake of discussion, let's call this encoding "web-1252". WHATWG
> calls it "windows-1252", but notice that it's subtly different from
> Python's "windows-1252" encoding. Python's windows-1252 has bytes that are
> undefined:
>
> >>> b'\x90'.decode('windows-1252')
> UnicodeDecodeError: 'charmap' codec can't decode byte 0x90 in position 0:
> character maps to <undefined>
>
> In web-1252, the bytes that are undefined according to windows-1252 map to
> the control characters in those positions in iso-8859-1 -- that is, the
> Unicode codepoints with the same number as the byte. In web-1252, b'\x90'
> would decode as '\u0090'.
>
> According to https://en.wikipedia.org/wiki/Windows-1252 , Windows does
> the same:
>
>     "According to the information on Microsoft's and the Unicode
> Consortium's websites, positions 81, 8D, 8F, 90, and 9D are unused;
> however, the Windows API MultiByteToWideChar
> <http://msdn.microsoft.com/en-us/library/windows/desktop/dd319072%28v=vs.85%29.aspx>
> maps these to the corresponding C1 control codes
> <https://en.wikipedia.org/wiki/C0_and_C1_control_codes>."
> And in ISO-8859-1, the same handling is done for unused code points even
> by the standard ( https://en.wikipedia.org/wiki/ISO/IEC_8859-1 ) :
>
>     "*ISO-8859-1* is the IANA
> <https://en.wikipedia.org/wiki/Internet_Assigned_Numbers_Authority>
> preferred name for this standard when supplemented with the C0 and C1
> control codes <https://en.wikipedia.org/wiki/C0_and_C1_control_codes>
> from ISO/IEC 6429 <https://en.wikipedia.org/wiki/ISO/IEC_6429>"
> And what would you think -- these "C1 control codes" are also the
> corresponding Unicode points! (
> https://en.wikipedia.org/wiki/Latin-1_Supplement_(Unicode_block) )
>
> Since Windows is pretty much the reference implementation for
> "windows-xxxx" encodings, it even makes sense to alter the existing
> encodings rather than add new ones.
>
>
> This may seem like a silly encoding that encourages doing horrible things
> with text. That's pretty much the case. But there's a reason every Web
> browser implements it:
>
> - It's compatible with windows-1252
> - Any sequence of bytes can be round-tripped through it without losing
> information
>
> It's not just this one encoding. WHATWG's encoding standard (
> https://encoding.spec.whatwg.org/) contains modified versions of
> windows-1250 through windows-1258 and windows-874.
>
> Support for these encodings matters to me, in part, because I maintain a
> Unicode data-cleaning library, "ftfy". One thing it does is to detect and
> undo encoding/decoding errors that cause mojibake, as long as they're
> detectible and reversible. Looking at real-world examples of text that has
> been damaged by mojibake, it's clear that lots of text is transferred
> through what I'm calling the "web-1252" encoding, in a way that's
> incompatible with Python's "windows-1252".
>
> In order to be able to work with and fix this kind of text, ftfy registers
> new codecs -- and I implemented this even before I knew that they were
> standardized in Web browsers. When ftfy is imported, you can decode text as
> "sloppy-windows-1252" (the name I chose for this encoding), for example.
>
> ftfy can tell people a sequence of steps that they can use in the future
> to fix text that's like the text they provided. Very often, these steps
> require the sloppy-windows-1252 or sloppy-windows-1251 encoding, which
> means the steps only work with ftfy imported, even for people who are not
> using the features of ftfy.
>
> Support for these encodings also seems highly relevant to people who use
> Python for web scraping, as it would be desirable to maximize compatibility
> with what a Web browser would do.
>
> This really seems like it belongs in the standard library instead of being
> an incidental feature of my library. I know that code in the standard
> library has "one foot in the grave". I _want_ these legacy encodings to
> have one foot in the grave. But some of them are extremely common, and
> Python code should be able to deal with them.
>
> Adding these encodings to Python would be straightforward to implement.
> Does this require a PEP, a pull request, or further discussion?
>
>
> _______________________________________________
> Python-ideas mailing 
> listPython-ideas@python.orghttps://mail.python.org/mailman/listinfo/python-ideas
> Code of Conduct: http://python.org/psf/codeofconduct/
>
>
> --
> Regards,
> Ivan
>
> _______________________________________________
> Python-ideas mailing list
> Python-ideas@python.org
> https://mail.python.org/mailman/listinfo/python-ideas
> Code of Conduct: http://python.org/psf/codeofconduct/
>
_______________________________________________
Python-ideas mailing list
Python-ideas@python.org
https://mail.python.org/mailman/listinfo/python-ideas
Code of Conduct: http://python.org/psf/codeofconduct/

Reply via email to