Re: [Python-ideas] Support WHATWG versions of legacy encodings

2018-02-06 Thread Rob Speer
By now, it sounds right to me that I should implement these codecs in a package. I accept that I've established the use case, but not sufficiently established why it belongs in Python. The package can easily be ftfy -- although I should point out that what's in ftfy at the moment isn't quite right

Re: [Python-ideas] Support WHATWG versions of legacy encodings

2018-02-06 Thread Stephen J. Turnbull
Nick Coghlan writes: > Personally, I think a See Also note pointing to ftfy in the "codecs" > module documentation would be quite a reasonable outcome of the thread Yes please. The more I hear about purported use cases (with the exception of Nathaniel's "don't crash when I manipulate the DOM"

Re: [Python-ideas] Support WHATWG versions of legacy encodings

2018-02-05 Thread M.-A. Lemburg
On 05.02.2018 12:39, Serhiy Storchaka wrote: > 05.02.18 12:52, M.-A. Lemburg пише: >> Let's leave things as they are and perhaps a section to the codecs >> documentation, as you suggest, where to find other encodings which >> a user might want to use and tools to help with fixing encoding or >> dec

Re: [Python-ideas] Support WHATWG versions of legacy encodings

2018-02-05 Thread Serhiy Storchaka
05.02.18 12:52, M.-A. Lemburg пише: Let's leave things as they are and perhaps a section to the codecs documentation, as you suggest, where to find other encodings which a user might want to use and tools to help with fixing encoding or decoding errors. Here's a random list from PyPI with some p

Re: [Python-ideas] Support WHATWG versions of legacy encodings

2018-02-05 Thread M.-A. Lemburg
On 05.02.2018 04:01, Nick Coghlan wrote: > On 2 February 2018 at 16:52, Steven D'Aprano wrote: >> If it were my decision, I'd have these codecs raise a warning (not an >> error) when used for encoding. But I guess some people will consider >> that either going too far or not far enough :-) > > Ro

Re: [Python-ideas] Support WHATWG versions of legacy encodings

2018-02-05 Thread Paul Moore
On 5 February 2018 at 06:40, Serhiy Storchaka wrote: > 05.02.18 05:01, Nick Coghlan пише: >> >> On 2 February 2018 at 16:52, Steven D'Aprano wrote: >>> >>> If it were my decision, I'd have these codecs raise a warning (not an >>> error) when used for encoding. But I guess some people will conside

Re: [Python-ideas] Support WHATWG versions of legacy encodings

2018-02-04 Thread Serhiy Storchaka
05.02.18 05:01, Nick Coghlan пише: On 2 February 2018 at 16:52, Steven D'Aprano wrote: If it were my decision, I'd have these codecs raise a warning (not an error) when used for encoding. But I guess some people will consider that either going too far or not far enough :-) Rob pointed out tha

Re: [Python-ideas] Support WHATWG versions of legacy encodings

2018-02-04 Thread Nick Coghlan
On 2 February 2018 at 16:52, Steven D'Aprano wrote: > If it were my decision, I'd have these codecs raise a warning (not an > error) when used for encoding. But I guess some people will consider > that either going too far or not far enough :-) Rob pointed out that one of the main use cases for t

Re: [Python-ideas] Support WHATWG versions of legacy encodings

2018-02-02 Thread Chris Barker
On Thu, Feb 1, 2018 at 1:34 PM, Terry Reedy wrote: > On 1/31/2018 6:15 PM, Chris Barker wrote: > > I still have no idea why there is such resistance to this [spelling >> corrected] >> > > M.-A. Lemburg already summarized his view of the specifics for this > issue. And see below. Thanks for t

Re: [Python-ideas] Support WHATWG versions of legacy encodings

2018-02-01 Thread Steven D'Aprano
On Thu, Feb 01, 2018 at 10:20:00AM +0100, M.-A. Lemburg wrote: > In general, we have only added new encodings when there was an encoding > missing which a lot of people were actively using. We asked for > official documentation defining the mappings, references showing > usage and IANA or similar

Re: [Python-ideas] Support WHATWG versions of legacy encodings

2018-02-01 Thread Mark Lawrence
On 01/02/18 21:34, Terry Reedy wrote: On 1/31/2018 6:15 PM, Chris Barker wrote: I still have no idea why there is such resistance to this [spelling corrected] Every proposal should be resisted to the extent of requiring clarity, consideration of alternatives, and sufficient justification.

Re: [Python-ideas] Support WHATWG versions of legacy encodings

2018-02-01 Thread Terry Reedy
On 1/31/2018 6:15 PM, Chris Barker wrote: I still have no idea why there is such resistance to this [spelling corrected] Every proposal should be resisted to the extent of requiring clarity, consideration of alternatives, and sufficient justification. yes, it's a fairly small benefit over

Re: [Python-ideas] Support WHATWG versions of legacy encodings

2018-02-01 Thread M.-A. Lemburg
On 01.02.2018 00:40, Chris Angelico wrote: > On Thu, Feb 1, 2018 at 10:15 AM, Chris Barker wrote: >> I still have no ide4a why there is such resistance to this -- yes, it's a >> fairly small benefit over a package no PyPi, but there is also virtually no >> downside. > > I don't understand it eith

Re: [Python-ideas] Support WHATWG versions of legacy encodings

2018-01-31 Thread Chris Angelico
On Thu, Feb 1, 2018 at 10:15 AM, Chris Barker wrote: > I still have no ide4a why there is such resistance to this -- yes, it's a > fairly small benefit over a package no PyPi, but there is also virtually no > downside. I don't understand it either. Aside from maybe bikeshedding the *name* of the

Re: [Python-ideas] Support WHATWG versions of legacy encodings

2018-01-31 Thread Chris Barker
On Wed, Jan 31, 2018 at 9:48 AM, Serhiy Storchaka wrote: > Hm. As a user, unless I run into problems with a specific encoding, I >> never care about how many encodings we have, so I don't see how adding >> extra encodings bothers those users who have no need for them. >> > > The codecs module doc

Re: [Python-ideas] Support WHATWG versions of legacy encodings

2018-01-31 Thread Rob Speer
On Wed, 31 Jan 2018 at 12:50 Serhiy Storchaka wrote: > The passed encoding differs from the name of new Python encoding. It is > just 'windows-1252', not 'windows-1252-whatwg'. If just change the > existing encoding, this can break other code that expects the standard > 'windows-1252'. Thus every

Re: [Python-ideas] Support WHATWG versions of legacy encodings

2018-01-31 Thread Guido van Rossum
OK, I am no longer interested in this topic. If you can't reach agreement, so be it, and then the status quo prevails. I am going to mute this thread. There's no need to explain to me why I am wrong. On Wed, Jan 31, 2018 at 9:48 AM, Serhiy Storchaka wrote: > 31.01.18 18:36, Guido van Rossum пише

Re: [Python-ideas] Support WHATWG versions of legacy encodings

2018-01-31 Thread Serhiy Storchaka
31.01.18 18:36, Guido van Rossum пише: On Wed, Jan 31, 2018 at 3:03 AM, Serhiy Storchaka > wrote: 19.01.18 05:51, Guido van Rossum пише: Can someone explain to me why this is such a controversial issue? It seems reasonable to me to add new encod

Re: [Python-ideas] Support WHATWG versions of legacy encodings

2018-01-31 Thread M.-A. Lemburg
On 31.01.2018 17:36, Guido van Rossum wrote: > On Wed, Jan 31, 2018 at 3:03 AM, Serhiy Storchaka > wrote: > > 19.01.18 05:51, Guido van Rossum пише: > > Can someone explain to me why this is such a controversial issue? > > It seems reasonable to m

Re: [Python-ideas] Support WHATWG versions of legacy encodings

2018-01-31 Thread Guido van Rossum
On Wed, Jan 31, 2018 at 3:03 AM, Serhiy Storchaka wrote: > 19.01.18 05:51, Guido van Rossum пише: > >> Can someone explain to me why this is such a controversial issue? >> >> It seems reasonable to me to add new encodings to the stdlib that do the >> roundtripping requested in the first message o

Re: [Python-ideas] Support WHATWG versions of legacy encodings

2018-01-31 Thread Serhiy Storchaka
19.01.18 05:51, Guido van Rossum пише: Can someone explain to me why this is such a controversial issue? It seems reasonable to me to add new encodings to the stdlib that do the roundtripping requested in the first message of the thread. As long as they have new names that seems to fall under

Re: [Python-ideas] Support WHATWG versions of legacy encodings

2018-01-28 Thread Stephen J. Turnbull
Sorry for the long delay. I had a lot on my plate at work, and was spending 14 hours a day sleeping because of the flu. "It got better." Rob Speer writes: > I don't really understand what you're doing when you take a > fragment of my sentence where I explain a wrong understanding of > WHATWG

Re: [Python-ideas] Support WHATWG versions of legacy encodings

2018-01-22 Thread Rob Speer
I don't really understand what you're doing when you take a fragment of my sentence where I explain a wrong understanding of WHATWG encodings, and say "that's wrong, as you explain". I know it's wrong. That's what I was saying. You quoted the part where I said "Filling in all the gaps with Latin-1

Re: [Python-ideas] Support WHATWG versions of legacy encodings

2018-01-21 Thread Stephen J. Turnbull
Random832 writes: > I think his point is that the WHATWG standard is the one that > governs HTML and therefore HTML that uses these encodings > (including the C1 characters) are conformant to *that* standard, I don't think that is a tenable interpretation of this standard. The WHAT-WG standard

Re: [Python-ideas] Support WHATWG versions of legacy encodings

2018-01-21 Thread Stephen J. Turnbull
I don't expect to change your mind about the "right" way to deal with this, but this is a more explicit description of what those of us who advocate error handlers are thinking about. It may be useful in writing your PEP (PEPs describe rejected counterproposals and amendments along with adopted pr

Re: [Python-ideas] Support WHATWG versions of legacy encodings

2018-01-21 Thread Guido van Rossum
On Sun, Jan 21, 2018 at 2:43 AM, Steven D'Aprano wrote: > On Fri, Jan 19, 2018 at 06:35:30PM +, Rob Speer wrote: > > Guido had some very sensible feedback just a moment ago. I am wondering > now > > if we lost Guido because I broke python-ideas etiquette (is a pull > request > > not the next

Re: [Python-ideas] Support WHATWG versions of legacy encodings

2018-01-21 Thread Chris Angelico
On Mon, Jan 22, 2018 at 3:36 AM, Rob Speer wrote: > Thanks for the recommendation there, and I'd like a little extra information > -- I don't know _mechanically_ how to write a PEP. (Where do I submit it to, > for example?) I can help you with that side of things. Start by checking out PEP 1: ht

Re: [Python-ideas] Support WHATWG versions of legacy encodings

2018-01-21 Thread Rob Speer
> The question to my mind is whether or not this "latin1replace" handler, > in conjunction with existing codecs, will do the same thing as the > WHATWG codecs. If I have understood you correctly, I think it will. Have > I missed something? It won't do the same thing, and neither will the "chaining

Re: [Python-ideas] Support WHATWG versions of legacy encodings

2018-01-21 Thread Steven D'Aprano
On Fri, Jan 19, 2018 at 06:35:30PM +, Rob Speer wrote: > > It depends on what you want to achieve. You may want to fail, assign a > code point from a private area or use a surrogate escape approach. > > And the way to express that is with errors='replace', > errors='surrogateescape', or whatev

Re: [Python-ideas] Support WHATWG versions of legacy encodings

2018-01-19 Thread M.-A. Lemburg
Rob: I think I was very clear very early in the thread that I'm opposed to adding a complete set of new encodings to the stdlib which only slightly alter many existing ones. Ever since I've been trying to give you suggestions on how we can solve the issue you're trying to address with the encodin

Re: [Python-ideas] Support WHATWG versions of legacy encodings

2018-01-19 Thread Rob Speer
> It depends on what you want to achieve. You may want to fail, assign a code point from a private area or use a surrogate escape approach. And the way to express that is with errors='replace', errors='surrogateescape', or whatever, which Python already does. We do not need an explosion of error h

Re: [Python-ideas] Support WHATWG versions of legacy encodings

2018-01-19 Thread M.-A. Lemburg
On 19.01.2018 18:12, Rob Speer wrote: > Error handlers are quite orthogonal to this problem. If you try to solve > this problem with an error handler, you will have a different problem. > > Suppose you made "c1-control-passthrough" or whatever into an error > handler, similar to "replace" or "igno

Re: [Python-ideas] Support WHATWG versions of legacy encodings

2018-01-19 Thread Guido van Rossum
OK, I will tune out this conversation. It is clearly not going anywhere. On Fri, Jan 19, 2018 at 9:12 AM, Rob Speer wrote: > Error handlers are quite orthogonal to this problem. If you try to solve > this problem with an error handler, you will have a different problem. > > Suppose you made "c1-

Re: [Python-ideas] Support WHATWG versions of legacy encodings

2018-01-19 Thread Rob Speer
Error handlers are quite orthogonal to this problem. If you try to solve this problem with an error handler, you will have a different problem. Suppose you made "c1-control-passthrough" or whatever into an error handler, similar to "replace" or "ignore", and then you encounter an unassigned charac

Re: [Python-ideas] Support WHATWG versions of legacy encodings

2018-01-19 Thread M.-A. Lemburg
On 19.01.2018 17:20, Guido van Rossum wrote: > On Fri, Jan 19, 2018 at 5:30 AM, M.-A. Lemburg > wrote: > > On 19.01.2018 05:38, Nathaniel Smith wrote: > > On Thu, Jan 18, 2018 at 7:51 PM, Guido van Rossum > wrote: > >> Can someone expl

Re: [Python-ideas] Support WHATWG versions of legacy encodings

2018-01-19 Thread Random832
On Fri, Jan 19, 2018, at 08:30, M.-A. Lemburg wrote: > > Someone did discover that Microsoft's current implementations of the > > windows-* encodings matches the WHAT-WG spec, rather than the Unicode > > spec that Microsoft originally wrote. > > No, MS implements somethings called "best fit encodi

Re: [Python-ideas] Support WHATWG versions of legacy encodings

2018-01-19 Thread Guido van Rossum
On Fri, Jan 19, 2018 at 5:30 AM, M.-A. Lemburg wrote: > On 19.01.2018 05:38, Nathaniel Smith wrote: > > On Thu, Jan 18, 2018 at 7:51 PM, Guido van Rossum > wrote: > >> Can someone explain to me why this is such a controversial issue? > > > > I guess practicality versus purity is always controver

Re: [Python-ideas] Support WHATWG versions of legacy encodings

2018-01-19 Thread M.-A. Lemburg
On 19.01.2018 05:38, Nathaniel Smith wrote: > On Thu, Jan 18, 2018 at 7:51 PM, Guido van Rossum wrote: >> Can someone explain to me why this is such a controversial issue? > > I guess practicality versus purity is always controversial :-) > >> It seems reasonable to me to add new encodings to th

Re: [Python-ideas] Support WHATWG versions of legacy encodings

2018-01-18 Thread Nathaniel Smith
On Thu, Jan 18, 2018 at 7:51 PM, Guido van Rossum wrote: > Can someone explain to me why this is such a controversial issue? I guess practicality versus purity is always controversial :-) > It seems reasonable to me to add new encodings to the stdlib that do the > roundtripping requested in the

Re: [Python-ideas] Support WHATWG versions of legacy encodings

2018-01-18 Thread Guido van Rossum
Can someone explain to me why this is such a controversial issue? It seems reasonable to me to add new encodings to the stdlib that do the roundtripping requested in the first message of the thread. As long as they have new names that seems to fall under "practicality beats purity". (Modifying exi

Re: [Python-ideas] Support WHATWG versions of legacy encodings

2018-01-18 Thread Steven D'Aprano
On Wed, Jan 10, 2018 at 07:13:39PM +, Rob Speer wrote: [...] > Having a pip installable library as the _only_ way to use these encodings > is the status quo that I am very familiar with. It's awkward. To use a > package that registers new codecs, you have to import something from that > packag

Re: [Python-ideas] Support WHATWG versions of legacy encodings

2018-01-18 Thread Soni L.
On 2018-01-18 04:12 PM, Stephen J. Turnbull wrote: Soni L. writes: > ISO-8859-1 explicitly defines control characters in the \x80-\x9F range, > IIRC. You recall incorrectly. You're probably thinking of RFC 1345. But I've never seen that cited except in the IANA registry. All of ISO 202

Re: [Python-ideas] Support WHATWG versions of legacy encodings

2018-01-18 Thread Stephen J. Turnbull
Soni L. writes: > ISO-8859-1 explicitly defines control characters in the \x80-\x9F range, > IIRC. You recall incorrectly. You're probably thinking of RFC 1345. But I've never seen that cited except in the IANA registry. All of ISO 2022, ISO 4873, ISO 8859, and Unicode suggest the ISO 6429

Re: [Python-ideas] Support WHATWG versions of legacy encodings

2018-01-18 Thread Random832
On Thu, Jan 18, 2018, at 11:04, Stephen J. Turnbull wrote: > Nathaniel Smith writes: > > > It's also nice to be able to parse some HTML data, make a few changes > > in memory, and then serialize it back to HTML. Having this crash on > > random documents is rather irritating, esp. if these docum

Re: [Python-ideas] Support WHATWG versions of legacy encodings

2018-01-18 Thread Stephen J. Turnbull
Nathaniel Smith writes: > It's also nice to be able to parse some HTML data, make a few changes > in memory, and then serialize it back to HTML. Having this crash on > random documents is rather irritating, esp. if these documents are > standards-compliant HTML as in this case. This example d

Re: [Python-ideas] Support WHATWG versions of legacy encodings

2018-01-17 Thread Nathaniel Smith
On Wed, Jan 17, 2018 at 10:13 AM, Rob Speer wrote: > I'm going to push back on the idea that this should only be used for > decoding, not encoding. > > The use case I started with -- showing people how to fix mojibake using > Python -- would *only* use these codecs in the encoding direction. To fi

Re: [Python-ideas] Support WHATWG versions of legacy encodings

2018-01-17 Thread Rob Speer
I'm going to push back on the idea that this should only be used for decoding, not encoding. The use case I started with -- showing people how to fix mojibake using Python -- would *only* use these codecs in the encoding direction. To fix the most common case of mojibake, you encode it as web-1252

Re: [Python-ideas] Support WHATWG versions of legacy encodings

2018-01-17 Thread Chris Barker
On Tue, Jan 16, 2018 at 9:30 PM, Stephen J. Turnbull < turnbull.stephen...@u.tsukuba.ac.jp> wrote: > In what context? WHAT-WG's encoding standard is *all about browsers*. > If a codec is feeding text into a process that renders them all as > glyphs for a human to look at, that's one thing. The c

Re: [Python-ideas] Support WHATWG versions of legacy encodings

2018-01-17 Thread Soni L.
On 2018-01-17 03:30 AM, Stephen J. Turnbull wrote: Soni L. writes: > This is surprising to me because I always took those encodings to > have those fallbacks [to raw control characters]. ISO-8859-1 implementations do, for historical reasons AFAICT. And they frequently produce mojibake an

Re: [Python-ideas] Support WHATWG versions of legacy encodings

2018-01-16 Thread Stephen J. Turnbull
Random832 writes: > There are plenty of standard encodings that do have actual > representations of the control characters. My complaint was not about coded character sets that don't conform to ISO 2022's conventions about control vs. graphic blocks, especially in the C1 block. It was about pr

Re: [Python-ideas] Support WHATWG versions of legacy encodings

2018-01-16 Thread Stephen J. Turnbull
Soni L. writes: > This is surprising to me because I always took those encodings to > have those fallbacks [to raw control characters]. ISO-8859-1 implementations do, for historical reasons AFAICT. And they frequently produce mojibake and occasionally wilder behavior. Most legacy encodings don

Re: [Python-ideas] Support WHATWG versions of legacy encodings

2018-01-12 Thread Random832
On Fri, Jan 12, 2018, at 03:10, Stephen J. Turnbull wrote: > > Other than that, all the differences are adding the fall-throughs in the > > range U+0080 to U+009F. For example, elsewhere in windows-1255, the byte > > b'\xff' is undefined, and it remains undefined in WHATWG's mapping. > > I real

Re: [Python-ideas] Support WHATWG versions of legacy encodings

2018-01-12 Thread Soni L.
On 2018-01-12 06:10 AM, Stephen J. Turnbull wrote: Rob Speer writes: > There is one more difference I have found between Python's encodings and > WHATWG's. In Python's codepage 1255, b'\xca' is undefined. In WHATWG's, it > maps to U+05BA HEBREW POINT HOLAM HASER FOR VAV. I haven't tracke

Re: [Python-ideas] Support WHATWG versions of legacy encodings

2018-01-12 Thread Stephen J. Turnbull
Rob Speer writes: > There is one more difference I have found between Python's encodings and > WHATWG's. In Python's codepage 1255, b'\xca' is undefined. In WHATWG's, it > maps to U+05BA HEBREW POINT HOLAM HASER FOR VAV. I haven't tracked down > what the Unicode Consortium has to say about thi

Re: [Python-ideas] Support WHATWG versions of legacy encodings

2018-01-11 Thread Nick Coghlan
On 12 January 2018 at 14:55, Steve Dower wrote: > On 12Jan2018 0342, Random832 wrote: >> >> On Thu, Jan 11, 2018, at 04:55, Serhiy Storchaka wrote: >>> >>> The way of solving this issue in Python is using an error handler. The >>> "surrogateescape" error handler is specially designed for lossless

Re: [Python-ideas] Support WHATWG versions of legacy encodings

2018-01-11 Thread Stephen J. Turnbull
Executive summary: we already do. Nathaniel suggests we should conform to the WHAT-WG standard. But AFAGCT[1], there is no such thing as "WHATWG versions of legacy encodings". The document at https://encoding.spec.whatwg.org/ has the following normative specifications (capitalized words are pres

Re: [Python-ideas] Support WHATWG versions of legacy encodings

2018-01-11 Thread Steve Dower
On 12Jan2018 0342, Random832 wrote: On Thu, Jan 11, 2018, at 04:55, Serhiy Storchaka wrote: The way of solving this issue in Python is using an error handler. The "surrogateescape" error handler is specially designed for lossless reversible decoding. It maps every unassigned byte in the range 0x

Re: [Python-ideas] Support WHATWG versions of legacy encodings

2018-01-11 Thread MRAB
On 2018-01-11 19:42, Rob Speer wrote: > The question is rather: how often does web-XXX mojibake happen? Very often. Particularly web-1252 mixed up with UTF-8. My ftfy library is tested on data from Twitter and the Common Crawl, both prime sources of mojibake. One common mojibake sequence is w

Re: [Python-ideas] Support WHATWG versions of legacy encodings

2018-01-11 Thread Random832
On Thu, Jan 11, 2018, at 14:55, Rob Speer wrote: > There is one more difference I have found between Python's encodings and > WHATWG's. In Python's codepage 1255, b'\xca' is undefined. In WHATWG's, it > maps to U+05BA HEBREW POINT HOLAM HASER FOR VAV. I haven't tracked down > what the Unicode Conso

Re: [Python-ideas] Support WHATWG versions of legacy encodings

2018-01-11 Thread Rob Speer
On Thu, 11 Jan 2018 at 11:43 Random832 wrote: > Maybe we need a new error handler that maps unassigned bytes in the range > 0x80-0x9f to a single character in the range U+0080-U+009F. Do any of the > encodings being discussed have behavior other than the "normal" version of > the encoding plus wh

Re: [Python-ideas] Support WHATWG versions of legacy encodings

2018-01-11 Thread Rob Speer
> The question is rather: how often does web-XXX mojibake happen? Very often. Particularly web-1252 mixed up with UTF-8. My ftfy library is tested on data from Twitter and the Common Crawl, both prime sources of mojibake. One common mojibake sequence is when a right curly quote is encoded as UTF-

Re: [Python-ideas] Support WHATWG versions of legacy encodings

2018-01-11 Thread Random832
On Thu, Jan 11, 2018, at 03:58, M.-A. Lemburg wrote: > There's a problem with these encodings: they are mostly meant > for decoding (broken) data, but as soon as we have them in the stdlib, > people will also start using them for encoding data, producing more > corrupted data. Is it really corrupt

Re: [Python-ideas] Support WHATWG versions of legacy encodings

2018-01-11 Thread Random832
On Thu, Jan 11, 2018, at 04:55, Serhiy Storchaka wrote: > The way of solving this issue in Python is using an error handler. The > "surrogateescape" error handler is specially designed for lossless > reversible decoding. It maps every unassigned byte in the range > 0x80-0xff to a single characte

Re: [Python-ideas] Support WHATWG versions of legacy encodings

2018-01-11 Thread Antoine Pitrou
On Thu, 11 Jan 2018 05:18:43 -0800 Nathaniel Smith wrote: > I'm not an expert here or anything, but from what we've been hearing it > sounds like it must be used by all standard-compliant HTML parsers. I don't > *like* the standard much, but I don't think that the stdlib should refuse > to handle

Re: [Python-ideas] Support WHATWG versions of legacy encodings

2018-01-11 Thread Nathaniel Smith
On Jan 11, 2018 4:05 AM, "Antoine Pitrou" wrote: Define "widely used". If web-XXX is a superset of windows-XXX, then perhaps web-XXX is "used" in the sense of "used to decode valid windows-XXX data" (but windows-XXX could be used just as well to decode the same data). The question is rather: ho

Re: [Python-ideas] Support WHATWG versions of legacy encodings

2018-01-11 Thread Antoine Pitrou
On Wed, 10 Jan 2018 16:24:33 -0800 Chris Barker wrote: > On Wed, Jan 10, 2018 at 11:04 AM, M.-A. Lemburg wrote: > > > I don't believe it's a good strategy to create the confusion that > > WHATWG is introducing by using the same names for non-standard > > encodings. > > > > agreed. > > > > P

Re: [Python-ideas] Support WHATWG versions of legacy encodings

2018-01-11 Thread Stephan Houben
Op 11 jan. 2018 10:56 schreef "Serhiy Storchaka" : 09.01.18 23:15, Rob Speer пише: > > > For the sake of discussion, let's call this encoding "web-1252". WHATWG > calls it "windows-1252", I'd suggest to name it then "whatwg-windows-152". and in general "whatwg-" + whatgwgs_name_of_encoding S

Re: [Python-ideas] Support WHATWG versions of legacy encodings

2018-01-11 Thread Serhiy Storchaka
09.01.18 23:15, Rob Speer пише: There is an encoding with no name of its own. It's supported by every current web browser and standardized by WHATWG. It's so prevalent that if you ask a Web browser to decode "iso-8859-1" or "windows-1252", you will get this encoding _instead_. It is probably th

Re: [Python-ideas] Support WHATWG versions of legacy encodings

2018-01-11 Thread M.-A. Lemburg
On 11.01.2018 10:01, Chris Angelico wrote: > On Thu, Jan 11, 2018 at 7:58 PM, M.-A. Lemburg wrote: >> On 11.01.2018 01:22, Nick Coghlan wrote: >>> On 11 January 2018 at 05:04, M.-A. Lemburg wrote: For the stdlib, I think we should stick to standards and not go for spreading non-standard

Re: [Python-ideas] Support WHATWG versions of legacy encodings

2018-01-11 Thread Chris Angelico
On Thu, Jan 11, 2018 at 7:58 PM, M.-A. Lemburg wrote: > On 11.01.2018 01:22, Nick Coghlan wrote: >> On 11 January 2018 at 05:04, M.-A. Lemburg wrote: >>> For the stdlib, I think we should stick to standards and >>> not go for spreading non-standard ones. >>> >>> So -1 on adding WHATWG encodings t

Re: [Python-ideas] Support WHATWG versions of legacy encodings

2018-01-11 Thread M.-A. Lemburg
On 11.01.2018 01:22, Nick Coghlan wrote: > On 11 January 2018 at 05:04, M.-A. Lemburg wrote: >> For the stdlib, I think we should stick to standards and >> not go for spreading non-standard ones. >> >> So -1 on adding WHATWG encodings to the stdlib. > > We already support HTML5 in the standard li

Re: [Python-ideas] Support WHATWG versions of legacy encodings

2018-01-10 Thread Random832
On Wed, Jan 10, 2018, at 14:44, Steve Barnes wrote: > I am somewhat confused because according to > https://encoding.spec.whatwg.org/index-windows-1252.txt 0x90 (one of the > original examples) is undefined as the table only runs to 127 i.e. 0x7F. The spec referenced in the comments says "Let co

Re: [Python-ideas] Support WHATWG versions of legacy encodings

2018-01-10 Thread Chris Angelico
On Thu, Jan 11, 2018 at 6:44 AM, Steve Barnes wrote: > > I am somewhat confused because according to > https://encoding.spec.whatwg.org/index-windows-1252.txt 0x90 (one of the > original examples) is undefined as the table only runs to 127 i.e. 0x7F. AIUI the table in that file assumes that the f

Re: [Python-ideas] Support WHATWG versions of legacy encodings

2018-01-10 Thread Steve Barnes
On 10/01/2018 19:13, Rob Speer wrote: > I was originally proposing these encodings under different names, and > that's what I think they should have. Indeed, that helps because a pip > installable library can backport the new encodings to previous versions > of Python. > > Having a pip instal

Re: [Python-ideas] Support WHATWG versions of legacy encodings

2018-01-10 Thread Chris Barker
On Wed, Jan 10, 2018 at 11:04 AM, M.-A. Lemburg wrote: > I don't believe it's a good strategy to create the confusion that > WHATWG is introducing by using the same names for non-standard > encodings. > agreed. > Python uses the Unicode Consortium standard encodings or > otherwise internationa

Re: [Python-ideas] Support WHATWG versions of legacy encodings

2018-01-10 Thread Nick Coghlan
On 11 January 2018 at 05:04, M.-A. Lemburg wrote: > For the stdlib, I think we should stick to standards and > not go for spreading non-standard ones. > > So -1 on adding WHATWG encodings to the stdlib. We already support HTML5 in the standard library, and saying "We'll accept WHATWG's definition

Re: [Python-ideas] Support WHATWG versions of legacy encodings

2018-01-10 Thread Rob Speer
> Well, one of your main arguments was that the Windows API follows these best fit encodings. No, that wasn't me, that was Ivan. My argument has been based on compatibility with Web technologies; I wanted these encodings before I knew what Windows did (and now what Windows does kind of horrifies m

Re: [Python-ideas] Support WHATWG versions of legacy encodings

2018-01-10 Thread M.-A. Lemburg
On 10.01.2018 20:13, Rob Speer wrote: > I was originally proposing these encodings under different names, and > that's what I think they should have. Indeed, that helps because a pip > installable library can backport the new encodings to previous versions of > Python. > > Having a pip installable

Re: [Python-ideas] Support WHATWG versions of legacy encodings

2018-01-10 Thread Rob Speer
I was originally proposing these encodings under different names, and that's what I think they should have. Indeed, that helps because a pip installable library can backport the new encodings to previous versions of Python. Having a pip installable library as the _only_ way to use these encodings

Re: [Python-ideas] Support WHATWG versions of legacy encodings

2018-01-10 Thread M.-A. Lemburg
On 10.01.2018 19:36, Rob Speer wrote: > I'm looking at the documentation of "best fit" mappings, and that seems to > be a different matter. It appears that best-fit mappings are designed to be > many-to-one mappings used only for encoding. "Best fit" is what the Windows API is implementing. I don

Re: [Python-ideas] Support WHATWG versions of legacy encodings

2018-01-10 Thread Rob Speer
I'm looking at the documentation of "best fit" mappings, and that seems to be a different matter. It appears that best-fit mappings are designed to be many-to-one mappings used only for encoding. "Examples of best fit are converting fullwidth letters to their counterparts when converting to single

Re: [Python-ideas] Support WHATWG versions of legacy encodings

2018-01-10 Thread M.-A. Lemburg
On 10.01.2018 00:56, Rob Speer wrote: > Oh that's interesting. So it seems to be Python that's the exception here. > > Would we really be able to add entries to character mappings that haven't > changed since Python 2.0? The Windows mappings in Python come directly from the Unicode Consortium map

Re: [Python-ideas] Support WHATWG versions of legacy encodings

2018-01-10 Thread Paul Moore
On 10 January 2018 at 04:16, Nick Coghlan wrote: > On 10 January 2018 at 13:56, Rob Speer wrote: >> One other thing I've noticed that's related to the WHATWG encoding list: in >> Python, the encoding name "windows-874" seems to be missing. The _encoding_ >> is there, as "cp874", but "windows-874"

Re: [Python-ideas] Support WHATWG versions of legacy encodings

2018-01-09 Thread Nick Coghlan
On 10 January 2018 at 13:56, Rob Speer wrote: > One other thing I've noticed that's related to the WHATWG encoding list: in > Python, the encoding name "windows-874" seems to be missing. The _encoding_ > is there, as "cp874", but "windows-874" doesn't work as an alias for it the > way that "window

Re: [Python-ideas] Support WHATWG versions of legacy encodings

2018-01-09 Thread Rob Speer
One other thing I've noticed that's related to the WHATWG encoding list: in Python, the encoding name "windows-874" seems to be missing. The _encoding_ is there, as "cp874", but "windows-874" doesn't work as an alias for it the way that "windows-1252" works as an alias for "cp1252". That alias shou

Re: [Python-ideas] Support WHATWG versions of legacy encodings

2018-01-09 Thread Nick Coghlan
On 10 January 2018 at 09:56, Rob Speer wrote: > Oh that's interesting. So it seems to be Python that's the exception here. > > Would we really be able to add entries to character mappings that haven't > changed since Python 2.0? Changing things that used to cause an exception into operations that

Re: [Python-ideas] Support WHATWG versions of legacy encodings

2018-01-09 Thread Rob Speer
Oh that's interesting. So it seems to be Python that's the exception here. Would we really be able to add entries to character mappings that haven't changed since Python 2.0? On Tue, 9 Jan 2018 at 16:53 Ivan Pozdeev via Python-ideas < python-ideas@python.org> wrote: > First of all, many thanks f

Re: [Python-ideas] Support WHATWG versions of legacy encodings

2018-01-09 Thread Ivan Pozdeev via Python-ideas
First of all, many thanks for such a excellently writen letter. It was a real pleasure to read. On 10.01.2018 0:15, Rob Speer wrote: Hi! I joined this list because I'm interested in filling a gap in Python's standard library, relating to text encodings. There is an encoding with no name of it

[Python-ideas] Support WHATWG versions of legacy encodings

2018-01-09 Thread Rob Speer
Hi! I joined this list because I'm interested in filling a gap in Python's standard library, relating to text encodings. There is an encoding with no name of its own. It's supported by every current web browser and standardized by WHATWG. It's so prevalent that if you ask a Web browser to decode "