Re: Magic UTF-8/Windows-1252 encodings

2016-08-30 Thread Chris Angelico
On Tue, Aug 30, 2016 at 7:36 PM, Johannes Bauer  wrote:
> On 29.08.2016 17:59, Chris Angelico wrote:
>
>> Fair enough. If this were something that a lot of programs wanted,
>> then yeah, there'd be good value in stdlibbing it. Character encodings
>> ARE hard to get right, and this kind of thing does warrant some help.
>> But I think it's best not done in core - at least, not until we see a
>> lot more people doing the same :)
>
> I hope this kind of botchery never makes it in the stdlib. It directly
> contradicts "In the face of ambiguity, refuse the temptation to guess."
>
> If you don't know what the charset is, don't guess. It'll introduce
> subtle ambiguities and ugly corner cases and will make the life for the
> rest of us -- who are trying to get their charsets straight and correct
> -- a living hell.
>
> Having such silly "magic" guessing stuff is actually detrimental to the
> whole concept of properly identifying and using character sets.
> Everything about the thought makes me shiver.

In the clinical purity of theoretical work, I absolutely agree with
you, and for that reason, this definitely doesn't belong in the
stdlib. But designers need to leave their wonderlands - the real world
is not so wonderful. (Nan Sharpe, to Alice Liddell.) If every program
in the world understood character encodings and correctly decoded
bytes using a known encoding and encoded text using the same encoding
(preferably UTF-8), then sure, it'd be easy. But when your program has
to cope with other people's bytes-that-ought-to-represent-text,
sometimes guessing IS better than choking. This example is a perfect
one; a naive byte-oriented server accepts ASCII-compatible text from a
variety of clients, and sends it out to all clients. (Since all the
parts that the server actually parses are ASCII, this works.) Very
commonly, naive Windows clients send text in the native encoding, eg
CP-1252, but smarter clients generally send UTF-8. I want my client to
interoperate perfectly with other UTF-8 clients, which is generally
easy (the only breakage is if the server attempts to letter-wrap a
massively long word, and ends up breaking a UTF-8 sequence across
lines), but I also want to have a decent fallback for the eight-bit
clients. Obviously I can't *know* the encoding used - if they were
smart enough to send encoding info, they'd most likely use UTF-8 - so
it's either guess, or choke on any non-ASCII bytes.

Another place where guessing is VERY useful is when I'm leafing
through 300 subtitles files for "Tangled" and want to know whether
they're accurate transcriptions or not. (Not hypothetical. Been doing
exactly that for a lot of this weekend. It seemed logical, since I've
done the same for "Frozen", and both movies are excellent.) All I have
is a file - a sequence of bytes. I know it's an ASCII-compatible
encoding because the numeric positioning info looks correct. If my
program "avoided the temptation to guess", I would have to manually
test a dozen encodings until one of them looked right to me, the
human; but instead, I use chardet plus some other heuristics, and
generally the program's right on either the first or second guess.
That means just two encodings for me to look at, often just one, and
only going to the full dozen or so if it gets it completely wrong.

The principle "refuse the temptation to guess" applies to core data
types and such (and not even universally there), but NOT to
applications, where you need domain knowledge to make that kind of
call.

ChrisA
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Magic UTF-8/Windows-1252 encodings

2016-08-30 Thread Johannes Bauer
On 29.08.2016 17:59, Chris Angelico wrote:

> Fair enough. If this were something that a lot of programs wanted,
> then yeah, there'd be good value in stdlibbing it. Character encodings
> ARE hard to get right, and this kind of thing does warrant some help.
> But I think it's best not done in core - at least, not until we see a
> lot more people doing the same :)

I hope this kind of botchery never makes it in the stdlib. It directly
contradicts "In the face of ambiguity, refuse the temptation to guess."

If you don't know what the charset is, don't guess. It'll introduce
subtle ambiguities and ugly corner cases and will make the life for the
rest of us -- who are trying to get their charsets straight and correct
-- a living hell.

Having such silly "magic" guessing stuff is actually detrimental to the
whole concept of properly identifying and using character sets.
Everything about the thought makes me shiver.

Cheers,
Johannes

-- 
>> Wo hattest Du das Beben nochmal GENAU vorhergesagt?
> Zumindest nicht öffentlich!
Ah, der neueste und bis heute genialste Streich unsere großen
Kosmologen: Die Geheim-Vorhersage.
 - Karl Kaos über Rüdiger Thomas in dsa 
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Magic UTF-8/Windows-1252 encodings

2016-08-29 Thread Chris Angelico
On Tue, Aug 30, 2016 at 1:28 AM, Random832  wrote:
> On Mon, Aug 29, 2016, at 11:14, Chris Angelico wrote:
>> Please don't. :) This is something that belongs in the application;
>> it's somewhat hacky, and I don't see any benefit to it going into the
>> language. For one thing, I could well imagine making the fallback
>> encoding configurable (it isn't currently, but it could easily be),
>> and that doesn't really fit into the Python notion of error handler.
>
> Well, yeah, if anything implementing it as an error handler is a hack, I
> just meant it's just the least hacky way I can think that fits in the
> size "half a dozen lines".
>
>> For another, this is a fairly rare concept - I don't see dozens of
>> programs out there using the exact same strange logic, and even if
>> there were, there'd be small differences
>
> That is actually an argument in favor of putting it in the stdlib,
> assuming few of those small differences are truly considered and
> intentional. The main thrust of my post was that this is one of the
> things that's harder than it sounds to get right due to edge cases, just
> like the clip/clamp function being discussed last month.

Fair enough. If this were something that a lot of programs wanted,
then yeah, there'd be good value in stdlibbing it. Character encodings
ARE hard to get right, and this kind of thing does warrant some help.
But I think it's best not done in core - at least, not until we see a
lot more people doing the same :)

ChrisA
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Magic UTF-8/Windows-1252 encodings

2016-08-29 Thread Random832
On Mon, Aug 29, 2016, at 11:14, Chris Angelico wrote:
> Please don't. :) This is something that belongs in the application;
> it's somewhat hacky, and I don't see any benefit to it going into the
> language. For one thing, I could well imagine making the fallback
> encoding configurable (it isn't currently, but it could easily be),
> and that doesn't really fit into the Python notion of error handler.

Well, yeah, if anything implementing it as an error handler is a hack, I
just meant it's just the least hacky way I can think that fits in the
size "half a dozen lines".

> For another, this is a fairly rare concept - I don't see dozens of
> programs out there using the exact same strange logic, and even if
> there were, there'd be small differences

That is actually an argument in favor of putting it in the stdlib,
assuming few of those small differences are truly considered and
intentional. The main thrust of my post was that this is one of the
things that's harder than it sounds to get right due to edge cases, just
like the clip/clamp function being discussed last month.

> (eg whether or not the
> fallback is applied line-by-line). This was intended as an example of
> something that does NOT belong in the core language, and while I
> appreciate the offer of help, it's not something I'd support polluting
> the language with :)
> 
> (Plus, my server's not written in Python. Nor is the client that this
> started in, although I have considered writing a version of it in
> Python, which would in theory benefit from this.)
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Magic UTF-8/Windows-1252 encodings

2016-08-29 Thread Chris Angelico
On Tue, Aug 30, 2016 at 12:38 AM, Random832  wrote:
> Directing this to python-list because it's really not on the topic of
> the idea being discussed.
>
> On Mon, Aug 29, 2016, at 05:37, Chris Angelico wrote:
>> Suppose I come to python-ideas and say "Hey, the MUD community would
>> really benefit from a magic decoder that would use UTF-8 where
>> possible, ISO-8859-1 as fall-back, and Windows-1252 for characters not
>> in 8859-1". Apart from responding that 8859-1 is a complete subset of
>> 1252,
>
> ISO-8859-1, with a dash in between "ISO" and "8859" is not a complete
> subset of 1252. In fact, ISO-8859-1-with-a-dash incorporates ISO 6429
> for 0x80-0x9F, and thereby has no bytes that do not map to characters.
> Incidentally, many Windows encodings, including 1252, as they are
> actually used do use ISO 6429 for bytes that do not map to characters,
> even when best fit mappings are not accepted. It is unclear why they
> published tables that define these bytes as undefined, which have been
> picked up by independent implementations of these encodings such as the
> ones in Python. The only reason I can think of is to reserve the ability
> to add new mappings later, as they did for 0x80 to U+20AC.

Huh, okay. Anyway, point is that it's a magical decoder that tries
UTF-8, and if that fails, uses an eight-bit encoding.

>> there's not really a lot that you could discuss about that
>> proposal, unless I were to show you some of my code. I can tell you
>> about the number of MUDs that I play, the number of MUD clients that
>> I've written, and some stats from my MUD server, and say "The MUD
>> community needs this support", but it's of little value compared to
>> actual code.
>>
>> (For the record, a two-step decode of "UTF-8, fall back on 1252" is
>> exactly what I do... in half a dozen lines of code. So this does NOT
>> need to be implemented.)
>
> And what level is the fallback done at? Per line? Per character? Per
> read result? Does encountering an invalid-for-UTF-8 byte put it
> permanently in Windows-1252 mode? Does it "retroactively" affect earlier
> bytes? Can it be used as a stream encoding, or does it require you to
> use bytes-based I/O and a separate .decode step?

Currently? UTF-8 is attempted on an entire read result, and if it
fails, the data is cracked into individual lines and retried, using
the fallback as per the above. So in effect, it's per line. I
basically assume that a naive byte-oriented server is usually going to
be spitting out data from one client at a time, and each client is
either emitting UTF-8 or its native encoding. (Since I have no way of
knowing what native encoding a given client was using, I just pick
Western Europe as the most likely codepage and run with it. The
algorithm would work just the same if I picked, say, Windows-1250 as
the eight-bit encoding.)

> I assume a MUD server isn't blocking on each client socket waiting for a
> newline character, so how does such a decoding step mesh with whatever
> such a server does to handle I/O asynchronously? Are there any
> frameworks that you could be using that you can't if it's not an
> encoding?

This magic started out in my MUD client, where it's connecting to a
naive server that echoes whatever it's given. The same logic is now in
my MUD server, too. It's pretty simple in both cases; the client is
built around asynchronous I/O, the server is threaded, but both of
them have a single point in the code where new bytes come in. There's
one function that converts bytes to text, and it operates on the above
algorithm.

> What happens if it's being used as an incremental decoder, encounters a
> valid UTF-8 lead byte on a buffer boundary, and then must "reject" (i.e.
> decode as the fallback encoding) it afterwards because an invalid trail
> byte follows it in the next buffer? What happens if a buffer consists
> only of a valid partial UTF-8 character?

Hmm, I don't remember if there's any actual handling of this. If
there's a problem, my solution is simple: split on 0x0A first, and
then decode, which means I'm decoding one line at a time. Both server
and client already are fundamentally line-based anyway, and depending
on byte value 0x0A always and only representing U+000A is valid in all
of the encodings that I'm willing to accept.

> I can probably implement the fallback as an error handler in half a
> dozen lines, but it's not obvious and I suspect it's not what a lot of
> people do. It would probably take a bit more than half a dozen lines to
> implement it as an encoding.

Please don't. :) This is something that belongs in the application;
it's somewhat hacky, and I don't see any benefit to it going into the
language. For one thing, I could well imagine making the fallback
encoding configurable (it isn't currently, but it could easily be),
and that doesn't really fit into the Python notion of error handler.
For another, this is a fairly rare concept - I don't see dozens of
programs out there using the exact same strange lo

Magic UTF-8/Windows-1252 encodings

2016-08-29 Thread Random832
Directing this to python-list because it's really not on the topic of
the idea being discussed.

On Mon, Aug 29, 2016, at 05:37, Chris Angelico wrote:
> Suppose I come to python-ideas and say "Hey, the MUD community would
> really benefit from a magic decoder that would use UTF-8 where
> possible, ISO-8859-1 as fall-back, and Windows-1252 for characters not
> in 8859-1". Apart from responding that 8859-1 is a complete subset of
> 1252,

ISO-8859-1, with a dash in between "ISO" and "8859" is not a complete
subset of 1252. In fact, ISO-8859-1-with-a-dash incorporates ISO 6429
for 0x80-0x9F, and thereby has no bytes that do not map to characters.
The magic encoding that people often ask for or use is to use UTF-8
first, Windows-1252 as a fallback, and ISO 6429 as the final fallback
(and may or may not involve a "side trip" through Windows-1252 for UTF-8
encodings purportedly of code points between U+0080 and U+009F).

Incidentally, many Windows encodings, including 1252, as they are
actually used do use ISO 6429 for bytes that do not map to characters,
even when best fit mappings are not accepted. It is unclear why they
published tables that define these bytes as undefined, which have been
picked up by independent implementations of these encodings such as the
ones in Python. The only reason I can think of is to reserve the ability
to add new mappings later, as they did for 0x80 to U+20AC.

> there's not really a lot that you could discuss about that
> proposal, unless I were to show you some of my code. I can tell you
> about the number of MUDs that I play, the number of MUD clients that
> I've written, and some stats from my MUD server, and say "The MUD
> community needs this support", but it's of little value compared to
> actual code.
> 
> (For the record, a two-step decode of "UTF-8, fall back on 1252" is
> exactly what I do... in half a dozen lines of code. So this does NOT
> need to be implemented.)

And what level is the fallback done at? Per line? Per character? Per
read result? Does encountering an invalid-for-UTF-8 byte put it
permanently in Windows-1252 mode? Does it "retroactively" affect earlier
bytes? Can it be used as a stream encoding, or does it require you to
use bytes-based I/O and a separate .decode step?

I assume a MUD server isn't blocking on each client socket waiting for a
newline character, so how does such a decoding step mesh with whatever
such a server does to handle I/O asynchronously? Are there any
frameworks that you could be using that you can't if it's not an
encoding?

What happens if it's being used as an incremental decoder, encounters a
valid UTF-8 lead byte on a buffer boundary, and then must "reject" (i.e.
decode as the fallback encoding) it afterwards because an invalid trail
byte follows it in the next buffer? What happens if a buffer consists
only of a valid partial UTF-8 character?

I can probably implement the fallback as an error handler in half a
dozen lines, but it's not obvious and I suspect it's not what a lot of
people do. It would probably take a bit more than half a dozen lines to
implement it as an encoding.
-- 
https://mail.python.org/mailman/listinfo/python-list