On Tue, Aug 30, 2016 at 12:38 AM, Random832 <random...@fastmail.com> wrote: > Directing this to python-list because it's really not on the topic of > the idea being discussed. > > On Mon, Aug 29, 2016, at 05:37, Chris Angelico wrote: >> Suppose I come to python-ideas and say "Hey, the MUD community would >> really benefit from a magic decoder that would use UTF-8 where >> possible, ISO-8859-1 as fall-back, and Windows-1252 for characters not >> in 8859-1". Apart from responding that 8859-1 is a complete subset of >> 1252, > > ISO-8859-1, with a dash in between "ISO" and "8859" is not a complete > subset of 1252. In fact, ISO-8859-1-with-a-dash incorporates ISO 6429 > for 0x80-0x9F, and thereby has no bytes that do not map to characters. > Incidentally, many Windows encodings, including 1252, as they are > actually used do use ISO 6429 for bytes that do not map to characters, > even when best fit mappings are not accepted. It is unclear why they > published tables that define these bytes as undefined, which have been > picked up by independent implementations of these encodings such as the > ones in Python. The only reason I can think of is to reserve the ability > to add new mappings later, as they did for 0x80 to U+20AC.
Huh, okay. Anyway, point is that it's a magical decoder that tries UTF-8, and if that fails, uses an eight-bit encoding. >> there's not really a lot that you could discuss about that >> proposal, unless I were to show you some of my code. I can tell you >> about the number of MUDs that I play, the number of MUD clients that >> I've written, and some stats from my MUD server, and say "The MUD >> community needs this support", but it's of little value compared to >> actual code. >> >> (For the record, a two-step decode of "UTF-8, fall back on 1252" is >> exactly what I do... in half a dozen lines of code. So this does NOT >> need to be implemented.) > > And what level is the fallback done at? Per line? Per character? Per > read result? Does encountering an invalid-for-UTF-8 byte put it > permanently in Windows-1252 mode? Does it "retroactively" affect earlier > bytes? Can it be used as a stream encoding, or does it require you to > use bytes-based I/O and a separate .decode step? Currently? UTF-8 is attempted on an entire read result, and if it fails, the data is cracked into individual lines and retried, using the fallback as per the above. So in effect, it's per line. I basically assume that a naive byte-oriented server is usually going to be spitting out data from one client at a time, and each client is either emitting UTF-8 or its native encoding. (Since I have no way of knowing what native encoding a given client was using, I just pick Western Europe as the most likely codepage and run with it. The algorithm would work just the same if I picked, say, Windows-1250 as the eight-bit encoding.) > I assume a MUD server isn't blocking on each client socket waiting for a > newline character, so how does such a decoding step mesh with whatever > such a server does to handle I/O asynchronously? Are there any > frameworks that you could be using that you can't if it's not an > encoding? This magic started out in my MUD client, where it's connecting to a naive server that echoes whatever it's given. The same logic is now in my MUD server, too. It's pretty simple in both cases; the client is built around asynchronous I/O, the server is threaded, but both of them have a single point in the code where new bytes come in. There's one function that converts bytes to text, and it operates on the above algorithm. > What happens if it's being used as an incremental decoder, encounters a > valid UTF-8 lead byte on a buffer boundary, and then must "reject" (i.e. > decode as the fallback encoding) it afterwards because an invalid trail > byte follows it in the next buffer? What happens if a buffer consists > only of a valid partial UTF-8 character? Hmm, I don't remember if there's any actual handling of this. If there's a problem, my solution is simple: split on 0x0A first, and then decode, which means I'm decoding one line at a time. Both server and client already are fundamentally line-based anyway, and depending on byte value 0x0A always and only representing U+000A is valid in all of the encodings that I'm willing to accept. > I can probably implement the fallback as an error handler in half a > dozen lines, but it's not obvious and I suspect it's not what a lot of > people do. It would probably take a bit more than half a dozen lines to > implement it as an encoding. Please don't. :) This is something that belongs in the application; it's somewhat hacky, and I don't see any benefit to it going into the language. For one thing, I could well imagine making the fallback encoding configurable (it isn't currently, but it could easily be), and that doesn't really fit into the Python notion of error handler. For another, this is a fairly rare concept - I don't see dozens of programs out there using the exact same strange logic, and even if there were, there'd be small differences (eg whether or not the fallback is applied line-by-line). This was intended as an example of something that does NOT belong in the core language, and while I appreciate the offer of help, it's not something I'd support polluting the language with :) (Plus, my server's not written in Python. Nor is the client that this started in, although I have considered writing a version of it in Python, which would in theory benefit from this.) ChrisA -- https://mail.python.org/mailman/listinfo/python-list