On Wed, Jun 29, 2016 at 1:52 AM, Random832 <random...@fastmail.com> wrote: > On Tue, Jun 28, 2016, at 06:25, Chris Angelico wrote: >> For the OP's situation, frankly, I doubt there'll be anything other >> than UTF-8, Latin-1, and CP-1252. The chances that someone casually >> mixes CP-1252 with (say) CP-1254 would be vanishingly small. So the >> simple decode of "UTF-8, or failing that, 1252" is probably going to >> give correct results for most of the content. The trick is figuring >> out a correct boundary for the check; line-by-line may be sufficient, >> or it may not. > > For completeness, this can be done character-by-character (i.e. try to > decode a UTF-8 character, if it fails decode the offending byte as 1252) > with an error handler: > > import codecs > > def cp1252_errors(exception): > input, idx = exception.object, exception.start > byte = input[idx:idx+1] > try: > return byte.decode('windows-1252'), idx+1 > except UnicodeDecodeError: > # python's cp1252 doesn't accept 0x81, etc > return byte.decode('latin1'), idx+1
Yeah, and the decision as to where that boundary should be placed is thus completely up to the application. I don't know of any situation where you'd need the byte-by-byte version, but it's there if you want it. The reason I chose line-by-line in my MUD client is because of the nature of MUDs. The server I primarily use is a naive eight-bit one - whatever bytes it gets, it retransmits. (All the commands that it responds to are ASCII, so we can assume that all encodings are ASCII-compatible or the user will have major difficulties.) Some clients (including mine) send UTF-8. If I send the command "trivia This is a piece of text\n", the text will be encoded UTF-8, the server receives those bytes, and then transmit (to everyone who's tuned to the [trivia] channel) this text: "MyName [trivia] This is a piece of text\r\n". Simplistic clients (usually on Windows) will do the same thing, only they'll use their default encoding - usually 1252 - rather than UTF-8. But the entire command will be encoded the same way, which means the server will send an entire line (or several lines, if it wraps) in the same encoding. It's safe to assume that any given line will be in one single encoding, but consecutive lines could be in different encodings. For emails, it might be possible to use a larger section, but line-by-line would be safe there too, most likely. ChrisA -- https://mail.python.org/mailman/listinfo/python-list