On Tue, 28 Jun 2016 08:17 pm, Michael Welle wrote: > After a bit more 'fiddling' I found out that all test cases work if I > use .decode('utf-8') on the incoming bytes. In my first approach I tried > to find out at what I was looking and then used a specific .decode, e.g. > .decode('windows-1252'). That were the trouble makers.
Remember that chardet's detection is based on statistics and heuristics and cannot be considered 100% reliable. Normally I would expect that chardet would guess two or three encodings. If the first fails, you might want to check the others. Also remember that chardet works best with large amounts of text, like an entire webpage. If you pass it a single byte, or even a few bytes, the results will likely be no better than whatever encoding the chardet developer decided to use as the default: "If there's not enough data to guess, just return Win-1252, because that's pretty common..." > With your help, I fixed logging. Somehow I had in mind that the > logging module would do the right thing if I don't specify the encoding. > Well, setting the encoding explicitly to utf-8 changes the behaviour. I would expect that logging will do the right thing if you pass it text strings and have set the encoding to UTF-8. > If I use decode('windows-1252') on a bit of text You cannot decode text. Text is ENCODED to bytes, and bytes are DECODED to text. > I still have trouble to understand what's happening. > For instance, there is an u umlaut in the > 1252 encoded portion of the input text. You don't know that. If the input is *bytes*, then all you know is the byte values. What they mean is anyone's guess unless the name of the encoding is transmitted separately. You can be reasonably sure that the bytes are mostly ASCII, because it's email and nobody sends email in EBCDIC, so if you see a byte 0x41, you can be sure it represents an 'A'. But outside of the ASCII range, you're on shaky ground. If the specified encoding is correct, then everything works well: the email says it is UTF-8, and sure enough it is UTF-8. But if the specified encoding is wrong, you're in trouble. You only think the encoding is Windows-1252 because Chardet has guessed that. But it's not infallible and maybe it has got it wrong. Especially if your input is made up of a lot of bytes from all sorts of different encodings, that may be confusing Chardet. > That character is 0xfc in hex. No. Byte 0xFC represents ü if your guess about the encoding is correct. If the encoding truly is Windows-1252, or Latin-1, then byte 0xFC will mean ü. (And some others.) If the source is Western European, that might be a good guess. But if the encoding actually is (let's say): - ISO-8859-5 (Cyrillic), then the byte represents ќ - ISO-8859-7 (Greek), then the byte represents ό - MacRoman (Apple Macintosh), then the byte represents ¸ (That last one is not a comma, but a cedilla.) > After applying .decode('windows-1252') and logging it, the log contains > a mangled character with hex codes 0xc3 0x20. I think you are misunderstanding what you are looking at. How are you seeing that? 0x20 will be a space in most encodings. (1) How are you opening the log file in Python? Do you specify an encoding? (2) How are you writing to the log file? (3) What are you using to read the log file outside of Python? How do you know the hex codes? I don't know any way you can start with the character ü and write it to a file and get bytes 0xc3 0x20. Maybe somebody else will think of something, but to me, that seems impossible. > If I do the same with > .decode('utf-8'), the result is a working u umlaut with 0xfc in the log. That suggests that you have opened the log file using Latin-1 or Windows-1252 as the encoding. You shouldn't do that. Unless you have a good reason to do otherwise (in other words, for experts only) you should always use UTF-8 for writing. > On the other hand, if I try the following in the interactive > interpreter: > > Here I have a few bytes that can be interpreted as a 1252 encoded string > and I command the interpreter to show me the string, right? > >>>> e=b'\xe4' That's ONE byte, not a few. >>>> e.decode('1252') > 'ä' Right -- that means that byte 0xE4 represents ä in Windows-1252, also in Latin-1 and some others. But: py> e.decode('iso-8859-7') # Greek 'δ' py> e.decode('iso-8859-8') # Hebrew 'ה' py> e.decode('iso-8859-6') # Arabic 'ل' py> e.decode('MacRoman') '‰' py> e.decode('iso-8859-5') 'ф' So if you find a byte 0xE4 in a file, and don't know where it came from, you don't know what it means. If you can guess it came from Russia, then it might be a ф. If you think it came from a Macintosh prior to OS X, then it probably means a per-mill sign ‰. > Now, I can't to this, because 0xe4 isn't valid utf-8: >>>> e.decode('utf-8') > Traceback (most recent call last): > File "<stdin>", line 1, in <module> > UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe4 in position 0: > unexpected end of data Correct. > But why is it different in my actual script? Without seeing your script, it's hard to say what you are actually doing. > I guess the assumption that > what I am reading from sys.stdin.buffer is the same as what is in the > file, that I pipe into the script, is wrong? I wouldn't rule that out, but more likely the issue lies elsewhere, in your own code. -- Steven “Cheer up,” they said, “things could be worse.” So I cheered up, and sure enough, things got worse. -- https://mail.python.org/mailman/listinfo/python-list