Michael Welle wrote: > With your help, I fixed logging. Somehow I had in mind that the > logging module would do the right thing if I don't specify the encoding.
The default encoding depends on the environment (and platform): $ touch tmp.txt $ python3 -c 'print(open("tmp.txt").encoding)' UTF-8 $ LANG=C python3 -c 'print(open("tmp.txt").encoding)' ANSI_X3.4-1968 > Well, setting the encoding explicitly to utf-8 changes the behaviour. > > If I use decode('windows-1252') on a bit of text I still have trouble to > understand what's happening. For instance, there is an u umlaut in the > 1252 encoded portion of the input text. That character is 0xfc in hex. > After applying .decode('windows-1252') and logging it, the log contains > a mangled character with hex codes 0xc3 0x20. If I do the same with > .decode('utf-8'), the result is a working u umlaut with 0xfc in the log. > > On the other hand, if I try the following in the interactive > interpreter: > > Here I have a few bytes that can be interpreted as a 1252 encoded string > and I command the interpreter to show me the string, right? > >>>> e=b'\xe4' >>>> e.decode('1252') > 'ä' > > Now, I can't to this, because 0xe4 isn't valid utf-8: >>>> e.decode('utf-8') > Traceback (most recent call last): > File "<stdin>", line 1, in <module> > UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe4 in position 0: > unexpected end of data > > But why is it different in my actual script? I guess the assumption that > what I am reading from sys.stdin.buffer is the same as what is in the > file, that I pipe into the script, is wrong? The situation is simple; the string consists of code points, but the file may only contain bytes. When reading a string from a file the bytes read need decoding, and before writing a string to a file it must be encoded. What byte sequence denotes a specific code point depends on the encoding. This is always the case, i. e. if you look at a UTF-8-encoded file with an editor that expects cp1252 you will see >>> in_the_file = "ä".encode("utf-8") >>> in_the_file b'\xc3\xa4' >>> what_the_editor_shows = in_the_file.decode("cp1252") >>> print(what_the_editor_shows) ä On the other hand if you look at a cp1252-encoded file decoding the data as UTF-8 you will likely get an error because the byte >>> "ä".encode("cp1252") b'\xe4' alone is not valid UTF-8. As part of a sequence the data may still be ambiguous. If you were to write an a-umlaut followed by two euro signs using cp1252 >>> in_the_file = '䀀'.encode("cp1252") an editor expecting UTF-8 would show >>> in_the_file.decode("utf-8") '䀀' -- https://mail.python.org/mailman/listinfo/python-list