On Tue, 28 Jun 2016 10:30 pm, Michael Welle wrote: > I changed the code from my initial mail to: > > LOGGER = logging.getLogger() > LOGGER.addHandler(logging.FileHandler("tmp.txt", encoding="utf-8")) > > for l in sys.stdin.buffer: > l = l.decode('utf-8') > LOGGER.critical(l)
I imagine you're running this over input known to contain UTF-8 text? Because if you run it over your emails with non-UTF8 content, you'll get an exception. I would try this: for l in sys.stdin.buffer: l = l.decode('utf-8', errors='surrogateescape') print(repr(l)) # or log it, whichever you prefer If I try simulating that, you'll see the output: py> buffer = [] py> buffer.append('abüd\n'.encode('utf-8')) py> buffer.append('abüd\n'.encode('utf-8')) py> buffer.append('abüd\n'.encode('latin-1')) py> buffer.append('abüd\n'.encode('utf-8')) py> buffer [b'ab\xc3\xbcd\n', b'ab\xc3\xbcd\n', b'ab\xfcd\n', b'ab\xc3\xbcd\n'] py> for l in buffer: #sys.stdin.buffer: ... l = l.decode('utf-8', errors='surrogateescape') ... print(repr(l)) ... 'abüd\n' 'abüd\n' 'ab\udcfcd\n' 'abüd\n' See the second last line? The \udcfc code point is a surrogate, encoding the "bad byte" \xfc. See the docs for further details. Alternatively, you could try: for l in sys.stdin.buffer: try: l = l.decode('utf-8', errors='strict') except UnicodeDecodeError: l = l.decode('latin1') # May generate mojibake. print(repr(l)) # or log it, whichever you prefer This version should give satisfactory results if the email actually does contain lines of Latin-1 (or Windows-1252 if you prefer) mixed in with the UTF-8. If not, it will generate mojibake, which may be acceptable to your users. -- Steven “Cheer up,” they said, “things could be worse.” So I cheered up, and sure enough, things got worse. -- https://mail.python.org/mailman/listinfo/python-list