Michael Welle wrote: > Hello, > > I want to use Python 3 to process data, that unfortunately come with > different encodings. So far I have found ascii, iso-8859, utf-8, > windows-1252 and maybe some more in the same file (don't ask...). I read > the data via sys.stdin and the idea is to read a line, detect the > current encoding, hit it until it looks like utf-8 and then go on with > the next line of input: > > > import cchardet > > for line in sys.stdin.buffer: > > encoding = cchardet.detect(line)['encoding'] > line = line.decode(encoding, 'ignore')\ > .encode('UTF-8').decode('UTF-8', 'ignore')
Here the last decode('UTF-8', 'ignore') undoes the preceding encode('UTF-8'); therefore line = line.decode(encoding, 'ignore') should suffice. Does chardet ever return an encoding that fails to decode the line? Only in that case the "ignore" error handler would make sense. I expect that for line in sys.stdin.buffer: encoding = cchardet.detect(line)['encoding'] line = line.decode(encoding) will work if you don't want to use the alternative suggested by Chris. > After that line should be a string. The logging module and some others > choke on line: UnicodeEncodeError: 'charmap' codec can't encode > character. What would be a right approach to tackle that problem > (assuming that I can't change the input data)? It looks like you are trying to write the unicode you have generated above into a file using iso-8859-1 or similar: $ cat log_unicode.py import logging LOGGER = logging.getLogger() LOGGER.addHandler(logging.FileHandler("tmp.txt", encoding="ISO-8859-1")) LOGGER.critical("\N{PILE OF POO}") $ python3 log_unicode.py --- Logging error --- Traceback (most recent call last): File "/usr/lib/python3.4/logging/__init__.py", line 980, in emit stream.write(msg) UnicodeEncodeError: 'latin-1' codec can't encode character '\U0001f4a9' in position 0: ordinal not in range(256) Call stack: File "log_unicode.py", line 5, in <module> LOGGER.critical("\N{PILE OF POO}") Message: '💩' Arguments: () If my assumption is correct you can either change the target file's encoding to UTF-8 or change the error handling strategy to ignore or something else. I didn't find an official way, so here's a minimal example: $ rm tmp.txt $ cat log_unicode.py import logging class FileHandler(logging.FileHandler): def _open(self): return open( self.baseFilename, self.mode, encoding=self.encoding, errors="xmlcharrefreplace") LOGGER = logging.getLogger() LOGGER.addHandler(FileHandler("tmp.txt", encoding="ISO-8859-1")) LOGGER.critical("\N{PILE OF POO}") $ python3 log_unicode.py $ cat tmp.txt 💩 A real program would of course override the initializer... -- https://mail.python.org/mailman/listinfo/python-list