Re: Processing text data with different encodings

2016-06-28 Thread Chris Angelico
On Wed, Jun 29, 2016 at 1:52 AM, Random832 wrote: > On Tue, Jun 28, 2016, at 06:25, Chris Angelico wrote: >> For the OP's situation, frankly, I doubt there'll be anything other >> than UTF-8, Latin-1, and CP-1252. The chances that someone casually >> mixes CP-1252 with

Re: Processing text data with different encodings

2016-06-28 Thread Random832
On Tue, Jun 28, 2016, at 06:25, Chris Angelico wrote: > For the OP's situation, frankly, I doubt there'll be anything other > than UTF-8, Latin-1, and CP-1252. The chances that someone casually > mixes CP-1252 with (say) CP-1254 would be vanishingly small. So the > simple decode of "UTF-8, or

Re: Processing text data with different encodings

2016-06-28 Thread Steven D'Aprano
On Tue, 28 Jun 2016 10:30 pm, Michael Welle wrote: > I changed the code from my initial mail to: > > LOGGER = logging.getLogger() > LOGGER.addHandler(logging.FileHandler("tmp.txt", encoding="utf-8")) > > for l in sys.stdin.buffer: > l = l.decode('utf-8') > LOGGER.critical(l) I imagine

Re: Processing text data with different encodings

2016-06-28 Thread Random832
On Tue, Jun 28, 2016, at 10:52, Steven D'Aprano wrote: > "you will find THREE OR FOUR different encodings in one email. > I think at the sending side they just glue different text > fragments from different sources together without thinking > about the encoding" > > But I'm not

Re: Processing text data with different encodings

2016-06-28 Thread Steven D'Aprano
On Tue, 28 Jun 2016 10:30 pm, Michael Welle wrote: > I look at the hex values of the bytes, get the win-1252 table and > translate the bytes to chars. If the result makes sense, it's win-1252 > (and maybe others, if the tables overlap). So in that sense I know what > I have. I least for this

Re: Processing text data with different encodings

2016-06-28 Thread Peter Otten
Michael Welle wrote: > With your help, I fixed logging. Somehow I had in mind that the > logging module would do the right thing if I don't specify the encoding. The default encoding depends on the environment (and platform): $ touch tmp.txt $ python3 -c 'print(open("tmp.txt").encoding)' UTF-8

Re: Processing text data with different encodings

2016-06-28 Thread Steven D'Aprano
On Tue, 28 Jun 2016 08:17 pm, Michael Welle wrote: > After a bit more 'fiddling' I found out that all test cases work if I > use .decode('utf-8') on the incoming bytes. In my first approach I tried > to find out at what I was looking and then used a specific .decode, e.g. >

Re: Processing text data with different encodings

2016-06-28 Thread Chris Angelico
On Tue, Jun 28, 2016 at 8:37 PM, Michael Welle wrote: > Steven D'Aprano writes: > >> On Tue, 28 Jun 2016 06:35 pm, Michael Welle wrote: >> >>> my original data is email. The mail header says it's utf-8, but you will >>> find three or four different

Re: Processing text data with different encodings

2016-06-28 Thread Steven D'Aprano
On Tue, 28 Jun 2016 06:35 pm, Michael Welle wrote: > my original data is email. The mail header says it's utf-8, but you will > find three or four different encodings in one email. I think at the > sending side they just glue different text fragments from different > sources together without

Re: Processing text data with different encodings

2016-06-28 Thread Chris Angelico
On Tue, Jun 28, 2016 at 6:30 PM, Peter Otten <__pete...@web.de> wrote: > Does chardet ever return an encoding that fails to decode > the line? Only in that case the "ignore" error handler would make sense. Assuming the module the OP is using is functionally identical to the one I use from the

Re: Processing text data with different encodings

2016-06-28 Thread Peter Otten
Michael Welle wrote: > Hello, > > I want to use Python 3 to process data, that unfortunately come with > different encodings. So far I have found ascii, iso-8859, utf-8, > windows-1252 and maybe some more in the same file (don't ask...). I read > the data via sys.stdin and the idea is to read a

Re: Processing text data with different encodings

2016-06-28 Thread Chris Angelico
On Tue, Jun 28, 2016 at 5:25 PM, Michael Welle wrote: > I want to use Python 3 to process data, that unfortunately come with > different encodings. So far I have found ascii, iso-8859, utf-8, > windows-1252 and maybe some more in the same file (don't ask...). I read > the data