Re: Python 3.0 automatic decoding of UTF16

2008-12-07 Thread Terry Reedy
John Machin wrote: Here's the scoop: It's a bug in the newline handling (in io.py, class IncrementalNewlineDecoder, method decode). It reads text files in 128- byte chunks. Converting CR LF to \n requires special case handling when '\r' is detected at the end of the decoded chunk n in case

Re: Python 3.0 automatic decoding of UTF16

2008-12-07 Thread John Machin
On Dec 7, 8:15 pm, Terry Reedy [EMAIL PROTECTED] wrote: John Machin wrote: Here's the scoop: It's a bug in the newline handling (in io.py, class IncrementalNewlineDecoder, method decode). It reads text files in 128- byte chunks. Converting CR LF to \n requires special case handling when

Re: Python 3.0 automatic decoding of UTF16

2008-12-07 Thread Johannes Bauer
John Machin schrieb: He did. Ugly stuff using readline() :-) Should still work, though. Well, well, I'm a C kinda guy used to while (fgets(b, sizeof(b), f)) kinda loops :-) But, seriously - I find that whole while True: and if line == construct ugly as hell, too. How can reading a file line

Re: Python 3.0 automatic decoding of UTF16

2008-12-07 Thread D'Arcy J.M. Cain
On Sun, 07 Dec 2008 16:05:53 +0100 Johannes Bauer [EMAIL PROTECTED] wrote: But, seriously - I find that whole while True: and if line == construct ugly as hell, too. How can reading a file line by line be achieved in a more pythonic kind of way? for line in open(filename): do stuff with

Re: Python 3.0 automatic decoding of UTF16

2008-12-07 Thread John Machin
On Dec 8, 2:05 am, Johannes Bauer [EMAIL PROTECTED] wrote: John Machin schrieb: He did. Ugly stuff using readline() :-) Should still work, though. Well, well, I'm a C kinda guy used to while (fgets(b, sizeof(b), f)) kinda loops :-) But, seriously - I find that whole while True: and if

Re: Python 3.0 automatic decoding of UTF16

2008-12-06 Thread Johannes Bauer
[EMAIL PROTECTED] schrieb: 2 problems: endianness and trailing zer byte. This works for me: This is very strange - when using utf16, endianness should be detected automatically. When I simply truncate the trailing zero byte, I receive: Traceback (most recent call last): File ./modify.py,

Re: Python 3.0 automatic decoding of UTF16

2008-12-06 Thread Johannes Bauer
John Machin schrieb: On Dec 6, 5:36 am, Johannes Bauer [EMAIL PROTECTED] wrote: So UTF-16 has an explicit EOF marker within the text? I cannot find one in original file, only some kind of starting sequence I suppose (0xfeff). The last characters of the file are 0x00 0x0d 0x00 0x0a, simple

Re: Python 3.0 automatic decoding of UTF16

2008-12-06 Thread MRAB
Johannes Bauer wrote: [EMAIL PROTECTED] schrieb: 2 problems: endianness and trailing zer byte. This works for me: This is very strange - when using utf16, endianness should be detected automatically. When I simply truncate the trailing zero byte, I receive: Traceback (most recent call

Re: Python 3.0 automatic decoding of UTF16

2008-12-06 Thread Mark Tolonen
Johannes Bauer [EMAIL PROTECTED] wrote in message news:[EMAIL PROTECTED] John Machin schrieb: On Dec 6, 5:36 am, Johannes Bauer [EMAIL PROTECTED] wrote: So UTF-16 has an explicit EOF marker within the text? I cannot find one in original file, only some kind of starting sequence I suppose

Re: Python 3.0 automatic decoding of UTF16

2008-12-06 Thread John Machin
On Dec 7, 6:20 am, Mark Tolonen [EMAIL PROTECTED] wrote: Johannes Bauer [EMAIL PROTECTED] wrote in message news:[EMAIL PROTECTED] John Machin schrieb: On Dec 6, 5:36 am, Johannes Bauer [EMAIL PROTECTED] wrote: So UTF-16 has an explicit EOF marker within the text? I cannot find one in

Re: Python 3.0 automatic decoding of UTF16

2008-12-06 Thread David Bolen
Johannes Bauer [EMAIL PROTECTED] writes: This is very strange - when using utf16, endianness should be detected automatically. When I simply truncate the trailing zero byte, I receive: Any chance that whatever you used to simply truncate the trailing zero byte also removed the BOM at the start

Re: Python 3.0 automatic decoding of UTF16

2008-12-06 Thread John Machin
On Dec 7, 9:01 am, David Bolen [EMAIL PROTECTED] wrote: Johannes Bauer [EMAIL PROTECTED] writes: This is very strange - when using utf16, endianness should be detected automatically. When I simply truncate the trailing zero byte, I receive: Any chance that whatever you used to simply

Re: Python 3.0 automatic decoding of UTF16

2008-12-06 Thread John Machin
On Dec 7, 9:34 am, John Machin [EMAIL PROTECTED] wrote: On Dec 7, 9:01 am, David Bolen [EMAIL PROTECTED] wrote: Johannes Bauer [EMAIL PROTECTED] writes: This is very strange - when using utf16, endianness should be detected automatically. When I simply truncate the trailing zero byte, I

Python 3.0 automatic decoding of UTF16

2008-12-05 Thread Johannes Bauer
Hello group, I'm having trouble reading a utf-16 encoded file with Python3.0. This is my (complete) code: #!/usr/bin/python3.0 class AddressBook(): def __init__(self, filename): f = open(filename, r, encoding=utf16) while True:

Re: Python 3.0 automatic decoding of UTF16

2008-12-05 Thread J Kenneth King
Johannes Bauer [EMAIL PROTECTED] writes: Traceback (most recent call last): File ./modify.py, line 12, in module a = AddressBook(2008_11_05_Handy_Backup.txt) File ./modify.py, line 7, in __init__ line = f.readline() File /usr/local/lib/python3.0/io.py, line 1807, in readline

Re: Python 3.0 automatic decoding of UTF16

2008-12-05 Thread Johannes Bauer
J Kenneth King schrieb: It probably means what it says: that the input file contains characters it cannot read using the specified encoding. No, it doesn't. The file is just fine, just as the example. Are you generating the file from python using a file object with the same encoding? If

Re: Python 3.0 automatic decoding of UTF16

2008-12-05 Thread Richard Brodie
J Kenneth King [EMAIL PROTECTED] wrote in message news:[EMAIL PROTECTED] It probably means what it says: that the input file contains characters it cannot read using the specified encoding. That was my first thought. However it appears that there is an off by one error somewhere in the

Re: Python 3.0 automatic decoding of UTF16

2008-12-05 Thread Terry Reedy
Johannes Bauer wrote: Hello group, I'm having trouble reading a utf-16 encoded file with Python3.0. This is my (complete) code: what OS. This is often critical when you have a problem interacting with the OS. #!/usr/bin/python3.0 class AddressBook(): def __init__(self,

Re: Python 3.0 automatic decoding of UTF16

2008-12-05 Thread Johannes Bauer
Terry Reedy schrieb: Johannes Bauer wrote: Hello group, I'm having trouble reading a utf-16 encoded file with Python3.0. This is my (complete) code: what OS. This is often critical when you have a problem interacting with the OS. It's a 64-bit Linux, currently running: Linux joeserver

Re: Python 3.0 automatic decoding of UTF16

2008-12-05 Thread Joe Strout
On Dec 5, 2008, at 11:36 AM, Johannes Bauer wrote: I suspect that '?' after \n (\u0a00) is indicates not 'question-mark' but 'uninterpretable as a utf16 character'. The traceback below confirms that. It should be an end-of-file marker and should not be passed to Python. I strongly suspect

Re: Python 3.0 automatic decoding of UTF16

2008-12-05 Thread info
On Dec 5, 3:25 pm, Johannes Bauer [EMAIL PROTECTED] wrote: Hello group, I'm having trouble reading a utf-16 encoded file with Python3.0. This is my (complete) code: #!/usr/bin/python3.0 class AddressBook():         def __init__(self, filename):                 f = open(filename, r,

Re: Python 3.0 automatic decoding of UTF16

2008-12-05 Thread MRAB
Joe Strout wrote: On Dec 5, 2008, at 11:36 AM, Johannes Bauer wrote: I suspect that '?' after \n (\u0a00) is indicates not 'question-mark' but 'uninterpretable as a utf16 character'. The traceback below confirms that. It should be an end-of-file marker and should not be passed to Python. I

Re: Python 3.0 automatic decoding of UTF16

2008-12-05 Thread John Machin
On Dec 6, 5:36 am, Johannes Bauer [EMAIL PROTECTED] wrote: So UTF-16 has an explicit EOF marker within the text? I cannot find one in original file, only some kind of starting sequence I suppose (0xfeff). The last characters of the file are 0x00 0x0d 0x00 0x0a, simple \r\n line ending. Sorry,

Re: Python 3.0 automatic decoding of UTF16

2008-12-05 Thread Steven D'Aprano
On Fri, 05 Dec 2008 12:00:59 -0700, Joe Strout wrote: So UTF-16 has an explicit EOF marker within the text? No, it does not. I don't know what Terry's thinking of there, but text files do not have any EOF marker. They start at the beginning (sometimes including a byte-order mark), and go

Re: Python 3.0 automatic decoding of UTF16

2008-12-05 Thread John Machin
On Dec 6, 10:35 am, Steven D'Aprano [EMAIL PROTECTED] cybersource.com.au wrote: On Fri, 05 Dec 2008 12:00:59 -0700, Joe Strout wrote: So UTF-16 has an explicit EOF marker within the text? No, it does not.  I don't know what Terry's thinking of there, but text files do not have any EOF

Re: Python 3.0 automatic decoding of UTF16

2008-12-05 Thread MRAB
John Machin wrote: On Dec 6, 10:35 am, Steven D'Aprano [EMAIL PROTECTED] cybersource.com.au wrote: On Fri, 05 Dec 2008 12:00:59 -0700, Joe Strout wrote: So UTF-16 has an explicit EOF marker within the text? No, it does not. I don't know what Terry's thinking of there, but text files do not