Terry Reedy schrieb: > Johannes Bauer wrote: >> Hello group, >> >> I'm having trouble reading a utf-16 encoded file with Python3.0. This is >> my (complete) code: > > what OS. This is often critical when you have a problem interacting > with the OS.
It's a 64-bit Linux, currently running: Linux joeserver 2.6.20-skas3-v9-pre9 #4 SMP PREEMPT Wed Dec 3 18:34:49 CET 2008 x86_64 Intel(R) Core(TM)2 CPU 6400 @ 2.13GHz GenuineIntel GNU/Linux Kernel, however, 2.6.26.1 yields the same problem. >> Entry00Text = "ADAC Verkehrsinfo"\r\n > > From \r\n I guess Windows. Correct? Well, not really. The file was created with gammu, a Linux opensource tool to extract a phonebook off cell phones. However, gammu seems to generate those Windows-CRLF lineendings. > I suspect that '?' after \n (\u0a00) is indicates not 'question-mark' > but 'uninterpretable as a utf16 character'. The traceback below > confirms that. It should be an end-of-file marker and should not be > passed to Python. I strongly suspect that whatever wrote the file > screwed up the (OS-specific) end-of-file marker. I have seen this > occasionally on Dos/Windows with ascii byte files, with the same symptom > of reading random garbage pass the end of the file. Or perhaps > end-of-file does not work right with utf16. So UTF-16 has an explicit EOF marker within the text? I cannot find one in original file, only some kind of starting sequence I suppose (0xfeff). The last characters of the file are 0x00 0x0d 0x00 0x0a, simple \r\n line ending. >> is actually the only thing the line contains, Python makes the rest up. > > No it does not. It echoes what the OS gives it with system calls, which > is randon garbage to the end of the disk block. Could it not be, as Richard suggested, that there's an off-by-one? > Try open with explicit 'rt' and 'rb' modes and see what happens. Text > mode should be default, but then \r should be deleted. rt: [...] ['[', 'P', 'h', 'o', 'n', 'e', 'P', 'B', 'K', '0', '0', '3', ']', '\n'] ['L', 'o', 'c', 'a', 't', 'i', 'o', 'n', ' ', '=', ' ', '0', '0', '3', '\n'] ['E', 'n', 't', 'r', 'y', '0', '0', 'T', 'y', 'p', 'e', ' ', '=', ' ', 'N', 'a', 'm', 'e', '\n'] Traceback (most recent call last): File "./modify.py", line 12, in <module> a = AddressBook("2008_11_05_Handy_Backup.txt") File "./modify.py", line 7, in __init__ line = f.readline() File "/usr/local/lib/python3.0/io.py", line 1807, in readline while self._read_chunk(): File "/usr/local/lib/python3.0/io.py", line 1556, in _read_chunk self._set_decoded_chars(self._decoder.decode(input_chunk, eof)) File "/usr/local/lib/python3.0/io.py", line 1293, in decode output = self.decoder.decode(input, final=final) File "/usr/local/lib/python3.0/codecs.py", line 300, in decode (result, consumed) = self._buffer_decode(data, self.errors, final) File "/usr/local/lib/python3.0/encodings/utf_16.py", line 69, in _buffer_decode return self.decoder(input, self.errors, final) UnicodeDecodeError: 'utf16' codec can't decode bytes in position 74-75: illegal encoding rb works, as it doesn't take an encoding parameter. > Malformed EOF more likely. Could you please elaborate? Kind regards, Johannes -- "Meine Gegenklage gegen dich lautet dann auf bewusste Verlogenheit, verlästerung von Gott, Bibel und mir und bewusster Blasphemie." -- Prophet und Visionär Hans Joss aka HJP in de.sci.physik <[EMAIL PROTECTED]> -- http://mail.python.org/mailman/listinfo/python-list