On Dec 5, 3:25 pm, Johannes Bauer <[EMAIL PROTECTED]> wrote: > Hello group, > > I'm having trouble reading a utf-16 encoded file with Python3.0. This is > my (complete) code: > > #!/usr/bin/python3.0 > > class AddressBook(): > def __init__(self, filename): > f = open(filename, "r", encoding="utf16") > while True: > line = f.readline() > if line == "": break > print([line[x] for x in range(len(line))]) > f.close() > > a = AddressBook("2008_11_05_Handy_Backup.txt") > > This is the file (only 1 kB, if hosting doesn't work please tell me and > I'll see if I can put it someplace else): > > http://www.file-upload.net/download-1297291/2008_11_05_Handy_Backup.t... > > What I get: The file reads file the first few lines. Then, in the last > line, I get lots of garbage (looking like uninitialized memory): > > ['E', 'n', 't', 'r', 'y', '0', '0', 'T', 'e', 'x', 't', ' ', '=', ' ', > '"', 'A', 'D', 'A', 'C', ' ', 'V', 'e', 'r', 'k', 'e', 'h', 'r', 's', > 'i', 'n', 'f', 'o', '"', '\u0d00', '\u0a00', '䔀', '渀', '琀', '爀', '礀 > ', '\u3000', '\u3100', '吀', '礀', '瀀', '攀', '\u2000', '㴀', '\u2000', > '一', '甀', '洀', '戀', '攀', '爀', '䴀', '漀', '戀', '椀', '氀', '攀', > '\u0d00', '\u0a00', '䔀', '渀', '琀', '爀', '礀', '\u3000', '\u3100', ' > 吀', '攀', '砀', '琀', '\u2000', '㴀', '\u2000', '∀', '⬀', '㐀', '㤀', > '\u3100', '㜀', '㤀', '㈀', '㈀', '㐀', '㤀', '㤀', '∀', '\u0d00', > '\u0a00', '\u0d00', '\u0a00', '嬀', '倀', '栀', '漀', '渀', '攀', '倀', > '䈀', '䬀', '\u3000', '\u3000', '㐀', '崀', '\u0d00', '\u0a00'] > > Where the line > > Entry00Text = "ADAC Verkehrsinfo"\r\n > > is actually the only thing the line contains, Python makes the rest up. > > The actual file is much longer and contains private numbers, so I > truncated them away. When I let python process the original file, it > dies with another error: > > Traceback (most recent call last): > File "./modify.py", line 12, in <module> > a = AddressBook("2008_11_05_Handy_Backup.txt") > File "./modify.py", line 7, in __init__ > line = f.readline() > File "/usr/local/lib/python3.0/io.py", line 1807, in readline > while self._read_chunk(): > File "/usr/local/lib/python3.0/io.py", line 1556, in _read_chunk > self._set_decoded_chars(self._decoder.decode(input_chunk, eof)) > File "/usr/local/lib/python3.0/io.py", line 1293, in decode > output = self.decoder.decode(input, final=final) > File "/usr/local/lib/python3.0/codecs.py", line 300, in decode > (result, consumed) = self._buffer_decode(data, self.errors, final) > File "/usr/local/lib/python3.0/encodings/utf_16.py", line 69, in > _buffer_decode > return self.decoder(input, self.errors, final) > UnicodeDecodeError: 'utf16' codec can't decode bytes in position 74-75: > illegal encoding > > With the place where it dies being exactly the place where it outputs > the weird garbage in the shortened file. I guess it runs over some page > boundary here or something? > > Kind regards, > Johannes > > -- > "Meine Gegenklage gegen dich lautet dann auf bewusste Verlogenheit, > verlästerung von Gott, Bibel und mir und bewusster Blasphemie." > -- Prophet und Visionär Hans Joss aka HJP in de.sci.physik > <[EMAIL PROTECTED]>
2 problems: endianness and trailing zer byte. This works for me: class AddressBook(): def __init__(self, filename): f = open(filename, "r", encoding="utf_16_be", newline="\r\n") while True: line = f.readline() if len(line) == 0: break print (line.replace("\r\n","")) f.close() a = AddressBook("2008_11_05_Handy_Backup2.txt") Please note the filename: I modified your file by dropping the trailing zer byte -- http://mail.python.org/mailman/listinfo/python-list