Re: Problem reading file with umlauts
Thanks a lot. I will try that on the weekend. Claus Claus Hausberger wrote: Thanks a lot. Now I am one step further but I get another strange error: Traceback (most recent call last): File ./read.py, line 12, in module of.write(text) UnicodeEncodeError: 'ascii' codec can't encode character u'\ufeff' in position 0: ordinal not in range(128) according to google ufeff has something to do with byte order. I use an Linux system, maybe this helps to find the error. 'text' contains Unicode, but you're writing it to a file that's not opened for Unicode. Either open the output file for Unicode: of = codecs.open(umlaut-out.txt, w, encoding=latin1) or encode the text before writing: text = text.encode(latin1) (I'm assuming you want the output file to be in Latin1.) Claus Hausberger wrote: I have a text file with is encoding in Latin1 (ISO-8859-1). I can't change that as I do not create those files myself. I have to read those files and convert the umlauts like ö to stuff like oumol; as the text files should become html files. umlaut-in.txt: This file is contains data in the unicode character set and is encoded with utf-8. Viele Röhre. Macht spaß! Tsüsch! umlaut-in.txt hexdump: 00: 54 68 69 73 20 66 69 6C 65 20 69 73 20 63 6F 6E This file is con 10: 74 61 69 6E 73 20 64 61 74 61 20 69 6E 20 74 68 tains data in th 20: 65 20 75 6E 69 63 6F 64 65 0D 0A 63 68 61 72 61 e unicode..chara 30: 63 74 65 72 20 73 65 74 20 61 6E 64 20 69 73 20 cter set and is 40: 65 6E 63 6F 64 65 64 20 77 69 74 68 20 75 74 66 encoded with utf 50: 2D 38 2E 0D 0A 56 69 65 6C 65 20 52 C3 B6 68 72 -8...Viele R..hr 60: 65 2E 20 4D 61 63 68 74 20 73 70 61 C3 9F 21 20 e. Macht spa..! 70: 20 54 73 C3 BC 73 63 68 21 0D 0A 00 00 00 00 00 Ts..sch!... umlaut.py: # -*- coding: utf-8 -*- import codecs text=codecs.open(umlaut-in.txt,encoding=utf-8).read() text=text.replace(uö,uoe) text=text.replace(uß,uss) text=text.replace(uü,uue) of=open(umlaut-out.txt,w) of.write(text) of.close() umlaut-out.txt: This file is contains data in the unicode character set and is encoded with utf-8. Viele Roehre. Macht spass! Tsuesch! umlaut-out.txt hexdump: 00: 54 68 69 73 20 66 69 6C 65 20 69 73 20 63 6F 6E This file is con 10: 74 61 69 6E 73 20 64 61 74 61 20 69 6E 20 74 68 tains data in th 20: 65 20 75 6E 69 63 6F 64 65 0D 0D 0A 63 68 61 72 e unicode...char 30: 61 63 74 65 72 20 73 65 74 20 61 6E 64 20 69 73 acter set and is 40: 20 65 6E 63 6F 64 65 64 20 77 69 74 68 20 75 74 encoded with ut 50: 66 2D 38 2E 0D 0D 0A 56 69 65 6C 65 20 52 6F 65 f-8Viele Roe 60: 68 72 65 2E 20 4D 61 63 68 74 20 73 70 61 73 73 hre. Macht spass 70: 21 20 20 54 73 75 65 73 63 68 21 0D 0D 0A 00 00 ! Tsuesch!. -- The ability of the OSS process to collect and harness the collective IQ of thousands of individuals across the Internet is simply amazing. - Vinod Valloppillil http://www.catb.org/~esr/halloween/halloween4.html -- http://mail.python.org/mailman/listinfo/python-list -- Neu: GMX Doppel-FLAT mit Internet-Flatrate + Telefon-Flatrate für nur 19,99 Euro/mtl.!* http://portal.gmx.net/de/go/dsl02 -- http://mail.python.org/mailman/listinfo/python-list
Problem reading file with umlauts
Hello I have a text file with is encoding in Latin1 (ISO-8859-1). I can't change that as I do not create those files myself. I have to read those files and convert the umlauts like ö to stuff like oumol; as the text files should become html files. I have this code: #!/usr/bin/python # -*- coding: latin1 -*- import codecs f = codecs.open('abc.txt', encoding='latin1') for line in f: print line for c in line: if c == ö: print oe else: print c and I get this error message: $ ./read.py Abc ./read.py:11: UnicodeWarning: Unicode equal comparison failed to convert both arguments to Unicode - interpreting them as being unequal if c == ö: A b c Traceback (most recent call last): File ./read.py, line 9, in module print line UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-3: ordinal not in range(128) I checked the web and tried several approaches but I also get some strange encoding errors. Has anyone ever done this before? I am currently using Python 2.5 and may be able to use 2.6 but I cannot yet move to 3.1 as many libs we use don't yet work with Python 3. any help more than welcome. This has been driving me crazy for two days now. best wishes Claus -- Neu: GMX Doppel-FLAT mit Internet-Flatrate + Telefon-Flatrate für nur 19,99 Euro/mtl.!* http://portal.gmx.net/de/go/dsl02 -- http://mail.python.org/mailman/listinfo/python-list
Re: Problem reading file with umlauts
Thanks a lot. Now I am one step further but I get another strange error: Traceback (most recent call last): File ./read.py, line 12, in module of.write(text) UnicodeEncodeError: 'ascii' codec can't encode character u'\ufeff' in position 0: ordinal not in range(128) according to google ufeff has something to do with byte order. I use an Linux system, maybe this helps to find the error. Claus Claus Hausberger wrote: I have a text file with is encoding in Latin1 (ISO-8859-1). I can't change that as I do not create those files myself. I have to read those files and convert the umlauts like ö to stuff like oumol; as the text files should become html files. umlaut-in.txt: This file is contains data in the unicode character set and is encoded with utf-8. Viele Röhre. Macht spaß! Tsüsch! umlaut-in.txt hexdump: 00: 54 68 69 73 20 66 69 6C 65 20 69 73 20 63 6F 6E This file is con 10: 74 61 69 6E 73 20 64 61 74 61 20 69 6E 20 74 68 tains data in th 20: 65 20 75 6E 69 63 6F 64 65 0D 0A 63 68 61 72 61 e unicode..chara 30: 63 74 65 72 20 73 65 74 20 61 6E 64 20 69 73 20 cter set and is 40: 65 6E 63 6F 64 65 64 20 77 69 74 68 20 75 74 66 encoded with utf 50: 2D 38 2E 0D 0A 56 69 65 6C 65 20 52 C3 B6 68 72 -8...Viele R..hr 60: 65 2E 20 4D 61 63 68 74 20 73 70 61 C3 9F 21 20 e. Macht spa..! 70: 20 54 73 C3 BC 73 63 68 21 0D 0A 00 00 00 00 00 Ts..sch!... umlaut.py: # -*- coding: utf-8 -*- import codecs text=codecs.open(umlaut-in.txt,encoding=utf-8).read() text=text.replace(uö,uoe) text=text.replace(uß,uss) text=text.replace(uü,uue) of=open(umlaut-out.txt,w) of.write(text) of.close() umlaut-out.txt: This file is contains data in the unicode character set and is encoded with utf-8. Viele Roehre. Macht spass! Tsuesch! umlaut-out.txt hexdump: 00: 54 68 69 73 20 66 69 6C 65 20 69 73 20 63 6F 6E This file is con 10: 74 61 69 6E 73 20 64 61 74 61 20 69 6E 20 74 68 tains data in th 20: 65 20 75 6E 69 63 6F 64 65 0D 0D 0A 63 68 61 72 e unicode...char 30: 61 63 74 65 72 20 73 65 74 20 61 6E 64 20 69 73 acter set and is 40: 20 65 6E 63 6F 64 65 64 20 77 69 74 68 20 75 74 encoded with ut 50: 66 2D 38 2E 0D 0D 0A 56 69 65 6C 65 20 52 6F 65 f-8Viele Roe 60: 68 72 65 2E 20 4D 61 63 68 74 20 73 70 61 73 73 hre. Macht spass 70: 21 20 20 54 73 75 65 73 63 68 21 0D 0D 0A 00 00 ! Tsuesch!. -- The ability of the OSS process to collect and harness the collective IQ of thousands of individuals across the Internet is simply amazing. - Vinod Valloppillil http://www.catb.org/~esr/halloween/halloween4.html -- Neu: GMX Doppel-FLAT mit Internet-Flatrate + Telefon-Flatrate für nur 19,99 Euro/mtl.!* http://portal.gmx.net/de/go/dsl02 -- http://mail.python.org/mailman/listinfo/python-list