Re: Problem reading file with umlauts

2009-07-09 Thread Claus Hausberger
Thanks a lot. I will try that on the weekend.

Claus

 Claus Hausberger wrote:
  Thanks a lot. Now I am one step further but I get another strange error:
  
  Traceback (most recent call last):
File ./read.py, line 12, in module
  of.write(text)
  UnicodeEncodeError: 'ascii' codec can't encode character u'\ufeff' in
 position 0: ordinal not in range(128)
  
  according to google ufeff has something to do with byte order.
  
  I use an Linux system, maybe this helps to find the error.
  
 'text' contains Unicode, but you're writing it to a file that's not
 opened for Unicode. Either open the output file for Unicode:
 
  of = codecs.open(umlaut-out.txt, w, encoding=latin1)
 
 or encode the text before writing:
 
  text = text.encode(latin1)
 
 (I'm assuming you want the output file to be in Latin1.)
 
  
  Claus Hausberger wrote:
 
  I have a text file with is encoding in Latin1 (ISO-8859-1). I can't
  change that as I do not create those files myself. I have to read
  those files and convert the umlauts like ö to stuff like oumol; as
  the text files should become html files.
  umlaut-in.txt:
  
  This file is contains data in the unicode
  character set and is encoded with utf-8.
  Viele Röhre. Macht spaß!  Tsüsch!
 
 
  umlaut-in.txt hexdump:
  
  00: 54 68 69 73 20 66 69 6C  65 20 69 73 20 63 6F 6E This file is
 con
  10: 74 61 69 6E 73 20 64 61  74 61 20 69 6E 20 74 68 tains data in
 th
  20: 65 20 75 6E 69 63 6F 64  65 0D 0A 63 68 61 72 61 e
 unicode..chara
  30: 63 74 65 72 20 73 65 74  20 61 6E 64 20 69 73 20 cter set and
 is
  40: 65 6E 63 6F 64 65 64 20  77 69 74 68 20 75 74 66 encoded with
 utf
  50: 2D 38 2E 0D 0A 56 69 65  6C 65 20 52 C3 B6 68 72 -8...Viele
 R..hr
  60: 65 2E 20 4D 61 63 68 74  20 73 70 61 C3 9F 21 20 e. Macht
 spa..!
  70: 20 54 73 C3 BC 73 63 68  21 0D 0A 00 00 00 00 00 
 Ts..sch!...
 
 
  umlaut.py:
  
  # -*- coding: utf-8 -*-
  import codecs
  text=codecs.open(umlaut-in.txt,encoding=utf-8).read()
  text=text.replace(uö,uoe)
  text=text.replace(uß,uss)
  text=text.replace(uü,uue)
  of=open(umlaut-out.txt,w)
  of.write(text)
  of.close()
 
 
  umlaut-out.txt:
  
  This file is contains data in the unicode
  character set and is encoded with utf-8.
  Viele Roehre. Macht spass!  Tsuesch!
 
 
  umlaut-out.txt hexdump:
  
  00: 54 68 69 73 20 66 69 6C  65 20 69 73 20 63 6F 6E This file is
 con
  10: 74 61 69 6E 73 20 64 61  74 61 20 69 6E 20 74 68 tains data in
 th
  20: 65 20 75 6E 69 63 6F 64  65 0D 0D 0A 63 68 61 72 e
 unicode...char
  30: 61 63 74 65 72 20 73 65  74 20 61 6E 64 20 69 73 acter set and
 is
  40: 20 65 6E 63 6F 64 65 64  20 77 69 74 68 20 75 74  encoded with
 ut
  50: 66 2D 38 2E 0D 0D 0A 56  69 65 6C 65 20 52 6F 65 f-8Viele
 Roe
  60: 68 72 65 2E 20 4D 61 63  68 74 20 73 70 61 73 73 hre. Macht
 spass
  70: 21 20 20 54 73 75 65 73  63 68 21 0D 0D 0A 00 00 ! 
 Tsuesch!.
 
 
 
 
 
  -- 
  The ability of the OSS process to collect and harness
  the collective IQ of thousands of individuals across
  the Internet is simply amazing. - Vinod Valloppillil
  http://www.catb.org/~esr/halloween/halloween4.html
  
 
 -- 
 http://mail.python.org/mailman/listinfo/python-list

-- 
Neu: GMX Doppel-FLAT mit Internet-Flatrate + Telefon-Flatrate
für nur 19,99 Euro/mtl.!* http://portal.gmx.net/de/go/dsl02
-- 
http://mail.python.org/mailman/listinfo/python-list


Problem reading file with umlauts

2009-07-07 Thread Claus Hausberger
Hello

I have a text file with is encoding in Latin1 (ISO-8859-1). I can't change that 
as I do not create those files myself.

I have to read those files and convert the umlauts like ö to stuff like oumol; 
as the text files should become html files.

I have this code:


#!/usr/bin/python
# -*- coding: latin1 -*-

import codecs

f = codecs.open('abc.txt', encoding='latin1')

for line in f:
print line
for c in line: 
if c == ö:
print oe
else:
print c


and I get this error message:

$ ./read.py
Abc

./read.py:11: UnicodeWarning: Unicode equal comparison failed to convert both 
arguments to Unicode - interpreting them as being unequal
  if c == ö:
A
b
c



Traceback (most recent call last):
  File ./read.py, line 9, in module
print line
UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-3: 
ordinal not in range(128)




I checked the web and tried several approaches but I also get some strange 
encoding errors.
Has anyone ever done this before? 
I am currently using Python 2.5 and may be able to use 2.6 but I cannot yet 
move to 3.1 as many libs we use don't yet work with Python 3.

any help more than welcome.  This has been driving me crazy for two days now.

best wishes

Claus
-- 
Neu: GMX Doppel-FLAT mit Internet-Flatrate + Telefon-Flatrate
für nur 19,99 Euro/mtl.!* http://portal.gmx.net/de/go/dsl02
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Problem reading file with umlauts

2009-07-07 Thread Claus Hausberger
Thanks a lot. Now I am one step further but I get another strange error:

Traceback (most recent call last):
  File ./read.py, line 12, in module
of.write(text)
UnicodeEncodeError: 'ascii' codec can't encode character u'\ufeff' in position 
0: ordinal not in range(128)

according to google ufeff has something to do with byte order.

I use an Linux system, maybe this helps to find the error.

Claus

 Claus Hausberger wrote:
 
  I have a text file with is encoding in Latin1 (ISO-8859-1). I can't
  change that as I do not create those files myself. I have to read
  those files and convert the umlauts like ö to stuff like oumol; as
  the text files should become html files.
 
 umlaut-in.txt:
 
 This file is contains data in the unicode
 character set and is encoded with utf-8.
 Viele Röhre. Macht spaß!  Tsüsch!
 
 
 umlaut-in.txt hexdump:
 
 00: 54 68 69 73 20 66 69 6C  65 20 69 73 20 63 6F 6E This file is con
 10: 74 61 69 6E 73 20 64 61  74 61 20 69 6E 20 74 68 tains data in th
 20: 65 20 75 6E 69 63 6F 64  65 0D 0A 63 68 61 72 61 e unicode..chara
 30: 63 74 65 72 20 73 65 74  20 61 6E 64 20 69 73 20 cter set and is
 40: 65 6E 63 6F 64 65 64 20  77 69 74 68 20 75 74 66 encoded with utf
 50: 2D 38 2E 0D 0A 56 69 65  6C 65 20 52 C3 B6 68 72 -8...Viele R..hr
 60: 65 2E 20 4D 61 63 68 74  20 73 70 61 C3 9F 21 20 e. Macht spa..!
 70: 20 54 73 C3 BC 73 63 68  21 0D 0A 00 00 00 00 00  Ts..sch!...
 
 
 umlaut.py:
 
 # -*- coding: utf-8 -*-
 import codecs
 text=codecs.open(umlaut-in.txt,encoding=utf-8).read()
 text=text.replace(uö,uoe)
 text=text.replace(uß,uss)
 text=text.replace(uü,uue)
 of=open(umlaut-out.txt,w)
 of.write(text)
 of.close()
 
 
 umlaut-out.txt:
 
 This file is contains data in the unicode
 character set and is encoded with utf-8.
 Viele Roehre. Macht spass!  Tsuesch!
 
 
 umlaut-out.txt hexdump:
 
 00: 54 68 69 73 20 66 69 6C  65 20 69 73 20 63 6F 6E This file is con
 10: 74 61 69 6E 73 20 64 61  74 61 20 69 6E 20 74 68 tains data in th
 20: 65 20 75 6E 69 63 6F 64  65 0D 0D 0A 63 68 61 72 e unicode...char
 30: 61 63 74 65 72 20 73 65  74 20 61 6E 64 20 69 73 acter set and is
 40: 20 65 6E 63 6F 64 65 64  20 77 69 74 68 20 75 74  encoded with ut
 50: 66 2D 38 2E 0D 0D 0A 56  69 65 6C 65 20 52 6F 65 f-8Viele Roe
 60: 68 72 65 2E 20 4D 61 63  68 74 20 73 70 61 73 73 hre. Macht spass
 70: 21 20 20 54 73 75 65 73  63 68 21 0D 0D 0A 00 00 !  Tsuesch!.
 
 
 
 
 
 -- 
 The ability of the OSS process to collect and harness
 the collective IQ of thousands of individuals across
 the Internet is simply amazing. - Vinod Valloppillil
 http://www.catb.org/~esr/halloween/halloween4.html

-- 
Neu: GMX Doppel-FLAT mit Internet-Flatrate + Telefon-Flatrate
für nur 19,99 Euro/mtl.!* http://portal.gmx.net/de/go/dsl02
-- 
http://mail.python.org/mailman/listinfo/python-list