Python and encodings drives me crazy

2005-06-20 Thread Oliver Andrich
Hi everybody,

I have to write a little skript, that reads some nasty xml formated
files. Nasty xml formated means, we have a xml like syntax, no dtd,
use html entities without declaration and so on. A task as I like it.
My task looks like that...

1. read the data from the file.
2. get rid of the html entities
3. parse the stuff to extract the content of two tags.
4. create a new string from the extracted content
5. write it to a cp850 or even better macroman encoded file

Well, step 1 is easy and obvious. Step is solved for me by 

= code =

from htmlentitydefs import entitydefs

html2text = []
for k,v in entitydefs.items():
  if v[0] != :
html2text.append([+k+; , v])
  else:
html2text.append([+k+;, ])

def remove_html_entities(data):
  for html, char in html2text:
data = apply(string.replace, [data, html, char])
  return data

= code =

Step 3 + 4 also work fine so far. But step 5 drives me completely
crazy, cause I get a lot of nice exception from the codecs module.

Hopefully someone can help me with that. 

If my code for processing the file looks like that:

def process_file(file_name):
  data = codecs.open(file_name, r, latin1).read()
  data = remove_html_entities(data) 
  dom = parseString(data)
  print data

I get

Traceback (most recent call last):
  File ag2blvd.py, line 46, in ?
process_file(file_name)
  File ag2blvd.py, line 33, in process_file
data = remove_html_entities(data)
  File ag2blvd.py, line 39, in remove_html_entities
data = apply(string.replace, [data, html, char])
  File /usr/lib/python2.4/string.py, line 519, in replace
return s.replace(old, new, maxsplit)
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe5 in position
0: ordinal not in range(128)

I am pretty sure that I have iso-latin-1 files, but after running
through my code everything looks pretty broken. If I remove the call
to remove_html_entities I get

Traceback (most recent call last):
  File ag2blvd.py, line 46, in ?
process_file(file_name)
  File ag2blvd.py, line 35, in process_file
print data
UnicodeEncodeError: 'ascii' codec can't encode character u'\x96' in
position 2482: ordinal not in range(128)

And this continues, when I try to write to a file in macroman encoding. 

As I am pretty sure, that I am doing something completely wrong and I
also haven't found a trace in the fantastic cookbook, I like to ask
for help here. :)

I am also pretty sure, that I do something wrong as writing a unicode
string with german umlauts to a macroman file opened via the codecs
module works fine.

Hopefully someone can help me. :)

Best regards,
Oliver

-- 
Oliver Andrich [EMAIL PROTECTED] --- http://fitheach.de/
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Python and encodings drives me crazy

2005-06-20 Thread Steven Bethard
Oliver Andrich wrote:
 def remove_html_entities(data):
   for html, char in html2text:
 data = apply(string.replace, [data, html, char])
   return data

I know this isn't your question, but why write:

  data = apply(string.replace, [data, html, char])

when you could write

 data = data.replace(html, char)

??

STeVe
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Python and encodings drives me crazy

2005-06-20 Thread Oliver Andrich
 I know this isn't your question, but why write:
 
   data = apply(string.replace, [data, html, char])
 
 when you could write
 
  data = data.replace(html, char)
 
 ??

Cause I guess, that I am already blind. Thanks.

Oliver

-- 
Oliver Andrich [EMAIL PROTECTED] --- http://fitheach.de/
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Python and encodings drives me crazy

2005-06-20 Thread Oliver Andrich
Well, I narrowed my problem down to writing a macroman or cp850 file
using the codecs module. The rest was basically a misunderstanding
about codecs module and the wrong assumption, that my input data is
iso-latin-1 encode. It is UTF-8 encoded. So, curently I am at the
point where I have my data ready for writing

Does the following code write headline and caption in MacRoman
encoding to the disk? Or better that, is this the way to do it?
headline and caption are both unicode strings.

f = codecs.open(outfilename, w, macroman)
f.write(headline)
f.write(\n\n)
f.write(caption)
f.close()

Best regards,
Oliver

-- 
Oliver Andrich [EMAIL PROTECTED] --- http://fitheach.de/
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Python and encodings drives me crazy

2005-06-20 Thread Diez B. Roggisch
Oliver Andrich wrote:
 Well, I narrowed my problem down to writing a macroman or cp850 file
 using the codecs module. The rest was basically a misunderstanding
 about codecs module and the wrong assumption, that my input data is
 iso-latin-1 encode. It is UTF-8 encoded. So, curently I am at the
 point where I have my data ready for writing
 
 Does the following code write headline and caption in MacRoman
 encoding to the disk? Or better that, is this the way to do it?
 headline and caption are both unicode strings.
 
 f = codecs.open(outfilename, w, macroman)
 f.write(headline)
 f.write(\n\n)
 f.write(caption)
 f.close()

looks ok - but you should use u\n\n in general - if that line for some 
reason chages to öäü (german umlauts), you'll get the error you 
already observed. But using uäöü the parser pukes at you when the 
specified coding of the file can't decode that bytes to the unicode object.

Most problems occdure when one confuses unicode objects with strings - 
this requires a coercion that will be done using the default encoding 
error you already observed.


Diez
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Python and encodings drives me crazy

2005-06-20 Thread Konstantin Veretennicov
On 6/20/05, Oliver Andrich [EMAIL PROTECTED] wrote:
 Does the following code write headline and caption in
 MacRoman encoding to the disk?
 
 f = codecs.open(outfilename, w, macroman)
 f.write(headline)

It does, as long as headline and caption *can* actually be encoded as
macroman. After you decode headline from utf-8 it will be unicode and
not all unicode characters can be mapped to macroman:

 u'\u0160'.encode('utf8')
'\xc5\xa0'
 u'\u0160'.encode('latin2')
'\xa9'
 u'\u0160'.encode('macroman')
Traceback (most recent call last):
  File stdin, line 1, in ?
  File D:\python\2.4\lib\encodings\mac_roman.py, line 18, in encode
return codecs.charmap_encode(input,errors,encoding_map)
UnicodeEncodeError: 'charmap' codec can't encode character u'\u0160' in position
 0: character maps to undefined

- kv
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Python and encodings drives me crazy

2005-06-20 Thread Oliver Andrich
2005/6/21, Konstantin Veretennicov [EMAIL PROTECTED]:
 It does, as long as headline and caption *can* actually be encoded as
 macroman. After you decode headline from utf-8 it will be unicode and
 not all unicode characters can be mapped to macroman:
 
  u'\u0160'.encode('utf8')
 '\xc5\xa0'
  u'\u0160'.encode('latin2')
 '\xa9'
  u'\u0160'.encode('macroman')
 Traceback (most recent call last):
   File stdin, line 1, in ?
   File D:\python\2.4\lib\encodings\mac_roman.py, line 18, in encode
 return codecs.charmap_encode(input,errors,encoding_map)
 UnicodeEncodeError: 'charmap' codec can't encode character u'\u0160' in 
 position
  0: character maps to undefined

Yes, this and the coersion problems Diez mentioned were the problems I
faced. Now I have written a little cleanup method, that removes the
bad characters from the input and finally I guess I have macroman
encoded files. But we will see, as soon as I try to open them on the
Mac. But now I am more or less satisfied, as only 3 obvious files
aren't converted correctly and the other 1000 files are.

Thanks for your hints, tips and so on. Good Night.

Oliver

-- 
Oliver Andrich [EMAIL PROTECTED] --- http://fitheach.de/
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Python and encodings drives me crazy

2005-06-20 Thread John Machin
Oliver Andrich wrote:
 2005/6/21, Konstantin Veretennicov [EMAIL PROTECTED]:
 
It does, as long as headline and caption *can* actually be encoded as
macroman. After you decode headline from utf-8 it will be unicode and
not all unicode characters can be mapped to macroman:


u'\u0160'.encode('utf8')

'\xc5\xa0'

u'\u0160'.encode('latin2')

'\xa9'

u'\u0160'.encode('macroman')

Traceback (most recent call last):
  File stdin, line 1, in ?
  File D:\python\2.4\lib\encodings\mac_roman.py, line 18, in encode
return codecs.charmap_encode(input,errors,encoding_map)
UnicodeEncodeError: 'charmap' codec can't encode character u'\u0160' in 
position
 0: character maps to undefined
 
 
 Yes, this and the coersion problems Diez mentioned were the problems I
 faced. Now I have written a little cleanup method, that removes the
 bad characters from the input

By bad characters, do you mean characters that are in Unicode but not 
in MacRoman?

By removes the bad characters, do you mean deletes, or do you mean 
substitutes one or more MacRoman characters?

If all you want to do is torch the bad guys, you don't have to write a 
little cleanup method.

To leave a tombstone for the bad guys:

  u'abc\u0160def'.encode('macroman', 'replace')
'abc?def'
 

To leave no memorial, only a cognitive gap:

  u'The Good Soldier \u0160vejk'.encode('macroman', 'ignore')
'The Good Soldier vejk'

Do you *really* need to encode it as MacRoman? Can't the Mac app 
understand utf8?

You mentioned cp850 in an earlier post. What would you be feeding 
cp850-encoded data that doesn't understand cp1252, and isn't in a museum?

Cheers,
John
-- 
http://mail.python.org/mailman/listinfo/python-list