"Adam Tauno Williams" <awill...@whitemice.org> wrote in message news:1273932760.3929.18.ca...@linux-yu4c.site...
On Sat, 2010-05-15 at 20:30 +1000, Lie Ryan wrote:
On 05/15/10 10:27, Adam Tauno Williams wrote:
[snip]

Yep.  But in the interpreter both unicode() and repr() produce the same
output.  Nothing displays the accented character.

h = codecs.open('file.txt', 'rb', encoding='iso8859-2')
data = h.read()
h.close()
str(data)

Here you are correctly reading an iso8859-2-encoded file and converting it to Unicode.

Try "print data". "str(data)" converts from Unicode strings to byte strings, but only uses the default encoding, which is 'ascii'. print will use the stdout encoding of your terminal, if known. Try these commands on your system (mine is Windows XP):

import sys
sys.getdefaultencoding()
'ascii'
sys.stdout.encoding
'cp437'

You should only attempt to "print" Unicode strings or byte strings encoded in the stdout encoding. Printing byte strings in any other encoding will often print garbage.

[snip]
I think I'm getting close.  Parsing the file seems to work, and while
writing it out does not error, rereading my own output fails. :(
Possibly I'm 'accidentally' writing the output as UTF-8 and not
ISO8859-2.  I need the internal data to be UTF-8 but read as ISO8859-2
and rewritten back to ISO8859-2 [at least that is what I believe from
the OpenStep files I'm seeing].

"internal data" is Unicode, not UTF-8. Unicode is the absence of an encoding (Python uses UTF-16 or UTF-32 internally, but that is an implementation detail). UTF-8 is a byte-encoding.

If you actually need the internal data as UTF-8 (maybe you are working with a library that works with UTF-8 strings, then:

f = codecs.open("in.txt", 'rb', encoding="iso8859-2")
s = f.read()  # s is a Unicode string.
s = s.encode('utf-8') # now s is a UTF-8 byte string
f.close()

(process data as UTF-8 here).

s = s.decode('utf-8') # s is Unicode again.
f2 = codecs.open("out.txt", 'wb', encoding="iso8859-2")
f2.write(s)
f2.close()

Note you *decode* byte strings to Unicode and *encode* Unicode into byte strings.

-Mark


--
http://mail.python.org/mailman/listinfo/python-list

Reply via email to