Re: Puzzled by code pages

Mark Tolonen Sat, 15 May 2010 12:34:29 -0700

"Adam Tauno Williams" <[email protected]> wrote in messagenews:[email protected]...

On Sat, 2010-05-15 at 20:30 +1000, Lie Ryan wrote:

On 05/15/10 10:27, Adam Tauno Williams wrote:

[snip]

Yep.  But in the interpreter both unicode() and repr() produce the same
output.  Nothing displays the accented character.

h = codecs.open('file.txt', 'rb', encoding='iso8859-2')
data = h.read()
h.close()
str(data)

Here you are correctly reading an iso8859-2-encoded file and converting itto Unicode.

Try "print data". "str(data)" converts from Unicode strings to bytestrings, but only uses the default encoding, which is 'ascii'. print willuse the stdout encoding of your terminal, if known. Try these commands onyour system (mine is Windows XP):

import sys
sys.getdefaultencoding()

'ascii'

sys.stdout.encoding

'cp437'

You should only attempt to "print" Unicode strings or byte strings encodedin the stdout encoding. Printing byte strings in any other encoding willoften print garbage.


[snip]

I think I'm getting close.  Parsing the file seems to work, and while
writing it out does not error, rereading my own output fails. :(
Possibly I'm 'accidentally' writing the output as UTF-8 and not
ISO8859-2.  I need the internal data to be UTF-8 but read as ISO8859-2
and rewritten back to ISO8859-2 [at least that is what I believe from
the OpenStep files I'm seeing].

"internal data" is Unicode, not UTF-8. Unicode is the absence of anencoding (Python uses UTF-16 or UTF-32 internally, but that is animplementation detail). UTF-8 is a byte-encoding.

If you actually need the internal data as UTF-8 (maybe you are working witha library that works with UTF-8 strings, then:

f = codecs.open("in.txt", 'rb', encoding="iso8859-2")
s = f.read()  # s is a Unicode string.
s = s.encode('utf-8') # now s is a UTF-8 byte string
f.close()


(process data as UTF-8 here).

s = s.decode('utf-8') # s is Unicode again.
f2 = codecs.open("out.txt", 'wb', encoding="iso8859-2")
f2.write(s)
f2.close()

Note you *decode* byte strings to Unicode and *encode* Unicode into bytestrings.


-Mark


--
http://mail.python.org/mailman/listinfo/python-list

Re: Puzzled by code pages

Reply via email to