Hi Steven,

I read through the article you referenced.  I understand Unicode better now.
I wasn't completely ignorant of the subject.  My confusion is more about how
Python is handling Unicode than Unicode itself.  I guess I'm fighting my own
misconceptions. I do that a lot.  It's hard for me to understand how things
work when they don't function the way I *think* they should.

Here's the main source of my confusion.  In my original sample, I had read a
line in from the file and used the unicode function to create a
unicodestring object;

        unicodestring = unicode(line, 'latin1')

What I thought this step would do is translate the line to an internal
Unicode representation.  The problem character \xe1 would have been
translated into a correct Unicode representation for the accented "a"
character. 

Next I tried to write the unicodestring object to a file thusly;

        output.write(unicodestring)

I would have expected the write function to request the byte string from the
unicodestring object and simply write that byte string to a file.  I thought
that at this point, I should have had a valid Unicode latin1 encoded file.
Instead get an error that the character \xe1 is invalid.

The fact that the \xe1 character is still in the unicodestring object tells
me it wasn't translated into whatever python uses for its internal Unicode
representation.  Either that or the unicodestring object returns the
original string when it's asked for a byte stream representation.

Instead of just writing the unicodestring object, I had to do this;

        output.write(unicodestring.encode('utf-8'))

This is doing what I thought the other steps were doing.  It's translating
the internal unicodestring byte representation to utf-8 and writing it out.
It still seems strange and I'm still not completely clear as to what is
going on at the byte stream level for each of these steps.



-- 
http://mail.python.org/mailman/listinfo/python-list

Reply via email to