William O'Higgins Witteman wrote:
> On Wed, Jul 04, 2007 at 02:47:45PM -0400, Kent Johnson wrote:
>
>> encode() really wants a unicode string not a byte string. If you call
>> encode() on a byte string, the string is first converted to unicode
>> using the default encoding (usually ascii), then converted with the
>> given encoding.
>
> Aha! That helps. Something else that helps is that my Python code is
> generating output that is received by several other tools. Interesting
> facts:
>
> Not all .NET XML parsers (nor IE6) accept valid UTF-8 XML.
Yikes! Are you sure it isn't a problem with your XML?
> I am indeed seeing filenames in cp1252, even though the Microsoft docs
> say that filenames are in UTF-8.
>
> Filenames in Arabic are in UTF-8.
Not on my computer (Win XP) in os.listdir(). With filenames of Tést.txt
and ق.txt (that's \u0642, an Arabic character), os.listdir() gives me
>>> os.listdir('.')
['Administrator', 'All Users', 'Default User', 'LocalService',
'NetworkService', 'T\xe9st.txt', '?.txt']
>>> os.listdir(u'.')
[u'Administrator', u'All Users', u'Default User', u'LocalService',
u'NetworkService', u'T\xe9st.txt', u'\u0642.txt']
So with a byte string directory it fails, with a unicode directory it
gives unicode, not utf-8.
> What I have to do is to check the encoding of the filename as received
> by os.walk (and thus os.listdir) and convert them to Unicode, continue
> to process them, and then encode them as UTF-8 for output to XML.
How do you do that? AFAIK there is no completely reliable way to
determine the encoding of a byte string by looking at it; the most
common approach is to try to find one that successfully decodes the
string; more sophisticated variations look at the distribution of
character codes.
Anyway if you use the Unicode file names you shouldn't have to worry
about this.
Kent
_______________________________________________
Tutor maillist - [email protected]
http://mail.python.org/mailman/listinfo/tutor