William O'Higgins Witteman wrote: > On Wed, Jul 04, 2007 at 02:47:45PM -0400, Kent Johnson wrote: > >> encode() really wants a unicode string not a byte string. If you call >> encode() on a byte string, the string is first converted to unicode >> using the default encoding (usually ascii), then converted with the >> given encoding. > > Aha! That helps. Something else that helps is that my Python code is > generating output that is received by several other tools. Interesting > facts: > > Not all .NET XML parsers (nor IE6) accept valid UTF-8 XML.
Yikes! Are you sure it isn't a problem with your XML? > I am indeed seeing filenames in cp1252, even though the Microsoft docs > say that filenames are in UTF-8. > > Filenames in Arabic are in UTF-8. Not on my computer (Win XP) in os.listdir(). With filenames of Tést.txt and ق.txt (that's \u0642, an Arabic character), os.listdir() gives me >>> os.listdir('.') ['Administrator', 'All Users', 'Default User', 'LocalService', 'NetworkService', 'T\xe9st.txt', '?.txt'] >>> os.listdir(u'.') [u'Administrator', u'All Users', u'Default User', u'LocalService', u'NetworkService', u'T\xe9st.txt', u'\u0642.txt'] So with a byte string directory it fails, with a unicode directory it gives unicode, not utf-8. > What I have to do is to check the encoding of the filename as received > by os.walk (and thus os.listdir) and convert them to Unicode, continue > to process them, and then encode them as UTF-8 for output to XML. How do you do that? AFAIK there is no completely reliable way to determine the encoding of a byte string by looking at it; the most common approach is to try to find one that successfully decodes the string; more sophisticated variations look at the distribution of character codes. Anyway if you use the Unicode file names you shouldn't have to worry about this. Kent _______________________________________________ Tutor maillist - Tutor@python.org http://mail.python.org/mailman/listinfo/tutor