Re: [Tutor] UTF-8 filenames encountered in os.walk
On Tue, Jul 03, 2007 at 06:04:16PM -0700, Terry Carroll wrote: Has anyone found a silver bullet for ensuring that all the filenames encountered by os.walk are treated as UTF-8? Thanks. What happens if you specify the starting directory as a Unicode string, rather than an ascii string, e.g., if you're walking the current directory: for thing in os.walk(u'.'): instead of: for thing in os.walk('.'): This is a good thought, and the crux of the problem. I pull the starting directories from an XML file which is UTF-8, but by the time it hits my program, because there are no extended characters in the starting path, os.walk assumes ascii. So, I recast the string as UTF-8, and I get UTF-8 output. The problem happens further down the line. I get a list of paths from the results of os.walk, all in UTF-8, but not identified as such. If I just pass my list to other parts of the program it seems to assume either ascii or UTF-8, based on the individual list elements. If I try to cast the whole list as UTF-8, I get an exception because it is assuming ascii and receiving UTF-8 for some list elements. I suspect that my program will have to make sure to recast all equivalent-to-ascii strings as UTF-8 while leaving the ones that are already extended alone. -- yours, William ___ Tutor maillist - Tutor@python.org http://mail.python.org/mailman/listinfo/tutor
Re: [Tutor] UTF-8 filenames encountered in os.walk
William O'Higgins Witteman wrote: for thing in os.walk(u'.'): instead of: for thing in os.walk('.'): This is a good thought, and the crux of the problem. I pull the starting directories from an XML file which is UTF-8, but by the time it hits my program, because there are no extended characters in the starting path, os.walk assumes ascii. So, I recast the string as UTF-8, and I get UTF-8 output. The problem happens further down the line. I get a list of paths from the results of os.walk, all in UTF-8, but not identified as such. If I just pass my list to other parts of the program it seems to assume either ascii or UTF-8, based on the individual list elements. If I try to cast the whole list as UTF-8, I get an exception because it is assuming ascii and receiving UTF-8 for some list elements. FWIW, I'm pretty sure you are confusing Unicode strings and UTF-8 strings, they are not the same thing. A Unicode string uses 16 bits to represent each character. It is a distinct data type from a 'regular' string. Regular Python strings are byte strings with an implicit encoding. One possible encoding is UTF-8 which uses one or more bytes to represent each character. Some good reading on Unicode and utf-8: http://www.joelonsoftware.com/articles/Unicode.html http://effbot.org/zone/unicode-objects.htm If you pass a unicode string (not utf-8) to os.walk(), the resulting lists will also be unicode. Again, it would be helpful to see the code that is getting the error. I suspect that my program will have to make sure to recast all equivalent-to-ascii strings as UTF-8 while leaving the ones that are already extended alone. It is nonsense to talk about 'recasting' an ascii string as UTF-8; an ascii string is *already* UTF-8 because the representation of the characters is identical. OTOH it makes sense to talk about converting an ascii string to a unicode string. Kent ___ Tutor maillist - Tutor@python.org http://mail.python.org/mailman/listinfo/tutor
Re: [Tutor] UTF-8 filenames encountered in os.walk
On Wed, Jul 04, 2007 at 11:28:53AM -0400, Kent Johnson wrote: FWIW, I'm pretty sure you are confusing Unicode strings and UTF-8 strings, they are not the same thing. A Unicode string uses 16 bits to represent each character. It is a distinct data type from a 'regular' string. Regular Python strings are byte strings with an implicit encoding. One possible encoding is UTF-8 which uses one or more bytes to represent each character. Some good reading on Unicode and utf-8: http://www.joelonsoftware.com/articles/Unicode.html http://effbot.org/zone/unicode-objects.htm The problem is that the Windows filesystem uses UTF-8 as the encoding for filenames, but os doesn't seem to have a UTF-8 mode, just an ascii mode and a Unicode mode. If you pass a unicode string (not utf-8) to os.walk(), the resulting lists will also be unicode. Again, it would be helpful to see the code that is getting the error. The code is quite complex for not-relevant-to-this-problem reasons. The gist is that I walk the FS, get filenames, some of which get written to an XML file. If I leave the output alone I get errors on reading the XML file. If I try to change the output so that it is all Unicode, I get errors because my UTF-8 data sometimes looks like ascii, and I don't see a UTF-8-to-Unicode converter in the docs. I suspect that my program will have to make sure to recast all equivalent-to-ascii strings as UTF-8 while leaving the ones that are already extended alone. It is nonsense to talk about 'recasting' an ascii string as UTF-8; an ascii string is *already* UTF-8 because the representation of the characters is identical. OTOH it makes sense to talk about converting an ascii string to a unicode string. Then what does mystring.encode(UTF-8) do? -- yours, William ___ Tutor maillist - Tutor@python.org http://mail.python.org/mailman/listinfo/tutor
Re: [Tutor] UTF-8 filenames encountered in os.walk
On Wed, 2007-07-04 at 12:00 -0400, William O'Higgins Witteman wrote: On Wed, Jul 04, 2007 at 11:28:53AM -0400, Kent Johnson wrote: FWIW, I'm pretty sure you are confusing Unicode strings and UTF-8 strings, they are not the same thing. A Unicode string uses 16 bits to represent each character. It is a distinct data type from a 'regular' string. Regular Python strings are byte strings with an implicit encoding. One possible encoding is UTF-8 which uses one or more bytes to represent each character. Some good reading on Unicode and utf-8: http://www.joelonsoftware.com/articles/Unicode.html http://effbot.org/zone/unicode-objects.htm The problem is that the Windows filesystem uses UTF-8 as the encoding for filenames, but os doesn't seem to have a UTF-8 mode, just an ascii mode and a Unicode mode. Are you converting your utf-8 strings to unicode? unicode_file_name = utf8_file_name.decode('UTF-8') If you pass a unicode string (not utf-8) to os.walk(), the resulting lists will also be unicode. Again, it would be helpful to see the code that is getting the error. The code is quite complex for not-relevant-to-this-problem reasons. The gist is that I walk the FS, get filenames, some of which get written to an XML file. If I leave the output alone I get errors on reading the XML file. If I try to change the output so that it is all Unicode, I get errors because my UTF-8 data sometimes looks like ascii, and I don't see a UTF-8-to-Unicode converter in the docs. It is probably worth the effort to put together a simpler piece of code that can illustrate the problem. I suspect that my program will have to make sure to recast all equivalent-to-ascii strings as UTF-8 while leaving the ones that are already extended alone. It is nonsense to talk about 'recasting' an ascii string as UTF-8; an ascii string is *already* UTF-8 because the representation of the characters is identical. OTOH it makes sense to talk about converting an ascii string to a unicode string. Then what does mystring.encode(UTF-8) do? It uses utf8 encoding rules to convert mystring FROM unicode to a string. If mystring is *NOT* unicode but simply a string, it appears to do a round trip decode and encode of the string. This allows you to find encoding errors, but if there are no errors the result is the same as what you started with. The data in a file (streams of bytes) are encoded to represent unicode characters. The stream must be decoded to recover the underlying unicode. The unicode must be encoded when written to files. utf-8 is just one of many possible encoding schemes. -- Lloyd Kvam Venix Corp ___ Tutor maillist - Tutor@python.org http://mail.python.org/mailman/listinfo/tutor
Re: [Tutor] UTF-8 filenames encountered in os.walk
On Wed, 4 Jul 2007, William O'Higgins Witteman wrote: It is nonsense to talk about 'recasting' an ascii string as UTF-8; an ascii string is *already* UTF-8 because the representation of the characters is identical. OTOH it makes sense to talk about converting an ascii string to a unicode string. Then what does mystring.encode(UTF-8) do? I'm pretty iffy on this stuff myself, but as I see it, you basically have three kinds of things here. First, an ascii string: s = 'abc' In hex, this is 616263; 61 for 'a'; 62 for 'b', 63 for 'c'. Second, a unicode string: u = u'abc' I can't say what this is in hex because that's not meaningful. A Unicode character is a code point, which can be represented in a variety of ways, depending on the encoding used. So, moving on Finally, you can have a sequence of bytes, which are stored in a string as a buffer, that shows the particular encoding of a particular string: e8 = s.encode(UTF-8) e16 = s.encode(UTF-16) Now, e8 and e16 are each strings (of bytes), the content of which tells you how the string of characters that was encoded is represented in that particular encoding. In hex, these look like this. e8: 616263 (61 for 'a'; 62 for 'b', 63 for 'c') e16: FFFE6100 62006300 (FFEE for the BOM, 6100 for 'a', 6200 for 'b', 6300 for 'c') Now, superficially, s and e8 are equal, because for plain old ascii characters (which is all I've used in this example), UTF-8 is equivalent to ascii. And they compare the same: s == e8 True But that's not true of the UTF-16: s == e16 False e8 == e16 False So (and I'm open to correction on this), I think of the encode() method as returning a string of bytes that represents the particular encoding of a string value -- and it can't be used as the string value itself. But you can get that string value back (assuming all the characters map to ascii): s8 = e8.decode(UTF-8) s16 = e16.decode(UTF-16) s == s8 == s16 True ___ Tutor maillist - Tutor@python.org http://mail.python.org/mailman/listinfo/tutor
Re: [Tutor] UTF-8 filenames encountered in os.walk
William O'Higgins Witteman wrote: The problem is that the Windows filesystem uses UTF-8 as the encoding for filenames, That's not what I get. For example, I made a file called Tést.txt and looked at what os.listdir() gives me. (os.listdir() is what os.walk() uses to get the file and directory names.) If I pass a byte string as the directory name, I get byte strings back, not in utf-8, but apparently in cp1252 (or latin-1, but this is Windows so it's probably cp1252): os.listdir('C:\Documents and Settings') ['Administrator', 'All Users', 'Default User', 'LocalService', 'NetworkService', 'T\xe9st.txt'] Note the \xe9 which is the cp1252 representation of é. If I give the directory as a unicode string, the results are all unicode strings as well: os.listdir(u'C:\Documents and Settings') [u'Administrator', u'All Users', u'Default User', u'LocalService', u'NetworkService', u'T\xe9st.txt'] In neither case does it give me utf-8. but os doesn't seem to have a UTF-8 mode, just an ascii mode and a Unicode mode. It has a unicode string mode and a byte string mode. The code is quite complex for not-relevant-to-this-problem reasons. The gist is that I walk the FS, get filenames, some of which get written to an XML file. If I leave the output alone I get errors on reading the XML file. What kind of errors? Be specific! Show the code that generates the error. I'll hazard a guess that you are writing the cp1252 characters to the XML file but not specifying the charset of the file, or specifying it as utf-8, and the reader croaks on the cp1252. If I try to change the output so that it is all Unicode, I get errors because my UTF-8 data sometimes looks like ascii, How do you change the output? What do you mean, the utf-8 data looks like ascii? Ascii data *is* utf-8, they should look the same. I don't see a UTF-8-to-Unicode converter in the docs. If s is a byte string containing utf-8, then s.decode('utf-8') is the equivalent unicode string. I suspect that my program will have to make sure to recast all equivalent-to-ascii strings as UTF-8 while leaving the ones that are already extended alone. It is nonsense to talk about 'recasting' an ascii string as UTF-8; an ascii string is *already* UTF-8 because the representation of the characters is identical. OTOH it makes sense to talk about converting an ascii string to a unicode string. Then what does mystring.encode(UTF-8) do? It depends on what mystring is. If it is a unicode string, it converts it to a plain (byte) string containing the utf-8 representation of mystring. For example, In [8]: s=u'\xe9' # Note the leading u - this is a unicode string In [9]: s.encode('utf-8') Out[9]: '\xc3\xa9' If mystring is a string, it is converted to a unicode string using the default encoding (ascii unless you have changed it), then that string is converted to utf-8. This can work out two ways: - if mystring originally contained only ascii characters, the result is identical to the original: In [1]: s='abc' In [2]: s.encode('utf-8') Out[2]: 'abc' In [4]: s.encode('utf-8') == s Out[4]: True - if mystring contains non-ascii characters, then the implicit *decode* using the ascii codec will fail with an exception: In [5]: s = '\303\251' In [6]: s.encode('utf-8') Traceback (most recent call last): File ipython console, line 1, in module type 'exceptions.UnicodeDecodeError': 'ascii' codec can't decode byte 0xc3 in position 0: ordinal not in range(128) Note this is exactly the same error you would get if you explicitly tried to convert to unicode using the ascii codec, because that is what is happening under the hood: In [11]: s.decode('ascii') Traceback (most recent call last): File ipython console, line 1, in module type 'exceptions.UnicodeDecodeError': 'ascii' codec can't decode byte 0xc3 in position 0: ordinal not in range(128) Again, it would really help if you would - show some code - show some data - learn more about unicode, utf-8, character encodings and python strings. Kent ___ Tutor maillist - Tutor@python.org http://mail.python.org/mailman/listinfo/tutor
Re: [Tutor] UTF-8 filenames encountered in os.walk
Terry Carroll wrote: I'm pretty iffy on this stuff myself, but as I see it, you basically have three kinds of things here. First, an ascii string: s = 'abc' In hex, this is 616263; 61 for 'a'; 62 for 'b', 63 for 'c'. Second, a unicode string: u = u'abc' I can't say what this is in hex because that's not meaningful. A Unicode character is a code point, which can be represented in a variety of ways, depending on the encoding used. So, moving on Finally, you can have a sequence of bytes, which are stored in a string as a buffer, that shows the particular encoding of a particular string: e8 = s.encode(UTF-8) e16 = s.encode(UTF-16) Now, e8 and e16 are each strings (of bytes), the content of which tells you how the string of characters that was encoded is represented in that particular encoding. I would say that there are two kinds of strings, byte strings and unicode strings. Byte strings have an implicit encoding. If the contents of the byte string are all ascii characters, you can generally get away with ignoring that they are in an encoding, because most of the common 8-bit character encodings include plain ascii as a subset (all the latin-x encodings, all the Windows cp12xx encodings, and utf-8 all have ascii as a subset), so an ascii string can be interpreted as any of those encodings without error. As soon as you get away from ascii, you have to be aware of the encoding of the string. encode() really wants a unicode string not a byte string. If you call encode() on a byte string, the string is first converted to unicode using the default encoding (usually ascii), then converted with the given encoding. In hex, these look like this. e8: 616263 (61 for 'a'; 62 for 'b', 63 for 'c') e16: FFFE6100 62006300 (FFEE for the BOM, 6100 for 'a', 6200 for 'b', 6300 for 'c') Now, superficially, s and e8 are equal, because for plain old ascii characters (which is all I've used in this example), UTF-8 is equivalent to ascii. And they compare the same: s == e8 True They are equal in every sense, I don't know why you consider this superficial. And if your original string was not ascii the encode() would fail with a UnicodeDecodeError. But that's not true of the UTF-16: s == e16 False e8 == e16 False So (and I'm open to correction on this), I think of the encode() method as returning a string of bytes that represents the particular encoding of a string value -- and it can't be used as the string value itself. The idea that there is somehow some kind of string value that doesn't have an encoding will bring you a world of hurt as soon as you venture out of the realm of pure ascii. Every string is a particular encoding of character values. It's not any different from the string value itself. But you can get that string value back (assuming all the characters map to ascii): s8 = e8.decode(UTF-8) s16 = e16.decode(UTF-16) s == s8 == s16 True You can get back to the ascii-encoded representation of the string. Though here you are hiding something - s8 and s16 are unicode strings while s is a byte string. In [13]: s = 'abc' In [14]: e8 = s.encode(UTF-8) In [15]: e16 = s.encode(UTF-16) In [16]: s8 = e8.decode(UTF-8) In [17]: s16 = e16.decode(UTF-16) In [18]: s8 Out[18]: u'abc' In [19]: s16 Out[19]: u'abc' In [20]: s Out[20]: 'abc' In [21]: type(s8) == type(s) Out[21]: False The way I think of it is, unicode is the pure representation of the string. (This is nonsense, I know, but I find it a convenient mnemonic.) encode() converts from the pure representation to an encoded representation. The encoding can be ascii, latin-1, utf-8... decode() converts from the coded representation back to the pure one. Kent ___ Tutor maillist - Tutor@python.org http://mail.python.org/mailman/listinfo/tutor
Re: [Tutor] UTF-8 filenames encountered in os.walk
On Wed, Jul 04, 2007 at 02:47:45PM -0400, Kent Johnson wrote: encode() really wants a unicode string not a byte string. If you call encode() on a byte string, the string is first converted to unicode using the default encoding (usually ascii), then converted with the given encoding. Aha! That helps. Something else that helps is that my Python code is generating output that is received by several other tools. Interesting facts: Not all .NET XML parsers (nor IE6) accept valid UTF-8 XML. I am indeed seeing filenames in cp1252, even though the Microsoft docs say that filenames are in UTF-8. Filenames in Arabic are in UTF-8. What I have to do is to check the encoding of the filename as received by os.walk (and thus os.listdir) and convert them to Unicode, continue to process them, and then encode them as UTF-8 for output to XML. In trying to work around bad 3rd party tools and inconsistent data I introduced errors in my Python code. The problem was in treating all filenames the same way, when they were not being created the same way by the filesystem. Thanks for all the help and suggestions. -- yours, William ___ Tutor maillist - Tutor@python.org http://mail.python.org/mailman/listinfo/tutor
Re: [Tutor] UTF-8 filenames encountered in os.walk
William O'Higgins Witteman wrote: On Wed, Jul 04, 2007 at 02:47:45PM -0400, Kent Johnson wrote: encode() really wants a unicode string not a byte string. If you call encode() on a byte string, the string is first converted to unicode using the default encoding (usually ascii), then converted with the given encoding. Aha! That helps. Something else that helps is that my Python code is generating output that is received by several other tools. Interesting facts: Not all .NET XML parsers (nor IE6) accept valid UTF-8 XML. Yikes! Are you sure it isn't a problem with your XML? I am indeed seeing filenames in cp1252, even though the Microsoft docs say that filenames are in UTF-8. Filenames in Arabic are in UTF-8. Not on my computer (Win XP) in os.listdir(). With filenames of Tést.txt and ق.txt (that's \u0642, an Arabic character), os.listdir() gives me os.listdir('.') ['Administrator', 'All Users', 'Default User', 'LocalService', 'NetworkService', 'T\xe9st.txt', '?.txt'] os.listdir(u'.') [u'Administrator', u'All Users', u'Default User', u'LocalService', u'NetworkService', u'T\xe9st.txt', u'\u0642.txt'] So with a byte string directory it fails, with a unicode directory it gives unicode, not utf-8. What I have to do is to check the encoding of the filename as received by os.walk (and thus os.listdir) and convert them to Unicode, continue to process them, and then encode them as UTF-8 for output to XML. How do you do that? AFAIK there is no completely reliable way to determine the encoding of a byte string by looking at it; the most common approach is to try to find one that successfully decodes the string; more sophisticated variations look at the distribution of character codes. Anyway if you use the Unicode file names you shouldn't have to worry about this. Kent ___ Tutor maillist - Tutor@python.org http://mail.python.org/mailman/listinfo/tutor
Re: [Tutor] UTF-8 filenames encountered in os.walk
On Wed, 4 Jul 2007, Kent Johnson wrote: Terry Carroll wrote: Now, superficially, s and e8 are equal, because for plain old ascii characters (which is all I've used in this example), UTF-8 is equivalent to ascii. And they compare the same: s == e8 True They are equal in every sense, I don't know why you consider this superficial. And if your original string was not ascii the encode() would fail with a UnicodeDecodeError. Superficial in the sense that I was using only characters in the ascii character set, so that the same byte encoding in UTF-8. so: 'abc'.decode(UTF-8) u'abc' works But UTF-8 can hold other characters, too; for example '\xe4\xba\xba'.decode(UTF-8) u'\u4eba' (Chinese character for person) I'm just saying that UTF-8 encodes ascii characters to themselves; but UTF-8 is not the same as ascii. I think we're ultimately saying the same thing; to merge both our ways of putting it, I think, is that ascii will map to UTF-8 identically; but UTF-8 may map back or it will raise UnicodeDecodeError. I just didn't want to leave the impression Yeah, UTF-8 ascii, they're the same thing. ___ Tutor maillist - Tutor@python.org http://mail.python.org/mailman/listinfo/tutor
Re: [Tutor] UTF-8 filenames encountered in os.walk
Terry Carroll wrote: I'm just saying that UTF-8 encodes ascii characters to themselves; but UTF-8 is not the same as ascii. I think we're ultimately saying the same thing; to merge both our ways of putting it, I think, is that ascii will map to UTF-8 identically; but UTF-8 may map back or it will raise UnicodeDecodeError. I just didn't want to leave the impression Yeah, UTF-8 ascii, they're the same thing. I hope neither of us gave that impression! I think you are right, we just have different ways of thinking about it. Any ascii string is also a valid utf-8 string (and latin-1, and many other encodings), but the opposite is not true. Kent ___ Tutor maillist - Tutor@python.org http://mail.python.org/mailman/listinfo/tutor
Re: [Tutor] UTF-8 filenames encountered in os.walk
William O'Higgins Witteman [EMAIL PROTECTED] wrote I have several programs which traverse a Windows filesystem with French characters in the filenames. I suspect you need to set the Locale at the top of your file. Do a search for locale in this lists archive where we had a thread on this a few months ago. HTH, Alan G ___ Tutor maillist - Tutor@python.org http://mail.python.org/mailman/listinfo/tutor
Re: [Tutor] UTF-8 filenames encountered in os.walk
Alan Gauld wrote: William O'Higgins Witteman [EMAIL PROTECTED] wrote I have several programs which traverse a Windows filesystem with French characters in the filenames. I suspect you need to set the Locale at the top of your file. Do you mean the # -*- coding: encoding-name -*- comment? That only affects the encoding of the source file itself. Kent ___ Tutor maillist - Tutor@python.org http://mail.python.org/mailman/listinfo/tutor
Re: [Tutor] UTF-8 filenames encountered in os.walk
Kent Johnson [EMAIL PROTECTED] wrote I suspect you need to set the Locale at the top of your file. Do you mean the # -*- coding: encoding-name -*- comment? That only affects the encoding of the source file itself. No, I meant the Locale but I got it mixed up with the encoding in how it is set. Oops! Alan G. ___ Tutor maillist - Tutor@python.org http://mail.python.org/mailman/listinfo/tutor