Re: [Tutor] UTF-8 filenames encountered in os.walk

2007-07-04 Thread Kent Johnson
Terry Carroll wrote: > I'm just saying that UTF-8 encodes ascii characters to themselves; but > UTF-8 is not the same as ascii. > > I think we're ultimately saying the same thing; to merge both our ways of > putting it, I think, is that ascii will map to UTF-8 identically; but > UTF-8 may map bac

Re: [Tutor] UTF-8 filenames encountered in os.walk

2007-07-04 Thread Terry Carroll
On Wed, 4 Jul 2007, Kent Johnson wrote: > Terry Carroll wrote: > > Now, superficially, s and e8 are equal, because for plain old ascii > > characters (which is all I've used in this example), UTF-8 is equivalent > > to ascii. And they compare the same: > > > s == e8 > > True > > They are

Re: [Tutor] UTF-8 filenames encountered in os.walk

2007-07-04 Thread Kent Johnson
William O'Higgins Witteman wrote: > On Wed, Jul 04, 2007 at 02:47:45PM -0400, Kent Johnson wrote: > >> encode() really wants a unicode string not a byte string. If you call >> encode() on a byte string, the string is first converted to unicode >> using the default encoding (usually ascii), then

Re: [Tutor] UTF-8 filenames encountered in os.walk

2007-07-04 Thread William O'Higgins Witteman
On Wed, Jul 04, 2007 at 02:47:45PM -0400, Kent Johnson wrote: >encode() really wants a unicode string not a byte string. If you call >encode() on a byte string, the string is first converted to unicode >using the default encoding (usually ascii), then converted with the >given encoding. Aha!

Re: [Tutor] UTF-8 filenames encountered in os.walk

2007-07-04 Thread Kent Johnson
Terry Carroll wrote: > I'm pretty iffy on this stuff myself, but as I see it, you basically have > three kinds of things here. > > First, an ascii string: > > s = 'abc' > > In hex, this is 616263; 61 for 'a'; 62 for 'b', 63 for 'c'. > > Second, a unicode string: > > u = u'abc' > > I can

Re: [Tutor] UTF-8 filenames encountered in os.walk

2007-07-04 Thread Kent Johnson
William O'Higgins Witteman wrote: > The problem is that the Windows filesystem uses UTF-8 as the encoding > for filenames, That's not what I get. For example, I made a file called "Tést.txt" and looked at what os.listdir() gives me. (os.listdir() is what os.walk() uses to get the file and direc

Re: [Tutor] UTF-8 filenames encountered in os.walk

2007-07-04 Thread Terry Carroll
On Wed, 4 Jul 2007, William O'Higgins Witteman wrote: > >It is nonsense to talk about 'recasting' an ascii string as UTF-8; an > >ascii string is *already* UTF-8 because the representation of the > >characters is identical. OTOH it makes sense to talk about converting an > >ascii string to a un

Re: [Tutor] UTF-8 filenames encountered in os.walk

2007-07-04 Thread Lloyd Kvam
On Wed, 2007-07-04 at 12:00 -0400, William O'Higgins Witteman wrote: > On Wed, Jul 04, 2007 at 11:28:53AM -0400, Kent Johnson wrote: > > >FWIW, I'm pretty sure you are confusing Unicode strings and UTF-8 > >strings, they are not the same thing. A Unicode string uses 16 bits to > >represent each ch

Re: [Tutor] UTF-8 filenames encountered in os.walk

2007-07-04 Thread William O'Higgins Witteman
On Wed, Jul 04, 2007 at 11:28:53AM -0400, Kent Johnson wrote: >FWIW, I'm pretty sure you are confusing Unicode strings and UTF-8 >strings, they are not the same thing. A Unicode string uses 16 bits to >represent each character. It is a distinct data type from a 'regular' >string. Regular Python st

Re: [Tutor] UTF-8 filenames encountered in os.walk

2007-07-04 Thread Kent Johnson
William O'Higgins Witteman wrote: >> for thing in os.walk(u'.'): >> >> instead of: >> >> for thing in os.walk('.'): > > This is a good thought, and the crux of the problem. I pull the > starting directories from an XML file which is UTF-8, but by the time it > hits my program, because there are

Re: [Tutor] UTF-8 filenames encountered in os.walk

2007-07-04 Thread William O'Higgins Witteman
On Tue, Jul 03, 2007 at 06:04:16PM -0700, Terry Carroll wrote: > >> Has anyone found a silver bullet for ensuring that all the filenames >> encountered by os.walk are treated as UTF-8? Thanks. > >What happens if you specify the starting directory as a Unicode string, >rather than an ascii string,

Re: [Tutor] UTF-8 filenames encountered in os.walk

2007-07-03 Thread Terry Carroll
On Tue, 3 Jul 2007, William O'Higgins Witteman wrote: > Has anyone found a silver bullet for ensuring that all the filenames > encountered by os.walk are treated as UTF-8? Thanks. What happens if you specify the starting directory as a Unicode string, rather than an ascii string, e.g., if you'r

Re: [Tutor] UTF-8 filenames encountered in os.walk

2007-07-03 Thread Alan Gauld
"Kent Johnson" <[EMAIL PROTECTED]> wrote >> I suspect you need to set the Locale at the top of your file. > > Do you mean the > # -*- coding: -*- > comment? That only affects the encoding of the source file itself. No, I meant the Locale but I got it mixed up with the encoding in how it is set

Re: [Tutor] UTF-8 filenames encountered in os.walk

2007-07-03 Thread Kent Johnson
William O'Higgins Witteman wrote: > I have several programs which traverse a Windows filesystem with French > characters in the filenames. > > I have having trouble dealing with these filenames when outputting these > paths to an XML file - I get UnicodeDecodeError: 'ascii' codec can't > decode by

Re: [Tutor] UTF-8 filenames encountered in os.walk

2007-07-03 Thread Kent Johnson
Alan Gauld wrote: > "William O'Higgins Witteman" <[EMAIL PROTECTED]> wrote > >> I have several programs which traverse a Windows filesystem with >> French >> characters in the filenames. > > I suspect you need to set the Locale at the top of your file. Do you mean the # -*- coding: -*- comment

Re: [Tutor] UTF-8 filenames encountered in os.walk

2007-07-03 Thread Alan Gauld
"William O'Higgins Witteman" <[EMAIL PROTECTED]> wrote >I have several programs which traverse a Windows filesystem with >French > characters in the filenames. I suspect you need to set the Locale at the top of your file. Do a search for locale in this lists archive where we had a thread on th

[Tutor] UTF-8 filenames encountered in os.walk

2007-07-03 Thread William O'Higgins Witteman
I have several programs which traverse a Windows filesystem with French characters in the filenames. I have having trouble dealing with these filenames when outputting these paths to an XML file - I get UnicodeDecodeError: 'ascii' codec can't decode byte 0xe9 ... etc. That happens when I try to c