On Thu, Aug 18, 2016 at 4:07 PM, Steve Dower <steve.do...@python.org> wrote: > On 18Aug2016 0900, Chris Angelico wrote: >> >> On Fri, Aug 19, 2016 at 1:54 AM, Steve Dower <steve.do...@python.org> >> wrote: >>> >>> On 18Aug2016 0829, Chris Angelico wrote: >>>> >>>> >>>> The second call to glob doesn't have any Unicode characters at all, >>>> the way I see it - it's all bytes. Am I completely misunderstanding >>>> this? >>> >>> >>> >>> You're not the only one - I think this has been the most common >>> misunderstanding. >>> >>> On Windows, the paths as stored in the filesystem are actually all text - >>> more precisely, utf-16-le encoded bytes, represented as 16-bit characters >>> strings. >>> >>> Converting to an 8-bit character representation only exists for >>> compatibility with code written for other platforms (either Linux, or >>> much >>> older versions of Windows). The operating system has one way to do the >>> conversion to bytes, which Python currently uses, but since we control >>> that >>> transformation I'm proposing an alternative conversion that is more >>> reliable >>> than compatible (with Windows 3.1... shouldn't affect compatibility with >>> code that properly handles multibyte encodings, which should include >>> anything developed for Linux in the last decade or two). >>> >>> Does that help? I tried to keep the explanation short and focused :) >> >> >> Ah, I think I see what you mean. There's a slight ambiguity in the >> word "missing" here. >> >> 1) The Unicode character in the result lacks some of the information >> it should have >> >> 2) The Unicode character in the file name is information that has now been >> lost. >> >> My reading was the first, but AIUI you actually meant the second. If >> so, I'd be inclined to reword it very slightly, eg: >> >> "The Unicode character in the second call to glob is now lost >> information." >> >> Is that a correct interpretation? > > > I think so, though I find the wording a little awkward (and on rereading, my > original wording was pretty bad). How about: > > "The second call to glob has replaced the Unicode character with '?', which > means the actual filename cannot be recovered and the path is no longer > valid."
They're all just characters in the context of Unicode, so I think it's clearest to use the character code, e.g.: The second call to glob has replaced the U+AB00 character with '?', which means ... _______________________________________________ Python-ideas mailing list Python-ideas@python.org https://mail.python.org/mailman/listinfo/python-ideas Code of Conduct: http://python.org/psf/codeofconduct/