On Thu, Aug 18, 2016 at 2:32 AM, Stephen J. Turnbull <turnbull.stephen...@u.tsukuba.ac.jp> wrote: > > So it's not just invalid surrogate *pairs*, it's invalid surrogates of > all kinds. This means that it's theoretically possible (though I > gather that it's unlikely in the extreme) for a real Windows filename > to indistinguishable from one generated by Python's surrogateescape > handler.
Absolutely if the filesystem is one of Microsoft's such as NTFS, FAT32, exFAT, ReFS, NPFS (named pipes), MSFS (mailslots) -- and I'm pretty sure it's also possible with CDFS and UDFS. UDF allows any Unicode character except NUL. > What happens when Python's directory manipulation functions on Windows > encounter such a filename? Do they try to write it to the disk > directory? Do they succeed? Does that depend on surrogateescape? Python allows these 'Unicode' (but not strictly UTF compatible) strings, so it doesn't have a problem with such filenames, as long as it's calling the Windows wide-character APIs. > Is there a reason in practice to allow surrogateescape at all on names > in Windows filesystems, at least when using the *W API? You mention > non-Microsoft filesystems; are they common enough to matter? Previously I gave an example with a VirtualBox shared folder, which rejects names with invalid surrogates. I don't know how common that is in general. I typically switch between 2 guests on a Linux host and share folders between systems. In Windows I mount shared folders as directory symlinks in C:\Mount. I just tested another example that led to different results. Ext2Fsd is a free ext2/ext3 filesystem driver for Windows. I mounted an ext2 disk in Windows 10. Next, in Python I created a file named "\udc00b\udc00a\udc00d" in the root directory. Ext2Fsd defaults to using UTF-8 as the drive codepage, so I expected it to reject this filename, just like VBoxSF does. But it worked: >>> os.listdir('.')[-1] '\udc00b\udc00a\udc00d' As expected the ANSI API substitutes question marks for the surrogate codes: >>> os.listdir(b'.')[-1] b'?b?a?d' So what did Ext2Fsd write in this supposedly UTF-8 filesystem? I mounted the disk in Linux to check: >>> os.listdir(b'.')[-1] b'\xed\xb0\x80b\xed\xb0\x80a\xed\xb0\x80d' It blindly encoded the surrogate codes, creating invalid UTF-8. I think it's called WTF-8 (Wobbly Transformation Format). The file manager in Linux displays this file as "���b���a���d (invalid encoding)", and ls prints "???b???a???d". Python uses its surrogateescape error handler: >>> os.listdir('.')[-1] '\udced\udcb0\udc80b\udced\udcb0\udc80a\udced\udcb0\udc80d' The original name can be decoded using the surrogatepass error handler: >>> os.listdir(b'.')[-1].decode(errors='surrogatepass') '\udc00b\udc00a\udc00d' _______________________________________________ Python-ideas mailing list Python-ideas@python.org https://mail.python.org/mailman/listinfo/python-ideas Code of Conduct: http://python.org/psf/codeofconduct/