Re: PEP 383: Non-decodable Bytes in System Character Interfaces
On 29Apr2009 23:41, Barry Scott wrote: > On 22 Apr 2009, at 07:50, Martin v. Löwis wrote: >> If the locale's encoding is UTF-8, the file system encoding is set to >> a new encoding "utf-8b". The UTF-8b codec decodes non-decodable bytes >> (which must be >= 0x80) into half surrogate codes U+DC80..U+DCFF. > > Forgive me if this has been covered. I've been reading this thread for a > long time and still have a 100 odd replies to go... > > How do get a printable unicode version of these path strings if they > contain none unicode data? Personally, I'd use repr(). One might ask, what would you expect to see if you were printing such a string? > I'm guessing that an app has to understand that filenames come in two > forms unicode and bytes if its not utf-8 data. Why not simply return string > if > its valid utf-8 otherwise return bytes? Then in the app you check for the > type for > the object, string or byte and deal with reporting errors appropriately. Because it complicates the app enormously, for every app. It would be _nice_ to just call os.listdir() et al with strings, get strings, and not worry. With strings becoming unicode in Python3, on POSIX you have an issue of deciding how to get its filenames-are-bytes into a string and the reverse. One could naively map the byte values to the same Unicode code points, but that results in strings that do not contain the same characters as the user/app expects for byte values above 127. Since POSIX does not really have a filesystem level character encoding, just a user environment setting that says how the current user encodes characters into bytes (UTF-8 is increasingly common and useful, but it is not universal), it is more useful to decode filenames on the assumption that they represent characters in the user's (current) encoding convention; that way when things are displayed they are meaningful, and they interoperate well with strings made by the user/app. If all the filenames were actually encoded that way when made, that works. But different users may adopt different conventions, and indeed a user may have used ACII or and ISO8859-* coding in the past and be transitioning to something else now, so they will have a bunch of files in different encodings. The PEP uses the user's current encoding with a handler for byte sequences that don't decode to valid Unicode scaler values in a fashion that is reversible. That is, you get "strings" out of listdir() and those strings will go back in (eg to open()) perfectly robustly. Previous approaches would either silently hide non-decodable names in listdir() results or throw exceptions when the decode failed or mangle things no reversably. I believe Python3 went with the first option there. The PEP at least lets programs naively access all files that exist, and create a filename from any well-formed unicode string provided that the filesystem encoding permits the name to be encoded. The lengthy discussion mostly revolves around: - Glenn points out that strings that came _not_ from listdir, and that are _not_ well-formed unicode (== "have bare surrogates in them") but that were intended for use as filenames will conflict with the PEP's scheme - programs must know that these strings came from outside and must be translated into the PEP's funny-encoding before use in the os.* functions. Previous to the PEP they would get used directly and encode differently after the PEP, thus producing different POSIX filenames. Breakage. - Glenn would like the encoding to use Unicode scalar values only, using a rare-in-filenames character. That would avoid the issue with "outside' strings that contain surrogates. To my mind it just moves the punning from rare illegal strings to merely uncommon but legal characters. - Some parties think it would be better to not return strings from os.listdir but a subclass of string (or at least a duck-type of string) that knows where it came from and is also handily recognisable as not-really-a-string for purposes of deciding whether is it PEP-funny-encoded by direct inspection. Cheers, -- Cameron Simpson DoD#743 http://www.cskk.ezoshosting.com/cs/ The peever can look at the best day in his life and sneer at it. - Jim Hill, JennyGfest '95 -- http://mail.python.org/mailman/listinfo/python-list
Re: PEP 383: Non-decodable Bytes in System Character Interfaces
On 25Apr2009 14:07, "Martin v. Löwis" wrote: | Cameron Simpson wrote: | > On 22Apr2009 08:50, Martin v. Löwis wrote: | > | File names, environment variables, and command line arguments are | > | defined as being character data in POSIX; | > | > Specific citation please? I'd like to check the specifics of this. | For example, on environment variables: | http://opengroup.org/onlinepubs/007908799/xbd/envvar.html [...] | http://opengroup.org/onlinepubs/007908799/xsh/execve.html [...] Thanks. | > So you're proposing that all POSIX OS interfaces (which use byte strings) | > interpret those byte strings into Python3 str objects, with a codec | > that will accept arbitrary byte sequences losslessly and is totally | > reversible, yes? | | Correct. | | > And, I hope, that the os.* interfaces silently use it by default. | | Correct. Ok, then I'm probably good with the PEP. Though I have a quite strong desire to be able to work in bytes at need without doing multiple encode/decode steps. | > | Applications that need to process the original byte | > | strings can obtain them by encoding the character strings with the | > | file system encoding, passing "python-escape" as the error handler | > | name. | > | > -1 | > This last sentence kills the idea for me, unless I'm missing something. | > Which I may be, of course. | > POSIX filesystems _do_not_ have a file system encoding. | | Why is that a problem for the PEP? Because you said above "by encoding the character strings with the file system encoding", which is a fiction. | > If I'm writing a general purpose UNIX tool like chmod or find, I expect | > it to work reliably on _any_ UNIX pathname. It must be totally encoding | > blind. If I speak to the os.* interface to open a file, I expect to hand | > it bytes and have it behave. | | See the other messages. If you want to do that, you can continue to. | | > I'm very much in favour of being able to work in strings for most | > purposes, but if I use the os.* interfaces on a UNIX system it is | > necessary to be _able_ to work in bytes, because UNIX file pathnames | > are bytes. | | Please re-read the PEP. It provides a way of being able to access any | POSIX file name correctly, and still pass strings. | | > If there isn't a byte-safe os.* facility in Python3, it will simply be | > unsuitable for writing low level UNIX tools. | | Why is that? The mechanism in the PEP is precisely defined to allow | writing low level UNIX tools. Then implicitly it's byte safe. Clearly I'm being unclear; I mean original OS-level byte strings must be obtainable undamaged, and it must be possible to create/work on OS objects starting with a byte string as the pathname. | > Finally, I have a small python program whose whole purpose in life | > is to transcode UNIX filenames before transfer to a MacOSX HFS | > directory, because of HFS's enforced particular encoding. What approach | > should a Python app take to transcode UNIX pathnames under your scheme? | | Compute the corresponding character strings, and use them. In Python2 I've been going (ignoring checks for unchanged names): - Obtain the old name and interpret it into a str() "correctly". I mean here that I go: unicode_name = unicode(name, srcencoding) in old Python2 speak. name is a bytes string obtained from listdir() and srcencoding is the encoding known to have been used when the old name was constructed. Eg iso8859-1. - Compute the new name in the desired encoding. For MacOSX HFS, that's: utf8_name = unicodedata.normalize('NFD',unicode_name).encode('utf8') Still in Python2 speak, that's a byte string. - os.rename(name, utf8_name) Under your scheme I imagine this is amended. I would change your listdir_b() function as follows: def listdir_b(bytestring, fse=None): if fse is None: fse = sys.getfilesystemencoding() string = bytestring.decode(fse, "python-escape") for fn in os.listdir(string): yield fn.encoded(fse, "python-escape") So, internally, os.listdir() takes a string and encodes it to an _unspecified_ encoding in bytes, and opens the directory with that byte string using POSIX opendir(3). How does listdir() ensure that the byte string it passes to the underlying opendir(3) is identical to 'bytestring' as passed to listdir_b()? It seems from the PEP that "On POSIX systems, Python currently applies the locale's encoding to convert the byte data to Unicode". Your extension is to augument that by expressing the non-decodable byte sequences in a non-conflicting way for reversal later, yes? That seems to double the complexity of my example application, since it wants to interpret the original bytes in a caller-specified fashion, not using the locale defaults. So I must go: def macify(dirname, srcencoding): # I need this to reverse your encoding scheme fse = sys.getfilesystemencoding() # I'll pretend dirname is ready for use # it possibly has had to un
Re: PEP 383: Non-decodable Bytes in System Character Interfaces
Thanks for writing this PEP 383, MvL. I recently ran into this problem in Python 2.x in the Tahoe project [1]. The Tahoe project should be considered a good use case showing what some people need. For example, the assumption that a file will later be written back into the same local filesystem (and thus luckily use the same encoding) from which it originally came doesn't hold for us, because Tahoe is used for file-sharing as well as for backup-and-restore. One of my first conclusions in pursuing this issue is that we can never use the Python 2.x unicode APIs on Linux, just as we can never use the Python 2.x str APIs on Windows [2]. (You mentioned this ugliness in your PEP.) My next conclusion was that the Linux way of doing encoding of filenames really sucks compared to, for example, the Mac OS X way. I'm heartened to see what David Wheeler is trying to persuade the maintainers of Linux filesystems to improve some of this: [3]. My final conclusion was that we needed to have two kinds of workaround for the Linux suckage: first, if decoding using the suggested filesystem encoding fails, then we fall back to mojibake [4] by decoding with iso-8859-1 (or else with windows-1252 -- I'm not sure if it matters and I haven't yet understood if utf-8b offers another alternative for this case). Second, if decoding succeeds using the suggested filesystem encoding on Linux, then write down the encoding that we used and include that with the filename. This expands the size of our filenames significantly, but it is the only way to allow some future programmer to undo the damage of a falsely- successful decoding. Here's our whole plan: [5]. Regards, Zooko [1] http://allmydata.org [2] http://allmydata.org/pipermail/tahoe-dev/2009-March/001379.html # see the footnote of this message [3] http://www.dwheeler.com/essays/fixing-unix-linux-filenames.html [4] http://en.wikipedia.org/wiki/Mojibake [5] http://allmydata.org/trac/tahoe/ticket/534#comment:47 -- http://mail.python.org/mailman/listinfo/python-list
Re: PEP 383: Non-decodable Bytes in System Character Interfaces
On 24Apr2009 09:27, I wrote: | If I'm writing a general purpose UNIX tool like chmod or find, I expect | it to work reliably on _any_ UNIX pathname. It must be totally encoding | blind. If I speak to the os.* interface to open a file, I expect to hand | it bytes and have it behave. As an explicit example, I would be just fine | with python's open(filename, "w") to take a string and encode it for use, | but _not_ ok for os.open() to require me to supply a string and cross | my fingers and hope something sane happens when it is turned into bytes | for the UNIX system call. | | I'm very much in favour of being able to work in strings for most | purposes, but if I use the os.* interfaces on a UNIX system it is | necessary to be _able_ to work in bytes, because UNIX file pathnames | are bytes. Just to follow up to my own words here, I would be ok for all the pure-byte stuff to be off in the "posix" module if os.* goes pure character instead of bytes or bytes+strings. -- Cameron Simpson DoD#743 http://www.cskk.ezoshosting.com/cs/ ... that, in a few years, all great physical constants will have been approximately estimated, and that the only occupation which will be left to men of science will be to carry these measurements to another place of decimals. - James Clerk Maxwell (1813-1879) Scientific Papers 2, 244, October 1871 -- http://mail.python.org/mailman/listinfo/python-list
Re: PEP 383: Non-decodable Bytes in System Character Interfaces
On 22Apr2009 08:50, Martin v. L�wis wrote: | File names, environment variables, and command line arguments are | defined as being character data in POSIX; Specific citation please? I'd like to check the specifics of this. | the C APIs however allow | passing arbitrary bytes - whether these conform to a certain encoding | or not. Indeed. | This PEP proposes a means of dealing with such irregularities | by embedding the bytes in character strings in such a way that allows | recreation of the original byte string. [...] So you're proposing that all POSIX OS interfaces (which use byte strings) interpret those byte strings into Python3 str objects, with a codec that will accept arbitrary byte sequences losslessly and is totally reversible, yes? And, I hope, that the os.* interfaces silently use it by default. | For most applications, we assume that they eventually pass data | received from a system interface back into the same system | interfaces. For example, and application invoking os.listdir() will | likely pass the result strings back into APIs like os.stat() or | open(), which then encodes them back into their original byte | representation. Applications that need to process the original byte | strings can obtain them by encoding the character strings with the | file system encoding, passing "python-escape" as the error handler | name. -1 This last sentence kills the idea for me, unless I'm missing something. Which I may be, of course. POSIX filesystems _do_not_ have a file system encoding. The user's environment suggests a preferred encoding via the locale stuff, and apps honouring that will make nice looking byte strings as filenames for that user. (Some platforms, like MacOSX' HFS filesystems, _do_ enforce an encoding, and a quite specific variety of UTF-8 it is; I would say they're not a full UNIX filesystem _precisely_ because they reject certain byte strings that are valid on other UNIX filesystems. What will your proposal do here? I can imagine it might cope with existing names, but what happens when the user creates a new name?) Further, different users can use different locales and encodings. If they do it in different work areas they'll be perfectly happy; if they do it in a shared area doubtless confusion will reign, but only in the users' minds, not in the filesystem. If I'm writing a general purpose UNIX tool like chmod or find, I expect it to work reliably on _any_ UNIX pathname. It must be totally encoding blind. If I speak to the os.* interface to open a file, I expect to hand it bytes and have it behave. As an explicit example, I would be just fine with python's open(filename, "w") to take a string and encode it for use, but _not_ ok for os.open() to require me to supply a string and cross my fingers and hope something sane happens when it is turned into bytes for the UNIX system call. I'm very much in favour of being able to work in strings for most purposes, but if I use the os.* interfaces on a UNIX system it is necessary to be _able_ to work in bytes, because UNIX file pathnames are bytes. If there isn't a byte-safe os.* facility in Python3, it will simply be unsuitable for writing low level UNIX tools. And I very much like using Python2 for that. Finally, I have a small python program whose whole purpose in life is to transcode UNIX filenames before transfer to a MacOSX HFS directory, because of HFS's enforced particular encoding. What approach should a Python app take to transcode UNIX pathnames under your scheme? Cheers, -- Cameron Simpson DoD#743 http://www.cskk.ezoshosting.com/cs/ The nice thing about standards is that you have so many to choose from; furthermore, if you do not like any of them, you can just wait for next year's model. - Andrew S. Tanenbaum -- http://mail.python.org/mailman/listinfo/python-list
Re: PEP 383: Non-decodable Bytes in System Character Interfaces
On Apr 21, 11:50 pm, "Martin v. Löwis" wrote: > I'm proposing the following PEP for inclusion into Python 3.1. > Please comment. Basically the scheme doesn't work. Aside from that, it is very close. There are tons of encoding schemes that could work... they don't have to include half-surrogates or bytes. What they have to do, is make sure that they are uniformly applied to all appropriate strings. The problem with this, and other preceding schemes that have been discussed here, is that there is no means of ascertaining whether a particular file name str was obtained from a str API, or was funny- decoded from a bytes API... and thus, there is no means of reliably ascertaining whether a particular filename str should be passed to a str API, or funny-encoded back to bytes. The assumption in the 2nd Discussion paragraph may hold for a large percentage of cases, maybe even including some number of 9s, but it is not guaranteed, and cannot be enforced, therefore there are cases that could fail. Whether those failure cases are a concern or not is an open question. Picking a character (I don't find U+F01xx in the Unicode standard, so I don't know what it is) that is obscure, and unlikely to be used in "real" file names, might help the heuristic nature of the encoding and decoding avoid most conflicts, but provides no guarantee that data puns will not occur in practice. Today's obscure character is tomorrows commonly used character, perhaps. Someone not on this list may be happily using that character for their own nefarious, incompatible purpose. As I realized in the email-sig, in talking about decoding corrupted headers, there is only one way to guarantee this... to encode _all_ character sequences, from _all_ interfaces. Basically it requires reserving an escape character (I'll use ? in these examples -- yes, an ASCII question mark -- happens to be illegal in Windows filenames so all the better on that platform, but the specific character doesn't matter... avoiding / \ and . is probably good, though). So the rules would be, when obtaining a file name from the bytes OS interface, that doesn't properly decode according to UTF-8, decode it by placing a ? at the beginning, then for each decodable UTF-8 sequence, add a Unicode character -- unless the character is ?, in which case you add two ??, and for each non-decodable byte sequence, place a ? and two hex digits, or a ? and a half surrogate code, or a ? and whatever gibberish you like. Two hex digits are fine by me, and will serve for this discussion. ALSO, when obtaining a file name from the str OS interfaces, encode it too... if it contains a ? at the front, it must be replaced by ??? and then any other ? in the name doubled. Then you have a string that can/must be encoded to be used on either str or bytes OS interfaces... or any other interfaces that want str or bytes... but whichever they want, you can do a decode, or determine that you can't, into that form. The encode and decode functions should be available for coders to use, that code to external interfaces, either OS or 3rd party packages, that do not use this encoding scheme. This encoding scheme would be used throughout all Python APIs (most of which would need very little change to accommodate it). However, programs would have to keep track of whether they were dealing with encoded or unencoded strings, if they use both types in their program (an example, is hard-coded file names or file name parts). The initial ? is not strictly necessary for this scheme to work, but I think it would be a good flag to the user that this name has been altered. This scheme does not depend on assumptions about the use of file names. This scheme would be enhanced if the file name APIs returned a subtype of str for the encoded names, but that should be considered only a hint, not a requirement. When encoding file name strings to pass to bytes APIs, the ? followed by two hex digits would be converted to a byte. Leading ? would be dropped, and ?? would convert to ?. I don't believe failures are possible when encoding to bytes. When encoding file name strings to pass to str APIs, the discovery of ? followed by two hex digits would raise an exception, the file name is not acceptable to a str API. However, leading ? would be dropped, and ?? would convert to ?, and if no ? followed by two hex digits were found, the file name would be successfully converted for use on the str API. Note that not even on Unix/Posix is it particularly easy nor useful to place a ? into file names from command lines due to shell escapes, etc. The use of ? in file names also interferes with easy ability to specifically match them in globs, etc. Anything short of such an encoding of both types of interfaces, such that it is known that all python-manipulated filenames will be encoded, will have data puns that provide a potential for failure in edge cases. Note that in this scheme, no file names that are fully Unicode and d
Re: PEP 383: Non-decodable Bytes in System Character Interfaces
Martin v. Löwis wrote: [snip] To convert non-decodable bytes, a new error handler "python-escape" is introduced, which decodes non-decodable bytes using into a private-use character U+F01xx, which is believed to not conflict with private-use characters that currently exist in Python codecs. The error handler interface is extended to allow the encode error handler to return byte strings immediately, in addition to returning Unicode strings which then get encoded again. If the locale's encoding is UTF-8, the file system encoding is set to a new encoding "utf-8b". The UTF-8b codec decodes non-decodable bytes (which must be >= 0x80) into half surrogate codes U+DC80..U+DCFF. If the byte stream happens to include a sequence which decodes to U+F01xx, shouldn't that raise an exception? -- http://mail.python.org/mailman/listinfo/python-list