Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces
> Ned Deily (ND) wrote: >ND> In article , Piet van Oostrum >ND> wrote: >>> > Ronald Oussoren (RO) wrote: >>> >RO> For what it's worth, the OSX API's seem to behave as follows: >>> >RO> * If you create a file with an non-UTF8 name on a HFS+ filesystem the >>> >RO> system automaticly encodes the name. >>> >>> >RO> That is, open(chr(255), 'w') will silently create a file named '%FF' >>> >RO> instead of the name you'd expect on a unix system. >>> >>> Not for me (I am using Python 2.6.2). >>> >>> >>> f = open(chr(255), 'w') >>> Traceback (most recent call last): >>> File "", line 1, in >>> IOError: [Errno 22] invalid mode ('w') or filename: '\xff' >>> >>> >ND> What version of OSX are you using? On Tiger 10.4.11 I see the failure >ND> you see but on Leopard 10.5.6 the behavior Ronald reports. Yes, I am using Tiger (10.4.11). Interesting that it has changed on Leopard. -- Piet van Oostrum URL: http://pietvanoostrum.com [PGP 8DAE142BE17999C4] Private email: p...@vanoostrum.org ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces
James Y Knight writes: > in python. It seems like the most common reason why people want to use > SJIS is to make old pre-unicode apps work right in WINE -- in which > case it doesn't actually affect unix python at all. Mounting external drives, especially USB memory sticks which tend to be FAT-initialized by the manufacturers, is another common case. But I don't understand why PEP 383 needs to care at all. ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces
On 30 Apr, 2009, at 21:33, Piet van Oostrum wrote: Ronald Oussoren (RO) wrote: RO> For what it's worth, the OSX API's seem to behave as follows: RO> * If you create a file with an non-UTF8 name on a HFS+ filesystem the RO> system automaticly encodes the name. RO> That is, open(chr(255), 'w') will silently create a file named '%FF' RO> instead of the name you'd expect on a unix system. Not for me (I am using Python 2.6.2). f = open(chr(255), 'w') Traceback (most recent call last): File "", line 1, in IOError: [Errno 22] invalid mode ('w') or filename: '\xff' That's odd. Which version of OSX do you use? ron...@rivendell-2[0]$ sw_vers ProductName:Mac OS X ProductVersion: 10.5.6 BuildVersion: 9G55 [~/testdir] ron...@rivendell-2[0]$ /usr/bin/python Python 2.5.1 (r251:54863, Jan 13 2009, 10:26:13) [GCC 4.0.1 (Apple Inc. build 5465)] on darwin Type "help", "copyright", "credits" or "license" for more information. >>> import os >>> os.listdir('.') [] >>> open(chr(255), 'w').write('x') >>> os.listdir('.') ['%FF'] >>> And likewise with python 2.6.1+ (after cleaning the directory): [~/testdir] ron...@rivendell-2[0]$ python2.6 Python 2.6.1+ (release26-maint:70603, Mar 26 2009, 08:38:03) [GCC 4.0.1 (Apple Inc. build 5493)] on darwin Type "help", "copyright", "credits" or "license" for more information. >>> import os >>> os.listdir('.') [] >>> open(chr(255), 'w').write('x') >>> os.listdir('.') ['%FF'] >>> I once got a tar file from a Linux system which contained a file with a non-ASCII, ISO-8859-1 encoded filename. The tar file refused to be unpacked on a HFS+ filesystem. -- Piet van Oostrum URL: http://pietvanoostrum.com [PGP 8DAE142BE17999C4] Private email: p...@vanoostrum.org smime.p7s Description: S/MIME cryptographic signature ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces
On Fri, 1 May 2009 06:55:48 am Thomas Breuel wrote: > You can get the same error on Linux: > > $ python > Python 2.6.2 (release26-maint, Apr 19 2009, 01:56:41) > [GCC 4.3.3] on linux2 > Type "help", "copyright", "credits" or "license" for more > information. > > >>> f=open(chr(255),'w') > > Traceback (most recent call last): > File "", line 1, in > IOError: [Errno 22] invalid mode ('w') or filename: '\xff' Works for me under Fedora using ext3 as the file system. $ python2.6 Python 2.6.1 (r261:67515, Dec 24 2008, 00:33:13) [GCC 4.1.2 20070502 (Red Hat 4.1.2-12)] on linux2 Type "help", "copyright", "credits" or "license" for more information. >>> f=open(chr(255),'w') >>> f.close() >>> import os >>> os.remove(chr(255)) >>> Given that chr(255) is a valid filename on my file system, I would consider it a bug if Python couldn't deal with a file with that name. -- Steven D'Aprano ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces
Thomas Breuel wrote: > Not for me (I am using Python 2.6.2). > > >>> f = open(chr(255), 'w') > Traceback (most recent call last): > File "", line 1, in > IOError: [Errno 22] invalid mode ('w') or filename: '\xff' > >>> > > > You can get the same error on Linux: > > $ python > Python 2.6.2 (release26-maint, Apr 19 2009, 01:56:41) > [GCC 4.3.3] on linux2 > Type "help", "copyright", "credits" or "license" for more information. f=open(chr(255),'w') > Traceback (most recent call last): > File "", line 1, in > IOError: [Errno 22] invalid mode ('w') or filename: '\xff' > > (Some file system drivers do not enforce valid utf8 yet, but I suspect > they will in the future.) > Do you suspect that from discussing the issue with kernel developers or reading a thread on lkml? If not, then you're suspicion seems to be pretty groundless The fact that VFAT enforces an encoding does not lend itself to your argument for two reasons: 1) VFAT is not a Unix filesystem. It's a filesystem that's compatible with Windows/DOS. If Windows and DOS have filesystem encodings, then it makes sense for that driver to enforce that as well. Filesystems intended to be used natively on Linux/Unix do not necessarily make this design decision. 2) The encoding is specified when mounting the filesystem. This means that you can still mix encodings in a number of ways. If you mount with an encoding that has full byte coverage, for instance, each user can put filenames from different encodings on there. If you mount with utf8 on a system which uses euc-jp as the default encoding, you can have full paths that contain a mix of utf-8 and euc-jp. Etc. -Toshio signature.asc Description: OpenPGP digital signature ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces
James Y Knight wrote: On Apr 30, 2009, at 5:42 AM, Martin v. Löwis wrote: I think you are right. I have now excluded ASCII bytes from being mapped, effectively not supporting any encodings that are not ASCII compatible. Does that sound ok? Yes. The practical upshot of this is that users who brokenly use "ja_JP.SJIS" as their locale (which, note, first requires editing some files in /var/lib/locales manually to enable its use..) may still have python not work with invalid-in-shift-jis filenames. Since that locale is widely recognized as a bad idea to use, and is not supported by any distros, it certainly doesn't bother me that it isn't 100% supported in python. It seems like the most common reason why people want to use SJIS is to make old pre-unicode apps work right in WINE -- in which case it doesn't actually affect unix python at all. I'd personally be fine with python just declaring that the filesystem-encoding will *always* be utf-8b and ignore the locale...but I expect some other people might complain about that. Of course, application authors can decide to do that themselves by calling sys.setfilesystemencoding('utf-8b') at the start of their program. It seems to me that the 3.1+ doc set (or wiki) could be usefully extended with a How-to on working with filenames. I am not sure that everything useful fits anywhere in particular the ref manuals. ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces
On 30 Apr 2009, at 21:06, Martin v. Löwis wrote: How do get a printable unicode version of these path strings if they contain none unicode data? Define "printable". One way would be to use a regular expression, replacing all codes in a certain range with a question mark. What I mean by printable is that the string must be valid unicode that I can print to a UTF-8 console or place as text in a UTF-8 web page. I think your PEP gives me a string that will not encode to valid UTF-8 that the outside of python world likes. Did I get this point wrong? You are right. However, if your *only* requirement is that it should be printable, then this is fairly underspecified. One way to get a printable string would be this function def printable_string(unprintable): return "" Ha ha! Indeed this works, but I would have to try to turn enough of the string into a reasonable hint at the name of the file so the user can some chance of know what is being reported. This will always return a printable version of the input string... In our application we are running fedora with the assumption that the filenames are UTF-8. When Windows systems FTP files to our system the files are in CP-1251(?) and not valid UTF-8. That would be a bug in your FTP server, no? If you want all file names to be UTF-8, then your FTP server should arrange for that. Not a bug its the lack of a feature. We use ProFTPd that has just implemented what is required. I forget the exact details - they are at work - when the ftp client asks for the FEAT of the ftp server the server can say use UTF-8. Supporting that in the server was apparently none-trivia. Having an algorithm that says if its a string no problem, if its a byte deal with the exceptions seems simple. How do I do this detection with the PEP proposal? Do I end up using the byte interface and doing the utf-8 decode myself? No, you should encode using the "strict" error handler, with the locale encoding. If the file name encodes successfully, it's correct, otherwise, it's broken. O.k. I understand. Barry ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces
> > Not for me (I am using Python 2.6.2). > > >>> f = open(chr(255), 'w') > Traceback (most recent call last): > File "", line 1, in > IOError: [Errno 22] invalid mode ('w') or filename: '\xff' > >>> You can get the same error on Linux: $ python Python 2.6.2 (release26-maint, Apr 19 2009, 01:56:41) [GCC 4.3.3] on linux2 Type "help", "copyright", "credits" or "license" for more information. >>> f=open(chr(255),'w') Traceback (most recent call last): File "", line 1, in IOError: [Errno 22] invalid mode ('w') or filename: '\xff' >>> (Some file system drivers do not enforce valid utf8 yet, but I suspect they will in the future.) Tom ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces
On Apr 30, 2009, at 5:42 AM, Martin v. Löwis wrote: I think you are right. I have now excluded ASCII bytes from being mapped, effectively not supporting any encodings that are not ASCII compatible. Does that sound ok? Yes. The practical upshot of this is that users who brokenly use "ja_JP.SJIS" as their locale (which, note, first requires editing some files in /var/lib/locales manually to enable its use..) may still have python not work with invalid-in-shift-jis filenames. Since that locale is widely recognized as a bad idea to use, and is not supported by any distros, it certainly doesn't bother me that it isn't 100% supported in python. It seems like the most common reason why people want to use SJIS is to make old pre-unicode apps work right in WINE -- in which case it doesn't actually affect unix python at all. I'd personally be fine with python just declaring that the filesystem- encoding will *always* be utf-8b and ignore the locale...but I expect some other people might complain about that. Of course, application authors can decide to do that themselves by calling sys.setfilesystemencoding('utf-8b') at the start of their program. James ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces
Barry Scott wrote: On 30 Apr 2009, at 05:52, Martin v. Löwis wrote: How do get a printable unicode version of these path strings if they contain none unicode data? Define "printable". One way would be to use a regular expression, replacing all codes in a certain range with a question mark. What I mean by printable is that the string must be valid unicode that I can print to a UTF-8 console or place as text in a UTF-8 web page. I think your PEP gives me a string that will not encode to valid UTF-8 that the outside of python world likes. Did I get this point wrong? I'm guessing that an app has to understand that filenames come in two forms unicode and bytes if its not utf-8 data. Why not simply return string if its valid utf-8 otherwise return bytes? That would have been an alternative solution, and the one that 2.x uses for listdir. People didn't like it. In our application we are running fedora with the assumption that the filenames are UTF-8. When Windows systems FTP files to our system the files are in CP-1251(?) and not valid UTF-8. What we have to do is detect these non UTF-8 filename and get the users to rename them. Having an algorithm that says if its a string no problem, if its a byte deal with the exceptions seems simple. How do I do this detection with the PEP proposal? Do I end up using the byte interface and doing the utf-8 decode myself? What do you do currently? The PEP just offers a way of reading all filenames as Unicode, if that's what you want. So what if the strings can't be encoded to normal UTF-8! The filenames aren't valid UTF-8 anyway! :-) ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces
>>> How do get a printable unicode version of these path strings if they >>> contain none unicode data? >> >> Define "printable". One way would be to use a regular expression, >> replacing all codes in a certain range with a question mark. > > What I mean by printable is that the string must be valid unicode > that I can print to a UTF-8 console or place as text in a UTF-8 > web page. > > I think your PEP gives me a string that will not encode to > valid UTF-8 that the outside of python world likes. Did I get this > point wrong? You are right. However, if your *only* requirement is that it should be printable, then this is fairly underspecified. One way to get a printable string would be this function def printable_string(unprintable): return "" This will always return a printable version of the input string... > In our application we are running fedora with the assumption that the > filenames are UTF-8. When Windows systems FTP files to our system > the files are in CP-1251(?) and not valid UTF-8. That would be a bug in your FTP server, no? If you want all file names to be UTF-8, then your FTP server should arrange for that. > Having an algorithm that says if its a string no problem, if its > a byte deal with the exceptions seems simple. > > How do I do this detection with the PEP proposal? > Do I end up using the byte interface and doing the utf-8 decode > myself? No, you should encode using the "strict" error handler, with the locale encoding. If the file name encodes successfully, it's correct, otherwise, it's broken. Regards, Martin ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces
In article , Piet van Oostrum wrote: > > Ronald Oussoren (RO) wrote: > >RO> For what it's worth, the OSX API's seem to behave as follows: > >RO> * If you create a file with an non-UTF8 name on a HFS+ filesystem the > >RO> system automaticly encodes the name. > > >RO> That is, open(chr(255), 'w') will silently create a file named '%FF' > >RO> instead of the name you'd expect on a unix system. > > Not for me (I am using Python 2.6.2). > > >>> f = open(chr(255), 'w') > Traceback (most recent call last): > File "", line 1, in > IOError: [Errno 22] invalid mode ('w') or filename: '\xff' > >>> What version of OSX are you using? On Tiger 10.4.11 I see the failure you see but on Leopard 10.5.6 the behavior Ronald reports. -- Ned Deily, n...@acm.org ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces
On 30 Apr 2009, at 05:52, Martin v. Löwis wrote: How do get a printable unicode version of these path strings if they contain none unicode data? Define "printable". One way would be to use a regular expression, replacing all codes in a certain range with a question mark. What I mean by printable is that the string must be valid unicode that I can print to a UTF-8 console or place as text in a UTF-8 web page. I think your PEP gives me a string that will not encode to valid UTF-8 that the outside of python world likes. Did I get this point wrong? I'm guessing that an app has to understand that filenames come in two forms unicode and bytes if its not utf-8 data. Why not simply return string if its valid utf-8 otherwise return bytes? That would have been an alternative solution, and the one that 2.x uses for listdir. People didn't like it. In our application we are running fedora with the assumption that the filenames are UTF-8. When Windows systems FTP files to our system the files are in CP-1251(?) and not valid UTF-8. What we have to do is detect these non UTF-8 filename and get the users to rename them. Having an algorithm that says if its a string no problem, if its a byte deal with the exceptions seems simple. How do I do this detection with the PEP proposal? Do I end up using the byte interface and doing the utf-8 decode myself? Barry ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces
> Ronald Oussoren (RO) wrote: >RO> For what it's worth, the OSX API's seem to behave as follows: >RO> * If you create a file with an non-UTF8 name on a HFS+ filesystem the >RO> system automaticly encodes the name. >RO> That is, open(chr(255), 'w') will silently create a file named '%FF' >RO> instead of the name you'd expect on a unix system. Not for me (I am using Python 2.6.2). >>> f = open(chr(255), 'w') Traceback (most recent call last): File "", line 1, in IOError: [Errno 22] invalid mode ('w') or filename: '\xff' >>> I once got a tar file from a Linux system which contained a file with a non-ASCII, ISO-8859-1 encoded filename. The tar file refused to be unpacked on a HFS+ filesystem. -- Piet van Oostrum URL: http://pietvanoostrum.com [PGP 8DAE142BE17999C4] Private email: p...@vanoostrum.org ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces
MRAB wrote: > One further question: should the encoder accept a string like > u'\xDCC2\xDC80'? That would encode to b'\xC2\x80' Indeed so. > which, when decoded, would give u'\x80'. Assuming the encoding is UTF-8, yes. > Does the PEP only guarantee that strings decoded > from the filesystem are reversible, but not check what might be de novo > strings? Exactly so. Regards, Martin ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces
One further question: should the encoder accept a string like u'\xDCC2\xDC80'? That would encode to b'\xC2\x80', which, when decoded, would give u'\x80'. Does the PEP only guarantee that strings decoded from the filesystem are reversible, but not check what might be de novo strings? ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces
Cameron Simpson writes: > On 29Apr2009 22:14, Stephen J. Turnbull wrote: > | Baptiste Carvello writes: > | > By contrast, if the new utf-8b codec would *supercede* the old one, > | > \udcxx would always mean raw bytes (at least on UCS-4 builds, where > | > surrogates are unused). Thus ambiguity could be avoided. > | > | Unfortunately, that's false. [Because Python strings are > | intended to be used as containers for widechars which are to be > | interpreted as Unicode when that makes sense, but there's no > | restriction against nonsense code points, including in UCS-4 > | Python.] [...] > Wouldn't you then be bypassing the implicit encoding anyway, at least to > some extent, and thus not trip over the PEP? Sure. I'm not really arguing the PEP here; the point is that under the current definition of Python strings, ambiguity is unavoidable. The best we can ask for is fewer exceptions, and an attempt to reduce ambiguity to a bare minimum in the code paths that we open up when we make definition that allows a formerly erroneous computation to succeed. Martin is well aware of this, the PEP is clear enough about that (to me, but I'm a mail and multilingual editor internals kinda guy). I'd rather have more validation of strings, but *shrug* Martin's doing the work. OTOH, the Unicode fans need to understand that past policy of Python is not to validate; Python is intended to provide all the tools needed to write validating apps, but it isn't one itself. Martin's PEP is quite narrow in that sense. All it is about is an invertible encoding of broken encodings. It does have the downside that it guarantees that Python itself can produce non-conforming strings, but that's not the end of the world, and an app can keep track of them or even refuse them by setting the error handler, if it wants to. ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces
[top-posting for once to preserve full quoting] Glenn, Could you please reduce your suggestions into sample text for the PEP? We seem to be now at the stage where nobody is objecting to the PEP, so the focus should be on making the PEP clearer. If you still want to create an alternative PEP implementation, please provide step-by-step walkthroughs, preferably in a new thread -- if you did previously provide that, it's gotten lost in the flood of messages. On Thu, Apr 30, 2009, Glenn Linderman wrote: > On approximately 4/29/2009 8:46 PM, came the following characters from > the keyboard of Terry Reedy: >> Glenn Linderman wrote: >>> On approximately 4/29/2009 1:28 PM, came the following characters >>> from >> So where is the ambiguity here? >>> >>> None. But not everyone can read all the Python source code to try to >>> understand it; they expect the documentation to help them avoid that. >>> Because the documentation is lacking in this area, it makes your >>> concisely stated PEP rather hard to understand. >> >> If you think a section of the doc is grossly inadequate, and there is >> no existing issue on the tracker, feel free to add one. >> >>> Thanks for clarifying the Windows behavior, here. A little more >>> clarification in the PEP could have avoided lots of discussion. It >>> would seem that a PEP, proposed to modify a poorly documented (and >>> therefore likely poorly understood) area, should be educational about >>> the status quo, as well as presenting the suggested change. >> >> Where the PEP proposes to change, it should start with the status quo. >> But Martin's somewhat reasonable position is that since he is not >> proposing to change behavior on Windows, it is not his responsibility >> to document what he is not proposing to change more adequately. This >> means, of course, that any observed change on Windows would then be a >> bug, or at least a break of the promise. On the other hand, I can see >> that this is enough related to what he is proposing to change that >> better doc would help. > > > Yes; the very fact that the PEP discusses Windows, speaks about > cross-platform code, and doesn't explicitly state that no Windows > functionality will change, is confusing. > > An example of how to initialize things within a sample cross-platform > application might help, especially if that initialization only happens > if the platform is POSIX, or is commented to the effect that it has no > effect on Windows, but makes POSIX happy. Or maybe it is all buried > within the initialization of Python itself, and is not exposed to the > application at all. I still haven't figured that out, but was not (and > am still not) as concerned about that as ensuring that the overall > algorithms are functional and useful and user-friendly. Showing it > might have been helpful in making it clear that no Windows functionality > would change, however. > > A statement that additional features are being added to allow > cross-platform programs deal with non-decodable bytes obtained from > POSIX APIs using the same code that already works on Windows, would have > made things much clearer. The present Abstract does, in fact, talk only > about POSIX, but later statements about Windows muddy the water. > > Rationale paragraph 3, explicitly talks about cross-platform programs > needing to work one way on Windows and another way on POSIX to deal with > all the cases. It calls that a proposal, which I guess it is for > command line and environment, but it is already implemented in both > bytes and str forms for file names... so that further muddies the water. > > It is, of course, easier to point out deficiencies in a document than to > write a better document; however, it is incumbent upon the PEP author to > write a PEP that is good enough to get approved, and that means making > it understandable enough that people are in favor... or to respond to > the plethora of comments until people are in favor. I'm not sure which > one is more time-consuming. > > I've reached the point, based on PEP and comment responses, where I now > believe that the PEP is a solution to the problem it is trying to solve, > and doesn't create ambiguities in the naming. I don't believe it is the > best solution. > > The basic problem is the overuse of fake characters... normalizing them > for display results is large data loss -- many characters would be > translated to the same replacement characters. > > Solutions exist that would allow the use of fewer different fake > characters in the strings, while still having a fake character as the > escape character, to preserve the invariant that all the strings > manipulated by python-escape from the PEP were, and become, strings > containing fake characters (from a strict Unicode perspective), which is > a nice invariant*. There even exist solutions that would use only one > fake char
Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces
> I think it has to be excluded from mapping in order to not introduce > security issues. I think you are right. I have now excluded ASCII bytes from being mapped, effectively not supporting any encodings that are not ASCII compatible. Does that sound ok? Regards, Martin ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces
> Assuming people agree that this is an accurate summary, it should be > incorporated into the PEP. Done! Regards, Martin ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces
On approximately 4/29/2009 7:50 PM, came the following characters from the keyboard of Aahz: On Thu, Apr 30, 2009, Cameron Simpson wrote: The lengthy discussion mostly revolves around: - Glenn points out that strings that came _not_ from listdir, and that are _not_ well-formed unicode (== "have bare surrogates in them") but that were intended for use as filenames will conflict with the PEP's scheme - programs must know that these strings came from outside and must be translated into the PEP's funny-encoding before use in the os.* functions. Previous to the PEP they would get used directly and encode differently after the PEP, thus producing different POSIX filenames. Breakage. - Glenn would like the encoding to use Unicode scalar values only, using a rare-in-filenames character. That would avoid the issue with "outside' strings that contain surrogates. To my mind it just moves the punning from rare illegal strings to merely uncommon but legal characters. - Some parties think it would be better to not return strings from os.listdir but a subclass of string (or at least a duck-type of string) that knows where it came from and is also handily recognisable as not-really-a-string for purposes of deciding whether is it PEP-funny-encoded by direct inspection. Assuming people agree that this is an accurate summary, it should be incorporated into the PEP. I'll agree that once other misconceptions were explained away, that the remaining issues are those Cameron summarized. Thanks for the summary! Point two could be modified because I've changed my opinion; I like the invariant Cameron first (I think) explicitly stated about the PEP as it stands, and that I just reworded in another message, that the strings that are altered by the PEP in either direction are in the subset of strings that contain fake (from a strict Unicode viewpoint) characters. I still think an encoding that uses mostly real characters that have assigned glyphs would be better than the encoding in the PEP; but would now suggest that an escape character be a fake character. I'll note here that while the PEP encoding causes illegal bytes to be translated to one fake character, the 3-byte sequence that looks like the range of fake characters would also be translated to a sequence of 3 fake characters. This is 512 combinations that must be translated, and understood by the user (or at least by the programmer). The "escape sequence" approach requires changing only 257 combinations, and each altered combination would result in exactly 2 characters. Hence, this seems simpler to understand, and to manually encode and decode for debugging purposes. -- Glenn -- http://nevcal.com/ === A protocol is complete when there is nothing left to remove. -- Stuart Cheshire, Apple Computer, regarding Zero Configuration Networking ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces
On approximately 4/29/2009 8:46 PM, came the following characters from the keyboard of Terry Reedy: Glenn Linderman wrote: On approximately 4/29/2009 1:28 PM, came the following characters from So where is the ambiguity here? None. But not everyone can read all the Python source code to try to understand it; they expect the documentation to help them avoid that. Because the documentation is lacking in this area, it makes your concisely stated PEP rather hard to understand. If you think a section of the doc is grossly inadequate, and there is no existing issue on the tracker, feel free to add one. Thanks for clarifying the Windows behavior, here. A little more clarification in the PEP could have avoided lots of discussion. It would seem that a PEP, proposed to modify a poorly documented (and therefore likely poorly understood) area, should be educational about the status quo, as well as presenting the suggested change. Where the PEP proposes to change, it should start with the status quo. But Martin's somewhat reasonable position is that since he is not proposing to change behavior on Windows, it is not his responsibility to document what he is not proposing to change more adequately. This means, of course, that any observed change on Windows would then be a bug, or at least a break of the promise. On the other hand, I can see that this is enough related to what he is proposing to change that better doc would help. Yes; the very fact that the PEP discusses Windows, speaks about cross-platform code, and doesn't explicitly state that no Windows functionality will change, is confusing. An example of how to initialize things within a sample cross-platform application might help, especially if that initialization only happens if the platform is POSIX, or is commented to the effect that it has no effect on Windows, but makes POSIX happy. Or maybe it is all buried within the initialization of Python itself, and is not exposed to the application at all. I still haven't figured that out, but was not (and am still not) as concerned about that as ensuring that the overall algorithms are functional and useful and user-friendly. Showing it might have been helpful in making it clear that no Windows functionality would change, however. A statement that additional features are being added to allow cross-platform programs deal with non-decodable bytes obtained from POSIX APIs using the same code that already works on Windows, would have made things much clearer. The present Abstract does, in fact, talk only about POSIX, but later statements about Windows muddy the water. Rationale paragraph 3, explicitly talks about cross-platform programs needing to work one way on Windows and another way on POSIX to deal with all the cases. It calls that a proposal, which I guess it is for command line and environment, but it is already implemented in both bytes and str forms for file names... so that further muddies the water. It is, of course, easier to point out deficiencies in a document than to write a better document; however, it is incumbent upon the PEP author to write a PEP that is good enough to get approved, and that means making it understandable enough that people are in favor... or to respond to the plethora of comments until people are in favor. I'm not sure which one is more time-consuming. I've reached the point, based on PEP and comment responses, where I now believe that the PEP is a solution to the problem it is trying to solve, and doesn't create ambiguities in the naming. I don't believe it is the best solution. The basic problem is the overuse of fake characters... normalizing them for display results is large data loss -- many characters would be translated to the same replacement characters. Solutions exist that would allow the use of fewer different fake characters in the strings, while still having a fake character as the escape character, to preserve the invariant that all the strings manipulated by python-escape from the PEP were, and become, strings containing fake characters (from a strict Unicode perspective), which is a nice invariant*. There even exist solutions that would use only one fake character (repeatedly if necessary), and all other characters generated would be displayable characters. This would ease the burden on the program in displaying the strings, and also on the user that might view the resulting mojibake in trying to differentiate one such string from another. Those are outlined in various emails in this thread, although some include my misconception that strings obtained via Unicode-enabled OS APIs would also need to be encoded and altered. If there is any interest in using a more readable encoding, I'd be glad to rework them to remove those misconceptions. * It would be nice to point out that invariant in the PEP, also. -- Glenn -- http://nevcal.com/ === A
Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces
On Wed, Apr 29, 2009 at 23:03, Terry Reedy wrote: > Thomas Breuel wrote: > >> >>Sure. However, that requires you to provide meaningful, reproducible >>counter-examples, rather than a stenographic formulation that might >>hint some problem you apparently see (which I believe is just not >>there). >> >> >> Well, here's another one: PEP 383 would disallow UTF-8 encodings of half >> surrogates. >> > > By my reading, the current Unicode 5.1 definition of 'UTF-8' disallows > that. If we use conformance to Unicode 5.1 as the basis for our discussion, then PEP 383 is off the table anyway. I'm all for strict Unicode compliance. But apparently, the Python community doesn't care. CESU-8 is described in Unicode Technical Report #26, so it at least has some official recognition. More importantly, it's also widely used. So, my question: what are the implications of PEP 383 for CESU-8 encodings on Python? My meta-point is: there are probably many more such issues hidden away and it is a really bad idea to rush something like PEP 383 out. Unicode is hard anyway, and tinkering with its semantics requires a lot of thought. Tom ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces
> Thanks for clarifying the Windows behavior, here. A little more > clarification in the PEP could have avoided lots of discussion. It > would seem that a PEP, proposed to modify a poorly documented (and > therefore likely poorly understood) area, should be educational about > the status quo, as well as presenting the suggested change. Or is it > the Python philosophy that the PEPs should be as incomprehensible as > possible, to generate large discussions? Certainly not. See PEP 277 for a description of a specification of how file names are handled on Windows. Large discussions could be reduced if readers would try to constructively comment on the PEP, rather than making counter-proposals, or making statements about the PEP without making their implied assumptions explicit. Regards, Martin ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces
> How do get a printable unicode version of these path strings if they > contain none unicode data? Define "printable". One way would be to use a regular expression, replacing all codes in a certain range with a question mark. > I'm guessing that an app has to understand that filenames come in two forms > unicode and bytes if its not utf-8 data. Why not simply return string if > its valid utf-8 otherwise return bytes? That would have been an alternative solution, and the one that 2.x uses for listdir. People didn't like it. Regards, Martin ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces
Glenn Linderman wrote: On approximately 4/29/2009 1:28 PM, came the following characters from So where is the ambiguity here? None. But not everyone can read all the Python source code to try to understand it; they expect the documentation to help them avoid that. Because the documentation is lacking in this area, it makes your concisely stated PEP rather hard to understand. If you think a section of the doc is grossly inadequate, and there is no existing issue on the tracker, feel free to add one. Thanks for clarifying the Windows behavior, here. A little more clarification in the PEP could have avoided lots of discussion. It would seem that a PEP, proposed to modify a poorly documented (and therefore likely poorly understood) area, should be educational about the status quo, as well as presenting the suggested change. Where the PEP proposes to change, it should start with the status quo. But Martin's somewhat reasonable position is that since he is not proposing to change behavior on Windows, it is not his responsibility to document what he is not proposing to change more adequately. This means, of course, that any observed change on Windows would then be a bug, or at least a break of the promise. On the other hand, I can see that this is enough related to what he is proposing to change that better doc would help. tjr ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces
On Thu, Apr 30, 2009, Cameron Simpson wrote: > > The lengthy discussion mostly revolves around: > > - Glenn points out that strings that came _not_ from listdir, and that are > _not_ well-formed unicode (== "have bare surrogates in them") but that > were intended for use as filenames will conflict with the PEP's scheme - > programs must know that these strings came from outside and must be > translated into the PEP's funny-encoding before use in the os.* > functions. Previous to the PEP they would get used directly and > encode differently after the PEP, thus producing different POSIX > filenames. Breakage. > > - Glenn would like the encoding to use Unicode scalar values only, > using a rare-in-filenames character. > That would avoid the issue with "outside' strings that contain > surrogates. To my mind it just moves the punning from rare illegal > strings to merely uncommon but legal characters. > > - Some parties think it would be better to not return strings from > os.listdir but a subclass of string (or at least a duck-type of > string) that knows where it came from and is also handily > recognisable as not-really-a-string for purposes of deciding > whether is it PEP-funny-encoded by direct inspection. Assuming people agree that this is an accurate summary, it should be incorporated into the PEP. -- Aahz (a...@pythoncraft.com) <*> http://www.pythoncraft.com/ "If you think it's expensive to hire a professional to do the job, wait until you hire an amateur." --Red Adair ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces
On 29Apr2009 23:41, Barry Scott wrote: > On 22 Apr 2009, at 07:50, Martin v. Löwis wrote: >> If the locale's encoding is UTF-8, the file system encoding is set to >> a new encoding "utf-8b". The UTF-8b codec decodes non-decodable bytes >> (which must be >= 0x80) into half surrogate codes U+DC80..U+DCFF. > > Forgive me if this has been covered. I've been reading this thread for a > long time and still have a 100 odd replies to go... > > How do get a printable unicode version of these path strings if they > contain none unicode data? Personally, I'd use repr(). One might ask, what would you expect to see if you were printing such a string? > I'm guessing that an app has to understand that filenames come in two > forms unicode and bytes if its not utf-8 data. Why not simply return string > if > its valid utf-8 otherwise return bytes? Then in the app you check for the > type for > the object, string or byte and deal with reporting errors appropriately. Because it complicates the app enormously, for every app. It would be _nice_ to just call os.listdir() et al with strings, get strings, and not worry. With strings becoming unicode in Python3, on POSIX you have an issue of deciding how to get its filenames-are-bytes into a string and the reverse. One could naively map the byte values to the same Unicode code points, but that results in strings that do not contain the same characters as the user/app expects for byte values above 127. Since POSIX does not really have a filesystem level character encoding, just a user environment setting that says how the current user encodes characters into bytes (UTF-8 is increasingly common and useful, but it is not universal), it is more useful to decode filenames on the assumption that they represent characters in the user's (current) encoding convention; that way when things are displayed they are meaningful, and they interoperate well with strings made by the user/app. If all the filenames were actually encoded that way when made, that works. But different users may adopt different conventions, and indeed a user may have used ACII or and ISO8859-* coding in the past and be transitioning to something else now, so they will have a bunch of files in different encodings. The PEP uses the user's current encoding with a handler for byte sequences that don't decode to valid Unicode scaler values in a fashion that is reversible. That is, you get "strings" out of listdir() and those strings will go back in (eg to open()) perfectly robustly. Previous approaches would either silently hide non-decodable names in listdir() results or throw exceptions when the decode failed or mangle things no reversably. I believe Python3 went with the first option there. The PEP at least lets programs naively access all files that exist, and create a filename from any well-formed unicode string provided that the filesystem encoding permits the name to be encoded. The lengthy discussion mostly revolves around: - Glenn points out that strings that came _not_ from listdir, and that are _not_ well-formed unicode (== "have bare surrogates in them") but that were intended for use as filenames will conflict with the PEP's scheme - programs must know that these strings came from outside and must be translated into the PEP's funny-encoding before use in the os.* functions. Previous to the PEP they would get used directly and encode differently after the PEP, thus producing different POSIX filenames. Breakage. - Glenn would like the encoding to use Unicode scalar values only, using a rare-in-filenames character. That would avoid the issue with "outside' strings that contain surrogates. To my mind it just moves the punning from rare illegal strings to merely uncommon but legal characters. - Some parties think it would be better to not return strings from os.listdir but a subclass of string (or at least a duck-type of string) that knows where it came from and is also handily recognisable as not-really-a-string for purposes of deciding whether is it PEP-funny-encoded by direct inspection. Cheers, -- Cameron Simpson DoD#743 http://www.cskk.ezoshosting.com/cs/ The peever can look at the best day in his life and sneer at it. - Jim Hill, JennyGfest '95 ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces
On 22 Apr 2009, at 07:50, Martin v. Löwis wrote: If the locale's encoding is UTF-8, the file system encoding is set to a new encoding "utf-8b". The UTF-8b codec decodes non-decodable bytes (which must be >= 0x80) into half surrogate codes U+DC80..U+DCFF. Forgive me if this has been covered. I've been reading this thread for a long time and still have a 100 odd replies to go... How do get a printable unicode version of these path strings if they contain none unicode data? I'm guessing that an app has to understand that filenames come in two forms unicode and bytes if its not utf-8 data. Why not simply return string if its valid utf-8 otherwise return bytes? Then in the app you check for the type for the object, string or byte and deal with reporting errors appropriately. Barry ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces
On 29Apr2009 22:14, Stephen J. Turnbull wrote: | Baptiste Carvello writes: | > By contrast, if the new utf-8b codec would *supercede* the old one, | > \udcxx would always mean raw bytes (at least on UCS-4 builds, where | > surrogates are unused). Thus ambiguity could be avoided. | | Unfortunately, that's false. It could have come from a literal string | (similar to the text above ;-), a C extension, or a string slice (on | 16-bit builds), and there may be other ways to do it. The only way to | avoid ambiguity is to change the definition of a Python string to be | *valid* Unicode (possibly with Python extensions such as PEP 383 for | internal use only). But Guido has rejected that in the past; | validation is the application's problem, not Python's. | | Nor is a UCS-4 build exempt. IIRC Guido specifically envisioned | Python strings being used to build up code point sequences to be | directly output, which means that a UCS-4 string might none-the-less | contain surrogates being added to a string intended to be sent as | UTF-16 output simply by truncating the 32-bit code units to 16 bits. Wouldn't you then be bypassing the implicit encoding anyway, at least to some extent, and thus not trip over the PEP? -- Cameron Simpson DoD#743 http://www.cskk.ezoshosting.com/cs/ Clemson is the Harvard of cardboard packaging. - overhead by WIRED at the Intelligent Printing conference Oct2006 ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces
On 29Apr2009 17:03, Terry Reedy wrote: > Thomas Breuel wrote: >> Sure. However, that requires you to provide meaningful, reproducible >> counter-examples, rather than a stenographic formulation that might >> hint some problem you apparently see (which I believe is just not >> there). >> >> Well, here's another one: PEP 383 would disallow UTF-8 encodings of >> half surrogates. > > By my reading, the current Unicode 5.1 definition of 'UTF-8' disallows that. 5.0 also disallows it. No surprise I guess. -- Cameron Simpson DoD#743 http://www.cskk.ezoshosting.com/cs/ Out on the road, feeling the breeze, passing the cars. - Bob Seger ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces
On approximately 4/29/2009 1:28 PM, came the following characters from the keyboard of Martin v. Löwis: C. File on disk with the invalid surrogate code, accessed via the str interface, no decoding happens, matches in memory the file on disk with the byte that translates to the same surrogate, accessed via the bytes interface. Ambiguity. What does that mean? What specific interface are you referring to to obtain file names? os.listdir("") os.listdir(b"") So I guess I'd better suggest that a specific, equivalent directory name be passed in either bytes or str form. [Leaving the issue of the empty string apparently having different meanings aside ...] Ok. Now I understand the example. So you do os.listdir("c:/tmp") os.listdir(b"c:/tmp") and you have a file in c:/tmp that is named "abc\uDC10". So what you are saying here is that Python doesn't use the "A" forms of the Windows APIs for filenames, but only the "W" forms, and uses lossy decoding (from MS) to the current code page (which can never be UTF-8 on Windows). Actually, it does use the A form, in the second listdir example. This, in turn (inside Windows), uses the lossy CP_ACP encoding. You get back a byte string; the listdirs should give ["abc\uDC10"] [b"abc?"] (not quite sure about the second - I only guess that CP_ACP will replace the half surrogate with a question mark). So where is the ambiguity here? None. But not everyone can read all the Python source code to try to understand it; they expect the documentation to help them avoid that. Because the documentation is lacking in this area, it makes your concisely stated PEP rather hard to understand. Thanks for clarifying the Windows behavior, here. A little more clarification in the PEP could have avoided lots of discussion. It would seem that a PEP, proposed to modify a poorly documented (and therefore likely poorly understood) area, should be educational about the status quo, as well as presenting the suggested change. Or is it the Python philosophy that the PEPs should be as incomprehensible as possible, to generate large discussions? -- Glenn -- http://nevcal.com/ === A protocol is complete when there is nothing left to remove. -- Stuart Cheshire, Apple Computer, regarding Zero Configuration Networking ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces
Thomas Breuel wrote: Sure. However, that requires you to provide meaningful, reproducible counter-examples, rather than a stenographic formulation that might hint some problem you apparently see (which I believe is just not there). Well, here's another one: PEP 383 would disallow UTF-8 encodings of half surrogates. By my reading, the current Unicode 5.1 definition of 'UTF-8' disallows that. But such encodings are currently supported by Python, and they are used as part of CESU-8 coding. That's, in fact, a common way of converting UTF-16 to UTF-8. How are you going to deal with existing code that relies on being able to code half surrogates as UTF-8? ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces
Glenn Linderman wrote: On approximately 4/29/2009 4:36 AM, came the following characters from the keyboard of Cameron Simpson: On 29Apr2009 02:56, Glenn Linderman wrote: os.listdir(b"") I find that on my Windows system, with all ASCII path file names, that I get quite different results when I pass os.listdir an empty str vs an empty bytes. Rather than keep you guessing, I get the root directory contents from the empty str, and the current directory contents from an empty bytes. That is rather unexpected. So I guess I'd better suggest that a specific, equivalent directory name be passed in either bytes or str form. I think you may have uncovered an implementation bug rather than an encoding issue (because I'd expect "" and b"" to be equivalent). Me too. Sounds like an issue for the tracker. ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces
> So while out of scope of the PEP, I don't think it's at all > artificial. Sure - but I see this as the same case as "the file got renamed". If you have a LRU list in your app, and a file gets renamed, then the LRU list breaks (unless you also store the inode number in the LRU list, and lookup the file by inode number - or object UUID on NTFS, possibly using distributed link tracking). Regards, Martin ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces
>>> C. File on disk with the invalid surrogate code, accessed via the >>> str interface, no decoding happens, matches in memory the file on disk >>> with the byte that translates to the same surrogate, accessed via the >>> bytes interface. Ambiguity. >> What does that mean? What specific interface are you referring to to >> obtain file names? > > os.listdir("") > > os.listdir(b"") > > So I guess I'd better suggest that a specific, equivalent directory name > be passed in either bytes or str form. [Leaving the issue of the empty string apparently having different meanings aside ...] Ok. Now I understand the example. So you do os.listdir("c:/tmp") os.listdir(b"c:/tmp") and you have a file in c:/tmp that is named "abc\uDC10". > So what you are saying here is that Python doesn't use the "A" forms of > the Windows APIs for filenames, but only the "W" forms, and uses lossy > decoding (from MS) to the current code page (which can never be UTF-8 on > Windows). Actually, it does use the A form, in the second listdir example. This, in turn (inside Windows), uses the lossy CP_ACP encoding. You get back a byte string; the listdirs should give ["abc\uDC10"] [b"abc?"] (not quite sure about the second - I only guess that CP_ACP will replace the half surrogate with a question mark). So where is the ambiguity here? > You are further saying that Python doesn't give the programmer control > over the codec that is used to convert from W results to bytes, so that > on Windows, it is impossible to obtain a bytes result containing UTF-8 > from os.listdir, even though sys.setfilesystemencoding exists, and > sys.getfilesystemencoding is affected by it, and the latter is > documented as returning "mbcs", and as returning the codec that should > be used by the application to convert str to bytes for filenames. > (Python 3.0.1). Not exactly. You *can* do setfilesystemencoding on Windows, but it has no effect, as the Python file system encoding is never used on Windows. For a string, it passes it to the W API as is; for bytes, it passes it to the A API as-is. Python never invokes any codec here. > While I can hear a "that is outside the scope of the PEP" coming, this > documentation is confusing, to say the least. Only because you are apparently unaware of the status quo. If you would study the current Python source code, it would be all very clear. > Things are a little clearer in the documentation for > sys.setfilesystemencoding, which does say the encoding isn't used by > Windows -- so why is it permitted to change it, if it has no effect?). As in many cases: because nobody contributed code to make it behave otherwise. It's not that the file system encoding is "mbcs" - the file system encoding is simply unused on Windows (but that wasn't always the case, in particular not when Windows 9x still had to be supported). Regards, Martin ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces
> Sure. However, that requires you to provide meaningful, reproducible > counter-examples, rather than a stenographic formulation that might > hint some problem you apparently see (which I believe is just not > there). > > > Well, here's another one: PEP 383 would disallow UTF-8 encodings of half > surrogates. But such encodings are currently supported by Python, and > they are used as part of CESU-8 coding. That's, in fact, a common way > of converting UTF-16 to UTF-8. How are you going to deal with existing > code that relies on being able to code half surrogates as UTF-8? Can you please elaborate? What code specifically are you talking about? Regards, Martin ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces
"Martin v. Löwis" writes: > I find the case pretty artificial, though: if the locale encoding > changes, all file names will look incorrect to the user, so he'll > quickly switch back, or rename all the files. It's not necessarily the case that the locale encoding changes, but rather the name of the file. I have a couple of directories where I have Japanese in both EUC-JP and UTF-8, for example. (The applications where I never bothered to do a conversion from EUC to UTF-8 are things like stripping MIME attachments from messages and saving them to files when I changed my default.) So I have a little Emacs Lisp function that tries EUC or UTF8 depending on date and falls back to the other on a decode error. Another possible situation would be a user program in the user's locale communicating with a daemon running in some other locale (quite likely POSIX). So while out of scope of the PEP, I don't think it's at all artificial. ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces
Baptiste Carvello writes: > By contrast, if the new utf-8b codec would *supercede* the old one, > \udcxx would always mean raw bytes (at least on UCS-4 builds, where > surrogates are unused). Thus ambiguity could be avoided. Unfortunately, that's false. It could have come from a literal string (similar to the text above ;-), a C extension, or a string slice (on 16-bit builds), and there may be other ways to do it. The only way to avoid ambiguity is to change the definition of a Python string to be *valid* Unicode (possibly with Python extensions such as PEP 383 for internal use only). But Guido has rejected that in the past; validation is the application's problem, not Python's. Nor is a UCS-4 build exempt. IIRC Guido specifically envisioned Python strings being used to build up code point sequences to be directly output, which means that a UCS-4 string might none-the-less contain surrogates being added to a string intended to be sent as UTF-16 output simply by truncating the 32-bit code units to 16 bits. ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces
On approximately 4/29/2009 4:36 AM, came the following characters from the keyboard of Cameron Simpson: On 29Apr2009 02:56, Glenn Linderman wrote: os.listdir(b"") I find that on my Windows system, with all ASCII path file names, that I get quite different results when I pass os.listdir an empty str vs an empty bytes. Rather than keep you guessing, I get the root directory contents from the empty str, and the current directory contents from an empty bytes. That is rather unexpected. So I guess I'd better suggest that a specific, equivalent directory name be passed in either bytes or str form. I think you may have uncovered an implementation bug rather than an encoding issue (because I'd expect "" and b"" to be equivalent). Me too. In ancient times, "" was a valid UNIX name for the working directory. POSIX disallows that, and requires people to use ".". Maybe you're seeing an artifact; did python move from UNIX to Windows or the other way around in its porting history? I'd guess the former. Do you get differing results from listdir(".") and listdir(b".") ? No. Both are the same as b"" How's python2 behave for ""? (Since there's no b"" in python2.) Python2 os.listdir("") produces the same thing as Python3 os.listdir(b"") Python2 os.listdir(u"") produces the same thing as Python3 os.listdir("") Another phenomenon of note: I created a directory named ábc. (Windows XP, Python 3.0.1, Python 2.6.1, SetConsoleOutputCP(65001)) Python3 os.listdir(b".") prints it as b"\xe1bc" Python2 os.listdir(".") prints it as b"\xe1bc" Python2 os.listdir(u".") prints it as u"\xe1bc" Python3 os.listdir(".") prints it as "bc" -- Glenn -- http://nevcal.com/ === A protocol is complete when there is nothing left to remove. -- Stuart Cheshire, Apple Computer, regarding Zero Configuration Networking ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces
On approximately 4/29/2009 4:07 AM, came the following characters from the keyboard of R. David Murray: On Tue, 28 Apr 2009 at 20:29, Glenn Linderman wrote: On approximately 4/28/2009 7:40 PM, came the following characters from the keyboard of R. David Murray: On Tue, 28 Apr 2009 at 13:37, Glenn Linderman wrote: > C. File on disk with the invalid surrogate code, accessed via the str > interface, no decoding happens, matches in memory the file on disk with > the byte that translates to the same surrogate, accessed via the bytes > interface. Ambiguity. Unless I'm missing something, one of these is type str, and the other is type bytes, so no ambiguity. You are missing that the bytes value would get decoded to a str; thus both are str; so ambiguity is possible. Only if you as the programmer decode it. Now, I don't understand the subtleties of Unicode enough to know if Martin has already successfully addressed this concern in another fashion, but personally I think that if you as a programmer are comparing funnydecoded-str strings gotten via a string interface with normal-decoded strings gotten via a bytes interface, that we could claim that your program has a bug. Hopefully Martin will clarify the PEP as I suggested in another branch of this thread. He has eventually convinced me that this ambiguity is not possible, via email discussion, but the PEP is certainly less than sufficiently explanatory to make that obvious. -- Glenn -- http://nevcal.com/ === A protocol is complete when there is nothing left to remove. -- Stuart Cheshire, Apple Computer, regarding Zero Configuration Networking ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces
On 29Apr2009 02:56, Glenn Linderman wrote: > os.listdir(b"") > > I find that on my Windows system, with all ASCII path file names, that I > get quite different results when I pass os.listdir an empty str vs an > empty bytes. > > Rather than keep you guessing, I get the root directory contents from > the empty str, and the current directory contents from an empty bytes. > That is rather unexpected. > > So I guess I'd better suggest that a specific, equivalent directory name > be passed in either bytes or str form. I think you may have uncovered an implementation bug rather than an encoding issue (because I'd expect "" and b"" to be equivalent). In ancient times, "" was a valid UNIX name for the working directory. POSIX disallows that, and requires people to use ".". Maybe you're seeing an artifact; did python move from UNIX to Windows or the other way around in its porting history? I'd guess the former. Do you get differing results from listdir(".") and listdir(b".") ? How's python2 behave for ""? (Since there's no b"" in python2.) Cheers, -- Cameron Simpson DoD#743 http://www.cskk.ezoshosting.com/cs/ 'Supposing a tree fell down, Pooh, when we were underneath it?' 'Supposing it didn't,' said Pooh after careful thought. ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces
On Tue, 28 Apr 2009 at 20:29, Glenn Linderman wrote: On approximately 4/28/2009 7:40 PM, came the following characters from the keyboard of R. David Murray: On Tue, 28 Apr 2009 at 13:37, Glenn Linderman wrote: > C. File on disk with the invalid surrogate code, accessed via the str > interface, no decoding happens, matches in memory the file on disk with > the byte that translates to the same surrogate, accessed via the bytes > interface. Ambiguity. Unless I'm missing something, one of these is type str, and the other is type bytes, so no ambiguity. You are missing that the bytes value would get decoded to a str; thus both are str; so ambiguity is possible. Only if you as the programmer decode it. Now, I don't understand the subtleties of Unicode enough to know if Martin has already successfully addressed this concern in another fashion, but personally I think that if you as a programmer are comparing funnydecoded-str strings gotten via a string interface with normal-decoded strings gotten via a bytes interface, that we could claim that your program has a bug. --David ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces
On approximately 4/29/2009 12:29 AM, came the following characters from the keyboard of Martin v. Löwis: C. File on disk with the invalid surrogate code, accessed via the str interface, no decoding happens, matches in memory the file on disk with the byte that translates to the same surrogate, accessed via the bytes interface. Ambiguity. Is that an alternative to A and B? I guess it is an adjunct to case B, the current PEP. It is what happens when using the PEP on a system that provides both bytes and str interfaces, and both get used. Your formulation is a bit too stenographic to me, but please trust me that there is *no* ambiguity in the case you construct. No Martin, the point of reviewing the PEP is to _not_ trust you, even though you are generally very knowledgeable and very trustworthy. It is much easier to find problems before something is released, or even coded, than it is afterwards. Sure. However, that requires you to provide meaningful, reproducible counter-examples, rather than a stenographic formulation that might hint some problem you apparently see (which I believe is just not there). You assumed, and maybe I wasn't clear in my statement. By "accessed via the str interface" I mean that (on Windows) the wide string interface would be used to obtain a file name. What does that mean? What specific interface are you referring to to obtain file names? Most of the time, file names are obtained by the user entering them on the keyboard. GUI applications are completely out of the scope of the PEP. Now, suppose that the file name returned contains "abc" followed by the half-surrogate U+DC10 -- four 16-bit codes. Ok, so perhaps you might be talking about os.listdir here. Communication would be much easier if I would not need to guess what you may mean. os.listdir("") Also, why is U+DC10 four 16-bit codes? It isn't. First 16-bit code is U+0061 Second 16-bit code is U+0062 Third 16-bit code is U+0063 Fourth 16-bit code is U+DC10 Then, ask for the same filename via the bytes interface, using UTF-8 encoding. How do you do that on Windows? You cannot just pick an encoding, such as UTF-8, and pass that to the byte interface, and expect it to work. If you use the byte interface, you need to encode in the file system encoding, of course. Also, what do you mean by "ask for"?? WHAT INTERFACE ARE YOU USING Please use specific python code. os.listdir(b"") I find that on my Windows system, with all ASCII path file names, that I get quite different results when I pass os.listdir an empty str vs an empty bytes. Rather than keep you guessing, I get the root directory contents from the empty str, and the current directory contents from an empty bytes. That is rather unexpected. So I guess I'd better suggest that a specific, equivalent directory name be passed in either bytes or str form. The PEP says that the above name would get translated to "abc" followed by 3 half-surrogates, corresponding to the 3 UTF-8 bytes used to represent the half-surrogate that is actually in the file name, specifically U+DCED U+DCB0 U+DC90. This means that one name on disk can be seen as two different names in memory. You are relying on false assumptions here, namely that the UTF-8 encoding would play any role. What would happen instead is that the "mbcs" encoding would be used. The "mbcs" encoding, by design from Microsoft, will never report an error, so the error handler will not be invoked at all. So what you are saying here is that Python doesn't use the "A" forms of the Windows APIs for filenames, but only the "W" forms, and uses lossy decoding (from MS) to the current code page (which can never be UTF-8 on Windows). You are further saying that Python doesn't give the programmer control over the codec that is used to convert from W results to bytes, so that on Windows, it is impossible to obtain a bytes result containing UTF-8 from os.listdir, even though sys.setfilesystemencoding exists, and sys.getfilesystemencoding is affected by it, and the latter is documented as returning "mbcs", and as returning the codec that should be used by the application to convert str to bytes for filenames. (Python 3.0.1). While I can hear a "that is outside the scope of the PEP" coming, this documentation is confusing, to say the least. Now posit another file which, when accessed via the str interface, has the name "abc" followed by U+DCED U+DCB0 U+DC90. Looks ambiguous to me. Now if you have a scheme for handling this case, fine, but I don't understand it from what is written in the PEP. You were just making false assumptions in your reasoning, assumptions that are way beyond the scope of the PEP. Absolutely correct. I was making what seemed to be reasonable assumptions about Python internals on Windows, and several of them are false, including misleading documentation for listdir (which doesn't specify that bytes and str parameters affect whether or
Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces
On approximately 4/29/2009 12:38 AM, came the following characters from the keyboard of Baptiste Carvello: Glenn Linderman a écrit : 3. When an undecodable byte 0xPQ is found, decode to the escape codepoint, followed by codepoint U+01PQ, where P and Q are hex digits. The problem with this strategy is: paths are often sliced, so your 2 codepoints could get separated. The good thing with the PEP's strategy is that 1 character stays 1 character. Baptiste Except for half-surrogates that are in the file names already, which get converted to 3 characters. -- Glenn -- http://nevcal.com/ === A protocol is complete when there is nothing left to remove. -- Stuart Cheshire, Apple Computer, regarding Zero Configuration Networking ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces
> Sure. However, that requires you to provide meaningful, reproducible > counter-examples, rather than a stenographic formulation that might > hint some problem you apparently see (which I believe is just not > there). Well, here's another one: PEP 383 would disallow UTF-8 encodings of half surrogates. But such encodings are currently supported by Python, and they are used as part of CESU-8 coding. That's, in fact, a common way of converting UTF-16 to UTF-8. How are you going to deal with existing code that relies on being able to code half surrogates as UTF-8? Tom ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces
Glenn Linderman a écrit : If there is going to be a required transformation from de novo strings to funny-encoded strings, then why not make one that people can actually see and compare and decode from the displayable form, by using displayable characters instead of lone surrogates? The problem with your "escape character" scheme is that the meaning is lost with slicing of the strings, which is a very common operation. I though half-surrogates were illegal in well formed Unicode. I confess to being weak in this area. By "legitimate" above I meant things like half-surrogates which, like quarks, should not occur alone? "Illegal" just means violating the accepted rules. In this case, the accepted rules are those enforced by the file system (at the bytes or str API levels), and by Python (for the str manipulations). None of those rules outlaw lone surrogates. [...] Python could as well *specify* that lone surrogates are illegal, as their meaning is undefined by Unicode. If this rule is respected language-wise, there is no ambiguity. It might be unrealistic on windows, though. This rule could even be specified only for strings that represent filesystem paths. Sure, they are the same type as other strings, but the programmer usually knows if a given string is intended to be a path or not. Baptiste ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces
Lino Mastrodomenico a écrit : Only for the new utf-8b encoding (if Martin agrees), while the existing utf-8 is fine as is (or at least waaay outside the scope of this PEP). This is questionable. This would have the consequence that \udcxx in a python string would sometimes mean a surrogate, and sometimes mean raw bytes, depending on the history of the string. By contrast, if the new utf-8b codec would *supercede* the old one, \udcxx would always mean raw bytes (at least on UCS-4 builds, where surrogates are unused). Thus ambiguity could be avoided. Baptiste ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces
Zooko O'Whielacronx wrote: If you switch to iso8859-15 only in the presence of undecodable UTF-8, then you have the same round-trip problem as the PEP: both b'\xff' and b'\xc3\xbf' will be converted to u'\u00ff' without a way to unambiguously recover the original file name. Why do you say that? It seems to work as I expected here: >>> '\xff'.decode('iso-8859-15') u'\xff' >>> '\xc3\xbf'.decode('iso-8859-15') u'\xc3\xbf' Here is what I mean by "switch to iso8859-15" only in the presence of undecodable UTF-8: def file_name_to_unicode(fn, encoding): try: return fn.decode(encoding) except UnicodeDecodeError: return fn.decode('iso-8859-15') Now, assume a UTF-8 locale and try to use it on the provided example file names. >>> file_name_to_unicode(b'\xff', 'utf-8') 'ÿ' >>> file_name_to_unicode(b'\xc3\xbf', 'utf-8') 'ÿ' That is the ambiguity I was referring to -- to different byte sequences result in the same unicode string. ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces
Glenn Linderman a écrit : 3. When an undecodable byte 0xPQ is found, decode to the escape codepoint, followed by codepoint U+01PQ, where P and Q are hex digits. The problem with this strategy is: paths are often sliced, so your 2 codepoints could get separated. The good thing with the PEP's strategy is that 1 character stays 1 character. Baptiste ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces
> C. File on disk with the invalid surrogate code, accessed via the str > interface, no decoding happens, matches in memory the file on disk > with > the byte that translates to the same surrogate, accessed via the bytes > interface. Ambiguity. Is that an alternative to A and B? >>> I guess it is an adjunct to case B, the current PEP. >>> >>> It is what happens when using the PEP on a system that provides both >>> bytes and str interfaces, and both get used. >> >> Your formulation is a bit too stenographic to me, but please trust me >> that there is *no* ambiguity in the case you construct. > > > No Martin, the point of reviewing the PEP is to _not_ trust you, even > though you are generally very knowledgeable and very trustworthy. It is > much easier to find problems before something is released, or even > coded, than it is afterwards. Sure. However, that requires you to provide meaningful, reproducible counter-examples, rather than a stenographic formulation that might hint some problem you apparently see (which I believe is just not there). > You assumed, and maybe I wasn't clear in my statement. > > By "accessed via the str interface" I mean that (on Windows) the wide > string interface would be used to obtain a file name. What does that mean? What specific interface are you referring to to obtain file names? Most of the time, file names are obtained by the user entering them on the keyboard. GUI applications are completely out of the scope of the PEP. > Now, suppose that > the file name returned contains "abc" followed by the half-surrogate > U+DC10 -- four 16-bit codes. Ok, so perhaps you might be talking about os.listdir here. Communication would be much easier if I would not need to guess what you may mean. Also, why is U+DC10 four 16-bit codes? > Then, ask for the same filename via the bytes interface, using UTF-8 > encoding. How do you do that on Windows? You cannot just pick an encoding, such as UTF-8, and pass that to the byte interface, and expect it to work. If you use the byte interface, you need to encode in the file system encoding, of course. Also, what do you mean by "ask for"?? WHAT INTERFACE ARE YOU USING Please use specific python code. > The PEP says that the above name would get translated to > "abc" followed by 3 half-surrogates, corresponding to the 3 UTF-8 bytes > used to represent the half-surrogate that is actually in the file name, > specifically U+DCED U+DCB0 U+DC90. This means that one name on disk can > be seen as two different names in memory. You are relying on false assumptions here, namely that the UTF-8 encoding would play any role. What would happen instead is that the "mbcs" encoding would be used. The "mbcs" encoding, by design from Microsoft, will never report an error, so the error handler will not be invoked at all. > Now posit another file which, when accessed via the str interface, has > the name "abc" followed by U+DCED U+DCB0 U+DC90. > > Looks ambiguous to me. Now if you have a scheme for handling this case, > fine, but I don't understand it from what is written in the PEP. You were just making false assumptions in your reasoning, assumptions that are way beyond the scope of the PEP. Regards, Martin ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces
On approximately 4/28/2009 10:52 PM, came the following characters from the keyboard of Martin v. Löwis: C. File on disk with the invalid surrogate code, accessed via the str interface, no decoding happens, matches in memory the file on disk with the byte that translates to the same surrogate, accessed via the bytes interface. Ambiguity. Is that an alternative to A and B? I guess it is an adjunct to case B, the current PEP. It is what happens when using the PEP on a system that provides both bytes and str interfaces, and both get used. Your formulation is a bit too stenographic to me, but please trust me that there is *no* ambiguity in the case you construct. No Martin, the point of reviewing the PEP is to _not_ trust you, even though you are generally very knowledgeable and very trustworthy. It is much easier to find problems before something is released, or even coded, than it is afterwards. By "accessed via the str interface", I assume you do something like fn = "some string" open(fn) You are wrong in assuming "no decoding happens", and that "matches in memory the file on disk" (whatever that means - how do I match a file on disk in memory??). What happens instead is that fn gets *encoded* with the file system encoding, and the python-escape handler. This will *not* produce an ambiguity. You assumed, and maybe I wasn't clear in my statement. By "accessed via the str interface" I mean that (on Windows) the wide string interface would be used to obtain a file name. Now, suppose that the file name returned contains "abc" followed by the half-surrogate U+DC10 -- four 16-bit codes. Then, ask for the same filename via the bytes interface, using UTF-8 encoding. The PEP says that the above name would get translated to "abc" followed by 3 half-surrogates, corresponding to the 3 UTF-8 bytes used to represent the half-surrogate that is actually in the file name, specifically U+DCED U+DCB0 U+DC90. This means that one name on disk can be seen as two different names in memory. Now posit another file which, when accessed via the str interface, has the name "abc" followed by U+DCED U+DCB0 U+DC90. Looks ambiguous to me. Now if you have a scheme for handling this case, fine, but I don't understand it from what is written in the PEP. If you think there is an ambiguity in that you can use both the byte interface and the string interface to access the same file: this would be a ridiculous interpretation. *Of course* you can access /etc/passwd both as "/etc/passwd" and b"/etc/passwd", there is nothing ambiguous about that. Yes, this would be a ridiculous interpretation of "ambiguous". -- Glenn -- http://nevcal.com/ === A protocol is complete when there is nothing left to remove. -- Stuart Cheshire, Apple Computer, regarding Zero Configuration Networking ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces
> I'm more concerned with your (yours? someone else's?) mention of shift > characters. I'm unfamiliar with these encodings: to translate such a > thing into a Latin example, is it the case that there are schemes with > valid encodings that look like: > > [SHIFT] a b c > > which would produce "ABC" in unicode, which is ambiguous with: > > A B C > > which would also produce "ABC"? No: the "shift" in "shift-jis" is not really about the shift key. See http://en.wikipedia.org/wiki/Shift-JIS Regards, Martin ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces
>> The Python UTF-8 codec will happily encode half-surrogates; people argue >> that it is a bug that it does so, however, it would help in this >> specific case. > > Can we use this encoding scheme for writing into files as well? We've > turned the filename with undecodable bytes into a string with half > surrogates. Putting that string into a file has to turn them into bytes > at some level. Can we use the python-escape error handler to achieve > that somehow? Sure: if you are aware that what you write to the stream is actually a file name, you should encode it with the file system encoding, and the python-escape handler. However, it's questionable that the same approach is right for the rest of the data that goes into the file. If you use a different encoding on the stream, yet still use the python-escape handler, you may end up with completely non-sensical bytes. In practice, it probably won't be that bad - python-escape has likely escaped all non-ASCII bytes, so that on re-encoding with a different encoding, only the ASCII characters get encoded, which likely will work fine. Regards, Martin ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces
>>> C. File on disk with the invalid surrogate code, accessed via the str >>> interface, no decoding happens, matches in memory the file on disk with >>> the byte that translates to the same surrogate, accessed via the bytes >>> interface. Ambiguity. >> >> Is that an alternative to A and B? > > I guess it is an adjunct to case B, the current PEP. > > It is what happens when using the PEP on a system that provides both > bytes and str interfaces, and both get used. Your formulation is a bit too stenographic to me, but please trust me that there is *no* ambiguity in the case you construct. By "accessed via the str interface", I assume you do something like fn = "some string" open(fn) You are wrong in assuming "no decoding happens", and that "matches in memory the file on disk" (whatever that means - how do I match a file on disk in memory??). What happens instead is that fn gets *encoded* with the file system encoding, and the python-escape handler. This will *not* produce an ambiguity. If you think there is an ambiguity in that you can use both the byte interface and the string interface to access the same file: this would be a ridiculous interpretation. *Of course* you can access /etc/passwd both as "/etc/passwd" and b"/etc/passwd", there is nothing ambiguous about that. Regards, Martin ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces
On approximately 4/28/2009 4:06 PM, came the following characters from the keyboard of Cameron Simpson: I think I may be able to resolve Glenn's issues with the scheme lower down (through careful use of definitions and hand waving). Close. You at least resolved what you thought my issue was. And, you did make me more comfortable with the idea that I, in programs I write, would not be adversely affected by the PEP if implemented. While I can see that the PEP no doubt solves the os.listdir / open problem on POSIX systems for Python 3 + PEP programs that don't use 3rd party libraries, it does require programs that do use 3rd party libraries to be recoded with your functions -- which so far the PEP hasn't embraced. Or, to use the bytes APIs directly to get file names for 3rd party libraries -- but the directly ported, filenames-as-strings type of applications that could call 3rd party filenames-as-bytes libraries in 2.x must be tweaked to do something different than they did before. On 27Apr2009 23:52, Glenn Linderman wrote: On approximately 4/27/2009 7:11 PM, came the following characters from the keyboard of Cameron Simpson: [...] There may be puns. So what? Use the right strings for the right purpose and all will be well. I think what is missing here, and missing from Martin's PEP, is some utility functions for the os.* namespace. PROPOSAL: add to the PEP the following functions: os.fsdecode(bytes) -> funny-encoded Unicode This is what os.listdir() does to produce the strings it hands out. os.fsencode(funny-string) -> bytes This is what open(filename,..) does to turn the filename into bytes for the POSIX open. os.pathencode(your-string) -> funny-encoded-Unicode This is what you must do to a de novo string to turn it into a string suitable for use by open. Importantly, for most strings not hand crafted to have weird sequences in them, it is a no-op. But it will recode your puns for survival. [...] So assume a non-decodable sequence in a name. That puts us into Martin's funny-decode scheme. His funny-decode scheme produces a bare string, indistinguishable from a bare string that would be produced by a str API that happens to contain that same sequence. Data puns. See my proposal above. Does it address your concerns? A program still must know the providence of the string, and _if_ you're working with non-decodable sequences in a names then you should transmute then into the funny encoding using the os.pathencode() function described above. In this way the punning issue can be avoided. _Lacking_ such a function, your punning concern is valid. Seems like one would also desire os.pathdecode to do the reverse. Yes. And also versions that take or produce bytes from funny-encoded strings. Isn't that the first two functions above? Yes, sorry. Then, if programs were re-coded to perform these transformations on what you call de novo strings, then the scheme would work. But I think a large part of the incentive for the PEP is to try to invent a scheme that intentionally allows for the puns, so that programs do not need to be recoded in this manner, and yet still work. I don't think such a scheme exists. I agree no such scheme exists. I don't think it can, just using strings. But _unless_ you have made a de novo handcrafted string with ill-formed sequences in it, you don't need to bother because you won't _have_ puns. If Martin's using half surrogates to encode "undecodable" bytes, then no normal string should conflict because a normal string will contain _only_ Unicode scalar values. Half surrogate code points are not such. The advantage here is that unless you've deliberately constructed an ill-formed unicode string, you _do_not_ need to recode into funny-encoding, because you are already compatible. Somewhat like one doesn't need to recode ASCII into UTF-8, because ASCII is unchanged. Right. And I don't intend to generate ill-formed Unicode strings, in my programs. But I might well read their names from other sources. It is nice, and thank you for emphasizing (although I already did realize it, back there in the far reaches of the brain) that all the data puns are between ill-formed Unicode strings, and undecodable bytes strings. That is a nice property of the PEP's encoding/decoding method. I'm not sure it outweighs the disadvantage of taking unreadable gibberish, and producing indecipherable gibberish (codepoints with no glyphs), though, when there are ways to produce decipherable gibberish instead... or at least mostly-decipherable gibberish. Another idea forms described below. If there is going to be a required transformation from de novo strings to funny-encoded strings, then why not make one that people can actually see and compare and decode from the displayable form, by using displayable characters instead of
Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces
On approximately 4/28/2009 7:40 PM, came the following characters from the keyboard of R. David Murray: On Tue, 28 Apr 2009 at 13:37, Glenn Linderman wrote: C. File on disk with the invalid surrogate code, accessed via the str interface, no decoding happens, matches in memory the file on disk with the byte that translates to the same surrogate, accessed via the bytes interface. Ambiguity. Unless I'm missing something, one of these is type str, and the other is type bytes, so no ambiguity. You are missing that the bytes value would get decoded to a str; thus both are str; so ambiguity is possible. -- Glenn -- http://nevcal.com/ === A protocol is complete when there is nothing left to remove. -- Stuart Cheshire, Apple Computer, regarding Zero Configuration Networking ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces
On 28Apr2009 13:37, Glenn Linderman wrote: > On approximately 4/28/2009 1:25 PM, came the following characters from > the keyboard of Martin v. Löwis: >>> The UTF-8b representation suffers from the same potential ambiguities as >>> the PUA characters... >> >> Not at all the same ambiguities. Here, again, the two choices: >> >> A. use PUA characters to represent undecodable bytes, in particular for >>UTF-8 (the PEP actually never proposed this to happen). >>This introduces an ambiguity: two different files in the same >>directory may decode to the same string name, if one has the PUA >>character, and the other has a non-decodable byte that gets decoded >>to the same PUA character. >> >> B. use UTF-8b, representing the byte will ill-formed surrogate codes. >>The same ambiguity does *NOT* exist. If a file on disk already >>contains an invalid surrogate code in its file name, then the UTF-8b >>decoder will recognize this as invalid, and decode it byte-for-byte, >>into three surrogate codes. Hence, the file names that are different >>on disk are also different in memory. No ambiguity. > > C. File on disk with the invalid surrogate code, accessed via the str > interface, no decoding happens, matches in memory the file on disk with > the byte that translates to the same surrogate, accessed via the bytes > interface. Ambiguity. Is this a Windows example, or (now I think on it) an equivalent POSIX example of using the PEP where the locale encoding is UTF-16? In either case, I would say one could make an argument for being stricter in reading in OS-native sequences. Grant that NTFS doesn't prevent half-surrogates in filenames, and likewise that POSIX won't because to the OS they're just bytes. On decoding, require well-formed data. When you hit ill-formed data, treat the nasty half surrogate as a PAIR of bytes to be escaped in the resulting decode. Ambiguity avoided. I'm more concerned with your (yours? someone else's?) mention of shift characters. I'm unfamiliar with these encodings: to translate such a thing into a Latin example, is it the case that there are schemes with valid encodings that look like: [SHIFT] a b c which would produce "ABC" in unicode, which is ambiguous with: A B C which would also produce "ABC"? Cheers, -- Cameron Simpson DoD#743 http://www.cskk.ezoshosting.com/cs/ Helicopters are considerably more expensive [than fixed wing aircraft], which is only right because they don't actually fly, but just beat the air into submission.- Paul Tomblin ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces
Martin v. Löwis wrote: >> Since the serialization of the Unicode string is likely to use UTF-8, >> and the string for such a file will include half surrogates, the >> application may raise an exception when encoding the names for a >> configuration file. These encoding exceptions will be as rare as the >> unusual names (which the careful I18N aware developer has probably >> eradicated from his system), and thus will appear late. > > There are trade-offs to any solution; if there was a solution without > trade-offs, it would be implemented already. > > The Python UTF-8 codec will happily encode half-surrogates; people argue > that it is a bug that it does so, however, it would help in this > specific case. Can we use this encoding scheme for writing into files as well? We've turned the filename with undecodable bytes into a string with half surrogates. Putting that string into a file has to turn them into bytes at some level. Can we use the python-escape error handler to achieve that somehow? -Toshio signature.asc Description: OpenPGP digital signature ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces
On 28Apr2009 14:37, Thomas Breuel wrote: | But the biggest problem with the proposal is that it isn't needed: if you | want to be able to turn arbitrary byte sequences into unicode strings and | back, just set your encoding to iso8859-15. That already works and it | doesn't require any changes. No it doesn't. It does transcode without throwing exceptions. On POSIX. (On Windows? I doubt it - windows isn't using an 8-bit scheme. I believe.) But it utter destorys any hope of working in any other locale nicely. The PEP lets you work losslessly in other locales. It _may_ require some app care for particular very weird strings that don't come from the filesystem, but as far as I can see only in circumstances where such care would be needed anyway i.e. you've got to do special stuff for weirdness in the first place. Weird == "ill-formed unicode string" here. Cheers, -- Cameron Simpson DoD#743 http://www.cskk.ezoshosting.com/cs/ I just kept it wide-open thinking it would correct itself. Then I ran out of talent. - C. Fittipaldi ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces
On Tue, 28 Apr 2009 at 13:37, Glenn Linderman wrote: C. File on disk with the invalid surrogate code, accessed via the str interface, no decoding happens, matches in memory the file on disk with the byte that translates to the same surrogate, accessed via the bytes interface. Ambiguity. Unless I'm missing something, one of these is type str, and the other is type bytes, so no ambiguity. --David ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces
Zooko O'Whielacronx wrote: > On Apr 28, 2009, at 6:46 AM, Hrvoje Niksic wrote: >> If you switch to iso8859-15 only in the presence of undecodable UTF-8, >> then you have the same round-trip problem as the PEP: both b'\xff' and >> b'\xc3\xbf' will be converted to u'\u00ff' without a way to >> unambiguously recover the original file name. > > Why do you say that? It seems to work as I expected here: > '\xff'.decode('iso-8859-15') > u'\xff' '\xc3\xbf'.decode('iso-8859-15') > u'\xc3\xbf' '\xff'.decode('cp1252') > u'\xff' '\xc3\xbf'.decode('cp1252') > u'\xc3\xbf' > You're not showing that this is a fallback path. What won't work is first trying a local encoding (in the following example, utf-8) and then if that doesn't work, trying a one-byte encoding like iso8859-15: try: file1 = '\xff'.decode('utf-8') except UnicodeDecodeError: file1 = '\xff'.decode('iso-8859-15') print repr(file1) try: file2 = '\xc3\xbf'.decode('utf-8') except UnicodeDecodeError: file2 = '\xc3\xbf'.decode('iso-8859-15') print repr(file2) That prints: u'\xff' u'\xff' The two encodings can map different bytes to the same unicode code point so you can't do this type of thing without recording what encoding was used in the translation. -Toshio signature.asc Description: OpenPGP digital signature ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces
On approximately 4/28/2009 2:01 PM, came the following characters from the keyboard of MRAB: Glenn Linderman wrote: On approximately 4/28/2009 11:55 AM, came the following characters from the keyboard of MRAB: I've been thinking of "python-escape" only in terms of UTF-8, the only encoding mentioned in the PEP. In UTF-8, bytes 0x00 to 0x7F are decodable. UTF-8 is only mentioned in the sense of having special handling for re-encoding; all the other locales/encodings are implicit. But I also went down that path to some extent. But if you're talking about using it with other encodings, eg shift-jisx0213, then I'd suggest the following: 1. Bytes 0x00 to 0xFF which can't normally be decoded are decoded to half surrogates U+DC00 to U+DCFF. This makes 256 different escape codes. Speaking personally, I won't call them 'escape codes'. I'd use the term 'escape code' to mean a character that changes the interpretation of the next character(s). OK, I won't be offended if you don't call them 'escape codes'. :) But what else to call them? My use of that term is a bit backwards, perhaps... what happens is that because these 256 half surrogates are used to decode otherwise undecodable bytes, they themselves must be "escaped" or translated into something different, when they appear in the byte sequence. The process described reserves a set of codepoints for use, and requires that that same set of codepoints be translated using a similar mechanism to avoid their untranslated appearance in the resulting str. Escape codes have the same sort of characteristic... by replacing their normal use for some other use, they must themselves have a replacement. Anyway, I think we are communicating successfully. 2. Bytes which would have decoded to half surrogates U+DC00 to U+DCFF are treated as though they are undecodable bytes. This provides escaping for the 256 different escape codes, which is lacking from the PEP. 3. Half surrogates U+DC00 to U+DCFF which can be produced by decoding are encoded to bytes 0x00 to 0xFF. This reverses the escaping. 4. Codepoints, including half surrogates U+DC00 to U+DCFF, which can't be produced by decoding raise an exception. This is confusing. Did you mean "excluding" instead of "including"? Perhaps I should've said "Any codepoint which can't be produced by decoding should raise an exception". Yes, your rephrasing is clearer, regarding your intention. For example, decoding with UTF-8b will never produce U+DC00, therefore attempting to encode U+DC00 should raise an exception and not produce 0x00. Decoding with UTF-8b might never produce U+DC00, but then again, it won't handle the random byte string, either. I think I've covered all the possibilities. :-) You might have. Seems like there could be a simpler scheme, though... 1. Define an escape codepoint. It could be U+003F or U+DC00 or U+F817 or pretty much any defined Unicode codepoint outside the range U+0100 to U+01FF (see rule 3 for why). Only one escape codepoint is needed, this is easier for humans to comprehend. 2. When the escape codepoint is decoded from the byte stream for a bytes interface or found in a str on the str interface, double it. 3. When an undecodable byte 0xPQ is found, decode to the escape codepoint, followed by codepoint U+01PQ, where P and Q are hex digits. 4. When encoding, a sequence of two escape codepoints would be encoded as one escape codepoint, and a sequence of the escape codepoint followed by codepoint U+01PQ would be encoded as byte 0xPQ. Escape codepoints not followed by the escape codepoint, or by a codepoint in the range U+0100 to U+01FF would raise an exception. 5. Provide functions that will perform the same decoding and encoding as would be done by the system calls, for both bytes and str interfaces. This differs from my previous proposal in three ways: A. Doesn't put a marker at the beginning of the string (which I said wasn't necessary even then). B. Allows for a choice of escape codepoint, the previous proposal suggested a specific one. But the final solution will only have a single one, not a user choice, but an implementation choice. C. Uses the range U+0100 to U+01FF for the escape codes, rather than U+ to U+00FF. This avoids introducing the NULL character and escape characters into the decoded str representation, yet still uses characters for which glyphs are commonly available, are non-combining, and are easily distinguishable one from another. Rationale: The use of codepoints with visible glyphs makes the escaped string friendlier to display systems, and to people. I still recommend using U+003F as the escape codepoint, but certainly one with a typcially visible glyph available. This avoids what I consider to be an annoyance with the PEP, that the codepoints used are not ones that are easily displayed, so endecodable names could easily result in long strings of indisting
Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces
I think I may be able to resolve Glenn's issues with the scheme lower down (through careful use of definitions and hand waving). On 27Apr2009 23:52, Glenn Linderman wrote: > On approximately 4/27/2009 7:11 PM, came the following characters from > the keyboard of Cameron Simpson: [...] >> There may be puns. So what? Use the right strings for the right purpose >> and all will be well. >> >> I think what is missing here, and missing from Martin's PEP, is some >> utility functions for the os.* namespace. >> >> PROPOSAL: add to the PEP the following functions: >> >> os.fsdecode(bytes) -> funny-encoded Unicode >> This is what os.listdir() does to produce the strings it hands out. >> os.fsencode(funny-string) -> bytes >> This is what open(filename,..) does to turn the filename into bytes >> for the POSIX open. >> os.pathencode(your-string) -> funny-encoded-Unicode >> This is what you must do to a de novo string to turn it into a >> string suitable for use by open. >> Importantly, for most strings not hand crafted to have weird >> sequences in them, it is a no-op. But it will recode your puns >> for survival. [...] >>> So assume a non-decodable sequence in a name. That puts us into >>> Martin's funny-decode scheme. His funny-decode scheme produces a >>> bare string, indistinguishable from a bare string that would be >>> produced by a str API that happens to contain that same sequence. >>> Data puns. >>> >> >> See my proposal above. Does it address your concerns? A program still >> must know the providence of the string, and _if_ you're working with >> non-decodable sequences in a names then you should transmute then into >> the funny encoding using the os.pathencode() function described above. >> >> In this way the punning issue can be avoided. >> _Lacking_ such a function, your punning concern is valid. > > Seems like one would also desire os.pathdecode to do the reverse. Yes. > And > also versions that take or produce bytes from funny-encoded strings. Isn't that the first two functions above? > Then, if programs were re-coded to perform these transformations on what > you call de novo strings, then the scheme would work. > But I think a large part of the incentive for the PEP is to try to > invent a scheme that intentionally allows for the puns, so that programs > do not need to be recoded in this manner, and yet still work. I don't > think such a scheme exists. I agree no such scheme exists. I don't think it can, just using strings. But _unless_ you have made a de novo handcrafted string with ill-formed sequences in it, you don't need to bother because you won't _have_ puns. If Martin's using half surrogates to encode "undecodable" bytes, then no normal string should conflict because a normal string will contain _only_ Unicode scalar values. Half surrogate code points are not such. The advantage here is that unless you've deliberately constructed an ill-formed unicode string, you _do_not_ need to recode into funny-encoding, because you are already compatible. Somewhat like one doesn't need to recode ASCII into UTF-8, because ASCII is unchanged. > If there is going to be a required transformation from de novo strings > to funny-encoded strings, then why not make one that people can actually > see and compare and decode from the displayable form, by using > displayable characters instead of lone surrogates? Because that would _not_ be a no-op for well formed Unicode strings. That reason is sufficient for me. I consider the fact that well-formed Unicode -> funny-encoded is a no-op to be an enormous feature of Martin's scheme. Unless I'm missing something, there _are_no_puns_ between funny-encoded strings and well formed unicode strings. I suppose if your program carefully constructs a unicode string riddled with half-surrogates etc and imagines something specific should happen to them on the way to being POSIX bytes then you might have a problem... >>> Right. Or someone else's program does that. I've just spent a cosy 20 minutes with my copy of Unicode 5.0 and a coffee, reading section 3.9 (Unicode Encoding Forms). I now do not believe your scenario makes sense. Someone can construct a Python3 string containing code points that includes surrogates. Granted. However such a string is not meaningful because it is not well-formed (D85). It's ill-formed (D84). It is not sane to expect it to translate into a POSIX byte sequence, be it UTF-8 or anything else, unless it is accompanied by some kind of explicit mapping provided by the programmer. Absent that mapping, it's nonsense in much the same way that a non-decodable UTF-8 byte sequence is nonsense. For example, Martin's funny-encoding is such an explicit mapping. >>>I only want to use >>> Unicode file names. But if those other file names exist, I want to >>> be able to access them, and not accidentally get a different file. But those other
Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces
On approximately 4/28/2009 2:02 PM, came the following characters from the keyboard of Martin v. Löwis: Glenn Linderman wrote: On approximately 4/28/2009 1:25 PM, came the following characters from the keyboard of Martin v. Löwis: The UTF-8b representation suffers from the same potential ambiguities as the PUA characters... Not at all the same ambiguities. Here, again, the two choices: A. use PUA characters to represent undecodable bytes, in particular for UTF-8 (the PEP actually never proposed this to happen). This introduces an ambiguity: two different files in the same directory may decode to the same string name, if one has the PUA character, and the other has a non-decodable byte that gets decoded to the same PUA character. B. use UTF-8b, representing the byte will ill-formed surrogate codes. The same ambiguity does *NOT* exist. If a file on disk already contains an invalid surrogate code in its file name, then the UTF-8b decoder will recognize this as invalid, and decode it byte-for-byte, into three surrogate codes. Hence, the file names that are different on disk are also different in memory. No ambiguity. C. File on disk with the invalid surrogate code, accessed via the str interface, no decoding happens, matches in memory the file on disk with the byte that translates to the same surrogate, accessed via the bytes interface. Ambiguity. Is that an alternative to A and B? I guess it is an adjunct to case B, the current PEP. It is what happens when using the PEP on a system that provides both bytes and str interfaces, and both get used. On a Windows system, perhaps the ambiguous case would be the use of the str API and bytes APIs producing different memory names for the same file that contains a (Unicode-illegal) half surrogate. The half-surrogate would seem to get decoded to 3 half surrogates if accessed via the bytes interface, but only one via the str interface. The version with 3 half surrogates could match another name that actually contains 3 half surrogates, that is accessed via the str interface. I can't actually tell by reading the PEP whether it affects Windows bytes interfaces or is only implemented on POSIX, so that POSIX has a str interface. If it is only implemented on POSIX, then the current scheme (now escaping the hundreds of escape codes) could work, within a single platform... but it would still suffer from displaying garbage (sequences of replacement characters) in file listings displayed or printed. There is no way, once the string is adjusted to contain replacement characters for display, to distinguish one file name from another, if they are identical except for a same-length sequence of different undecodable bytes. The concept of a function that allows the same decoding and encoding process for 3rd party interfaces is still missing from the PEP; implementation of the PEP would require that all interfaces to 3rd party software that accept file names would have to be transcoded by the interface layer. Or else such software would have to use the bytes interfaces directly, and if they do, there is no need for the PEP. So I see the PEP as a partial solution to a limited problem, that on the one hand potentially produces indistinguishable sequences of replacement characters in filenames, rather than the mojibake (which is at least distinguishable), and on the other hand, doesn't help software that also uses 3rd party libraries to avoid the use of bytes APIs for accessing file names. There are other encodings that produce more distinguishable mojibake, and would work in the same situations as the PEP. -- Glenn -- http://nevcal.com/ === A protocol is complete when there is nothing left to remove. -- Stuart Cheshire, Apple Computer, regarding Zero Configuration Networking ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces
Glenn Linderman wrote: > On approximately 4/28/2009 1:25 PM, came the following characters from > the keyboard of Martin v. Löwis: >>> The UTF-8b representation suffers from the same potential ambiguities as >>> the PUA characters... >> >> Not at all the same ambiguities. Here, again, the two choices: >> >> A. use PUA characters to represent undecodable bytes, in particular for >>UTF-8 (the PEP actually never proposed this to happen). >>This introduces an ambiguity: two different files in the same >>directory may decode to the same string name, if one has the PUA >>character, and the other has a non-decodable byte that gets decoded >>to the same PUA character. >> >> B. use UTF-8b, representing the byte will ill-formed surrogate codes. >>The same ambiguity does *NOT* exist. If a file on disk already >>contains an invalid surrogate code in its file name, then the UTF-8b >>decoder will recognize this as invalid, and decode it byte-for-byte, >>into three surrogate codes. Hence, the file names that are different >>on disk are also different in memory. No ambiguity. > > C. File on disk with the invalid surrogate code, accessed via the str > interface, no decoding happens, matches in memory the file on disk with > the byte that translates to the same surrogate, accessed via the bytes > interface. Ambiguity. Is that an alternative to A and B? Regards, Martin ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces
Glenn Linderman wrote: On approximately 4/28/2009 11:55 AM, came the following characters from the keyboard of MRAB: I've been thinking of "python-escape" only in terms of UTF-8, the only encoding mentioned in the PEP. In UTF-8, bytes 0x00 to 0x7F are decodable. UTF-8 is only mentioned in the sense of having special handling for re-encoding; all the other locales/encodings are implicit. But I also went down that path to some extent. But if you're talking about using it with other encodings, eg shift-jisx0213, then I'd suggest the following: 1. Bytes 0x00 to 0xFF which can't normally be decoded are decoded to half surrogates U+DC00 to U+DCFF. This makes 256 different escape codes. Speaking personally, I won't call them 'escape codes'. I'd use the term 'escape code' to mean a character that changes the interpretation of the next character(s). 2. Bytes which would have decoded to half surrogates U+DC00 to U+DCFF are treated as though they are undecodable bytes. This provides escaping for the 256 different escape codes, which is lacking from the PEP. 3. Half surrogates U+DC00 to U+DCFF which can be produced by decoding are encoded to bytes 0x00 to 0xFF. This reverses the escaping. 4. Codepoints, including half surrogates U+DC00 to U+DCFF, which can't be produced by decoding raise an exception. This is confusing. Did you mean "excluding" instead of "including"? Perhaps I should've said "Any codepoint which can't be produced by decoding should raise an exception". For example, decoding with UTF-8b will never produce U+DC00, therefore attempting to encode U+DC00 should raise an exception and not produce 0x00. I think I've covered all the possibilities. :-) You might have. Seems like there could be a simpler scheme, though... 1. Define an escape codepoint. It could be U+003F or U+DC00 or U+F817 or pretty much any defined Unicode codepoint outside the range U+0100 to U+01FF (see rule 3 for why). Only one escape codepoint is needed, this is easier for humans to comprehend. 2. When the escape codepoint is decoded from the byte stream for a bytes interface or found in a str on the str interface, double it. 3. When an undecodable byte 0xPQ is found, decode to the escape codepoint, followed by codepoint U+01PQ, where P and Q are hex digits. 4. When encoding, a sequence of two escape codepoints would be encoded as one escape codepoint, and a sequence of the escape codepoint followed by codepoint U+01PQ would be encoded as byte 0xPQ. Escape codepoints not followed by the escape codepoint, or by a codepoint in the range U+0100 to U+01FF would raise an exception. 5. Provide functions that will perform the same decoding and encoding as would be done by the system calls, for both bytes and str interfaces. This differs from my previous proposal in three ways: A. Doesn't put a marker at the beginning of the string (which I said wasn't necessary even then). B. Allows for a choice of escape codepoint, the previous proposal suggested a specific one. But the final solution will only have a single one, not a user choice, but an implementation choice. C. Uses the range U+0100 to U+01FF for the escape codes, rather than U+ to U+00FF. This avoids introducing the NULL character and escape characters into the decoded str representation, yet still uses characters for which glyphs are commonly available, are non-combining, and are easily distinguishable one from another. Rationale: The use of codepoints with visible glyphs makes the escaped string friendlier to display systems, and to people. I still recommend using U+003F as the escape codepoint, but certainly one with a typcially visible glyph available. This avoids what I consider to be an annoyance with the PEP, that the codepoints used are not ones that are easily displayed, so endecodable names could easily result in long strings of indistinguishable substitution characters. Perhaps the escape character should be U+005C. ;-) It, like MRAB's proposal, also avoids data puns, which is a major problem with the PEP. I consider this proposal to be easier to understand than MRAB's proposal, or the PEP, because of the single escape codepoint and the use of visible characters. This proposal, like my initial one, also decodes and encodes (just the escape codes) values on the str interfaces. This is necessary to avoid data puns on systems that provide both types of interfaces. This proposal could be used for programs that use str values, and easily migrates to a solution that provides an object that provides an abstraction for system interfaces that have two forms. ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces
> Others have made this suggestion, and it is helpful to the PEP, but not > sufficient. As implemented as an error handler, I'm not sure that the > b'\xed\xb3\xbf' sequence would trigger the error handler, if the UTF-8 > decoder is happy with it. Which, in my testing, it is. Rest assured that the utf-8b codec will work the way it is specified. Regards, Martin ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces
On approximately 4/28/2009 1:25 PM, came the following characters from the keyboard of Martin v. Löwis: The UTF-8b representation suffers from the same potential ambiguities as the PUA characters... Not at all the same ambiguities. Here, again, the two choices: A. use PUA characters to represent undecodable bytes, in particular for UTF-8 (the PEP actually never proposed this to happen). This introduces an ambiguity: two different files in the same directory may decode to the same string name, if one has the PUA character, and the other has a non-decodable byte that gets decoded to the same PUA character. B. use UTF-8b, representing the byte will ill-formed surrogate codes. The same ambiguity does *NOT* exist. If a file on disk already contains an invalid surrogate code in its file name, then the UTF-8b decoder will recognize this as invalid, and decode it byte-for-byte, into three surrogate codes. Hence, the file names that are different on disk are also different in memory. No ambiguity. C. File on disk with the invalid surrogate code, accessed via the str interface, no decoding happens, matches in memory the file on disk with the byte that translates to the same surrogate, accessed via the bytes interface. Ambiguity. -- Glenn -- http://nevcal.com/ === A protocol is complete when there is nothing left to remove. -- Stuart Cheshire, Apple Computer, regarding Zero Configuration Networking ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces
On approximately 4/28/2009 6:01 AM, came the following characters from the keyboard of Lino Mastrodomenico: 2009/4/28 Glenn Linderman : The switch from PUA to half-surrogates does not resolve the issues with the encoding not being a 1-to-1 mapping, though. The very fact that you think you can get away with use of lone surrogates means that other people might, accidentally or intentionally, also use lone surrogates for some other purpose. Even in file names. It does solve this issue, because (unlike e.g. U+F01FF) '\udcff' is not a valid Unicode character (not a character at all, really) and the only way you can put this in a POSIX filename is if you use a very lenient UTF-8 encoder that gives you b'\xed\xb3\xbf'. Wrong. An 8859-1 locale allows any byte sequence to placed into a POSIX filename. And while U+DCFF is illegal alone in Unicode, it is not illegal in Python str values. And from my testing, Python 3's current UTF-8 encoder will happily provide exactly the bytes value you mention when given U+DCFF. Since this byte sequence doesn't represent a valid character when decoded with UTF-8, it should simply be considered an invalid UTF-8 sequence of three bytes and decoded to '\udced\udcb3\udcbf' (*not* '\udcff'). Martin: maybe the PEP should say this explicitly? Note that the round-trip works without ambiguities between '\udcff' in the filename: b'\xed\xb3\xbf' -> '\udced\udcb3\udcbf' -> b'\xed\xb3\xbf' and b'\xff' in the filename, decoded by Python to '\udcff': b'\xff' -> '\udcff' -> b'\xff' Others have made this suggestion, and it is helpful to the PEP, but not sufficient. As implemented as an error handler, I'm not sure that the b'\xed\xb3\xbf' sequence would trigger the error handler, if the UTF-8 decoder is happy with it. Which, in my testing, it is. -- Glenn -- http://nevcal.com/ === A protocol is complete when there is nothing left to remove. -- Stuart Cheshire, Apple Computer, regarding Zero Configuration Networking ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces
On approximately 4/28/2009 11:55 AM, came the following characters from the keyboard of MRAB: I've been thinking of "python-escape" only in terms of UTF-8, the only encoding mentioned in the PEP. In UTF-8, bytes 0x00 to 0x7F are decodable. UTF-8 is only mentioned in the sense of having special handling for re-encoding; all the other locales/encodings are implicit. But I also went down that path to some extent. But if you're talking about using it with other encodings, eg shift-jisx0213, then I'd suggest the following: 1. Bytes 0x00 to 0xFF which can't normally be decoded are decoded to half surrogates U+DC00 to U+DCFF. This makes 256 different escape codes. 2. Bytes which would have decoded to half surrogates U+DC00 to U+DCFF are treated as though they are undecodable bytes. This provides escaping for the 256 different escape codes, which is lacking from the PEP. 3. Half surrogates U+DC00 to U+DCFF which can be produced by decoding are encoded to bytes 0x00 to 0xFF. This reverses the escaping. 4. Codepoints, including half surrogates U+DC00 to U+DCFF, which can't be produced by decoding raise an exception. This is confusing. Did you mean "excluding" instead of "including"? I think I've covered all the possibilities. :-) You might have. Seems like there could be a simpler scheme, though... 1. Define an escape codepoint. It could be U+003F or U+DC00 or U+F817 or pretty much any defined Unicode codepoint outside the range U+0100 to U+01FF (see rule 3 for why). Only one escape codepoint is needed, this is easier for humans to comprehend. 2. When the escape codepoint is decoded from the byte stream for a bytes interface or found in a str on the str interface, double it. 3. When an undecodable byte 0xPQ is found, decode to the escape codepoint, followed by codepoint U+01PQ, where P and Q are hex digits. 4. When encoding, a sequence of two escape codepoints would be encoded as one escape codepoint, and a sequence of the escape codepoint followed by codepoint U+01PQ would be encoded as byte 0xPQ. Escape codepoints not followed by the escape codepoint, or by a codepoint in the range U+0100 to U+01FF would raise an exception. 5. Provide functions that will perform the same decoding and encoding as would be done by the system calls, for both bytes and str interfaces. This differs from my previous proposal in three ways: A. Doesn't put a marker at the beginning of the string (which I said wasn't necessary even then). B. Allows for a choice of escape codepoint, the previous proposal suggested a specific one. But the final solution will only have a single one, not a user choice, but an implementation choice. C. Uses the range U+0100 to U+01FF for the escape codes, rather than U+ to U+00FF. This avoids introducing the NULL character and escape characters into the decoded str representation, yet still uses characters for which glyphs are commonly available, are non-combining, and are easily distinguishable one from another. Rationale: The use of codepoints with visible glyphs makes the escaped string friendlier to display systems, and to people. I still recommend using U+003F as the escape codepoint, but certainly one with a typcially visible glyph available. This avoids what I consider to be an annoyance with the PEP, that the codepoints used are not ones that are easily displayed, so endecodable names could easily result in long strings of indistinguishable substitution characters. It, like MRAB's proposal, also avoids data puns, which is a major problem with the PEP. I consider this proposal to be easier to understand than MRAB's proposal, or the PEP, because of the single escape codepoint and the use of visible characters. This proposal, like my initial one, also decodes and encodes (just the escape codes) values on the str interfaces. This is necessary to avoid data puns on systems that provide both types of interfaces. This proposal could be used for programs that use str values, and easily migrates to a solution that provides an object that provides an abstraction for system interfaces that have two forms. -- Glenn -- http://nevcal.com/ === A protocol is complete when there is nothing left to remove. -- Stuart Cheshire, Apple Computer, regarding Zero Configuration Networking ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces
> The UTF-8b representation suffers from the same potential ambiguities as > the PUA characters... Not at all the same ambiguities. Here, again, the two choices: A. use PUA characters to represent undecodable bytes, in particular for UTF-8 (the PEP actually never proposed this to happen). This introduces an ambiguity: two different files in the same directory may decode to the same string name, if one has the PUA character, and the other has a non-decodable byte that gets decoded to the same PUA character. B. use UTF-8b, representing the byte will ill-formed surrogate codes. The same ambiguity does *NOT* exist. If a file on disk already contains an invalid surrogate code in its file name, then the UTF-8b decoder will recognize this as invalid, and decode it byte-for-byte, into three surrogate codes. Hence, the file names that are different on disk are also different in memory. No ambiguity. Regards, Martin ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces
On approximately 4/28/2009 10:53 AM, came the following characters from the keyboard of James Y Knight: On Apr 28, 2009, at 2:50 AM, Martin v. Löwis wrote: James Y Knight wrote: Hopefully it can be assumed that your locale encoding really is a non-overlapping superset of ASCII, as is required by POSIX... Can you please point to the part of the POSIX spec that says that such overlapping is forbidden? I can't find it...I would've thought it would be on this page: http://opengroup.org/onlinepubs/007908775/xbd/charset.html but it's not (at least, not obviously). That does say (effectively) that all encodings must be supersets of ASCII and use the same codepoints, though. However, ISO-2022 being inappropriate for LC_CTYPE usage is the entire reason why EUC-JP was created, so I'm pretty sure that it is in fact inappropriate, and I cannot find any evidence of it ever being used on any system. It would seem from the definition of ISO-2022 that what it calls "escape sequences" is in your POSIX spec called "locking-shift encoding". Therefore, the second bullet item under the "Character Encoding" heading prohibits use of ISO-2022, for whatever uses that document defines (which, since you referenced it, I assume means locales, and possibly file system encodings, but I'm not familiar with the structure of all the POSIX standards documents). A locking-shift encoding (where the state of the character is determined by a shift code that may affect more than the single character following it) cannot be defined with the current character set description file format. Use of a locking-shift encoding with any of the standard utilities in the XCU specification or with any of the functions in the XSH specification that do not specifically mention the effects of state-dependent encoding is implementation-dependent. From http://en.wikipedia.org/wiki/EUC-JP: "To get the EUC form of an ISO-2022 character, the most significant bit of each 7-bit byte of the original ISO 2022 codes is set (by adding 128 to each of these original 7-bit codes); this allows software to easily distinguish whether a particular byte in a character string belongs to the ISO-646 code or the ISO-2022 (EUC) code." Also: http://www.cl.cam.ac.uk/~mgk25/ucs/iso2022-wc.html I'm a bit scared at the prospect that U+DCAF could turn into "/", that just screams security vulnerability to me. So I'd like to propose that only 0x80-0xFF <-> U+DC80-U+DCFF should ever be allowed to be encoded/decoded via the error handler. It would be actually U+DC2f that would turn into /. Yes, I meant to say DC2F, sorry for the confusion. I'm happy to exclude that range from the mapping if POSIX really requires an encoding not to be overlapping with ASCII. I think it has to be excluded from mapping in order to not introduce security issues. However... There's also SHIFT-JIS to worry about...which apparently some people actually want to use as their default encoding, despite it being broken to do so. RedHat apparently refuses to provide it as a locale charset (due to its brokenness), and it's also not available by default on my Debian system. People do unfortunately seem to actually use it in real life. https://bugzilla.redhat.com/show_bug.cgi?id=136290 So, I'd like to propose this: The "python-escape" error handler when given a non-decodable byte from 0x80 to 0xFF will produce values of U+DC80 to U+DCFF. When given a non-decodable byte from 0x00 to 0x7F, it will be converted to U+-U+007F. On the encoding side, values from U+DC80 to U+DCFF are encoded into 0x80 to 0xFF, and all other characters are treated in whatever way the encoding would normally treat them. This proposal obviously works for all non-overlapping ASCII supersets, where 0x00 to 0x7F always decode to U+00 to U+7F. But it also works for Shift-JIS and other similar ASCII-supersets with overlaps in trailing bytes of a multibyte sequence. So, a sequence like "\x81\xFD".decode("shift-jis", "python-escape") will turn into u"\uDC81\u00fd". Which will then properly encode back into "\x81\xFD". The character sets this *doesn't* work for are: ebcdic code pages (obviously completely unsuitable for a locale encoding on unix), Why is that obvious? The only thing I saw that could exclude EBCDIC would be the requirement that the codes be positive in a char, but on a system where the C compiler treats char as unsigned, EBCDIC would qualify. Of course, the use of EBCDIC would also restrict the other possible code pages to those derived from EBCDIC (rather than the bulk of code pages that are derived from ASCII), due to: If the encoded values associated with each member of the portable character set are not invariant across all locales supported by the implementation, the results achieved by an application accessing those locales are unspecified. iso2022-* (covered above), and shift-jisx0213 (because it has replaced \ with yen, and - with
Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces
On Apr 28, 2009, at 6:46 AM, Hrvoje Niksic wrote: Are you proposing to unconditionally encode file names as iso8859-15, or to do so only when undecodeable bytes are encountered? For what it is worth, what we have previously planned to do for the Tahoe project is the second of these -- decode using some 1-byte encoding such as iso-8859-1, iso-8859-15, or windows-1252 only in the case that attempting to decode the bytes using the local alleged encoding failed. If you switch to iso8859-15 only in the presence of undecodable UTF-8, then you have the same round-trip problem as the PEP: both b'\xff' and b'\xc3\xbf' will be converted to u'\u00ff' without a way to unambiguously recover the original file name. Why do you say that? It seems to work as I expected here: >>> '\xff'.decode('iso-8859-15') u'\xff' >>> '\xc3\xbf'.decode('iso-8859-15') u'\xc3\xbf' >>> >>> >>> >>> '\xff'.decode('cp1252') u'\xff' >>> '\xc3\xbf'.decode('cp1252') u'\xc3\xbf' Regards, Zooko ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces
James Y Knight wrote: On Apr 28, 2009, at 2:50 AM, Martin v. Löwis wrote: James Y Knight wrote: Hopefully it can be assumed that your locale encoding really is a non-overlapping superset of ASCII, as is required by POSIX... Can you please point to the part of the POSIX spec that says that such overlapping is forbidden? I can't find it...I would've thought it would be on this page: http://opengroup.org/onlinepubs/007908775/xbd/charset.html but it's not (at least, not obviously). That does say (effectively) that all encodings must be supersets of ASCII and use the same codepoints, though. However, ISO-2022 being inappropriate for LC_CTYPE usage is the entire reason why EUC-JP was created, so I'm pretty sure that it is in fact inappropriate, and I cannot find any evidence of it ever being used on any system. From http://en.wikipedia.org/wiki/EUC-JP: "To get the EUC form of an ISO-2022 character, the most significant bit of each 7-bit byte of the original ISO 2022 codes is set (by adding 128 to each of these original 7-bit codes); this allows software to easily distinguish whether a particular byte in a character string belongs to the ISO-646 code or the ISO-2022 (EUC) code." Also: http://www.cl.cam.ac.uk/~mgk25/ucs/iso2022-wc.html I'm a bit scared at the prospect that U+DCAF could turn into "/", that just screams security vulnerability to me. So I'd like to propose that only 0x80-0xFF <-> U+DC80-U+DCFF should ever be allowed to be encoded/decoded via the error handler. It would be actually U+DC2f that would turn into /. Yes, I meant to say DC2F, sorry for the confusion. I'm happy to exclude that range from the mapping if POSIX really requires an encoding not to be overlapping with ASCII. I think it has to be excluded from mapping in order to not introduce security issues. However... There's also SHIFT-JIS to worry about...which apparently some people actually want to use as their default encoding, despite it being broken to do so. RedHat apparently refuses to provide it as a locale charset (due to its brokenness), and it's also not available by default on my Debian system. People do unfortunately seem to actually use it in real life. https://bugzilla.redhat.com/show_bug.cgi?id=136290 So, I'd like to propose this: The "python-escape" error handler when given a non-decodable byte from 0x80 to 0xFF will produce values of U+DC80 to U+DCFF. When given a non-decodable byte from 0x00 to 0x7F, it will be converted to U+-U+007F. On the encoding side, values from U+DC80 to U+DCFF are encoded into 0x80 to 0xFF, and all other characters are treated in whatever way the encoding would normally treat them. This proposal obviously works for all non-overlapping ASCII supersets, where 0x00 to 0x7F always decode to U+00 to U+7F. But it also works for Shift-JIS and other similar ASCII-supersets with overlaps in trailing bytes of a multibyte sequence. So, a sequence like "\x81\xFD".decode("shift-jis", "python-escape") will turn into u"\uDC81\u00fd". Which will then properly encode back into "\x81\xFD". The character sets this *doesn't* work for are: ebcdic code pages (obviously completely unsuitable for a locale encoding on unix), iso2022-* (covered above), and shift-jisx0213 (because it has replaced \ with yen, and - with overline). If it's desirable to work with shift_jisx0213, a modification of the proposal can be made: Change the second sentence to: "When given a non-decodable byte from 0x00 to 0x7F, that byte must be the second or later byte in a multibyte sequence. In such a case, the error handler will produce the encoding of that byte if it was standing alone (thus in most encodings, \x00-\x7f turn into U+00-U+7F)." It sounds from https://bugzilla.novell.com/show_bug.cgi?id=162501 like some people do actually use shift_jisx0213, unfortunately. I've been thinking of "python-escape" only in terms of UTF-8, the only encoding mentioned in the PEP. In UTF-8, bytes 0x00 to 0x7F are decodable. But if you're talking about using it with other encodings, eg shift-jisx0213, then I'd suggest the following: 1. Bytes 0x00 to 0xFF which can't normally be decoded are decoded to half surrogates U+DC00 to U+DCFF. 2. Bytes which would have decoded to half surrogates U+DC00 to U+DCFF are treated as though they are undecodable bytes. 3. Half surrogates U+DC00 to U+DCFF which can be produced by decoding are encoded to bytes 0x00 to 0xFF. 4. Codepoints, including half surrogates U+DC00 to U+DCFF, which can't be produced by decoding raise an exception. I think I've covered all the possibilities. :-) ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces
On approximately 4/28/2009 10:00 AM, came the following characters from the keyboard of Martin v. Löwis: An alternative that doesn't suffer from the risk of not being able to store decoded strings would have been the use of PUA characters, but people rejected it because of the potential ambiguities. So they clearly dislike one risk more than the other. UTF-8b is primarily meant as an in-memory representation. The UTF-8b representation suffers from the same potential ambiguities as the PUA characters... perhaps slightly less likely in practice, due to the use of Unicode-illegal characters, but exactly the same theoretical likelihood in the space of Python-acceptable character codes. -- Glenn -- http://nevcal.com/ === A protocol is complete when there is nothing left to remove. -- Stuart Cheshire, Apple Computer, regarding Zero Configuration Networking ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces
On Apr 28, 2009, at 2:50 AM, Martin v. Löwis wrote: James Y Knight wrote: Hopefully it can be assumed that your locale encoding really is a non-overlapping superset of ASCII, as is required by POSIX... Can you please point to the part of the POSIX spec that says that such overlapping is forbidden? I can't find it...I would've thought it would be on this page: http://opengroup.org/onlinepubs/007908775/xbd/charset.html but it's not (at least, not obviously). That does say (effectively) that all encodings must be supersets of ASCII and use the same codepoints, though. However, ISO-2022 being inappropriate for LC_CTYPE usage is the entire reason why EUC-JP was created, so I'm pretty sure that it is in fact inappropriate, and I cannot find any evidence of it ever being used on any system. From http://en.wikipedia.org/wiki/EUC-JP: "To get the EUC form of an ISO-2022 character, the most significant bit of each 7-bit byte of the original ISO 2022 codes is set (by adding 128 to each of these original 7-bit codes); this allows software to easily distinguish whether a particular byte in a character string belongs to the ISO-646 code or the ISO-2022 (EUC) code." Also: http://www.cl.cam.ac.uk/~mgk25/ucs/iso2022-wc.html I'm a bit scared at the prospect that U+DCAF could turn into "/", that just screams security vulnerability to me. So I'd like to propose that only 0x80-0xFF <-> U+DC80-U+DCFF should ever be allowed to be encoded/decoded via the error handler. It would be actually U+DC2f that would turn into /. Yes, I meant to say DC2F, sorry for the confusion. I'm happy to exclude that range from the mapping if POSIX really requires an encoding not to be overlapping with ASCII. I think it has to be excluded from mapping in order to not introduce security issues. However... There's also SHIFT-JIS to worry about...which apparently some people actually want to use as their default encoding, despite it being broken to do so. RedHat apparently refuses to provide it as a locale charset (due to its brokenness), and it's also not available by default on my Debian system. People do unfortunately seem to actually use it in real life. https://bugzilla.redhat.com/show_bug.cgi?id=136290 So, I'd like to propose this: The "python-escape" error handler when given a non-decodable byte from 0x80 to 0xFF will produce values of U+DC80 to U+DCFF. When given a non- decodable byte from 0x00 to 0x7F, it will be converted to U+-U +007F. On the encoding side, values from U+DC80 to U+DCFF are encoded into 0x80 to 0xFF, and all other characters are treated in whatever way the encoding would normally treat them. This proposal obviously works for all non-overlapping ASCII supersets, where 0x00 to 0x7F always decode to U+00 to U+7F. But it also works for Shift-JIS and other similar ASCII-supersets with overlaps in trailing bytes of a multibyte sequence. So, a sequence like "\x81\xFD".decode("shift-jis", "python-escape") will turn into u"\uDC81\u00fd". Which will then properly encode back into "\x81\xFD". The character sets this *doesn't* work for are: ebcdic code pages (obviously completely unsuitable for a locale encoding on unix), iso2022-* (covered above), and shift-jisx0213 (because it has replaced \ with yen, and - with overline). If it's desirable to work with shift_jisx0213, a modification of the proposal can be made: Change the second sentence to: "When given a non- decodable byte from 0x00 to 0x7F, that byte must be the second or later byte in a multibyte sequence. In such a case, the error handler will produce the encoding of that byte if it was standing alone (thus in most encodings, \x00-\x7f turn into U+00-U+7F)." It sounds from https://bugzilla.novell.com/show_bug.cgi?id=162501 like some people do actually use shift_jisx0213, unfortunately. James ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces
> If the PEP depends on this being changed, it should be mentioned in the > PEP. The PEP says that the utf-8b codec decodes invalid bytes into low surrogates. I have now clarified that a strict definition of UTF-8 is assumed for utf-8b. Regards, Martin ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces
> Since the serialization of the Unicode string is likely to use UTF-8, > and the string for such a file will include half surrogates, the > application may raise an exception when encoding the names for a > configuration file. These encoding exceptions will be as rare as the > unusual names (which the careful I18N aware developer has probably > eradicated from his system), and thus will appear late. There are trade-offs to any solution; if there was a solution without trade-offs, it would be implemented already. The Python UTF-8 codec will happily encode half-surrogates; people argue that it is a bug that it does so, however, it would help in this specific case. An alternative that doesn't suffer from the risk of not being able to store decoded strings would have been the use of PUA characters, but people rejected it because of the potential ambiguities. So they clearly dislike one risk more than the other. UTF-8b is primarily meant as an in-memory representation. > Or say de/serialization succeeds. Since the resulting Unicode string > differs depending on the encoding (which is a good thing; it is > supposed to make most cases mostly readable), when the filesystem > encoding changes (say from legacy to UTF-8), the "name" changes, and > deserialized references to it become stale. That problem has nothing to do with the PEP. If the encoding changes, LRU entries may get stale even if there were no encoding errors at all. Suppose the old encoding was Latin-1, and the new encoding is KOI8-R, then all file names are decodable before and afterwards, yet the string representation changes. Applications that want to protect themselves against that happening need to store byte representations of the file names, not character representations. Depending on the configuration file format, that may or may not be possible. I find the case pretty artificial, though: if the locale encoding changes, all file names will look incorrect to the user, so he'll quickly switch back, or rename all the files. As an application supporting a LRU list, I would remove/hide all entries that don't correlate to existing files - after all, the user may have as well deleted the file in the LRU list. Regards, Martin ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces
> It does solve this issue, because (unlike e.g. U+F01FF) '\udcff' is > not a valid Unicode character (not a character at all, really) and the > only way you can put this in a POSIX filename is if you use a very > lenient UTF-8 encoder that gives you b'\xed\xb3\xbf'. > > Since this byte sequence doesn't represent a valid character when > decoded with UTF-8, it should simply be considered an invalid UTF-8 > sequence of three bytes and decoded to '\udced\udcb3\udcbf' (*not* > '\udcff'). > > Martin: maybe the PEP should say this explicitly? Sure, will do. Regards, Martin ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces
Paul Moore writes: > But it seems to me that there is an assumption that problems will > arise when code gets a potentially funny-decoded string and doesn't > know where it came from. > > Is that a real concern? Yes, it's a real concern. I don't think it's possible to show a small piece of code one could point at and say "without a better API I bet you can't write this correctly," though. Rather, my experience with Emacs and various mail packages is that without type information it is impossible to keep track of the myriad bits and pieces of text that are recombining like pig flu, and eventually one breaks out and causes an error. It's usually easy to fix, but so are the next hundred similar regressions, and in the meantime a hundred users have suffered more or less damage or at least annoyance. There's no question that dealing with escapes of funny-decoded strings to uprepared code paths is mission creep compared to Martin's stated purpose for PEP 383, but it is also a real problem. ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces
On Mon, Apr 27, 2009 at 23:43, Stephen J. Turnbull wrote: > Nobody said we were at the stage of *saving* the [attachment]! But speaking of saving files, I think that's the biggest hole in this that has been nagging at the back of my mind. This PEP intends to allow easy access to filenames and other environment strings which are not restricted to known encodings. What happens if the detected encoding changes? There may be difficulties de/serializing these names, such as for an MRU list. Since the serialization of the Unicode string is likely to use UTF-8, and the string for such a file will include half surrogates, the application may raise an exception when encoding the names for a configuration file. These encoding exceptions will be as rare as the unusual names (which the careful I18N aware developer has probably eradicated from his system), and thus will appear late. Or say de/serialization succeeds. Since the resulting Unicode string differs depending on the encoding (which is a good thing; it is supposed to make most cases mostly readable), when the filesystem encoding changes (say from legacy to UTF-8), the "name" changes, and deserialized references to it become stale. This can probably be handled through careful use of the same encoding/decoding scheme, if relevant, but that sounds like we've just moved the problem from fs/environment access to serialization. Is that good enough? For other uses the API knew whether it was environmentally aware, but serialization probably will not. Should this PEP make recommendations about how to save filenames in configuration files? -- Michael Urman ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces
2009/4/28 Hrvoje Niksic : > Lino Mastrodomenico wrote: >> >> Since this byte sequence [b'\xed\xb3\xbf'] doesn't represent a valid >> character when >> decoded with UTF-8, it should simply be considered an invalid UTF-8 >> sequence of three bytes and decoded to '\udced\udcb3\udcbf' (*not* >> '\udcff'). > > "Should be considered" or "will be considered"? Python 3.0's UTF-8 decoder > happily accepts it and returns u'\udcff': > b'\xed\xb3\xbf'.decode('utf-8') > '\udcff' Only for the new utf-8b encoding (if Martin agrees), while the existing utf-8 is fine as is (or at least waaay outside the scope of this PEP). -- Lino Mastrodomenico ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces
Lino Mastrodomenico wrote: Since this byte sequence [b'\xed\xb3\xbf'] doesn't represent a valid character when decoded with UTF-8, it should simply be considered an invalid UTF-8 sequence of three bytes and decoded to '\udced\udcb3\udcbf' (*not* '\udcff'). "Should be considered" or "will be considered"? Python 3.0's UTF-8 decoder happily accepts it and returns u'\udcff': >>> b'\xed\xb3\xbf'.decode('utf-8') '\udcff' If the PEP depends on this being changed, it should be mentioned in the PEP. ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces
2009/4/28 Glenn Linderman : > The switch from PUA to half-surrogates does not resolve the issues with the > encoding not being a 1-to-1 mapping, though. The very fact that you think > you can get away with use of lone surrogates means that other people might, > accidentally or intentionally, also use lone surrogates for some other > purpose. Even in file names. It does solve this issue, because (unlike e.g. U+F01FF) '\udcff' is not a valid Unicode character (not a character at all, really) and the only way you can put this in a POSIX filename is if you use a very lenient UTF-8 encoder that gives you b'\xed\xb3\xbf'. Since this byte sequence doesn't represent a valid character when decoded with UTF-8, it should simply be considered an invalid UTF-8 sequence of three bytes and decoded to '\udced\udcb3\udcbf' (*not* '\udcff'). Martin: maybe the PEP should say this explicitly? Note that the round-trip works without ambiguities between '\udcff' in the filename: b'\xed\xb3\xbf' -> '\udced\udcb3\udcbf' -> b'\xed\xb3\xbf' and b'\xff' in the filename, decoded by Python to '\udcff': b'\xff' -> '\udcff' -> b'\xff' -- Lino Mastrodomenico ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces
Thomas Breuel wrote: But the biggest problem with the proposal is that it isn't needed: if you want to be able to turn arbitrary byte sequences into unicode strings and back, just set your encoding to iso8859-15. That already works and it doesn't require any changes. Are you proposing to unconditionally encode file names as iso8859-15, or to do so only when undecodeable bytes are encountered? If you unconditionally set encoding to iso8859-15, then you are effectively reverting to treating file names as bytes, regardless of the locale. You're also angering a lot of European users who expect iso8859-2, etc. If you switch to iso8859-15 only in the presence of undecodable UTF-8, then you have the same round-trip problem as the PEP: both b'\xff' and b'\xc3\xbf' will be converted to u'\u00ff' without a way to unambiguously recover the original file name. ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces
> > Yep, that's the problem. Lots of theoretical problems noone has ever > encountered > brought up against a PEP which resolves some actual problems people > encounter on > a regular basis. How can you bring up practical problems against something that hasn't been implemented? The fact that no other language or library does this is perhaps an indication that it isn't the right thing to do. But the biggest problem with the proposal is that it isn't needed: if you want to be able to turn arbitrary byte sequences into unicode strings and back, just set your encoding to iso8859-15. That already works and it doesn't require any changes. Tom ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces
For what it's worth, the OSX API's seem to behave as follows: * If you create a file with an non-UTF8 name on a HFS+ filesystem the system automaticly encodes the name. That is, open(chr(255), 'w') will silently create a file named '%FF' instead of the name you'd expect on a unix system. * If you mount an NFS filesystem from a linux host and that directory contains a file named chr(255) - unix-level tools will see a file with the expected name (just like on linux) - Cocoa's NSFileManager returns u"?" as the filename, that is when the filename cannot be decoded using UTF-8 the name returned by the high- level API is mangled. This is regardless of the setting of LANG. - I haven't found a way yet to access files whose names are not valid UTF-8 using the high-level Cocoa API's. The latter two are interesting because Cocoa has a unicode filesystem API on top of a POSIX C-API, just like Python 3.x. I guess the choosen behaviour works out on OSX (where users are unlikely to run into this issue), but could be more problematic on other POSIX systems. Ronald On 28 Apr, 2009, at 14:03, Michael Foord wrote: Paul Moore wrote: 2009/4/28 Antoine Pitrou : Paul Moore gmail.com> writes: I've yet to hear anyone claim that they would have an actual problem with a specific piece of code they have written. Yep, that's the problem. Lots of theoretical problems noone has ever encountered brought up against a PEP which resolves some actual problems people encounter on a regular basis. For the record, I'm +1 on the PEP being accepted and implemented as soon as possible (preferably before 3.1). In case it's not clear, I am also +1 on the PEP as it stands. Me 2 Michael Paul. ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/fuzzyman%40voidspace.org.uk -- http://www.ironpythoninaction.com/ ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/ronaldoussoren%40mac.com smime.p7s Description: S/MIME cryptographic signature ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces
Paul Moore wrote: 2009/4/28 Antoine Pitrou : Paul Moore gmail.com> writes: I've yet to hear anyone claim that they would have an actual problem with a specific piece of code they have written. Yep, that's the problem. Lots of theoretical problems noone has ever encountered brought up against a PEP which resolves some actual problems people encounter on a regular basis. For the record, I'm +1 on the PEP being accepted and implemented as soon as possible (preferably before 3.1). In case it's not clear, I am also +1 on the PEP as it stands. Me 2 Michael Paul. ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/fuzzyman%40voidspace.org.uk -- http://www.ironpythoninaction.com/ ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces
2009/4/28 Antoine Pitrou : > Paul Moore gmail.com> writes: >> >> I've yet to hear anyone claim that they would have an actual problem >> with a specific piece of code they have written. > > Yep, that's the problem. Lots of theoretical problems noone has ever > encountered > brought up against a PEP which resolves some actual problems people encounter > on > a regular basis. > > For the record, I'm +1 on the PEP being accepted and implemented as soon as > possible (preferably before 3.1). In case it's not clear, I am also +1 on the PEP as it stands. Paul. ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces
2009/4/28 Glenn Linderman : > So assume a non-decodable sequence in a name. That puts us into Martin's > funny-decode scheme. His funny-decode scheme produces a bare string, > indistinguishable from a bare string that would be produced by a str API > that happens to contain that same sequence. Data puns. > > So when open is handed the string, should it open the file with the name > that matches the string, or the file with the name that funny-decodes to the > same string? It can't know, unless it knows that the string is a > funny-decoded string or not. Sorry for picking on Glenn's comment - it's only one of many in this thread. But it seems to me that there is an assumption that problems will arise when code gets a potentially funny-decoded string and doesn't know where it came from. Is that a real concern? How many programs really don't know where their data came from? Maybe a general-purpose library routine *might* just need to document explicitly how it handles funny-encoded data (I can't actually imagine anything that would, but I'll concede it may be possible) but that's just a matter of documenting your assumptions - no better or worse than many other cases. This all sounds similar to the idea of "tainted" data in security - if you lose track of untrusted data from the environment, you expose yourself to potential security issues. So the same techniques should be relevant here (including ignoring it if your application isn't such that it's s concern!) I've yet to hear anyone claim that they would have an actual problem with a specific piece of code they have written. (NB, if such a claim has been made, feel free to point me to it - I admit I've been skimming this thread at times). Paul. ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces
> Does the PEP take into consideration the normalising behaviour of Mac > OSX ? We've had some ongoing challenges in bzr related to this with bzr. No, that's completely out of scope, AFAICT. I don't even know what the issues are, so I'm not able to propose a solution, at the moment. Regards, Martin ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces
On approximately 4/27/2009 7:11 PM, came the following characters from the keyboard of Cameron Simpson: On 27Apr2009 18:15, Glenn Linderman wrote: The problem with this, and other preceding schemes that have been discussed here, is that there is no means of ascertaining whether a particular file name str was obtained from a str API, or was funny- decoded from a bytes API... and thus, there is no means of reliably ascertaining whether a particular filename str should be passed to a str API, or funny-encoded back to bytes. Why is it necessary that you are able to make this distinction? It is necessary that programs (not me) can make the distinction, so that it knows whether or not to do the funny-encoding or not. I would say this isn't so. It's important that programs know if they're dealing with strings-for-filenames, but not that they be able to figure that out "a priori" if handed a bare string (especially since they can't:-) So you agree they can't... that there are data puns. (OK, you may not have thought that through) I agree you can't examine a string and know if it came from the os.* munging or from someone else's munging. I totally disagree that this is a problem. There may be puns. So what? Use the right strings for the right purpose and all will be well. I think what is missing here, and missing from Martin's PEP, is some utility functions for the os.* namespace. PROPOSAL: add to the PEP the following functions: os.fsdecode(bytes) -> funny-encoded Unicode This is what os.listdir() does to produce the strings it hands out. os.fsencode(funny-string) -> bytes This is what open(filename,..) does to turn the filename into bytes for the POSIX open. os.pathencode(your-string) -> funny-encoded-Unicode This is what you must do to a de novo string to turn it into a string suitable for use by open. Importantly, for most strings not hand crafted to have weird sequences in them, it is a no-op. But it will recode your puns for survival. and for me, I would like to see: os.setfilesystemencoding(coding) Currently os.getfilesystemencoding() returns you the encoding based on the current locale, and (I trust) the os.* stuff encodes on that basis. setfilesystemencoding() would override that, unless coding==None in what case it reverts to the former "use the user's current locale" behaviour. (We have locale "C" for what one might otherwise expect None to mean:-) The idea here is to let to program control the codec used for filenames for special purposes, without working indirectly through the locale. If a name is funny-decoded when the name is accessed by a directory listing, it needs to be funny-encoded in order to open the file. Hmm. I had thought that legitimate unicode strings already get transcoded to bytes via the mapping specified by sys.getfilesystemencoding() (the user's locale). That already happens I believe, and Martin's scheme doesn't change this. He's just funny-encoding non-decodable byte sequences, not the decoded stuff that surrounds them. So assume a non-decodable sequence in a name. That puts us into Martin's funny-decode scheme. His funny-decode scheme produces a bare string, indistinguishable from a bare string that would be produced by a str API that happens to contain that same sequence. Data puns. See my proposal above. Does it address your concerns? A program still must know the providence of the string, and _if_ you're working with non-decodable sequences in a names then you should transmute then into the funny encoding using the os.pathencode() function described above. In this way the punning issue can be avoided. _Lacking_ such a function, your punning concern is valid. Seems like one would also desire os.pathdecode to do the reverse. And also versions that take or produce bytes from funny-encoded strings. Then, if programs were re-coded to perform these transformations on what you call de novo strings, then the scheme would work. But I think a large part of the incentive for the PEP is to try to invent a scheme that intentionally allows for the puns, so that programs do not need to be recoded in this manner, and yet still work. I don't think such a scheme exists. If there is going to be a required transformation from de novo strings to funny-encoded strings, then why not make one that people can actually see and compare and decode from the displayable form, by using displayable characters instead of lone surrogates? So when open is handed the string, should it open the file with the name that matches the string, or the file with the name that funny-decodes to the same string? It can't know, unless it knows that the string is a funny-decoded string or not. True. open() should always expect a funny-encoded name. So it is already the case that strings get decoded to bytes by cal
Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces
James Y Knight wrote: > Hopefully it can be assumed that your locale encoding really is a > non-overlapping superset of ASCII, as is required by POSIX... Can you please point to the part of the POSIX spec that says that such overlapping is forbidden? > I'm a bit scared at the prospect that U+DCAF could turn into "/", that > just screams security vulnerability to me. So I'd like to propose that > only 0x80-0xFF <-> U+DC80-U+DCFF should ever be allowed to be > encoded/decoded via the error handler. It would be actually U+DC2f that would turn into /. I'm happy to exclude that range from the mapping if POSIX really requires an encoding not to be overlapping with ASCII. Regards, Martin ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces
On approximately 4/27/2009 8:39 PM, came the following characters from the keyboard of Martin v. Löwis: I'm not suggesting the PEP should solve the problem of mounting foreign file systems, although if it doesn't it should probably point that out. I'm just suggesting that if the people that write software to solve the problem of mounting foreign file systems have already solved the naming problem, then it might be a source of a good solution. On the other hand, it might be the source of a mediocre or bad solution. However, if those mounting system have good solutions, it would be good to be compatible with them, rather than have yet another solution. It was in that sense, of thinking about possibly existing practice, and leveraging an existing solution, that caused me to bring up the topic. I think you make quite a lot of assumptions here. It would be better to research the state of the art first, and only then propose to follow it. I didn't propose to follow it. I only proposed an area that could be researched as a source of ideas and/or potential solutions. Apparently there wasn't, but there could have been someone listening that had the results of such research on the tip of their tongue, and might have piped up with the techniques used. I did, in fact, begin researching the topic after making the suggestion, and thus far haven't found any brilliant solutions from that arena. -- Glenn -- http://nevcal.com/ === A protocol is complete when there is nothing left to remove. -- Stuart Cheshire, Apple Computer, regarding Zero Configuration Networking ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces
On Mon, 2009-04-27 at 22:25 -0700, Glenn Linderman wrote: > > Indeed, that was the missing piece. I'd forgotten about the > encodings > that use escape sequences, rather than UTF-8, and DBCS. I don't > think > those encodings are permitted by POSIX file systems, but I suppose > they > could sneak in via Environment variable values, and the like. This may already have been discussed, and if so I apologise for the for the noise. Does the PEP take into consideration the normalising behaviour of Mac OSX ? We've had some ongoing challenges in bzr related to this with bzr. -Rob signature.asc Description: This is a digitally signed message part ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces
On approximately 4/27/2009 8:35 PM, came the following characters from the keyboard of Martin v. Löwis: Glenn Linderman wrote: On approximately 4/27/2009 12:42 PM, came the following characters from the keyboard of Martin v. Löwis: It's a private use area. It will never carry an official character assignment. I know that U+F - U+F is a private use area. I don't find a definition of U+F01xx to know what the notation means. Are you picking a particular character within the private use area, or a particular range, or what? It's a range. The lower-case 'x' denotes a variable half-byte, ranging from 0 to F. So this is the range U+F0100..U+F01FF, giving 256 code points. So you only need 128 code points, so there is something else unclear. (please understand that this is history now, since the PEP has stopped using PUA characters). Yes, but having found the latest PEP finally (at least I hope the one at python.org is the latest, it has quit using PUA anyway), I confirm it is history. But the same issue applies to the range of half-surrogates. No. You seem to assume that all bytes < 128 decode successfully always. I believe this assumption is wrong, in general: py> "\x1b$B' \x1b(B".decode("iso-2022-jp") #2.x syntax Traceback (most recent call last): File "", line 1, in UnicodeDecodeError: 'iso2022_jp' codec can't decode bytes in position 3-4: illegal multibyte sequence All bytes are below 128, yet it fails to decode. Indeed, that was the missing piece. I'd forgotten about the encodings that use escape sequences, rather than UTF-8, and DBCS. I don't think those encodings are permitted by POSIX file systems, but I suppose they could sneak in via Environment variable values, and the like. The switch from PUA to half-surrogates does not resolve the issues with the encoding not being a 1-to-1 mapping, though. The very fact that you think you can get away with use of lone surrogates means that other people might, accidentally or intentionally, also use lone surrogates for some other purpose. Even in file names. -- Glenn -- http://nevcal.com/ === A protocol is complete when there is nothing left to remove. -- Stuart Cheshire, Apple Computer, regarding Zero Configuration Networking ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces
On Apr 27, 2009, at 11:35 PM, Martin v. Löwis wrote: No. You seem to assume that all bytes < 128 decode successfully always. I believe this assumption is wrong, in general: py> "\x1b$B' \x1b(B".decode("iso-2022-jp") #2.x syntax Traceback (most recent call last): File "", line 1, in UnicodeDecodeError: 'iso2022_jp' codec can't decode bytes in position 3-4: illegal multibyte sequence All bytes are below 128, yet it fails to decode. Surely nobody uses iso2022 as an LC_CTYPE encoding. That's expressly forbidden by POSIX, if I'm not mistaken...and I can't see how it would work, considering that it uses all the bytes from 0x20-0x7f, including 0x2f ("/"), to represent non-ascii characters. Hopefully it can be assumed that your locale encoding really is a non- overlapping superset of ASCII, as is required by POSIX... I'm a bit scared at the prospect that U+DCAF could turn into "/", that just screams security vulnerability to me. So I'd like to propose that only 0x80-0xFF <-> U+DC80-U+DCFF should ever be allowed to be encoded/decoded via the error handler. James ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces
Tony Nelson writes: > At 16:09 + 04/27/2009, Antoine Pitrou wrote: > >Stephen J. Turnbull xemacs.org> writes: > >> > >> I hate to break it to you, but most stages of mail processing have > >> very little to do with SMTP. In particular, processing MIME > >> attachments often requires dealing with file names. > > > >AFAIK, the file name is only there as an indication for the user > >when he wants to save the file. If it's garbled a bit, no big > >deal. Nobody said we were at the stage of *saving* the file! ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces
Michael Foord writes: > The problem you don't address, which is still the reality for most > programmers (especially Mac OS X where filesystem encoding is UTF 8), is > that programmers *are* going to treat filenames as strings. > The proposed PEP allows that to work for them - whatever platform their > program runs on. Sure, for values of "work" == "No exception will be raised in my module, and some content will actually be returned." It doesn't say anything about what happens once those strings escape the immediate context. So it *encourages* those programmers to pass any problems downstream, but only after discarding the resources needed to deal with problems effectively. It's not that hard to overcome that problem, but it does require a slightly more complex API, and one that doesn't return a string but rather a stringlike object annotated with the information about how it was decoded. Conversion to a string *should* be trivial; I just think it should be invoked explicitly to make it clear where information is being discarded. Without an implicit conversion, the nature of the data (ie, context-dependent structure) is made explicit. There's a natural place to document the problem that context must be used to interpret the data accurately, and even add more robust processing (in a new PEP, of course!), etc. Then in the future this interface could be used as the basis of a more robust API. With good design (and luck) it might be subclassible or extensible to a path object API, for example. PEP 383 on the other hand is a dead end as it stands. AFAICS it gives the best possible treatment of conversion of OS data to plain string, but we're already got developers lining up to say "I can't use it". :-( ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com