On approximately 4/28/2009 2:02 PM, came the following characters from the keyboard of Martin v. Löwis:
Glenn Linderman wrote:
On approximately 4/28/2009 1:25 PM, came the following characters from
the keyboard of Martin v. Löwis:
The UTF-8b representation suffers from the same potential ambiguities as
the PUA characters...
Not at all the same ambiguities. Here, again, the two choices:

A. use PUA characters to represent undecodable bytes, in particular for
   UTF-8 (the PEP actually never proposed this to happen).
   This introduces an ambiguity: two different files in the same
   directory may decode to the same string name, if one has the PUA
   character, and the other has a non-decodable byte that gets decoded
   to the same PUA character.

B. use UTF-8b, representing the byte will ill-formed surrogate codes.
   The same ambiguity does *NOT* exist. If a file on disk already
   contains an invalid surrogate code in its file name, then the UTF-8b
   decoder will recognize this as invalid, and decode it byte-for-byte,
   into three surrogate codes. Hence, the file names that are different
   on disk are also different in memory. No ambiguity.
C. File on disk with the invalid surrogate code, accessed via the str
interface, no decoding happens, matches in memory the file on disk with
the byte that translates to the same surrogate, accessed via the bytes
interface.  Ambiguity.

Is that an alternative to A and B?

I guess it is an adjunct to case B, the current PEP.

It is what happens when using the PEP on a system that provides both bytes and str interfaces, and both get used.

On a Windows system, perhaps the ambiguous case would be the use of the str API and bytes APIs producing different memory names for the same file that contains a (Unicode-illegal) half surrogate. The half-surrogate would seem to get decoded to 3 half surrogates if accessed via the bytes interface, but only one via the str interface. The version with 3 half surrogates could match another name that actually contains 3 half surrogates, that is accessed via the str interface.

I can't actually tell by reading the PEP whether it affects Windows bytes interfaces or is only implemented on POSIX, so that POSIX has a str interface.

If it is only implemented on POSIX, then the current scheme (now escaping the hundreds of escape codes) could work, within a single platform... but it would still suffer from displaying garbage (sequences of replacement characters) in file listings displayed or printed. There is no way, once the string is adjusted to contain replacement characters for display, to distinguish one file name from another, if they are identical except for a same-length sequence of different undecodable bytes.

The concept of a function that allows the same decoding and encoding process for 3rd party interfaces is still missing from the PEP; implementation of the PEP would require that all interfaces to 3rd party software that accept file names would have to be transcoded by the interface layer. Or else such software would have to use the bytes interfaces directly, and if they do, there is no need for the PEP.

So I see the PEP as a partial solution to a limited problem, that on the one hand potentially produces indistinguishable sequences of replacement characters in filenames, rather than the mojibake (which is at least distinguishable), and on the other hand, doesn't help software that also uses 3rd party libraries to avoid the use of bytes APIs for accessing file names. There are other encodings that produce more distinguishable mojibake, and would work in the same situations as the PEP.

--
Glenn -- http://nevcal.com/
===========================
A protocol is complete when there is nothing left to remove.
-- Stuart Cheshire, Apple Computer, regarding Zero Configuration Networking
_______________________________________________
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Reply via email to