Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

Glenn Linderman Tue, 28 Apr 2009 15:53:16 -0700

On approximately 4/28/2009 2:02 PM, came the following characters fromthe keyboard of Martin v. Löwis:

Glenn Linderman wrote:

On approximately 4/28/2009 1:25 PM, came the following characters from
the keyboard of Martin v. Löwis:

The UTF-8b representation suffers from the same potential ambiguities as

the PUA characters...

Not at all the same ambiguities. Here, again, the two choices:


A. use PUA characters to represent undecodable bytes, in particular for
   UTF-8 (the PEP actually never proposed this to happen).
   This introduces an ambiguity: two different files in the same
   directory may decode to the same string name, if one has the PUA
   character, and the other has a non-decodable byte that gets decoded
   to the same PUA character.

B. use UTF-8b, representing the byte will ill-formed surrogate codes.
   The same ambiguity does *NOT* exist. If a file on disk already
   contains an invalid surrogate code in its file name, then the UTF-8b
   decoder will recognize this as invalid, and decode it byte-for-byte,
   into three surrogate codes. Hence, the file names that are different
   on disk are also different in memory. No ambiguity.

C. File on disk with the invalid surrogate code, accessed via the str
interface, no decoding happens, matches in memory the file on disk with
the byte that translates to the same surrogate, accessed via the bytes
interface.  Ambiguity.


Is that an alternative to A and B?


I guess it is an adjunct to case B, the current PEP.

It is what happens when using the PEP on a system that provides bothbytes and str interfaces, and both get used.

On a Windows system, perhaps the ambiguous case would be the use of thestr API and bytes APIs producing different memory names for the samefile that contains a (Unicode-illegal) half surrogate. Thehalf-surrogate would seem to get decoded to 3 half surrogates ifaccessed via the bytes interface, but only one via the str interface.The version with 3 half surrogates could match another name thatactually contains 3 half surrogates, that is accessed via the str interface.

I can't actually tell by reading the PEP whether it affects Windowsbytes interfaces or is only implemented on POSIX, so that POSIX has astr interface.

If it is only implemented on POSIX, then the current scheme (nowescaping the hundreds of escape codes) could work, within a singleplatform... but it would still suffer from displaying garbage (sequencesof replacement characters) in file listings displayed or printed. Thereis no way, once the string is adjusted to contain replacement charactersfor display, to distinguish one file name from another, if they areidentical except for a same-length sequence of different undecodable bytes.

The concept of a function that allows the same decoding and encodingprocess for 3rd party interfaces is still missing from the PEP;implementation of the PEP would require that all interfaces to 3rd partysoftware that accept file names would have to be transcoded by theinterface layer. Or else such software would have to use the bytesinterfaces directly, and if they do, there is no need for the PEP.

So I see the PEP as a partial solution to a limited problem, that on theone hand potentially produces indistinguishable sequences of replacementcharacters in filenames, rather than the mojibake (which is at leastdistinguishable), and on the other hand, doesn't help software that alsouses 3rd party libraries to avoid the use of bytes APIs for accessingfile names. There are other encodings that produce more distinguishablemojibake, and would work in the same situations as the PEP.


--
Glenn -- http://nevcal.com/
===========================
A protocol is complete when there is nothing left to remove.
-- Stuart Cheshire, Apple Computer, regarding Zero Configuration Networking
_______________________________________________
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

Reply via email to