Glenn Linderman wrote: > On approximately 4/28/2009 1:25 PM, came the following characters from > the keyboard of Martin v. Löwis: >>> The UTF-8b representation suffers from the same potential ambiguities as >>> the PUA characters... >> >> Not at all the same ambiguities. Here, again, the two choices: >> >> A. use PUA characters to represent undecodable bytes, in particular for >> UTF-8 (the PEP actually never proposed this to happen). >> This introduces an ambiguity: two different files in the same >> directory may decode to the same string name, if one has the PUA >> character, and the other has a non-decodable byte that gets decoded >> to the same PUA character. >> >> B. use UTF-8b, representing the byte will ill-formed surrogate codes. >> The same ambiguity does *NOT* exist. If a file on disk already >> contains an invalid surrogate code in its file name, then the UTF-8b >> decoder will recognize this as invalid, and decode it byte-for-byte, >> into three surrogate codes. Hence, the file names that are different >> on disk are also different in memory. No ambiguity. > > C. File on disk with the invalid surrogate code, accessed via the str > interface, no decoding happens, matches in memory the file on disk with > the byte that translates to the same surrogate, accessed via the bytes > interface. Ambiguity.
Is that an alternative to A and B? Regards, Martin _______________________________________________ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com