On approximately 4/27/2009 8:35 PM, came the following characters from the keyboard of Martin v. Löwis:
Glenn Linderman wrote:
On approximately 4/27/2009 12:42 PM, came the following characters from
the keyboard of Martin v. Löwis:
It's a private use area. It will never carry an official character
assignment.
I know that U+F0000 - U+FFFFF is a private use area.  I don't find a
definition of U+F01xx to know what the notation means.  Are you picking
a particular character within the private use area, or a particular
range, or what?
It's a range. The lower-case 'x' denotes a variable half-byte, ranging
from 0 to F. So this is the range U+F0100..U+F01FF, giving 256 code
points.

So you only need 128 code points, so there is something else unclear.

(please understand that this is history now, since the PEP has stopped
using PUA characters).


Yes, but having found the latest PEP finally (at least I hope the one at python.org is the latest, it has quit using PUA anyway), I confirm it is history. But the same issue applies to the range of half-surrogates.


No. You seem to assume that all bytes < 128 decode successfully always.
I believe this assumption is wrong, in general:

py> "\x1b$B' \x1b(B".decode("iso-2022-jp") #2.x syntax
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'iso2022_jp' codec can't decode bytes in position
3-4: illegal multibyte sequence

All bytes are below 128, yet it fails to decode.


Indeed, that was the missing piece. I'd forgotten about the encodings that use escape sequences, rather than UTF-8, and DBCS. I don't think those encodings are permitted by POSIX file systems, but I suppose they could sneak in via Environment variable values, and the like.

The switch from PUA to half-surrogates does not resolve the issues with the encoding not being a 1-to-1 mapping, though. The very fact that you think you can get away with use of lone surrogates means that other people might, accidentally or intentionally, also use lone surrogates for some other purpose. Even in file names.


--
Glenn -- http://nevcal.com/
===========================
A protocol is complete when there is nothing left to remove.
-- Stuart Cheshire, Apple Computer, regarding Zero Configuration Networking
_______________________________________________
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Reply via email to