Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

Glenn Linderman Tue, 28 Apr 2009 16:09:28 -0700

On approximately 4/28/2009 2:01 PM, came the following characters fromthe keyboard of MRAB:

Glenn Linderman wrote:

On approximately 4/28/2009 11:55 AM, came the following charactersfrom the keyboard of MRAB:
I've been thinking of "python-escape" only in terms of UTF-8, the only
encoding mentioned in the PEP. In UTF-8, bytes 0x00 to 0x7F are
decodable.
UTF-8 is only mentioned in the sense of having special handling forre-encoding; all the other locales/encodings are implicit. But I alsowent down that path to some extent.
But if you're talking about using it with other encodings, eg
shift-jisx0213, then I'd suggest the following:

1. Bytes 0x00 to 0xFF which can't normally be decoded are decoded to
half surrogates U+DC00 to U+DCFF.
This makes 256 different escape codes.

Speaking personally, I won't call them 'escape codes'. I'd use the term
'escape code' to mean a character that changes the interpretation of the
next character(s).

OK, I won't be offended if you don't call them 'escape codes'. :) Butwhat else to call them?

My use of that term is a bit backwards, perhaps... what happens is thatbecause these 256 half surrogates are used to decode otherwiseundecodable bytes, they themselves must be "escaped" or translated intosomething different, when they appear in the byte sequence. The processdescribed reserves a set of codepoints for use, and requires that thatsame set of codepoints be translated using a similar mechanism to avoidtheir untranslated appearance in the resulting str. Escape codes havethe same sort of characteristic... by replacing their normal use forsome other use, they must themselves have a replacement.


Anyway, I think we are communicating successfully.

2. Bytes which would have decoded to half surrogates U+DC00 to U+DCFF
are treated as though they are undecodable bytes.

This provides escaping for the 256 different escape codes, which islacking from the PEP.

3. Half surrogates U+DC00 to U+DCFF which can be produced by decoding
are encoded to bytes 0x00 to 0xFF.



This reverses the escaping.

4. Codepoints, including half surrogates U+DC00 to U+DCFF, which can't
be produced by decoding raise an exception.



This is confusing.  Did you mean "excluding" instead of "including"?

Perhaps I should've said "Any codepoint which can't be produced by
decoding should raise an exception".



Yes, your rephrasing is clearer, regarding your intention.

For example, decoding with UTF-8b will never produce U+DC00, therefore
attempting to encode U+DC00 should raise an exception and not produce
0x00.

Decoding with UTF-8b might never produce U+DC00, but then again, itwon't handle the random byte string, either.

I think I've covered all the possibilities. :-)
You might have.  Seems like there could be a simpler scheme, though...
1. Define an escape codepoint. It could be U+003F or U+DC00 or U+F817or pretty much any defined Unicode codepoint outside the range U+0100to U+01FF (see rule 3 for why). Only one escape codepoint is needed,this is easier for humans to comprehend.
2. When the escape codepoint is decoded from the byte stream for abytes interface or found in a str on the str interface, double it.
3. When an undecodable byte 0xPQ is found, decode to the escapecodepoint, followed by codepoint U+01PQ, where P and Q are hex digits.
4. When encoding, a sequence of two escape codepoints would be encodedas one escape codepoint, and a sequence of the escape codepointfollowed by codepoint U+01PQ would be encoded as byte 0xPQ. Escapecodepoints not followed by the escape codepoint, or by a codepoint inthe range U+0100 to U+01FF would raise an exception.
5. Provide functions that will perform the same decoding and encodingas would be done by the system calls, for both bytes and str interfaces.
This differs from my previous proposal in three ways:
A. Doesn't put a marker at the beginning of the string (which I saidwasn't necessary even then).
B. Allows for a choice of escape codepoint, the previous proposalsuggested a specific one. But the final solution will only have asingle one, not a user choice, but an implementation choice.
C. Uses the range U+0100 to U+01FF for the escape codes, rather thanU+0000 to U+00FF. This avoids introducing the NULL character andescape characters into the decoded str representation, yet still usescharacters for which glyphs are commonly available, are non-combining,and are easily distinguishable one from another.
Rationale:
The use of codepoints with visible glyphs makes the escaped stringfriendlier to display systems, and to people. I still recommend usingU+003F as the escape codepoint, but certainly one with a typciallyvisible glyph available. This avoids what I consider to be anannoyance with the PEP, that the codepoints used are not ones that areeasily displayed, so endecodable names could easily result in longstrings of indistinguishable substitution characters.
Perhaps the escape character should be U+005C. ;-)



Windows users everywhere would love you for that one :)

It, like MRAB's proposal, also avoids data puns, which is a majorproblem with the PEP. I consider this proposal to be easier tounderstand than MRAB's proposal, or the PEP, because of the singleescape codepoint and the use of visible characters.
This proposal, like my initial one, also decodes and encodes (just theescape codes) values on the str interfaces. This is necessary toavoid data puns on systems that provide both types of interfaces.
This proposal could be used for programs that use str values, andeasily migrates to a solution that provides an object that provides anabstraction for system interfaces that have two forms.



--
Glenn -- http://nevcal.com/
===========================
A protocol is complete when there is nothing left to remove.
-- Stuart Cheshire, Apple Computer, regarding Zero Configuration Networking
_______________________________________________
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

Reply via email to