Tom Christiansen <[email protected]> added the comment:
Ezio Melotti <[email protected]> wrote
on Sun, 14 Aug 2011 07:15:09 -0000:
>> Unicode says you can't put surrogates or noncharacters in a
>> UTF-anything stream. It's a bug to do so and pretend it's a
>> UTF-whatever.
> The UTF-8 codec described by RFC 2279 didn't say so, so, since our
> codec was following RFC 2279, it was producing valid UTF-8. With RFC
> 3629 a number of things changed in a non-backward compatible way.
> Therefore we couldn't just change the behavior of the UTF-8 codec nor
> rename it to something else in Python 2. We had to wait till Python 3
> in order to fix it.
I'm a bit confused on this. You no longer fix bugs in Python 2?
I've dug out the references that state that you are not allowed to do things the
way you are doing them. This is from the published Unicode Standard version
6.0.0,
chapter 3, Conformance. It is a very important chapter.
http://www.unicode.org/versions/Unicode6.0.0/ch03.pdf
Python is in violation of that published Standard by interpreting noncharacter
code
points as abstract characters and tolerating them in character encoding forms
like
UTF-8 or UTF-16. This explains that conformant processes are forbidden from
doing this.
Code Points Unassigned to Abstract Characters
C1 A process shall not interpret a high-surrogate code point or a
low-surrogate code point
as an abstract character.
· The high-surrogate and low-surrogate code points are designated for
surrogate
code units in the UTF-16 character encoding form. They are unassigned
to any
abstract character.
==> C2 A process shall not interpret a noncharacter code point as an abstract
character.
· The noncharacter code points may be used internally, such as for
sentinel val-
ues or delimiters, but should not be exchanged publicly.
C3 A process shall not interpret an unassigned code point as an abstract
character.
· This clause does not preclude the assignment of certain generic
semantics to
unassigned code points (for example, rendering with a glyph to
indicate the
position within a character block) that allow for graceful behavior in
the pres-
ence of code points that are outside a supported subset.
· Unassigned code points may have default property values. (See D26.)
· Code points whose use has not yet been designated may be assigned to
abstract
characters in future versions of the standard. Because of this fact,
due care in
the handling of generic semantics for such code points is likely to
provide better
robustness for implementations that may encounter data based on future
ver-
sions of the standard.
Next we have exactly how something you call UTF-{8,16-32} must be formed.
*This* is the Standard against which these things are measured; it is not the
RFC.
You are of course perfectly free to say you conform to this and that RFC, but
you
must not say you conform to the Unicode Standard when you don't. These are
different
things. I feel it does users a grave disservice to ignore the Unicode Standard
in
this, and sheer casuistry to rely on an RFC definition while ignoring the
Unicode
Standard whence it originated, because this borders on being intentionally
misleading.
Character Encoding Forms
C8 When a process interprets a code unit sequence which purports to be in
a Unicode char-
acter encoding form, it shall interpret that code unit sequence
according to the corre-
sponding code point sequence.
==> · The specification of the code unit sequences for UTF-8 is given in D92.
· The specification of the code unit sequences for UTF-16 is given in
D91.
· The specification of the code unit sequences for UTF-32 is given in
D90.
C9 When a process generates a code unit sequence which purports to be in a
Unicode char-
acter encoding form, it shall not emit ill-formed code unit sequences.
· The definition of each Unicode character encoding form specifies the
ill-
formed code unit sequences in the character encoding form. For
example, the
definition of UTF-8 (D92) specifies that code unit sequences such as
<C0 AF>
are ill-formed.
==> C10 When a process interprets a code unit sequence which purports to be in
a Unicode char-
acter encoding form, it shall treat ill-formed code unit sequences as
an error condition
and shall not interpret such sequences as characters.
· For example, in UTF-8 every code unit of the form 110xxxx2 must be
followed
by a code unit of the form 10xxxxxx2. A sequence such as 110xxxxx2
0xxxxxxx2
is ill-formed and must never be generated. When faced with this
ill-formed
code unit sequence while transforming or interpreting text, a
conformant pro-
cess must treat the first code unit 110xxxxx2 as an illegally
terminated code unit
sequence--for example, by signaling an error, filtering the code unit
out, or
representing the code unit with a marker such as U+FFFD replacement
character.
· Conformant processes cannot interpret ill-formed code unit sequences.
How-
ever, the conformance clauses do not prevent processes from operating
on code
unit sequences that do not purport to be in a Unicode character
encoding form.
For example, for performance reasons a low-level string operation may
simply
operate directly on code units, without interpreting them as
characters. See,
especially, the discussion under D89.
· Utility programs are not prevented from operating on "mangled" text.
For
example, a UTF-8 file could have had CRLF sequences introduced at
every 80
bytes by a bad mailer program. This could result in some UTF-8 byte
sequences
being interrupted by CRLFs, producing illegal byte sequences. This
mangled
text is no longer UTF-8. It is permissible for a conformant program to
repair
such text, recognizing that the mangled text was originally
well-formed UTF-8
byte sequences. However, such repair of mangled data is a special
case, and it
must not be used in circumstances where it would cause security
problems.
There are important security issues associated with encoding
conversion, espe-
cially with the conversion of malformed text. For more information,
see Uni-
code Technical Report #36, "Unicode Security Considerations."
Here is the part that explains why Python narrow builds are actually UTF-16 not
UCS-2,
and why its documentation needs to be updated:
D89 In a Unicode encoding form: A Unicode string is said to be in a
particular Unicode
encoding form if and only if it consists of a well-formed Unicode
code unit sequence
of that Unicode encoding form.
· A Unicode string consisting of a well-formed UTF-8 code unit sequence
is said
to be in UTF-8. Such a Unicode string is referred to as a valid
UTF-8 string, or a
UTF-8 string for short.
· A Unicode string consisting of a well-formed UTF-16 code unit
sequence is said
to be in UTF-16. Such a Unicode string is referred to as a valid
UTF-16 string,
or a UTF-16 string for short.
· A Unicode string consisting of a well-formed UTF-32 code unit
sequence is said
to be in UTF-32. Such a Unicode string is referred to as a valid
UTF-32 string,
or a UTF-32 string for short.
==> Unicode strings need not contain well-formed code unit sequences under all
conditions.
This is equivalent to saying that a particular Unicode string need not be
in a Unicode
encoding form.
· For example, it is perfectly reasonable to talk about an operation
that takes the
two Unicode 16-bit strings, <004D D800> and <DF02 004D>, each of
which
contains an ill-formed UTF-16 code unit sequence, and concatenates
them to
form another Unicode string <004D D800 DF02 004D>, which contains a
well-
formed UTF-16 code unit sequence. The first two Unicode strings are
not in
UTF-16, but the resultant Unicode string is.
[...]
D14 Noncharacter: A code point that is permanently reserved for internal
use and that
should never be interchanged. Noncharacters consist of the values
U+nFFFE and
U+nFFFF (where n is from 0 to 1016) and the values U+FDD0..U+FDEF.
· For more information, see Section 16.7, Noncharacters.
· These code points are permanently reserved as noncharacters.
D15 Reserved code point: Any code point of the Unicode Standard that is
reserved for
future assignment. Also known as an unassigned code point.
· Surrogate code points and noncharacters are considered assigned code
points,
but not assigned characters.
· For a summary classification of reserved and other types of code
points, see
Table 2-3.
In general, a conforming process may indicate the presence of a code point
whose use has
not been designated (for example, by showing a missing glyph in rendering
or by signaling
an appropriate error in a streaming protocol), even though it is forbidden
by the standard
from interpreting that code point as an abstract character.
Here's how I read all that.
The noncharacters and the unpaired surrogates are illegal for interchange, and
their
presence in a UTF means that that UTF is not conformant to the requirements for
what
a UTF shall contain. Nonetheless, internally it is necessary that all code
points,
even noncharacter code points and surrogates, be representable, and doing so
does not
mean that you are no longer are in that encoding form. However, you must not
allow
such things into a UTF stream, because doing so means that that stream is no
longer
a UTF stream.
That's why I say that you are of conformance by having encoders and decoders of
UTF
streams tolerate noncharacters. You are not allowed to call something a UTF
and do
non-UTF things with it, because this in violation of conformance requirement C2.
Therefore you must either (1) change what you are calling the thing you doing
the
nonconforming thing to, or you must (2) change it to no longer do the
nonconforming
thing. If you do neither, then Python no longer conforms to the formal
requirements
for handling such things as these are defined by the Unicode Standard, and
therefore
that version of Python is no longer conformant to the version of the Unicode
Standard
that it purports conformance to. And yes, that's a long way of saying it's
lying.
It's also why having noncharacters including surrogates in memory does *not*
suddenly
mean that there are not stored in a UTF, because you have to be able to do that
to
build up buffers per the concatenation example in conformance requirement D89.
Therefore, Python uses UTF-16 internally and should not say it uses UCS-2,
because
that is inaccurate and incorrect; in short, it's wrong. That doesn't help
anybody.
At least, that's how I read the Unicode Standard. Perhaps a more careful
reading
than mine would admit alternate interpretations. If you have not reread its
Chapter
3 of late in its entirety, you probably want to do so. There is quite a bit of
material there that is fundamental to any process that claims to be conformant
with
the Unicode Standard.
I hope that makes sense. These things can be extremely difficult to read, for
they
are subtle and quick to anger. :)
--tom
----------
_______________________________________
Python tracker <[email protected]>
<http://bugs.python.org/issue12729>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe:
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com