Re: Text Editors and Canonical Equivalence (was Coloured diacritics)

Kenneth Whistler Wed, 10 Dec 2003 20:34:00 -0800

Peter Kirk continued:

> >Once again, people are falling afoul of the subtle distinctions
> >that the Unicode conformance clauses are attempting to make.
> >  
> >
> In that case the distinctions are too subtle and need to be clarified. 
> C9 states that "no process can assume that another process will make a 
> distinction between two different, but canonical-equivalent character 
> sequences."


No, C9 states:

<quote>
C9 A process shall not assume that the interpretations of two
canonical-equivalent character sequences are distinct.
</quote>

You are quoting out of an explanatory bullet to clause C9. And
in that context, it should be perfectly clear that the "distinctions"
we are talking about are distinctions of interpretation. The
*subsection* of Section 3.2 that C9 occurs in is also labelled
"Interpretation".

Quoting statements from the standard out of context, and then
asserting that "distinction" means something other than it
clearly does when seen *in context* isn't helping to make
your case any.

> If that in fact should be "no process can assume that 
> another process will *give different interpretations to* two different, 
> but canonical-equivalent character sequences", then that is what should 
> be written. 

O.k. that kind of explicitness might help others understand
the text.

> And even then the word "interpretation" needs to be clearly 
> defined, see below.

"Interpretation" has been *deliberately* left undefined. It falls
back to its general English usage, because attempting a
technical definition of "interpretation" in the context of
the Unicode Standard runs too far afield from the intended
area of standardization. The UTC would end up bogged down
in linguistic and semiotic theory attempting to nail this
one down.

What *is* clear is that a "distinction in interpretation of
a character or character sequence" cannot be confused, by
any careful reader of the standard, with "difference in
code point or code point sequence". The latter *is* defined
and totally unambiguous in the standard.

> >It is perfectly conformant with the Unicode Standard to assert
> >that <U+00E9> "é" and <U+0065, U+0301> "é" are different
> >Unicode strings. They *are* different Unicode strings. They
> >contain different encoded characters, and they have different
> >lengths. ...
> >
> But they are "two different, but canonical-equivalent character 
> sequences", and as such "no process can assume that another process will 
> make a distinction between" them.
         ^^^^^^^^^^^
         distinction in interpretation
         
You are quoting out of context again. 

> C9 does not say that certain 
> distinctions may be assumed and others may not.

If you read it right, it absolutely *does* indicate that.

> >... And any Unicode-conformant process that treated the
> >second string as if it had only one code unit and only
> >one encoded character in it would be a) stupid, and b)
> >non-conformant. A Unicode process can not only assume that
> >another Unicode-conformant process can make this distinction --
> >it should *expect* it to or will run into interoperability
> >problems.
> >  
> >
> Well, this goes entirely against how I had read and understood the 
> conformance clauses. The problem is, what does "interpretation" mean? 

"Interpretation" means..., well, it means "what it means".

If you want to bandy semiotics, be my guest, but the Unicode
Standard is not a semiotic standard. It is a character encoding
standard.


> >What canonical equivalence is about is making non-distinctions
> >in the *interpretation* of equivalent sequences. No Unicode-
> >conformant process should assume that another process will
> >systematically distinguish a meaningful interpretation
> >difference between <U+00E9> "é" and <U+0065, U+0301> "é" --
> >they both represent the *same* abstract character, namely
> >an e-acute. And because of the conformance requirements
> >in the Unicode Standard, I am not allowed to call some
> >other process wrong if it claims to be handing me an "e-acute"
> >and delivers <U+0065, U+0301> when I was expecting to
> >see just <U+00E9>. ...
> >
> Well, the question here hangs on the meaning of "interpretation". I 
> understood "interpretation" to include such matters as determining the 
> number of characters in a string (although I carefully distinguished 
> that from determining the number of memory units required to store it, 
> which depends also on the encoding form and is at a quite different 
> level). 

Well, then please correct your interpretation of interpretation.

<U+00E9> has one code point in it. It has one encoded character in it.

<U+0065, U+0301> has two code points in it. It has two encoded
            characters in it.
            
The two sequences are distinct and distinguished and
distinguishable -- in terms of their code point or character
sequences.
            
The two sequences are canonically equivalent. They are not
*interpreted* differently, since they both *mean* the same
thing -- they are both interpreted as referring to the letter of
various Latin alphabets known as "e-acute".

*That* is what the Unicode Standard "means" by canonical equivalence.

> I would understand a different character count to be "a 
> meaningful interpretation difference". As for the question "is this 
> string normalised?", at the interpretation level I have in mind that is 
> in fact a meaningless question because normalisation is, or should be, 
> hidden at a lower level.

Then you are still mixing levels. You are operating here in terms
of "user-perceived characters", but those are *not* a primitive
of the Unicode Standard, and are not well-defined there, precisely
because the character encoding per se cannot be and is not based
entirely on psychological memes residing in the heads of the users
of various written orthographies. It isn't arbitrarily disconnected
from meaningful units that end users think of as "letters" or
"syllables" or other useful graphological units, but those are
not the determinative factors for the encoding itself nor for its
statement of conformance requirements.

If you are operating at a level where the question "is this string
normalised" is meaningless, then you are talking about text
content and not about the level where the conformance requirements
of the Unicode Standard are relevant. No wonder you and others
are confused.

Of course, if I look on a printed page of text and see the word
"café" rendered there as a token, it is meaningless to talk about
whether the é is normalized or not. It just is a manifest token
of the letter é, rendered on the page. The whole concept of
Unicode normalization is irrelevant to a user at that level. But
you cannot infer from that that normalization distinctions cannot
be made conformantly in the encoded character stores for
digital representation of text -- which is the relevant field
where Unicode conformance issues apply.

> But it seems that you are viewing the whole thing from a different level 
> from me. I am looking on as a user or an application programmer. You are 
> looking at Unicode internally, as a systems programmer. At that lower 
> level, yes, of course normalisation forms have to be distinguished 
> because that is the level at which normalisation is carried out.

It isn't application programmer versus systems programmer. It is
digital representation of text in terms of encoded characters
versus end user interaction with rendered (printed, displayed)
text.

> 
> >... The whole point of normalization is
> >to make life for implementations worth living in such a

> Well, there is an interesting philosophical question here. With a normal 
> literary text, the interpretation of it intended by the author is 
> generally considered to be definitive. Humpty Dumpty was right when 
> talking about what he had written. But that is not true of laws, and I 
> suppose that it is similarly not true of standards. 

Standards are not laws. (Nor are they literary texts.) They
are technical specifications which aim at enabling
interoperable implementations of whatever they are
a standard for. (At least the kinds of IT standards we
are talking about here.)

Standards are not adjudicated by case law. They are not
interpreted by judges. If something is unclear in a standard,
that is generally simply reported back as a defect to the
standardization committee, which attempts to reach a
consensus regarding what the actual intent was and then
instructs the editor(s) to rewrite things so that the
intent (which often turns out to be what everybody is
implementing anyway) is made clearer. Or in some cases
(see IEEE standards for examples), the standards development
organization may issue a formal "clarification" spelling out
the interpretation of a point that was unclear.

> There is assumed to 
> be some objectivity to the language in which they are written. The 
> implication is that your assertion that what you have written is 
> conformant cannot be trusted a priori but must be tested against the 
> text of the standard as written and agreed. In principle any dispute 
> might have to be settled by a judge, and on the basis only of what is 
> written, not of what you claim was intended. While I certainly don't 
> intend to take this to court, I think I would have a reasonable case if 
> I did!

I don't think it is reasonable. If anything, it is approaching
harebrained here (sorry for the ad hominem), because it doesn't
reflect the reality of IT standards development.

What is often clearest to the standards development committee
is what the intended behavior is to be. Writing that into the
formal text of the standard, on the other hand, may stress the
rhetorical capabilities of the authors, and you can end up with
text that doesn't necessarily do the intent justice. Hence the
need, for example, to keep rewriting the conformance clauses
of the Unicode Standard until the character model finally
started to gel and make sense to people implementing the standard.

Trying to go legalistic, and trying to give objective
primacy to the text of the standard, especially when you
interpret the text differently than people on the originating
committee who *wrote* the text, and in the face of counteropinions
from engaged members of the responsible committee, is not,
in my opinion, doing anybody any favors here.

> 
> Of course it is possible for those conformance clauses to be rewritten 
> (they aren't fixed by the stability policy, are they?). 

Nope.

> That is probably 
> what is necessary. 

In general, yes. If people are misinterpreting some key part of
the conformance requirements of the standard, then both the
UTC and the editors are interested in ensuring that the
wording of the text is not encouraging such (mis)interpretations.

> Such a rewrite would require a change to the sentence 
> "no process can assume that another process will make a distinction 
> between two different, but canonical-equivalent character sequences" 

Could be, but notice as above, this is already in an explanatory
bullet, and is not the normative part of C9. Certainly if it
is causing misinterpretation, the editors can address that,
but I'm hearing other people on the list who are not having
trouble coming to the correct conclusions in this particular
instance.

> and 
> a proper definition of "interpretation".

Won't happen. See above.


> Well, I had stated such things more tentatively to start with, asking 
> for contrary views and interpretations, but received none until now 
> except for Mark's very generalised implication that I had said something 
> wrong (and, incorrectly, that I hadn't read the relevant part of the 
> standard). Please, those of you who do know what is correct, keep us on 
> the right path. Otherwise the confusion will spread.

I'll try. :-)

--Ken

Re: Text Editors and Canonical Equivalence (was Coloured diacritics)

Reply via email to