Peter Kirk continued: > >Once again, people are falling afoul of the subtle distinctions > >that the Unicode conformance clauses are attempting to make. > > > > > In that case the distinctions are too subtle and need to be clarified. > C9 states that "no process can assume that another process will make a > distinction between two different, but canonical-equivalent character > sequences."
No, C9 states: <quote> C9 A process shall not assume that the interpretations of two canonical-equivalent character sequences are distinct. </quote> You are quoting out of an explanatory bullet to clause C9. And in that context, it should be perfectly clear that the "distinctions" we are talking about are distinctions of interpretation. The *subsection* of Section 3.2 that C9 occurs in is also labelled "Interpretation". Quoting statements from the standard out of context, and then asserting that "distinction" means something other than it clearly does when seen *in context* isn't helping to make your case any. > If that in fact should be "no process can assume that > another process will *give different interpretations to* two different, > but canonical-equivalent character sequences", then that is what should > be written. O.k. that kind of explicitness might help others understand the text. > And even then the word "interpretation" needs to be clearly > defined, see below. "Interpretation" has been *deliberately* left undefined. It falls back to its general English usage, because attempting a technical definition of "interpretation" in the context of the Unicode Standard runs too far afield from the intended area of standardization. The UTC would end up bogged down in linguistic and semiotic theory attempting to nail this one down. What *is* clear is that a "distinction in interpretation of a character or character sequence" cannot be confused, by any careful reader of the standard, with "difference in code point or code point sequence". The latter *is* defined and totally unambiguous in the standard. > >It is perfectly conformant with the Unicode Standard to assert > >that <U+00E9> "é" and <U+0065, U+0301> "é" are different > >Unicode strings. They *are* different Unicode strings. They > >contain different encoded characters, and they have different > >lengths. ... > > > But they are "two different, but canonical-equivalent character > sequences", and as such "no process can assume that another process will > make a distinction between" them. ^^^^^^^^^^^ distinction in interpretation You are quoting out of context again. > C9 does not say that certain > distinctions may be assumed and others may not. If you read it right, it absolutely *does* indicate that. > >... And any Unicode-conformant process that treated the > >second string as if it had only one code unit and only > >one encoded character in it would be a) stupid, and b) > >non-conformant. A Unicode process can not only assume that > >another Unicode-conformant process can make this distinction -- > >it should *expect* it to or will run into interoperability > >problems. > > > > > Well, this goes entirely against how I had read and understood the > conformance clauses. The problem is, what does "interpretation" mean? "Interpretation" means..., well, it means "what it means". If you want to bandy semiotics, be my guest, but the Unicode Standard is not a semiotic standard. It is a character encoding standard. > >What canonical equivalence is about is making non-distinctions > >in the *interpretation* of equivalent sequences. No Unicode- > >conformant process should assume that another process will > >systematically distinguish a meaningful interpretation > >difference between <U+00E9> "é" and <U+0065, U+0301> "é" -- > >they both represent the *same* abstract character, namely > >an e-acute. And because of the conformance requirements > >in the Unicode Standard, I am not allowed to call some > >other process wrong if it claims to be handing me an "e-acute" > >and delivers <U+0065, U+0301> when I was expecting to > >see just <U+00E9>. ... > > > Well, the question here hangs on the meaning of "interpretation". I > understood "interpretation" to include such matters as determining the > number of characters in a string (although I carefully distinguished > that from determining the number of memory units required to store it, > which depends also on the encoding form and is at a quite different > level). Well, then please correct your interpretation of interpretation. <U+00E9> has one code point in it. It has one encoded character in it. <U+0065, U+0301> has two code points in it. It has two encoded characters in it. The two sequences are distinct and distinguished and distinguishable -- in terms of their code point or character sequences. The two sequences are canonically equivalent. They are not *interpreted* differently, since they both *mean* the same thing -- they are both interpreted as referring to the letter of various Latin alphabets known as "e-acute". *That* is what the Unicode Standard "means" by canonical equivalence. > I would understand a different character count to be "a > meaningful interpretation difference". As for the question "is this > string normalised?", at the interpretation level I have in mind that is > in fact a meaningless question because normalisation is, or should be, > hidden at a lower level. Then you are still mixing levels. You are operating here in terms of "user-perceived characters", but those are *not* a primitive of the Unicode Standard, and are not well-defined there, precisely because the character encoding per se cannot be and is not based entirely on psychological memes residing in the heads of the users of various written orthographies. It isn't arbitrarily disconnected from meaningful units that end users think of as "letters" or "syllables" or other useful graphological units, but those are not the determinative factors for the encoding itself nor for its statement of conformance requirements. If you are operating at a level where the question "is this string normalised" is meaningless, then you are talking about text content and not about the level where the conformance requirements of the Unicode Standard are relevant. No wonder you and others are confused. Of course, if I look on a printed page of text and see the word "café" rendered there as a token, it is meaningless to talk about whether the é is normalized or not. It just is a manifest token of the letter é, rendered on the page. The whole concept of Unicode normalization is irrelevant to a user at that level. But you cannot infer from that that normalization distinctions cannot be made conformantly in the encoded character stores for digital representation of text -- which is the relevant field where Unicode conformance issues apply. > But it seems that you are viewing the whole thing from a different level > from me. I am looking on as a user or an application programmer. You are > looking at Unicode internally, as a systems programmer. At that lower > level, yes, of course normalisation forms have to be distinguished > because that is the level at which normalisation is carried out. It isn't application programmer versus systems programmer. It is digital representation of text in terms of encoded characters versus end user interaction with rendered (printed, displayed) text. > > >... The whole point of normalization is > >to make life for implementations worth living in such a > Well, there is an interesting philosophical question here. With a normal > literary text, the interpretation of it intended by the author is > generally considered to be definitive. Humpty Dumpty was right when > talking about what he had written. But that is not true of laws, and I > suppose that it is similarly not true of standards. Standards are not laws. (Nor are they literary texts.) They are technical specifications which aim at enabling interoperable implementations of whatever they are a standard for. (At least the kinds of IT standards we are talking about here.) Standards are not adjudicated by case law. They are not interpreted by judges. If something is unclear in a standard, that is generally simply reported back as a defect to the standardization committee, which attempts to reach a consensus regarding what the actual intent was and then instructs the editor(s) to rewrite things so that the intent (which often turns out to be what everybody is implementing anyway) is made clearer. Or in some cases (see IEEE standards for examples), the standards development organization may issue a formal "clarification" spelling out the interpretation of a point that was unclear. > There is assumed to > be some objectivity to the language in which they are written. The > implication is that your assertion that what you have written is > conformant cannot be trusted a priori but must be tested against the > text of the standard as written and agreed. In principle any dispute > might have to be settled by a judge, and on the basis only of what is > written, not of what you claim was intended. While I certainly don't > intend to take this to court, I think I would have a reasonable case if > I did! I don't think it is reasonable. If anything, it is approaching harebrained here (sorry for the ad hominem), because it doesn't reflect the reality of IT standards development. What is often clearest to the standards development committee is what the intended behavior is to be. Writing that into the formal text of the standard, on the other hand, may stress the rhetorical capabilities of the authors, and you can end up with text that doesn't necessarily do the intent justice. Hence the need, for example, to keep rewriting the conformance clauses of the Unicode Standard until the character model finally started to gel and make sense to people implementing the standard. Trying to go legalistic, and trying to give objective primacy to the text of the standard, especially when you interpret the text differently than people on the originating committee who *wrote* the text, and in the face of counteropinions from engaged members of the responsible committee, is not, in my opinion, doing anybody any favors here. > > Of course it is possible for those conformance clauses to be rewritten > (they aren't fixed by the stability policy, are they?). Nope. > That is probably > what is necessary. In general, yes. If people are misinterpreting some key part of the conformance requirements of the standard, then both the UTC and the editors are interested in ensuring that the wording of the text is not encouraging such (mis)interpretations. > Such a rewrite would require a change to the sentence > "no process can assume that another process will make a distinction > between two different, but canonical-equivalent character sequences" Could be, but notice as above, this is already in an explanatory bullet, and is not the normative part of C9. Certainly if it is causing misinterpretation, the editors can address that, but I'm hearing other people on the list who are not having trouble coming to the correct conclusions in this particular instance. > and > a proper definition of "interpretation". Won't happen. See above. > Well, I had stated such things more tentatively to start with, asking > for contrary views and interpretations, but received none until now > except for Mark's very generalised implication that I had said something > wrong (and, incorrectly, that I hadn't read the relevant part of the > standard). Please, those of you who do know what is correct, keep us on > the right path. Otherwise the confusion will spread. I'll try. :-) --Ken