Peter, in XML you really don't want to use attributes for any general text; there are too many restrictions on the content. For example, we never put translatable text into them. Attributes should really be treated more like sequences of symbols, with a constrained syntax.
This is also not in violation of the Unicode conformance clause. A "space plus combining character" is a unit in some sense. That is, it is a combining character sequence (and grapheme cluster). However, there is no clause that says that such units cannot be changed, or that any particular sequence of characters cannot be changed; operations such as case mapping or normalization do just that, they change characters. There are restrictions on what can be changed *if* a process purports to not modify the text (C10). But an XML parser is certainly capable of interpreting a sequence A B, and deciding that it wants to change A to C. If the parser interpreted the 0x0041 in UTF-16 as a Z or a Greek Alpha, *that* would be a violation of C7. But interpreting a space as a space, then deciding to modify it, is perfectly legit. Mark __________________________________ http://www.macchiato.com ► “Eppur si muove” ◄ ----- Original Message ----- From: "Peter Kirk" <[EMAIL PROTECTED]> To: "John Cowan" <[EMAIL PROTECTED]> Cc: <[EMAIL PROTECTED]> Sent: Wednesday, August 13, 2003 05:09 Subject: Re: Questions on ZWNBS - for line initial holam plus alef > On 12/08/2003 20:28, John Cowan wrote: > > >Peter Kirk scripsit: > > > > > > > >>>2) In attribute values, LF, CR, and TAB characters are normalized to > >>>spaces. Not relevant here. > >>> > >>> > >>This would be relevant if it is legal for the character after LF, CR, > >>and TAB to be a combining mark. Is this legal? In this case what was > >>previously a defective (but legal) combining sequence would turn into a > >>non-defective one, but the intended whitespace would be lost. > >> > >> > > > >The point is that there is no such thing as an *intended* line break in > >an attribute value; it will *always* be translated to a space before > >the application sees it. (More exactly, line-break characters can > >be inserted into attribute values, but only with the use of a numeric > >character reference such as "
".) > > > > > Sorry, I'm confused. Are you saying that the input processing will > translate line breaks into spaces within attribute values, unless > inserted as 
 ? Well, I suppose this is fair enough as it is up to > the user not to enter garbage. > > > > > > >>Not just a rendering glitch, I suspect. If the combining character is > >>combined with the separating space, the space loses many of its > >>separating functions, and perhaps keeps a confusing subset of them with > >>all sorts of possibilities of error. > >> > >> > > > >The space(s) will be used to separate individual tokens at processing > >time. No spacing diacritic (either single-character or space+combining) > >is permitted in a NMTOKEN. > > > > > OK if this is clearly illegal, but this might restrict use of some > languages in NMTOKEN. Would NBSP + combining be allowed? > > > > > > >>At best tokens beginning with > >>combining characters will be unusable. At worst they will crash the > >>implementation (and count on someone trying deliberately to do that!). > >> > >> > > > >In effect, the combining character will constitute a defective combining > >sequence at the beginning of the individual token. > > > >Stepping away from the letter of the standard for a moment, there is > >no real reason to begin a NMTOKEN with a combining character. It is > >only allowed is a result of the miscegenation of SGML concepts with > >Unicode ones. > > > >In SGML's original design of tokens, they consisted of letters and digits > >(and a few punctuation marks, which functioned as letters). There were > >four kinds: a NUMBER could contain only digits, a NAME could not begin > >with a digit, a NUTOKEN had to begin with a digit, and a NMTOKEN had no > >restrictions. ID and IDREF had the same syntax as NAME with additional > >semantics. Later, the categories "letter" and "digit" were generalized, > >by redefining the concrete syntax, to be whatever you wanted, and were > >renamed "name-start" and "name" characters (technically, a name character > >was a letter *or* a digit). > > > >When SGML was simplified to produce XML, only NMTOKEN, the most general > >type of token, was kept. However, in order to keep the semantics of > >"letter" and "digit" in the Unicode world, "letter" was extended to be any > >letter and "digit" to be any digit *or* combining character. That worked > >well for ID and IDREF, since treating combining characters as part of > >"digit" prevented them from appearing first, as was only sensible. > > > >Unfortunately, NMTOKENs, since there were no restrictions, became able > >to begin with a combining character, though that made no real sense. > >To write in a restriction would make it impossible to specify XML's > >concrete syntax in SGML terms, which did not allow for three different > >classes of characters within tokens. So we wound up with a basically > >useless capability that if used will only cause trouble. > > > > > > > There is some potential for real trouble here, if one process outputs an > NMTOKEN starting with a combining character preceded by a separating > space, or something else which is changed into a space, and another > process takes the new space plus combining character as a unit and so > doesn't recognise the separation. Any hackers and virus programmers > reading this will soon start flooding the Internet with tokens beginning > with combining characters in the hope of crashing implementations or > finding back doors. Of course this wouldn't have been a problem if > Unicode had never defined space plus combining character as legal and > meaningful. But this is not my problem! > > -- > Peter Kirk > [EMAIL PROTECTED] (personal) > [EMAIL PROTECTED] (work) > http://www.qaya.org/ > > > >