Peter Kirk scripsit:

> >2) In attribute values, LF, CR, and TAB characters are normalized to 
> >spaces.   Not relevant here.
> 
> This would be relevant if it is legal for the character after LF, CR, 
> and TAB to be a combining mark. Is this legal? In this case what was 
> previously a defective (but legal) combining sequence would turn into a 
> non-defective one, but the intended whitespace would be lost.

The point is that there is no such thing as an *intended* line break in
an attribute value; it will *always* be translated to a space before
the application sees it.  (More exactly, line-break characters can
be inserted into attribute values, but only with the use of a numeric
character reference such as "
".)

> Not just a rendering glitch, I suspect. If the combining character is 
> combined with the separating space, the space loses many of its 
> separating functions, and perhaps keeps a confusing subset of them with 
> all sorts of possibilities of error.

The space(s) will be used to separate individual tokens at processing
time.  No spacing diacritic (either single-character or space+combining)
is permitted in a NMTOKEN.

> At best tokens beginning with
> combining characters will be unusable. At worst they will crash the 
> implementation (and count on someone trying deliberately to do that!). 

In effect, the combining character will constitute a defective combining
sequence at the beginning of the individual token.

Stepping away from the letter of the standard for a moment, there is
no real reason to begin a NMTOKEN with a combining character.  It is
only allowed is a result of the miscegenation of SGML concepts with
Unicode ones.

In SGML's original design of tokens, they consisted of letters and digits
(and a few punctuation marks, which functioned as letters).  There were
four kinds: a NUMBER could contain only digits, a NAME could not begin
with a digit, a NUTOKEN had to begin with a digit, and a NMTOKEN had no
restrictions.  ID and IDREF had the same syntax as NAME with additional
semantics.  Later, the categories "letter" and "digit" were generalized,
by redefining the concrete syntax, to be whatever you wanted, and were
renamed "name-start" and "name" characters (technically, a name character
was a letter *or* a digit).

When SGML was simplified to produce XML, only NMTOKEN, the most general
type of token, was kept.  However, in order to keep the semantics of
"letter" and "digit" in the Unicode world, "letter" was extended to be any
letter and "digit" to be any digit *or* combining character.  That worked
well for ID and IDREF, since treating combining characters as part of
"digit" prevented them from appearing first, as was only sensible.

Unfortunately, NMTOKENs, since there were no restrictions, became able
to begin with a combining character, though that made no real sense.
To write in a restriction would make it impossible to specify XML's
concrete syntax in SGML terms, which did not allow for three different
classes of characters within tokens.  So we wound up with a basically
useless capability that if used will only cause trouble.

-- 
John Cowan  [EMAIL PROTECTED]  www.reutershealth.com  ccil.org/~cowan
Dievas dave dantis; Dievas duos duonos          --Lithuanian proverb
Deus dedit dentes; deus dabit panem             --Latin version thereof
Deity donated dentition;
  deity'll donate doughnuts                     --English version by Muke Tever
God gave gums; God'll give granary              --Version by Mat McVeagh

Reply via email to