Peter, in XML you really don't want to use attributes for any general
text; there are too many restrictions on the content. For example, we
never put translatable text into them. Attributes should really be
treated more like sequences of symbols, with a constrained syntax.

This is also not in violation of the Unicode conformance clause. A
"space plus combining
character" is a unit in some sense. That is, it is a combining
character sequence (and grapheme cluster). However, there is no clause
that says that such units cannot be changed, or that any particular
sequence of characters cannot be changed; operations such as case
mapping or normalization do just that, they change characters.

There are restrictions on what can be changed *if* a process purports
to not modify the text (C10). But an XML parser is certainly capable
of interpreting a sequence A B, and deciding that it wants to change A
to C. If the parser interpreted the 0x0041 in UTF-16 as a Z or a Greek
Alpha, *that* would be a violation of C7. But interpreting a space as
a space, then deciding to modify it, is perfectly legit.

Mark
__________________________________
http://www.macchiato.com
►  “Eppur si muove” ◄

----- Original Message ----- 
From: "Peter Kirk" <[EMAIL PROTECTED]>
To: "John Cowan" <[EMAIL PROTECTED]>
Cc: <[EMAIL PROTECTED]>
Sent: Wednesday, August 13, 2003 05:09
Subject: Re: Questions on ZWNBS - for line initial holam plus alef


> On 12/08/2003 20:28, John Cowan wrote:
>
> >Peter Kirk scripsit:
> >
> >
> >
> >>>2) In attribute values, LF, CR, and TAB characters are normalized
to
> >>>spaces.   Not relevant here.
> >>>
> >>>
> >>This would be relevant if it is legal for the character after LF,
CR,
> >>and TAB to be a combining mark. Is this legal? In this case what
was
> >>previously a defective (but legal) combining sequence would turn
into a
> >>non-defective one, but the intended whitespace would be lost.
> >>
> >>
> >
> >The point is that there is no such thing as an *intended* line
break in
> >an attribute value; it will *always* be translated to a space
before
> >the application sees it.  (More exactly, line-break characters can
> >be inserted into attribute values, but only with the use of a
numeric
> >character reference such as "&#xA;".)
> >
> >
> Sorry, I'm confused. Are you saying that the input processing will
> translate line breaks into spaces within attribute values, unless
> inserted as &#xA; ? Well, I suppose this is fair enough as it is up
to
> the user not to enter garbage.
>
> >
> >
> >>Not just a rendering glitch, I suspect. If the combining character
is
> >>combined with the separating space, the space loses many of its
> >>separating functions, and perhaps keeps a confusing subset of them
with
> >>all sorts of possibilities of error.
> >>
> >>
> >
> >The space(s) will be used to separate individual tokens at
processing
> >time.  No spacing diacritic (either single-character or
space+combining)
> >is permitted in a NMTOKEN.
> >
> >
> OK if this is clearly illegal, but this might restrict use of some
> languages in NMTOKEN. Would NBSP + combining be allowed?
>
> >
> >
> >>At best tokens beginning with
> >>combining characters will be unusable. At worst they will crash
the
> >>implementation (and count on someone trying deliberately to do
that!).
> >>
> >>
> >
> >In effect, the combining character will constitute a defective
combining
> >sequence at the beginning of the individual token.
> >
> >Stepping away from the letter of the standard for a moment, there
is
> >no real reason to begin a NMTOKEN with a combining character.  It
is
> >only allowed is a result of the miscegenation of SGML concepts with
> >Unicode ones.
> >
> >In SGML's original design of tokens, they consisted of letters and
digits
> >(and a few punctuation marks, which functioned as letters).  There
were
> >four kinds: a NUMBER could contain only digits, a NAME could not
begin
> >with a digit, a NUTOKEN had to begin with a digit, and a NMTOKEN
had no
> >restrictions.  ID and IDREF had the same syntax as NAME with
additional
> >semantics.  Later, the categories "letter" and "digit" were
generalized,
> >by redefining the concrete syntax, to be whatever you wanted, and
were
> >renamed "name-start" and "name" characters (technically, a name
character
> >was a letter *or* a digit).
> >
> >When SGML was simplified to produce XML, only NMTOKEN, the most
general
> >type of token, was kept.  However, in order to keep the semantics
of
> >"letter" and "digit" in the Unicode world, "letter" was extended to
be any
> >letter and "digit" to be any digit *or* combining character.  That
worked
> >well for ID and IDREF, since treating combining characters as part
of
> >"digit" prevented them from appearing first, as was only sensible.
> >
> >Unfortunately, NMTOKENs, since there were no restrictions, became
able
> >to begin with a combining character, though that made no real
sense.
> >To write in a restriction would make it impossible to specify XML's
> >concrete syntax in SGML terms, which did not allow for three
different
> >classes of characters within tokens.  So we wound up with a
basically
> >useless capability that if used will only cause trouble.
> >
> >
> >
> There is some potential for real trouble here, if one process
outputs an
> NMTOKEN starting with a combining character preceded by a separating
> space, or something else which is changed into a space, and another
> process takes the new space plus combining character as a unit and
so
> doesn't recognise the separation. Any hackers and virus programmers
> reading this will soon start flooding the Internet with tokens
beginning
> with combining characters in the hope of crashing implementations or
> finding back doors. Of course this wouldn't have been a problem if
> Unicode had never  defined space plus combining character as legal
and
> meaningful. But this is not my problem!
>
> -- 
> Peter Kirk
> [EMAIL PROTECTED] (personal)
> [EMAIL PROTECTED] (work)
> http://www.qaya.org/
>
>
>
>


Reply via email to