subject:"Re\: Questions on ZWNBS \- for line initial holam plus alef"

Kent Karlsson said:

> I see no particular *technical* problem with using WJ, though.  In
> contrast
> to the suggestion of using CGJ (re. another problem) anywhere else but
> at the end of a combining sequence. CGJ has combining class 0, despite
> being invisible and not ("visually") interfering with any other
> combining
> mark. Using CGJ at a non-final position in a combining sequence puts
> in doubt the entire idea with combining classes and normal forms.

Why? There are any number of combining characters with combining
class 0, including the vast majority of Indic dependent vowels,
for instance.

A combining character sequence is a base character followed
by any number of combining characters. There is no constraint
in that definition that the combining characters have to
have non-zero combining class.

Canonical reordering is scoped to stop at combining class = 0.
It doesn't say that it applies to combining character sequences
per se. It applies to *decomposed* character sequences
(meaning, effectively, any sequence which has had the recursive 
application of the decomposition mappings done).

Take a Myanmar example: /kau/:

character sequence:   <1000, 1031, 102C, 1039, 200C>
combining?:  no   yes   yes   yesno
combining classes:0 0 0 9 0
comb char sequence:--
canon reorder scope:   ---|  ---|  -|  ---|

The combining character sequence here is: <1000, 1031, 102C, 1039>
The *syllable* consists of that plus the trailing ZWNJ.
But the relevant sequences for application of the
canonical reordering algorithm are each sequence starting
with combining class zero and continuing through any
sequence with combining class not zero.

I don't see how introduction of CGJ into such sequences calls
any of the definitions or algorithms into question.

--Ken

Re: Questions on ZWNBS - for line initial holam plus alef

Kenneth Whistler scripsit:

> D17a Defective combining character sequence: A combining character
>  sequence that does not start with a base character.
>  
>  * Defective combining character sequences occur when a sequence
>of combining characters appears at the start of a string or
>follows a control or format character. Such sequences are
>defective from the point of view of handling of combining
>marks, but are not ill-formed.
>   ^^

What, if anything, does the term "ill-formed" mean when attached to
a sequence of characters?  I understood that every sequence of
characters whatsoever is permitted.

-- 
"But the next day there came no dawn,   John Cowan
and the Grey Company passed on into the [EMAIL PROTECTED]
darkness of the Storm of Mordor and werehttp://www.ccil.org/~cowan
lost to mortal sight; but the Dead  http://reutershealth.com
followed them.  --"The Passing of the Grey Company"

Re: Questions on ZWNBS - for line initial holam plus alef

On Wednesday, August 06, 2003 12:38 PM, Kent Karlsson <[EMAIL PROTECTED]> wrote:
> Since I think  should be canonically
> equivalent to , but cannot be made
> so (now), the only ways out seem to be to either formally deprecate
> CGJ, or at least confine it to very specific uses. Other occurrences
> would not be ill-formed or illegal, but would then be non-conforming.

There's a way to specify that  is
well-formed, but not :
a CGJ can be authorized in a combining sequence only if it
precedes a base character, or is precedes a combining character
which combining class is strictly lower than the combining class
of the previous character.

So, with this definition, with the combining classes indicated:

- 
  is well-formed because 220 < 230. It is distinct from:
  , whose canonical
  ordering is

- 
  is ill-formed because 230 > 220. The CGJ is superfluous
  and should be removed to create:

- 
  is ill-formed because 220 = 220. The CGJ is superfluous
  and should be removed to create:

  which is well-formed and in canonical order.

- 
  is ill-formed because 220 = 220. The CGJ is superfluous 
  and should be removed to create:

  which is well-formed and in canonical order.

This "well-formed" rule would clearly give an exact semantic
for CGJ, used in the middle of a combining sequence as the
only way to bypass the canonical reordering of combining
characters.

Re: Questions on ZWNBS - for line initial holam plus alef

Peter Kirk followed up:

> On 07/08/2003 07:27, Philippe Verdy wrote:
> 
> >On Thursday, August 07, 2003 2:40 AM, Doug Ewell <[EMAIL PROTECTED]> wrote:
> >
> >>Kenneth Whistler  wrote:
> >>
> >>>But I challenge you to find anything in the standard that
> >>>*prohibits* such sequences from occurring.
> >>>  
> >>>
> >>I've learned that this question of "illegal" or "invalid" character
> >>sequences is one of the main distinguishing factors between those who
> >>truly understand Unicode and those who are still on the Road to
> >>Enlightenment.
> >>
> >>...
> >>
> >If the term "valid" cannot be changed, then I suggest defining
> >"conforming" for encoded text independantly of its validity (a
> >"conforming text" would still need to use a "valid encoding").
> >
> As  a very quick thought, maybe what we need is not restrictions to the 
> Unicode standard but a set of rules for each language or group of 
> languages, defining exactly how Unicode characters should be used to 
> write the words etc of that language. Such definitions might be 
> independent of the actual Unicode standard.

I emphatically agree with Peter on this.

The impulse to get the Unicode Standard to head down the road
to becoming the "spelling standard" for all languages of the
world has to be constrained, simply because there is not the
expertise or the bandwidth in the UTC to accomplish this and
because it isn't the business of the UTC in the first place.

This is the kind of task which *must* be distributed to the
relevant stakeholders around the world, wherever they may 
be and however their relevant jurisdictions are defined and
constituted. 

The establishment of orthographic rules for particular language in
the context of the Unicode Standard means transferring the notion
of what the printed conventions for that language are -- whatever
they may be -- into a determination of exactly which Unicode
characters are to be used to represent those conventions,
including any constraints on cooccurrence with particular
format control characters, and so on.

The scope of the task of defining rendering rules in the
Unicode Standard is generic to script behavior -- establishing
the general rules of the road, as it were, for how the
scripts behave in the encoding, so that people and implementations
have a determinate sense of what order characters should be
in, what it means for combining characters to "combine" with
base characters, how format control characters may impact
script rendering generically, and so on. But beyond that, one
is getting into the realm of orthographic rules for particular
languages or jurisdictions and the realm of typographic
conventions for particular styles and regions. Making those
determinations belongs to the stakeholders themselves: ministries,
academies, associations, type designers, whoever.

It is precisely because the developers of the Unicode Standard
cannot foresee all possible orthographic conventions and
uses to which the standard may be put in representing text
that it is deliberately permissive: essentially any sequence
of characters is "legal", and it is up to the users of
the standard to determine, for them, what is a *sensible*
sequence of characters for their multitudinous purposes.

--Ken

Re: Questions on ZWNBS - for line initial holam plus alef

2003-08-14 Thread Michael Everson

At 14:22 -0700 2003-08-08, Kenneth Whistler wrote:

Philippe, you are tilting at windmills, here. There is no chance 
that the UTC is going to consider such a character, in my 
assessment, let alone give it the properties you suggest.
Nor WG2 either.
--
Michael Everson * * Everson Typography *  * http://www.evertype.com

Re: Questions on ZWNBS - for line initial holam plus alef

On Friday, August 08, 2003 9:54 PM, Peter Kirk <[EMAIL PROTECTED]> wrote:

> On 08/08/2003 08:54, Philippe Verdy wrote:
> 
> But I'm not sure that ZERO WIDTH SYMBOL is the best name, unless you
> are suggesting other uses in which it really has zero width. Well, it
> might have in a case like line initial holam which shifts on to a
> following silent alef, but that is a rather special case.

I just picked "SYMBOL" to just match the required property that would match
other spacing variants of diacritics. The "ZERO WIDTH" is probably confusive, but it 
just marks the fact that it has no associated glyph and a null *minimum* width (which 
expands to the largest diacritic(s) with which it is combined).

Its main role would be to fill the gap for missing spacing versions of existing 
diacritics.

What about the name "INVISIBLE CARRIER SYMBOL" ? (note that I avoid any occurence of 
the term "COMBINING" in the name, because there would be no requirement for this 
character to be followed by any diacritic(s), but the character would itself be 
handled as a symbol, in a way similar to the existing spacing diacritics (that are 
already of category Sk, and are conceptually a combination of the INVISIBLE CARRIER 
SYMBOL and diacritics, defined for compatibility purpose as an approximation of the 
sequence SPACE+diacritic).

It is worth noting that for now it is quite tricky to get an isolated diacritic 
without getting deceptive results (in some cases, the only way to do it is by using 
what Unicode describes as "defective" combining sequences, not illegal by themselves 
but whose rendering and interpretation is not guaranteed.

On the opposite, Unicode offers a standard way to force the appearance of the dotted 
circle for an isolated diacritic, a function that may not always be desirable, using a 
dotted circle symbol as the base character.

As someone corrected me in this list, SPACE+combiningdiacritic is admitted in the 
standard, but only as a compatibility equivalence for spacing diacritics, where in 
fact the isolated spacing diacritic is really a symbol (gc=Sk), unlike the base SPACE 
character used in the compatibility decomposition (which has gc=Zs), meaning that 
SPACE+combining diacritic does not have the same textual semantics as the effectively 
already encoded spacing diacritics (all of them seem to have property gc=Sk, and are 
not considered as Letters with gc=Lo, and that's why I thought the name "SYMBOL" was 
accurate).

Also I tried to justify a possible codepoint assignment at U+20CF, where it would 
group more logically, given that the U+02XX block is already full and U+20XX is used 
for both symbols (including currencies) and a set of additional combining diacritics. 
Of course the U+20CF is just a suggestion, not something approved or documented.

-- 
Philippe.
Spams non tolérés: tout message non sollicité sera
rapporté à vos fournisseurs de services Internet.

Re: Questions on ZWNBS - for line initial holam plus alef

On Sunday, August 10, 2003 11:53 AM, Kent Karlsson <[EMAIL PROTECTED]> wrote:

> <>
> 
> Spams de Philippe Verdy non tolérés: tout message non sollicité sera
> rapporté à son fournisseur de services Internet.

There was no spam in the message you deleted. This was a single post to the list, no 
cross-posting, no advertizing, no product sold, no money claimed, no required action, 
no identity forged, and no deceptive subject line, the message was on topic...

Reread the definition of spam: "bulk + unsollicitated". May be you don't like my 
message, but reporting it to my ISP will not be successful for you, and in fact you 
risk more by doing so because my ISP could complain to yours.

If you think you don't like my message which was on topic, don't reply to it, delete 
it, ignore it, but don't do such false claim...

Thanks.

Re: Questions on ZWNBS - for line initial holam plus alef

On 11/08/2003 12:26, Kenneth Whistler wrote:

Peter Kirk wrote:

 

I think this may be a "Peter mistake". I meant to refer to spacing 
diacritics. Sorry.

It is certainly highly inappropriate for spacing diacritics to 
be considered word boundaries.
   

Why? It is entirely dependent on the orthography and conventions
involved. ...
Well, agreed, there may be orthographic conventions in which a spacing 
diacritic is considered a word boundary or a break opportunity e.g. if 
used like a  hyphen. But there are other mechanisms for forcing a word 
boundary where otherwise there would not be one. Are there to suppress a 
word boundary? Perhaps I need to encode  to 
avoid the word boundary implication? Would this work?

... There is probably as much (or more) bad ASCII usage
of spacing diacritics like `this', where a grave accent character
is being misapplied to make a directional quotation mark, as
there is actual, linguistically appropriate use of spacing
diacritics.
 

But this is an abuse of the spacing diacritic as punctuation. Proper, 
linguistically appropriate use of spacing diacritics should not be 
broken in order to support abuse. Or, if the standard wants to support 
such abuse, we can reserve  for the abuse and define 
a  new character XXX such that  has the properties for 
the linguistically appropriate use.

Also, everyone should consider carefully the status of UAX #29,
Text Boundaries.

2 Conformance
This is informative material. There are many different ways to
divide text elements corresponding to grapheme clusters, words 
and sentences, and the Unicode Standard and this document do not
restrict the ways in which implementations can do this.

This specification is a default mechanism;
more sophisticated engines can and should tailor it for particular
locales or environments. ...

The whole UAX is informative. ...

Then let it be correctly informative and not full of misinformation. And 
let its default mechanism and recommendations be appropriate for the 
majority of uses, including such cases as list of diacritics which may 
occur in any orthography.

Ken, it seems to me all the more clearly from looking at the latest 
batch of postings on this list that the  mechanism 
defined by Unicode is fundamentally flawed. It works, but it creates a 
serious and needless complication for all kinds of other processes, 
including rendering and higher level processes. These processes cannot 
simply take a space as a space and process it as such. Every time they 
come across a space (which is very often!) they have to test whether it 
is followed by a combining character, and if it is they have to treat 
that space specially. This has created a serious problem for 
implementers, which is why they have produced non-conforming 
implementations - and we are not talking about small companies which 
have rushed into the market recently, we are talking about Microsoft, 
among others, which has been sponsoring Unicode for the start, I 
understand. Surely the UTC should not create difficulties for 
implementers and then just shout at them for getting things wrong. The 
UTC should try to produce a standard which is workable without 
unnecessary complications

I agree that it works better to use NBSP here. There are fewer such 
problems, but they have not gone away entirely. And  NBSP is more likely 
to be treated by implementers (in the absence of other guidelines from 
Unicode) as fixed width, not trimmed to the width needed for the diacritic.

--
Peter Kirk
[EMAIL PROTECTED] (personal)
[EMAIL PROTECTED] (work)
http://www.qaya.org/

Re: Questions on ZWNBS - for line initial holam plus alef

On 11/08/2003 08:39, Doug Ewell wrote:

Peter Kirk  wrote:

 

Thank you, Ken. Well, you make it sound as if the problems are
minimal, and that version I can just about accept. But if Philippe is
correct about what he says about UAX#29 and UAX#14, there are some
more serious problems. It is certainly highly inappropriate for
non-spacing diacritics to be considered word boundaries.
   

Non-spacing diacritics had better not be word boundaries, otherwise a
string like Québec (spelled with U+0301, as here) would be considered
two words.  I don't have time right now to look up the relevant
properties and UAX's, but I sincerely hope this is just another
"Philippe mistake" and not a general misinterpretation that anyone might
make.
-Doug Ewell
Fullerton, California
http://users.adelphia.net/~dewell/


 

I think this may be a "Peter mistake". I meant to refer to spacing 
diacritics. Sorry.

It is certainly highly inappropriate for spacing diacritics to be considered word boundaries.

--
Peter Kirk
[EMAIL PROTECTED] (personal)
[EMAIL PROTECTED] (work)
http://www.qaya.org/

RE: Questions on ZWNBS - for line initial holam plus alef

> For me the term "difficult" is inappropriate. In fact it is invalid for
> interoperability (even though it is valid, not forbidden, for
> ISO10646/Unicode, as an string fragment for intermediate processing),
> and such sequence should not occur in actual documents, out of any
> external processing context which defines its behavior.

So that fact that you can't stick it into XML won't cause you many tears
then.
Good.

Re: Questions on ZWNBS - for line initial holam plus alef

From: "Jon Hanna" <[EMAIL PROTECTED]>

>  If this is
> > different, then it is not XML but a derived language (for example
HTML or
> > SGML which are using more "relaxed" syntaxes).
>
> XML is derived from SGML, not the other way around. Still doesn't
matter.

I did not say that, despite the sentence may let you think so. Of course
XML is born based on the ground of SGML and its HTML application, but
now contains enough differences that it can no longer be considered an
application of SGML, as it is both a subset and a superset of SGML (XML
allows things forbidden in SGML, and forbids things that is completely
valid in SGML).

Additionally the DTD syntax profile used in XML is very limited face to
SGML, and even this DTD syntax is not enough to represent in SGML XML
features like namespaces (in XML, namespace prefixes can be freely
substituted without requiring a new DTD, and are resolved as URIs
instead of being part of the element or attribute names). Naming
conventions in XML are based on two orthogonal dimensions, unlike in
HTML and SGML which just use a single namespace.

Finally DTDs are being deprecated in XML, because they cannot represent
correctly the semantics of allowed attributes and even the allowed
content models for schemas (so a XML document would validate with a DTD
which would not if the schema was defined more precisely with a XSD
schema: nearly all DTDs I have seen for XML, HTML and SGML contain
important comments that cannot be represented in a parsable way.

OK I used the term DOM instead of InfoSet but what I said was "DOM-like"
data-representation (meaning InfoSet if this is what is used to
represent the document). I won't discuss the case of element names or
attribute names, which
are by essence constrained by XML datatypes and do not represent any
arbitrary Unicode text. But CDATA sections, attribute values (in non
validating parsers), and anonymous text elements are where the handling
of initial/final whitespaces as well as sequences of whitespaces, cause
problems. This is clearly NOT markup, but plain text data, which may or
may not be constrained by datatype facets, without even the need to
specify a special xml:whitespace
attribute in the markup of the document itself.

As validating documents against their definitions is an optional part of
a valid XML document, normalization of whitespace sequences occurs only
if the schema is known. In the case of standardized schemas, like XHTML,
it becomes mandatory, and there's no way to bypass this rule, as any
client could assume and load the corresponding schema and preprocess the
DOM-like data contained in the parsed document to create data which will
not expose unnormalized whitespaces. So the behavior of spaces must be
assumed by authors which canot predict if the XML parser will validate
or not the parsed document. It is clearly not a rendering issue in fonts
or XSLT processors or stylesheets. I see absolutely no place where a XML
author can create a valid XML schema instance that will work with
parsers if the author wants to use SPACE+diacritics sequences in the
document. The only way to bypass safely this behavior is to use unparsed
entities to represent the leading SPACE, or the whole combining
sequence.

This is really a shame that there is no "XML-safe" base character in
Unicode to represent leading spacing diacritics in actual documents
(either in HTML, XML, SGML, or even for other Rich-Text format,
including TeX, RTF, or proprietary text formats like MS-Doc, or PDF
which already can and do use Unicode as its now prefered encoding).
Ignoring the extremely huge number of applications assuming this role to
spaces, is then a critical caveat as such rules cannot be changed
easily.

Re: Questions on ZWNBS - for line initial holam plus alef

2003-08-14 Thread Mark Davis

There are a number of incorrect statements. My comments below.

- Original Message - 
From: "Peter Kirk" <[EMAIL PROTECTED]>
To: "Kenneth Whistler" <[EMAIL PROTECTED]>
Cc: <[EMAIL PROTECTED]>
Sent: Monday, August 11, 2003 16:28
Subject: Re: Questions on ZWNBS - for line initial holam plus alef

> I was aware that there should not be a line break or word break
between
> the space and the NSM, although I suspect that many implementers
will
> not be aware of this, or at least will not test for it properly and
so
> treat any space as a word break and a line break opportunity.

Hard to be clearer than what is written in the LineBreak UAX. (see
below).

> As I just
> wrote, this requirement to test all spaces for following NSMs is a
> significant inefficiency built into the standard.

This is incorrect. Characters (not just spaces) only need to be
checked for following NSMs in *those processes where that makes a
difference*. And in most of those processes, like line-break, some
lookahead is required anyway. To see, for example, whether there is a
linebreak after a character X, in almost all cases I have to look at
the character after X, and in many cases I have to look at more than
one character. Notice, for example, that in the sequence "a" I
have to look ahead to see if there is a ":", so that French
punctuation works correctly.

In practice, looking at a character past a space does not represent a
significant performance issue. One is typically using a mechanism
(like an augmented state machine) that maintains enough state that
that is not an issue.

>
> But there is still a problem if there is considered by default to be
a
> word break and a line break opportunity AFTER the NSM. I would
suggest,
> as a candidate for a concrete proposal, that the default behaviour
be
> adjusted so that there is no word break or line break opportunity
here
> either.

It helps if "concrete proposals" were actually, well, concrete.

I see no problem with Line Break.
(http://www.unicode.org/reports/tr14/#Algorithm):

Space + NSM is treated as a unit, with behavior that is pretty
consistent with a stand-alone accent like "^". To quote:

LB 7a  In all of the following rules, if a space is the base character
for a combining mark, the space is changed to type ID. In other words,
break before SP CM* in the same cases as one would break before an ID.

Treat SP CM* as if it were ID

If you want non-breaking behavior, you use NBSP + NSM; if you want
breaking behavior, you use SP + NSM. The algorithm does that.

I also see no problem with word-break
(http://www.unicode.org/reports/tr29/#Word_Boundaries). Look at the
specific text. To quote:

Treat a grapheme cluster as if it were a single character: the first
character of the cluster.
GC→FC(3)
...
Otherwise, break everywhere (including around ideographs).
Any÷Any(14)

None of the other rules are relevant.

So what this does is that SPACE + NSM will break before the space and
after the NSM (assuming there is only one). So it will behave like a
symbol, such as "*", or ")", or "^".

The one area I do see that there may be an issue is with one that you
didn't mention,
http://www.unicode.org/reports/tr29/#Sentence_Boundaries. Sp + NSM
should not behave as Sp in the rules (8), (10), and (11). Even there,
it will produce at most a minor oddity.

If we wanted to change it, the *concrete* change would be to replace
(4) by:

Treat a grapheme cluster as if it were a single character: the first
character of the cluster, except if that first character is a space.
In that case, change to Any.
SGC→FC(4a)
GC→FC(4b)

>
> -- 
> Peter Kirk
> [EMAIL PROTECTED] (personal)
> [EMAIL PROTECTED] (work)
> http://www.qaya.org/
>
>
>

Re: Questions on ZWNBS - for line initial holam plus alef

From: "Jon Hanna" <[EMAIL PROTECTED]>

> Lots of different things happen that affect the whitespace of an XML
> document (whether a DOM tree is constructed or not, since it isn't the
only
> legal way to process an XML document).

Of course one is not required to build an actual DOM tree, however XML, HTML
and alike is now defined in terms of the DOM, where the text/xml syntax is
just a serialization, which is the only place where whitespaces
normalization is defined (such normalization does not occur at the DOM
level, and a XML document may be serialized with another concrete syntax
than the one assigned to the "text/xml" MIME type, registered and documented
by the W3C.

When processing XML documents, the DOM part is the most important feature
and it is logically separated from the concrete syntax used by text XML
parsers. The W3C defines very strict rules to ensure that the DOM-equivalent
data will be preserved, and whitespace normalization in XML documents
serialized as "text/xml" is mandatory, or it is not a valid "text/xml"
serialization.

Processing a "text/xml" document in a way that would be incompatible with
what a DOM tree builder would create is not conforming. If this is
different, then it is not XML but a derived language (for example HTML or
SGML which are using more "relaxed" syntaxes). In XML, whitespace
normalization can be overriden using very precise rules within the parser
only, but not in the resulting DOM-tree, so it is important to understand
each step that goes from the concreate text/xml syntax to the DOM-tree or
its equivalents (notably the successive steps required in parsed entities,
named entities, ...) No XML application is required to use the "text/xml"
MIME syntax, and there exists such examples (for example the serialization
and compression formats used by WAP, MMS, Nec's i-Mode, and SOAP).

If an application does not build the DOM tree, it is still required to
perform namespace resolution and to solve named entities according to the
standard "text/xml" MIME rules formulated by the W3C reference, including
all its facets, needed for interoperability of document properties
independantly of the character encoding used in the serialized document, or
its syntaxic representation. In my opinion, all XML-based languages should
be defined now in terms of its DOM structure, and the XML application should
be defined by a valid DTD, or beter now with a now standard XSD schema, that
can be processed by validating parsers (parsers that absolutely need to
create a DOM-like tree or flow of tokens with strictly defined properties,
value sets and behavior.)

Without DOM interoperability, XML would be another imprecise language like
HTML, with very little reusability due to naming conflicts. This is the most
important benefit of XHTML (strictly based on XML) face to HTML (4.x and
before) and SGML (all versions), notably when a schema is explicitly
specified for the document, and is loaded for validating purposes (some
schemas are normative like XHTML, and canot be changed by authors)

Re: Questions on ZWNBS - for line initial holam plus alef

On 11/08/2003 18:46, Mark Davis wrote:

There are a number of incorrect statements. My comments below.
 

Thanks for the clarifications. Sorry about the inaccuracies. On some 
maybe Philippe misled me, on others it is just my inadequate understanding.

...

In practice, looking at a character past a space does not represent a
significant performance issue. One is typically using a mechanism
(like an augmented state machine) that maintains enough state that
that is not an issue.
 

Understood. I hope Microsoft is listening.

...

It helps if "concrete proposals" were actually, well, concrete.
 

Of course! But I need help to get rid of any inaccuracies before the 
concrete sets.

I see no problem with Line Break.
(http://www.unicode.org/reports/tr14/#Algorithm):
Space + NSM is treated as a unit, with behavior that is pretty
consistent with a stand-alone accent like "^". To quote:
LB 7a  In all of the following rules, if a space is the base character
for a combining mark, the space is changed to type ID. In other words,
break before SP CM* in the same cases as one would break before an ID.
   Treat SP CM* as if it were ID

If you want non-breaking behavior, you use NBSP + NSM; if you want
breaking behavior, you use SP + NSM. The algorithm does that.
 

Thank you. I have looked at this. Well, the ideal for me would be a 
mechanism whereby base + NSM was AL, rather than ID or GL. The problem 
comes, if I understand correctly, with a sequence like SP XX CM* AL, 
where I want a break opportunity after SP but not before AL. If I use 
NBSP for XX, I get not breaking opportunity at all. If I use SP, I may 
get a break before AL. But I suppose SP SP CM* WJ AL would do what I 
want, perhaps also SP ZWSP NBSP CM* AL as the break opportunity after 
ZWSP takes precedence over the no break before NBSP.

I also see no problem with word-break
(http://www.unicode.org/reports/tr29/#Word_Boundaries). Look at the
specific text. To quote:
Treat a grapheme cluster as if it were a single character: the first
character of the cluster.
   GC→FC(3)
...
Otherwise, break everywhere (including around ideographs).
   Any÷Any(14)
None of the other rules are relevant.

So what this does is that SPACE + NSM will break before the space and
after the NSM (assuming there is only one). So it will behave like a
symbol, such as "*", or ")", or "^".
 

OK, no real problem then. In some circumstances it might have been 
better for space + NSM to behave like a letter rather than a symbol may 
be more appropriate, but I recognise that tailoring may be required for 
fine details.

The one area I do see that there may be an issue is with one that you
didn't mention,
http://www.unicode.org/reports/tr29/#Sentence_Boundaries. Sp + NSM
should not behave as Sp in the rules (8), (10), and (11). Even there,
it will produce at most a minor oddity.
If we wanted to change it, the *concrete* change would be to replace
(4) by:
Treat a grapheme cluster as if it were a single character: the first
character of the cluster, except if that first character is a space.
In that case, change to Any.
   SGC→FC(4a)
   GC→FC(4b)
 

Do you mean: "SGC → Any (4a)"?

How should I go about making a concrete proposal for this?

Anyway, many thanks for your help. I think I am beginning to realise 
that this is a small problem which has been blown out of proportion by 
others. I still see the space + NSM choice as a rather poor initial 
design, but one which can be lived with.

--
Peter Kirk
[EMAIL PROTECTED] (personal)
[EMAIL PROTECTED] (work)
http://www.qaya.org/

Re: Questions on ZWNBS - for line initial holam plus alef

Peter Kirk scripsit:

> So far so good, but when I get to an accent with no predefined spacing 
> variant, I have a problem!

No you don't.  If you want to say  is the diacritic used to
represent linguolabial sounds in the IPA, then you just encode U+0020 U+033C
at the beginning of the next line.  If the seagull doesn't line up properly,
you complain to the foundry or the implementor.

-- 
John Cowan  [EMAIL PROTECTED]http://www.ccil.org/~cowan
Is it not written, "That which is written, is written"?

Re: Questions on ZWNBS - for line initial holam plus alef

On 12/08/2003 04:17, Jon Hanna wrote:

Thanks for the clarification. I probably misunderstood Jon's intention.
But is there a problem if, for example, an application sees the string
 and regularises it (wrongly!) to 
combining mark>?
   

Yes, I was not saying that it wouldn't be sensible to begin a line of text
with a spacing diacritic (whether precomposed or created using space or
NBSP). I was saying that it wouldn't be sensible to begin a line with a
combining diacritic, since that combining diacritic would be combining with
a newline character which it's difficult to think of any possible sensible
meaning for. Attribute normalisation would change the sequence U+000A,
 to U+0020,  which would arguably change the meaning,
but changing the meaning of a meaningless construct isn't a problem to my
mind.
 

Thanks for the clarification. I think the combining mark would not 
combine with the new line mark but would be a defective combining 
sequence. I might wish to do this simply because, according to UTR #14 
this is the only way to get a combining mark to be treated as AL as I 
might wish. Probably not the best way to do this, but not illegal!

So it seems to me that this attribute normalisation is a problem. It is 
a problem for the higher level protocol as thinks it has created a space 
but in fact it has created a combining sequence which it must not treat 
as a space. A legal sequence at a lower level, even if meaningful, 
should not confuse the higher level. (Indeed I  don't think the higher 
level ought to be confused even by illegal sequences at the lower level, 
it should be transparent as far as possible.) So the higher level 
protocol needs to know not only not to split a space, combining mark 
sequence but also not to create one where one was not present before. 
Perhaps it needs to insert a suitable separator (ZWNJ?) to ensure that 
when the space is created it is not combined with the combining mark. So 
another example of needless complication created by the long-standing 
decision to permit space as a carrier for combining marks.

--
Peter Kirk
[EMAIL PROTECTED] (personal)
[EMAIL PROTECTED] (work)
http://www.qaya.org/

Re: Questions on ZWNBS - for line initial holam plus alef

Peter Kirk scripsit:

> Philippe or anyone else, would it be "XML-safe" to use NBSP rather than 
> SP as the base character for spacing diacritics in XML? Perhaps that's 
> the answer here. I know there are still some issues of detail concerning 
> the line breaking, but apart from that is there any other problem?

NBSP is not usable in attribute values other than those of type CDATA,
but it is usable in character content.  XML does not consider it whitespace
(the only whitespace characters are LF, SPACE, TAB and marginally CR.

IMHO, it is best practice not to use anything in attribute values,
certainly non-CDATA attribute values, that is in any way intended to
handle fully general text: attribute values should be protocol strings.

-- 
John Cowan<[EMAIL PROTECTED]> 
http://www.reutershealth.com  http://www.ccil.org/~cowan
Yakka foob mog.  Grug pubbawup zink wattoom gazork.  Chumble spuzz.
-- Calvin, giving Newton's First Law "in his own words"

Re: Questions on ZWNBS - for line initial holam plus alef

Peter Kirk scripsit:

> Sorry, I'm confused. Are you saying that the input processing will 
> translate line breaks into spaces within attribute values, unless 
> inserted as 
 ? Well, I suppose this is fair enough as it is up to 
> the user not to enter garbage.

Yes, that is how attribute values work.  The idea is that when you have
a long string in an attribute value, you can introduce a line break for
readability without its having any effect on processing, thus:



The line break gets turned into a space before the application sees it.

Additionally, if you have a long list of tokens in an attribute value,
thus:



the application does not have to deal with either the line break or the
tab character specially, but sees simply a list of tokens separated by
a single space.

> OK if this is clearly illegal, but this might restrict use of some 
> languages in NMTOKEN. Would NBSP + combining be allowed?

No, it isn't.  As I say, attribute values aren't meant to handle
arbitrary natural-language human-readable text.

> There is some potential for real trouble here, if one process outputs an 
> NMTOKEN starting with a combining character preceded by a separating 
> space, or something else which is changed into a space, and another 
> process takes the new space plus combining character as a unit and so 
> doesn't recognise the separation. 

If the second processor is XML-compliant, it will treat the space as a
token separator, not as part of the token (as I say, spacing diacritics
aren't allowed in tokens).  If the XML document is printed or displayed
in its raw form (that is, treating it as plain rather than structured
text), you may see something a bit strange, but that will not affect
the processing model.

> reading this will soon start flooding the Internet with tokens beginning 
> with combining characters in the hope of crashing implementations or 
> finding back doors. 

Very, very unlikely.

-- 
Winter:  MIT,   John Cowan
Keio, INRIA,[EMAIL PROTECTED]
Issue lots of Drafts.   http://www.ccil.org/~cowan
So much more to understand! http://www.reutershealth.com
Might simplicity return?(A "tanka", or extended haiku)

Re: Questions on ZWNBS - for line initial holam plus alef

- Original Message - 
From: "John Cowan" <[EMAIL PROTECTED]>
To: "Peter Kirk" <[EMAIL PROTECTED]>
Cc: <[EMAIL PROTECTED]>
Sent: Wednesday, August 13, 2003 5:31 AM
Subject: Re: Questions on ZWNBS - for line initial holam plus alef

> Peter Kirk scripsit:
>
> > Philippe or anyone else, would it be "XML-safe" to use NBSP rather
than
> > SP as the base character for spacing diacritics in XML? Perhaps
that's
> > the answer here. I know there are still some issues of detail
concerning
> > the line breaking, but apart from that is there any other problem?

For XML, using NBSP would be safe, however this is another caveat as it
introduce a non-break property, which may be an issue for the rendering,
but normally not for text processing. This can be corrected by saying
that
NBSP+combining does not have a non-break property, and that a
"don't break here" format control can be used if needed to specify the
breaking behavior.

In that case, this change in properties of the combining sequence
(in fact something that was still not specified until now) would be
harmless (as the behavior was not clearly specified and implementation
dependant), and we could say that SPACE+diacritics is deprecated
in favor of NBSP+diacritics (which would NOT inherit the non-breaking
behavior but would have its own properties).

> NBSP is not usable in attribute values other than those of type CDATA,
> but it is usable in character content.  XML does not consider it
whitespace
> (the only whitespace characters are LF, SPACE, TAB and marginally CR.

And NEL (for compatibility with EBCDIC systems).

RE: Questions on ZWNBS - for line initial holam plus alef

2003-08-14 Thread Jony Rosenne

Suggested but not accepted.

I am inherently suspicious when pressure is being exerted to decide complex
and difficult questions in a hurry.

Jony

> -Original Message-
> From: [EMAIL PROTECTED] 
> [mailto:[EMAIL PROTECTED] On Behalf Of Peter Kirk
> Sent: Wednesday, August 13, 2003 8:43 PM
> To: Philippe Verdy
> Cc: [EMAIL PROTECTED]
> Subject: Re: Questions on ZWNBS - for line initial holam plus alef
> 
> 
> On 13/08/2003 11:09, Philippe Verdy wrote:
> 
> >... For this reason, defective
> >combining sequences (combining characters without a leading base
> >character) should be forbidden (invalid for XML).
> >  
> >
> If there is even the remotest possibility of this happening, 
> we need to 
> know quickly! Defective combining sequences are legal Unicode and are 
> now being suggested for use in Hebrew e.g. for holam male. But such a 
> definition would be useless if XML restricts the texts it can 
> represent 
> to a subset of Unicode excluding such sequences.
> 
> 
> -- 
> Peter Kirk
> [EMAIL PROTECTED] (personal)
> [EMAIL PROTECTED] (work)
> http://www.qaya.org/
> 
> 
> 
>

Re: Questions on ZWNBS - for line initial holam plus alef

On 12/08/2003 07:05, John Cowan wrote:

Very true.  But what is this whitespace normalization?

1) Throughout the document, line-end characters and sequences are normalized
  to LF.  Not relevant here.
2) In attribute values, LF, CR, and TAB characters are normalized to spaces.
  Not relevant here.
 

This would be relevant if it is legal for the character after LF, CR, 
and TAB to be a combining mark. Is this legal? In this case what was 
previously a defective (but legal) combining sequence would turn into a 
non-defective one, but the intended whitespace would be lost.

3) In attribute values that have a declared type other than CDATA, multiple
  spaces are compressed to a single space, and leading and trailing spaces
  are removed.  After this is done, there can be no spaces in attributes
  of type ID, IDREF, ENTITY, NMTOKEN, NOTATION, or enumerated types.
  In the types IDREFS and ENTITIES, spaces are used to separate
  individual tokens, none of which may begin with a combining character.
  In the remaining type, NMTOKENS, individual characters may begin
  with a combining character, so it is possible that such a token, if
  not the first in the attribute, will be rendered in a peculiar way,
  with the combining character placed over the separating space.
  But that is a mere rendering glitch and in no way affects anything.
 

Not just a rendering glitch, I suspect. If the combining character is 
combined with the separating space, the space loses many of its 
separating functions, and perhaps keeps a confusing subset of them with 
all sorts of possibilities of error. At best tokens beginning with 
combining characters will be unusable. At worst they will crash the 
implementation (and count on someone trying deliberately to do that!). 
The only safe thing to do is to specify that space followed by a 
combining mark is NEVER considered to be a space and this combination is 
NEVER generated.

--
Peter Kirk
[EMAIL PROTECTED] (personal)
[EMAIL PROTECTED] (work)
http://www.qaya.org/

Re: Questions on ZWNBS - for line initial holam plus alef

On 12/08/2003 09:00, Philippe Verdy wrote:

This is really a shame that there is no "XML-safe" base character in
Unicode to represent leading spacing diacritics in actual documents
(either in HTML, XML, SGML, or even for other Rich-Text format,
including TeX, RTF, or proprietary text formats like MS-Doc, or PDF
which already can and do use Unicode as its now prefered encoding).
Ignoring the extremely huge number of applications assuming this role to
spaces, is then a critical caveat as such rules cannot be changed
easily.


 

Philippe or anyone else, would it be "XML-safe" to use NBSP rather than 
SP as the base character for spacing diacritics in XML? Perhaps that's 
the answer here. I know there are still some issues of detail concerning 
the line breaking, but apart from that is there any other problem?

--
Peter Kirk
[EMAIL PROTECTED] (personal)
[EMAIL PROTECTED] (work)
http://www.qaya.org/

Re: Questions on ZWNBS - for line initial holam plus alef

On 13/08/2003 14:07, Philippe Verdy wrote:

I did not notice that the discussion about Hebrew holam male was
related.
In fact I don't know anything about the hebrew alphabet so I could not
understand the semantics discussed, and so di not note that
was a "defective" encoding (in terms of combining sequences).

Well, it wasn't very releated - although the subject line here "line
initial holam plus alef" reminds me that it is very near to where we
started this thread.

When using the term "forbidden", it was only related to possible
security
problems with XML, but the term was certainly too much expeditive.
However, given that possible security and parsing issues do exist, the
case of used to encode "holam-male" may be another
argument to propose a neutral/invisible base character for combining
characters. For the case of Hebrew, it then needs to have a "letter"
behavior, but for the case of other isolated diacritics in Latin,Greek
Cyrillic, and probably also Hiragana, Katakana (voice marks) it should
better be handled as a symbol.
I suggested several semantics for this invisible character(s) in a
earlier
message:
- A invisible symbol
- An invisible LTR letter
- An invisible RTL letter
all of them having a *compatibility* decomposition (or NFKD form) as
a SPACE like other existing spacing combining marks, but not being
canonical equivalent of SPACE (to keep separately the legacy semantics,
properties, behavior and known caveats unchanged and
implementation/usage-dependant, as they are now with SPACE+NSM
which could then be discouraged in Unicode and strongly deprecated
in SGML/HTML/XML)

My latest idea is to use RLM as in effect your "invisible RTL letter".
So I would encode word or line initial holam male as .
This is technically a defective combining sequence (is that correct?),
as RLM is a format control character, but the RLM has the double effect
of keeping the holam separate from any spaces which a higher level
protocol might put there and ensuring RTL directionality. And I suppose
the same technique would be legal with any combining character. But of
course it would all be spoiled if XML were to forbid defective combining
sequences, which fortunately is unlikely. Actually I suppose you could
use or for your spacing
diacritics as the RLM or LRM would protect the space from combination
with any previous space etc. Or perhaps . As RLM effectively disappears in searches etc, in effect
you have your compatibility decomposition.

I note that there is no line break opportunity in . But is
there one after the space in ? If so, has a third advantage, that it gives the right line
break opportunity when this sequence is word initial, which it wouldn't
do without the RLM.

--
Peter Kirk
[EMAIL PROTECTED] (personal)
[EMAIL PROTECTED] (work)
http://www.qaya.org/

RE: Questions on ZWNBS - for line initial holam plus alef

> I do agree: a XML document could require the use at some place of a
> given attribute or element. If this attribute name follows the element
> name
> after a line break, which gets changed into a space during parsing,
> forcing
> XML parsers to treat SPACE+combining as a unbreakable grapheme
> cluster acting like a letter would have the effect of creating a new
> element
> name which may violate the lement name identity. Now suppose that the
> attribute name contains a colon, you have created a custom namespace
> name, under which you can add any element you like, even if this was
> forbidden by the content-model of the reference schema.

1. SPACE is treated "blindly" as a SPACE by XML. String + space + combining
+ string would not be treated as a single token, no matter how that space
was introduced. That's what you were complaining about in the first place
(as far as I can make out).
2. While nmtokens can begin with a combining character names cannot, nor can
they contain spaces.
3. This would in no way change the content-model. So even if the above two
points didn't hold they would only sneak the document past something which
performed validation before parsing(!), and where the content-model was
already pretty loose (so it didn't complain about the unrecognised
attribute).

You've just discovered a way to disguise one document that isn't well-formed
as a different document that isn't well-formed. l33t!

> So this would invalidate existing documents, or create holes allowing
> insertion of arbitrary XML content, if the XML application is not
> validating extremely strictly the element names (the pair namespace+
> name) and exclude completely from processing any unrecognized
> element (including all its content and attributes).

This argument is not on friendly terms with the concept of causality.

 This would be a
> breach in the content model which may have been validated and tested
> for security in another layer of the document encoding process (notably
> when XML documents are created from templates, such as XSL
> processors, or custom C source using simple template substitution).

Testing validity without testing well-formedness is not possible.

> So for me the sequence SPACE+combining should not be acceptable
> as a valid grapheme cluster within element names or attribute names,

As it already isn't.

> and thus would need to be excluded from NMTOKEN. The correct
> way to do it is to consider it NOT A LETTER, but a symbol (Sk),
> exactly like other spacing diacritics, which are already invalid in
> NMTOKEN.

Wait a second. That was my justification for why the fact that
space+combining is ALREADY prohibited from NMTOKEN shouldn't be considered a
failure on the part of XML to allow for freedom of choice with the strings
used for NMTOKENs. Now you actually want to introduce this (already
existent) feature.

> There still remains the unresolved question of grapheme clusters
> that could span the starting "<" or ending ">" or "/>" of tags, or
> the leading "&" of a entitity reference.

No there isn't. What goes before <, >, / or & isn't a problem since those
are all non-combining characters and a new unit for any sort of processing
treating more than one codepoint as a unit. What goes after < or & has to be
a name (not an nmtoken) and as such is already prohibited from beginning
with a combiner. What goes after > is already dealt with by the Charmod, and
even if you ignore charmod apart from the possibility of normalisation
turning the sequence U+003E, U+0338 into U+226E (a possibility that is well
noted) it still isn't going to hurt.

Re: Questions on ZWNBS - for line initial holam plus alef

Peter Kirk scripsit:

> >2) In attribute values, LF, CR, and TAB characters are normalized to 
> >spaces.   Not relevant here.
> 
> This would be relevant if it is legal for the character after LF, CR, 
> and TAB to be a combining mark. Is this legal? In this case what was 
> previously a defective (but legal) combining sequence would turn into a 
> non-defective one, but the intended whitespace would be lost.

The point is that there is no such thing as an *intended* line break in
an attribute value; it will *always* be translated to a space before
the application sees it.  (More exactly, line-break characters can
be inserted into attribute values, but only with the use of a numeric
character reference such as "
".)

> Not just a rendering glitch, I suspect. If the combining character is 
> combined with the separating space, the space loses many of its 
> separating functions, and perhaps keeps a confusing subset of them with 
> all sorts of possibilities of error.

The space(s) will be used to separate individual tokens at processing
time.  No spacing diacritic (either single-character or space+combining)
is permitted in a NMTOKEN.

> At best tokens beginning with
> combining characters will be unusable. At worst they will crash the 
> implementation (and count on someone trying deliberately to do that!). 

In effect, the combining character will constitute a defective combining
sequence at the beginning of the individual token.

Stepping away from the letter of the standard for a moment, there is
no real reason to begin a NMTOKEN with a combining character.  It is
only allowed is a result of the miscegenation of SGML concepts with
Unicode ones.

In SGML's original design of tokens, they consisted of letters and digits
(and a few punctuation marks, which functioned as letters).  There were
four kinds: a NUMBER could contain only digits, a NAME could not begin
with a digit, a NUTOKEN had to begin with a digit, and a NMTOKEN had no
restrictions.  ID and IDREF had the same syntax as NAME with additional
semantics.  Later, the categories "letter" and "digit" were generalized,
by redefining the concrete syntax, to be whatever you wanted, and were
renamed "name-start" and "name" characters (technically, a name character
was a letter *or* a digit).

When SGML was simplified to produce XML, only NMTOKEN, the most general
type of token, was kept.  However, in order to keep the semantics of
"letter" and "digit" in the Unicode world, "letter" was extended to be any
letter and "digit" to be any digit *or* combining character.  That worked
well for ID and IDREF, since treating combining characters as part of
"digit" prevented them from appearing first, as was only sensible.

Unfortunately, NMTOKENs, since there were no restrictions, became able
to begin with a combining character, though that made no real sense.
To write in a restriction would make it impossible to specify XML's
concrete syntax in SGML terms, which did not allow for three different
classes of characters within tokens.  So we wound up with a basically
useless capability that if used will only cause trouble.

-- 
John Cowan  [EMAIL PROTECTED]  www.reutershealth.com  ccil.org/~cowan
Dievas dave dantis; Dievas duos duonos  --Lithuanian proverb
Deus dedit dentes; deus dabit panem --Latin version thereof
Deity donated dentition;
  deity'll donate doughnuts --English version by Muke Tever
God gave gums; God'll give granary  --Version by Mat McVeagh

Re: Questions on ZWNBS - for line initial holam plus alef

2003-08-14 Thread Mark Davis

Some of this seems to be in reference to an earlier contention that
Text Boundaries (inc. Lines) break between the space and the
non-spacing mark. I think this was attributed to Phillipe.

[This may not be true: I don't actually read his email, because the
information content per line falls below my email threshold; not to
say that there may not be information there, but I cannot afford to
take the time to find out -- sadly, one of my character flaws.]

All of the text boundaries preserve grapheme cluster boundaries, which
never separate a base character (including space and NBSP) from a
following NSM. In addition, each of the boundary types above grapheme
clusters make some statement about the behavior of the grapheme
cluster. For example, with line boundaries a SPACE + NSM has a special
behavior. With the others, the behavior is the same as the base
character.

As Ken points out, in any event these are default boundaries, and can
be tailored. That being said, if the normal behavior of the default
can be improvied, and someone has a concrete proposal for doing so,
then it can be considered.

Mark
__
http://www.macchiato.com
►  “Eppur si muove” ◄

- Original Message - 
From: "Kenneth Whistler" <[EMAIL PROTECTED]>
To: <[EMAIL PROTECTED]>
Cc: <[EMAIL PROTECTED]>; <[EMAIL PROTECTED]>
Sent: Monday, August 11, 2003 12:26
Subject: Re: Questions on ZWNBS - for line initial holam plus alef


> Peter Kirk wrote:
>
> > I think this may be a "Peter mistake". I meant to refer to spacing
> > diacritics. Sorry.
> >
> > It is certainly highly inappropriate for spacing diacritics to
> > be considered word boundaries.
>
> Why? It is entirely dependent on the orthography and conventions
> involved. There is probably as much (or more) bad ASCII usage
> of spacing diacritics like `this', where a grave accent character
> is being misapplied to make a directional quotation mark, as
> there is actual, linguistically appropriate use of spacing
> diacritics.
>
> Also, everyone should consider carefully the status of UAX #29,
> Text Boundaries.
>
> 
> 2 Conformance
>
> This is informative material. There are many different ways to
> divide text elements corresponding to grapheme clusters, words
> and sentences, and the Unicode Standard and this document do not
> restrict the ways in which implementations can do this.
>
> This specification is a default mechanism;
> more sophisticated engines can and should tailor it for particular
> locales or environments. ...
> 
>
> The whole UAX is informative. It is a here's-how-you-can-approach-
> the-problem implementation guide with some suggestions for
> rules and classes.
>
> *If* you are working with an orthography that uses one or more
> spacing diacritics, and
> *If* those spacing diacritics need to be represented by
>  sequences,
>
> then you are in the situation where your implementation of
> text boundaries should take  sequences explicitly
> into account, so as to result in expected behavior for that
> orthography.
>
> Everyone has had experiences with their platform UI producing
> bad results for text boundaries. The Solaris platform I am
> writing this on right now, for example, implements a double-click
> word selection that treats the string "`this'," above, including
> the grave accent, the apostrophe, and the comma, as a "word".
> Is that right or wrong? Well, it depends on what you are trying
> to do, I expect.
>
> But even the most sophisticated platform implementers can only
> do so much with processes like default word selection. It is
> bound to be wrong for one purpose or another and for one
> orthography or another. Ultimately you need to have tailored
> processes that can be orthography-specific if you want to
> get best results.
>
> --Ken
>
>
>

Re: Questions on ZWNBS - for line initial holam plus alef

On 12/08/2003 20:28, John Cowan wrote:

Peter Kirk scripsit:

 

2) In attribute values, LF, CR, and TAB characters are normalized to 
spaces.   Not relevant here.
 

This would be relevant if it is legal for the character after LF, CR, 
and TAB to be a combining mark. Is this legal? In this case what was 
previously a defective (but legal) combining sequence would turn into a 
non-defective one, but the intended whitespace would be lost.
   

The point is that there is no such thing as an *intended* line break in
an attribute value; it will *always* be translated to a space before
the application sees it.  (More exactly, line-break characters can
be inserted into attribute values, but only with the use of a numeric
character reference such as "
".)
 

Sorry, I'm confused. Are you saying that the input processing will 
translate line breaks into spaces within attribute values, unless 
inserted as 
 ? Well, I suppose this is fair enough as it is up to 
the user not to enter garbage.

 

Not just a rendering glitch, I suspect. If the combining character is 
combined with the separating space, the space loses many of its 
separating functions, and perhaps keeps a confusing subset of them with 
all sorts of possibilities of error.
   

The space(s) will be used to separate individual tokens at processing
time.  No spacing diacritic (either single-character or space+combining)
is permitted in a NMTOKEN.
 

OK if this is clearly illegal, but this might restrict use of some 
languages in NMTOKEN. Would NBSP + combining be allowed?

 

At best tokens beginning with
combining characters will be unusable. At worst they will crash the 
implementation (and count on someone trying deliberately to do that!). 
   

In effect, the combining character will constitute a defective combining
sequence at the beginning of the individual token.
Stepping away from the letter of the standard for a moment, there is
no real reason to begin a NMTOKEN with a combining character.  It is
only allowed is a result of the miscegenation of SGML concepts with
Unicode ones.
In SGML's original design of tokens, they consisted of letters and digits
(and a few punctuation marks, which functioned as letters).  There were
four kinds: a NUMBER could contain only digits, a NAME could not begin
with a digit, a NUTOKEN had to begin with a digit, and a NMTOKEN had no
restrictions.  ID and IDREF had the same syntax as NAME with additional
semantics.  Later, the categories "letter" and "digit" were generalized,
by redefining the concrete syntax, to be whatever you wanted, and were
renamed "name-start" and "name" characters (technically, a name character
was a letter *or* a digit).
When SGML was simplified to produce XML, only NMTOKEN, the most general
type of token, was kept.  However, in order to keep the semantics of
"letter" and "digit" in the Unicode world, "letter" was extended to be any
letter and "digit" to be any digit *or* combining character.  That worked
well for ID and IDREF, since treating combining characters as part of
"digit" prevented them from appearing first, as was only sensible.
Unfortunately, NMTOKENs, since there were no restrictions, became able
to begin with a combining character, though that made no real sense.
To write in a restriction would make it impossible to specify XML's
concrete syntax in SGML terms, which did not allow for three different
classes of characters within tokens.  So we wound up with a basically
useless capability that if used will only cause trouble.
 

There is some potential for real trouble here, if one process outputs an 
NMTOKEN starting with a combining character preceded by a separating 
space, or something else which is changed into a space, and another 
process takes the new space plus combining character as a unit and so 
doesn't recognise the separation. Any hackers and virus programmers 
reading this will soon start flooding the Internet with tokens beginning 
with combining characters in the hope of crashing implementations or 
finding back doors. Of course this wouldn't have been a problem if 
Unicode had never  defined space plus combining character as legal and 
meaningful. But this is not my problem!

--
Peter Kirk
[EMAIL PROTECTED] (personal)
[EMAIL PROTECTED] (work)
http://www.qaya.org/

Re: Questions on ZWNBS - for line initial holam plus alef

On 13/08/2003 11:09, Philippe Verdy wrote:

... For this reason, defective
combining sequences (combining characters without a leading base
character) should be forbidden (invalid for XML).
 

If there is even the remotest possibility of this happening, we need to 
know quickly! Defective combining sequences are legal Unicode and are 
now being suggested for use in Hebrew e.g. for holam male. But such a 
definition would be useless if XML restricts the texts it can represent 
to a subset of Unicode excluding such sequences.

--
Peter Kirk
[EMAIL PROTECTED] (personal)
[EMAIL PROTECTED] (work)
http://www.qaya.org/

Re: Questions on ZWNBS - for line initial holam plus alef

Peter Kirk scripsit:

> These processes cannot 
> simply take a space as a space and process it as such. Every time they 
> come across a space (which is very often!) they have to test whether it 
> is followed by a combining character, and if it is they have to treat 
> that space specially. 

This must be done for all other base characters as well.

> This has created a serious problem for
> implementers, which is why they have produced non-conforming 
> implementations - and we are not talking about small companies which 
> have rushed into the market recently, we are talking about Microsoft, 
> among others, which has been sponsoring Unicode for the start, I 
> understand.

You don't have (nor do I) the vaguest idea why Microsoft produced
this particular nonconforming implementation, or whether they
consider it a bug or not.

> Surely the UTC should not create difficulties for 
> implementers and then just shout at them for getting things wrong. The 
> UTC should try to produce a standard which is workable without 
> unnecessary complications.

This is sheer conjecture.

-- 
John Cowan  www.ccil.org/~cowan  [EMAIL PROTECTED]  www.reutershealth.com
[P]olice in many lands are not complaining that local arrestees are insisting
on having their Miranda rights read to them, just like perps in American TV
cop shows.  When it's explained to them that there are in a different country,
where those rights do not exist, they become outraged.  --Neal Stephenson

RE: Questions on ZWNBS - for line initial holam plus alef

> Of course one is not required to build an actual DOM tree,
> however XML, HTML
> and alike is now defined in terms of the DOM, where the text/xml syntax is
> just a serialization, which is the only place where whitespaces
> normalization is defined (such normalization does not occur at the DOM
> level, and a XML document may be serialized with another concrete syntax
> than the one assigned to the "text/xml" MIME type, registered and
> documented
> by the W3C.

No.

"XML documents are made up of storage units called entities, which contain
either parsed or unparsed data. Parsed data is made up of characters, some
of which form character data, and some of which form markup. Markup encodes
a description of the document's storage layout and logical structure. XML
provides a mechanism to impose constraints on the storage layout and logical
structure." (XML, Introduction. XML1.1 will not change that).

*XML applications* can be defined in terms of the DOM, but they can also be
defined in terms of the XML Information Set, XPath, by extending one of the
above, or through some other model (e.g. in terms of SAX events). Many
applications are defined in terms of the Information Set or XPath.

None of this actually matters here of course, because there is still no
problem with the use of space and NBSP with combining characters unless you
use that in names or nmtokens.

and whitespace normalization in XML documents
> serialized as "text/xml" is mandatory, or it is not a valid "text/xml"
> serialization.

But it doesn't matter.

> Processing a "text/xml" document in a way that would be incompatible with
> what a DOM tree builder would create is not conforming.

Doesn't matter.

 If this is
> different, then it is not XML but a derived language (for example HTML or
> SGML which are using more "relaxed" syntaxes).

XML is derived from SGML, not the other way around. Still doesn't matter.

> If an application does not build the DOM tree, it is still required to
> perform namespace resolution

Namespace resolution, do you mean complying with Namespaces in XML? XML
parsers aren't required to do that, and it still doesn't matter.

> Without DOM interoperability, XML would be another imprecise language like
> HTML,

HTML is pretty precise, most of the imprecision is quite possible in XML as
well. Comparing HTML with XML is a pretty fruitless exercise beyond "oh look
this one has point brackets as well".

Still doesn't matter.

> with very little reusability due to naming conflicts.

Naming conflicts are perfectly possible with XML applications that don't use
Namespaces. Which they are perfectly within the spec in doing, and where
combining diacritics still don't matter.

Re: Questions on ZWNBS - for line initial holam plus alef

2003-08-14 Thread Thomas M. Widmann

Peter Kirk <[EMAIL PROTECTED]> writes:

> On 08/08/2003 08:54, Philippe Verdy wrote:
> 
> > ... Could there be another codepoint assigned that has
> >
> >these properties:
> >
> >20CF;ZERO WIDTH SYMBOL;Sk;0;ON; 0020N;
> > [...]
> But I'm not sure that ZERO WIDTH SYMBOL is the best name, unless you
> are suggesting other uses in which it really has zero width. Well, it
> might have in a case like line initial holam which shifts on to a
> following silent alef, but that is a rather special case.

What would be a better name?  ACCENT CARRIER?

/Thomas
-- 
Thomas Widmann, MA  +44  141 419 9872   Glasgow, Scotland, EU
[EMAIL PROTECTED] http://www.widmann.uklinux.net

Re: Questions on ZWNBS - for line initial holam plus alef

Peter Kirk scripsit:
> On 13/08/2003 11:09, Philippe Verdy wrote:
> 
> >... For this reason, defective
> >combining sequences (combining characters without a leading base
> >character) should be forbidden (invalid for XML).
> > 
> >
> If there is even the remotest possibility of this happening, we need to 
> know quickly! 

As a member of the XML Core Working Group of the W3C, I can assure you that
there is not even the remotest possibility of it.

-- 
John Cowan  [EMAIL PROTECTED]http://www.ccil.org/~cowan
Is it not written, "That which is written, is written"?

Re: Questions on ZWNBS - for line initial holam plus alef

John Hudson scripsit:

> Again, you are working on the assumption that U+0020 is represented by an 
> actual painted glyph and not e.g. by a horizontal offset. In my experience, 
> the more sophisticated the application -- e.g. a professional page layout 
> application rather than a word processor -- the more likely it is that 
> white space characters will not be consistently treated as painted glyphs. 

I'm working on the assumption that applications that claim to conform to
Unicode actually do conform to it.  If they don't, and it's not the font
foundry's fault, then complain, complain, complain!  It's not Unicode
that's broken, it's the implementation.

> I've heard convincing arguments from the engineeers of such applications 
> that the space character shouldn't be a glyph in the font at all, but 
> should simply be a numeric value telling applications how large an offset 
> to apply. Since most fonts do not contain glyphs for variant white space 
> characters such as thin and hair spaces, applications typically treat these 
> as offset values. Painting a glyph is only one way to represent a character.

Nothing in the Unicode Standard says those oddball spaces have to work
"correctly" with combining diacritics.

-- 
A mosquito cried out in his pain,   John Cowan
"A chemist has poisoned my brain!"  http://www.ccil.org/~cowan
The cause of his sorrow http://www.reutershealth.com
Was para-dichloro-  [EMAIL PROTECTED]
Diphenyltrichloroethane.(aka DDT)

Re: Questions on ZWNBS - for line initial holam plus alef

On Monday, August 11, 2003 12:27 AM, Kenneth Whistler <[EMAIL PROTECTED]> wrote:

> A point I keep trying to make, but which often gets overlooked
> by people trying to code Unicode mechanisms for dealing with
> edge cases, is that the design goal of the Unicode Standard is,
> and always has been, to represent *plain text content*. It
> cannot, and should not, IMO, deal with requirements for
> representing arbitrarily fine distinctions of typographical
> detail in all manuscripts and other documents in all writing
> systems of the world.

Spacing diacritics are not "on the edge" of the standard, when they
are already given a full block and handled there as symbols (not as
letters as suggested in some parts of UAX's), with their own identity
independant of their actual glyphic representation. I am not
discussing about the typesetting of these grapheme clusters but
really about the textual semantics of such combining sequences
with an invisible base character, affecting all their properties and
not fully described in the various standard annexes. Due to the
huge legacy use of SPACE+diacritics in legacy text, and the
already normative parts of some standard annexes, it will be hard
to correct the behavior or change the text of these annexes.
And it's where a new better base character than SPACE could
help solve cleanly the ambiguities.

-- 
Philippe.
Spams non tolérés: tout message non sollicité sera
rapporté à vos fournisseurs de services Internet.

Re: Questions on ZWNBS - for line initial holam plus alef

- Original Message - 
From: "Doug Ewell" <[EMAIL PROTECTED]>
To: "Unicode Mailing List" <[EMAIL PROTECTED]>
Cc: "Peter Kirk" <[EMAIL PROTECTED]>; "Kenneth Whistler"
<[EMAIL PROTECTED]>
Sent: Monday, August 11, 2003 5:39 PM
Subject: Re: Questions on ZWNBS - for line initial holam plus alef

> Peter Kirk  wrote:
>
> > Thank you, Ken. Well, you make it sound as if the problems are
> > minimal, and that version I can just about accept. But if Philippe is
> > correct about what he says about UAX#29 and UAX#14, there are some
> > more serious problems. It is certainly highly inappropriate for
> > non-spacing diacritics to be considered word boundaries.
>
> Non-spacing diacritics had better not be word boundaries, otherwise a
> string like Québec (spelled with U+0301, as here) would be considered
> two words.  I don't have time right now to look up the relevant
> properties and UAX's, but I sincerely hope this is just another
> "Philippe mistake" and not a general misinterpretation that anyone might
> make.

Not a mistake from me, sorry. From you yes: Peter Kirk probably wanted
to speak about *spacing* diacritics (when coded with SPACE+NSM).
There is no such *spacing* character in "Québec".

Don't accuse me of something I did not say. And be more tolerant please
with what is an obvious typo in the message from Peter Kirk. Instead of
just flaming,  could you better read the message and accept errors and
correct them instead of sending such unconstructive  replied.

Thanks.

Re: Questions on ZWNBS - for line initial holam plus alef

On 13/08/2003 04:44, Jon Hanna wrote:

No, the safe thing to do (and the thing that is done) is to treat the space
as a space ignoring the fact that the NMTOKEN contains a combining
character, this is even safer than your suggestion since it can't
mis-identify the combining properties of a character.
 

OK, it's safe, but it is a misuse of Unicode. As space plus combining 
character is a unit in Unicode, it should be treated as a unit by higher 
level protocols. If higher level protocols are allowed to do arbitrary 
things within Unicode units, there is no end to the possible confusion. 
See for example, from Unicode 4.0 chapter 3:

C7 A process shall interpret a coded character representation according 
to the character
semantics established by this standard, if that process does interpret 
that coded character
representation.

--
Peter Kirk
[EMAIL PROTECTED] (personal)
[EMAIL PROTECTED] (work)
http://www.qaya.org/

Re: Questions on ZWNBS - for line initial holam plus alef

On 11/08/2003 06:59, Jon Hanna wrote:

There are only two theoretical problems that I can see here, the first is
that a whitespace character other than space gets converted to space by
attribute value normalisation, and that this changes the meaning of the text
in some way. This could only occur if the combining character were the first
character in a line of text, which is quite a nonsensical construct to begin
with.
 

Not at all! Imagine a tutorial on a language, which might well list the 
accents used, in a format like this:

` (grave accent) is used with a, e and o, and indicates more open 
pronunciation
^ (circumflex accent) is used with any vowel, and indicates lengthening

So far so good, but when I get to an accent with no predefined spacing 
variant, I have a problem!

--
Peter Kirk
[EMAIL PROTECTED] (personal)
[EMAIL PROTECTED] (work)
http://www.qaya.org/

Re: Questions on ZWNBS - for line initial holam plus alef

- Original Message - 
From: "Peter Kirk" <[EMAIL PROTECTED]>
To: "Jon Hanna" <[EMAIL PROTECTED]>
Cc: <[EMAIL PROTECTED]>
Sent: Wednesday, August 13, 2003 3:05 PM
Subject: Re: Questions on ZWNBS - for line initial holam plus alef

> On 13/08/2003 04:44, Jon Hanna wrote:
>
> >No, the safe thing to do (and the thing that is done) is to treat the
space
> >as a space ignoring the fact that the NMTOKEN contains a combining
> >character, this is even safer than your suggestion since it can't
> >mis-identify the combining properties of a character.
> >
> >
> OK, it's safe, but it is a misuse of Unicode. As space plus combining
> character is a unit in Unicode, it should be treated as a unit by
higher
> level protocols. If higher level protocols are allowed to do arbitrary
> things within Unicode units, there is no end to the possible
confusion.
> See for example, from Unicode 4.0 chapter 3:
>
> C7 A process shall interpret a coded character representation
according
> to the character
> semantics established by this standard, if that process does interpret
> that coded character
> representation.

OK, but XML inherits its behavior from SGML and you won't change it.
The only way to bypass this would be to use entitiy references to encode
the base space needed by the Unicode convention, so this is related to
what Unicode defines as a higher level protocol, needed here to bypass
the limitations of basic text. However it still creates a problem within
CDATA sections, which are not supposed to contain entity references.
One needs then to use the XML CDATA escaping mechanism with
another escaping system specific to CDATA sections (which are
formally anonymous text elements and equivalent to them).

Re: Questions on ZWNBS - for line initial holam plus alef

On 11/08/2003 11:45, Kenneth Whistler wrote:

Peter Kirk responded:

 

On 11/08/2003 06:59, Jon Hanna wrote:

   

There are only two theoretical problems that I can see here, the first is
that a whitespace character other than space gets converted to space by
attribute value normalisation, and that this changes the meaning of the text
in some way. This could only occur if the combining character were the first
character in a line of text, which is quite a nonsensical construct to begin
with.
 

Not at all! Imagine a tutorial on a language, which might well list the 
accents used, in a format like this:

` (grave accent) is used with a, e and o, and indicates more open 
pronunciation
^ (circumflex accent) is used with any vowel, and indicates lengthening
   

We're going round and round in circles here. Those are not lines
starting with a combining character, but lines starting with
a *spacing diacritic*.
 

So far so good, but when I get to an accent with no predefined spacing 
variant, I have a problem!
   

Either you have the spacing diacritic encoded (as in those instances),
or the standard indicates that you can represent one by applying the
nonspacing, *combining* mark to SPACE. In those instances, the line
still doesn't start with a combining mark -- it starts with a SPACE
character serving as the base character for the combining mark.
--Ken

 

Thanks for the clarification. I probably misunderstood Jon's intention. 
But is there a problem if, for example, an application sees the string 
 and regularises it (wrongly!) to ?

--
Peter Kirk
[EMAIL PROTECTED] (personal)
[EMAIL PROTECTED] (work)
http://www.qaya.org/

Re: Questions on ZWNBS - for line initial holam plus alef

From: "Kenneth Whistler" <[EMAIL PROTECTED]>
> It is perfectly reasonable, as I see it, to consider the
>  in a  sequence to be:
>   a. significant
>   b. part of the characters in a document that are not markup
>  (at least in the cases we are talking about, since the
>  problem is not about defining Nmtokens for markup in
>  Biblical Hebrew, but rather the representation of the
>  Biblical Hebrew document content itself)
>  
> So I *still* don't see the problem you are on about, and even
> if there was one, the xml:space attribute could be used to
> require preservation of a particular space.

May be you are forgetting that in XML and HTML, attributes
(including "spacial attributes like "xml:space" can have default
values, and in fact they have such values set in DTD or
schemas to by normative XML applications like XHTML.
Authors are not supposed to modify normative schemas or DTDs,
and so use elements with their default attributes. This is the case
of XHTML as an application of XML, and HTML as an
application of SGML (neither HTML or SGML parsers will
interpret the xml:space attribute, and XML parsers will handle it
only if they are validating documents with their DTD or schema)

Re: Questions on ZWNBS - for line initial holam plus alef

Philippe Verdy scripsit:

> Of course one is not required to build an actual DOM tree, however XML, HTML
> and alike is now defined in terms of the DOM, where the text/xml syntax is
> just a serialization,

This is absolutely false.  XML is defined by the XML Recommendation, which
is entirely syntactic.  As a matter of convenience, many other XML
recommendations use the XML Infoset, which is by no means the same as the
DOM.  The DOM is an abstract API for programmatic access to the content
of XML documents.

> which is the only place where whitespaces
> normalization is defined (such normalization does not occur at the DOM
> level, and a XML document may be serialized with another concrete syntax
> than the one assigned to the "text/xml" MIME type, registered and documented
> by the W3C.

"May" be, yes.  You can serialize it in ASN.1 if you want to.  That doesn't
make ASN.1 an instance of XML.

> [W]hitespace normalization in XML documents
> serialized as "text/xml" is mandatory, or it is not a valid "text/xml"
> serialization.

Very true.  But what is this whitespace normalization?

1) Throughout the document, line-end characters and sequences are normalized
   to LF.  Not relevant here.

2) In attribute values, LF, CR, and TAB characters are normalized to spaces.
   Not relevant here.

3) In attribute values that have a declared type other than CDATA, multiple
   spaces are compressed to a single space, and leading and trailing spaces
   are removed.  After this is done, there can be no spaces in attributes
   of type ID, IDREF, ENTITY, NMTOKEN, NOTATION, or enumerated types.
   In the types IDREFS and ENTITIES, spaces are used to separate
   individual tokens, none of which may begin with a combining character.
   In the remaining type, NMTOKENS, individual characters may begin
   with a combining character, so it is possible that such a token, if
   not the first in the attribute, will be rendered in a peculiar way,
   with the combining character placed over the separating space.
   But that is a mere rendering glitch and in no way affects anything.

> No XML application is required to use the "text/xml"
> MIME syntax, and there exists such examples (for example the serialization
> and compression formats used by WAP, MMS, Nec's i-Mode, and SOAP).

That is not the definition of "XML application" given in the XML Recommendation,
which is the sole authority on the subject.  You can invent your own
definitions if you like, but you need not expect to be listened to.

> If an application does not build the DOM tree, it is still required to
> perform namespace resolution

No XML application is required to perform "namespace resolution", whatever
that may be.

> to solve named entities according to the
> standard "text/xml" MIME rules formulated by the W3C reference,

Only certain named entities *must* be resolved: specifically, internal
entities that are defined in the internal subset.

> In my opinion, all XML-based languages should
> be defined now in terms of its DOM structure, and the XML application should
> be defined by a valid DTD, or beter now with a now standard XSD schema, that
> can be processed by validating parsers (parsers that absolutely need to
> create a DOM-like tree or flow of tokens with strictly defined properties,
> value sets and behavior.)

In your *opinion*.

> Without DOM interoperability, XML would be another imprecise language like
> HTML, with very little reusability due to naming conflicts. 

Nonsense.

*plonk*

-- 
There is / One art  John Cowan <[EMAIL PROTECTED]>
No more / No less   http://www.reutershealth.com
To do / All things  http://www.ccil.org/~cowan
With art- / Lessness -- Piet Hein

Re: Questions on ZWNBS - for line initial holam plus alef

On 11/08/2003 16:06, Mark Davis wrote:

Some of this seems to be in reference to an earlier contention that
Text Boundaries (inc. Lines) break between the space and the
non-spacing mark. I think this was attributed to Phillipe.
[This may not be true: I don't actually read his email, because the
information content per line falls below my email threshold; not to
say that there may not be information there, but I cannot afford to
take the time to find out -- sadly, one of my character flaws.]
All of the text boundaries preserve grapheme cluster boundaries, which
never separate a base character (including space and NBSP) from a
following NSM. In addition, each of the boundary types above grapheme
clusters make some statement about the behavior of the grapheme
cluster. For example, with line boundaries a SPACE + NSM has a special
behavior. With the others, the behavior is the same as the base
character.
As Ken points out, in any event these are default boundaries, and can
be tailored. That being said, if the normal behavior of the default
can be improvied, and someone has a concrete proposal for doing so,
then it can be considered.
Mark
__
http://www.macchiato.com
►  “Eppur si muove” ◄
 

I was aware that there should not be a line break or word break between 
the space and the NSM, although I suspect that many implementers will 
not be aware of this, or at least will not test for it properly and so 
treat any space as a word break and a line break opportunity. As I just 
wrote, this requirement to test all spaces for following NSMs is a 
significant inefficiency built into the standard.

But there is still a problem if there is considered by default to be a 
word break and a line break opportunity AFTER the NSM. I would suggest, 
as a candidate for a concrete proposal, that the default behaviour be 
adjusted so that there is no word break or line break opportunity here 
either.

--
Peter Kirk
[EMAIL PROTECTED] (personal)
[EMAIL PROTECTED] (work)
http://www.qaya.org/

RE: Questions on ZWNBS - for line initial holam plus alef

2003-08-14 Thread Kent Karlsson

Michael wrote:
> The Name Police reject this utterly. ZERO WIDTH cannot have an 
> expanding dynamic width.

Then what about ZERO WIDTH SPACE, which, according to TUS3, p. 238,
"can grow to have a visible width when justified"? And it has the
NamesList comment:
* nominally zero width, but may expand in justification

(But U+0082, BREAK PERMITTED HERE, which otherwise is very similar
to ZWSP according to 6429, does apparently not allow such stretching...)

/kent k

Re: Questions on ZWNBS - for line initial holam plus alef

2003-08-14 Thread John Hudson

At 11:36 AM 8/11/2003, John Cowan wrote:

> So far so good, but when I get to an accent with no predefined spacing
> variant, I have a problem!
No you don't.  If you want to say  is the diacritic used to
represent linguolabial sounds in the IPA, then you just encode U+0020 U+033C
at the beginning of the next line.  If the seagull doesn't line up properly,
you complain to the foundry or the implementor.
Again, you are working on the assumption that U+0020 is represented by an 
actual painted glyph and not e.g. by a horizontal offset. In my experience, 
the more sophisticated the application -- e.g. a professional page layout 
application rather than a word processor -- the more likely it is that 
white space characters will not be consistently treated as painted glyphs. 
I've heard convincing arguments from the engineeers of such applications 
that the space character shouldn't be a glyph in the font at all, but 
should simply be a numeric value telling applications how large an offset 
to apply. Since most fonts do not contain glyphs for variant white space 
characters such as thin and hair spaces, applications typically treat these 
as offset values. Painting a glyph is only one way to represent a character.

Regards, John

Tiro Typeworks  www.tiro.com
Vancouver, BC   [EMAIL PROTECTED]
The sight of James Cox from the BBC's World at One,
interviewing Robin Oakley, CNN's man in Europe,
surrounded by a scrum of furiously scribbling print
journalists will stand for some time as the apogee of
media cannibalism.
- Emma Brockes, at the EU summit

Re: Questions on ZWNBS - for line initial holam plus alef

On 13/08/2003 15:54, Jony Rosenne wrote:

Suggested but not accepted.

I am inherently suspicious when pressure is being exerted to decide complex
and difficult questions in a hurry.
Jony
 

Jony, I am not trying to hurry anything. I am putting a lot of time and 
effort into trying to reach proper decisions on these complex and 
difficult questions. What I am not prepared to do is to accept a quick 
answer that the lowest common denominator of printers don't bother to do 
X, therefore we need not bother to support X in Unicode although X is a 
definite requirement of a significant subset of Hebrew users.

If you have problems with this particular suggestion, let's discuss them 
on the Hebrew list.

--
Peter Kirk
[EMAIL PROTECTED] (personal)
[EMAIL PROTECTED] (work)
http://www.qaya.org/

RE: Questions on ZWNBS - for line initial holam plus alef

> The only way to bypass this would be to use entitiy references to encode
> the base space needed by the Unicode convention, so this is related to
> what Unicode defines as a higher level protocol, needed here to bypass
> the limitations of basic text. However it still creates a problem within
> CDATA sections, which are not supposed to contain entity references.
> One needs then to use the XML CDATA escaping mechanism with
> another escaping system specific to CDATA sections (which are
> formally anonymous text elements and equivalent to them).

Wow!

You can't have a CDATA section within or containing a name or nmtoken.
You can't have an entity reference within element or attribute names, the
most common use of names.
You don't want an entity reference with any other name or within an nmtoken,
it would be very poor design to use characters that were awkward for
developers (the only people who would ever have to deal with this stuff at
that level) to type.
CDATA sections aren't affected by the part of white-space handling we are
discussing.

The idea of creating an escaping mechanism specific to (or at all applicable
to) CDATA sections is mind-hurtingly bad even in hypothetical terms.

RE: Questions on ZWNBS - for line initial holam plus alef

2003-08-14 Thread Kent Karlsson

Kenneth Whistler wrote:

> Kent Karlsson said:
> 
> > I see no particular *technical* problem with using WJ, though.  In
> > contrast
> > to the suggestion of using CGJ (re. another problem) 
> anywhere else but
> > at the end of a combining sequence. CGJ has combining class 
> 0, despite
> > being invisible and not ("visually") interfering with any other
> > combining
> > mark. Using CGJ at a non-final position in a combining sequence puts
> > in doubt the entire idea with combining classes and normal forms.
> 
> Why? 

See above (I DID write the motivation!). Combining classes are generally
assigned according to "typographic placement". Combining characters
(except those that are really letters) that have the "same" placement,
and "interfere typographically" are assigned the same combining class,
while those that don't get different classes, and the relative order is
then considered unimportant (canonically equivalent). How is then,
e.g.  supposed to be different from
 (supposing all involved characters
are fully supported), when  is NOT
supposed to be much different from 
(them being canonically equivalent)? An invisible combining character
does not interfere typographically with anything, it being invisible!
The other invisible (per se!) combining characters with combining
class 0, the variation selectors, are ok, since their *conforming* use
is
vary highly constrained. Maybe I've been wrong, but I have taken
CGJ as similarly constrained as it was given a semantics only when
followed by a base character (but now it seems to have no semantics
at all).

> There are any number of combining characters with combining
> class 0, including the vast majority of Indic dependent vowels,
> for instance.

These are ok. They are not invisible, and the vowels should not
reorder amongst themselves in a single combining sequence (I know,
there is normally only one vowel per syllable, but as the Hebrew
discussion has shown, one should not generalise too much),
regardless of placement (before, above, below, after, before&after,
...).
So at least they should have the same combining class, regardless
of typographic placement. (This should have been the case also
for the Hebrew vowels...) But class 0 (which is specially treated),
I'm not sure if that was ideal.

> A combining character sequence is a base character followed
> by any number of combining characters. There is no constraint
> in that definition that the combining characters have to
> have non-zero combining class.

Well, you cannot *conformantly* place a VS anywhere in a combining
sequence! Only certain combinations of base+vs are allowed in
any given version of Unicode. (Breaking that does not make the
combining sequence ill-formed, or illegal, but would make it
non-conformant, just like using an unassigned code point.)

> Canonical reordering is scoped to stop at combining class = 0.

(I know it is. But I confess I'm not sure why.)

> It doesn't say that it applies to combining character sequences
> per se. It applies to *decomposed* character sequences
> (meaning, effectively, any sequence which has had the recursive 
> application of the decomposition mappings done).

Yes, for the definition of normalisation. But not necessary for
canonical equivalence. Your point?

> Take a Myanmar example: /kau/:
> 
> character sequence:   <1000, 1031, 102C, 1039, 200C>
> combining?:  no   yes   yes   yesno
> combining classes:0 0 0 9 0
> comb char sequence:--
> canon reorder scope:   ---|  ---|  -|  ---|
> 
> The combining character sequence here is: <1000, 1031, 102C, 1039>
> The *syllable* consists of that plus the trailing ZWNJ.
> But the relevant sequences for application of the
> canonical reordering algorithm are each sequence starting
> with combining class zero and continuing through any
> sequence with combining class not zero.

Formally, a character *pair* based definition is enough:
xy S yx,if 0 < cc(y) < cc(x) (and apply that repeatedly);
no need to define any "canonically reordering scope", though
that may be marginally more efficient in an implementation
of normalisation  (but this is getting beside the topic of this
discussion).

> I don't see how introduction of CGJ into such sequences calls
> any of the definitions or algorithms into question.

No, not the algorithm, but the basic idea and design. The algorithm
as such has no "idea" how or why the combining class numbers
were assigned. But we humans do, or might have.

Again, why should not  be canonically
equivalent to , when  is canonically equivalent to ?
And I want a design answer, not a formal answer! (The latter I already
know, and is uninteresting.)

Since I think  should be canonically
equivalent to , but cannot be made
so (now), the only ways out seem to be to either formally deprecate
CGJ, or at least confine it to very specific uses. Other occurrences
would not be ill-formed or illegal, but would then be non-conformin

Re: Questions on ZWNBS - for line initial holam plus alef

From: "John Cowan" <[EMAIL PROTECTED]>

> Peter Kirk scripsit:
>
> > So far so good, but when I get to an accent with no predefined spacing
> > variant, I have a problem!
>
> No you don't.  If you want to say  is the diacritic used to
> represent linguolabial sounds in the IPA, then you just encode U+0020
U+033C
> at the beginning of the next line.  If the seagull doesn't line up
properly,
> you complain to the foundry or the implementor.

It's true that you can complain to a foundry for an inappropriaet glyph
positioning
but not to an implementor of other components dealing with text boundaries.
The inaccuracies we are spaeaking about are not in the glyph representation
but in text handling algorithms, these last ones being clearly part of the
Unicode
standard, unlike font problems.

Re: Questions on ZWNBS - for line initial holam plus alef

On Saturday, August 09, 2003 3:11 PM, Kent Karlsson <[EMAIL PROTECTED]> wrote:

> Michael wrote:
> > The Name Police reject this utterly. ZERO WIDTH cannot have an
> > expanding dynamic width.
> 
> Then what about ZERO WIDTH SPACE, which, according to TUS3, p. 238,
> "can grow to have a visible width when justified"? And it has the
> NamesList comment:
> * nominally zero width, but may expand in justification
> 
> (But U+0082, BREAK PERMITTED HERE, which otherwise is very similar
> to ZWSP according to 6429, does apparently not allow such
> stretching...) 
> 
> /kent k

- ZERO WIDTH SPACE would be good only if it had not the "Zs" general
category which qualifies it as a whitespace, and a word breaker (in fact
the same problem occurs with the general category offered by SPACE
or NBSP, which is a good reason why they are highly criticizable as
base characters for word-like sequences (even if there's a NBSP, there
is still a word delimitation which may be important for orthographic
and grammatical analysis, given that the main difference between SPACE
and NBSP is mostly the line-breaking behavior but not the word-breaking
behavior.)

- BREAK PERMITTED HERE is a control and does not qualify as a base
character.

In fact, depending on the usage, the gaps to fill depend on the usage:

1) when the isolated diacritic is to be used as a spacing symbol but which
should not be force glued with surrounding characters, the NBSP base
character is a problem, and in fact it also has the wrong character
properties which normally applies to the whole combining sequence
that should normally inherit the properties of the first base character.
For this usage, we need something like an "INVISIBLE SYMBOL"
base character (with gc=Sk like for other existing spacing diacritics,
and probably with neutral directionality). The combining sequence
will have its width adjusted to the largest diacritic(s) applied to that
"INVISIBLE SYMBOL" base character. The nearest existing character
to fit this function is ZWS, but it is whitespace, not symbolic.

2) when the isolated diacritic is to be used as a regular letter within
words (e.g.: in Traditional Hebrew), we need something like a "INVISIBLE
LETTER" base character (with gc=Lo and neutral directionality), whose
width is not necessarily supposed to be adjusted but may adjust depending
depending on the left or right context (in rendering engines), so that one could
use an isolated circumflex between each character in the pair "oo", and the
diacritic being centered on the touching edges of each surrounding spacing
base character, or it would create a sufficient margin on either side to make
the isolated diacritic fit. The resulting combining sequence with the INVISIBLE
LETTER and its non-spacing diacritics would be mostly non-spacing.
But this rendering may be tricky to implement in many cases, and the
renderer should be allowed to render it as a spacing diacritic, like for the
invisible symbol, except that it would not be a symbol but really a letter that
can fit within a word (and have applications for elided letters in the middle of
a unbreakable word). This function is partially implementable with CGJ only
if there's a preceding combining sequence or base letter, or by WJ (Word
Joiner) but it is a format control and not applicable as a base character.

For texts that want to present the isolated diacritic for its related normal
function as a diacritic, the current best solution is to use the existing
(spacing) dotted circle symbol as the base character. However this usage
is quite technical, and too much Unicode related, and is not appropriate
for all usages, where the dotted circle symbol base character may conflict
with other usage (in a document) of this symbol (some other documents
also prefer using for such presentation forms a gray-coloured Latin small
letter o in some rich text like HTML or RTF, but this still has the problem
that a rich-text format like HTML will break the plain-text into separate
sequences, where the non-grayed diacritic muct still be rendered on top
of this separate sequence: which base character can be used in that
case? there's currently none, except trying with ZWS (does not work
always), but should better be a non-spacing INVISIBLE LETTER, rather
than a spacing INVISIBLE SYMBOL (which by itself has no defined width
but has just a minimum width 0).

-- 
Philippe.
Spams non tolérés: tout message non sollicité sera
rapporté à vos fournisseurs de services Internet.

Re: Questions on ZWNBS - for line initial holam plus alef

2003-08-14 Thread John Hudson

At 05:27 PM 8/8/2003, Kenneth Whistler wrote:

Because the mechanism for doing so -- application to SPACE or
to NBSP -- has been specified by the standard for a decade now.
True enough, but I'm also a bit concerned about this mechanism because 
white space characters are another pesky thing that not all applications 
paint. TEX, perhaps most famously, uses its own 'glue' instead of the space 
glyph in the font. And what happens when word spacing is expanded or 
contracted in text? The diacritic mark ends up being shoved to the left or 
right of where it should be. Of course, if the space glyph is not painted 
you have to rely on blind offsets for mark positioning, because unpainted 
glyphs can't be found for smart positioning lookups. As someone who cares 
about typography, I don't like blind offsets because they don't offer 
precise enough control: I would much rather have a mechanism that I can 
reliably and precisely use with glyph positioning lookups. I'm not 
suggesting that the use of space/nbspace for this purpose should be 
deprecated, only that an alternate mechanism would be useful for those who 
want more control of how combining marks are rendered on a blank base.

A similar but not identical issue was raised by Peter Constable when we 
were talking about Qere vs Ketiv readings in Biblical Hebrew. There are 
cases in which vowels are applied to ellided consonants, which in some 
texts results in marks applied to a blank base in mid-word. In this case, 
my concern about using space or nbspace is that these imply a word break 
where there is not, in fact, any break in the word: the blank base is part 
of the word.

John Hudson

Tiro Typeworks  www.tiro.com
Vancouver, BC   [EMAIL PROTECTED]
The sight of James Cox from the BBC's World at One,
interviewing Robin Oakley, CNN's man in Europe,
surrounded by a scrum of furiously scribbling print
journalists will stand for some time as the apogee of
media cannibalism.
- Emma Brockes, at the EU summit

RE: Questions on ZWNBS - for line initial holam plus alef

Kent asked:

> How should a freestanding double diacritic be encoded (for purposes of
> meta-discussions, and the like):  or  diacritic, SPACE>? 

It *could* be represented as , of course,
or for that matter , or other possibilities.
The combining character sequence, in either case, is the
 sequence.

But it *should* be represented by something visually more
meaningful, such as , which is
how the standard itself tends to represent it when needing
to engage in a meta-discussion. The whole point of a double
diacritic is its graphic application to two base characters,
which point is lost in the discussion if you don't show a
graphic base when displaying the character in isolation.

> How should combining characters (spacing as well
> as non-spacing) that are not vertically centered *roughly* be displayed,
> e.g. , should that *roughly*
> be displayed with or without a typographic void to the left of it? 

It's up to the application. And again, I would say that if this
level of detail is a concern to the person originating the text,
then the better convention is to represent the combining character
on a *visible* generic base.

> So
> if I want a space (though not an overgrown one), should one use
> ? Or even  ZWSP, SPACE, right-side combining character>, to prevent "space
> collapse".
> And similarly for left-side combining characters. Likewise for defective
> combining sequences. If I want a visible pseudo-base, a dotted ring, or
> an
> underline, the answers are fairly clear, using a suitable character as a
> base.

Exactly. Which is why you should use such conventions if you
care about the placement in this detail.

Otherwise, you up-level and make use of whatever mechanisms a
typesetting application makes available for individual adjustment
of the placement of glyphs.

--Ken

> But not for the cases above. I don't think that should entirely up
> to each font (maker), without any recommendation. (A "should" rather
> than a "shall" is quite sufficient.)
> 
>   /kent k
> 
>

Re: Questions on ZWNBS - for line initial holam plus alef

On Saturday, August 09, 2003 12:49 AM, Michael Everson <[EMAIL PROTECTED]> wrote:

> At 14:22 -0700 2003-08-08, Kenneth Whistler wrote:
> 
> > Philippe, you are tilting at windmills, here. There is no chance
> > that the UTC is going to consider such a character, in my
> > assessment, let alone give it the properties you suggest.
> 
> Nor WG2 either.

Why that? Because I suggest something that some other may think
as useful to fill a large gap in Unicode for spcing diacritics, but I'm
not trusted enough due to my errors or confusions here, so that this
suggestion would be endorsed by more "serious" UTC or WG2
members?

I admit that the properties of such character can be discussed, and
is possibly not necessarily a "Sk" symbol, but a "Lo" letter, in which
case the name "INVISIBLE LETTER" may be appropriate (where
it could also fill the gap for Hebrew "Yerushala(y)im", but this is a
possibly distinct function for a missing letter in phonology).

Why do you think it is stupid to have a single carrier character that
would avoid adding new spacing diacritics, when the standard
combining diacritics could be used without less "quirks" like
"defective" sequences just to produce the desired effect?

If you think that spacing diacritics are stupid, why then are they
given these properties and not deprecated (no more recommanded)
in the standard, in favor of the SPACE+diacritics sequences, which
are really not equivalent to spacing diacritics used as symbols
(sometimes described also as "MODIFIER LETTER" which is
very misleading according to their gc=Sk property) and as base
characters (to which other diacritics can be applied) ?

-- 
Philippe.
Spams non tolérés: tout message non sollicité sera
rapporté à vos fournisseurs de services Internet.

Re: Questions on ZWNBS - for line initial holam plus alef

On 06/08/2003 03:38, Kent Karlsson wrote:

Kenneth Whistler wrote:

 

Kent Karlsson said:

   

I see no particular *technical* problem with using WJ, though.  In
contrast
to the suggestion of using CGJ (re. another problem) 
 

anywhere else but
   

at the end of a combining sequence. CGJ has combining class 
 

0, despite
   

being invisible and not ("visually") interfering with any other
combining
mark. Using CGJ at a non-final position in a combining sequence puts
in doubt the entire idea with combining classes and normal forms.
 

Why? 
   

See above (I DID write the motivation!). Combining classes are generally
assigned according to "typographic placement". Combining characters
(except those that are really letters) that have the "same" placement,
and "interfere typographically" are assigned the same combining class,
while those that don't get different classes, ...
Not true, as we have seen for Hebrew. It's supposed to be true, but 
isn't, and the problems can't be fixed.

... and the relative order is
then considered unimportant (canonically equivalent). How is then,
e.g.  supposed to be different from
 (supposing all involved characters
are fully supported), when  is NOT
supposed to be much different from 
(them being canonically equivalent)? ...
There is no difference when the characters really do not interfere 
typographically. But when they do, there is a real and, in some 
languages, meaningful distinction.

...

... the only ways out seem to be to either formally deprecate
CGJ, or at least confine it to very specific uses. Other occurrences
would not be ill-formed or illegal, but would then be non-conforming.
 

OK, let's confine it to those specific uses where it is really needed, 
e.g. to get round the problem of combining characters with different 
combining classes which actually do interact typographically, and 
perhaps there was another one being suggested. I have no problem with 
that - as long as the list of permitted uses is not set in stone, so 
that new uses can be approved when they are discovered. But there is no 
good reason to object to its use in those cases where it is needed, 
simply because in many other cases it is not needed.

--
Peter Kirk
[EMAIL PROTECTED]
http://web.onetel.net.uk/~peterkirk/

RE: Questions on ZWNBS - for line initial holam plus alef

> >3) In attribute values that have a declared type other than
> CDATA, multiple
> >   spaces are compressed to a single space, and leading and
> trailing spaces
> >   are removed.  After this is done, there can be no spaces in attributes
> >   of type ID, IDREF, ENTITY, NMTOKEN, NOTATION, or enumerated types.
> >   In the types IDREFS and ENTITIES, spaces are used to separate
> >   individual tokens, none of which may begin with a combining character.
> >   In the remaining type, NMTOKENS, individual characters may begin
> >   with a combining character, so it is possible that such a token, if
> >   not the first in the attribute, will be rendered in a peculiar way,
> >   with the combining character placed over the separating space.
> >   But that is a mere rendering glitch and in no way affects anything.
> >
> >
> Not just a rendering glitch, I suspect. If the combining character is
> combined with the separating space, the space loses many of its
> separating functions, and perhaps keeps a confusing subset of them with
> all sorts of possibilities of error. At best tokens beginning with
> combining characters will be unusable. At worst they will crash the
> implementation (and count on someone trying deliberately to do that!).
> The only safe thing to do is to specify that space followed by a
> combining mark is NEVER considered to be a space and this combination is
> NEVER generated.

No, the safe thing to do (and the thing that is done) is to treat the space
as a space ignoring the fact that the NMTOKEN contains a combining
character, this is even safer than your suggestion since it can't
mis-identify the combining properties of a character.

This effectively bans space+combining (and for that matter NBSP+combining
since NBSP isn't allowed in NMTOKENs) within an NMTOKEN and means that if
you attempt to begin an NMTOKEN with space+combining it will be treated as
beginning with the combining character.

The resulting lost of expressive power in having this banned is negligible,
it means that you can't use what is quite a linguistic oddity
(space+combining is mainly used in meta-discussion of combining marks as was
mentioned earlier) in a context where it is human-readable (hopefully) but
not fully general text. NMTOKENs should only be given "raw" to a user by
relatively low-level tools (i.e. general purpose XML tools for developers),
in other contexts they should be represented by a more user-friendly and
application-appropriate indicator (perhaps text, perhaps not) so the
inability to use space+combining won't apply at that level.

Re: Questions on ZWNBS - for line initial holam plus alef

2003-08-14 Thread Doug Ewell

Peter Kirk  wrote:

> Thank you, Ken. Well, you make it sound as if the problems are
> minimal, and that version I can just about accept. But if Philippe is
> correct about what he says about UAX#29 and UAX#14, there are some
> more serious problems. It is certainly highly inappropriate for
> non-spacing diacritics to be considered word boundaries.

Non-spacing diacritics had better not be word boundaries, otherwise a
string like Québec (spelled with U+0301, as here) would be considered
two words.  I don't have time right now to look up the relevant
properties and UAX's, but I sincerely hope this is just another
"Philippe mistake" and not a general misinterpretation that anyone might
make.

-Doug Ewell
 Fullerton, California
 http://users.adelphia.net/~dewell/

RE: Questions on ZWNBS - for line initial holam plus alef

> Thanks for the clarification. I probably misunderstood Jon's intention.
> But is there a problem if, for example, an application sees the string
>  and regularises it (wrongly!) to  combining mark>?

Yes, I was not saying that it wouldn't be sensible to begin a line of text
with a spacing diacritic (whether precomposed or created using space or
NBSP). I was saying that it wouldn't be sensible to begin a line with a
combining diacritic, since that combining diacritic would be combining with
a newline character which it's difficult to think of any possible sensible
meaning for. Attribute normalisation would change the sequence U+000A,
 to U+0020,  which would arguably change the meaning,
but changing the meaning of a meaningless construct isn't a problem to my
mind.

Re: Questions on ZWNBS - for line initial holam plus alef

On Wednesday, August 06, 2003 10:19 PM, Kenneth Whistler <[EMAIL PROTECTED]> wrote:

> Kent Karlsson responded:
> 
> > > > I see no particular *technical* problem with using WJ, though. 
> > > > In contrast
> > > > to the suggestion of using CGJ (re. another problem)
> > > anywhere else but
> > > > at the end of a combining sequence. CGJ has combining class
> > > 0, despite
> > > > being invisible and not ("visually") interfering with any other
> > > > combining
> > > > mark. Using CGJ at a non-final position in a combining sequence
> > > > puts in doubt the entire idea with combining classes and normal
> > > > forms. 
> > > 
> > > Why?
> > 
> > See above (I DID write the motivation!).
> 
> I guess that I did not (and still do not) see the motivation for
> your final statement.
> 
> > Combining classes are generally
> > assigned according to "typographic placement". Combining characters
> > (except those that are really letters) that have the "same"
> > placement, and "interfere typographically" are assigned the same
> > combining class, while those that don't get different classes, and
> > the relative order is then considered unimportant (canonically
> > equivalent). How is then, 
> > e.g.  supposed to be different from
> >  (supposing all involved characters
> > are fully supported), when  is NOT
> > supposed to be much different from 
> > (them being canonically equivalent)? An invisible combining
> > character does not interfere typographically with anything, it
> > being invisible! 
> 
> The same thing can be said about any inserted invisible character,
> combining or not.
> 
> How is:  supposed to be different from
> 
> 
> How is:  supposed to be different from
> 
> 
> In display, they might not be distinct, unless you were doing some
> kind of show-hidden display. Yet these sequences are not canonically
> equivalent, and the presence of an embedded control character or an
> embedded format control character would block canonical reordering.

I disagree with you, using a LRM mark in the middle of a combining
sequence is conforming to canonicalization rules but is clearly
ill-formed, as well as using a NULL control in the middle, which
breaks the combining sequence.

So in your two examples above, inserting the LRM or NULL splits
a combining sequence and creates 3 ones, each with their own
properties, and the last one is ill-formed as it contains a combining
character after a control and not a base or combining character.

The proposal to use CGJ however is legal: it does not break the
combining sequences and grapheme clusters, and thus the whole
encoded sequence encoded with CGJ will be considered by
rendering engines, where CGJ is a no-op for rendering but not for
the canonical ordering where I see its only well-formed use as a
canonical ordering fix for NF* normalized forms, or before a
base character to extend the combining sequences used by
renderers or character parsers and breakers.

So your example with:

would in fact be rendered and parsed as three combining sequences:
   , , 
i.e. a wellformed , a control (normally invisible,
but may be edited with a visible glyph with a dotted square like in
the Unicode charts), and a ill-formed isolated  (most
probably rendered with a dotted circle).

So it cannot be thought as equivalent and not even rendered
equivalently as:

or its canonical equivalents (not in normalized order but still
conforming and well-formed, and handled equivalently):

-- 
Philippe.
Spams non tolérés: tout message non sollicité sera
rapporté à vos fournisseurs de services Internet.

Re: Questions on ZWNBS - for line initial holam plus alef

On 11/08/2003 18:03, John Cowan wrote:

You don't have (nor do I) the vaguest idea why Microsoft produced
this particular nonconforming implementation, or whether they
consider it a bug or not.
 

Don't make assumptions about things you don't know anything about. I 
have been working closely and personally with Microsoft's head of 
typography on support for Hebrew and other scripts in Uniscribe. While I 
don't happen to have detailed information on this particular point, I am 
aware of some of the constraints that Microsoft has been under e.g. to 
avoid the inefficiency of calling Uniscribe for rendering of plain text 
in western languages. This is why they have been slow to support use of 
arbitrary diacritics with Latin text. I think this issue may have been 
fixed with the soon to be released new version of Uniscribe, and perhaps 
the problem with spaces and diacritics has also been fixed. We'll see.

 

Surely the UTC should not create difficulties for 
implementers and then just shout at them for getting things wrong. The 
UTC should try to produce a standard which is workable without 
unnecessary complications.
   

This is sheer conjecture.

 

No, it is not. For one thing I have not said that the UTC has done 
anything bad, and certainly not that it has done so deliberately, only 
that it should not do so. But it is not just me who has pointed to the 
difficulty for implementers of the space + diacritic convention which 
the UTC defined (with inadequate forethought rather than malicious 
intention), see also John Hudson's independent opinions and the failure 
of Microsoft to implement it. I was wrong to suggest that the UTC is 
shouting at implementers for getting things wrong though I think it 
should so so if they do. But UTC members have told me to complain to 
implementers for getting things wrong. As for my last statement, that is 
simply my opinion. If you wish to disagree with it, do you prefer that 
the UTC should deliberately produce an unworkable standard, or that it 
should introduce unnecessary complications?

--
Peter Kirk
[EMAIL PROTECTED] (personal)
[EMAIL PROTECTED] (work)
http://www.qaya.org/

Re: Questions on ZWNBS - for line initial holam plus alef

On 08/08/2003 13:56, Thomas M. Widmann wrote:

Peter Kirk <[EMAIL PROTECTED]> writes:

 

On 08/08/2003 08:54, Philippe Verdy wrote:

   

... Could there be another codepoint assigned that has

these properties:

20CF;ZERO WIDTH SYMBOL;Sk;0;ON; 0020N;
[...]
 

But I'm not sure that ZERO WIDTH SYMBOL is the best name, unless you
are suggesting other uses in which it really has zero width. Well, it
might have in a case like line initial holam which shifts on to a
following silent alef, but that is a rather special case.
   

What would be a better name?  ACCENT CARRIER?

/Thomas
 

Perhaps CARRIER FOR COMBINING CHARACTERS - not COMBINING CHARACTER 
CARRIER as that gives the wrong idea that this should itself be a 
combining character, it should not.

--
Peter Kirk
[EMAIL PROTECTED] (personal)
[EMAIL PROTECTED] (work)
http://www.qaya.org/

Re: Questions on ZWNBS - for line initial holam plus alef

Peter Kirk asked:

> Thanks for the clarification. I probably misunderstood Jon's intention. 
> But is there a problem if, for example, an application sees the string 
>  and regularises it (wrongly!) to  combining mark>?

Then you have a problem, of course.

What the Unicode Standard says about application of nonspacing
combining marks to SPACE seem clear to me.

What other standards say about space folding is clear in their
own contexts.

If someone is implementing both such standards together, then
one has to be careful how the requirements articulate.

In Unicode terms, a space folding is an example of a "knowing
modification" of the content of the text. It is perfectly o.k.
to modify Unicode text, of course, *as long as you know what
you are doing* -- i.e., you aren't converting valid text to
bit hash because you aren't conforming to the meaning of
the characters or to their encoding forms.

Now if a process is doing a space folding, but is applying
it to Unicode text as a "semi-ignorant modification", i.e.,
without being aware of the fact that nonspacing combining
marks can apply to SPACE characters (and that such sequences
are valid combining character sequences and should be treated
analogously with other grapheme clusters, viz UAX #29), then
it is modifying the text away from its intended content without
*knowing* what it is actually doing. Such mistakes are
programming errors in application of the relevant standards.

Of course a standard which mandates space folding is also
within its rights to mandate, for example, the non-use of
nonspacing marks applied to SPACE characters. It can simply
rule out such sequences as valid for its context, in which
case the problem goes away.

The important thing here is to know what you are doing when
you modify text, and, as far as possible, to accomplish
such modifications in ways that are the same as other
processes which also know what they are doing. That is the
basis for interoperability of textual data.

--Ken

Re: Questions on ZWNBS - for line initial holam plus alef

Peter Kirk wrote:

> I think this may be a "Peter mistake". I meant to refer to spacing 
> diacritics. Sorry.
> 
> It is certainly highly inappropriate for spacing diacritics to 
> be considered word boundaries.

Why? It is entirely dependent on the orthography and conventions
involved. There is probably as much (or more) bad ASCII usage
of spacing diacritics like `this', where a grave accent character
is being misapplied to make a directional quotation mark, as
there is actual, linguistically appropriate use of spacing
diacritics.

Also, everyone should consider carefully the status of UAX #29,
Text Boundaries.

2 Conformance

This is informative material. There are many different ways to
divide text elements corresponding to grapheme clusters, words 
and sentences, and the Unicode Standard and this document do not
restrict the ways in which implementations can do this.

This specification is a default mechanism;
more sophisticated engines can and should tailor it for particular
locales or environments. ...

The whole UAX is informative. It is a here's-how-you-can-approach-
the-problem implementation guide with some suggestions for
rules and classes.

*If* you are working with an orthography that uses one or more
spacing diacritics, and
*If* those spacing diacritics need to be represented by
 sequences,

then you are in the situation where your implementation of
text boundaries should take  sequences explicitly
into account, so as to result in expected behavior for that
orthography.

Everyone has had experiences with their platform UI producing
bad results for text boundaries. The Solaris platform I am
writing this on right now, for example, implements a double-click
word selection that treats the string "`this'," above, including
the grave accent, the apostrophe, and the comma, as a "word".
Is that right or wrong? Well, it depends on what you are trying
to do, I expect.

But even the most sophisticated platform implementers can only
do so much with processes like default word selection. It is
bound to be wrong for one purpose or another and for one
orthography or another. Ultimately you need to have tailored
processes that can be orthography-specific if you want to
get best results.

--Ken

Re: Questions on ZWNBS - for line initial holam plus alef

2003-08-14 Thread Jim Allan

Ken Whistler posted:

Of course a standard which mandates space folding is also
within its rights to mandate, for example, the non-use of
nonspacing marks applied to SPACE characters. It can simply
rule out such sequences as valid for its context, in which
case the problem goes away. 
And for such standards or applications one can usually use U+00A0 
NO-BREAK SPACE to force multiple spacings.

One can also use this followed by a non-spacing combining character to 
call for rendering of that combining character in isolation.

My feeling is that because of the special qualities of regular SPACE 
using NBSP (U+00A0) should be the more robust way to go.

Essentially, since the Unicode specifications say that a non-spacing 
diacritic can be applied to any base character, including the spaces, it 
is up to fonts and other presentation software to support this and to 
try to make the results look good according to othrographic and cultural 
expectations, just as it is with any text coded in Unicode.

Sometimes fonts don't do this. I would not at all be surprised to find 
for example that _g_ followed by U+0325 COMBINING RING BELOW would come 
out with the combining ring overlapping the tail of the _g_ unless I 
were using a font especially designed for linguistic use.

I would not be at all surprised that some fonts and display devices 
wouldn't justify NBSP + COMBINING DOT BELOW at the beginning of a line. 
But good typographical fonts should justify such combinations and should 
presumably change the width of NBSP when appropriate.

Such changes of width and shapes are what one finds with ligatures in 
fonts that support ligatures.

Jim Allan

Re: Questions on ZWNBS - for line initial holam plus alef

From: "Peter Kirk" <[EMAIL PROTECTED]>

> On 13/08/2003 11:09, Philippe Verdy wrote:
>
> >... For this reason, defective
> >combining sequences (combining characters without a leading base
> >character) should be forbidden (invalid for XML).
> >
> >
> If there is even the remotest possibility of this happening, we need
to
> know quickly! Defective combining sequences are legal Unicode and are
> now being suggested for use in Hebrew e.g. for holam male. But such a
> definition would be useless if XML restricts the texts it can
represent
> to a subset of Unicode excluding such sequences.

I did not notice that the discussion about Hebrew holam male was
related.
In fact I don't know anything about the hebrew alphabet so I could not
understand the semantics discussed, and so di not note that 
was a "defective" encoding (in terms of combining sequences).

When using the term "forbidden", it was only related to possible
security
problems with XML, but the term was certainly too much expeditive.
However, given that possible security and parsing issues do exist, the
case of  used to encode "holam-male" may be another
argument to propose a neutral/invisible base character for combining
characters. For the case of Hebrew, it then needs to have a "letter"
behavior, but for the case of other isolated diacritics in Latin,Greek
Cyrillic, and probably also Hiragana, Katakana (voice marks) it should
better be handled as a symbol.

I suggested several semantics for this invisible character(s) in a
earlier
message:
- A invisible symbol
- An invisible LTR letter
- An invisible RTL letter
all of them having a *compatibility* decomposition (or NFKD form) as
a SPACE like other existing spacing combining marks, but not being
canonical equivalent of SPACE (to keep separately the legacy semantics,
properties, behavior and known caveats unchanged and
implementation/usage-dependant, as they are now with SPACE+NSM
which could then be discouraged in Unicode and strongly deprecated
in SGML/HTML/XML)

Re: Questions on ZWNBS - for line initial holam plus alef

From: "Jon Hanna" <[EMAIL PROTECTED]>


> I was saying that it wouldn't be sensible to begin a line with a
> combining diacritic, since that combining diacritic would be combining
> with a newline character which it's difficult to think of any possible
> sensible meaning for.

A newline is a control with a whitespace property and a line-breaking
behavior. It must not combine with a combining diacritic, according to
the UAX definition of grapheme clusters.

So +NSM is clearly defective and must be parsed as two distinct
combining sequences, the first one for the newline sequence, the second
one being "defective" as the combining character does not have a base
character to which it applies (the standard suggests using a dotted
circle to render it in editors, but suggests nothing for the rendering
of final documents, which could simply drop the defective sequence or
display it with a replacement base character, or use a dotted circle, or
a invisible glyph. So the result in this case is implementation
dependant, and not interoperable.

For me the term "difficult" is inappropriate. In fact it is invalid for
interoperability (even though it is valid, not forbidden, for
ISO10646/Unicode, as an string fragment for intermediate processing),
and such sequence should not occur in actual documents, out of any
external processing context which defines its behavior.

RE: Questions on ZWNBS - for line initial holam plus alef

2003-08-14 Thread Michael Everson

At 15:11 +0200 2003-08-09, Kent Karlsson wrote:
Michael wrote:
 The Name Police reject this utterly. ZERO WIDTH cannot have an
 expanding dynamic width.
Then what about ZERO WIDTH SPACE, which, according to TUS3, p. 238,
"can grow to have a visible width when justified"? And it has the
NamesList comment:
* nominally zero width, but may expand in justification
(Rolls eyes.)

Fine.
--
Michael Everson * * Everson Typography *  * http://www.evertype.com

Re: Questions on ZWNBS - for line initial holam plus alef

Jon Hanna scripsit:

> If this is not the case (I'm not entirely sure this bans what XML does with
> spaces) then all we would need is a change so that rather than a de facto
> ban on space+combining within names and nmtokens we would have an explicit
> ban on the same; then we'd all be happy, except possibly for some sadistic
> XML application designer that was planning on use that combination out of
> ill-will towards his or her colleagues.

Space in any case is not allowed in a token.

There are far worse conformance problems than this anyway, notably the
fact that canonical equivalence is not respected in XML names: a start-tag
that is decomposed and an end-tag that is composed (or vice versa) will not
match.

-- 
The Imperials are decadent, 300 pound   John Cowan <[EMAIL PROTECTED]>
free-range chickens (except they have   http://www.reutershealth.com
teeth, arms instead of wings andhttp://www.ccil.org/~cowan
dinosaurlike tails).--Elyse Grasso

Re: Questions on ZWNBS - for line initial holam plus alef

On 08/08/2003 08:54, Philippe Verdy wrote:

... Could there be another codepoint assigned that has

these properties:

20CF;ZERO WIDTH SYMBOL;Sk;0;ON; 0020N;

i.e. being considered symbolic, not a whitespace, with
combining class 0 (not combining), and used as an
explicit base for a isolated spacing diacritic to never show
with a dotted circle? (note U+20CF is just a suggestion, as
it fits at end of the symbolic block used for currency symbols,
just before the "extended" combining characters block, and
because the U+02XX block where other "Sk" spacing
diacritics are defined is full).
The compatibility decomposition to a space is to make it
in sync with other compatibly decomposable spacing
diacritics.
The new character would allow to represent diacritics that currently
don't have a spacing counterpart, and use them as if they were letter
like. Let's look at a similar diacritic which currently has an existing
"precombined" spacing version:
00B4;ACUTE ACCENT;Sk;0;ON; 0020 0301N;SPACING ACUTE



 

Philippe, this sounds like an excellent suggestion, at least in general 
terms. There is a missing function here, which has been provided (since 
Unicode 1.0) by overloading the characters space and NBSP with an 
inappropriate second function. Of course we can't make existing practice 
illegal, but we can recommend that in future versions of the standard 
your new ZERO WIDTH SYMBOL character should be used for display of 
isolated diacritics where there is no separate spacing form. We can also 
suggest that the width of the combination should be that of the 
diacritic only.

But I'm not sure that ZERO WIDTH SYMBOL is the best name, unless you are 
suggesting other uses in which it really has zero width. Well, it might 
have in a case like line initial holam which shifts on to a following 
silent alef, but that is a rather special case.

--
Peter Kirk
[EMAIL PROTECTED] (personal)
[EMAIL PROTECTED] (work)
http://www.qaya.org/

RE: Questions on ZWNBS - for line initial holam plus alef

the
> solution with
> SPACE is really tricky due to the special treatment of SPACE notably
> in HTML, SGML, XML

I disagree. There are a few different things that happen with whitespace in
such technologies. Some of these only apply to elements that do not allow
any character data apart from whitespace to appear directly within them, and
hence are not an issue here. Some happen at relatively high level of
processing, e.g. rendering (not parsing) of HTML, and as such should
correctly process spaces combined with combining characters.

There are only two theoretical problems that I can see here, the first is
that a whitespace character other than space gets converted to space by
attribute value normalisation, and that this changes the meaning of the text
in some way. This could only occur if the combining character were the first
character in a line of text, which is quite a nonsensical construct to begin
with.

The other would be with names, qnames, nmtokens and such. These are not
normal textual content; they are human-readable constructs that are based on
normal text because that makes it easier for some developers to work at a
plain-text level (if they speak the natural language that the human-readable
constructs were based on). Support for the linguistic oddity of a dialectic
divorced from the context in which it would normally exist would have little
justification in this place except for fulfilling the general goal of
"completeness". Completeness is a laudable aim of course, but extreme
edge-cases need only be brought in if they are both safe and cheap. Anyone
designing an XML application who frequently considers isolated diacritics as
the most natural choice in part of such tokens probably needs to take a
couple of weeks holidays before continuing the design. Of course some of the
characters that could be considered to be precomposed isolated diacritics
are banned from use in nmtokens anyway.

Re: Questions on ZWNBS - for line initial holam plus alef

From: "Peter Kirk" <[EMAIL PROTECTED]>

> There is some potential for real trouble here, if one process outputs
an
> NMTOKEN starting with a combining character preceded by a separating
> space, or something else which is changed into a space, and another
> process takes the new space plus combining character as a unit and so
> doesn't recognise the separation. Any hackers and virus programmers
> reading this will soon start flooding the Internet with tokens
beginning
> with combining characters in the hope of crashing implementations or
> finding back doors. Of course this wouldn't have been a problem if
> Unicode had never  defined space plus combining character as legal and
> meaningful. But this is not my problem!

I do agree: a XML document could require the use at some place of a
given attribute or element. If this attribute name follows the element
name
after a line break, which gets changed into a space during parsing,
forcing
XML parsers to treat SPACE+combining as a unbreakable grapheme
cluster acting like a letter would have the effect of creating a new
element
name which may violate the lement name identity. Now suppose that the
attribute name contains a colon, you have created a custom namespace
name, under which you can add any element you like, even if this was
forbidden by the content-model of the reference schema.

So this would invalidate existing documents, or create holes allowing
insertion of arbitrary XML content, if the XML application is not
validating extremely strictly the element names (the pair namespace+
name) and exclude completely from processing any unrecognized
element (including all its content and attributes). This would be a
breach in the content model which may have been validated and tested
for security in another layer of the document encoding process (notably
when XML documents are created from templates, such as XSL
processors, or custom C source using simple template substitution).

So for me the sequence SPACE+combining should not be acceptable
as a valid grapheme cluster within element names or attribute names,
and thus would need to be excluded from NMTOKEN. The correct
way to do it is to consider it NOT A LETTER, but a symbol (Sk),
exactly like other spacing diacritics, which are already invalid in
NMTOKEN.

There still remains the unresolved question of grapheme clusters
that could span the starting "<" or ending ">" or "/>" of tags, or
the leading "&" of a entitity reference. For this reason, defective
combining sequences (combining characters without a leading base
character) should be forbidden (invalid for XML).

So there remains a unsolved conflict here: defective combining
sequences cause security or validity problems in XML documents,
and a non-defective SPACE+combining sequence cause also
security problems. There's no secure choice to represent
spacing diacritics which are not already encoded in a precomposed
form...

Re: Questions on ZWNBS - for line initial holam plus alef

2003-08-14 Thread Mark Davis

Peter, in XML you really don't want to use attributes for any general
text; there are too many restrictions on the content. For example, we
never put translatable text into them. Attributes should really be
treated more like sequences of symbols, with a constrained syntax.

This is also not in violation of the Unicode conformance clause. A
"space plus combining
character" is a unit in some sense. That is, it is a combining
character sequence (and grapheme cluster). However, there is no clause
that says that such units cannot be changed, or that any particular
sequence of characters cannot be changed; operations such as case
mapping or normalization do just that, they change characters.

There are restrictions on what can be changed *if* a process purports
to not modify the text (C10). But an XML parser is certainly capable
of interpreting a sequence A B, and deciding that it wants to change A
to C. If the parser interpreted the 0x0041 in UTF-16 as a Z or a Greek
Alpha, *that* would be a violation of C7. But interpreting a space as
a space, then deciding to modify it, is perfectly legit.

Mark
__
http://www.macchiato.com
►  “Eppur si muove” ◄

- Original Message - 
From: "Peter Kirk" <[EMAIL PROTECTED]>
To: "John Cowan" <[EMAIL PROTECTED]>
Cc: <[EMAIL PROTECTED]>
Sent: Wednesday, August 13, 2003 05:09
Subject: Re: Questions on ZWNBS - for line initial holam plus alef


> On 12/08/2003 20:28, John Cowan wrote:
>
> >Peter Kirk scripsit:
> >
> >
> >
> >>>2) In attribute values, LF, CR, and TAB characters are normalized
to
> >>>spaces.   Not relevant here.
> >>>
> >>>
> >>This would be relevant if it is legal for the character after LF,
CR,
> >>and TAB to be a combining mark. Is this legal? In this case what
was
> >>previously a defective (but legal) combining sequence would turn
into a
> >>non-defective one, but the intended whitespace would be lost.
> >>
> >>
> >
> >The point is that there is no such thing as an *intended* line
break in
> >an attribute value; it will *always* be translated to a space
before
> >the application sees it.  (More exactly, line-break characters can
> >be inserted into attribute values, but only with the use of a
numeric
> >character reference such as "
".)
> >
> >
> Sorry, I'm confused. Are you saying that the input processing will
> translate line breaks into spaces within attribute values, unless
> inserted as 
 ? Well, I suppose this is fair enough as it is up
to
> the user not to enter garbage.
>
> >
> >
> >>Not just a rendering glitch, I suspect. If the combining character
is
> >>combined with the separating space, the space loses many of its
> >>separating functions, and perhaps keeps a confusing subset of them
with
> >>all sorts of possibilities of error.
> >>
> >>
> >
> >The space(s) will be used to separate individual tokens at
processing
> >time.  No spacing diacritic (either single-character or
space+combining)
> >is permitted in a NMTOKEN.
> >
> >
> OK if this is clearly illegal, but this might restrict use of some
> languages in NMTOKEN. Would NBSP + combining be allowed?
>
> >
> >
> >>At best tokens beginning with
> >>combining characters will be unusable. At worst they will crash
the
> >>implementation (and count on someone trying deliberately to do
that!).
> >>
> >>
> >
> >In effect, the combining character will constitute a defective
combining
> >sequence at the beginning of the individual token.
> >
> >Stepping away from the letter of the standard for a moment, there
is
> >no real reason to begin a NMTOKEN with a combining character.  It
is
> >only allowed is a result of the miscegenation of SGML concepts with
> >Unicode ones.
> >
> >In SGML's original design of tokens, they consisted of letters and
digits
> >(and a few punctuation marks, which functioned as letters).  There
were
> >four kinds: a NUMBER could contain only digits, a NAME could not
begin
> >with a digit, a NUTOKEN had to begin with a digit, and a NMTOKEN
had no
> >restrictions.  ID and IDREF had the same syntax as NAME with
additional
> >semantics.  Later, the categories "letter" and "digit" were
generalized,
> >by redefining the concrete syntax, to be whatever you wanted, and
were
> >renamed "name-start" and "name" characters (technically, a name
character
> >was a letter *or* a digit).
> >
> >When SGML was simplified to produce XML, only NMTOKEN, the most
general
> &

Re: Questions on ZWNBS - for line initial holam plus alef

From: "Peter Kirk" <[EMAIL PROTECTED]>
> I note that there is no line break opportunity in . But
is
> there one after the space in ? If so,  combining character> has a third advantage, that it gives the right
line
> break opportunity when this sequence is word initial, which it
wouldn't
> do without the RLM.

How can we be so complicated when a new base character with
the needed properties would be much simpler and easier to support
in implementations?

What is wrong with the encoding of new recommanded alternatives
to SPACE or NBSP, i.e. an invisible symbol, an invisible LTR letter,
an invisible RTL letter? This way we can fix some issues in the current
text of UAX'es but recommand that new writers use a new base
character which will behave correctly without those too complex
hacks that users and implementers won't understand.

RE: Questions on ZWNBS - for line initial holam plus alef

> OK, it's safe, but it is a misuse of Unicode. As space plus combining
> character is a unit in Unicode, it should be treated as a unit by higher
> level protocols. If higher level protocols are allowed to do arbitrary
> things within Unicode units, there is no end to the possible confusion.
> See for example, from Unicode 4.0 chapter 3:
>
> C7 A process shall interpret a coded character representation according
> to the character
> semantics established by this standard, if that process does interpret
> that coded character
> representation.

If this is not the case (I'm not entirely sure this bans what XML does with
spaces) then all we would need is a change so that rather than a de facto
ban on space+combining within names and nmtokens we would have an explicit
ban on the same; then we'd all be happy, except possibly for some sadistic
XML application designer that was planning on use that combination out of
ill-will towards his or her colleagues.

Re: Questions on ZWNBS - for line initial holam plus alef


- Original Message - 
From: "Jon Hanna" <[EMAIL PROTECTED]>
To: <[EMAIL PROTECTED]>
Sent: Thursday, August 14, 2003 1:49 PM
Subject: RE: Questions on ZWNBS - for line initial holam plus alef


> > I do agree: a XML document could require the use at some place of a
> > given attribute or element. If this attribute name follows the
element
> > name
> > after a line break, which gets changed into a space during parsing,
> > forcing
> > XML parsers to treat SPACE+combining as a unbreakable grapheme
> > cluster acting like a letter would have the effect of creating a new
> > element
> > name which may violate the lement name identity. Now suppose that
the
> > attribute name contains a colon, you have created a custom namespace
> > name, under which you can add any element you like, even if this was
> > forbidden by the content-model of the reference schema.
>
> 1. SPACE is treated "blindly" as a SPACE by XML. String + space +
combining
> + string would not be treated as a single token, no matter how that
space
> was introduced. That's what you were complaining about in the first
place
> (as far as I can make out).
> 2. While nmtokens can begin with a combining character names cannot,
nor can
> they contain spaces.
> 3. This would in no way change the content-model. So even if the above
two
> points didn't hold they would only sneak the document past something
which
> performed validation before parsing(!), and where the content-model
was
> already pretty loose (so it didn't complain about the unrecognised
> attribute).
>
> You've just discovered a way to disguise one document that isn't
well-formed
> as a different document that isn't well-formed. l33t!
>
> > So this would invalidate existing documents, or create holes
allowing
> > insertion of arbitrary XML content, if the XML application is not
> > validating extremely strictly the element names (the pair namespace+
> > name) and exclude completely from processing any unrecognized
> > element (including all its content and attributes).
>
> This argument is not on friendly terms with the concept of causality.
>
>  This would be a
> > breach in the content model which may have been validated and tested
> > for security in another layer of the document encoding process
(notably
> > when XML documents are created from templates, such as XSL
> > processors, or custom C source using simple template substitution).
>
> Testing validity without testing well-formedness is not possible.
>
> > So for me the sequence SPACE+combining should not be acceptable
> > as a valid grapheme cluster within element names or attribute names,
>
> As it already isn't.
>
> > and thus would need to be excluded from NMTOKEN. The correct
> > way to do it is to consider it NOT A LETTER, but a symbol (Sk),
> > exactly like other spacing diacritics, which are already invalid in
> > NMTOKEN.
>
> Wait a second. That was my justification for why the fact that
> space+combining is ALREADY prohibited from NMTOKEN shouldn't be
considered a
> failure on the part of XML to allow for freedom of choice with the
strings
> used for NMTOKENs. Now you actually want to introduce this (already
> existent) feature.
>
> > There still remains the unresolved question of grapheme clusters
> > that could span the starting "<" or ending ">" or "/>" of tags, or
> > the leading "&" of a entitity reference.
>
> No there isn't. What goes before <, >, / or & isn't a problem since
those
> are all non-combining characters and a new unit for any sort of
processing
> treating more than one codepoint as a unit. What goes after < or & has
to be
> a name (not an nmtoken) and as such is already prohibited from
beginning
> with a combiner. What goes after > is already dealt with by the
Charmod, and
> even if you ignore charmod apart from the possibility of normalisation
> turning the sequence U+003E, U+0338 into U+226E (a possibility that is
well
> noted) it still isn't going to hurt.

One note: in Unicode, grapheme clusters (considered unbreakable) are
more
than just combining sequences! Look at CGJ, WJ, ZWJ, ...
So what is after or *before* a base character may impact parsing
grapheme clusters!

As the well-formedness of XML documents goes even before its validity
(which is optional, but required in some applications that need to parse
the DOM-tree or InfoSet rather than), this impacts the way Unicode can
be used (read it as "embedded") within XML. Depending on where this
encoded text is used (NMTOKENs, text elements, attribute values,...)
the em

Re: Questions on ZWNBS - for line initial holam plus alef