Re: Combining across markup? (Was: RE: sign for anti-neutrino - g ree k nu with diacritical line aboveworkaround ?)

2004-08-12 Thread Philippe Verdy
From: "saqqara" <[EMAIL PROTECTED]>
> This is one example where colour support in fonts would be useful. A
useful
> addition to the OpenType specifications if any readers here have influence
> on such matters.
>
> Something I have a vested interest in with my own focus on Ancient
Egyptian
> Hieroglyphs.

No need for that; markup can be made to allow selecting primary or secondary
colors, and fonts don't need to encode color directly, but instead a set of
related glyphs with their indexed color information; it's up to the styling
renderer to allow specifying these colors, or give them some defaults.

In classic text renderers we specify only 1 color, which apply to the whole
glyph, but a font could specify the behavior of this color style face to
internal color indexes (for example, a unique "black" color could be
transformed by a renderer as a shade of black computed with help of the
font's color information; classic fonts don't have this information and do
not map a style color to an effective glyph color with a transformation
function, but this is a possible extension of font formats).




Re: Combining across markup?

2004-08-12 Thread Philippe Verdy
From: "Doug Ewell" <[EMAIL PROTECTED]>
> The suggestion to add a "mark-color" capability to CSS might handle a
> majority of the realistic situations where color is really understood to
> be part of the textual content.  Peter's two combining marks, a black
> one in the actual manuscript and a red one added by the editor, sounds
> less like a problem that Unicode or W3C need to worry about.

Note: this message is quite long, sorry. It exposes several ideas to solve
the problem of markup or styling of combining characters (below the Unicode
character model).

This is probably out of scope of Unicode itself, but the need to encode
isolated diacritics in XML implies the need to be able to encode defective
combining sequences in text elements or attribute values.

Shamely, in XML, the only way to encode it without breaking the XML syntax
when the document is normalized is to encode the combining characters that
start the defective combining sequence is to encode them with (numeric or
named) character entities like #x300; or ´ within text elements, or
within attribute values, so that they will not collide with the previous
quote mark (leading a attribute value) or with the previous closing angle
mark (that terminates the element's start tag that necessarily comes before
text elements which must be part of an element content in a well-formed XML
document).

One bad thing of this approach is that XML documents are subject to
transformations (through DOM or SAX or similar APIs that can generate new
documents or fragments), so not all valid plain-text sequences can be
encoded safely with the same way.

However:

- Unicode however also allows delimiting defective sequences after control
characters like end-of-line control characters.
- well-formed XML documents have a limited set of control characters that
can be inserted in the encoded XML syntax: CR, LF, TAB, NL... These control
characters are considered "whitespaces" in XML and subject to an optional
white-space normalization within text elements (but not in attribute
values...)
- the solution would then be to force the insertion of such a control
character at the start of the plain-text-encoding of the XML attribute
value, or at the start of the plain-text encoding of a text element (within
the content of another element);
- but then, technically, this control character becomes part of the text
element content (unless there's a xml:whitespace specifier that indicates to
the document parser that this character must be ignored as it is blank), or
of the attribute value.

So what can we do to allow encoding defective combining sequences in XML?
- for attribute values, there's currently nothing we can do to avoid making
this control character part of the actual value; this is a limitation of the
XML syntax itself.
- for text elements, we could have the container element specify that the
leading control is not a whitespace but only necessary to make the document
still well-formed after normalization. This could be a sort of
xml:controldefective attribute added to the parent element, and that clearly
indicates to the parser that the leading control must be removed from the
effective text element content.

All these seem to indicate that XML document generators can safely encode a
document containing defective combining sequences, provided that they know
that these sequences will be defective. This requires that XML document
generators be able to detect them when they are leading text elements or
attribute values.

Another problem comes with elements whose content is marked to be
normalizable (in the schema definition of the container element or
explicitly with container elements that specify a whitespace normalization
in their xml:whitespace attribute): the XML whitespace normalization must
not strip this leading control or whitespace, but must still normalize the
whitespaces in the rest of the encoded text string.

Now comes the problem of creating documents with the markup necessary to
give specific styles or colors to diacritics. A first natural approach is to
surround the encoded defective combining sequence as the content of a
styling XML element. If the XML document generator does not know that the
combining sequence is defective, many problems will occur.

The consequence is that XML document generators must be able to detect
combining characters, and thus include at least a vector of known combining
characters, that must be encoded with character entities, and not as plain
text in the normal case (because they would behave badly through Unicode
normalization of documents.)

(Note that  sections will not help here, because the
defective sequence will appear just after the second '[' with which it will
combine in Unicode, possibly creating a combining sequence that breaks the
XML syntax if a Unicode normalization is applied to the document!).

If XML document generators (or editors...) are made aware of this problem,
then they will safely encode things like:

  

Re: Combining across markup? (Was: RE: sign for anti-neutrino - g ree k nu with diacritical line aboveworkaround ?)

2004-08-12 Thread Philippe Verdy
> This means that the rules of XML conflict with the rules of Unicode. If
> the string is a Unicode string, U+226F is canonically equivalent to
>  and therefore any higher level protocol should treat
> the two sequences as identical, rather than reject one of them as
> causing the document to be ill-formed.

There's no conflict here:
̸̸
will not be *canonically equivalent* (for Unicode) to:
!!
(here a exclamation point is used instead of the combining solidus)
which is canonically equivalent (for Unicode) to:
 character)

Internally, in the parsed XML tree, the two syntaxes "̸" and "/"
(combining) will produce the same internal U+0338 character in the DOM tree.
So the problem is purely a choice of syntax, because the two first elements
above would be treated identically by any compliant XML parser.

When there's a conflict, use a NCR: this is completely equivalent for all
compliant XML parsers. A XML document generator can know such exception, and
can generate a NCR in the XML document, each time the U+0338 character must
be coded in the first position of a text element (a text element is
necessarily following a closing element tag in any wellformed XML document).
This will resist to any Unicode normalization applied to the whole XML
document...

Note however that a Unicode normalization *modifies* the XML document: XML
ignores the Unicode canonical equivalences so it will treet the precombined
character  differently from the two characters . If a
document is transcoded from Unicode to another charset with an algorithm
that does not apply a one-to-one mapping of encoded characters, the new
document will *not* be equivalent for XML (for most legacy charsets, the
transcoding from this charset to Unicode is normally one-to-one, so most
document parsers will parse a legacy XML document into a DOM tree containing
Unicode strings).

For XML generators that use an internal DOM representation before generating
the XML document syntax, any character that cannot be mapped one-to-one in
the target charset of the document MUST use a NCR; not doing so will create
a document that will be later parsed as different from the original DOM
tree).

This is true also for all XML related APIs: DOM, SAX, ... when they are used
to get information from the parsed document tree, or when using
authentication of XML document contents (the XML semantic of XML-ignorable
whitespaces is considered, and space normalization will apply before the
signature is computed): they return either an exact Unicode string, or an
approximation of the actual DOM content if this information is requested in
another legacy charset because this would imply a lossy conversion, unless
the request to that API specifies that NCRs are allowed in the data returned
from the DOM tree by such API.

As a consequence, a compliant XML parser MUST NOT apply any Unicode
normalization to the parsed entities (text elements, element names,
attribute names, attribute values, processing instructions...) without being
instructed to do so.

So there's NO conflict between XML document equivalence and Unicode
canonical equivalence: they are not the same, and they don't need to be the
same!




Re: Combining across markup?

2004-08-12 Thread Marcin 'Qrczak' Kowalczyk
W liście z czw, 12-08-2004, godz. 13:00 -0400, John Cowan napisał:


> > Even better yet: Have the WC3 rephrase their demand that no element
> > should start with a defective sequence (when considered in separate)
> > as that no *block-level* element should etc., and leave things like
> > ,  and other in-line elements free to start with a combining
> > character (provided that the said in-line container is not the first
> > within a block-level element, of course).
> 
> The trouble with that idea is that in XML generally we don't know
> what is a block-level element: elements are just elements, and it's
> up to rendering routines whether they appear as block, inline, or
> not at all.

So if on that level of abstraction it is not known whether it would make
sense or not for the higher layers, it should be permitted in all cases.

-- 
   __("< Marcin Kowalczyk
   \__/   [EMAIL PROTECTED]
^^ http://qrnik.knm.org.pl/~qrczak/





Re: Combining across markup?

2004-08-12 Thread John Cowan
Anto'nio Martins-Tuva'lkin scripsit:

> Even better yet: Have the WC3 rephrase their demand that no element
> should start with a defective sequence (when considered in separate)
> as that no *block-level* element should etc., and leave things like
> ,  and other in-line elements free to start with a combining
> character (provided that the said in-line container is not the first
> within a block-level element, of course).

The trouble with that idea is that in XML generally we don't know
what is a block-level element: elements are just elements, and it's
up to rendering routines whether they appear as block, inline, or
not at all.

-- 
John Cowan  [EMAIL PROTECTED]  www.reutershealth.com  www.ccil.org/~cowan
Promises become binding when there is a meeting of the minds and consideration
is exchanged. So it was at King's Bench in common law England; so it was
under the common law in the American colonies; so it was through more than
two centuries of jurisprudence in this country; and so it is today. 
   --Specht v. Netscape



Re: Combining across markup?

2004-08-12 Thread Anto'nio Martins-Tuva'lkin
On 2004.08.11, 18:58, Mike Ayers <[EMAIL PROTECTED]> wrote:

> Better yet, have a generic mechanism which allows you to build

Even better yet: Have the WC3 rephrase their demand that no element
should start with a defective sequence (when considered in separate)
as that no *block-level* element should etc., and leave things like
,  and other in-line elements free to start with a combining
character (provided that the said in-line container is not the first
within a block-level element, of course).

That would, IIUC, address WC3's "fears" and would OTOH satisfy the
need to mark up differently any part of a *text stream*.

(Yes, of course CSS allows blobk elements to behave as in-line and
vice-versa, buat that would be the user's responsability...)

--.
António MARTINS-Tuválkin |  ()|
<[EMAIL PROTECTED]>||
PT-1XXX-XXX LISBOA   Não me invejo de quem tem|
+351 934 821 700 carros, parelhas e montes|
http://www.tuvalkin.web.pt/bandeira/ só me invejo de quem bebe|
http://pagina.de/bandeiras/  a água em todas as fontes|




RE: Combining across markup?

2004-08-11 Thread Mike Ayers
Title: RE: Combining across markup?






> From: [EMAIL PROTECTED] 
> [mailto:[EMAIL PROTECTED]] On Behalf Of Christopher Fynn
> Sent: Tuesday, August 10, 2004 5:11 PM


> If this is important, rather than trying to overload the 
> current markup, the way to make this possible is to propose 
> to W3C a new style property for CSS so that you could have:
> Some Arabic Text


    Better yet, have a generic mechanism which allows you to build an array with a default color, then specify colors on a per-codepoint basis.  This would permit all of the suggested examples to be done, to user taste, without breaking any combining sequences.  Everybody wins!  OK, who's writing the proposal?  


/|/|ike





Re: Combining across markup?

2004-08-11 Thread Peter Kirk
On 11/08/2004 15:27, Doug Ewell wrote:
... Peter's two combining marks, a black
one in the actual manuscript and a red one added by the editor, sounds
less like a problem that Unicode or W3C need to worry about.
 

I agree that it is not a problem for Unicode. But I do think it is a 
potential problem for W3C if the latter aims to represent how people 
actually display text.

As another example, there was a discussion on the Unicode Hebrew list 
not long ago of two combining marks, used in certain texts for variant 
forms of Dagesh and Sheva, which are distinguished from the regular 
forms by being larger and/or more bold, although used with the regular 
base characters. It was judged that they should not be encoded as 
separate characters, partly because the use is rather idiosyncratic but 
also because it was said that these variants should be distinguished 
from the regular forms by markup, e.g. by marking as bold or as a larger 
size. I strongly suspect that similar arguments have been used e.g. 
against variant sizes of accents, shapes of umlaut etc. But this markup 
argument fails if the standard forms of markup are unable to make such 
distinctions in combining marks when there is no distinction in the base 
character. This could open the door for a lot more proposals to encode 
in Unicode variant forms of combining marks.

--
Peter Kirk
[EMAIL PROTECTED] (personal)
[EMAIL PROTECTED] (work)
http://www.qaya.org/



Re: Combining across markup?

2004-08-11 Thread Doug Ewell
Peter Kirk  wrote:

> But does that mean that this kind
> of text is to be ruled forever outside the scope of Unicode. I'm not
> saying it should be plain text. But Unicode should be able to support
> markup schemes which do allow such things.

The plain-text requirement still applies.  I can imagine wanting to use
bold and italics and different fonts and sizes in the same document; in
fact, I just did so yesterday.  But none of these features are plain
text, and so I did not expect them to be handled within Unicode.

The suggestion to add a "mark-color" capability to CSS might handle a
majority of the realistic situations where color is really understood to
be part of the textual content.  Peter's two combining marks, a black
one in the actual manuscript and a red one added by the editor, sounds
less like a problem that Unicode or W3C need to worry about.

-Doug Ewell
 Fullerton, California
 http://users.adelphia.net/~dewell/





RE: Combining across markup? (Was: RE: sign for anti-neutrino - g ree k nu with diacritical line aboveworkaround ?)

2004-08-11 Thread Peter Constable
> From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]
On
> Behalf Of saqqara


> This is one example where colour support in fonts would be useful.

That doesn't solve the markup problem, though: you still have to have
some markup representation that reflects where colours are supposed to
apply. Colour support in fonts wouldn't offer any benefit wrt this
scenario that I can see.


Peter Constable





Re: Combining across markup?

2004-08-11 Thread saqqara


>
> At 10:19 +0100 2004-08-11, saqqara wrote:
> >This is one example where colour support in fonts would be useful. A
useful
> >addition to the OpenType specifications if any readers here have
influence
> >on such matters.
>
> Out of scope. You can use markup to make characters red.
>

Of course but in the example, character has multiple colours (black and red)

> >Something I have a vested interest in with my own focus on Ancient
Egyptian
> >Hieroglyphs.
>
> Budge used to print a solid black line over red Coffin text and the like.
>
Yes, certainly the use of red for spells in Coffin texts, for instance, or
indeed most use of red ink in hieratic is properly a matter of semantics and
markup makes sense here.

However use of colour is a feature of Hieroglyphs and it would be entirely
reasonable for an Ancient Egyptian to want a full colour font for
applications such as Tomb decoration. Not that this has any ramifications
for Unicode, except to highlight the fact that such would be primarily a
font issue rather than character coding or markup. Practically, I don't
expect this to cut much ice with OpenType developments unless modern
examples in living scripts exist.

> The answer to the original question about black a with red macron is:
>
> You cheat, just like they did in the good old days of lead type. For
> that matter, just as they did when they had to put down one pen and
> pick up another.
>
> You can't expect the encoding to colour elements of precomposed glyphs.
If this form is used consistently throughout, then it is in fact a feature
of the font and in an ideal world fonts would support the bi-colour glyphs.
In an imperfect world, you must indeed expect to have to fudge the issue.

Bob Richmond

> -- 
> Michael Everson * * Everson Typography *  * http://www.evertype.com




Re: Combining across markup?

2004-08-11 Thread Peter Kirk
On 11/08/2004 10:35, Michael Everson wrote:
At 10:19 +0100 2004-08-11, saqqara wrote:
This is one example where colour support in fonts would be useful. A 
useful
addition to the OpenType specifications if any readers here have 
influence
on such matters.

Out of scope. You can use markup to make characters red.

Not if the glyph is multi-colour, which is common with certain 
representations of Egyptian hieroglyphs. (More to the point I can 
imagine people wanting to put coloured icons, corporate logos etc into 
the PUA of fonts so that they can be used within text - far more likely 
to drive changes to OpenType etc than anything ancient Egyptian.) Nor if 
the colouring is needed to make a plain text distinction, which is 
necessary in at least one script which was discussed here within the 
last year - although I think we agreed that that particular script did 
not justify encoding.

...
The answer to the original question about black a with red macron is:
You cheat, just like they did in the good old days of lead type. For 
that matter, just as they did when they had to put down one pen and 
pick up another.

You can't cheat within Unicode. This is at least potentially a necessary 
distinction, although whether it is plain text or not is debatable. 
While I don't have an example to hand (although I have seen this kind of 
thing in biblical Hebrew and Greek texts), I can easily imagine a 
scholarly edition of a text which uses one colour for text, including 
base characters and some combining marks, which are in the actual 
manuscript and another colour for marks which have been supplied by the 
editor. Not even Chris' very sensible suggestion

Some Arabic 
Text
can deal with that situation properly. But does that mean that this kind 
of text is to be ruled forever outside the scope of Unicode. I'm not 
saying it should be plain text. But Unicode should be able to support 
markup schemes which do allow such things.

--
Peter Kirk
[EMAIL PROTECTED] (personal)
[EMAIL PROTECTED] (work)
http://www.qaya.org/



Re: Combining across markup? (Was: RE: sign for anti-neutrino - gree k nu with diacritical line aboveworkaround ?)

2004-08-11 Thread Peter Kirk
On 10/08/2004 23:13, D. Starner wrote:
Peter Kirk writes:
 

That one is easy: this is the closing tag followed by a combining 
solidus. The difficult case is if the parser encounters a not greater 
than symbol. The parser will need to know to decompose such characters 
first, but then a good parser would always need to do that. 
   

So all existing XML emitters should be changed, to make sure that not
less than symbols and not greater than symbols are escaped? If I were
writting a XML document with math content, and added a not less than
symbol, I would be sorely surprised to find it starting a tag. Being
a Unicode geek, I could figure it out, but I bet many mathematicians
wouldn't. Letting not less than symbols open tags would be a big
mistake.
 

If existing XML emitters are not to be changed, they MUST drop any claim 
that what they are emitting is a string of Unicode characters. This is a 
conformance issue. An XML emitter which emits a not less than symbol and 
assumes that it will not be interpreted as a tag start character 
followed by a combining mark is in breach of Unicode conformance rule 
"C9 A process shall not assume that the interpretations of two 
canonical-equivalent character sequences are distinct."

W3C or whoever could get around this problem in at least three ways:
1) Specify that the string is put into a particular normalisation form 
before parsing;

2) Specify that "<" followed by a combining mark, or certain combining 
marks, is not interpreted as a tag start character but as a literal, 
i.e. this is some kind of escape mechanism;

3) Specify that "not less than" etc must be escaped in one of the same 
ways as "less than" - which an intelligent editor could hide from 
mathematicians.

Or Unicode could exceptionally change its decompositions here - which 
could be justified in that the main reason for refusing to change them 
is for compatibility with W3C, so W3C can't complain if they require 
this change.

But without some such change there is a failure to conform to Unicode.
--
Peter Kirk
[EMAIL PROTECTED] (personal)
[EMAIL PROTECTED] (work)
http://www.qaya.org/



Re: Combining across markup?

2004-08-11 Thread Michael Everson
At 10:19 +0100 2004-08-11, saqqara wrote:
This is one example where colour support in fonts would be useful. A useful
addition to the OpenType specifications if any readers here have influence
on such matters.
Out of scope. You can use markup to make characters red.
Something I have a vested interest in with my own focus on Ancient Egyptian
Hieroglyphs.
Budge used to print a solid black line over red Coffin text and the like.
The answer to the original question about black a with red macron is:
You cheat, just like they did in the good old days of lead type. For 
that matter, just as they did when they had to put down one pen and 
pick up another.

You can't expect the encoding to colour elements of precomposed glyphs.
--
Michael Everson * * Everson Typography *  * http://www.evertype.com


Re: Combining across markup? (Was: RE: sign for anti-neutrino - g ree k nu with diacritical line aboveworkaround ?)

2004-08-11 Thread saqqara
This is one example where colour support in fonts would be useful. A useful
addition to the OpenType specifications if any readers here have influence
on such matters.

Something I have a vested interest in with my own focus on Ancient Egyptian
Hieroglyphs.

Bob

- Original Message - 
From: "Philipp Reichmuth" <[EMAIL PROTECTED]>
To: "Jon Hanna" <[EMAIL PROTECTED]>; <[EMAIL PROTECTED]>
Sent: Tuesday, August 10, 2004 4:35 PM
Subject: Re: Combining across markup? (Was: RE: sign for anti-neutrino - g
ree k nu with diacritical line aboveworkaround ?)


>
> Jon Hanna schrieb:
> > The W3C Character Model does not, or will not since it's not yet a
> > Recommendation, allow text nodes or attribute values to begin with
defective
> > combining character sequences.
>
> What am I supposed do when I need a black a with a red macron?  Or for a
> less obscure example, an Arabic text with the letters correctly ligated,
> in black, and the vowel marks in another colour, such as in practically
> *any* printed edition of the Koran?
>
> Philipp




Re: Combining across markup?

2004-08-11 Thread Christopher Fynn
Peter Constable wrote:
Of course, this approach is simply moving description of the
lowest-level portions of the hierarchy into attributes rather than
representation as elements. It does work around the problem, though.
The main thing is that it does not allow attributes other than color to 
be changed separately on a dependent mark. It's all the other things 
that could change as soon as you use a separate  that could be 
problematic.

Incidentally I see that under "Tools, Options, Complex Scripts" MS Word 
now has a check-box option to enable "Different Color for Diacritics"

best regards
- Chris



RE: Combining across markup?

2004-08-10 Thread Peter Constable
> From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]
On
> Behalf Of Christopher Fynn


> If this is important, rather than trying to overload the current
markup,
> the way to make this possible is to propose to W3C a new style
property
> for CSS so that you could have:
> Some Arabic
> Text

Of course, this approach is simply moving description of the
lowest-level portions of the hierarchy into attributes rather than
representation as elements. It does work around the problem, though.


Peter Constable




Re: Combining across markup?

2004-08-10 Thread Christopher Fynn
Philipp Reichmuth wrote:

What am I supposed do when I need a black a with a red macron?  Or for a 
less obscure example, an Arabic text with the letters correctly ligated, 
in black, and the vowel marks in another colour, such as in practically 
*any* printed edition of the Koran?

Philipp

If this is important, rather than trying to overload the current markup, 
the way to make this possible is to propose to W3C a new style property 
for CSS so that you could have:
Some Arabic 
Text

- Chris



RE: Combining across markup? (Was: RE: sign for anti-neutrino - g ree k nu with diacritical line aboveworkaround ?)

2004-08-10 Thread Mike Ayers
Title: RE: Combining across markup? (Was: RE: sign for anti-neutrino - g ree k nu with diacritical line aboveworkaround ?)






> From: [EMAIL PROTECTED] 
> [mailto:[EMAIL PROTECTED]] On Behalf Of Philipp Reichmuth
> Sent: Tuesday, August 10, 2004 8:35 AM


> What am I supposed do when I need a black a with a red 
> macron?


    Use a picture, just as you would for a letter "i" with a smiley face for a dot, or any other aberrant characters.

>  Or for a less obscure example, an Arabic text with 
> the letters correctly ligated, in black, and the vowel marks 
> in another colour, such as in practically
> *any* printed edition of the Koran?


    Wouldn't the better idea here be to use a font which had the vowels in a different color?  This would make management of the source much easier, I think, although that would be a side issue.  Are there situations where this would not work?


    Thanks,


/|/|ike





Re: Combining across markup? (Was: RE: sign for anti-neutrino - gree k nu with diacritical line aboveworkaround ?)

2004-08-10 Thread D. Starner
Peter Kirk writes:
> That one is easy: this is the closing tag followed by a combining 
> solidus. The difficult case is if the parser encounters a not greater 
> than symbol. The parser will need to know to decompose such characters 
> first, but then a good parser would always need to do that. 

So all existing XML emitters should be changed, to make sure that not
less than symbols and not greater than symbols are escaped? If I were
writting a XML document with math content, and added a not less than
symbol, I would be sorely surprised to find it starting a tag. Being
a Unicode geek, I could figure it out, but I bet many mathematicians
wouldn't. Letting not less than symbols open tags would be a big
mistake.
-- 
___
Sign-up for Ads Free at Mail.com
http://promo.mail.com/adsfreejump.htm




RE: Combining across markup? (Was: RE: sign for anti-neutrino - g ree k nu with diacritical line aboveworkaround ?)

2004-08-10 Thread Mike Ayers
Title: RE: Combining across markup? (Was: RE: sign for anti-neutrino - g ree k nu with diacritical line aboveworkaround ?)






> From: Jon Hanna [mailto:[EMAIL PROTECTED]] 
> Sent: Tuesday, August 10, 2004 10:33 AM


> So even without an explicit spec saying otherwise the above 
> would be problematic.


    I *knew* that - what I want to know is whether W3C collectively knows it.



/|/|ike





Re: Combining across markup? (Was: RE: sign for anti-neutrino - g ree k nu with diacritical line aboveworkaround ?)

2004-08-10 Thread Peter Kirk
On 10/08/2004 17:13, Jon Hanna wrote:
...
Contra to your examples, what is a parser meant to do when it encounters the
greater-than symbol indicating the end of an element's tag followed by a
combining solidus, meaning that the two characters together are canonically
equivalent to the not greater-than symbol?
 

That one is easy: this is the closing tag followed by a combining 
solidus. The difficult case is if the parser encounters a not greater 
than symbol. The parser will need to know to decompose such characters 
first, but then a good parser would always need to do that.

--
Peter Kirk
[EMAIL PROTECTED] (personal)
[EMAIL PROTECTED] (work)
http://www.qaya.org/



Re: Combining across markup? (Was: RE: sign for anti-neutrino - g ree k nu with diacritical line aboveworkaround ?)

2004-08-10 Thread Peter Kirk
On 10/08/2004 18:33, Jon Hanna wrote:
...
As for modern markup, consider if instead of ̄ you had ̸
By the rules of XML that is treated as if the character U+0338 was there rather
than the escape sequence.
By the rules of Unicode the sequence U+003E, U+0338 is treated the same as the
character U+226F.
By the rules of XML replacing ≯ with U+226F would mean the document was
no longer well-formed.
So even without an explicit spec saying otherwise the above would be
problematic.
 

This means that the rules of XML conflict with the rules of Unicode. If 
the string is a Unicode string, U+226F is canonically equivalent to 
 and therefore any higher level protocol should treat 
the two sequences as identical, rather than reject one of them as 
causing the document to be ill-formed.

--
Peter Kirk
[EMAIL PROTECTED] (personal)
[EMAIL PROTECTED] (work)
http://www.qaya.org/



Re: Combining across markup? (Was: RE: sign for anti-neutrino - gree k nu with diacritical line aboveworkaround ?)

2004-08-10 Thread D. Starner
Philipp Reichmuth <[EMAIL PROTECTED]> writes:

> Jon Hanna schrieb:
> > The W3C Character Model does not, or will not since it's not yet a
> > Recommendation, allow text nodes or attribute values to begin with defective
> > combining character sequences.
> 
> What am I supposed do when I need a black a with a red macron?  Or for a 
> less obscure example, an Arabic text with the letters correctly ligated, 
> in black, and the vowel marks in another colour, such as in practically 
> *any* printed edition of the Koran?

Use PDF files, or images. The W3C Character Model is not the end all and be
all of Unicode text processing, but there's a certain limit to what HTML will
do. 

-- 
___
Sign-up for Ads Free at Mail.com
http://promo.mail.com/adsfreejump.htm




RE: Combining across markup? (Was: RE: sign for anti-neutrino - g ree k nu with diacritical line aboveworkaround ?)

2004-08-10 Thread Marcin 'Qrczak' Kowalczyk
W liście z wto, 10-08-2004, godz. 18:33 +0100, Jon Hanna napisał:

> By the rules of XML replacing ≯ with U+226F would mean the document was
> no longer well-formed.

Really? I don't have a XML spec handy, but character references like
̸ can't be processed before parsing tags, because &60; is the
literal character "<", not the start of a tag.

-- 
   __("< Marcin Kowalczyk
   \__/   [EMAIL PROTECTED]
^^ http://qrnik.knm.org.pl/~qrczak/





RE: Combining across markup? (Was: RE: sign for anti-neutrino - g ree k nu with diacritical line aboveworkaround ?)

2004-08-10 Thread Jon Hanna
>   Clarification, please.  Does this mean that the code:
> 
> ν style='font-family:Cardo'>̄ 
> 
>   will be considered illegal due to the freestanding ̄ ?

Well leaving out the quotes around tokens is old-fashioned anyway, and I don't
know if it will apply to HTML4.01 and earlier.

As for modern markup, consider if instead of ̄ you had ̸
By the rules of XML that is treated as if the character U+0338 was there rather
than the escape sequence.
By the rules of Unicode the sequence U+003E, U+0338 is treated the same as the
character U+226F.
By the rules of XML replacing ≯ with U+226F would mean the document was
no longer well-formed.

So even without an explicit spec saying otherwise the above would be
problematic.

-- 
Jon Hanna

…it has been truly said that hackers have even more words for
equipment failures than Yiddish has for obnoxious people." - jargon.txt



Re: Combining across markup? (Was: RE: sign for anti-neutrino - g ree k nu with diacritical line aboveworkaround ?)

2004-08-10 Thread Jon Hanna
Quoting Philipp Reichmuth <[EMAIL PROTECTED]>:

> Jon Hanna schrieb:
> > The W3C Character Model does not, or will not since it's not yet a
> > Recommendation, allow text nodes or attribute values to begin with
> defective
> > combining character sequences.
> 
> What am I supposed do when I need a black a with a red macron?  Or for a 
> less obscure example, an Arabic text with the letters correctly ligated, 
> in black, and the vowel marks in another colour, such as in practically 
> *any* printed edition of the Koran?

The former sounds more like drawing than writing, and is something SVG will
handle easily. The latter sounds like text the colour of which is black with
another colour for the vowel marks rather than alternating pieces of text in
different colours. Indeed the latter demonstrates well that using separate
markup for coloured diacritics would be neither an efficient task, nor one that
makes any sense with semantic markup, for a real-world use case.

Contra to your examples, what is a parser meant to do when it encounters the
greater-than symbol indicating the end of an element's tag followed by a
combining solidus, meaning that the two characters together are canonically
equivalent to the not greater-than symbol?

-- 
Jon Hanna

"I don't like to LOOK out of the windows even - there are so many of those 
creeping women, and they creep so fast."
- Charlotte Perkins Gilman, _The Yellow Wallpaper_



RE: Combining across markup? (Was: RE: sign for anti-neutrino - g ree k nu with diacritical line aboveworkaround ?)

2004-08-10 Thread Mike Ayers
Title: RE: Combining across markup? (Was: RE: sign for anti-neutrino - g ree k nu with diacritical line aboveworkaround ?)






> From: [EMAIL PROTECTED] 
> [mailto:[EMAIL PROTECTED]] On Behalf Of Jon Hanna
> Sent: Tuesday, August 10, 2004 6:22 AM


> The W3C Character Model does not, or will not since it's not 
> yet a Recommendation, allow text nodes or attribute values to 
> begin with defective combining character sequences.


    Clarification, please.  Does this mean that the code:


ν
style='font-family:Cardo'>̄ 


    will be considered illegal due to the freestanding ̄ ?



    Thanks,


/|/|ike





Re: Combining across markup? (Was: RE: sign for anti-neutrino - g ree k nu with diacritical line aboveworkaround ?)

2004-08-10 Thread Philipp Reichmuth
Jon Hanna schrieb:
The W3C Character Model does not, or will not since it's not yet a
Recommendation, allow text nodes or attribute values to begin with defective
combining character sequences.
What am I supposed do when I need a black a with a red macron?  Or for a 
less obscure example, an Arabic text with the letters correctly ligated, 
in black, and the vowel marks in another colour, such as in practically 
*any* printed edition of the Koran?

Philipp


Re: Combining across markup? (Was: RE: sign for anti-neutrino - g ree k nu with diacritical line aboveworkaround ?)

2004-08-10 Thread Jon Hanna
The W3C Character Model does not, or will not since it's not yet a
Recommendation, allow text nodes or attribute values to begin with defective
combining character sequences.

-- 
Jon Hanna

"What's a false move? Is it very different from a real one?"



Re: Combining across markup? (Was: RE: sign for anti-neutrino - g ree k nu with diacritical line aboveworkaround ?)

2004-08-09 Thread Peter Kirk
On 09/08/2004 22:10, Mike Ayers wrote:
...
> expect any kind of break or control character to have this
> effect by default. The correct way to do this is probably to
> insert SPACE or NBSP, or the proposed INVISIBLE CHARACTER.
...except that I don't want a space in there, nor can I use a 
proposed character.

Well, the only specified way which works for all characters is to use 
SPACE or NBSP. Several people including me have seen a problem with 
this, and presumably this is why INVISIBLE LETTER has been proposed, 
http://std.dkuug.dk/jtc1/sc2/wg2/docs/n2822.pdf.

> And then of course there is the spacing clone of the macron, U+00AF.
Possible.  Do such clones exist for all combiners?
No.
> I wouldn't consider this a violation of layering. If layer A
> has to split a string at a particular place because of its
> own functions, that does not imply a break at layer B.
...if layer A is on top of layer B, then a break in layer A 
is, by definition, a break in layer B.  Could I be using the wrong 
term again?

Well, maybe, but this is certainly not true "by definition" of all kinds 
of layering. In this case layer B simply doesn't recognise breaks in 
layer A. For example, layer A might include markup within <...>, but as 
far as layer B is concerned < and > are ordinary characters which are 
part of a character stream and not recognised as markup. But that is not 
the relationship we have here, which is not a simple layered structure.

> And it
> is certainly a legitimate user requirement for a combining
> mark to be in a different colour from its base character.
Has this been accepted by Unicode as a requirement?  If so, 
I'll just run screaming away from the issue.

It is probably considered outside the scope of Unicode. Nevertheless, it 
is something which people commonly want to do.

... Color change is the only markup that I can think of for which it 
would be possible to preserve kerning across markup based change, ...

I deliberately mentioned underlining as another example. And then there 
are all sorts of markup which do not affect rendering and so need not 
interrup kerning. Just one example: language and script tagging. And 
believe me, I have seen words which change script in the middle!

... and I assert that "people who change font colors midword" and 
"people who care about kerning" are two sets which do not intersect.


Don't count on it. Some software makes good use of font colour changes, 
which could be mid-word. That does not imply that kerning should be broken.

I'm not suggesting that all rendering engines must have such 
capabilities, just that there might be real requirements for such things 
for certain purposes.

--
Peter Kirk
[EMAIL PROTECTED] (personal)
[EMAIL PROTECTED] (work)
http://www.qaya.org/



RE: Combining across markup? (Was: RE: sign for anti-neutrino - g ree k nu with diacritical line aboveworkaround ?)

2004-08-09 Thread Mike Ayers
Title: RE: Combining across markup? (Was: RE: sign for anti-neutrino - gree k nu with diacritical line aboveworkaround ?)






> From: Peter Kirk [mailto:[EMAIL PROTECTED]] 
> Sent: Monday, August 09, 2004 1:11 PM


>  If you really want nu followed by 
> a non-combining macron, Unicode has defined ways of doing 
> this. Inserting ZWNJ is *not* one of them, and I wouldn't 


    I didn't propose ZWNJ.  In fact, I didn't mention it, so I'm not sure what you mean here.


> expect any kind of break or control character to have this 
> effect by default. The correct way to do this is probably to 
> insert SPACE or NBSP, or the proposed INVISIBLE CHARACTER. 


    ...except that I don't want a space in there, nor can I use a proposed character.


> And then of course there is the spacing clone of the macron, U+00AF.


    Possible.  Do such clones exist for all combiners?


> I wouldn't consider this a violation of layering. If layer A 
> has to split a string at a particular place because of its 
> own functions, that does not imply a break at layer B.


    ...if layer A is on top of layer B, then a break in layer A is, by definition, a break in layer B.  Could I be using the wrong term again?

> And it 
> is certainly a legitimate user requirement for a combining 
> mark to be in a different colour from its base character. 


    Has this been accepted by Unicode as a requirement?  If so, I'll just run screaming away from the issue.


> What is the best way to represent that is debatable, but some 
> kind of markup at this point should not be ruled out in principle.


    Unicode has nothing to say about markup other than relegating certain behaviors to it.  I had thought that coloring fonts was one of those behaviors.

> In any case markup cannot be allowed to break all intelligent 
> font features. Should kerning be broken because a particular 
> letter is in a different colour or underlined?


    Underlining (and bold, and italic, etc.) breaks kerning - "break" here meaning "causes a discontinuity in", not "causes to not work", because it forces a font switch, or, in the case of underlining, a change in the rendering behavior of the font in use.  Color change is the only markup that I can think of for which it would be possible to preserve kerning across markup based change, and I assert that "people who change font colors midword" and "people who care about kerning" are two sets which do not intersect.


/|/|ike





Re: Combining across markup? (Was: RE: sign for anti-neutrino - gree k nu with diacritical line aboveworkaround ?)

2004-08-09 Thread Peter Kirk
On 09/08/2004 18:30, Mike Ayers wrote:
...
> Mozilla, and lots of other software can't handle mixing
> markup and combining marks or characters.
Hmmm... Don't want to start a war, but I do have to ask.
Isn't this correct behavior?  Doesn't the code above 
explicitly separate the two codepoints to prevent them from being 
rendered as a sequence?  I mean, if I wanted to display nu, then 
macron, without combining behavior, wouldn't this be one way to do it? ...

This is an interesting one in the light of discussions elsewhere on 
Hebrew Holam. If you really want nu followed by a non-combining macron, 
Unicode has defined ways of doing this. Inserting ZWNJ is *not* one of 
them, and I wouldn't expect any kind of break or control character to 
have this effect by default. The correct way to do this is probably to 
insert SPACE or NBSP, or the proposed INVISIBLE CHARACTER. And then of 
course there is the spacing clone of the macron, U+00AF.

But then this makes me realise that the problem someone has recognised 
with ZWNJ and combining characters in MS Word may be because ZWNJ is 
somehow interpreted as in a different font or otherwise having separate 
markup, perhaps because there is no ZWNJ glyph in the seleted font.

...  Combining across markup seems counterintuitive (and a violation 
of layering) to me.


I wouldn't consider this a violation of layering. If layer A has to 
split a string at a particular place because of its own functions, that 
does not imply a break at layer B. And it is certainly a legitimate user 
requirement for a combining mark to be in a different colour from its 
base character. What is the best way to represent that is debatable, but 
some kind of markup at this point should not be ruled out in principle.

In any case markup cannot be allowed to break all intelligent font 
features. Should kerning be broken because a particular letter is in a 
different colour or underlined?

--
Peter Kirk
[EMAIL PROTECTED] (personal)
[EMAIL PROTECTED] (work)
http://www.qaya.org/