RE: Transcoding Tamil in the presence of markup

Philippe Verdy Sat, 06 Dec 2003 18:13:48 -0800

Christopher John Fynn writes:
> In Unicode U+0BBE, U+0BC6 and U+0BCA are all dependent vowel signs
> IE is probably  treating a base character and any dependent
> vowels as a single
> unit. Since in  some fonts a base character + combining vowel
> mark might be
> displayed by a single ligature glyph, it makes sense to apply the
> formatting of
> a base character to any dependant combining characters as well.
>
> In Mozilla you may be completely breaking the font lookups by separately
> formatting the different parts of a conjunct.
>
> In legacy glyph based Tamil encodings there was a simple one-to-one
> correspondence  characters and glyphs so it is straightforward to apply
> different formatting to different characters.


Still this is an interesting problem: some texts for example want to
exhibit some diacritics added to a base letter with a distinct color,
notably in linguistic texts related to grammar or orthography.

So for example you could want to exhibit the difference between the two
French words "désert" and "dessert" by coloring the accent of the first
word or the second s of the second; or even more accurately between
"bailler" (concéder un bail, des baux) and "bâiller" (ouvrir en grand)
where the presence or absence of the circumflex on letter 'a' is
necessary to reflect the difference of both meaning and pronounciation.

However, this is not a problem of Unicode itself, but of the rich-text
format used to add style to a given text. In Unicode (and even in HTML
and SGML), a letter 'a' followed by a circumflex is canonically equivalent
to the composed latter 'a' with a circumflex. However if you add tags
between a base letter and its diacritics, you create separate texts and
you then have a defective combining sequence in the second string
starting with the circumflex.

For Unicode, this circumflex will logically attempt to create a
combining sequence with its previous HTML or SGML or XML tag. This
will break many parsers that use the Unicode rules when handling files
encoded with a Unicode encoding scheme like UTF-8.

Creating a text that use this HTML "feature" is very hazardous, as the
interpretation and rendering of defective combining sequences is
implementation-specific (an application may choose to render the
diacritics with a base dotted circle glyph, or may display them with
an base empty glyph, or associate the defective combining sequence with
the previous combining sequence, or may just be unable to render this
sequence, as the previous combining sequence may not be accessible in
the current context of rendering).

If one want really to add style to diacritics only, it's not in
Unicode that you'll must search a solution, but in the styling or
tagging language itself (but defining such a style rule would be
extremely tricky, and adding this with intermediate tags is not
conforming to the W3C recommandation for separation between text and
styles). So that's an interesting question to submit to the W3C for
its CSS specification... I think that Unicode will not allow you to
define anything else.

For now you can use a conforming solution that consists in a HTML
code like this (here to render the circumflex above a in red):

        a<span style="position: relative; x: -6pt; color: red;
        ">&nbsp;&#x302;</span>

or better with a style sheet:

        <style><!--
        .diac-red {position: relative; x: -6pt; color: red;}
        --></style>
        ...
        a<span class="diac-red">&nbsp;&#x302;</span>

This code does not contain any defective sequence, and treats the
diacritic as a separate graphic unit (it is really such if you
need a style to detach it from the regular text.


__________________________________________________________________
<< ella for Spam Control >> has removed Spam messages and set aside
Newsletters for me
You can use it too - and it's FREE!  http://www.ellaforspam.com

RE: Transcoding Tamil in the presence of markup

Reply via email to