Re: Unicode Search Engines

Doug Ewell Mon, 18 Feb 2002 22:25:12 -0800

Stefan Probst <[EMAIL PROTECTED]> wrote:

> Well, I tried it with:
> a) the Vietnamese "tonal marks":
> ...
> b) the Vietnamese "modifier" characters:
> - breve       U+0306  combining class: 230
> - circumflex  U+0302  combining class: 230
> - horn        U+031B  combining class: 216
> ...
> I tried to combine them with the space character and with some vowels.
>
> The tonal marks went usually quite fine, but the modifier characters did
not:
> In WinME, they did not work in MSWindows97, OpenOffice641.
> In IE5.5 they did not work with the space, and only with the "right
> combination" of vowels and modifiers:
> OK: (all vowels a,e,i,o,u) + (any of breve or circumflex)
> OK: o + horn, u + horn (which are in fact valid Vietnamese characters)
> NOT OK: a + horn, e + horn, i + horn (which actually are not valid
> Vietnamese characters)
>
> Are the described issues a problem of the OS (e.g. rendering engine),
> application (why does IE behave different from Word?), or correct
Unicode
> implementation (e.g. that the horn does not combine with a,e,i)?


In theory, a fully conformant Unicode renderer is supposed to be able to
combine an arbitrary base character with arbitrary combining marks.  The
renderer is supposed to look at the glyphs and decide how to combine them
dynamically so they look reasonable together.  So you should be able to
combine "o with horn," "a with horn," or "q with horn" and get the
expected result.

In the real world, it doesn't work like that.  Renderers detect sequences
of base+combining characters, look for an equivalent precomposed form, and
display that instead.  For example, they detect U+006F (o) followed by
U+031B (combining horn), and instead of trying to figure out how to
combine them, simply generate U+01A1 (o with horn) instead.  This results
in a nice-looking precomposed glyph (if it's in the font) with a lot less
work.  But it means that U+0061 (a) plus U+031B (combining horn) can't be
displayed properly, since there is no precomposed code point for "a with
horn."

In the '90s, when UTC and WG2 were more open to encoding precomposed
forms, this approach was not too problematic, since any legitimate
diacriticized character in an alphabetic script probably had its own
precomposed form.  Today, because of normalization considerations, we are
probably not going to see any more precomposed characters that can already
be formed with combining sequences.  So if some language turns out to need
"a with horn" in the future, its readers will have to cross its fingers
that rendering engines become capable of displaying U+0061 U+031B
properly.

-Doug Ewell
 Fullerton, California

Re: Unicode Search Engines

Reply via email to