Re: Unicode Search Engines

DougEwell2 Tue, 29 Jan 2002 23:29:28 -0800

In a message dated 2002-01-28 7:37:48 Pacific Standard Time, 
[EMAIL PROTECTED] writes:


> I would like to add:
> How do they handle normalization?
> In Vietnam, many characters can be represented in several different ways:
> (1) fully precomposed (NFC)
> (2) base character and modifier precomposed, tonal mark combining
> (3) base character, then modifier, then tonal mark
> (4) like (3), but modifier and tonal mark sorted (NFD)
> Do the search engines do any normalization, before indexing a page?
> Are queries normalized before running the search?

I'm not sure what sort of normalization might be performed by search engines, 
but I want to examine the Vietnamese decomposition aspect for a moment.

If you have a Vietnamese vowel with both modifier and tone mark, say LATIN 
CAPITAL LETTER A WITH CIRCUMFLEX AND ACUTE, then you can represent this in 
Unicode in at least three ways:

(1) fully precomposed (NFC) -- that is, U+1EA4
(2) base character and modifier precomposed, tonal mark combining -- that is, 
U+00C2 U+0301
(3) base character, then modifier, then tonal mark -- that is, U+0041 U+0302 
U+0301

So far, so good.  But then we have:

> (4) like (3), but modifier and tonal mark sorted (NFD)

If "sorting" the diacritical marks in NFD results in rearranging the two 
diacritical marks -- in this case, U+0041 U+0301 U+0302 -- then in terms of 
Vietnamese orthography, the NFD form may not really be a legitimate way of 
representing the Vietnamese letter.

For example, U+1EAC LATIN CAPITAL LETTER A WITH CIRCUMFLEX AND DOT BELOW is, 
in Vietnamese, a circumflexed A to which a tone mark (dot below) has been 
added.  It is not a dotted-below A to which a circumflex has been added.  Yet 
because of the canonical combining classes of the two diacriticals (230 for 
COMBINING CIRCUMFLEX ACCENT, 220 for COMBINING DOT BELOW), the latter is how 
the character will be decomposed.

In theory, there is actually a case 5: base character and tonal mark 
precomposed, modifier combining.  In terms of Vietnamese orthography, this is 
just as illegitimate as case 4 (NFD), but most software that processes 
Vietnamese text will probably never encounter it.  But it will have to handle 
the NFD case.

If I were on some other mailing lists I could think of, I would claim that 
this is a fatal flaw in the design of Unicode Normalization Form D.  It's 
not, but it is a sticky problem that needs to be dealt with when dealing with 
Vietnamese text.

-Doug Ewell
 Fullerton, California

Re: Unicode Search Engines

Reply via email to