In a message dated 2002-01-28 7:37:48 Pacific Standard Time, [EMAIL PROTECTED] writes:
> I would like to add: > How do they handle normalization? > In Vietnam, many characters can be represented in several different ways: > (1) fully precomposed (NFC) > (2) base character and modifier precomposed, tonal mark combining > (3) base character, then modifier, then tonal mark > (4) like (3), but modifier and tonal mark sorted (NFD) > Do the search engines do any normalization, before indexing a page? > Are queries normalized before running the search? I'm not sure what sort of normalization might be performed by search engines, but I want to examine the Vietnamese decomposition aspect for a moment. If you have a Vietnamese vowel with both modifier and tone mark, say LATIN CAPITAL LETTER A WITH CIRCUMFLEX AND ACUTE, then you can represent this in Unicode in at least three ways: (1) fully precomposed (NFC) -- that is, U+1EA4 (2) base character and modifier precomposed, tonal mark combining -- that is, U+00C2 U+0301 (3) base character, then modifier, then tonal mark -- that is, U+0041 U+0302 U+0301 So far, so good. But then we have: > (4) like (3), but modifier and tonal mark sorted (NFD) If "sorting" the diacritical marks in NFD results in rearranging the two diacritical marks -- in this case, U+0041 U+0301 U+0302 -- then in terms of Vietnamese orthography, the NFD form may not really be a legitimate way of representing the Vietnamese letter. For example, U+1EAC LATIN CAPITAL LETTER A WITH CIRCUMFLEX AND DOT BELOW is, in Vietnamese, a circumflexed A to which a tone mark (dot below) has been added. It is not a dotted-below A to which a circumflex has been added. Yet because of the canonical combining classes of the two diacriticals (230 for COMBINING CIRCUMFLEX ACCENT, 220 for COMBINING DOT BELOW), the latter is how the character will be decomposed. In theory, there is actually a case 5: base character and tonal mark precomposed, modifier combining. In terms of Vietnamese orthography, this is just as illegitimate as case 4 (NFD), but most software that processes Vietnamese text will probably never encounter it. But it will have to handle the NFD case. If I were on some other mailing lists I could think of, I would claim that this is a fatal flaw in the design of Unicode Normalization Form D. It's not, but it is a sticky problem that needs to be dealt with when dealing with Vietnamese text. -Doug Ewell Fullerton, California