On 30/01/2002 15:30:06 Mark Davis wrote: > It is not a 'fatal flaw'. NFD makes to pretensions to represent the
I imagine that "to" -> "no". Misha > most 'natural' ordering for any given language. Out of all the > possible canonically equivalent sequences, it is simply a specific, > well-defined, unique representation that is fully decomposed. > > The issue of canonical equivalence itself is that that the circumflex > and dot-below can come in any order and have precisely the same > appearance, *and* that we could not predict the 'natural' order for > any given language. > > Mark > ————— > > Πόλλ’ ἠπίστατο ἔργα, κακῶς δ’ ἠπίστατο >πάντα — Ὁμήρου Μαργίτῃ > [For transliteration, see http://oss.software.ibm.com/cgi-bin/icu/tr] > > http://www.macchiato.com > > ----- Original Message ----- > From: <[EMAIL PROTECTED]> > To: <[EMAIL PROTECTED]> > Cc: <[EMAIL PROTECTED]> > Sent: Tuesday, January 29, 2002 22:51 > Subject: Re: Unicode Search Engines > > > > In a message dated 2002-01-28 7:37:48 Pacific Standard Time, > > [EMAIL PROTECTED] writes: > > > > > I would like to add: > > > How do they handle normalization? > > > In Vietnam, many characters can be represented in several > different ways: > > > (1) fully precomposed (NFC) > > > (2) base character and modifier precomposed, tonal mark combining > > > (3) base character, then modifier, then tonal mark > > > (4) like (3), but modifier and tonal mark sorted (NFD) > > > Do the search engines do any normalization, before indexing a > page? > > > Are queries normalized before running the search? > > > > I'm not sure what sort of normalization might be performed by search > engines, > > but I want to examine the Vietnamese decomposition aspect for a > moment. > > > > If you have a Vietnamese vowel with both modifier and tone mark, say > LATIN > > CAPITAL LETTER A WITH CIRCUMFLEX AND ACUTE, then you can represent > this in > > Unicode in at least three ways: > > > > (1) fully precomposed (NFC) -- that is, U+1EA4 > > (2) base character and modifier precomposed, tonal mark combining -- > that is, > > U+00C2 U+0301 > > (3) base character, then modifier, then tonal mark -- that is, > U+0041 U+0302 > > U+0301 > > > > So far, so good. But then we have: > > > > > (4) like (3), but modifier and tonal mark sorted (NFD) > > > > If "sorting" the diacritical marks in NFD results in rearranging the > two > > diacritical marks -- in this case, U+0041 U+0301 U+0302 -- then in > terms of > > Vietnamese orthography, the NFD form may not really be a legitimate > way of > > representing the Vietnamese letter. > > > > For example, U+1EAC LATIN CAPITAL LETTER A WITH CIRCUMFLEX AND DOT > BELOW is, > > in Vietnamese, a circumflexed A to which a tone mark (dot below) has > been > > added. It is not a dotted-below A to which a circumflex has been > added. Yet > > because of the canonical combining classes of the two diacriticals > (230 for > > COMBINING CIRCUMFLEX ACCENT, 220 for COMBINING DOT BELOW), the > latter is how > > the character will be decomposed. > > > > In theory, there is actually a case 5: base character and tonal mark > > precomposed, modifier combining. In terms of Vietnamese > orthography, this is > > just as illegitimate as case 4 (NFD), but most software that > processes > > Vietnamese text will probably never encounter it. But it will have > to handle > > the NFD case. > > > > If I were on some other mailing lists I could think of, I would > claim that > > this is a fatal flaw in the design of Unicode Normalization Form D. > It's > > not, but it is a sticky problem that needs to be dealt with when > dealing with > > Vietnamese text. > > > > -Doug Ewell > > Fullerton, California > > > > > > -------------------------------------------------------------- -- Visit our Internet site at http://www.reuters.com Any views expressed in this message are those of the individual sender, except where the sender specifically states them to be the views of Reuters Ltd.