> (1) Sorting > It is said, that in sorting, all combining marks should be disregarded. > While in Vietnamese this is OK for the (combining) tone marks, it is > absolutely not OK for the (combining) modifiers. In Vietnamese, e.g. an "a"
That is not the position taken in Unicode. Combining marks should be taken into account in sorting in a tailoring that is based upon how they are handled in the language in question. For example, particular ones may be treated as tones and sorted on a third level, while others may be treated as letter modifiers and sorted on the first level. Different combinations can also be sorted differently, according to the requirements of the language. For more information, see the UCA: http://www.unicode.org/reports/tr10/). Also, the UCA specifically requires that canonical equivalence be maintained (unless the source domain is limited to strings that do not contain alternates), so conformant application of the UCA will sort all of the following the same: > >(1) fully precomposed (NFC) -- that is, U+1EA4 > >(2) base character and modifier precomposed, tonal mark combining -- that is, > >U+00C2 U+0301 > >(3) base character, then modifier, then tonal mark -- that is, U+0041 U+0302 > >U+0301 > > (4) like (3), but modifier and tonal mark sorted (NFD) also in some cases (5) base character and tonal mark composed, modifier combining Mark ————— Πόλλ’ ἠπίστατο ἔργα, κακῶς δ’ ἠπίστατο πάντα — Ὁμήρου Μαργίτῃ [For transliteration, see http://oss.software.ibm.com/cgi-bin/icu/tr] http://www.macchiato.com ----- Original Message ----- From: "Stefan Probst" <[EMAIL PROTECTED]> To: <[EMAIL PROTECTED]>; <[EMAIL PROTECTED]> Cc: "Martin Duerst" <[EMAIL PROTECTED]> Sent: Wednesday, January 30, 2002 01:31 Subject: Re: Unicode Search Engines > Hello Doug, > > concluding from how well you understood the issue (including your case 5), > one could think, you were Vietnamese ;) > > It is exactly the "dot below" which makes the most problems, since its > combining class (220) is lower than some of the modifiers (230). > And unfortunately other tonal marks have the same combining class like > modifiers (230), and therefore the sorting seems to be not even specified! > > To have the information together: > The modifiers, which change the base character to form a new character: > breve U+0306 combining class: 230 > circumflex U+0302 combining class: 230 > horn U+031B combining class: 216 > The tonal marks, which have only a very loose connection with the character > (i.e. in handwriting they are often even placed above two adjacent vowels): > grave U+0300 combining class: 230 > hook above U+0309 combining class: 230 > tilde U+0303 combining class: 230 > acute U+0301 combining class: 230 > dot below U+0323 combining class: 220 > > I made already test pages, e.g. the one at > http://www.isoc-vn.org/www/standard/normalizationtest13.html > > The issue runs even a bit further: > > (1) Sorting > It is said, that in sorting, all combining marks should be disregarded. > While in Vietnamese this is OK for the (combining) tone marks, it is > absolutely not OK for the (combining) modifiers. In Vietnamese, e.g. an "a" > with "circumflex" is a completely different character than an "a" alone. > This is, why some circles in Vietnam prefer what I call "VN-combined": base > character and modifier pre-composed, tone mark combining. > (2) Converting > Inside of Vietnam, in the past, there were mainly two different encodings used: > - "TCVN-ABC": Fully pre-composed, but a separate font for some upper case > characters > - "VNI": Mainly using combining characters > When converting old documents (office and web) to Unicode, the question > will be, whether the tools will do any normalization (especially in case of > VNI), or just only re-map [combining] character by [combining] character. > > And to make things worse, it seems, that MS prefers the combining way, > saying that their sorting, spell check, word wrap etc. works that way.... > > Vietnam plans to make Unicode compulsory for state offices by middle of 2002. > I have been asked to advise, and volunteered to take mainly care about > Internet issues. > > Right now, in Vietnam they are still discussing, whether they should > require a specific normalization, and if so, which one of the four possible > candidates. > > According to W3C's draft at http://www.w3.org/TR/charmod/#sec-Normalization > it seems, that all Web Applications (and that might include search > engines?) should reject (to be precise: MUST NOT handle) everything which > is not NFC. This could mean, that search engines MUST NOT index pages in > "not NFC" and reject queries in "not NFC". If they do: fine. If not: then > we have probably quite some problems... > > > And since we are already in Vietnamese.... (to round the things up): > I am not sure, how e.g. in the introduction to dictionaries or Vietnamese > language books, the tonal mark can be printed "alone". One solution might > be to combine them with a "space", but at present, this does not work always. > And only some of the tonal marks seem to have a "stand-alone version", e.g. > U+02CB for the "grave". > > Best Regards, > Stefan > > > At 01:29 30.01.2002 -0500, [EMAIL PROTECTED] wrote: > ------------------------- > >In a message dated 2002-01-28 7:37:48 Pacific Standard Time, > >[EMAIL PROTECTED] writes: > > > > > I would like to add: > > > How do they handle normalization? > > > In Vietnam, many characters can be represented in several different ways: > > > (1) fully precomposed (NFC) > > > (2) base character and modifier precomposed, tonal mark combining > > > (3) base character, then modifier, then tonal mark > > > (4) like (3), but modifier and tonal mark sorted (NFD) > > > Do the search engines do any normalization, before indexing a page? > > > Are queries normalized before running the search? > > > >I'm not sure what sort of normalization might be performed by search engines, > >but I want to examine the Vietnamese decomposition aspect for a moment. > > > >If you have a Vietnamese vowel with both modifier and tone mark, say LATIN > >CAPITAL LETTER A WITH CIRCUMFLEX AND ACUTE, then you can represent this in > >Unicode in at least three ways: > > > >(1) fully precomposed (NFC) -- that is, U+1EA4 > >(2) base character and modifier precomposed, tonal mark combining -- that is, > >U+00C2 U+0301 > >(3) base character, then modifier, then tonal mark -- that is, U+0041 U+0302 > >U+0301 > > > >So far, so good. But then we have: > > > > > (4) like (3), but modifier and tonal mark sorted (NFD) > > > >If "sorting" the diacritical marks in NFD results in rearranging the two > >diacritical marks -- in this case, U+0041 U+0301 U+0302 -- then in terms of > >Vietnamese orthography, the NFD form may not really be a legitimate way of > >representing the Vietnamese letter. > > > >For example, U+1EAC LATIN CAPITAL LETTER A WITH CIRCUMFLEX AND DOT BELOW is, > >in Vietnamese, a circumflexed A to which a tone mark (dot below) has been > >added. It is not a dotted-below A to which a circumflex has been added. Yet > >because of the canonical combining classes of the two diacriticals (230 for > >COMBINING CIRCUMFLEX ACCENT, 220 for COMBINING DOT BELOW), the latter is how > >the character will be decomposed. > > > >In theory, there is actually a case 5: base character and tonal mark > >precomposed, modifier combining. In terms of Vietnamese orthography, this is > >just as illegitimate as case 4 (NFD), but most software that processes > >Vietnamese text will probably never encounter it. But it will have to handle > >the NFD case. > > > >If I were on some other mailing lists I could think of, I would claim that > >this is a fatal flaw in the design of Unicode Normalization Form D. It's > >not, but it is a sticky problem that needs to be dealt with when dealing with > >Vietnamese text. > > > >-Doug Ewell > > Fullerton, California > > >