Re: Unicode Search Engines

Mark Davis Wed, 30 Jan 2002 08:48:38 -0800

> (1) Sorting
> It is said, that in sorting, all combining marks should be
disregarded.
> While in Vietnamese this is OK for the (combining) tone marks, it is
> absolutely not OK for the (combining) modifiers. In Vietnamese, e.g.
an "a"


That is not the position taken in Unicode. Combining marks should be
taken into account in sorting in a tailoring that is based upon how
they are handled in the language in question. For example, particular
ones may be treated as tones and sorted on a third level, while others
may be treated as letter modifiers and sorted on the first level.
Different combinations can also be sorted differently, according to
the requirements of the language. For more information, see the UCA:
http://www.unicode.org/reports/tr10/).

Also, the UCA specifically requires that canonical equivalence be
maintained (unless the source domain is limited to strings that do not
contain alternates), so conformant application of the UCA will sort
all of the following the same:

> >(1) fully precomposed (NFC) -- that is, U+1EA4
> >(2) base character and modifier precomposed, tonal mark
combining -- that is,
> >U+00C2 U+0301
> >(3) base character, then modifier, then tonal mark -- that is,
U+0041 U+0302
> >U+0301
> > (4) like (3), but modifier and tonal mark sorted (NFD)
also in some cases
(5) base character and tonal mark composed, modifier combining

Mark
—————

Πόλλ’ ἠπίστατο ἔργα, κακῶς δ’ ἠπίστατο 
πάντα — Ὁμήρου Μαργίτῃ
[For transliteration, see http://oss.software.ibm.com/cgi-bin/icu/tr]

http://www.macchiato.com

----- Original Message -----
From: "Stefan Probst" <[EMAIL PROTECTED]>
To: <[EMAIL PROTECTED]>; <[EMAIL PROTECTED]>
Cc: "Martin Duerst" <[EMAIL PROTECTED]>
Sent: Wednesday, January 30, 2002 01:31
Subject: Re: Unicode Search Engines


> Hello Doug,
>
> concluding from how well you understood the issue (including your
case 5),
> one could think, you were Vietnamese ;)
>
> It is exactly the "dot below" which makes the most problems, since
its
> combining class (220) is lower than some of the modifiers (230).
> And unfortunately other tonal marks have the same combining class
like
> modifiers (230), and therefore the sorting seems to be not even
specified!
>
> To have the information together:
> The modifiers, which change the base character to form a new
character:
> breve       U+0306  combining class: 230
> circumflex  U+0302  combining class: 230
> horn        U+031B  combining class: 216
> The tonal marks, which have only a very loose connection with the
character
> (i.e. in handwriting they are often even placed above two adjacent
vowels):
> grave       U+0300  combining class: 230
> hook above  U+0309  combining class: 230
> tilde       U+0303  combining class: 230
> acute       U+0301  combining class: 230
> dot below   U+0323  combining class: 220
>
> I made already test pages, e.g. the one at
> http://www.isoc-vn.org/www/standard/normalizationtest13.html
>
> The issue runs even a bit further:
>
> (1) Sorting
> It is said, that in sorting, all combining marks should be
disregarded.
> While in Vietnamese this is OK for the (combining) tone marks, it is
> absolutely not OK for the (combining) modifiers. In Vietnamese, e.g.
an "a"
> with "circumflex" is a completely different character than an "a"
alone.
> This is, why some circles in Vietnam prefer what I call
"VN-combined": base
> character and modifier pre-composed, tone mark combining.
> (2) Converting
> Inside of Vietnam, in the past, there were mainly two different
encodings used:
> - "TCVN-ABC": Fully pre-composed, but a separate font for some upper
case
> characters
> - "VNI": Mainly using combining characters
> When converting old documents (office and web) to Unicode, the
question
> will be, whether the tools will do any normalization (especially in
case of
> VNI), or just only re-map [combining] character by [combining]
character.
>
> And to make things worse, it seems, that MS prefers the combining
way,
> saying that their sorting, spell check, word wrap etc. works that
way....
>
> Vietnam plans to make Unicode compulsory for state offices by middle
of 2002.
> I have been asked to advise, and volunteered to take mainly care
about
> Internet issues.
>
> Right now, in Vietnam they are still discussing, whether they should
> require a specific normalization, and if so, which one of the four
possible
> candidates.
>
> According to W3C's draft at
http://www.w3.org/TR/charmod/#sec-Normalization
> it seems, that all Web Applications (and that might include search
> engines?) should reject (to be precise: MUST NOT handle) everything
which
> is not NFC. This could mean, that search engines MUST NOT index
pages in
> "not NFC" and reject queries in "not NFC". If they do: fine. If not:
then
> we have probably quite some problems...
>
>
> And since we are already in Vietnamese.... (to round the things up):
> I am not sure, how e.g. in the introduction to dictionaries or
Vietnamese
> language books, the tonal mark can be printed "alone". One solution
might
> be to combine them with a "space", but at present, this does not
work always.
> And only some of the tonal marks seem to have a "stand-alone
version", e.g.
> U+02CB for the "grave".
>
> Best Regards,
> Stefan
>
>
> At 01:29 30.01.2002 -0500, [EMAIL PROTECTED] wrote:
> -------------------------
> >In a message dated 2002-01-28 7:37:48 Pacific Standard Time,
> >[EMAIL PROTECTED] writes:
> >
> > > I would like to add:
> > > How do they handle normalization?
> > > In Vietnam, many characters can be represented in several
different ways:
> > > (1) fully precomposed (NFC)
> > > (2) base character and modifier precomposed, tonal mark
combining
> > > (3) base character, then modifier, then tonal mark
> > > (4) like (3), but modifier and tonal mark sorted (NFD)
> > > Do the search engines do any normalization, before indexing a
page?
> > > Are queries normalized before running the search?
> >
> >I'm not sure what sort of normalization might be performed by
search engines,
> >but I want to examine the Vietnamese decomposition aspect for a
moment.
> >
> >If you have a Vietnamese vowel with both modifier and tone mark,
say LATIN
> >CAPITAL LETTER A WITH CIRCUMFLEX AND ACUTE, then you can represent
this in
> >Unicode in at least three ways:
> >
> >(1) fully precomposed (NFC) -- that is, U+1EA4
> >(2) base character and modifier precomposed, tonal mark
combining -- that is,
> >U+00C2 U+0301
> >(3) base character, then modifier, then tonal mark -- that is,
U+0041 U+0302
> >U+0301
> >
> >So far, so good.  But then we have:
> >
> > > (4) like (3), but modifier and tonal mark sorted (NFD)
> >
> >If "sorting" the diacritical marks in NFD results in rearranging
the two
> >diacritical marks -- in this case, U+0041 U+0301 U+0302 -- then in
terms of
> >Vietnamese orthography, the NFD form may not really be a legitimate
way of
> >representing the Vietnamese letter.
> >
> >For example, U+1EAC LATIN CAPITAL LETTER A WITH CIRCUMFLEX AND DOT
BELOW is,
> >in Vietnamese, a circumflexed A to which a tone mark (dot below)
has been
> >added.  It is not a dotted-below A to which a circumflex has been
added.  Yet
> >because of the canonical combining classes of the two diacriticals
(230 for
> >COMBINING CIRCUMFLEX ACCENT, 220 for COMBINING DOT BELOW), the
latter is how
> >the character will be decomposed.
> >
> >In theory, there is actually a case 5: base character and tonal
mark
> >precomposed, modifier combining.  In terms of Vietnamese
orthography, this is
> >just as illegitimate as case 4 (NFD), but most software that
processes
> >Vietnamese text will probably never encounter it.  But it will have
to handle
> >the NFD case.
> >
> >If I were on some other mailing lists I could think of, I would
claim that
> >this is a fatal flaw in the design of Unicode Normalization Form D.
It's
> >not, but it is a sticky problem that needs to be dealt with when
dealing with
> >Vietnamese text.
> >
> >-Doug Ewell
> >  Fullerton, California
>
>
>

Re: Unicode Search Engines

Reply via email to