Re: Unicode Search Engines

Misha . Wolf Wed, 30 Jan 2002 08:17:46 -0800


On 30/01/2002 15:30:06 Mark Davis wrote:
> It is not a 'fatal flaw'. NFD makes to pretensions to represent the


I imagine that "to" -> "no".

Misha

> most 'natural' ordering for any given language. Out of all the
> possible canonically equivalent sequences, it is simply a specific,
> well-defined, unique representation that is fully decomposed.
>
> The issue of canonical equivalence itself is that that the circumflex
> and dot-below can come in any order and have precisely the same
> appearance, *and* that we could not predict the 'natural' order for
> any given language.
>
> Mark
> —————
>
> Πόλλ’ ἠπίστατο ἔργα, κακῶς δ’ ἠπίστατο 
>πάντα — Ὁμήρου Μαργίτῃ
> [For transliteration, see http://oss.software.ibm.com/cgi-bin/icu/tr]
>
> http://www.macchiato.com
>
> ----- Original Message -----
> From: <[EMAIL PROTECTED]>
> To: <[EMAIL PROTECTED]>
> Cc: <[EMAIL PROTECTED]>
> Sent: Tuesday, January 29, 2002 22:51
> Subject: Re: Unicode Search Engines
>
>
> > In a message dated 2002-01-28 7:37:48 Pacific Standard Time,
> > [EMAIL PROTECTED] writes:
> >
> > > I would like to add:
> > > How do they handle normalization?
> > > In Vietnam, many characters can be represented in several
> different ways:
> > > (1) fully precomposed (NFC)
> > > (2) base character and modifier precomposed, tonal mark combining
> > > (3) base character, then modifier, then tonal mark
> > > (4) like (3), but modifier and tonal mark sorted (NFD)
> > > Do the search engines do any normalization, before indexing a
> page?
> > > Are queries normalized before running the search?
> >
> > I'm not sure what sort of normalization might be performed by search
> engines,
> > but I want to examine the Vietnamese decomposition aspect for a
> moment.
> >
> > If you have a Vietnamese vowel with both modifier and tone mark, say
> LATIN
> > CAPITAL LETTER A WITH CIRCUMFLEX AND ACUTE, then you can represent
> this in
> > Unicode in at least three ways:
> >
> > (1) fully precomposed (NFC) -- that is, U+1EA4
> > (2) base character and modifier precomposed, tonal mark combining --
> that is,
> > U+00C2 U+0301
> > (3) base character, then modifier, then tonal mark -- that is,
> U+0041 U+0302
> > U+0301
> >
> > So far, so good.  But then we have:
> >
> > > (4) like (3), but modifier and tonal mark sorted (NFD)
> >
> > If "sorting" the diacritical marks in NFD results in rearranging the
> two
> > diacritical marks -- in this case, U+0041 U+0301 U+0302 -- then in
> terms of
> > Vietnamese orthography, the NFD form may not really be a legitimate
> way of
> > representing the Vietnamese letter.
> >
> > For example, U+1EAC LATIN CAPITAL LETTER A WITH CIRCUMFLEX AND DOT
> BELOW is,
> > in Vietnamese, a circumflexed A to which a tone mark (dot below) has
> been
> > added.  It is not a dotted-below A to which a circumflex has been
> added.  Yet
> > because of the canonical combining classes of the two diacriticals
> (230 for
> > COMBINING CIRCUMFLEX ACCENT, 220 for COMBINING DOT BELOW), the
> latter is how
> > the character will be decomposed.
> >
> > In theory, there is actually a case 5: base character and tonal mark
> > precomposed, modifier combining.  In terms of Vietnamese
> orthography, this is
> > just as illegitimate as case 4 (NFD), but most software that
> processes
> > Vietnamese text will probably never encounter it.  But it will have
> to handle
> > the NFD case.
> >
> > If I were on some other mailing lists I could think of, I would
> claim that
> > this is a fatal flaw in the design of Unicode Normalization Form D.
> It's
> > not, but it is a sticky problem that needs to be dealt with when
> dealing with
> > Vietnamese text.
> >
> > -Doug Ewell
> >  Fullerton, California
> >
> >
>
>

-------------------------------------------------------------- --
        Visit our Internet site at http://www.reuters.com

Any views expressed in this message are those of  the  individual
sender,  except  where  the sender specifically states them to be
the views of Reuters Ltd.

Re: Unicode Search Engines

Reply via email to