Re: FW:transform a (UNICODE) accented character to its equivalent (UNICODE) non-accented character

Philippe Verdy Thu, 14 Aug 2003 11:40:21 -0700

On Tuesday, August 05, 2003 9:34 PM, John Cowan <[EMAIL PROTECTED]> wrote:


> Magda Danish (Unicode) scripsit:
> 
> > > I'm looking for the easiest and more stable way to transform
> > > an (UNICODE) accented character to its equivalent (UNICODE)
> > > non-accented character.
> 
> The following mapping table is an approximation to that.
> 
> 00C0;0041
(snip)
> 1D1C0;1D1BA

Why such a table? The main UCD table already contains the needed NFD canonical 
decompositions, and removing accents is simply a matter of NFD decomposition plus 
removal of combining characters (with combining class > 0), from which you may tune 
the set of filtered diacritics (for example to not remove some Brahmic diacritics such 
as viramas, or Hiragana/Katakana voicing marks which are easy to identify from their 
low positive combining class value, as they are not really accents but are important 
to correctly identify vowels and consonnants, without creating too much ambiguities if 
they are removed)...

Using the NFD/NFC algorithm is certainly the best and safest option as it is stable 
across Unicode versions. The NFD mappings in the UCD will also transliterate all 
compatibility characters into their canonical equivalents. Some tuning may be required 
for Arabic, which includes precomposed sequences defined for compatibility but only 
mapped with NFKD because they sometime include more than a base character (possibly 
decomposable) and a single undecomposable diacritic. Other tuning may also be needed 
for Arabic and Hebrew (accents and points), if one wants to preserve the traditional 
vowels or use a "modern" simplified mapping without vowels.

But the NFD mappings are already good for Han. Some compatility decompositions (NFKD) 
in the Han blocks may be useful (notably removing the narrow/wide differences)

If your intent is to remove only accents in alphabetized scripts, it's probably best 
to remove only diacritics (CC>0) below U+800, notably in the U+03xx block, after the 
NFD decomposition, and ensure that the resulting string is recomposed and reordered 
with NFC rules. For Japanese, one may want to remap Katakana to Hiragana (but still 
keep the Kana voice marks), using a table currently not defined by Unicode, but 
documented in IBM's open-sourced ICU.

See UAX#14.

-- 
Philippe.
Spams non tolérés: tout message non sollicité sera
rapporté à vos fournisseurs de services Internet.

Re: FW:transform a (UNICODE) accented character to its equivalent (UNICODE) non-accented character

Reply via email to