Re: removing accents

Eric Cholet Fri, 02 Jan 2004 03:58:58 -0800

Le 28 d�c. 03, � 04:45, SADAHIRO Tomoyuki a �crit :

On Sat, 27 Dec 2003 13:30:19 +0100
Eric Cholet <[EMAIL PROTECTED]> wrote:

Here's another naive question from a unicode newbie:
Is there a way, using perl's unicode support, to remove
accents from a string? I looked at \pM but can't figure
out how it works, I wasn't able to match anything with it.

Thanks,
--
Eric Cholet


Hello.
There are some threads on this issue.
Those which I found are as following.

* http://www.xray.mpe.mpg.de/mailing-lists/perl-unicode/2003-05/ msg00016.html * http://www.xray.mpe.mpg.de/mailing-lists/perl-unicode/2001-12/ msg00004.html

I hope something there can help you.

== P.S. UTR #30, Character Foldings, has two concepts about removing accents. [cf. http://www.unicode.org/reports/tr30/ ]

One is "accent removal", and
the other is "diacritic removal (includes stroke, hook, descender)".

The accent removal utilizes canonical decomposition, and
non-decomposable characters, including Eth ("�", U+00D0),
O with stroke ("�", U+00D8), c with curl (U+0255,
cf. http://www.unicode.org/cgi-bin/GetUnihanData.pl?codepoint=0255 ),
d with hook (U+0257,
cf. http://www.unicode.org/cgi-bin/GetUnihanData.pl?codepoint=0257 ),
will not be transformed.

Though "diacritic removal" is provisional and its definition has not
been specified yet, I suppose it to have mapping of "�" to "O", etc.


Thanks for your detailed reply. I looked into this and found that I
can use Unicode::Normalize to decompose a string in NFD form and then
remove the accents with a regex removing /pM/. I wonder if I overlooked
a shortcoming in this approach since you didn't recommend it although
your are the author of Unicode::Normalize.

Thanks,
--
Eric Cholet

Re: removing accents

Reply via email to