Le 28 déc. 03, à 04:45, SADAHIRO Tomoyuki a écrit :
On Sat, 27 Dec 2003 13:30:19 +0100
Eric Cholet <[EMAIL PROTECTED]> wrote:
Here's another naive question from a unicode newbie:
Is there a way, using perl's unicode support, to remove
accents from a string? I looked at \pM but can't figure
out how it works, I wasn't able to match anything with it.
Thanks,
--
Eric Cholet
Hello.
There are some threads on this issue.
Those which I found are as following.
*
http://www.xray.mpe.mpg.de/mailing-lists/perl-unicode/2003-05/
msg00016.html
*
http://www.xray.mpe.mpg.de/mailing-lists/perl-unicode/2001-12/
msg00004.html
I hope something there can help you.
==
P.S. UTR #30, Character Foldings, has two concepts about removing
accents.
[cf. http://www.unicode.org/reports/tr30/ ]
One is "accent removal", and
the other is "diacritic removal (includes stroke, hook, descender)".
The accent removal utilizes canonical decomposition, and
non-decomposable characters, including Eth ("Ð", U+00D0),
O with stroke ("Ø", U+00D8), c with curl (U+0255,
cf. http://www.unicode.org/cgi-bin/GetUnihanData.pl?codepoint=0255 ),
d with hook (U+0257,
cf. http://www.unicode.org/cgi-bin/GetUnihanData.pl?codepoint=0257 ),
will not be transformed.
Though "diacritic removal" is provisional and its definition has not
been specified yet, I suppose it to have mapping of "Ø" to "O", etc.
Thanks for your detailed reply. I looked into this and found that I
can use Unicode::Normalize to decompose a string in NFD form and then
remove the accents with a regex removing /pM/. I wonder if I overlooked
a shortcoming in this approach since you didn't recommend it although
your are the author of Unicode::Normalize.
Thanks,
--
Eric Cholet