Dear Ende,

On Apr 19, 2006, at 5:22 PM, ende wrote:


Thanks  Nobumi,

Your solution is not only shorter but also more precise and correct than my first attempt. But, anyway, although it works better it doesn't find words with different accented capitalization. That is, if you look for "Ángeles" it doesn't find nor "Angeles" nor "angeles" nor "ángeles"...


Well, on my machine, if I call that script with:

perl Ende_test.pl Ángeles

it does find "Ángeles" AND "ángeles" (because it has the "i" option in the regex).

But you seem to want to do a kind of "accent insensitive search"...? That should not be simple.

One possible -- and rather simple -- solution would be to use "Unicode::Normalize". I just tried this script:

#!/usr/bin/perl

use utf8;
use Encode;
use Unicode::Normalize;

binmode (STDOUT, ":utf8");

my $re = join("|", @ARGV);
$re = decode ("utf8", $re);
my $listin = "/Users/me/Documents/documentos/Familia/Casa/Telistin.txt";

open my $f, "<:encoding(MacRoman)", "$listin" or die "$listin no abre: $!";
while (<$f>) {
        chomp;
        if (/$re/i) {
                print $_, "\n";
        }
        else {
                my $temp = NFD($re);
                $temp =~ s/[\x{0300}-\x{036F}\x{0081}]+//g;
                print $_, "\n" if /$temp/i;
        }
}
close $f;

I can call this script from Terminal like this:

perl Ende_test.pl Ángeles

or

perl Ende_test.pl ángeles

and get the reply:

Ángeles
Angeles
ángeles
angeles

-- But you have to use the accented character to match non-accented characters -- that is, you will find only
Angeles
angeles

if you invoke the script with:

perl Ende_test.pl Angeles

or

perl Ende_test.pl angeles

Best regards,

Nobumi Iyanaga
Tokyo,
Japan

Reply via email to