Re: Enconding, locate, etc.

Nobumi Iyanaga Wed, 19 Apr 2006 04:04:21 -0700

Dear Ende,

On Apr 19, 2006, at 5:22 PM, ende wrote:

Thanks  Nobumi,
Your solution is not only shorter but also more precise and correctthan my first attempt. But, anyway, although it works better itdoesn't find words with different accented capitalization. Thatis, if you look for "Ángeles" it doesn't find nor "Angeles" nor"angeles" nor "ángeles"...


Well, on my machine, if I call that script with:

perl Ende_test.pl Ángeles

it does find "Ángeles" AND "ángeles" (because it has the "i" optionin the regex).

But you seem to want to do a kind of "accent insensitive search"...?That should not be simple.

One possible -- and rather simple -- solution would be to use"Unicode::Normalize". I just tried this script:


#!/usr/bin/perl

use utf8;
use Encode;
use Unicode::Normalize;

binmode (STDOUT, ":utf8");

my $re = join("|", @ARGV);
$re = decode ("utf8", $re);
my $listin = "/Users/me/Documents/documentos/Familia/Casa/Telistin.txt";

open my $f, "<:encoding(MacRoman)", "$listin" or die "$listin noabre: $!";

while (<$f>) {
        chomp;
        if (/$re/i) {
                print $_, "\n";
        }
        else {
                my $temp = NFD($re);
                $temp =~ s/[\x{0300}-\x{036F}\x{0081}]+//g;
                print $_, "\n" if /$temp/i;
        }
}
close $f;

I can call this script from Terminal like this:

perl Ende_test.pl Ángeles

or

perl Ende_test.pl ángeles

and get the reply:

Ángeles
Angeles
ángeles
angeles

-- But you have to use the accented character to match non-accentedcharacters -- that is, you will find only

Angeles
angeles

if you invoke the script with:

perl Ende_test.pl Angeles

or

perl Ende_test.pl angeles

Best regards,

Nobumi Iyanaga
Tokyo,
Japan

Re: Enconding, locate, etc.

Reply via email to