Hi, I am trying to extract the iso code and country name from a 3 column table (taken from en.wikipedia.org) and have noticed a problem with accented characters such as Ô.
Below is my script and a sample of the data I am using. When I run the script the code beginning CI for Côte d'Ivoire returns the string "CI\tC" where as I had hoped for "CI\tCôte d'Ivoire" Does anyone know why \w+ does include Côte d'Ivoire and how I can get around it in future? TIA, Dp. ==== extract.pl ======== #!/usr/bin/perl use strict; use warnings; my $file = 'iso-alpha2.txt'; open(FH,$file) or die "Can't open $file: $!\n"; while (<FH>) { chomp; next if ($_ !~ /^\w{2}\s+/); my ($code,$name) = ($_ =~ /^(\w{2})\s+(\w+\s\w+\s\w+s\w+|\w+\s\w+\s\w+|\w+\s\w+|\w+)/); print "$code\t$name\n"; } =============== ======== sample data ======== ...snip BY Belarus Previously named "Byelorussian S.S.R." BZ Belize CA Canada CC Cocos (Keeling) Islands CD Congo, the Democratic Republic of the Previously named "Zaire" ZR CF Central African Republic CG Congo CH Switzerland Code taken from "Confoederatio Helvetica", its official Latin name CI Côte d'Ivoire CK Cook Islands CL Chile CM Cameroon =========== -- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] http://learn.perl.org/