Why are you using:
use encoding 'utf8';
?
So that, for the sake of keeping the snippet short, Perl would know that my character constant was in utf-8, and that the "print" statements would output utf-8 as well. I typed the source code in an utf-8 editor, and used a utf-8 terminal to run it. I apologize for not making this clear.
Without it, perl 5.8.1, I see output:
1 Ã 2 Ã 3 Gro
Without the "use encoding" Perl is just doing bytes, you lose the unicode character semantics and end up with "3 Gro" which is wrong, GroÃbritannien is one word.
When I run with your use encoding 'utf8'; I get an error from perl:
Malformed UTF-8 character (unexpected non-continuation byte 0x62, immediately after start byte 0xdf) in pattern match (m//) at /tmp/w.pl line 9.
So you have 0xdf 0x62 which is Ãb in latin1. My sample assumes utf-8, in utf-8 Ãb is 0xc3 0x9f 0x62.
In other words you're not running the same code as I am. With such a latin1 source code and of course dropping the "use encoding" line, the character constant needs to be explicitely decoded to unicode:
$x = Encode::decode("iso-8859-1", "GroÃbritannien");
...which yields the same results of course:
1 2 Ã 3 GroÃbritannien
------------------------------------------ #!/usr/bin/perl -w
use strict; use encoding 'utf8';
my $x = 'GroÃbritannien'; $\ = "\n";
print '1 ', $x =~ /(\W+)/; print '2 ', $x =~ /([\W]+)/; print '3 ', $x =~ /(\w+)/;
exit(0);
-- Eric Cholet