Re: \W and [\W]

Eric Cholet Wed, 31 Dec 2003 09:02:02 -0800

Le 31 dÃc. 03, Ã 16:28, [EMAIL PROTECTED] a Ãcrit :

Why are you using:

use encoding 'utf8';

?


So that, for the sake of keeping the snippet short,
Perl would know that my character constant was in
utf-8, and that the "print" statements would output
utf-8 as well. I typed the source code in an utf-8
editor, and used a utf-8 terminal to run it.
I apologize for not making this clear.

Without it, perl 5.8.1, I see output:
1 Ã
2 Ã
3 Gro


Without the "use encoding" Perl is just doing bytes,
you lose the unicode character semantics and end up
with "3 Gro" which is wrong, GroÃbritannien is one word.

When I run with your use encoding 'utf8'; I get an error from perl: Malformed UTF-8 character (unexpected non-continuation byte 0x62, immediately after start byte 0xdf) in pattern match (m//) at /tmp/w.pl line 9.


So you have 0xdf 0x62 which is Ãb in latin1. My sample
assumes utf-8, in utf-8 Ãb is 0xc3 0x9f 0x62.

In other words you're not running the same code as I am.
With such a latin1 source code and of course dropping
the "use encoding" line, the character constant needs to
be explicitely decoded to unicode:

$x = Encode::decode("iso-8859-1", "GroÃbritannien");

...which yields the same results of course:

1
2 Ã
3 GroÃbritannien


------------------------------------------
#!/usr/bin/perl -w

use strict;
use encoding 'utf8';

my $x = 'GroÃbritannien';
$\ = "\n";

print '1 ', $x =~ /(\W+)/;
print '2 ', $x =~ /([\W]+)/;
print '3 ', $x =~ /(\w+)/;

exit(0);

--
Eric Cholet

Re: \W and [\W]

Reply via email to