Re: \W and [\W]

2004-01-02 Thread Nick Ing-Simmons
Eric Cholet [EMAIL PROTECTED] writes:
Le 1 janv. 04,  17:50, Rafael Garcia-Suarez a crit :

 +(However, and as a limitation of the current implementation, using
 +C\w or C\W Iinside a C[...] character class will still match
 +with byte semantics.)

I don't think it applies to \w, only \W. \x{df} matches [\w] just fine,
as shown in Andreas' bug report.

Do negated classes work at all ?
What does /[^\w]/ do ?

(I looked at this stuff ages ago and I thought unicode classes (including 
 negated ones worked, if that is true then fix may just be the magical 
 \W expander expanding to wrong thing...)


Re: \W and [\W]

2004-01-02 Thread Jarkko Hietaniemi
Do negated classes work at all ?
What does /[^\w]/ do ?
(I looked at this stuff ages ago and I thought unicode classes 
(including
 negated ones worked, if that is true then fix may just be the magical
 \W expander expanding to wrong thing...)
I think it's the evil characters in the 0x80..0xFF range that can still 
bit
one in the nether parts since they can be legacy or Unicode, 
depending,
and one part of the regex machinery gets it wrong, and as the bug 
report by
Andreas quoted Hugo, one can't really trivially fix the problem.  In 
this
particular case it's the sharp s that's the trouble maker.  IIRC the 
problem
lies in how the character classes are implemented, and how they have 
dual
brains, one eight-bit and one Unicode, and in this case the reptile 
legacy
brain fires it neuron(s?) too early, before the Unicode brain can 
engage itself.
Or something like that.  If we medicated the legacy brain not to fire, 
other
tests involving characters in that range started failing.


--
Jarkko Hietaniemi [EMAIL PROTECTED] http://www.iki.fi/jhi/ There is this 
special
biologist word we use for 'stable'.  It is 'dead'. -- Jack Cohen




Re: \W and [\W]

2004-01-01 Thread Andreas J Koenig
 On Wed, 31 Dec 2003 16:21:36 +0100, Eric Cholet [EMAIL PROTECTED] said:

   Can anyone enlighten me as to why \W behaves differently depending
   on wether it's inside or outside of a character class, for certain
   characters:

I have reported this as bug 18281

http://guest:[EMAIL PROTECTED]/rt3/Ticket/Display.html?id=18281

I don't think that it is documented by now and I cannot spot a good
place where it needs to be documented. perlre.pod and perlunicode.pod
seem the natural places.

-- 
andreas


Re: \W and [\W]

2004-01-01 Thread Rafael Garcia-Suarez
Andreas J Koenig wrote in perl.unicode :
 On Wed, 31 Dec 2003 16:21:36 +0100, Eric Cholet [EMAIL PROTECTED] said:
 
   Can anyone enlighten me as to why \W behaves differently depending
   on wether it's inside or outside of a character class, for certain
   characters:
 
 I have reported this as bug 18281
 
 http://guest:[EMAIL PROTECTED]/rt3/Ticket/Display.html?id=18281
 
 I don't think that it is documented by now and I cannot spot a good
 place where it needs to be documented. perlre.pod and perlunicode.pod
 seem the natural places.

And apparently fixing it is not trivial.
Does something like this suit you ? This can at least make its way into
5.8.3.

Change 22031 by [EMAIL PROTECTED] on 2004/01/01 16:30:13

Document that /[\W]/ doesn't work, unicode-wise (see bug #18281)

Affected files ...

... //depot/perl/pod/perlunicode.pod#130 edit

Differences ...

 //depot/perl/pod/perlunicode.pod#130 (text) 

@@ -166,6 +166,10 @@
 Unicode properties database.  C\w can be used to match a Japanese
 ideograph, for instance.
 
+(However, and as a limitation of the current implementation, using
+C\w or C\W Iinside a C[...] character class will still match
+with byte semantics.)
+
 =item *
 
 Named Unicode properties, scripts, and block ranges may be used like


Re: \W and [\W]

2003-12-31 Thread Eric Cholet
Le 31 dc. 03,  16:28, [EMAIL PROTECTED] a crit :

Why are you using:

use encoding 'utf8';

?
So that, for the sake of keeping the snippet short,
Perl would know that my character constant was in
utf-8, and that the print statements would output
utf-8 as well. I typed the source code in an utf-8
editor, and used a utf-8 terminal to run it.
I apologize for not making this clear.
Without it, perl 5.8.1, I see output:

1 
2 
3 Gro
Without the use encoding Perl is just doing bytes,
you lose the unicode character semantics and end up
with 3 Gro which is wrong, Grobritannien is one word.
When I run with your use encoding 'utf8'; I get an error from perl:
Malformed UTF-8 character (unexpected non-continuation byte 0x62, 
immediately after start byte 0xdf) in pattern match (m//) at /tmp/w.pl 
line 9.
So you have 0xdf 0x62 which is b in latin1. My sample
assumes utf-8, in utf-8 b is 0xc3 0x9f 0x62.
In other words you're not running the same code as I am.
With such a latin1 source code and of course dropping
the use encoding line, the character constant needs to
be explicitely decoded to unicode:
$x = Encode::decode(iso-8859-1, Grobritannien);

...which yields the same results of course:

1
2 
3 Grobritannien
--
#!/usr/bin/perl -w
use strict;
use encoding 'utf8';
my $x = 'Grobritannien';
$\ = \n;
print '1 ', $x =~ /(\W+)/;
print '2 ', $x =~ /([\W]+)/;
print '3 ', $x =~ /(\w+)/;
exit(0);

--
Eric Cholet