Re: \W and [\W]
Eric Cholet [EMAIL PROTECTED] writes: Le 1 janv. 04, 17:50, Rafael Garcia-Suarez a crit : +(However, and as a limitation of the current implementation, using +C\w or C\W Iinside a C[...] character class will still match +with byte semantics.) I don't think it applies to \w, only \W. \x{df} matches [\w] just fine, as shown in Andreas' bug report. Do negated classes work at all ? What does /[^\w]/ do ? (I looked at this stuff ages ago and I thought unicode classes (including negated ones worked, if that is true then fix may just be the magical \W expander expanding to wrong thing...)
Re: \W and [\W]
Do negated classes work at all ? What does /[^\w]/ do ? (I looked at this stuff ages ago and I thought unicode classes (including negated ones worked, if that is true then fix may just be the magical \W expander expanding to wrong thing...) I think it's the evil characters in the 0x80..0xFF range that can still bit one in the nether parts since they can be legacy or Unicode, depending, and one part of the regex machinery gets it wrong, and as the bug report by Andreas quoted Hugo, one can't really trivially fix the problem. In this particular case it's the sharp s that's the trouble maker. IIRC the problem lies in how the character classes are implemented, and how they have dual brains, one eight-bit and one Unicode, and in this case the reptile legacy brain fires it neuron(s?) too early, before the Unicode brain can engage itself. Or something like that. If we medicated the legacy brain not to fire, other tests involving characters in that range started failing. -- Jarkko Hietaniemi [EMAIL PROTECTED] http://www.iki.fi/jhi/ There is this special biologist word we use for 'stable'. It is 'dead'. -- Jack Cohen
Re: \W and [\W]
On Wed, 31 Dec 2003 16:21:36 +0100, Eric Cholet [EMAIL PROTECTED] said: Can anyone enlighten me as to why \W behaves differently depending on wether it's inside or outside of a character class, for certain characters: I have reported this as bug 18281 http://guest:[EMAIL PROTECTED]/rt3/Ticket/Display.html?id=18281 I don't think that it is documented by now and I cannot spot a good place where it needs to be documented. perlre.pod and perlunicode.pod seem the natural places. -- andreas
Re: \W and [\W]
Andreas J Koenig wrote in perl.unicode : On Wed, 31 Dec 2003 16:21:36 +0100, Eric Cholet [EMAIL PROTECTED] said: Can anyone enlighten me as to why \W behaves differently depending on wether it's inside or outside of a character class, for certain characters: I have reported this as bug 18281 http://guest:[EMAIL PROTECTED]/rt3/Ticket/Display.html?id=18281 I don't think that it is documented by now and I cannot spot a good place where it needs to be documented. perlre.pod and perlunicode.pod seem the natural places. And apparently fixing it is not trivial. Does something like this suit you ? This can at least make its way into 5.8.3. Change 22031 by [EMAIL PROTECTED] on 2004/01/01 16:30:13 Document that /[\W]/ doesn't work, unicode-wise (see bug #18281) Affected files ... ... //depot/perl/pod/perlunicode.pod#130 edit Differences ... //depot/perl/pod/perlunicode.pod#130 (text) @@ -166,6 +166,10 @@ Unicode properties database. C\w can be used to match a Japanese ideograph, for instance. +(However, and as a limitation of the current implementation, using +C\w or C\W Iinside a C[...] character class will still match +with byte semantics.) + =item * Named Unicode properties, scripts, and block ranges may be used like
Re: \W and [\W]
Le 31 dc. 03, 16:28, [EMAIL PROTECTED] a crit : Why are you using: use encoding 'utf8'; ? So that, for the sake of keeping the snippet short, Perl would know that my character constant was in utf-8, and that the print statements would output utf-8 as well. I typed the source code in an utf-8 editor, and used a utf-8 terminal to run it. I apologize for not making this clear. Without it, perl 5.8.1, I see output: 1 2 3 Gro Without the use encoding Perl is just doing bytes, you lose the unicode character semantics and end up with 3 Gro which is wrong, Grobritannien is one word. When I run with your use encoding 'utf8'; I get an error from perl: Malformed UTF-8 character (unexpected non-continuation byte 0x62, immediately after start byte 0xdf) in pattern match (m//) at /tmp/w.pl line 9. So you have 0xdf 0x62 which is b in latin1. My sample assumes utf-8, in utf-8 b is 0xc3 0x9f 0x62. In other words you're not running the same code as I am. With such a latin1 source code and of course dropping the use encoding line, the character constant needs to be explicitely decoded to unicode: $x = Encode::decode(iso-8859-1, Grobritannien); ...which yields the same results of course: 1 2 3 Grobritannien -- #!/usr/bin/perl -w use strict; use encoding 'utf8'; my $x = 'Grobritannien'; $\ = \n; print '1 ', $x =~ /(\W+)/; print '2 ', $x =~ /([\W]+)/; print '3 ', $x =~ /(\w+)/; exit(0); -- Eric Cholet