Re: \W and [\W]
Le 1 janv. 04, à 17:50, Rafael Garcia-Suarez a écrit : +(However, and as a limitation of the current implementation, using +C<\w> or C<\W> I a C<[...]> character class will still match +with byte semantics.) I don't think it applies to \w, only \W. \x{df} matches [\w] just fine, as shown in Andreas' bug report. -- Eric Cholet
Re: \W and [\W]
Andreas J Koenig wrote in perl.unicode : >> On Wed, 31 Dec 2003 16:21:36 +0100, Eric Cholet <[EMAIL PROTECTED]> said: > > > Can anyone enlighten me as to why \W behaves differently depending > > on wether it's inside or outside of a character class, for certain > > characters: > > I have reported this as bug 18281 > > http://guest:[EMAIL PROTECTED]/rt3/Ticket/Display.html?id=18281 > > I don't think that it is documented by now and I cannot spot a good > place where it needs to be documented. perlre.pod and perlunicode.pod > seem the natural places. And apparently fixing it is not trivial. Does something like this suit you ? This can at least make its way into 5.8.3. Change 22031 by [EMAIL PROTECTED] on 2004/01/01 16:30:13 Document that /[\W]/ doesn't work, unicode-wise (see bug #18281) Affected files ... ... //depot/perl/pod/perlunicode.pod#130 edit Differences ... //depot/perl/pod/perlunicode.pod#130 (text) @@ -166,6 +166,10 @@ Unicode properties database. C<\w> can be used to match a Japanese ideograph, for instance. +(However, and as a limitation of the current implementation, using +C<\w> or C<\W> I a C<[...]> character class will still match +with byte semantics.) + =item * Named Unicode properties, scripts, and block ranges may be used like
Re: \W and [\W]
> On Wed, 31 Dec 2003 16:21:36 +0100, Eric Cholet <[EMAIL PROTECTED]> said: > Can anyone enlighten me as to why \W behaves differently depending > on wether it's inside or outside of a character class, for certain > characters: I have reported this as bug 18281 http://guest:[EMAIL PROTECTED]/rt3/Ticket/Display.html?id=18281 I don't think that it is documented by now and I cannot spot a good place where it needs to be documented. perlre.pod and perlunicode.pod seem the natural places. -- andreas
Re: UTF8 behavior under -T (Taint) mode
At 22:32 04/01/01 +0900, Dan wrote: >Aha! I see your point at last. And I found your argument was correct. Sorry my poor English and insufficient explanation. :) >I am not sure how severe it is but this is a bug indeed. Oh, indeed? I had almost believed that it was a featured behavior. I hope there will be good news about the bug fix. >And you can't use Encode::decode("utf8", ...) in this particular case because >Encode::decode() checks and clobbers at "Cannot decode string with wide characters". >Hmm I see. Thank you for your hard and good jobs. Best regards, -- Masanori HATA <[EMAIL PROTECTED]> He's always with us!
Re: UTF8 behavior under -T (Taint) mode
On Jan 01, 2004, at 21:49, Masanori HATA wrote: Sorry, no. Since the case which I would like to suggest seems not to be fatal. Perl would not die, but it would take the tainted value as a Non-UTF8 string. My sample code is like below (test.pl): - utf8::decode(my $text0 = "\x{3042}" ); # clean utf8::decode(my $arg = $ARGV[0]); # tainted utf8::decode(my $text1 = "$arg$text0"); # tainted utf8::decode(my $text2 = "$text0$arg"); # tainted print length($text1), "\n"; print length($text2), "\n"; - Aha! I see your point at last. And I found your argument was correct. When I run this code with 'perl -T test.pl a', the result is: To clear your point, I have modified your script with Devel::Peek. Pay attention to the $text1 result. without -T % perl test.pl a SV = PV(0x812354) at 0x80a960 REFCNT = 1 FLAGS = (PADBUSY,PADMY,POK,pPOK,UTF8) PV = 0x428090 "a\343\201\202"\0 [UTF8 "a\x{3042}"] CUR = 4 LEN = 5 2 SV = PV(0x812e10) at 0x80f2a8 REFCNT = 1 FLAGS = (PADBUSY,PADMY,POK,pPOK,UTF8) PV = 0x405150 "\343\201\202a"\0 [UTF8 "\x{3042}a"] CUR = 4 LEN = 5 2 with -T % perl -T test.pl a SV = PVMG(0x819a88) at 0x80a954 REFCNT = 1 FLAGS = (PADBUSY,PADMY,GMG,SMG,pPOK) IV = 0 NV = 0 PV = 0x428540 "a\343\201\202"\0 CUR = 4 LEN = 5 MAGIC = 0x405480 MG_VIRTUAL = &PL_vtbl_taint MG_TYPE = PERL_MAGIC_taint(t) MG_LEN = 1 4 SV = PVMG(0x819af4) at 0x80f69c REFCNT = 1 FLAGS = (PADBUSY,PADMY,GMG,SMG,pPOK,UTF8) IV = 0 NV = 0 PV = 0x4054e0 "\343\201\202a"\0 [UTF8 "\x{3042}a"] CUR = 4 LEN = 5 MAGIC = 0x4010d0 MG_VIRTUAL = &PL_vtbl_taint MG_TYPE = PERL_MAGIC_taint(t) MG_LEN = 1 2 I am not sure how severe it is but this is a bug indeed. (My system is perl5.8.1 MSWin32-X86-multi-thread) I have duplicated the result with Perl 5.8.2 on Mac OS X as well as [EMAIL PROTECTED] on FreeBSD. And using Encode::decode_utf8 does not help either because it simply calls utf8::decode. And you can't use Encode::decode("utf8", ...) in this particular case because Encode::decode() checks and clobbers at "Cannot decode string with wide characters". Hmm Dan the Perl5 Porter
Re: UTF8 behavior under -T (Taint) mode
Thanks for replying, Dan-san. At 18:09 04/01/01 +0900, Dan wrote: >>It seems that utf8::decode() does not work for >>any tainted variables under the -T (Taint) mode. >What drove you to such a conclusion? It does work. Try something like > > perl -T -le 'utf8::decode($ARGV[0])' something > >and see it for yourself. Did perl die with "Insecure ..." message? Sorry, no. Since the case which I would like to suggest seems not to be fatal. Perl would not die, but it would take the tainted value as a Non-UTF8 string. My sample code is like below (test.pl): - utf8::decode(my $text0 = "\x{3042}" ); # clean utf8::decode(my $arg = $ARGV[0]); # tainted utf8::decode(my $text1 = "$arg$text0"); # tainted utf8::decode(my $text2 = "$text0$arg"); # tainted print length($text1), "\n"; print length($text2), "\n"; - When I run this code with 'perl -T test.pl a', the result is: 4 2 and when I run this code with 'perl test.pl a', the result is: 2 2 So I guess $text1 did not treated as a UTF8 string under the taint mode. (My system is perl5.8.1 MSWin32-X86-multi-thread) I would like to know any reasons for this problem. test.pl Description: Binary data -- Masanori HATA <[EMAIL PROTECTED]> He's always with us!
Re: UTF8 behavior under -T (Taint) mode
On Jan 01, 2004, at 12:32, Masanori HATA wrote: Hello, I have a simple question: It seems that utf8::decode() does not work for any tainted variables under the -T (Taint) mode. Is it right? Wrong. What drove you to such a conclusion? It does work. Try something like perl -T -le 'utf8::decode($ARGV[0])' something and see it for yourself. Did perl die with "Insecure ..." message? Dan the Perl5 Porter