Re: \W and [\W]

2004-01-01 Thread Eric Cholet
Le 1 janv. 04, à 17:50, Rafael Garcia-Suarez a écrit :

+(However, and as a limitation of the current implementation, using
+C<\w> or C<\W> I a C<[...]> character class will still match
+with byte semantics.)
I don't think it applies to \w, only \W. \x{df} matches [\w] just fine,
as shown in Andreas' bug report.
--
Eric Cholet


Re: \W and [\W]

2004-01-01 Thread Rafael Garcia-Suarez
Andreas J Koenig wrote in perl.unicode :
>> On Wed, 31 Dec 2003 16:21:36 +0100, Eric Cholet <[EMAIL PROTECTED]> said:
> 
>  > Can anyone enlighten me as to why \W behaves differently depending
>  > on wether it's inside or outside of a character class, for certain
>  > characters:
> 
> I have reported this as bug 18281
> 
> http://guest:[EMAIL PROTECTED]/rt3/Ticket/Display.html?id=18281
> 
> I don't think that it is documented by now and I cannot spot a good
> place where it needs to be documented. perlre.pod and perlunicode.pod
> seem the natural places.

And apparently fixing it is not trivial.
Does something like this suit you ? This can at least make its way into
5.8.3.

Change 22031 by [EMAIL PROTECTED] on 2004/01/01 16:30:13

Document that /[\W]/ doesn't work, unicode-wise (see bug #18281)

Affected files ...

... //depot/perl/pod/perlunicode.pod#130 edit

Differences ...

 //depot/perl/pod/perlunicode.pod#130 (text) 

@@ -166,6 +166,10 @@
 Unicode properties database.  C<\w> can be used to match a Japanese
 ideograph, for instance.
 
+(However, and as a limitation of the current implementation, using
+C<\w> or C<\W> I a C<[...]> character class will still match
+with byte semantics.)
+
 =item *
 
 Named Unicode properties, scripts, and block ranges may be used like


Re: \W and [\W]

2004-01-01 Thread Andreas J Koenig
> On Wed, 31 Dec 2003 16:21:36 +0100, Eric Cholet <[EMAIL PROTECTED]> said:

  > Can anyone enlighten me as to why \W behaves differently depending
  > on wether it's inside or outside of a character class, for certain
  > characters:

I have reported this as bug 18281

http://guest:[EMAIL PROTECTED]/rt3/Ticket/Display.html?id=18281

I don't think that it is documented by now and I cannot spot a good
place where it needs to be documented. perlre.pod and perlunicode.pod
seem the natural places.

-- 
andreas


Re: UTF8 behavior under -T (Taint) mode

2004-01-01 Thread Masanori HATA
At 22:32 04/01/01 +0900, Dan wrote:
>Aha!  I see your point at last.  And I found your argument was correct.

Sorry my poor English and insufficient explanation. :)

>I am not sure how severe it is but this is a bug indeed.

Oh, indeed? I had almost believed that it was a featured behavior.

I hope there will be good news about the bug fix.

>And you can't use Encode::decode("utf8", ...) in this particular case because 
>Encode::decode() checks and clobbers at "Cannot decode string with wide characters".  
>Hmm

I see.

Thank you for your hard and good jobs.

Best regards,

-- 
Masanori HATA
<[EMAIL PROTECTED]>
He's always with us!



Re: UTF8 behavior under -T (Taint) mode

2004-01-01 Thread Dan Kogai
On Jan 01, 2004, at 21:49, Masanori HATA wrote:
Sorry, no. Since the case which I would like to suggest
seems not to be fatal. Perl would not die, but it would
take the tainted value as a Non-UTF8 string.
My sample code is like below (test.pl):
-
utf8::decode(my $text0 = "\x{3042}"  ); # clean
utf8::decode(my $arg   = $ARGV[0]); # tainted
utf8::decode(my $text1 = "$arg$text0"); # tainted
utf8::decode(my $text2 = "$text0$arg"); # tainted
print length($text1), "\n";
print length($text2), "\n";
-
Aha!  I see your point at last.  And I found your argument was correct.

When I run this code with 'perl -T test.pl a', the result is:
To clear your point, I have modified your script with Devel::Peek.  Pay 
attention to the $text1 result.

without -T
% perl test.pl a
SV = PV(0x812354) at 0x80a960
  REFCNT = 1
  FLAGS = (PADBUSY,PADMY,POK,pPOK,UTF8)
  PV = 0x428090 "a\343\201\202"\0 [UTF8 "a\x{3042}"]
  CUR = 4
  LEN = 5
2
SV = PV(0x812e10) at 0x80f2a8
  REFCNT = 1
  FLAGS = (PADBUSY,PADMY,POK,pPOK,UTF8)
  PV = 0x405150 "\343\201\202a"\0 [UTF8 "\x{3042}a"]
  CUR = 4
  LEN = 5
2
with -T
% perl -T test.pl a
SV = PVMG(0x819a88) at 0x80a954
  REFCNT = 1
  FLAGS = (PADBUSY,PADMY,GMG,SMG,pPOK)
  IV = 0
  NV = 0
  PV = 0x428540 "a\343\201\202"\0
  CUR = 4
  LEN = 5
  MAGIC = 0x405480
MG_VIRTUAL = &PL_vtbl_taint
MG_TYPE = PERL_MAGIC_taint(t)
MG_LEN = 1
4
SV = PVMG(0x819af4) at 0x80f69c
  REFCNT = 1
  FLAGS = (PADBUSY,PADMY,GMG,SMG,pPOK,UTF8)
  IV = 0
  NV = 0
  PV = 0x4054e0 "\343\201\202a"\0 [UTF8 "\x{3042}a"]
  CUR = 4
  LEN = 5
  MAGIC = 0x4010d0
MG_VIRTUAL = &PL_vtbl_taint
MG_TYPE = PERL_MAGIC_taint(t)
MG_LEN = 1
2
I am not sure how severe it is but this is a bug indeed.

(My system is perl5.8.1 MSWin32-X86-multi-thread)
I have duplicated the result with Perl 5.8.2 on Mac OS X as well as 
[EMAIL PROTECTED] on FreeBSD.  And using Encode::decode_utf8 does not 
help either because it simply calls utf8::decode.  And you can't use 
Encode::decode("utf8", ...) in this particular case because 
Encode::decode() checks and clobbers at "Cannot decode string with wide 
characters".  Hmm

Dan the Perl5 Porter



Re: UTF8 behavior under -T (Taint) mode

2004-01-01 Thread Masanori HATA
Thanks for replying, Dan-san.

At 18:09 04/01/01 +0900, Dan wrote:
>>It seems that utf8::decode() does not work for
>>any tainted variables under the -T (Taint) mode.

>What drove you to such a conclusion?  It does work.  Try something like
>
>   perl -T -le 'utf8::decode($ARGV[0])' something
>
>and see it for yourself.  Did perl die with "Insecure ..." message?

Sorry, no. Since the case which I would like to suggest
seems not to be fatal. Perl would not die, but it would
take the tainted value as a Non-UTF8 string.

My sample code is like below (test.pl):
-
utf8::decode(my $text0 = "\x{3042}"  ); # clean
utf8::decode(my $arg   = $ARGV[0]); # tainted
utf8::decode(my $text1 = "$arg$text0"); # tainted
utf8::decode(my $text2 = "$text0$arg"); # tainted

print length($text1), "\n";
print length($text2), "\n";
-

When I run this code with 'perl -T test.pl a', the result is:

4
2

and when I run this code with 'perl test.pl a', the result is:

2
2

So I guess $text1 did not treated as a UTF8 string under
the taint mode.

(My system is perl5.8.1 MSWin32-X86-multi-thread)

I would like to know any reasons for this problem.


test.pl
Description: Binary data
-- 
Masanori HATA
<[EMAIL PROTECTED]>
He's always with us!


Re: UTF8 behavior under -T (Taint) mode

2004-01-01 Thread Dan Kogai
On Jan 01, 2004, at 12:32, Masanori HATA wrote:
Hello,

I have a simple question:

It seems that utf8::decode() does not work for
any tainted variables under the -T (Taint) mode.
Is it right?
Wrong.

What drove you to such a conclusion?  It does work.  Try something like

  perl -T -le 'utf8::decode($ARGV[0])' something

and see it for yourself.  Did perl die with "Insecure ..." message?

Dan the Perl5 Porter