Re: [perl #36207] UTF8/Latin 1/i regexp Malformed character warning

2005-07-10 Thread Dave Mitchell
On Fri, Jul 08, 2005 at 09:45:32AM +0300, Jarkko Hietaniemi wrote:
 I won't have time to look into this anytime soon, but I think the fix
 for the second case shouldn't be too hard to find.  First of all, if
 either the matcher or the matchee are UTF-8, we should never ever ever
 end up calling the bare ibcmp(), but instead ibcmp_utf8() with the right
 combination of flags to indicate which or both is UTF-8.

if ibcmp is replaced with ibcmp_utf8, what should ibcmp_locale be replaced
with?


-- 
The Enterprise is captured by a vastly superior alien intelligence which
does not put them on trial.
-- Things That Never Happen in Star Trek #10


Re: [perl #36207] UTF8/Latin 1/i regexp Malformed character warning

2005-07-10 Thread Dave Mitchell
On Fri, Jul 08, 2005 at 05:07:26PM +0200, demerphq wrote:
 Attached patch fixes it in the TRIE(FL?)? code afaict. 
 
 D:\dev\perl\liveperl -Ilib -wle $x=qq[\xe9\x{100}]; chop $x; print 1
 if qq[\xe9]=~/(abc|$x)/i
 1
 
 D:\dev\perl\liveperl -Ilib -wle $x=qq[\xe9\x{100}]; chop $x; print 1
 if $x=~/(abc|\xe9)/i
 1
 

thanks, applied as change #25106
I also added some tests.

-- 
That he said that that that that is is is debatable, is debatable.


Re: [perl #36207] UTF8/Latin 1/i regexp Malformed character warning

2005-07-10 Thread Jarkko Hietaniemi
Dave Mitchell wrote:
 On Fri, Jul 08, 2005 at 09:45:32AM +0300, Jarkko Hietaniemi wrote:
 
I won't have time to look into this anytime soon, but I think the fix
for the second case shouldn't be too hard to find.  First of all, if
either the matcher or the matchee are UTF-8, we should never ever ever
end up calling the bare ibcmp(), but instead ibcmp_utf8() with the right
combination of flags to indicate which or both is UTF-8.
 
 
 if ibcmp is replaced with ibcmp_utf8, what should ibcmp_locale be replaced
 with?

It depends, and not replaced with.  If it is a UTF-8 locale,
the ibcmp_utf8() should be used, otherwise I *guess* the ibcmp_locale()
as it is now should be used.

 



Re: [perl #36207] UTF8/Latin 1/i regexp Malformed character warning

2005-07-08 Thread Jarkko Hietaniemi
 
 it turns out perl is totally borked for

That bugs of this level are found only now is telling something about
how much people actually use the Unicode support of Perl.  Well, and
something about how suckily I tested it.

 
 The patch below fixes the first case:  the fault was in ibcmp_utf8(),

Looks good, off-hand.  (Feel free to quote me on haggis, though.)

 which was calling to_utf8_fold() without first converting the char to
 utf8.
 
 The second case fails in S_find_byclass() in regexec.c:
 
   while (s = e) {
   if ( (*(U8*)s == c1 || *(U8*)s == c2)
 (ln == 1 || !(OP(c) == EXACTF
 ? ibcmp(s, m, ln)
 : ibcmp_locale(s, m, ln)))
 (norun || regtry(prog, s)) )
   goto got_it;
   s++;
   }
 
 where it calls ibcmp() but s is a pointer to a latin1 while m is pointer
 to a utf8 char. I don't really understand this code well enough to feel
 confident fixing it.

I won't have time to look into this anytime soon, but I think the fix
for the second case shouldn't be too hard to find.  First of all, if
either the matcher or the matchee are UTF-8, we should never ever ever
end up calling the bare ibcmp(), but instead ibcmp_utf8() with the right
combination of flags to indicate which or both is UTF-8.

You might want to extend the tests also to run-time:

my $utf8   = \x{e9}\x{100}; chop $utf8;
my $latin1 = \x{e9};

ok($utf8   =~ /\xe9/i);
ok(\xe9  =~ /$utf8/i);

ok($utf8   =~ /$latin1/i);
ok($latin1 =~ /$utf8/i);



Re: [perl #36207] UTF8/Latin 1/i regexp Malformed character warning

2005-07-08 Thread demerphq
On 7/8/05, Dave Mitchell [EMAIL PROTECTED] wrote:
 On Tue, Jun 07, 2005 at 04:28:08PM -, Nicholas Clark wrote:
  ./perl -Ilib -we '$term = \xe9; $target = \xe9\x{100}; chop $target; 
  $target =~ /$term/i'
  Malformed UTF-8 character (unexpected non-continuation byte 0x00, 
  immediately after start byte 0xe9) in pattern match (m//) at -e line 1.
 
 
 it turns out perl is totally borked for
 
 $utf8 =~ /latin1/i
 and
 
 $latin1 =~ /$utf8/i
 
 unless all the chars happen to be  0x7f.

The case where the pattern is /(foo|bar)/ is handled by a totally
different codepath in blead, does it also fail there? I seem to recall
that I put in tests for this, but possibly im wrong. Im flying on
holiday in less than 24 hours and i doubt Ill be able to check until i
return at the end of the month.
 
cheers,
yves


-- 
perl -Mre=debug -e /just|another|perl|hacker/


Re: [perl #36207] UTF8/Latin 1/i regexp Malformed character warning

2005-07-08 Thread Nicholas Clark
On Fri, Jul 08, 2005 at 09:45:32AM +0300, Jarkko Hietaniemi wrote:
  
  it turns out perl is totally borked for
 
 That bugs of this level are found only now is telling something about
 how much people actually use the Unicode support of Perl.  Well, and
 something about how suckily I tested it.

The bugs are actually subtle in that they are in the mixed bytes (mumble
Latin 1)/UTF-8 code. People could be quite happily using the Unicode support,
where all their input data is already in UTF-8, or is in (say) Shift-JIS and
hence has to be converted on input, and hence never have cause to go down
these code paths.

Nicholas Clark


Re: [perl #36207] UTF8/Latin 1/i regexp Malformed character warning

2005-07-08 Thread Dave Mitchell
On Fri, Jul 08, 2005 at 09:24:42AM +0200, demerphq wrote:
  it turns out perl is totally borked for
  
  $utf8 =~ /latin1/i
  and
  
  $latin1 =~ /$utf8/i
  
  unless all the chars happen to be  0x7f.
 
 The case where the pattern is /(foo|bar)/ is handled by a totally
 different codepath in blead, does it also fail there? I seem to recall
 that I put in tests for this, but possibly im wrong. Im flying on
 holiday in less than 24 hours and i doubt Ill be able to check until i
 return at the end of the month.

$ ./perl -Ilib -wle '$x=\xe9\x{100};chop$x; print 1 if $x=~/(abc|\xe9)/i'
1

$ ./perl -Ilib -wle '$x=\xe9\x{100};chop$x; print 1 if \xe9=~/(abc|$x)/i'
Malformed UTF-8 character (unexpected non-continuation byte 0x00, immediately 
after start byte 0xe9) in pattern match (m//) at -e line 1.


-- 
Never do today what you can put off till tomorrow.


Re: [perl #36207] UTF8/Latin 1/i regexp Malformed character warning

2005-07-08 Thread demerphq
On 7/8/05, Dave Mitchell [EMAIL PROTECTED] wrote:
 On Fri, Jul 08, 2005 at 09:24:42AM +0200, demerphq wrote:
   it turns out perl is totally borked for
  
   $utf8 =~ /latin1/i
   and
  
   $latin1 =~ /$utf8/i
  
   unless all the chars happen to be  0x7f.
 
  The case where the pattern is /(foo|bar)/ is handled by a totally
  different codepath in blead, does it also fail there? I seem to recall
  that I put in tests for this, but possibly im wrong. Im flying on
  holiday in less than 24 hours and i doubt Ill be able to check until i
  return at the end of the month.
 
 $ ./perl -Ilib -wle '$x=\xe9\x{100};chop$x; print 1 if $x=~/(abc|\xe9)/i'
 1
 
 $ ./perl -Ilib -wle '$x=\xe9\x{100};chop$x; print 1 if \xe9=~/(abc|$x)/i'
 Malformed UTF-8 character (unexpected non-continuation byte 0x00, 
immediately after start byte 0xe9) in pattern match (m//) at -e line 1.

Well, i guess half wrong is better than all wrong but that still
sucks. Maybe ill get some time to put together a patch on the plane.

thanks tho.

-- 
perl -Mre=debug -e /just|another|perl|hacker/


Re: [perl #36207] UTF8/Latin 1/i regexp Malformed character warning

2005-07-08 Thread demerphq
On 7/8/05, Dave Mitchell [EMAIL PROTECTED] wrote:
 On Fri, Jul 08, 2005 at 09:24:42AM +0200, demerphq wrote:
   it turns out perl is totally borked for
  
   $utf8 =~ /latin1/i
   and
  
   $latin1 =~ /$utf8/i
  
   unless all the chars happen to be  0x7f.
 
  The case where the pattern is /(foo|bar)/ is handled by a totally
  different codepath in blead, does it also fail there? I seem to recall
  that I put in tests for this, but possibly im wrong. Im flying on
  holiday in less than 24 hours and i doubt Ill be able to check until i
  return at the end of the month.
 
 $ ./perl -Ilib -wle '$x=\xe9\x{100};chop$x; print 1 if $x=~/(abc|\xe9)/i'
 1
 
 $ ./perl -Ilib -wle '$x=\xe9\x{100};chop$x; print 1 if \xe9=~/(abc|$x)/i'
 Malformed UTF-8 character (unexpected non-continuation byte 0x00, immediately 
 after start byte 0xe9) in pattern match (m//) at -e line 1.
 

Attached patch fixes it in the TRIE(FL?)? code afaict. 

D:\dev\perl\liveperl -Ilib -wle $x=qq[\xe9\x{100}]; chop $x; print 1
if qq[\xe9]=~/(abc|$x)/i
1

D:\dev\perl\liveperl -Ilib -wle $x=qq[\xe9\x{100}]; chop $x; print 1
if $x=~/(abc|\xe9)/i
1

Yves

-- 
perl -Mre=debug -e /just|another|perl|hacker/
--- regexec.c.bak	2005-07-08 17:02:45.171875000 +0200
+++ regexec.c	2005-07-08 17:03:02.15625 +0200
@@ -2612,7 +2612,7 @@
 
 		if ( base ) {
 
-			if ( do_utf8 || UTF ) {
+			if ( do_utf8 ) {
 			if ( foldlen0 ) {
 uvc = utf8n_to_uvuni( uscan, UTF8_MAXLEN, len, uniflags );
 foldlen -= len;
@@ -2678,7 +2678,7 @@
 
 		if ( base ) {
 
-			if ( do_utf8 || UTF ) {
+			if ( do_utf8 ) {
 			uvc = utf8n_to_uvuni( (U8*)uc, UTF8_MAXLEN, len, uniflags );
 			} else {
 			uvc = (U32)*uc;


Re: [perl #36207] UTF8/Latin 1/i regexp Malformed character warning

2005-07-07 Thread Dave Mitchell
On Tue, Jun 07, 2005 at 04:28:08PM -, Nicholas Clark wrote:
 ./perl -Ilib -we '$term = \xe9; $target = \xe9\x{100}; chop $target; 
 $target =~ /$term/i'
 Malformed UTF-8 character (unexpected non-continuation byte 0x00, immediately 
 after start byte 0xe9) in pattern match (m//) at -e line 1.
 

it turns out perl is totally borked for

$utf8 =~ /latin1/i
and

$latin1 =~ /$utf8/i

unless all the chars happen to be  0x7f.

The patch below fixes the first case:  the fault was in ibcmp_utf8(),
which was calling to_utf8_fold() without first converting the char to
utf8.

The second case fails in S_find_byclass() in regexec.c:

while (s = e) {
if ( (*(U8*)s == c1 || *(U8*)s == c2)
  (ln == 1 || !(OP(c) == EXACTF
  ? ibcmp(s, m, ln)
  : ibcmp_locale(s, m, ln)))
  (norun || regtry(prog, s)) )
goto got_it;
s++;
}

where it calls ibcmp() but s is a pointer to a latin1 while m is pointer
to a utf8 char. I don't really understand this code well enough to feel
confident fixing it. In fact I'm not even totally confident in my fix for
the first case, hence I'm Ccing everyone's Favourite Finn.

Dave.

-- 
But Sidley Park is already a picture, and a most amiable picture too.
The slopes are green and gentle. The trees are companionably grouped at
intervals that show them to advantage. The rill is a serpentine ribbon
unwound from the lake peaceably contained by meadows on which the right
amount of sheep are tastefully arranged. -- Lady Croom - Arcadia


Change 25095 by [EMAIL PROTECTED] on 2005/07/08 01:43:24

[perl #36207] UTF8/Latin 1/i regexp Malformed character warning
$utf8 =~ /latin/i didn't match. 
Also added TODO for $latin =~ /utf8/i which also fails

Affected files ...

... //depot/perl/t/op/pat.t#222 edit
... //depot/perl/utf8.c#239 edit

Differences ...

 //depot/perl/t/op/pat.t#222 (xtext) 

@@ -6,7 +6,7 @@
 
 $| = 1;
 
-print 1..1178\n;
+print 1..1180\n;
 
 BEGIN {
 chdir 't' if -d 't';
@@ -3364,4 +3364,14 @@
 my $psycho=join |,@normal,map chr $_,255..2;
 ok(('these'=~/($psycho)/)  $1 eq 'these','Pyscho');
 }
-# last test 1178
+
+# [perl #36207] mixed utf8 / latin-1 and case folding
+
+{
+my $u = \xe9\x{100};
+chop $u;
+ok($u =~ /\xe9/i, utf8/latin);
+ok(\xe9 =~ /$u/i, # TODO latin/utf8);
+}
+
+# last test 1180

 //depot/perl/utf8.c#239 (text) 

@@ -2037,7 +2037,7 @@
   if (u1)
to_utf8_fold(p1, foldbuf1, foldlen1);
   else {
-   natbuf[0] = *p1;
+   uvuni_to_utf8(natbuf, (UV) NATIVE_TO_UNI(((UV)*p1)));
to_utf8_fold(natbuf, foldbuf1, foldlen1);
   }
   q1 = foldbuf1;
@@ -2047,7 +2047,7 @@
   if (u2)
to_utf8_fold(p2, foldbuf2, foldlen2);
   else {
-   natbuf[0] = *p2;
+   uvuni_to_utf8(natbuf, (UV) NATIVE_TO_UNI(((UV)*p2)));
to_utf8_fold(natbuf, foldbuf2, foldlen2);
   }
   q2 = foldbuf2;


[perl #36207] UTF8/Latin 1/i regexp Malformed character warning

2005-06-07 Thread via RT
# New Ticket Created by  Nicholas Clark 
# Please include the string:  [perl #36207]
# in the subject line of all future correspondence about this issue. 
# URL: https://rt.perl.org/rt3/Ticket/Display.html?id=36207 


This is a bug report for perl from [EMAIL PROTECTED],
generated with the help of perlbug 1.35 running under perl v5.9.3.


-
[Please enter your report here]

Stig came across this:

./perl -Ilib -we '$term = \xe9; $target = \xe9\x{100}; chop $target; 
$target =~ /$term/i'
Malformed UTF-8 character (unexpected non-continuation byte 0x00, immediately 
after start byte 0xe9) in pattern match (m//) at -e line 1.

I'm not sure if this is similar or the same to a bug previously reported.

[Please do not change anything below this line]
-
---
Flags:
category=core
severity=medium
---
Site configuration information for perl v5.9.3:

Configured by nick at Tue Jun  7 15:04:19 BST 2005.

Summary of my perl5 (revision 5 version 9 subversion 3 patch 24148) 
configuration:
  Platform:
osname=linux, osvers=2.4.21-15.0.3.elsmp, archname=i686-linux
uname='linux switch 2.4.21-15.0.3.elsmp #1 smp wed jul 7 09:34:05 edt 2004 
i686 i686 i386 gnulinux '
config_args='-Dusedevel=y -Dcc=ccache gcc -Dld=gcc -Ubincompat5005 
-Uinstallusrbinperl [EMAIL PROTECTED] [EMAIL PROTECTED] -Dinc_version_list=  
-Dinc_version_list_init=0 -Doptimize=-g -Uusethreads -Uuse64bitint -Duseperlio 
-Dusemymalloc -Dprefix=~/Sandpit/snap5.9.x-24728 -Dinstallman1dir=none 
-Dinstallman3dir=none -de'
hint=recommended, useposix=true, d_sigaction=define
usethreads=undef useithreads=undef usemultiplicity=undef
useperlio=define d_sfio=undef uselargefiles=define usesocks=undef
use64bitint=undef use64bitall=undef uselongdouble=undef
usemymalloc=y, bincompat5005=undef
  Compiler:
cc='ccache gcc', ccflags ='-DDEBUGGING -DPERL_COPY_ON_WRITE 
-fno-strict-aliasing -pipe -Wdeclaration-after-statement -I/usr/local/include 
-D_LARGEFILE_SOURCE -D_FILE_OFFSET_BITS=64',
optimize='-g',
cppflags='-DDEBUGGING -DPERL_COPY_ON_WRITE -fno-strict-aliasing -pipe 
-Wdeclaration-after-statement -I/usr/local/include'
ccversion='', gccversion='3.2.3 20030502 (Red Hat Linux 3.2.3-49)', 
gccosandvers=''
intsize=4, longsize=4, ptrsize=4, doublesize=8, byteorder=1234
d_longlong=define, longlongsize=8, d_longdbl=define, longdblsize=12
ivtype='long', ivsize=4, nvtype='double', nvsize=8, Off_t='off_t', 
lseeksize=8
alignbytes=4, prototype=define
  Linker and Libraries:
ld='gcc', ldflags =' -L/usr/local/lib'
libpth=/usr/local/lib /lib /usr/lib
libs=-lnsl -ldl -lm -lcrypt -lutil -lc
perllibs=-lnsl -ldl -lm -lcrypt -lutil -lc
libc=/lib/libc-2.3.2.so, so=so, useshrplib=false, libperl=libperl.a
gnulibc_version='2.3.2'
  Dynamic Linking:
dlsrc=dl_dlopen.xs, dlext=so, d_dlsymun=undef, ccdlflags='-Wl,-E'
cccdlflags='-fpic', lddlflags='-shared -L/usr/local/lib'

Locally applied patches:


---
@INC for perl v5.9.3:
lib
/home/nick/Sandpit/snap5.9.x-24728/lib/perl5/5.9.3/i686-linux
/home/nick/Sandpit/snap5.9.x-24728/lib/perl5/5.9.3
/home/nick/Sandpit/snap5.9.x-24728/lib/perl5/site_perl/5.9.3/i686-linux
/home/nick/Sandpit/snap5.9.x-24728/lib/perl5/site_perl/5.9.3
/home/nick/Sandpit/snap5.9.x-24728/lib/perl5/site_perl
.

---
Environment for perl v5.9.3:
HOME=/home/nick
LANG (unset)
LANGUAGE (unset)
LD_LIBRARY_PATH (unset)
LOGDIR (unset)

PATH=/home/nick/bin:/usr/kerberos/bin:/usr/lib/ccache/bin:/usr/local/bin:/bin:/usr/bin:/usr/X11R6/bin:/usr/local/sbin:/sbin:/usr/sbin
PERL_BADLANG (unset)
SHELL=/bin/bash