Re: [perl #36207] UTF8/Latin 1/i regexp Malformed character warning
On Fri, Jul 08, 2005 at 09:45:32AM +0300, Jarkko Hietaniemi wrote: I won't have time to look into this anytime soon, but I think the fix for the second case shouldn't be too hard to find. First of all, if either the matcher or the matchee are UTF-8, we should never ever ever end up calling the bare ibcmp(), but instead ibcmp_utf8() with the right combination of flags to indicate which or both is UTF-8. if ibcmp is replaced with ibcmp_utf8, what should ibcmp_locale be replaced with? -- The Enterprise is captured by a vastly superior alien intelligence which does not put them on trial. -- Things That Never Happen in Star Trek #10
Re: [perl #36207] UTF8/Latin 1/i regexp Malformed character warning
On Fri, Jul 08, 2005 at 05:07:26PM +0200, demerphq wrote: Attached patch fixes it in the TRIE(FL?)? code afaict. D:\dev\perl\liveperl -Ilib -wle $x=qq[\xe9\x{100}]; chop $x; print 1 if qq[\xe9]=~/(abc|$x)/i 1 D:\dev\perl\liveperl -Ilib -wle $x=qq[\xe9\x{100}]; chop $x; print 1 if $x=~/(abc|\xe9)/i 1 thanks, applied as change #25106 I also added some tests. -- That he said that that that that is is is debatable, is debatable.
Re: [perl #36207] UTF8/Latin 1/i regexp Malformed character warning
Dave Mitchell wrote: On Fri, Jul 08, 2005 at 09:45:32AM +0300, Jarkko Hietaniemi wrote: I won't have time to look into this anytime soon, but I think the fix for the second case shouldn't be too hard to find. First of all, if either the matcher or the matchee are UTF-8, we should never ever ever end up calling the bare ibcmp(), but instead ibcmp_utf8() with the right combination of flags to indicate which or both is UTF-8. if ibcmp is replaced with ibcmp_utf8, what should ibcmp_locale be replaced with? It depends, and not replaced with. If it is a UTF-8 locale, the ibcmp_utf8() should be used, otherwise I *guess* the ibcmp_locale() as it is now should be used.
Re: [perl #36207] UTF8/Latin 1/i regexp Malformed character warning
it turns out perl is totally borked for That bugs of this level are found only now is telling something about how much people actually use the Unicode support of Perl. Well, and something about how suckily I tested it. The patch below fixes the first case: the fault was in ibcmp_utf8(), Looks good, off-hand. (Feel free to quote me on haggis, though.) which was calling to_utf8_fold() without first converting the char to utf8. The second case fails in S_find_byclass() in regexec.c: while (s = e) { if ( (*(U8*)s == c1 || *(U8*)s == c2) (ln == 1 || !(OP(c) == EXACTF ? ibcmp(s, m, ln) : ibcmp_locale(s, m, ln))) (norun || regtry(prog, s)) ) goto got_it; s++; } where it calls ibcmp() but s is a pointer to a latin1 while m is pointer to a utf8 char. I don't really understand this code well enough to feel confident fixing it. I won't have time to look into this anytime soon, but I think the fix for the second case shouldn't be too hard to find. First of all, if either the matcher or the matchee are UTF-8, we should never ever ever end up calling the bare ibcmp(), but instead ibcmp_utf8() with the right combination of flags to indicate which or both is UTF-8. You might want to extend the tests also to run-time: my $utf8 = \x{e9}\x{100}; chop $utf8; my $latin1 = \x{e9}; ok($utf8 =~ /\xe9/i); ok(\xe9 =~ /$utf8/i); ok($utf8 =~ /$latin1/i); ok($latin1 =~ /$utf8/i);
Re: [perl #36207] UTF8/Latin 1/i regexp Malformed character warning
On 7/8/05, Dave Mitchell [EMAIL PROTECTED] wrote: On Tue, Jun 07, 2005 at 04:28:08PM -, Nicholas Clark wrote: ./perl -Ilib -we '$term = \xe9; $target = \xe9\x{100}; chop $target; $target =~ /$term/i' Malformed UTF-8 character (unexpected non-continuation byte 0x00, immediately after start byte 0xe9) in pattern match (m//) at -e line 1. it turns out perl is totally borked for $utf8 =~ /latin1/i and $latin1 =~ /$utf8/i unless all the chars happen to be 0x7f. The case where the pattern is /(foo|bar)/ is handled by a totally different codepath in blead, does it also fail there? I seem to recall that I put in tests for this, but possibly im wrong. Im flying on holiday in less than 24 hours and i doubt Ill be able to check until i return at the end of the month. cheers, yves -- perl -Mre=debug -e /just|another|perl|hacker/
Re: [perl #36207] UTF8/Latin 1/i regexp Malformed character warning
On Fri, Jul 08, 2005 at 09:45:32AM +0300, Jarkko Hietaniemi wrote: it turns out perl is totally borked for That bugs of this level are found only now is telling something about how much people actually use the Unicode support of Perl. Well, and something about how suckily I tested it. The bugs are actually subtle in that they are in the mixed bytes (mumble Latin 1)/UTF-8 code. People could be quite happily using the Unicode support, where all their input data is already in UTF-8, or is in (say) Shift-JIS and hence has to be converted on input, and hence never have cause to go down these code paths. Nicholas Clark
Re: [perl #36207] UTF8/Latin 1/i regexp Malformed character warning
On Fri, Jul 08, 2005 at 09:24:42AM +0200, demerphq wrote: it turns out perl is totally borked for $utf8 =~ /latin1/i and $latin1 =~ /$utf8/i unless all the chars happen to be 0x7f. The case where the pattern is /(foo|bar)/ is handled by a totally different codepath in blead, does it also fail there? I seem to recall that I put in tests for this, but possibly im wrong. Im flying on holiday in less than 24 hours and i doubt Ill be able to check until i return at the end of the month. $ ./perl -Ilib -wle '$x=\xe9\x{100};chop$x; print 1 if $x=~/(abc|\xe9)/i' 1 $ ./perl -Ilib -wle '$x=\xe9\x{100};chop$x; print 1 if \xe9=~/(abc|$x)/i' Malformed UTF-8 character (unexpected non-continuation byte 0x00, immediately after start byte 0xe9) in pattern match (m//) at -e line 1. -- Never do today what you can put off till tomorrow.
Re: [perl #36207] UTF8/Latin 1/i regexp Malformed character warning
On 7/8/05, Dave Mitchell [EMAIL PROTECTED] wrote: On Fri, Jul 08, 2005 at 09:24:42AM +0200, demerphq wrote: it turns out perl is totally borked for $utf8 =~ /latin1/i and $latin1 =~ /$utf8/i unless all the chars happen to be 0x7f. The case where the pattern is /(foo|bar)/ is handled by a totally different codepath in blead, does it also fail there? I seem to recall that I put in tests for this, but possibly im wrong. Im flying on holiday in less than 24 hours and i doubt Ill be able to check until i return at the end of the month. $ ./perl -Ilib -wle '$x=\xe9\x{100};chop$x; print 1 if $x=~/(abc|\xe9)/i' 1 $ ./perl -Ilib -wle '$x=\xe9\x{100};chop$x; print 1 if \xe9=~/(abc|$x)/i' Malformed UTF-8 character (unexpected non-continuation byte 0x00, immediately after start byte 0xe9) in pattern match (m//) at -e line 1. Well, i guess half wrong is better than all wrong but that still sucks. Maybe ill get some time to put together a patch on the plane. thanks tho. -- perl -Mre=debug -e /just|another|perl|hacker/
Re: [perl #36207] UTF8/Latin 1/i regexp Malformed character warning
On 7/8/05, Dave Mitchell [EMAIL PROTECTED] wrote: On Fri, Jul 08, 2005 at 09:24:42AM +0200, demerphq wrote: it turns out perl is totally borked for $utf8 =~ /latin1/i and $latin1 =~ /$utf8/i unless all the chars happen to be 0x7f. The case where the pattern is /(foo|bar)/ is handled by a totally different codepath in blead, does it also fail there? I seem to recall that I put in tests for this, but possibly im wrong. Im flying on holiday in less than 24 hours and i doubt Ill be able to check until i return at the end of the month. $ ./perl -Ilib -wle '$x=\xe9\x{100};chop$x; print 1 if $x=~/(abc|\xe9)/i' 1 $ ./perl -Ilib -wle '$x=\xe9\x{100};chop$x; print 1 if \xe9=~/(abc|$x)/i' Malformed UTF-8 character (unexpected non-continuation byte 0x00, immediately after start byte 0xe9) in pattern match (m//) at -e line 1. Attached patch fixes it in the TRIE(FL?)? code afaict. D:\dev\perl\liveperl -Ilib -wle $x=qq[\xe9\x{100}]; chop $x; print 1 if qq[\xe9]=~/(abc|$x)/i 1 D:\dev\perl\liveperl -Ilib -wle $x=qq[\xe9\x{100}]; chop $x; print 1 if $x=~/(abc|\xe9)/i 1 Yves -- perl -Mre=debug -e /just|another|perl|hacker/ --- regexec.c.bak 2005-07-08 17:02:45.171875000 +0200 +++ regexec.c 2005-07-08 17:03:02.15625 +0200 @@ -2612,7 +2612,7 @@ if ( base ) { - if ( do_utf8 || UTF ) { + if ( do_utf8 ) { if ( foldlen0 ) { uvc = utf8n_to_uvuni( uscan, UTF8_MAXLEN, len, uniflags ); foldlen -= len; @@ -2678,7 +2678,7 @@ if ( base ) { - if ( do_utf8 || UTF ) { + if ( do_utf8 ) { uvc = utf8n_to_uvuni( (U8*)uc, UTF8_MAXLEN, len, uniflags ); } else { uvc = (U32)*uc;
Re: [perl #36207] UTF8/Latin 1/i regexp Malformed character warning
On Tue, Jun 07, 2005 at 04:28:08PM -, Nicholas Clark wrote: ./perl -Ilib -we '$term = \xe9; $target = \xe9\x{100}; chop $target; $target =~ /$term/i' Malformed UTF-8 character (unexpected non-continuation byte 0x00, immediately after start byte 0xe9) in pattern match (m//) at -e line 1. it turns out perl is totally borked for $utf8 =~ /latin1/i and $latin1 =~ /$utf8/i unless all the chars happen to be 0x7f. The patch below fixes the first case: the fault was in ibcmp_utf8(), which was calling to_utf8_fold() without first converting the char to utf8. The second case fails in S_find_byclass() in regexec.c: while (s = e) { if ( (*(U8*)s == c1 || *(U8*)s == c2) (ln == 1 || !(OP(c) == EXACTF ? ibcmp(s, m, ln) : ibcmp_locale(s, m, ln))) (norun || regtry(prog, s)) ) goto got_it; s++; } where it calls ibcmp() but s is a pointer to a latin1 while m is pointer to a utf8 char. I don't really understand this code well enough to feel confident fixing it. In fact I'm not even totally confident in my fix for the first case, hence I'm Ccing everyone's Favourite Finn. Dave. -- But Sidley Park is already a picture, and a most amiable picture too. The slopes are green and gentle. The trees are companionably grouped at intervals that show them to advantage. The rill is a serpentine ribbon unwound from the lake peaceably contained by meadows on which the right amount of sheep are tastefully arranged. -- Lady Croom - Arcadia Change 25095 by [EMAIL PROTECTED] on 2005/07/08 01:43:24 [perl #36207] UTF8/Latin 1/i regexp Malformed character warning $utf8 =~ /latin/i didn't match. Also added TODO for $latin =~ /utf8/i which also fails Affected files ... ... //depot/perl/t/op/pat.t#222 edit ... //depot/perl/utf8.c#239 edit Differences ... //depot/perl/t/op/pat.t#222 (xtext) @@ -6,7 +6,7 @@ $| = 1; -print 1..1178\n; +print 1..1180\n; BEGIN { chdir 't' if -d 't'; @@ -3364,4 +3364,14 @@ my $psycho=join |,@normal,map chr $_,255..2; ok(('these'=~/($psycho)/) $1 eq 'these','Pyscho'); } -# last test 1178 + +# [perl #36207] mixed utf8 / latin-1 and case folding + +{ +my $u = \xe9\x{100}; +chop $u; +ok($u =~ /\xe9/i, utf8/latin); +ok(\xe9 =~ /$u/i, # TODO latin/utf8); +} + +# last test 1180 //depot/perl/utf8.c#239 (text) @@ -2037,7 +2037,7 @@ if (u1) to_utf8_fold(p1, foldbuf1, foldlen1); else { - natbuf[0] = *p1; + uvuni_to_utf8(natbuf, (UV) NATIVE_TO_UNI(((UV)*p1))); to_utf8_fold(natbuf, foldbuf1, foldlen1); } q1 = foldbuf1; @@ -2047,7 +2047,7 @@ if (u2) to_utf8_fold(p2, foldbuf2, foldlen2); else { - natbuf[0] = *p2; + uvuni_to_utf8(natbuf, (UV) NATIVE_TO_UNI(((UV)*p2))); to_utf8_fold(natbuf, foldbuf2, foldlen2); } q2 = foldbuf2;
[perl #36207] UTF8/Latin 1/i regexp Malformed character warning
# New Ticket Created by Nicholas Clark # Please include the string: [perl #36207] # in the subject line of all future correspondence about this issue. # URL: https://rt.perl.org/rt3/Ticket/Display.html?id=36207 This is a bug report for perl from [EMAIL PROTECTED], generated with the help of perlbug 1.35 running under perl v5.9.3. - [Please enter your report here] Stig came across this: ./perl -Ilib -we '$term = \xe9; $target = \xe9\x{100}; chop $target; $target =~ /$term/i' Malformed UTF-8 character (unexpected non-continuation byte 0x00, immediately after start byte 0xe9) in pattern match (m//) at -e line 1. I'm not sure if this is similar or the same to a bug previously reported. [Please do not change anything below this line] - --- Flags: category=core severity=medium --- Site configuration information for perl v5.9.3: Configured by nick at Tue Jun 7 15:04:19 BST 2005. Summary of my perl5 (revision 5 version 9 subversion 3 patch 24148) configuration: Platform: osname=linux, osvers=2.4.21-15.0.3.elsmp, archname=i686-linux uname='linux switch 2.4.21-15.0.3.elsmp #1 smp wed jul 7 09:34:05 edt 2004 i686 i686 i386 gnulinux ' config_args='-Dusedevel=y -Dcc=ccache gcc -Dld=gcc -Ubincompat5005 -Uinstallusrbinperl [EMAIL PROTECTED] [EMAIL PROTECTED] -Dinc_version_list= -Dinc_version_list_init=0 -Doptimize=-g -Uusethreads -Uuse64bitint -Duseperlio -Dusemymalloc -Dprefix=~/Sandpit/snap5.9.x-24728 -Dinstallman1dir=none -Dinstallman3dir=none -de' hint=recommended, useposix=true, d_sigaction=define usethreads=undef useithreads=undef usemultiplicity=undef useperlio=define d_sfio=undef uselargefiles=define usesocks=undef use64bitint=undef use64bitall=undef uselongdouble=undef usemymalloc=y, bincompat5005=undef Compiler: cc='ccache gcc', ccflags ='-DDEBUGGING -DPERL_COPY_ON_WRITE -fno-strict-aliasing -pipe -Wdeclaration-after-statement -I/usr/local/include -D_LARGEFILE_SOURCE -D_FILE_OFFSET_BITS=64', optimize='-g', cppflags='-DDEBUGGING -DPERL_COPY_ON_WRITE -fno-strict-aliasing -pipe -Wdeclaration-after-statement -I/usr/local/include' ccversion='', gccversion='3.2.3 20030502 (Red Hat Linux 3.2.3-49)', gccosandvers='' intsize=4, longsize=4, ptrsize=4, doublesize=8, byteorder=1234 d_longlong=define, longlongsize=8, d_longdbl=define, longdblsize=12 ivtype='long', ivsize=4, nvtype='double', nvsize=8, Off_t='off_t', lseeksize=8 alignbytes=4, prototype=define Linker and Libraries: ld='gcc', ldflags =' -L/usr/local/lib' libpth=/usr/local/lib /lib /usr/lib libs=-lnsl -ldl -lm -lcrypt -lutil -lc perllibs=-lnsl -ldl -lm -lcrypt -lutil -lc libc=/lib/libc-2.3.2.so, so=so, useshrplib=false, libperl=libperl.a gnulibc_version='2.3.2' Dynamic Linking: dlsrc=dl_dlopen.xs, dlext=so, d_dlsymun=undef, ccdlflags='-Wl,-E' cccdlflags='-fpic', lddlflags='-shared -L/usr/local/lib' Locally applied patches: --- @INC for perl v5.9.3: lib /home/nick/Sandpit/snap5.9.x-24728/lib/perl5/5.9.3/i686-linux /home/nick/Sandpit/snap5.9.x-24728/lib/perl5/5.9.3 /home/nick/Sandpit/snap5.9.x-24728/lib/perl5/site_perl/5.9.3/i686-linux /home/nick/Sandpit/snap5.9.x-24728/lib/perl5/site_perl/5.9.3 /home/nick/Sandpit/snap5.9.x-24728/lib/perl5/site_perl . --- Environment for perl v5.9.3: HOME=/home/nick LANG (unset) LANGUAGE (unset) LD_LIBRARY_PATH (unset) LOGDIR (unset) PATH=/home/nick/bin:/usr/kerberos/bin:/usr/lib/ccache/bin:/usr/local/bin:/bin:/usr/bin:/usr/X11R6/bin:/usr/local/sbin:/sbin:/usr/sbin PERL_BADLANG (unset) SHELL=/bin/bash