On 7/8/05, Dave Mitchell <[EMAIL PROTECTED]> wrote: > On Fri, Jul 08, 2005 at 09:24:42AM +0200, demerphq wrote: > > > it turns out perl is totally borked for > > > > > > $utf8 =~ /latin1/i > > > and > > > > > > $latin1 =~ /$utf8/i > > > > > > unless all the chars happen to be < 0x7f. > > > > The case where the pattern is /(foo|bar)/ is handled by a totally > > different codepath in blead, does it also fail there? I seem to recall > > that I put in tests for this, but possibly im wrong. Im flying on > > holiday in less than 24 hours and i doubt Ill be able to check until i > > return at the end of the month. > > $ ./perl -Ilib -wle '$x="\xe9\x{100}";chop$x; print 1 if $x=~/(abc|\xe9)/i' > 1 > > $ ./perl -Ilib -wle '$x="\xe9\x{100}";chop$x; print 1 if "\xe9"=~/(abc|$x)/i' > Malformed UTF-8 character (unexpected non-continuation byte 0x00, immediately > after start byte 0xe9) in pattern match (m//) at -e line 1. >
Attached patch fixes it in the TRIE(FL?)? code afaict. D:\dev\perl\live>perl -Ilib -wle "$x=qq[\xe9\x{100}]; chop $x; print 1 if qq[\xe9]=~/(abc|$x)/i" 1 D:\dev\perl\live>perl -Ilib -wle "$x=qq[\xe9\x{100}]; chop $x; print 1 if $x=~/(abc|\xe9)/i" 1 Yves -- perl -Mre=debug -e "/just|another|perl|hacker/"
--- regexec.c.bak 2005-07-08 17:02:45.171875000 +0200 +++ regexec.c 2005-07-08 17:03:02.156250000 +0200 @@ -2612,7 +2612,7 @@ if ( base ) { - if ( do_utf8 || UTF ) { + if ( do_utf8 ) { if ( foldlen>0 ) { uvc = utf8n_to_uvuni( uscan, UTF8_MAXLEN, &len, uniflags ); foldlen -= len; @@ -2678,7 +2678,7 @@ if ( base ) { - if ( do_utf8 || UTF ) { + if ( do_utf8 ) { uvc = utf8n_to_uvuni( (U8*)uc, UTF8_MAXLEN, &len, uniflags ); } else { uvc = (U32)*uc;