On Tue, Jun 07, 2005 at 04:28:08PM -0000, Nicholas Clark wrote: > ./perl -Ilib -we '$term = "\xe9"; $target = "\xe9\x{100}"; chop $target; > $target =~ /$term/i' > Malformed UTF-8 character (unexpected non-continuation byte 0x00, immediately > after start byte 0xe9) in pattern match (m//) at -e line 1. >
it turns out perl is totally borked for $utf8 =~ /latin1/i and $latin1 =~ /$utf8/i unless all the chars happen to be < 0x7f. The patch below fixes the first case: the fault was in ibcmp_utf8(), which was calling to_utf8_fold() without first converting the char to utf8. The second case fails in S_find_byclass() in regexec.c: while (s <= e) { if ( (*(U8*)s == c1 || *(U8*)s == c2) && (ln == 1 || !(OP(c) == EXACTF ? ibcmp(s, m, ln) : ibcmp_locale(s, m, ln))) && (norun || regtry(prog, s)) ) goto got_it; s++; } where it calls ibcmp() but s is a pointer to a latin1 while m is pointer to a utf8 char. I don't really understand this code well enough to feel confident fixing it. In fact I'm not even totally confident in my fix for the first case, hence I'm Ccing everyone's Favourite Finn. Dave. -- "But Sidley Park is already a picture, and a most amiable picture too. The slopes are green and gentle. The trees are companionably grouped at intervals that show them to advantage. The rill is a serpentine ribbon unwound from the lake peaceably contained by meadows on which the right amount of sheep are tastefully arranged." -- Lady Croom - Arcadia Change 25095 by [EMAIL PROTECTED] on 2005/07/08 01:43:24 [perl #36207] UTF8/Latin 1/i regexp "Malformed character" warning $utf8 =~ /latin/i didn't match. Also added TODO for $latin =~ /utf8/i which also fails Affected files ... ... //depot/perl/t/op/pat.t#222 edit ... //depot/perl/utf8.c#239 edit Differences ... ==== //depot/perl/t/op/pat.t#222 (xtext) ==== @@ -6,7 +6,7 @@ $| = 1; -print "1..1178\n"; +print "1..1180\n"; BEGIN { chdir 't' if -d 't'; @@ -3364,4 +3364,14 @@ my $psycho=join "|",@normal,map chr $_,255..20000; ok(('these'=~/($psycho)/) && $1 eq 'these','Pyscho'); } -# last test 1178 + +# [perl #36207] mixed utf8 / latin-1 and case folding + +{ + my $u = "\xe9\x{100}"; + chop $u; + ok($u =~ /\xe9/i, "utf8/latin"); + ok("\xe9" =~ /$u/i, "# TODO latin/utf8"); +} + +# last test 1180 ==== //depot/perl/utf8.c#239 (text) ==== @@ -2037,7 +2037,7 @@ if (u1) to_utf8_fold(p1, foldbuf1, &foldlen1); else { - natbuf[0] = *p1; + uvuni_to_utf8(natbuf, (UV) NATIVE_TO_UNI(((UV)*p1))); to_utf8_fold(natbuf, foldbuf1, &foldlen1); } q1 = foldbuf1; @@ -2047,7 +2047,7 @@ if (u2) to_utf8_fold(p2, foldbuf2, &foldlen2); else { - natbuf[0] = *p2; + uvuni_to_utf8(natbuf, (UV) NATIVE_TO_UNI(((UV)*p2))); to_utf8_fold(natbuf, foldbuf2, &foldlen2); } q2 = foldbuf2;