i regexp "Malformed character" warning

Dave Mitchell Thu, 07 Jul 2005 19:20:53 -0700

On Tue, Jun 07, 2005 at 04:28:08PM -0000, Nicholas Clark wrote:
> ./perl -Ilib -we '$term = "\xe9"; $target = "\xe9\x{100}"; chop $target; 
> $target =~ /$term/i'
> Malformed UTF-8 character (unexpected non-continuation byte 0x00, immediately 
> after start byte 0xe9) in pattern match (m//) at -e line 1.
>


it turns out perl is totally borked for

    $utf8 =~ /latin1/i
and

    $latin1 =~ /$utf8/i

unless all the chars happen to be < 0x7f.

The patch below fixes the first case:  the fault was in ibcmp_utf8(),
which was calling to_utf8_fold() without first converting the char to
utf8.

The second case fails in S_find_byclass() in regexec.c:

                    while (s <= e) {
                        if ( (*(U8*)s == c1 || *(U8*)s == c2)
                             && (ln == 1 || !(OP(c) == EXACTF
                                              ? ibcmp(s, m, ln)
                                              : ibcmp_locale(s, m, ln)))
                             && (norun || regtry(prog, s)) )
                            goto got_it;
                        s++;
                    }

where it calls ibcmp() but s is a pointer to a latin1 while m is pointer
to a utf8 char. I don't really understand this code well enough to feel
confident fixing it. In fact I'm not even totally confident in my fix for
the first case, hence I'm Ccing everyone's Favourite Finn.

Dave.

-- 
"But Sidley Park is already a picture, and a most amiable picture too.
The slopes are green and gentle. The trees are companionably grouped at
intervals that show them to advantage. The rill is a serpentine ribbon
unwound from the lake peaceably contained by meadows on which the right
amount of sheep are tastefully arranged." -- Lady Croom - Arcadia


Change 25095 by [EMAIL PROTECTED] on 2005/07/08 01:43:24

        [perl #36207] UTF8/Latin 1/i regexp "Malformed character" warning
        $utf8 =~ /latin/i didn't match. 
        Also added TODO for $latin =~ /utf8/i which also fails

Affected files ...

... //depot/perl/t/op/pat.t#222 edit
... //depot/perl/utf8.c#239 edit

Differences ...

==== //depot/perl/t/op/pat.t#222 (xtext) ====

@@ -6,7 +6,7 @@
 
 $| = 1;
 
-print "1..1178\n";
+print "1..1180\n";
 
 BEGIN {
     chdir 't' if -d 't';
@@ -3364,4 +3364,14 @@
     my $psycho=join "|",@normal,map chr $_,255..20000;
     ok(('these'=~/($psycho)/) && $1 eq 'these','Pyscho');
 }
-# last test 1178
+
+# [perl #36207] mixed utf8 / latin-1 and case folding
+
+{
+    my $u = "\xe9\x{100}";
+    chop $u;
+    ok($u =~ /\xe9/i, "utf8/latin");
+    ok("\xe9" =~ /$u/i, "# TODO latin/utf8");
+}
+
+# last test 1180

==== //depot/perl/utf8.c#239 (text) ====

@@ -2037,7 +2037,7 @@
               if (u1)
                    to_utf8_fold(p1, foldbuf1, &foldlen1);
               else {
-                   natbuf[0] = *p1;
+                   uvuni_to_utf8(natbuf, (UV) NATIVE_TO_UNI(((UV)*p1)));
                    to_utf8_fold(natbuf, foldbuf1, &foldlen1);
               }
               q1 = foldbuf1;
@@ -2047,7 +2047,7 @@
               if (u2)
                    to_utf8_fold(p2, foldbuf2, &foldlen2);
               else {
-                   natbuf[0] = *p2;
+                   uvuni_to_utf8(natbuf, (UV) NATIVE_TO_UNI(((UV)*p2)));
                    to_utf8_fold(natbuf, foldbuf2, &foldlen2);
               }
               q2 = foldbuf2;

Re: [perl #36207] UTF8/Latin 1/i regexp "Malformed character" warning

Reply via email to