i regexp "Malformed character" warning

demerphq Fri, 08 Jul 2005 08:07:37 -0700

On 7/8/05, Dave Mitchell <[EMAIL PROTECTED]> wrote:
> On Fri, Jul 08, 2005 at 09:24:42AM +0200, demerphq wrote:
> > > it turns out perl is totally borked for
> > >
> > >     $utf8 =~ /latin1/i
> > > and
> > >
> > >     $latin1 =~ /$utf8/i
> > >
> > > unless all the chars happen to be < 0x7f.
> >
> > The case where the pattern is /(foo|bar)/ is handled by a totally
> > different codepath in blead, does it also fail there? I seem to recall
> > that I put in tests for this, but possibly im wrong. Im flying on
> > holiday in less than 24 hours and i doubt Ill be able to check until i
> > return at the end of the month.
> 
> $ ./perl -Ilib -wle '$x="\xe9\x{100}";chop$x; print 1 if $x=~/(abc|\xe9)/i'
> 1
> 
> $ ./perl -Ilib -wle '$x="\xe9\x{100}";chop$x; print 1 if "\xe9"=~/(abc|$x)/i'
> Malformed UTF-8 character (unexpected non-continuation byte 0x00, immediately 
> after start byte 0xe9) in pattern match (m//) at -e line 1.
>


Attached patch fixes it in the TRIE(FL?)? code afaict. 

D:\dev\perl\live>perl -Ilib -wle "$x=qq[\xe9\x{100}]; chop $x; print 1
if qq[\xe9]=~/(abc|$x)/i"
1

D:\dev\perl\live>perl -Ilib -wle "$x=qq[\xe9\x{100}]; chop $x; print 1
if $x=~/(abc|\xe9)/i"
1

Yves

-- 
perl -Mre=debug -e "/just|another|perl|hacker/"

--- regexec.c.bak	2005-07-08 17:02:45.171875000 +0200
+++ regexec.c	2005-07-08 17:03:02.156250000 +0200
@@ -2612,7 +2612,7 @@
 
 		    if ( base ) {
 
-			if ( do_utf8 || UTF ) {
+			if ( do_utf8 ) {
 			    if ( foldlen>0 ) {
 				uvc = utf8n_to_uvuni( uscan, UTF8_MAXLEN, &len, uniflags );
 				foldlen -= len;
@@ -2678,7 +2678,7 @@
 
 		    if ( base ) {
 
-			if ( do_utf8 || UTF ) {
+			if ( do_utf8 ) {
 			    uvc = utf8n_to_uvuni( (U8*)uc, UTF8_MAXLEN, &len, uniflags );
 			} else {
 			    uvc = (U32)*uc;

Re: [perl #36207] UTF8/Latin 1/i regexp "Malformed character" warning

Reply via email to