Re: counting zero-length matches in sed(1)

Ingo Schwarze Thu, 21 Jul 2011 17:43:11 -0700

So, finally, here is the update of my patch to fix zero-length
matches in sed(1).  Sorry for the ridiculous delay.


Changes with respect to the first version:

 - Removed the now unused local variable re_off (noticed by otto@).
 - Make slen signed, or zero-length matches at EOF without a trailing
   newline would cause it to underflow (suspected by otto@ after
   looking at FreeBSD code, suspicion confirmed myself).
 - Advancing after zero-length matches was not quite right:
   A zero-length match before the final character of a file caused the
   zero-length match after the final character to get lost if the file
   ended without a terminating newline (noticed myself by changing
   echo "$in" to echo -n "$in" in substitute.sh of our own test suite);
   fixed by reworking the end of the do...while loop, which also
   makes it easier to understand.

This now passes our own test suite both as it is and with -n,
and otto@ checked that the remaining failures with respect to the
FreeBSD test suite look unrelated to this particular patch.

I'm leaving the original rationale at the end, for reference.


Index: process.c
===================================================================
RCS file: /cvs/src/usr.bin/sed/process.c,v
retrieving revision 1.15
diff -u -p -r1.15 process.c
--- process.c   27 Oct 2009 23:59:43 -0000      1.15
+++ process.c   22 Jul 2011 00:11:49 -0000
@@ -312,7 +312,7 @@ substitute(struct s_command *cp)
 {
        SPACE tspace;
        regex_t *re;
-       size_t re_off, slen;
+       regoff_t slen;
        int n, lastempty;
        char *s;
 
@@ -333,60 +333,54 @@ substitute(struct s_command *cp)
        n = cp->u.s->n;
        lastempty = 1;
 
-       switch (n) {
-       case 0:                                 /* Global */
-               do {
-                       if (lastempty || match[0].rm_so != match[0].rm_eo) {
-                               /* Locate start of replaced string. */
-                               re_off = match[0].rm_so;
-                               /* Copy leading retained string. */
-                               cspace(&SS, s, re_off, APPEND);
-                               /* Add in regular expression. */
-                               regsub(&SS, s, cp->u.s->new);
-                       }
+       do {
+               /* Copy the leading retained string. */
+               if (n <= 1 && match[0].rm_so)
+                       cspace(&SS, s, match[0].rm_so, APPEND);
 
-                       /* Move past this match. */
-                       if (match[0].rm_so != match[0].rm_eo) {
-                               s += match[0].rm_eo;
-                               slen -= match[0].rm_eo;
-                               lastempty = 0;
+               /* Skip zero-length matches right after other matches. */
+               if (lastempty || match[0].rm_so ||
+                   match[0].rm_so != match[0].rm_eo) {
+                       if (n <= 1) {
+                               /* Want this match: append replacement. */
+                               regsub(&SS, s, cp->u.s->new);
+                               if (n == 1)
+                                       n = -1;
                        } else {
-                               if (match[0].rm_so == 0)
-                                       cspace(&SS, s, match[0].rm_so + 1,
-                                           APPEND);
-                               else
-                                       cspace(&SS, s + match[0].rm_so, 1,
-                                           APPEND);
-                               s += match[0].rm_so + 1;
-                               slen -= match[0].rm_so + 1;
-                               lastempty = 1;
+                               /* Want a later match: append original. */
+                               if (match[0].rm_eo)
+                                       cspace(&SS, s, match[0].rm_eo, APPEND);
+                               n--;
                        }
-               } while (slen > 0 && regexec_e(re, s, REG_NOTBOL, 0, slen));
-               /* Copy trailing retained string. */
-               if (slen > 0)
-                       cspace(&SS, s, slen, APPEND);
-               break;
-       default:                                /* Nth occurrence */
-               while (--n) {
-                       s += match[0].rm_eo;
-                       slen -= match[0].rm_eo;
-                       if (!regexec_e(re, s, REG_NOTBOL, 0, slen))
-                               return (0);
                }
-               /* FALLTHROUGH */
-       case 1:                                 /* 1st occurrence */
-               /* Locate start of replaced string. */
-               re_off = match[0].rm_so + (s - ps);
-               /* Copy leading retained string. */
-               cspace(&SS, ps, re_off, APPEND);
-               /* Add in regular expression. */
-               regsub(&SS, s, cp->u.s->new);
-               /* Copy trailing retained string. */
+
+               /* Move past this match. */
                s += match[0].rm_eo;
                slen -= match[0].rm_eo;
+
+               /*
+                * After a zero-length match, advance one byte,
+                * and at the end of the line, terminate.
+                */
+               if (match[0].rm_so == match[0].rm_eo) {
+                       if (*s == '\n')
+                               slen = -1;
+                       else
+                               slen--;
+                       cspace(&SS, s++, 1, APPEND);
+                       lastempty = 1;
+               } else
+                       lastempty = 0;
+
+       } while (n >= 0 && slen >= 0 && regexec_e(re, s, REG_NOTBOL, 0, slen));
+
+       /* Did not find the requested number of matches. */
+       if (n > 1)
+               return (0);
+
+       /* Copy the trailing retained string. */
+       if (slen > 0)
                cspace(&SS, s, slen, APPEND);
-               break;
-       }
 
        /*
         * Swap the substitute space and the pattern space, and make sure


 ----- 8< ----- schnipp ----- >8 ----- 8< ----- schnapp ----- >8 -----

Otto Moerbeek wrote on Fri, Jun 24, 2011 at 09:41:00PM +0200:
> On Sat, Jun 18, 2011 at 08:16:05PM -0600, Ingo Schwarze wrote:

>> When a regular expression has zero-length matches in a string,
>> both sed(1) global replacement (/g) and replacement of numbered
>> instances (e.g. /2) are broken.  This is not even limited to sed -E.
>> Both Otto's patch and my own refactoring patch on misc@ only address
>> global replacement and leave numbered instances broken.
>> 
>> Already now, code in the three parts of the switch statement
>> is rather repetitive, and by fixing the numbered instances case,
>> code duplication would become much worse.  Thus, i propose
>> the following full rewrite of the whole switch statement.
>> 
>> This is a bit scary because...
>> Well, sed(1) is not exactly the tool we want to break.
>> Hence, a test suite is included.
>> Both my patch and GNU sed(1) pass the test suite.
>> 
>> Our -current sed fails 43 of these tests, quite a few of them
>> in spectacular ways, like
>> 
>>   $ echo | sed 's/$/x/2'   # expect ""
>>   x
>>   $ echo aaa | sed 's/a*/x/2'   # expect "aaa"
>>   aaax
>>   $ echo abc | sed -E 's/a|$/x/g'   # expect "xbcx"
>>   x
>>   $ echo abc | sed -E 's/()/x/2'    # expect "axbc"
>>   xabc
>>   $ echo abc | sed -E 's/()/x/42'   # expect "abc"
>>   xabc
>> 
>> One common source of confusion on misc@ was that Perl allows
>> empty matches right after other matches, like in:
>> 
>>   $ perl -Mstrict -we '$_ = "abc"; s/b|()/x/g; print "$_\n";'
>>   xaxxcx
>>   $ perl -Mstrict -we '$_ = "a"; s/^|a|$/x/g; print "$_\n";'
>>   xxx
>> 
>> I consider that broken behaviour in Perl.  For example, think about
>> the case of "a" =~ /^|a/.  The branch /^/ matches at 0, length 0.
>> The branch /a/ matches at 0, length 1.  So by greediness, the latter
>> ought to prevail.  But then, we have already consumed the character,
>> and there is no way to get a second match.  Hence,
>> 
>>   $ echo abc | sed -E 's/b|()/x/g'
>>   xaxcx
>>   $ echo a | sed -E 's/^|a|$/x/g'  
>>   x
>> 
>> Comments?
>>   Ingo

> Couple of comments;
> 
> - re_off is no longer used.  
> 
> - slen is unsigned. We need to be 100% sure it cannot wrap. ATM I'm
> not completely convinced the empty match case is safe. If you look at
> the FreeBSD code, it appears the case match[0].rm_eo == slen can
> happen, and in that case slen will wrap.  They fixed it by making slen
> a signed type (regoff_t). See rev 1.30 and 1.31 in their tree. 
> 
> When I run the regress test suite from FreeBSD (in
> src/tools/regression/usr.bin/sed) I see a couple of failures with our
> sed. I did not have time yet to look if these failures have anything
> with this diff and if we should incoorporate them.

Re: counting zero-length matches in sed(1)

Reply via email to