So, finally, here is the update of my patch to fix zero-length matches in sed(1). Sorry for the ridiculous delay.
Changes with respect to the first version: - Removed the now unused local variable re_off (noticed by otto@). - Make slen signed, or zero-length matches at EOF without a trailing newline would cause it to underflow (suspected by otto@ after looking at FreeBSD code, suspicion confirmed myself). - Advancing after zero-length matches was not quite right: A zero-length match before the final character of a file caused the zero-length match after the final character to get lost if the file ended without a terminating newline (noticed myself by changing echo "$in" to echo -n "$in" in substitute.sh of our own test suite); fixed by reworking the end of the do...while loop, which also makes it easier to understand. This now passes our own test suite both as it is and with -n, and otto@ checked that the remaining failures with respect to the FreeBSD test suite look unrelated to this particular patch. I'm leaving the original rationale at the end, for reference. Index: process.c =================================================================== RCS file: /cvs/src/usr.bin/sed/process.c,v retrieving revision 1.15 diff -u -p -r1.15 process.c --- process.c 27 Oct 2009 23:59:43 -0000 1.15 +++ process.c 22 Jul 2011 00:11:49 -0000 @@ -312,7 +312,7 @@ substitute(struct s_command *cp) { SPACE tspace; regex_t *re; - size_t re_off, slen; + regoff_t slen; int n, lastempty; char *s; @@ -333,60 +333,54 @@ substitute(struct s_command *cp) n = cp->u.s->n; lastempty = 1; - switch (n) { - case 0: /* Global */ - do { - if (lastempty || match[0].rm_so != match[0].rm_eo) { - /* Locate start of replaced string. */ - re_off = match[0].rm_so; - /* Copy leading retained string. */ - cspace(&SS, s, re_off, APPEND); - /* Add in regular expression. */ - regsub(&SS, s, cp->u.s->new); - } + do { + /* Copy the leading retained string. */ + if (n <= 1 && match[0].rm_so) + cspace(&SS, s, match[0].rm_so, APPEND); - /* Move past this match. */ - if (match[0].rm_so != match[0].rm_eo) { - s += match[0].rm_eo; - slen -= match[0].rm_eo; - lastempty = 0; + /* Skip zero-length matches right after other matches. */ + if (lastempty || match[0].rm_so || + match[0].rm_so != match[0].rm_eo) { + if (n <= 1) { + /* Want this match: append replacement. */ + regsub(&SS, s, cp->u.s->new); + if (n == 1) + n = -1; } else { - if (match[0].rm_so == 0) - cspace(&SS, s, match[0].rm_so + 1, - APPEND); - else - cspace(&SS, s + match[0].rm_so, 1, - APPEND); - s += match[0].rm_so + 1; - slen -= match[0].rm_so + 1; - lastempty = 1; + /* Want a later match: append original. */ + if (match[0].rm_eo) + cspace(&SS, s, match[0].rm_eo, APPEND); + n--; } - } while (slen > 0 && regexec_e(re, s, REG_NOTBOL, 0, slen)); - /* Copy trailing retained string. */ - if (slen > 0) - cspace(&SS, s, slen, APPEND); - break; - default: /* Nth occurrence */ - while (--n) { - s += match[0].rm_eo; - slen -= match[0].rm_eo; - if (!regexec_e(re, s, REG_NOTBOL, 0, slen)) - return (0); } - /* FALLTHROUGH */ - case 1: /* 1st occurrence */ - /* Locate start of replaced string. */ - re_off = match[0].rm_so + (s - ps); - /* Copy leading retained string. */ - cspace(&SS, ps, re_off, APPEND); - /* Add in regular expression. */ - regsub(&SS, s, cp->u.s->new); - /* Copy trailing retained string. */ + + /* Move past this match. */ s += match[0].rm_eo; slen -= match[0].rm_eo; + + /* + * After a zero-length match, advance one byte, + * and at the end of the line, terminate. + */ + if (match[0].rm_so == match[0].rm_eo) { + if (*s == '\n') + slen = -1; + else + slen--; + cspace(&SS, s++, 1, APPEND); + lastempty = 1; + } else + lastempty = 0; + + } while (n >= 0 && slen >= 0 && regexec_e(re, s, REG_NOTBOL, 0, slen)); + + /* Did not find the requested number of matches. */ + if (n > 1) + return (0); + + /* Copy the trailing retained string. */ + if (slen > 0) cspace(&SS, s, slen, APPEND); - break; - } /* * Swap the substitute space and the pattern space, and make sure ----- 8< ----- schnipp ----- >8 ----- 8< ----- schnapp ----- >8 ----- Otto Moerbeek wrote on Fri, Jun 24, 2011 at 09:41:00PM +0200: > On Sat, Jun 18, 2011 at 08:16:05PM -0600, Ingo Schwarze wrote: >> When a regular expression has zero-length matches in a string, >> both sed(1) global replacement (/g) and replacement of numbered >> instances (e.g. /2) are broken. This is not even limited to sed -E. >> Both Otto's patch and my own refactoring patch on misc@ only address >> global replacement and leave numbered instances broken. >> >> Already now, code in the three parts of the switch statement >> is rather repetitive, and by fixing the numbered instances case, >> code duplication would become much worse. Thus, i propose >> the following full rewrite of the whole switch statement. >> >> This is a bit scary because... >> Well, sed(1) is not exactly the tool we want to break. >> Hence, a test suite is included. >> Both my patch and GNU sed(1) pass the test suite. >> >> Our -current sed fails 43 of these tests, quite a few of them >> in spectacular ways, like >> >> $ echo | sed 's/$/x/2' # expect "" >> x >> $ echo aaa | sed 's/a*/x/2' # expect "aaa" >> aaax >> $ echo abc | sed -E 's/a|$/x/g' # expect "xbcx" >> x >> $ echo abc | sed -E 's/()/x/2' # expect "axbc" >> xabc >> $ echo abc | sed -E 's/()/x/42' # expect "abc" >> xabc >> >> One common source of confusion on misc@ was that Perl allows >> empty matches right after other matches, like in: >> >> $ perl -Mstrict -we '$_ = "abc"; s/b|()/x/g; print "$_\n";' >> xaxxcx >> $ perl -Mstrict -we '$_ = "a"; s/^|a|$/x/g; print "$_\n";' >> xxx >> >> I consider that broken behaviour in Perl. For example, think about >> the case of "a" =~ /^|a/. The branch /^/ matches at 0, length 0. >> The branch /a/ matches at 0, length 1. So by greediness, the latter >> ought to prevail. But then, we have already consumed the character, >> and there is no way to get a second match. Hence, >> >> $ echo abc | sed -E 's/b|()/x/g' >> xaxcx >> $ echo a | sed -E 's/^|a|$/x/g' >> x >> >> Comments? >> Ingo > Couple of comments; > > - re_off is no longer used. > > - slen is unsigned. We need to be 100% sure it cannot wrap. ATM I'm > not completely convinced the empty match case is safe. If you look at > the FreeBSD code, it appears the case match[0].rm_eo == slen can > happen, and in that case slen will wrap. They fixed it by making slen > a signed type (regoff_t). See rev 1.30 and 1.31 in their tree. > > When I run the regress test suite from FreeBSD (in > src/tools/regression/usr.bin/sed) I see a couple of failures with our > sed. I did not have time yet to look if these failures have anything > with this diff and if we should incoorporate them.