Hi,

There is an ongoing confusion in sed's source about the concept of EOL:
the code contradicts itself as to whether it should be determined by a
trailing NUL or by looking up the `len' field of the SPACE structure.

At least two commands (`l' and `s') assume that the pattern space is
always NUL-terminated.  However, at least two other commands (`c' and
`D') do not comply with that assumption, modifying it by merely updating
its length attribute.  As a result, using `l' or `s' after `c' or `D'
produces incorrect output.

For instance, in the following excerpt, `l' should print `bb$':

        $ printf 'a\nbb\n' | sed '${l;q;};N;D'
        bbbb$
        bb

For the same underlying reason, it also disregards the deletion of the
pattern space that `c' is supposed to entail:

        $ echo abc | sed 'c\
        > text
        > l'
        text
        abc$

The substitute command can likewise get confused and add an extra
character after a replacement:

        $ printf 'a\nbb\n' | sed '${s/$/x/;q;};N;D'
        bbxb

Note how this does not happen when the substitution pattern matches more
than the empty string:

        $ printf 'a\nbb\n' | sed '${s/.$/&x/;q;};N;D'
        bbx

After an empty match, the character that `s' adds to the pattern space
depends on how memmove(3) modified it beforehand.  Here's a very
simplified version of the original sequence of commands from which I
discovered these bugs, where `i' appear instead of `b':

        $ printf 'a\nbb\n' | sed '${s/$/x/;q;};N;s/a/ZZi/;D'
        bbxi

Similarily, when one believes the pattern space to be empty after a `c'
command, something like `s/.*//' will magically revive one character
from that emptiness:

        $ echo abc | sed 'c\
        > text
        > s/.*//'
        text
        a

As I came across these in an actual script that I was writing, I was
quite amazed to find out that they went unnoticed for so long, without
anyone ever trying to see what `l' does after `c' or `D'.  Indeed, the
`l' command is broken since revision 1.1 of process.c, making that bug
older than I am.  It has also been present in the other BSDs, but:

      o FreeBSD fixed it by accident in 2004 by adding support for
        multibyte encodings;

      o NetBSD fixed it by accident in 2014 by importing FreeBSD's sed
        (merging their changes afterwards, but that didn't reintroduce
        the bug, since they hadn't touched that part of the code).

The more serious bug regarding empty matches in the substitute command
was introduced in OpenBSD in 2011, when schwarze@ decided to rewrite its
main loop, intending to "fix multiple issues regarding the replacement
of zero-length strings", but apparently with the unfortunate side effect
of introducing a new issue regarding the replacement of zero-length
strings.  Finally, this commit was adopted by FreeBSD in 2016 when they
synchronised some of their code with OpenBSD's to ease importing other
fixes from it.

Thus, OpenBSD still has the old `l' bug as well as the more recent `s'
one, FreeBSD only has the latter, and since 2014 NetBSD is not longer
affected by any of these, although the fact that `c' and `D' omit to
NUL-terminate the pattern space could be considered a bug in itself,
since at least one piece of code present for 20 years in their tree was
making the contrary assumption.  So who knows where else it was made?
Actually, with all due respect, probably not even the original
developers: many of their choices suggest that the end of the pattern
space was not always meant to be encoded as a trailing NUL, but they
nevertheless designed the function `lputs' to only take a simple string
parameter, by which no separate length information could ever be
accessible.

I therefore have the strong feeling that this is not the last bug of
BSD's sed.  The remaining ones may be hidden in dark corners, but I do
expect them to show up at some point in the future, considering how
fragile the existing code looks.  The overall situation also prevents
any significant improvement to be successfully conducted, as changing
more than two lines will most likely lead to subtle regressions; see the
last few commits at

https://svnweb.freebsd.org/base/head/usr.bin/sed/process.c

and

http://cvsweb.netbsd.org/cgi-bin/cvsweb.cgi/src/usr.bin/sed/process.c

for recent examples of such failed attempts.

Anyway, until I digress even more, here is a patch for that one, forcing
`c' and `D' to put a NUL where `l' and `s' expect one to be.  By the
way, whether the fault is on one side or the other is actually unclear.
It does seem redundant to enforce such a convention when one already has
access to the length of the corresponding string.  As a matter of fact,
the overwhelming majority of the code does not appear to ever rely on a
NUL being there at all; for instance, `D' and `P' use memchr(3), not
strchr(3), which, besides, allows for the manipulation of NUL bytes
originally present in the input.  Hence a more complete patch could
instead try to unify the whole source to follow one convention
consistently; however, that would probably take as much time as a full
rewrite, because that is essentially what would need to be done, so, for
now, I kept it as simple as it could be, to avoid messing it up even
more than it already is.

--- a/src/usr.bin/sed/process.c Wed Feb 22 14:09:09 2017
+++ b/src/usr.bin/sed/process.c Wed Jun  7 09:56:20 2017
@@ -121,7 +121,7 @@ redirect:
                                goto redirect;
                        case 'c':
                                pd = 1;
-                               psl = 0;
+                               ps[psl = 0] = '\0';
                                if (cp->a2 == NULL || lastaddr || lastline())
                                        (void)fprintf(outfile, "%s", cp->t);
                                break;
@@ -138,6 +138,7 @@ redirect:
                                } else {
                                        psl -= (p + 1) - ps;
                                        memmove(ps, p + 1, psl);
+                                       ps[psl] = '\0';
                                        goto top;
                                }
                        case 'g':

Also note how one could have thought of fixing `D' by making its
memmove(3) go up to `psl + 1' instead of `psl', since there now should
always be a NUL at the end; however, a separate assignment makes the
intent clearer, and I read somewhere that explicit was better than
implicit.

That being said, the more I think about this, the more I am convinced
that the problematic commands were in fact `l' and `s'.  Since the code
already uses fgetln(3) to read lines, it would have been nicer (and
optimal) to never rely on NULs.  As stated above, this is already the
case almost everywhere; therefore, if any of the OpenBSD developers
wanted to elaborate on that idea after committing such an unsatisfying
patch, I would of course be very happy to help.

Kind regards,

kshe

Reply via email to