On Sat, 10 Jun 2017 10:25:27 +0000, Otto Moerbeek wrote:
> Thanks for the analysis and diff, I hope to get a chanche to think
> about this soon. At least I'll make sure this diff is not forgotten,
>     -Otto

I have seen the tests that you recently added to the tree; in the mean
time, however, while I continued experimenting and reading the source, I
discovered more bugs (as expected), as well as other trivial issues
which might be worthy of note before committing any real fixes.

First of all, my comprehension of the `c' command has slightly changed
since my last message: I now think its behaviour should be aligned with
that of `p', `P', `w' and others, all of which produce no output after
`c', treating a deleted pattern space as truly inexistent, and not
merely as an empty one.  It would hence appear more coherent to add

        if (pd)
                break;

before calling lputs(), as is done in many other places.  This would
then change the expected output of one of the newly added tests.

In general though, that `c' command is a big mess.  It is supposed to
delete the pattern space: in other implementations of sed that I have
tested, that means it will essentially behave like `i' followed by `d'.
In BSD's sed, however, it continues processing commands...  Here's one
of the many bugs that arise because of that:

        $ echo abc | sed 'c\
        > text
        > s/.*/x/
        > p
        > s/.*/y/
        > p'
        text
        x

So why isn't `y' printed too?  In comparison, using flags instead of
standalone commands:

        $ echo abc | sed 'c\
        > text
        > s/.*/x/p
        > s/.*/y/p'
        text
        x
        y

Here, we do get a `y', but only one (there should be two given that `-n'
has not been supplied to suppress normal output).  Since at least `p' as
a flag given to `s' appears to work, let us try the same with `w':

        $ echo abc | sed 'c\
        > text
        > s/.*/x/w /tmp/sed.out
        > s/.*/y/w /tmp/sed.out'
        text
        $ cat /tmp/sed.out
        x

But this time it doesn't work.  Now add a third `s' command...

        $ echo abc | sed 'c\
        > text
        > s/.*/x/w /tmp/sed.out
        > s/.*/y/w /tmp/sed.out
        > s/.*/z/w /tmp/sed.out'
        text
        z
        $ cat /tmp/sed.out
        x
        z

And it works.  Well, not really: we do see a `z', but still no `y'.
Indeed, this does not make any sense.

The reason for this strange behaviour is nevertheless easy to figure out
when looking at how `s' is implemented, keeping in mind that the
substitute space is globally defined.  In any case, even if one was to
fix that particular bug, I would probably be able to find more.  This is
why I believe that the only sane solution is that, just like `d', `c'
should simply

        goto new;

which would, at the same time, obviate the need for the `deleted' field
of the SPACE structure, subsequently allowing for substantial cleanups.

Also, aside of that `c' mess, there is still the fact that NUL bytes
originally present in the input are not well handled at all.  Of course,
`l' cannot print past them (whereas `p' can), which is a shame; but
significantly worse is the inconsistent behaviour of `s' in that case.
At first, it seems to work well (in the output, `^@' represents a NUL):

        $ printf 'x\0y' | sed 's/x/_/'
        _^@y

It can even handle some empty matches perfectly:

        $ printf 'x\0y' | sed 's/[[:<:]]/_/' # pattern space survives
        _x^@y

But not all of them:

        $ printf 'x\0y' | sed 's/[[:>:]]/_/' # pattern space is cut off
        x_
        $ printf 'x\0y' | sed 's/[[:>:]]/_/2' # likewise
        x

One might argue that NUL bytes are not supposed to happen in textual
input, so that these examples are irrelevant; the real problem however
is that it also affects newlines, thus perfectly valid text:

        $ printf 'x\ny' | sed 'N;s/[[:<:]]/_/'
        _x
        y
        $ printf 'x\ny' | sed 'N;s/[[:>:]]/_/'
        x_
        $ printf 'x\ny' | sed 'N;s/[[:>:]]/_/2'
        x

So maybe substitute() should test for (slen == 0) instead of the
questionable (*s == '\0' || *s == '\n').  That would seem more correct
to me, but this is only a guess since I must admit that I don't fully
understand that part of the code yet.  Here is a diff, for instance,
that fixes all the above issues.

--- a/src/usr.bin/sed/process.c Wed Jun  7 09:56:20 2017
+++ b/src/usr.bin/sed/process.c Tue Jun 13 07:21:59 2017
@@ -387,14 +388,11 @@ substitute(struct s_command *cp)
                 * and at the end of the line, terminate.
                 */
                if (match[0].rm_so == match[0].rm_eo) {
-                       if (*s == '\0' || *s == '\n')
-                               slen = -1;
-                       else
-                               slen--;
-                       if (*s != '\0') {
-                               cspace(&SS, s++, 1, APPEND);
-                               le++;
-                       }
+                       if (slen == 0)
+                               break;
+                       cspace(&SS, s++, 1, APPEND);
+                       slen--;
+                       le++;
                        lastempty = 1;
                } else
                        lastempty = 0;

However, even if it passes the regression tests with the same success as
the current sed, please do not blindly commit this patch.  Although I
fail to see what it was, surely there must have been a reason to test
for that particular condition in the original code, and not simply for
(slen == 0).  This patch is therefore only intented as an example of how
a fix might look like, but definitely not as the correct and final
solution.

Lastly, I have prepared one more patch dealing with trivial issues that
I found while browsing the code.  Contrary to the previous one, I am
very confident that it will not have any undesired effects, although of
course it should be carefully reviewed, as any patch should be.

--- a/src/usr.bin/sed/defs.h    Fri Jan 20 10:26:16 2017
+++ b/src/usr.bin/sed/defs.h    Mon Jun 12 09:32:57 2017
@@ -129,7 +129,6 @@ typedef struct {
        size_t len;             /* Current length. */
        int deleted;            /* If deleted. */
        int append_newline;     /* If originally terminated by \n. */
-       char *back;             /* Backing memory. */
        size_t blen;            /* Backing memory length. */
 } SPACE;
 
--- a/src/usr.bin/sed/process.c Wed Jun  7 09:56:20 2017
+++ b/src/usr.bin/sed/process.c Tue Jun 13 07:21:59 2017
@@ -77,9 +77,9 @@ static regex_t *defpreg;
 size_t maxnsub;
 regmatch_t *match;
 
-#define OUT() do {\
-       fwrite(ps, 1, psl, outfile);\
-       if (psanl) fputc('\n', outfile);\
+#define OUT() do {                             \
+       fwrite(ps, 1, psl, outfile);            \
+       if (psanl) fputc('\n', outfile);        \
 } while (0)
 
 void
@@ -145,15 +145,15 @@ redirect:
                                cspace(&PS, hs, hsl, REPLACE);
                                break;
                        case 'G':
-                               cspace(&PS, "\n", 1, 0);
-                               cspace(&PS, hs, hsl, 0);
+                               cspace(&PS, "\n", 1, APPEND);
+                               cspace(&PS, hs, hsl, APPEND);
                                break;
                        case 'h':
                                cspace(&HS, ps, psl, REPLACE);
                                break;
                        case 'H':
-                               cspace(&HS, "\n", 1, 0);
-                               cspace(&HS, ps, psl, 0);
+                               cspace(&HS, "\n", 1, APPEND);
+                               cspace(&HS, ps, psl, APPEND);
                                break;
                        case 'i':
                                (void)fprintf(outfile, "%s", cp->t);
@@ -171,7 +171,7 @@ redirect:
                                break;
                        case 'N':
                                flush_appends();
-                               cspace(&PS, "\n", 1, 0);
+                               cspace(&PS, "\n", 1, APPEND);
                                if (!mf_fgets(&PS, 0))
                                        exit(0);
                                break;
@@ -256,8 +256,9 @@ redirect:
                        cp = cp->next;
                } /* for all cp */
 
-new:           if (!nflag && !pd)
+               if (!nflag && !pd)
                        OUT();
+new:
                flush_appends();
        } /* for all lines */
 }
@@ -266,7 +267,7 @@ new:                if (!nflag && !pd)
  * TRUE if the address passed matches the current program state
  * (lastline, linenumber, ps).
  */
-#define        MATCH(a)                                                \
+#define MATCH(a)                                                       \
        (a)->type == AT_RE ? regexec_e((a)->u.r, ps, 0, 1, 0, psl) :    \
            (a)->type == AT_LINE ? linenum == (a)->u.l : lastline()
 
@@ -361,7 +362,7 @@ substitute(struct s_command *cp)
                        cspace(&SS, s, match[0].rm_so - le, APPEND);
 
                /* Skip zero-length matches right after other matches. */
-               if (lastempty || (match[0].rm_so - le) ||
+               if (lastempty || match[0].rm_so != le ||
                    match[0].rm_so != match[0].rm_eo) {
                        if (n <= 1) {
                                /* Want this match: append replacement. */
@@ -370,7 +371,7 @@ substitute(struct s_command *cp)
                                        n = -1;
                        } else {
                                /* Want a later match: append original. */
-                               if (match[0].rm_eo - le)
+                               if (match[0].rm_eo != le)
                                        cspace(&SS, s, match[0].rm_eo - le,
                                            APPEND);
                                n--;
@@ -418,7 +416,6 @@ substitute(struct s_command *cp)
        PS = SS;
        psanl = tspace.append_newline;
        SS = tspace;
-       SS.space = SS.back;
 
        /* Handle the 'p' flag. */
        if (cp->u.s->p)
@@ -547,13 +544,14 @@ regexec_e(regex_t *preg, const char *string, int eflag
 static void
 regsub(SPACE *sp, char *string, char *src)
 {
-       int len, no;
+       int no;
+       size_t len;
        char c, *dst;
 
-#define        NEEDSP(reqlen)                                                  
\
+#define NEEDSP(reqlen)                                                 \
        if (sp->len + (reqlen) + 1 >= sp->blen) {                       \
                size_t newlen = sp->blen + (reqlen) + 1024;             \
-               sp->space = sp->back = xrealloc(sp->back, newlen);      \
+               sp->space = xrealloc(sp->space, newlen);                \
                sp->blen = newlen;                                      \
                dst = sp->space + sp->len;                              \
        }
@@ -572,8 +570,7 @@ regsub(SPACE *sp, char *string, char *src)
                        NEEDSP(1);
                        *dst++ = c;
                        ++sp->len;
-               } else if (match[no].rm_so != -1 && match[no].rm_eo != -1) {
-                       len = match[no].rm_eo - match[no].rm_so;
+               } else if ((len = match[no].rm_eo - match[no].rm_so) != 0) {
                        NEEDSP(len);
                        memmove(dst, string + match[no].rm_so, len);
                        dst += len;
@@ -594,18 +591,18 @@ cspace(SPACE *sp, const char *p, size_t len, enum e_sp
 {
        size_t tlen;
 
+       if (spflag == REPLACE)
+               sp->len = 0;
+
        /* Make sure SPACE has enough memory and ramp up quickly. */
        tlen = sp->len + len + 1;
        if (tlen > sp->blen) {
                size_t newlen = tlen + 1024;
-               sp->space = sp->back = xrealloc(sp->back, newlen);
+               sp->space = xrealloc(sp->space, newlen);
                sp->blen = newlen;
        }
 
-       if (spflag == REPLACE)
-               sp->len = 0;
-
-       memmove(sp->space + sp->len, p, len);
+       memcpy(sp->space + sp->len, p, len);
 
        sp->space[sp->len += len] = '\0';
 }

Here is the corresponding list of changes:

1.  Remove the `back' field from the SPACE structure, which was useless
since always equal to the `space' field.

2.  Use APPEND in calls to cspace() that previously used 0 directly,
which could be slightly confusing if one did not know the order of the
`e_spflag' enum, and also missed the point of it being defined in the
first place.

3.  As suggested by the outdated comment preceding its definition,
cspace() was originally only meant to append to, not to replace, the
contents of a SPACE.  The calculation of the size for the potentially
new buffer was however never adjusted, which meant that, with input
lines of approximately equal length, the function would allocate roughly
twice as much memory as necessary, half of that memory staying unused.
To fix it, conditionally set `sp->len' to zero before the relevent
assignment, not after.

4.  Since cspace() never receives arguments that overlap in memory (as
that would make little sense), and considering that it is probably one
of the most frequently called routines, make it use memcpy(3) in lieu of
memmove(3).

5.  Change conditionals of the form (a - b) to (a != b) for better
readability.

6.  Use `size_t' instead of `int' for a variable holding a length.

7.  Slightly improve the logic of regsub() to avoid a bunch of no-op
instructions when `len' is zero.

8.  Move a goto label previously redirecting into a conditional that was
always false within that context.

9.  Add missing tabs to align backslashes in macro definitions.

Kind regards,

kshe

Reply via email to