Re: [patch] add UTF-8 support to column(1)

Nicholas Marriott Sun, 28 Feb 2016 08:28:02 -0800

Hi

I don't know why it is off in xterm but oxtabs is the default in
ttydefaults.h and appears to on for mostly everything else, including ssh
and tmux (which just uses what forkpty gives it).
Hi Martijn,


Martijn van Duren wrote on Sat, Feb 27, 2016 at 01:22:53PM +0100:

> Here's my attempt to implement UTF-8 support in column(1).
> Besides the general UTF-8 conversions it does several other things
> to make it behave properly.

Two general remarks:

 1. This column(1) code seems to be relatively low quality.
    When aiming at functional changes to low-quality code,
    it is sometimes useful to keep cleanup and functional
    changes seperate, in particular if either of them alone
    already cause a large diff (mixing minor cleanup into a
    functional diff is OK when the diff remains reasonably
    easy to read, though).

 2. When you want to do several things that require large changes,
    proceeding in steps can make review easier.

> Full changelist is as follow:
> - Make separator and input full UTF-8 aware.
> - Do proper character width count.

I agree with these two goals.

> This also fixes some indentation issues where the old code assumed
> that a tab also was one column wide.

I agree with your assessment that the old code mishandles tab
characters in the input.  There are three cases:

 1. In -t mode when '\t' is in -s (which it is by default).
    There is no problem because the tabs get replaced with NULs
    by strtok(3).
 2. In -t mode when '\t' is not in -s.
 3. In non-t mode.
    What should happen in these two cases seems arguable.
    One option would be to calculate the width of a tab such
    that it advances to the next multiple of eight.
    Another option would be to just replace each tab with
    a single blank on the output side.
    While the first option is closer to what happens now, the
    second one seems more useful: When you explicitly change the
    columnation and say that tabs in the input do not delimit
    columns, it doesn't seem very useful to have them signify
    subcolumns inside the new columns.

> - Replace tabs between columns with spaces.

I'm not convinced that should be changed.  In any case, it is unrelated
to UTF-8.

> The old code worked fine, but with UTF-8 and oxtabs in stty enabled
> the column positioning can get way off, and we can't expect everyone
> to run with "stty -oxtabs". Found with the help of nicm@.

On the console, UTF-8 doesn't work anyway, at all.  In an xterm(1),
it seems to me that "stty -oxtabs" is the default, so keeping the
tabs works by default.  If somebody manually sets "stty oxtabs"
(knowing that it cannot possibly work with UTF-8), they can easily
pipe column(1) output through expand(1).  Of course, we need to fix
expand(1), but that's a seperate matter.  I'm not sure we should
incorporate expand(1) functionality into random utilities that don't
yet have special tab handling just because there are some rare
corner cases where tabs might break.


Regarding the design of the program:  There are two very different
modes, -t and non-t.  In -t mode, total line widths are irrelevant
but calculated in input() anyway.  With UTF-8, that becomes more
expensive and harder to bear.  In non-t mode, total line widths
are the only widths that matter.

I think the ideal top-level data structure is as follows:

 - A list of input lines, stored as a pointer to the first
   line, either NULL-terminated or together with a variable
   storing the number of lines.
 - Each line is a list of fields, again stored as a pointer
   to the first field, this time definitely NULL-terminated
   for simplicity.
 - Each field is a struct containing a char * and an int width.

The input() function should generate this complete data structure
using getline(3) and mbtowc(3) in a single loop and remember the
maximum field width.  For the non-t case, only one field will be
generated per input line.

That way, we only need to parse the input once, we only need to
calculate widths once, we only need heavy malloc(3) lifting in one
function, and UTF-8 handling will only infect one single function,
input().  Probably, we only need one single call to mbtowc(3), one
single call to wcwidth(3), and no calls to *mbs*() functions except
one in main() for parsing and one in input() for using the -s option.
Probably, all the other functions can get away with functions like
fputs(3) or printf("...%s...").

I think it would be reasonable to do this refactoring in a first step,
maybe together with renaming length to width, then implement UTF-8
support in a second step.  Does that make sense to you?

I'm adding some specific comments in-line.  If any of my comments
do not seem helpful to you, just say so, i may have overlooked
something.

Yours,
  Ingo


> Index: column.c
> ===================================================================
> RCS file: /cvs/src/usr.bin/column/column.c,v
> retrieving revision 1.22
> diff -u -p -r1.22 column.c
> --- column.c  3 Nov 2015 04:55:44 -0000       1.22
> +++ column.c  27 Feb 2016 12:21:35 -0000
> @@ -36,15 +36,24 @@
>  #include <ctype.h>
>  #include <err.h>
>  #include <limits.h>
> +#include <locale.h>
>  #include <stdio.h>
>  #include <stdlib.h>
>  #include <string.h>
>  #include <unistd.h>
> +#include <wchar.h>
> +#include <wctype.h>
> +
> +struct li {
> +     int      width;
> +     char    *str;
> +};

I would probably put the pointer first (bikeshed).

>
>  void  c_columnate(void);
>  void *ereallocarray(void *, size_t, size_t);
>  void *ecalloc(size_t, size_t);
>  void  input(FILE *);
> +int   isu8start(unsigned char);
>  void  maketbl(void);
>  void  print(void);
>  void  r_columnate(void);
> @@ -54,9 +63,9 @@ int termwidth = 80;         /* default terminal
>
>  int entries;                 /* number of records */
>  int eval;                    /* exit value */
> -int maxlength;                       /* longest record */
> -char **list;                 /* array of pointers to records */
> -char *separator = "\t ";     /* field separator for table option */
> +int maxwidth;                        /* longest record */
> +struct li *list;             /* array of  to records */
> +wchar_t *separator = L"\t "; /* field separator for table option */
>
>  int
>  main(int argc, char *argv[])
> @@ -66,6 +75,7 @@ main(int argc, char *argv[])
>       int ch, tflag, xflag;
>       char *p;
>       const char *errstr;
> +     int slen;

You are using that for strlen(3) and malloc(3), so it should be size_t.

>
>       if (ioctl(1, TIOCGWINSZ, &win) == -1 || !win.ws_col) {
>               if ((p = getenv("COLUMNS")) && *p != '\0') {
> @@ -79,6 +89,8 @@ main(int argc, char *argv[])
>       if (pledge("stdio rpath", NULL) == -1)
>               err(1, "pledge");
>
> +     setlocale(LC_CTYPE, "");
> +

For now, setlocate(3) should be done before pledge(2).

>       tflag = xflag = 0;
>       while ((ch = getopt(argc, argv, "c:s:tx")) != -1)
>               switch(ch) {
> @@ -88,7 +100,13 @@ main(int argc, char *argv[])
>                               errx(1, "%s: %s", errstr, optarg);
>                       break;
>               case 's':
> -                     separator = optarg;
> +                     slen = strlen(optarg)+1;
> +                     if ((separator = reallocarray(NULL, slen,
> +                         sizeof(*separator))) == NULL)
> +                             err(1, NULL);
> +                     if (mbstowcs(separator, optarg,
> +                         slen * sizeof(*separator)) == (size_t)-1)

The third argument must be just slen, not a product.  It is the
maximum number of wide characters to be stored, not bytes.

> +                             err(1, "Unable to set separator");

Given that the EILSEQ message will be appended,
i would make this shorter, something like:

  err(1, "-s");

>                       break;
>               case 't':
>                       tflag = 1;
> @@ -125,69 +143,59 @@ main(int argc, char *argv[])
>
>       if (tflag)
>               maketbl();
> -     else if (maxlength >= termwidth)
> +     else if (maxwidth >= termwidth)
>               print();
>       else if (xflag)
>               c_columnate();
>       else
>               r_columnate();
> -     exit(eval);
> +     return eval;
>  }
>
>  #define      TAB     8
>  void
>  c_columnate(void)
>  {
> -     int chcnt, col, cnt, endcol, numcols;
> -     char **lp;
> +     int col, numcols;
> +     struct li *lp;
>
> -     maxlength = (maxlength + TAB) & ~(TAB - 1);
> -     numcols = termwidth / maxlength;
> -     endcol = maxlength;
> -     for (chcnt = col = 0, lp = list;; ++lp) {
> -             chcnt += printf("%s", *lp);
> +     maxwidth = (maxwidth + TAB) & ~(TAB - 1);
> +     if ((numcols = termwidth / maxwidth) == 0)
> +             numcols = 1;

That cannot happen.  For maxwidth >= termwidth, print() is used instead.

> +     for (col = 0, lp = list;; ++lp) {
> +             printf("%s", lp->str);
>               if (!--entries)
>                       break;
>               if (++col == numcols) {
> -                     chcnt = col = 0;
> -                     endcol = maxlength;
> +                     col = 0;
>                       putchar('\n');
>               } else {
> -                     while ((cnt = ((chcnt + TAB) & ~(TAB - 1))) <=
endcol) {
> -                             (void)putchar('\t');
> -                             chcnt = cnt;
> -                     }
> -                     endcol += maxlength;
> +                     while (lp->width++ < maxwidth)
> +                             (void)putchar(' ');

Not sure that change is needed (see above).

>               }
>       }
> -     if (chcnt)
> -             putchar('\n');
> +     putchar('\n');
>  }

Doesn't that change cause a bogus blank line for empty input?

>
>  void
>  r_columnate(void)
>  {
> -     int base, chcnt, cnt, col, endcol, numcols, numrows, row;
> +     int base, col, numcols, numrows, row;
>
> -     maxlength = (maxlength + TAB) & ~(TAB - 1);
> -     numcols = termwidth / maxlength;
> -     if (numcols == 0)
> +     maxwidth = (maxwidth + TAB) & ~(TAB - 1);
> +     if ((numcols = termwidth / maxwidth) == 0)
>               numcols = 1;

Again, cannot happen.

>       numrows = entries / numcols;
>       if (entries % numcols)
>               ++numrows;
>
>       for (row = 0; row < numrows; ++row) {
> -             endcol = maxlength;
> -             for (base = row, chcnt = col = 0; col < numcols; ++col) {
> -                     chcnt += printf("%s", list[base]);
> +             for (base = row, col = 0; col < numcols; ++col) {
> +                     printf("%s", list[base].str);
> +                     while (list[base].width++ < maxwidth)
> +                             (void)putchar(' ');

The padding certainly mustn't go before the following last column
check, or you produce trailing whitespace.  But again, do we really
want to expand tabs?

>                       if ((base += numrows) >= entries)
>                               break;
> -                     while ((cnt = ((chcnt + TAB) & ~(TAB - 1))) <=
endcol) {
> -                             (void)putchar('\t');
> -                             chcnt = cnt;
> -                     }
> -                     endcol += maxlength;
>               }
>               putchar('\n');
>       }
> @@ -197,15 +205,15 @@ void
>  print(void)
>  {
>       int cnt;
> -     char **lp;
> +     struct li *lp;
>
>       for (cnt = entries, lp = list; cnt--; ++lp)
> -             (void)printf("%s\n", *lp);
> +             (void)printf("%s\n", lp->str);
>  }
>
>  typedef struct _tbl {
> -     char **list;
> -     int cols, *len;
> +     wchar_t **list;
> +     int cols, *width;
>  } TBL;

Regarding the two levels of parsing, structs, and malloc, see my
design comments at the top.

>  #define      DEFCOLS 25
>
> @@ -214,47 +222,54 @@ maketbl(void)
>  {
>       TBL *t;
>       int coloff, cnt;
> -     char *p, **lp;
> -     int *lens, maxcols = DEFCOLS;
> +     struct li *lp;
> +     int *widths, maxcols = DEFCOLS;
>       TBL *tbl;
> -     char **cols;
> +     wchar_t **cols;
> +     wchar_t *ws, *wws, *last;
>
>       t = tbl = ecalloc(entries, sizeof(TBL));
>       cols = ereallocarray(NULL, maxcols, sizeof(char *));
> -     lens = ecalloc(maxcols, sizeof(int));
> +     widths = ecalloc(maxcols, sizeof(int));
> +     if ((ws = reallocarray(NULL, maxwidth, sizeof(*ws))) == NULL)
> +             err(1, NULL);

Seems inconsistent.  We should either use ereallocarray() throughout
or not at all.  There may be more instances in your patch.

>       for (cnt = 0, lp = list; cnt < entries; ++cnt, ++lp, ++t) {
> -             for (coloff = 0, p = *lp; (cols[coloff] = strtok(p,
separator));
> -                 p = NULL)
> +             if (mbstowcs(ws, lp->str, maxwidth * sizeof(*ws)) ==
(size_t)-1)
> +                     errx(1, "Invalid char on line %d", cnt+1);

I don't like that.  So if there is a single invalid byte, you lose all
the rest of the file?  I don't think we can use mbstowcs(3) in such
elementary utilities, except maybe for option parsing.

> +             free(lp->str);
> +             lp->str = NULL;
> +             wws = ws;
> +             for (coloff = 0; (cols[coloff] = wcstok(wws, separator,
> +                 &last)); wws = NULL) {
>                       if (++coloff == maxcols) {
>                               maxcols += DEFCOLS;
>                               cols = ereallocarray(cols, maxcols,
>                                   sizeof(char *));
> -                             lens = ereallocarray(lens, maxcols,
> +                             widths = ereallocarray(widths, maxcols,
>                                   sizeof(int));
> -                             memset(lens + coloff, 0, DEFCOLS *
sizeof(int));
> +                             memset(widths + coloff, 0, DEFCOLS *
sizeof(int));
>                       }
> +             }
>               if (coloff == 0)
>                       continue;
> -             t->list = ecalloc(coloff, sizeof(char *));
> -             t->len = ecalloc(coloff, sizeof(int));
> +             t->list = ecalloc(coloff, sizeof(*(t->list)));
> +             t->width = ecalloc(coloff, sizeof(*(t->width)));
>               for (t->cols = coloff; --coloff >= 0;) {
> -                     t->list[coloff] = cols[coloff];
> -                     t->len[coloff] = strlen(cols[coloff]);
> -                     if (t->len[coloff] > lens[coloff])
> -                             lens[coloff] = t->len[coloff];
> +                     t->list[coloff] = wcsdup(cols[coloff]);
> +                     t->width[coloff] = wcswidth(cols[coloff],
> +                         wcslen(cols[coloff]));
> +                     if (t->width[coloff] > widths[coloff])
> +                             widths[coloff] = t->width[coloff];
>               }
>       }
>       for (cnt = 0, t = tbl; cnt < entries; ++cnt, ++t) {
>               if (t->cols > 0) {
>                       for (coloff = 0; coloff < t->cols - 1; ++coloff)
> -                             (void)printf("%s%*s", t->list[coloff],
> -                                 lens[coloff] - t->len[coloff] + 2, " ");
> -                     (void)printf("%s\n", t->list[coloff]);
> +                             (void)printf("%ls%*s", t->list[coloff],
> +                                 widths[coloff] - t->width[coloff] + 2,
" ");
> +                     (void)printf("%ls\n", t->list[coloff]);
>               }
>       }
> -     free(tbl);
> -     free(lens);
> -     free(cols);
>  }
>
>  #define      DEFNUM          1000
> @@ -263,31 +278,56 @@ maketbl(void)
>  void
>  input(FILE *fp)
>  {
> -     static size_t maxentry = DEFNUM;
> -     int len;
> +     static int maxentry = DEFNUM;
> +     int width = 0, cwidth = 0, size;
>       char *p, buf[MAXLINELEN];
> +     wchar_t wc;
>
>       if (!list)
>               list = ecalloc(maxentry, sizeof(char *));
>       while (fgets(buf, MAXLINELEN, fp)) {

Using fgets(3) with a fixed-size buffer, discarding initial
parts of long lines, is really unfortunate, in particular
given that this utility reads the whole file into memory anyway.
So i strongly feel this should be changed to use getline(3),
removing the arbitrary line length restriction.

Probably, this should be a cleanup step before the main UTF-8
diff, see my design comments at the top.

> -             for (p = buf; isspace((unsigned char)*p); ++p);
> +             width = 0, cwidth = 0;
> +             p = buf;
> +             do {
> +                     p += cwidth;
> +                     if ((cwidth = mbtowc(&wc, p, MB_CUR_MAX)) == -1) {
> +                             (void) mbtowc(NULL, NULL, MB_CUR_MAX);
> +                             break;
> +                     }
> +             } while (iswspace(wc));
>               if (!*p)
>                       continue;
> -             if (!(p = strchr(p, '\n'))) {
> +
> +             for (p = buf; *p != '\n' && *p != '\0'; p++) {
> +                     if (isu8start(*p)) {

It looks a bit strange to use both isu8start() and mbtowc(3).
Usually, if you need mbtowc(3) anyway, you can get away without
isu8start() when coding carefully.

> +                             if ((size = mbtowc(&wc, p, MB_CUR_MAX)) ==
-1)
> +                                     mbtowc(NULL, NULL, MB_CUR_MAX);
> +                             if ((cwidth = wcwidth(wc)) >= 0) {

That doesn't work.  After mbtowc(3) failure, wc contains garbage.

> +                                     width += cwidth;
> +                                     p += size-1;
> +                             }

That doesn't work either.  After wcwidth(3) failure, something must
be done.  Treating non-printable characters as width 0 may be
acceptable, but failing to advance p looks like an endless loop
to me.

> +                     } else {
> +                             if (*p == '\t')
> +                                     width += TAB - width%TAB;

That seems wrong.  In -t mode, tabs may be used as column seperators,
so they don't contribute to the line width like that.

> +                             else if (isprint(*p))
> +                                     width++;
> +                     }
> +             }
> +             if (*p != '\n') {
>                       warnx("line too long");
>                       eval = 1;
>                       continue;
>               }
>               *p = '\0';
> -             len = p - buf;
> -             if (maxlength < len)
> -                     maxlength = len;
> +             if (maxwidth < width)
> +                     maxwidth = width;
>               if (entries == maxentry) {
>                       maxentry += DEFNUM;
>                       list = ereallocarray(list, maxentry, sizeof(char
*));
>                       memset(list + entries, 0, DEFNUM * sizeof(char *));
>               }
> -             if (!(list[entries++] = strdup(buf)))
> +             list[entries].width = width;
> +             if (!(list[entries++].str = strdup(buf)))
>                       err(1, NULL);
>       }
>  }
> @@ -319,4 +359,10 @@ usage(void)
>       (void)fprintf(stderr,
>           "usage: column [-tx] [-c columns] [-s sep] [file ...]\n");
>       exit(1);
> +}
> +
> +int
> +isu8start(unsigned char c)
> +{
> +     return MB_CUR_MAX > 1 && (c & (0x80 | 0x40)) == (0x80 | 0x40);
>  }
>

Re: [patch] add UTF-8 support to column(1)

Reply via email to