Hi I don't know why it is off in xterm but oxtabs is the default in ttydefaults.h and appears to on for mostly everything else, including ssh and tmux (which just uses what forkpty gives it). Hi Martijn,
Martijn van Duren wrote on Sat, Feb 27, 2016 at 01:22:53PM +0100: > Here's my attempt to implement UTF-8 support in column(1). > Besides the general UTF-8 conversions it does several other things > to make it behave properly. Two general remarks: 1. This column(1) code seems to be relatively low quality. When aiming at functional changes to low-quality code, it is sometimes useful to keep cleanup and functional changes seperate, in particular if either of them alone already cause a large diff (mixing minor cleanup into a functional diff is OK when the diff remains reasonably easy to read, though). 2. When you want to do several things that require large changes, proceeding in steps can make review easier. > Full changelist is as follow: > - Make separator and input full UTF-8 aware. > - Do proper character width count. I agree with these two goals. > This also fixes some indentation issues where the old code assumed > that a tab also was one column wide. I agree with your assessment that the old code mishandles tab characters in the input. There are three cases: 1. In -t mode when '\t' is in -s (which it is by default). There is no problem because the tabs get replaced with NULs by strtok(3). 2. In -t mode when '\t' is not in -s. 3. In non-t mode. What should happen in these two cases seems arguable. One option would be to calculate the width of a tab such that it advances to the next multiple of eight. Another option would be to just replace each tab with a single blank on the output side. While the first option is closer to what happens now, the second one seems more useful: When you explicitly change the columnation and say that tabs in the input do not delimit columns, it doesn't seem very useful to have them signify subcolumns inside the new columns. > - Replace tabs between columns with spaces. I'm not convinced that should be changed. In any case, it is unrelated to UTF-8. > The old code worked fine, but with UTF-8 and oxtabs in stty enabled > the column positioning can get way off, and we can't expect everyone > to run with "stty -oxtabs". Found with the help of nicm@. On the console, UTF-8 doesn't work anyway, at all. In an xterm(1), it seems to me that "stty -oxtabs" is the default, so keeping the tabs works by default. If somebody manually sets "stty oxtabs" (knowing that it cannot possibly work with UTF-8), they can easily pipe column(1) output through expand(1). Of course, we need to fix expand(1), but that's a seperate matter. I'm not sure we should incorporate expand(1) functionality into random utilities that don't yet have special tab handling just because there are some rare corner cases where tabs might break. Regarding the design of the program: There are two very different modes, -t and non-t. In -t mode, total line widths are irrelevant but calculated in input() anyway. With UTF-8, that becomes more expensive and harder to bear. In non-t mode, total line widths are the only widths that matter. I think the ideal top-level data structure is as follows: - A list of input lines, stored as a pointer to the first line, either NULL-terminated or together with a variable storing the number of lines. - Each line is a list of fields, again stored as a pointer to the first field, this time definitely NULL-terminated for simplicity. - Each field is a struct containing a char * and an int width. The input() function should generate this complete data structure using getline(3) and mbtowc(3) in a single loop and remember the maximum field width. For the non-t case, only one field will be generated per input line. That way, we only need to parse the input once, we only need to calculate widths once, we only need heavy malloc(3) lifting in one function, and UTF-8 handling will only infect one single function, input(). Probably, we only need one single call to mbtowc(3), one single call to wcwidth(3), and no calls to *mbs*() functions except one in main() for parsing and one in input() for using the -s option. Probably, all the other functions can get away with functions like fputs(3) or printf("...%s..."). I think it would be reasonable to do this refactoring in a first step, maybe together with renaming length to width, then implement UTF-8 support in a second step. Does that make sense to you? I'm adding some specific comments in-line. If any of my comments do not seem helpful to you, just say so, i may have overlooked something. Yours, Ingo > Index: column.c > =================================================================== > RCS file: /cvs/src/usr.bin/column/column.c,v > retrieving revision 1.22 > diff -u -p -r1.22 column.c > --- column.c 3 Nov 2015 04:55:44 -0000 1.22 > +++ column.c 27 Feb 2016 12:21:35 -0000 > @@ -36,15 +36,24 @@ > #include <ctype.h> > #include <err.h> > #include <limits.h> > +#include <locale.h> > #include <stdio.h> > #include <stdlib.h> > #include <string.h> > #include <unistd.h> > +#include <wchar.h> > +#include <wctype.h> > + > +struct li { > + int width; > + char *str; > +}; I would probably put the pointer first (bikeshed). > > void c_columnate(void); > void *ereallocarray(void *, size_t, size_t); > void *ecalloc(size_t, size_t); > void input(FILE *); > +int isu8start(unsigned char); > void maketbl(void); > void print(void); > void r_columnate(void); > @@ -54,9 +63,9 @@ int termwidth = 80; /* default terminal > > int entries; /* number of records */ > int eval; /* exit value */ > -int maxlength; /* longest record */ > -char **list; /* array of pointers to records */ > -char *separator = "\t "; /* field separator for table option */ > +int maxwidth; /* longest record */ > +struct li *list; /* array of to records */ > +wchar_t *separator = L"\t "; /* field separator for table option */ > > int > main(int argc, char *argv[]) > @@ -66,6 +75,7 @@ main(int argc, char *argv[]) > int ch, tflag, xflag; > char *p; > const char *errstr; > + int slen; You are using that for strlen(3) and malloc(3), so it should be size_t. > > if (ioctl(1, TIOCGWINSZ, &win) == -1 || !win.ws_col) { > if ((p = getenv("COLUMNS")) && *p != '\0') { > @@ -79,6 +89,8 @@ main(int argc, char *argv[]) > if (pledge("stdio rpath", NULL) == -1) > err(1, "pledge"); > > + setlocale(LC_CTYPE, ""); > + For now, setlocate(3) should be done before pledge(2). > tflag = xflag = 0; > while ((ch = getopt(argc, argv, "c:s:tx")) != -1) > switch(ch) { > @@ -88,7 +100,13 @@ main(int argc, char *argv[]) > errx(1, "%s: %s", errstr, optarg); > break; > case 's': > - separator = optarg; > + slen = strlen(optarg)+1; > + if ((separator = reallocarray(NULL, slen, > + sizeof(*separator))) == NULL) > + err(1, NULL); > + if (mbstowcs(separator, optarg, > + slen * sizeof(*separator)) == (size_t)-1) The third argument must be just slen, not a product. It is the maximum number of wide characters to be stored, not bytes. > + err(1, "Unable to set separator"); Given that the EILSEQ message will be appended, i would make this shorter, something like: err(1, "-s"); > break; > case 't': > tflag = 1; > @@ -125,69 +143,59 @@ main(int argc, char *argv[]) > > if (tflag) > maketbl(); > - else if (maxlength >= termwidth) > + else if (maxwidth >= termwidth) > print(); > else if (xflag) > c_columnate(); > else > r_columnate(); > - exit(eval); > + return eval; > } > > #define TAB 8 > void > c_columnate(void) > { > - int chcnt, col, cnt, endcol, numcols; > - char **lp; > + int col, numcols; > + struct li *lp; > > - maxlength = (maxlength + TAB) & ~(TAB - 1); > - numcols = termwidth / maxlength; > - endcol = maxlength; > - for (chcnt = col = 0, lp = list;; ++lp) { > - chcnt += printf("%s", *lp); > + maxwidth = (maxwidth + TAB) & ~(TAB - 1); > + if ((numcols = termwidth / maxwidth) == 0) > + numcols = 1; That cannot happen. For maxwidth >= termwidth, print() is used instead. > + for (col = 0, lp = list;; ++lp) { > + printf("%s", lp->str); > if (!--entries) > break; > if (++col == numcols) { > - chcnt = col = 0; > - endcol = maxlength; > + col = 0; > putchar('\n'); > } else { > - while ((cnt = ((chcnt + TAB) & ~(TAB - 1))) <= endcol) { > - (void)putchar('\t'); > - chcnt = cnt; > - } > - endcol += maxlength; > + while (lp->width++ < maxwidth) > + (void)putchar(' '); Not sure that change is needed (see above). > } > } > - if (chcnt) > - putchar('\n'); > + putchar('\n'); > } Doesn't that change cause a bogus blank line for empty input? > > void > r_columnate(void) > { > - int base, chcnt, cnt, col, endcol, numcols, numrows, row; > + int base, col, numcols, numrows, row; > > - maxlength = (maxlength + TAB) & ~(TAB - 1); > - numcols = termwidth / maxlength; > - if (numcols == 0) > + maxwidth = (maxwidth + TAB) & ~(TAB - 1); > + if ((numcols = termwidth / maxwidth) == 0) > numcols = 1; Again, cannot happen. > numrows = entries / numcols; > if (entries % numcols) > ++numrows; > > for (row = 0; row < numrows; ++row) { > - endcol = maxlength; > - for (base = row, chcnt = col = 0; col < numcols; ++col) { > - chcnt += printf("%s", list[base]); > + for (base = row, col = 0; col < numcols; ++col) { > + printf("%s", list[base].str); > + while (list[base].width++ < maxwidth) > + (void)putchar(' '); The padding certainly mustn't go before the following last column check, or you produce trailing whitespace. But again, do we really want to expand tabs? > if ((base += numrows) >= entries) > break; > - while ((cnt = ((chcnt + TAB) & ~(TAB - 1))) <= endcol) { > - (void)putchar('\t'); > - chcnt = cnt; > - } > - endcol += maxlength; > } > putchar('\n'); > } > @@ -197,15 +205,15 @@ void > print(void) > { > int cnt; > - char **lp; > + struct li *lp; > > for (cnt = entries, lp = list; cnt--; ++lp) > - (void)printf("%s\n", *lp); > + (void)printf("%s\n", lp->str); > } > > typedef struct _tbl { > - char **list; > - int cols, *len; > + wchar_t **list; > + int cols, *width; > } TBL; Regarding the two levels of parsing, structs, and malloc, see my design comments at the top. > #define DEFCOLS 25 > > @@ -214,47 +222,54 @@ maketbl(void) > { > TBL *t; > int coloff, cnt; > - char *p, **lp; > - int *lens, maxcols = DEFCOLS; > + struct li *lp; > + int *widths, maxcols = DEFCOLS; > TBL *tbl; > - char **cols; > + wchar_t **cols; > + wchar_t *ws, *wws, *last; > > t = tbl = ecalloc(entries, sizeof(TBL)); > cols = ereallocarray(NULL, maxcols, sizeof(char *)); > - lens = ecalloc(maxcols, sizeof(int)); > + widths = ecalloc(maxcols, sizeof(int)); > + if ((ws = reallocarray(NULL, maxwidth, sizeof(*ws))) == NULL) > + err(1, NULL); Seems inconsistent. We should either use ereallocarray() throughout or not at all. There may be more instances in your patch. > for (cnt = 0, lp = list; cnt < entries; ++cnt, ++lp, ++t) { > - for (coloff = 0, p = *lp; (cols[coloff] = strtok(p, separator)); > - p = NULL) > + if (mbstowcs(ws, lp->str, maxwidth * sizeof(*ws)) == (size_t)-1) > + errx(1, "Invalid char on line %d", cnt+1); I don't like that. So if there is a single invalid byte, you lose all the rest of the file? I don't think we can use mbstowcs(3) in such elementary utilities, except maybe for option parsing. > + free(lp->str); > + lp->str = NULL; > + wws = ws; > + for (coloff = 0; (cols[coloff] = wcstok(wws, separator, > + &last)); wws = NULL) { > if (++coloff == maxcols) { > maxcols += DEFCOLS; > cols = ereallocarray(cols, maxcols, > sizeof(char *)); > - lens = ereallocarray(lens, maxcols, > + widths = ereallocarray(widths, maxcols, > sizeof(int)); > - memset(lens + coloff, 0, DEFCOLS * sizeof(int)); > + memset(widths + coloff, 0, DEFCOLS * sizeof(int)); > } > + } > if (coloff == 0) > continue; > - t->list = ecalloc(coloff, sizeof(char *)); > - t->len = ecalloc(coloff, sizeof(int)); > + t->list = ecalloc(coloff, sizeof(*(t->list))); > + t->width = ecalloc(coloff, sizeof(*(t->width))); > for (t->cols = coloff; --coloff >= 0;) { > - t->list[coloff] = cols[coloff]; > - t->len[coloff] = strlen(cols[coloff]); > - if (t->len[coloff] > lens[coloff]) > - lens[coloff] = t->len[coloff]; > + t->list[coloff] = wcsdup(cols[coloff]); > + t->width[coloff] = wcswidth(cols[coloff], > + wcslen(cols[coloff])); > + if (t->width[coloff] > widths[coloff]) > + widths[coloff] = t->width[coloff]; > } > } > for (cnt = 0, t = tbl; cnt < entries; ++cnt, ++t) { > if (t->cols > 0) { > for (coloff = 0; coloff < t->cols - 1; ++coloff) > - (void)printf("%s%*s", t->list[coloff], > - lens[coloff] - t->len[coloff] + 2, " "); > - (void)printf("%s\n", t->list[coloff]); > + (void)printf("%ls%*s", t->list[coloff], > + widths[coloff] - t->width[coloff] + 2, " "); > + (void)printf("%ls\n", t->list[coloff]); > } > } > - free(tbl); > - free(lens); > - free(cols); > } > > #define DEFNUM 1000 > @@ -263,31 +278,56 @@ maketbl(void) > void > input(FILE *fp) > { > - static size_t maxentry = DEFNUM; > - int len; > + static int maxentry = DEFNUM; > + int width = 0, cwidth = 0, size; > char *p, buf[MAXLINELEN]; > + wchar_t wc; > > if (!list) > list = ecalloc(maxentry, sizeof(char *)); > while (fgets(buf, MAXLINELEN, fp)) { Using fgets(3) with a fixed-size buffer, discarding initial parts of long lines, is really unfortunate, in particular given that this utility reads the whole file into memory anyway. So i strongly feel this should be changed to use getline(3), removing the arbitrary line length restriction. Probably, this should be a cleanup step before the main UTF-8 diff, see my design comments at the top. > - for (p = buf; isspace((unsigned char)*p); ++p); > + width = 0, cwidth = 0; > + p = buf; > + do { > + p += cwidth; > + if ((cwidth = mbtowc(&wc, p, MB_CUR_MAX)) == -1) { > + (void) mbtowc(NULL, NULL, MB_CUR_MAX); > + break; > + } > + } while (iswspace(wc)); > if (!*p) > continue; > - if (!(p = strchr(p, '\n'))) { > + > + for (p = buf; *p != '\n' && *p != '\0'; p++) { > + if (isu8start(*p)) { It looks a bit strange to use both isu8start() and mbtowc(3). Usually, if you need mbtowc(3) anyway, you can get away without isu8start() when coding carefully. > + if ((size = mbtowc(&wc, p, MB_CUR_MAX)) == -1) > + mbtowc(NULL, NULL, MB_CUR_MAX); > + if ((cwidth = wcwidth(wc)) >= 0) { That doesn't work. After mbtowc(3) failure, wc contains garbage. > + width += cwidth; > + p += size-1; > + } That doesn't work either. After wcwidth(3) failure, something must be done. Treating non-printable characters as width 0 may be acceptable, but failing to advance p looks like an endless loop to me. > + } else { > + if (*p == '\t') > + width += TAB - width%TAB; That seems wrong. In -t mode, tabs may be used as column seperators, so they don't contribute to the line width like that. > + else if (isprint(*p)) > + width++; > + } > + } > + if (*p != '\n') { > warnx("line too long"); > eval = 1; > continue; > } > *p = '\0'; > - len = p - buf; > - if (maxlength < len) > - maxlength = len; > + if (maxwidth < width) > + maxwidth = width; > if (entries == maxentry) { > maxentry += DEFNUM; > list = ereallocarray(list, maxentry, sizeof(char *)); > memset(list + entries, 0, DEFNUM * sizeof(char *)); > } > - if (!(list[entries++] = strdup(buf))) > + list[entries].width = width; > + if (!(list[entries++].str = strdup(buf))) > err(1, NULL); > } > } > @@ -319,4 +359,10 @@ usage(void) > (void)fprintf(stderr, > "usage: column [-tx] [-c columns] [-s sep] [file ...]\n"); > exit(1); > +} > + > +int > +isu8start(unsigned char c) > +{ > + return MB_CUR_MAX > 1 && (c & (0x80 | 0x40)) == (0x80 | 0x40); > } >