Hi Ted, Ted Unangst wrote on Fri, Oct 23, 2015 at 09:38:22AM -0400:
> so that works with the diff below. I agree with the direction for this kind of tool, at least for now. However, your diff has a few issues, so i improved it, see below. Any OKs or vetos? Ted, in case you want to commit, the version below is obviously OK schwarze@. > i'm not sure how far down this road we need > to travel, but i figure it's worth a little exploration. I think making any valid sequence of single-codepoint characters work is reasonable, in particular if it just takes 15 lines of additional code in a utility of 500 lines. Changes with respect to tedu@'s version: * chunk 151 and chunk 158: unchanged * chunk 211: new chunk Required for the sequence underscore, backspace, multibyte character: Mark all the bytes underlined, not just the first one, or the multibyte character will be broken. * chunk 237 part 1: new change Required such that bytes with the high bit set compare equal even on signed char architectures. * chunk 237 part 2: style tweak Actually use the shiny new isu8cont() function, do not inline a copy of its code. Aspects not solved and other comments: - The new code runs always. In a POSIX locale, text files are not supposed to contain bytes with the high bit set, so it is undefined in the first place what ul(1) should do. Of course, we could artificially add yet more code (heavy-weight code with setlocale(3) and nl_langinfo(3), actually) to gratuitiously mess the file up, but i consider it more useful to treat UTF-8 gracefully even when the locale is not set, such that ul(1) output is predictable independently of the user's locale. - character, backspace, different character This is not valid backspace encoding for bold or italic, so ul(1) is not supposed to handle it. But at least, it no longer produces invalid UTF-8 even in that case. - The FreeBSD change with wchar_t (+70 -44 lines) seems like overkill to me. - Nothing changes with respect to tabs. To ul(1), tabs just mean "add enough blanks to advance to the next character position that is a multiple of eight". A backspace will then remove the last one of them. The usefulness of this feature may be argued, but that's unrelated to UTF-8. Index: ul.c =================================================================== RCS file: /cvs/src/usr.bin/ul/ul.c,v retrieving revision 1.19 diff -u -p -r1.19 ul.c --- ul.c 10 Oct 2015 16:15:03 -0000 1.19 +++ ul.c 23 Oct 2015 20:19:17 -0000 @@ -151,6 +151,12 @@ main(int argc, char *argv[]) exit(0); } +int +isu8cont(unsigned char c) +{ + return (c & (0x80 | 0x40)) == 0x80; +} + void mfilter(FILE *f) { @@ -158,8 +164,11 @@ mfilter(FILE *f) while ((c = getc(f)) != EOF && col < MAXBUF) switch(c) { case '\b': - if (col > 0) + while (col > 0) { col--; + if (!isu8cont(obuf[col].c_char)) + break; + } continue; case '\t': col = (col+8) & ~07; @@ -211,9 +220,13 @@ mfilter(FILE *f) continue; case '_': - if (obuf[col].c_char) + if (obuf[col].c_char != '\0') { obuf[col].c_mode |= UNDERL | mode; - else + if (obuf[col].c_char & 0x80) + while (col < maxcol & + isu8cont(obuf[col+1].c_char)) + obuf[++col].c_mode |= UNDERL | mode; + } else obuf[col].c_char = '_'; /* FALLTHROUGH */ case ' ': @@ -237,10 +250,12 @@ mfilter(FILE *f) } else if (obuf[col].c_char == '_') { obuf[col].c_char = c; obuf[col].c_mode |= UNDERL|mode; - } else if (obuf[col].c_char == c) + } else if (obuf[col].c_char == (char)c) obuf[col].c_mode |= BOLD|mode; else obuf[col].c_mode = mode; + if (col > 0 && isu8cont(c)) + obuf[col].c_mode = obuf[col - 1].c_mode; col++; if (col > maxcol) maxcol = col;