Hi Ted,

Ted Unangst wrote on Fri, Oct 23, 2015 at 09:38:22AM -0400:

> so that works with the diff below.

I agree with the direction for this kind of tool, at least for now.
However, your diff has a few issues, so i improved it, see below.

Any OKs or vetos?

Ted, in case you want to commit, the version below is obviously
OK schwarze@.

> i'm not sure how far down this road we need
> to travel, but i figure it's worth a little exploration.

I think making any valid sequence of single-codepoint characters
work is reasonable, in particular if it just takes 15 lines
of additional code in a utility of 500 lines.


Changes with respect to tedu@'s version:

 * chunk 151 and chunk 158: unchanged
 * chunk 211: new chunk
   Required for the sequence underscore, backspace, multibyte character:
   Mark all the bytes underlined, not just the first one, or the
   multibyte character will be broken.
 * chunk 237 part 1: new change
   Required such that bytes with the high bit set compare equal
   even on signed char architectures.
 * chunk 237 part 2: style tweak
   Actually use the shiny new isu8cont() function,
   do not inline a copy of its code.


Aspects not solved and other comments:

 - The new code runs always.
   In a POSIX locale, text files are not supposed to contain bytes
   with the high bit set, so it is undefined in the first place
   what ul(1) should do.  Of course, we could artificially add yet
   more code (heavy-weight code with setlocale(3) and nl_langinfo(3),
   actually) to gratuitiously mess the file up, but i consider it
   more useful to treat UTF-8 gracefully even when the locale is
   not set, such that ul(1) output is predictable independently of
   the user's locale.
 - character, backspace, different character
   This is not valid backspace encoding for bold or italic,
   so ul(1) is not supposed to handle it.  But at least, it
   no longer produces invalid UTF-8 even in that case.
 - The FreeBSD change with wchar_t (+70 -44 lines) seems
   like overkill to me.
 - Nothing changes with respect to tabs.
   To ul(1), tabs just mean "add enough blanks to advance to the
   next character position that is a multiple of eight".  A backspace
   will then remove the last one of them.  The usefulness of this
   feature may be argued, but that's unrelated to UTF-8.


Index: ul.c
===================================================================
RCS file: /cvs/src/usr.bin/ul/ul.c,v
retrieving revision 1.19
diff -u -p -r1.19 ul.c
--- ul.c        10 Oct 2015 16:15:03 -0000      1.19
+++ ul.c        23 Oct 2015 20:19:17 -0000
@@ -151,6 +151,12 @@ main(int argc, char *argv[])
        exit(0);
 }
 
+int
+isu8cont(unsigned char c)
+{
+       return (c & (0x80 | 0x40)) == 0x80;
+}
+
 void
 mfilter(FILE *f)
 {
@@ -158,8 +164,11 @@ mfilter(FILE *f)
 
        while ((c = getc(f)) != EOF && col < MAXBUF) switch(c) {
        case '\b':
-               if (col > 0)
+               while (col > 0) {
                        col--;
+                       if (!isu8cont(obuf[col].c_char))
+                               break;
+               }
                continue;
        case '\t':
                col = (col+8) & ~07;
@@ -211,9 +220,13 @@ mfilter(FILE *f)
                continue;
 
        case '_':
-               if (obuf[col].c_char)
+               if (obuf[col].c_char != '\0') {
                        obuf[col].c_mode |= UNDERL | mode;
-               else
+                       if (obuf[col].c_char & 0x80)
+                               while (col < maxcol &
+                                   isu8cont(obuf[col+1].c_char))
+                                       obuf[++col].c_mode |= UNDERL | mode;
+               } else
                        obuf[col].c_char = '_';
                /* FALLTHROUGH */
        case ' ':
@@ -237,10 +250,12 @@ mfilter(FILE *f)
                } else if (obuf[col].c_char == '_') {
                        obuf[col].c_char = c;
                        obuf[col].c_mode |= UNDERL|mode;
-               } else if (obuf[col].c_char == c)
+               } else if (obuf[col].c_char == (char)c)
                        obuf[col].c_mode |= BOLD|mode;
                else
                        obuf[col].c_mode = mode;
+               if (col > 0 && isu8cont(c))
+                       obuf[col].c_mode = obuf[col - 1].c_mode;
                col++;
                if (col > maxcol)
                        maxcol = col;

Reply via email to