less(1) UTF-8 cleanup: do_append()

Ingo Schwarze Tue, 12 Mar 2019 07:57:03 -0700

Hi Nicholas,

Nicholas Marriott wrote on Tue, Mar 12, 2019 at 11:05:07AM +0000:


> Sorry, ok nicm if you haven't committed this already.

No i hadn't, but now i have, thank you very much for checking the patch.

Here is the next step, cleaning up most parts of the crucial function
do_append() which decides how to handle backspaces and overstrikes,
in particular what to print literally and what to encode.

Four bad function calls have to be replaced here:

 1. The call to the bad function get_wchar() is used to find the
    character already present at the current position.  Replacing
    it with mbtowc(3) also eliminates LWCHAR.  In case of failure -
    which may no be likely since at least usually, the linebuf[]
    will contain valid UTF-8 - setting prev_ch to NUL makes sure
    that whatever is already there will simply be replaced.
    I think linebuf[curr] cannot be NUL at this point because only
    backc() sets overstrike, and just having backed up, *something*
    will be there.  But even if linebuf[curr] *is* NUL and hence
    mbtowc(3) returns 0, the new code should do the right thing and
    simply append.

 2. The calls to the bad functions is_composing_char() and
    is_combining_char() have to be replaced with wcwidth(3).
    That also eliminates the second call to get_wchar().
    If wcwidth(3) fails (i.e. ch is not printable), we simply have
    to treat it like a width 1 character.

 3. The call to the bad function control_char() has to be eliminated.
    Start by considering two cases separately.  In utf_mode, we
    have is_ascii_char(ch), i.e. ch <= 0x7f.  In that case,
    control_char() is just iscntrl(3), which is identical to
    !isprint(3).  In !utf_mode, we know that do_append() only gets
    called with ch <= 0xff, and control_char() is iscntrl(3) ||
    !isprint(3).  In that case and expression, the iscntrl(3) is
    obviously redundant.  So we can simply use !isprint(3) for both
    cases, which is also a logical way of expressing the condition
    because this "else if" clause is supposed to handle non-printable
    single-byte characters.

 4. The call to the bad function is_ubin_char() intends to handle
    non-printable Unicode characters, so the right function to use
    is simply !iswprint(3) from <wctype.h>.

One part of do_append() still remains broken: It still calls
prutfchar() which is broken because it calls is_ubin_char() and
control_char(), because it accepts some invalid input, and because
it generates some invalid output.  But cleaning that up is a separate
matter.

Here are byte sequences i tested:

 * ASCII char backspace same ASCII char again -> bold
 * underscore backspace underscore in between -> bold
 * ASCII char backspace underscore            -> italic
 * underscore backspace underscore in between -> italic
 * underscore backspace ASCII char            -> italic
 * UTF-8 character backspace same UTF-8 again -> bold
 * UTF-8 character backspace undercore        -> italic
 * underscore backspace UTF-8 character       -> italic
 * invalid byte backspace underscore          -> <88>^H_
 * ASCII char UTF-8 combining accent backspace
   same ASCII char and combining accent again -> bold
 * underscore backspace ASCII char
   UTF-8 combining accent                     -> italic
 * UTF-8 character UTF-8 combining accent
   backspace same UTF-8 character
   and combining accent again                 -> bold
 * underscore backspace UTF-8 character
   UTF-8 combining accent                     -> italic
 * ASCII char UTF-8 combining accent backspace
   non-printable ASCII char
   -> accented char invisible, overwritten by the encoded
      representation of the non-printable char
 * ASCII char UTF-8 combining accent backspace
   non-printable UTF-8 character
   -> accented char invisible, overwritten by the encoded
      representation of the non-printable character
 * ASCII controls show as caret-letter
 * UTF-8 controls show as <U+NNNN>

I tested both with LC_CTYPE=en_US.UTF-8 and LC_CTYPE=C.

Note the following case is dubious:

 * ASCII char UTF-8 combining accent backspace underscore
   Arguably, this should underline the accented character,
   but actually, the accent disappears with both the old
   and the new code because it is overwritten by the next
   character.  I'm not trying to fix that with this patch.
 * same for UTF-8 character UTF-8 combining accent backspace underscore

OK?
  Ingo


Index: line.c
===================================================================
RCS file: /cvs/src/usr.bin/less/line.c,v
retrieving revision 1.27
diff -u -p -r1.27 line.c
--- line.c      12 Mar 2019 11:59:24 -0000      1.27
+++ line.c      12 Mar 2019 14:33:55 -0000
@@ -16,6 +16,7 @@
  */
 
 #include <wchar.h>
+#include <wctype.h>
 
 #include "charset.h"
 #include "less.h"
@@ -774,8 +775,8 @@ retry:
 static int
 do_append(LWCHAR ch, char *rep, off_t pos)
 {
+       wchar_t prev_ch;
        int a;
-       LWCHAR prev_ch;
 
        a = AT_NORMAL;
 
@@ -813,7 +814,10 @@ do_append(LWCHAR ch, char *rep, off_t po
                 */
                overstrike = utf_mode ? -1 : 0;
                /* To be correct, this must be a base character.  */
-               prev_ch = get_wchar(linebuf + curr);
+               if (mbtowc(&prev_ch, linebuf + curr, MB_CUR_MAX) == -1) {
+                       (void)mbtowc(NULL, NULL, MB_CUR_MAX);
+                       prev_ch = L'\0';
+               }
                a = attr[curr];
                if (ch == prev_ch) {
                        /*
@@ -835,7 +839,7 @@ do_append(LWCHAR ch, char *rep, off_t po
                        } else {
                                a |= AT_BOLD;
                        }
-               } else if (ch == '_') {
+               } else if (ch == '_' && prev_ch != L'\0') {
                        a |= AT_UNDERLINE;
                        ch = prev_ch;
                        rep = linebuf + curr;
@@ -844,8 +848,7 @@ do_append(LWCHAR ch, char *rep, off_t po
                }
                /* Else we replace prev_ch, but we keep its attributes.  */
        } else if (overstrike < 0) {
-               if (is_composing_char(ch) ||
-                   is_combining_char(get_wchar(linebuf + curr), ch)) {
+               if (wcwidth(ch) == 0) {
                        /* Continuation of the same overstrike.  */
                        if (curr > 0)
                                a = attr[curr - 1] & (AT_UNDERLINE | AT_BOLD);
@@ -868,7 +871,8 @@ do_append(LWCHAR ch, char *rep, off_t po
                                return (1);
                        break;
                }
-       } else if ((!utf_mode || is_ascii_char(ch)) && control_char((char)ch)) {
+       } else if ((!utf_mode || is_ascii_char(ch)) &&
+           !isprint((unsigned char)ch)) {
 do_control_char:
                if (ctldisp == OPT_ON ||
                    (ctldisp == OPT_ONPLUS && ch == ESC)) {
@@ -881,7 +885,7 @@ do_control_char:
                        if (store_prchar(ch, pos))
                                return (1);
                }
-       } else if (utf_mode && ctldisp != OPT_ON && is_ubin_char(ch)) {
+       } else if (utf_mode && ctldisp != OPT_ON && !iswprint(ch)) {
                char *s;
 
                s = prutfchar(ch);

less(1) UTF-8 cleanup: do_append()

Reply via email to