Hi Nicholas, Nicholas Marriott wrote on Tue, Mar 12, 2019 at 11:05:07AM +0000:
> Sorry, ok nicm if you haven't committed this already. No i hadn't, but now i have, thank you very much for checking the patch. Here is the next step, cleaning up most parts of the crucial function do_append() which decides how to handle backspaces and overstrikes, in particular what to print literally and what to encode. Four bad function calls have to be replaced here: 1. The call to the bad function get_wchar() is used to find the character already present at the current position. Replacing it with mbtowc(3) also eliminates LWCHAR. In case of failure - which may no be likely since at least usually, the linebuf[] will contain valid UTF-8 - setting prev_ch to NUL makes sure that whatever is already there will simply be replaced. I think linebuf[curr] cannot be NUL at this point because only backc() sets overstrike, and just having backed up, *something* will be there. But even if linebuf[curr] *is* NUL and hence mbtowc(3) returns 0, the new code should do the right thing and simply append. 2. The calls to the bad functions is_composing_char() and is_combining_char() have to be replaced with wcwidth(3). That also eliminates the second call to get_wchar(). If wcwidth(3) fails (i.e. ch is not printable), we simply have to treat it like a width 1 character. 3. The call to the bad function control_char() has to be eliminated. Start by considering two cases separately. In utf_mode, we have is_ascii_char(ch), i.e. ch <= 0x7f. In that case, control_char() is just iscntrl(3), which is identical to !isprint(3). In !utf_mode, we know that do_append() only gets called with ch <= 0xff, and control_char() is iscntrl(3) || !isprint(3). In that case and expression, the iscntrl(3) is obviously redundant. So we can simply use !isprint(3) for both cases, which is also a logical way of expressing the condition because this "else if" clause is supposed to handle non-printable single-byte characters. 4. The call to the bad function is_ubin_char() intends to handle non-printable Unicode characters, so the right function to use is simply !iswprint(3) from <wctype.h>. One part of do_append() still remains broken: It still calls prutfchar() which is broken because it calls is_ubin_char() and control_char(), because it accepts some invalid input, and because it generates some invalid output. But cleaning that up is a separate matter. Here are byte sequences i tested: * ASCII char backspace same ASCII char again -> bold * underscore backspace underscore in between -> bold * ASCII char backspace underscore -> italic * underscore backspace underscore in between -> italic * underscore backspace ASCII char -> italic * UTF-8 character backspace same UTF-8 again -> bold * UTF-8 character backspace undercore -> italic * underscore backspace UTF-8 character -> italic * invalid byte backspace underscore -> <88>^H_ * ASCII char UTF-8 combining accent backspace same ASCII char and combining accent again -> bold * underscore backspace ASCII char UTF-8 combining accent -> italic * UTF-8 character UTF-8 combining accent backspace same UTF-8 character and combining accent again -> bold * underscore backspace UTF-8 character UTF-8 combining accent -> italic * ASCII char UTF-8 combining accent backspace non-printable ASCII char -> accented char invisible, overwritten by the encoded representation of the non-printable char * ASCII char UTF-8 combining accent backspace non-printable UTF-8 character -> accented char invisible, overwritten by the encoded representation of the non-printable character * ASCII controls show as caret-letter * UTF-8 controls show as <U+NNNN> I tested both with LC_CTYPE=en_US.UTF-8 and LC_CTYPE=C. Note the following case is dubious: * ASCII char UTF-8 combining accent backspace underscore Arguably, this should underline the accented character, but actually, the accent disappears with both the old and the new code because it is overwritten by the next character. I'm not trying to fix that with this patch. * same for UTF-8 character UTF-8 combining accent backspace underscore OK? Ingo Index: line.c =================================================================== RCS file: /cvs/src/usr.bin/less/line.c,v retrieving revision 1.27 diff -u -p -r1.27 line.c --- line.c 12 Mar 2019 11:59:24 -0000 1.27 +++ line.c 12 Mar 2019 14:33:55 -0000 @@ -16,6 +16,7 @@ */ #include <wchar.h> +#include <wctype.h> #include "charset.h" #include "less.h" @@ -774,8 +775,8 @@ retry: static int do_append(LWCHAR ch, char *rep, off_t pos) { + wchar_t prev_ch; int a; - LWCHAR prev_ch; a = AT_NORMAL; @@ -813,7 +814,10 @@ do_append(LWCHAR ch, char *rep, off_t po */ overstrike = utf_mode ? -1 : 0; /* To be correct, this must be a base character. */ - prev_ch = get_wchar(linebuf + curr); + if (mbtowc(&prev_ch, linebuf + curr, MB_CUR_MAX) == -1) { + (void)mbtowc(NULL, NULL, MB_CUR_MAX); + prev_ch = L'\0'; + } a = attr[curr]; if (ch == prev_ch) { /* @@ -835,7 +839,7 @@ do_append(LWCHAR ch, char *rep, off_t po } else { a |= AT_BOLD; } - } else if (ch == '_') { + } else if (ch == '_' && prev_ch != L'\0') { a |= AT_UNDERLINE; ch = prev_ch; rep = linebuf + curr; @@ -844,8 +848,7 @@ do_append(LWCHAR ch, char *rep, off_t po } /* Else we replace prev_ch, but we keep its attributes. */ } else if (overstrike < 0) { - if (is_composing_char(ch) || - is_combining_char(get_wchar(linebuf + curr), ch)) { + if (wcwidth(ch) == 0) { /* Continuation of the same overstrike. */ if (curr > 0) a = attr[curr - 1] & (AT_UNDERLINE | AT_BOLD); @@ -868,7 +871,8 @@ do_append(LWCHAR ch, char *rep, off_t po return (1); break; } - } else if ((!utf_mode || is_ascii_char(ch)) && control_char((char)ch)) { + } else if ((!utf_mode || is_ascii_char(ch)) && + !isprint((unsigned char)ch)) { do_control_char: if (ctldisp == OPT_ON || (ctldisp == OPT_ONPLUS && ch == ESC)) { @@ -881,7 +885,7 @@ do_control_char: if (store_prchar(ch, pos)) return (1); } - } else if (utf_mode && ctldisp != OPT_ON && is_ubin_char(ch)) { + } else if (utf_mode && ctldisp != OPT_ON && !iswprint(ch)) { char *s; s = prutfchar(ch);