Hi Nicholas,
Nicholas Marriott wrote on Tue, Mar 12, 2019 at 11:05:07AM +0000:
> Sorry, ok nicm if you haven't committed this already.
No i hadn't, but now i have, thank you very much for checking the patch.
Here is the next step, cleaning up most parts of the crucial function
do_append() which decides how to handle backspaces and overstrikes,
in particular what to print literally and what to encode.
Four bad function calls have to be replaced here:
1. The call to the bad function get_wchar() is used to find the
character already present at the current position. Replacing
it with mbtowc(3) also eliminates LWCHAR. In case of failure -
which may no be likely since at least usually, the linebuf[]
will contain valid UTF-8 - setting prev_ch to NUL makes sure
that whatever is already there will simply be replaced.
I think linebuf[curr] cannot be NUL at this point because only
backc() sets overstrike, and just having backed up, *something*
will be there. But even if linebuf[curr] *is* NUL and hence
mbtowc(3) returns 0, the new code should do the right thing and
simply append.
2. The calls to the bad functions is_composing_char() and
is_combining_char() have to be replaced with wcwidth(3).
That also eliminates the second call to get_wchar().
If wcwidth(3) fails (i.e. ch is not printable), we simply have
to treat it like a width 1 character.
3. The call to the bad function control_char() has to be eliminated.
Start by considering two cases separately. In utf_mode, we
have is_ascii_char(ch), i.e. ch <= 0x7f. In that case,
control_char() is just iscntrl(3), which is identical to
!isprint(3). In !utf_mode, we know that do_append() only gets
called with ch <= 0xff, and control_char() is iscntrl(3) ||
!isprint(3). In that case and expression, the iscntrl(3) is
obviously redundant. So we can simply use !isprint(3) for both
cases, which is also a logical way of expressing the condition
because this "else if" clause is supposed to handle non-printable
single-byte characters.
4. The call to the bad function is_ubin_char() intends to handle
non-printable Unicode characters, so the right function to use
is simply !iswprint(3) from <wctype.h>.
One part of do_append() still remains broken: It still calls
prutfchar() which is broken because it calls is_ubin_char() and
control_char(), because it accepts some invalid input, and because
it generates some invalid output. But cleaning that up is a separate
matter.
Here are byte sequences i tested:
* ASCII char backspace same ASCII char again -> bold
* underscore backspace underscore in between -> bold
* ASCII char backspace underscore -> italic
* underscore backspace underscore in between -> italic
* underscore backspace ASCII char -> italic
* UTF-8 character backspace same UTF-8 again -> bold
* UTF-8 character backspace undercore -> italic
* underscore backspace UTF-8 character -> italic
* invalid byte backspace underscore -> <88>^H_
* ASCII char UTF-8 combining accent backspace
same ASCII char and combining accent again -> bold
* underscore backspace ASCII char
UTF-8 combining accent -> italic
* UTF-8 character UTF-8 combining accent
backspace same UTF-8 character
and combining accent again -> bold
* underscore backspace UTF-8 character
UTF-8 combining accent -> italic
* ASCII char UTF-8 combining accent backspace
non-printable ASCII char
-> accented char invisible, overwritten by the encoded
representation of the non-printable char
* ASCII char UTF-8 combining accent backspace
non-printable UTF-8 character
-> accented char invisible, overwritten by the encoded
representation of the non-printable character
* ASCII controls show as caret-letter
* UTF-8 controls show as <U+NNNN>
I tested both with LC_CTYPE=en_US.UTF-8 and LC_CTYPE=C.
Note the following case is dubious:
* ASCII char UTF-8 combining accent backspace underscore
Arguably, this should underline the accented character,
but actually, the accent disappears with both the old
and the new code because it is overwritten by the next
character. I'm not trying to fix that with this patch.
* same for UTF-8 character UTF-8 combining accent backspace underscore
OK?
Ingo
Index: line.c
===================================================================
RCS file: /cvs/src/usr.bin/less/line.c,v
retrieving revision 1.27
diff -u -p -r1.27 line.c
--- line.c 12 Mar 2019 11:59:24 -0000 1.27
+++ line.c 12 Mar 2019 14:33:55 -0000
@@ -16,6 +16,7 @@
*/
#include <wchar.h>
+#include <wctype.h>
#include "charset.h"
#include "less.h"
@@ -774,8 +775,8 @@ retry:
static int
do_append(LWCHAR ch, char *rep, off_t pos)
{
+ wchar_t prev_ch;
int a;
- LWCHAR prev_ch;
a = AT_NORMAL;
@@ -813,7 +814,10 @@ do_append(LWCHAR ch, char *rep, off_t po
*/
overstrike = utf_mode ? -1 : 0;
/* To be correct, this must be a base character. */
- prev_ch = get_wchar(linebuf + curr);
+ if (mbtowc(&prev_ch, linebuf + curr, MB_CUR_MAX) == -1) {
+ (void)mbtowc(NULL, NULL, MB_CUR_MAX);
+ prev_ch = L'\0';
+ }
a = attr[curr];
if (ch == prev_ch) {
/*
@@ -835,7 +839,7 @@ do_append(LWCHAR ch, char *rep, off_t po
} else {
a |= AT_BOLD;
}
- } else if (ch == '_') {
+ } else if (ch == '_' && prev_ch != L'\0') {
a |= AT_UNDERLINE;
ch = prev_ch;
rep = linebuf + curr;
@@ -844,8 +848,7 @@ do_append(LWCHAR ch, char *rep, off_t po
}
/* Else we replace prev_ch, but we keep its attributes. */
} else if (overstrike < 0) {
- if (is_composing_char(ch) ||
- is_combining_char(get_wchar(linebuf + curr), ch)) {
+ if (wcwidth(ch) == 0) {
/* Continuation of the same overstrike. */
if (curr > 0)
a = attr[curr - 1] & (AT_UNDERLINE | AT_BOLD);
@@ -868,7 +871,8 @@ do_append(LWCHAR ch, char *rep, off_t po
return (1);
break;
}
- } else if ((!utf_mode || is_ascii_char(ch)) && control_char((char)ch)) {
+ } else if ((!utf_mode || is_ascii_char(ch)) &&
+ !isprint((unsigned char)ch)) {
do_control_char:
if (ctldisp == OPT_ON ||
(ctldisp == OPT_ONPLUS && ch == ESC)) {
@@ -881,7 +885,7 @@ do_control_char:
if (store_prchar(ch, pos))
return (1);
}
- } else if (utf_mode && ctldisp != OPT_ON && is_ubin_char(ch)) {
+ } else if (utf_mode && ctldisp != OPT_ON && !iswprint(ch)) {
char *s;
s = prutfchar(ch);