Re: ksh(1): vi mode UTF-8 bug

Anton Lindqvist Tue, 30 May 2017 00:57:05 -0700

Hi,

On Mon, May 29, 2017 at 04:16:06PM +0200, Ingo Schwarze wrote:
> Hi,
> 
> Anton Lindqvist wrote on Sun, May 28, 2017 at 06:07:00PM +0200:
> > On Sun, May 28, 2017 at 10:56:19AM +0200, Walter Alejandro Iglesias wrote:
> 
> >> There is still a similar issue when you try to "replace" a utf-8
> >> character (in command mode press 'r' to replace a single character or
> >> 'R' to replace a string).
> 
> > Thanks for the report, please try out the diff below.
> > As I understand the problem: the current code assumes that the character
> > to replace consists of a single byte, which is not true for Unicode
> > characters.
> 
> Correct.  That needs to be improved.
> 
> > When replacing such a character, delete the continuation
> > bytes and then replace the start byte with the replacement.
> > This ensures no continuation bytes are left behind.
> > I made use of putbuf() since it has the side-effect of advancing the
> > cursor.
> > Lastly, adjust the cursor to be positioned on the last replaced
> > character.
> > 
> > NUL-terminating the line buffer is necessary in order for the following
> > to work:
> > 
> > 1. Insert ö
> > 
> > 2. Press esc, h (back one char), ro (replace with o), ax (append x)
> > 
> > Note that replacing a character with a Unicode character does not work
> > either.
> > 
> > Comments? OK?
> > 
> > Index: bin/ksh/vi.c
> > ===================================================================
> > RCS file: /cvs/src/bin/ksh/vi.c,v
> > retrieving revision 1.45
> > diff -u -p -r1.45 vi.c
> > --- bin/ksh/vi.c    28 May 2017 07:27:01 -0000      1.45
> > +++ bin/ksh/vi.c    28 May 2017 15:59:59 -0000
> > @@ -926,13 +926,22 @@ vi_cmd(int argcnt, const char *cmd)
> >                     if (cmd[1] == 0)
> >                             vi_error();
> >                     else {
> > -                           int     n;
> > -
> >                             if (es->cursor + argcnt > es->linelen)
> >                                     return -1;
> 
> These two lines are no longer accurate.  They try to make sure there
> are enough characters under and to the right of the cursor to match
> the number you want to replace (for example, with "2r"), and beep
> otherwise - but they count bytes, which is wrong.


Correct, replacing 'ö' with 2ro is currently valid which is wrong. This
is fixed in the diff below and I added a test capturing this behavior.

> 
> To catch the error condition of an excessive argument, i think you
> first need to iterate to the right, using the c1 variable and isu8cont(),
> and return -1 if you hit the end prematurely.  Do not change anything
> in that case.
> 
> If so far, you succeed, you know you have to replace the range
> [es->cursor, c1].

Thanks for the pointers. I made use of c1 to count the number of
characters and cur denotes the upper limit for the range to replace
expressed in bytes. I've also added another test replacing the Euro sign
which consists of 3 bytes.

> > -                           for (n = 0; n < argcnt; ++n)
> > -                                   es->cbuf[es->cursor + n] = cmd[1];
> > -                           es->cursor += n - 1;
> > +
> > +                           while (argcnt-- > 0) {
> > +                                   for (cur = es->cursor + 1;
> > +                                       cur < es->linelen; cur++)
> > +                                           if (!isu8cont(es->cbuf[cur]))
> > +                                                   break;
> > +                                   if (cur > 1)
> > +                                           del_range(es->cursor, cur - 1);
> 
> Given that you don't know the length (in bytes) of the character
> to insert yet, i think it may be simpler to delete the byte under the
> cursor as well, even though that is slightly inefficient for the ASCII
> case.
> 
> > +                                   putbuf(&cmd[1], 1, 1);
> 
> It seems that here, you may need to measure the length of the character
> to insert in bytes and then call something like
> 
>   putbuf(cmd + 1, #bytes, 0);
> 
> 
> My impression is that the 's' command is likely also affected, but that
> can be fixed in a separate patch.

As for replacement using a UTF-8 character, I made the same conclusion
as stated in your follow-up email. Since we don't know the length of the
character, we can't do the right thing right now.

Index: bin/ksh/vi.c
===================================================================
RCS file: /cvs/src/bin/ksh/vi.c,v
retrieving revision 1.45
diff -u -p -r1.45 vi.c
--- bin/ksh/vi.c        28 May 2017 07:27:01 -0000      1.45
+++ bin/ksh/vi.c        30 May 2017 07:54:04 -0000
@@ -926,13 +926,24 @@ vi_cmd(int argcnt, const char *cmd)
                        if (cmd[1] == 0)
                                vi_error();
                        else {
-                               int     n;
-
-                               if (es->cursor + argcnt > es->linelen)
+                               c1 = 0;
+                               for (cur = es->cursor;
+                                   cur < es->linelen; cur++) {
+                                       if (!isu8cont(es->cbuf[cur]))
+                                               c1++;
+                                       if (c1 > argcnt)
+                                               break;
+                               }
+                               if (argcnt > c1)
                                        return -1;
-                               for (n = 0; n < argcnt; ++n)
-                                       es->cbuf[es->cursor + n] = cmd[1];
-                               es->cursor += n - 1;
+
+                               del_range(es->cursor, cur);
+                               while (argcnt-- > 0)
+                                       putbuf(&cmd[1], 1, 0);
+                               while (es->cursor > 0)
+                                       if (!isu8cont(es->cbuf[--es->cursor]))
+                                               break;
+                               es->cbuf[es->linelen] = '\0';
                        }
                        break;
 
Index: regress/bin/ksh/vi/vi.sh
===================================================================
RCS file: /cvs/src/regress/bin/ksh/vi/vi.sh,v
retrieving revision 1.2
diff -u -p -r1.2 vi.sh
--- regress/bin/ksh/vi/vi.sh    28 May 2017 07:27:01 -0000      1.2
+++ regress/bin/ksh/vi/vi.sh    30 May 2017 07:54:04 -0000
@@ -134,6 +134,9 @@ testseq "abcdef\00334h2Rxy\0033" " $ abc
 
 # r: Replace character.
 testseq "abcd\00332h2rxiy" " $ abcd\b\b\bxx\byxd\b\b\r\naxyxd"
+testseq "\0303\0266\0033ro" " $ \0303\0266\bo \b\b\r\no"
+testseq "\0342\0202\0254\0033ro" " $ \0342\0202\0254\bo  \b\b\b\r\no"
+testseq "\0303\0266\00332ro" " $ \0303\0266\b\a\r\n\0303\0266"
 
 # S: Substitute whole line.
 testseq "oldst\0033Snew" " $ oldst\b\b\b\b\b     \r $ new\r\nnew"

Re: ksh(1): vi mode UTF-8 bug

Reply via email to