On Fri, Apr 11, 2025 at 08:28:38PM +0200, Ingo Schwarze wrote:
> Hi Walter,
>
> On Sat, Apr 05, 2025 at 01:33:10PM +0200, Walter Alejandro Iglesias wrote:
>
> > I discovered another issue (in the unpatched ksh).
>
> Since i hate posting UTF-8 to mailing lists, let me paraphrase your
> finding such that i do not need to quote your (excessively verbose)
> example.
>
> Put ksh(1) into VI command line editing mode.
> Type a line containing any UTF-8 character (just a two-byte line
> consisting of a single character in enough).
> Put anything into the yank buffer (even a single ASCII character is enough).
> Put the cursor on any UTF-8 character (technically, that means the
> code implementating the shell has the cursor sitting on the first
> byte of that UTF-8 character, but that is an implementation detail
> not visible to the user).
> Issue the paste (p) command.
>
> The result is that the content of the yank buffer is inserted into the
> middle of the UTF-8 character. Fur example, if you use the UTF-8
> character 0xc3 0xa9 (e-accent aigu) and the paste buffer 0x61 (letter a),
> then you get 0xc3 0x61 0xa9, which is not a valid UTF-8 string.
>
> Consequently, i do confirm that you found a bug.
Your long and detailed explanations reflect your methodical thinking,
which is undoubtedly less prone to error, but this has its downside:
methodical and logical thinking isn't always effective in uncovering
errors and finding solutions to problems, at least not in a first stage.
In every creative endeavor, there's a chaotic first phase, a famous
Argentine writer, Jorge Luis Borges, quoted advice from his father, who
was also a writer: "First, unleash your imagination. Then, the axe."
Also, we have to take into account how lazy people are to read these
days, for example, from what you tell me below, I suspect you didn't
read the entire thread, especially the first message[1].
>
>
> Walter Alejandro Iglesias wrote on Fri, Apr 11, 2025 at 12:18:22PM +0200:
>
> > Please excuse me if I'm a bit chaotic.
>
> On the one hand, often enough bad patches trigger good ones,
> so do not panic. On the other hand, understand trying to fix stuff
> and sending bad patches as a valuable learning experience.
I fully agree with what you're saying[2], but reactions from other
developers in more than one occasion in these lists led me to think that
not everyone here agrees with this.
>
> > I don't do this every day;
>
> I don't do so any longer either; it has been a few years that i looked
> at this code.
>
> > I have trouble understanding the code at first.
>
> It is indeed mostly devoid of comments. On the other hand, much of it
> is pretty straightforward. So while a few comments here and there
> might help to clarify some of the less obvious aspects, i don't
> think we need comments all over the place.
I thought the same thing too.
>
> > This seems to work:
> >
> > Index: vi.c
> > ===================================================================
> > RCS file: /cvs/src/bin/ksh/vi.c,v
> > diff -u -p -r1.60 vi.c
> > --- vi.c 12 Mar 2021 02:10:25 -0000 1.60
> > +++ vi.c 11 Apr 2025 09:59:40 -0000
> > @@ -834,8 +834,12 @@ vi_cmd(int argcnt, const char *cmd)
> >
> > case 'p':
>
> /* What follows is the implementation of the paste (p) command. */
> (Clear enough without a comment.)
>
> > modified = 1; hnum = hlast;
> > - if (es->linelen != 0)
> > + if (es->linelen != 0) {
> > es->cursor++;
>
> /* This advances the cursor to the next byte.
> Needed because the subsequent putbuf() insertion inserts left
> of the cursor, but we want to insert to the right of it.
> You correctly recognized that the problem is that advancing
> by a single byte is not enough if we are sitting on the first
> byte of a UTF-8 character. */
>
> > + while (!isascii(es->cbuf[es->cursor]) &&
> > + es->cursor < es->linelen)
> > + es->cursor++;
>
> That, however, i totally wrong. I conclude that from mere code inspection
> and i'm not even going to test it. Imagine we are sitting at the
> beginning of a long string of UTF-8 characters. This will hurry
> past the whole long string because none of the bytes in this long
> string is ASCII.
>
> So instead of inserting after the current character and before the next
> character, you would now skip potentially many UTF-8 characters and
> insert before the next ASCII character, perhaps far to the right.
You're right on this point. This is where methodical thinking (the axe)
comes to do its job. :-)
>
>
> So what is needed instead? This is a faily standard task:
>
> If we sit on a UTF-8 start byte, advance past the whole UTF-8
> character.
>
> So what we need is
>
> while (es->cursor < es->linelen && isu8cont(es->cbuf[es->cursor]))
> es->cursor++;
Curiously, this is almost what I tried at first, except for I didn't
take in care to limit the loop to linelen. To be precise, I did this:
while (isu8cont(es->cbuf[es->cursor]))
es->cursor++;
This gave me segfaults, now it's obvious to me why but then I didn't
understand, so I looked for a solution with isspace() (aware that there
was something sloppy about this solution.)
>
> But that can be expressed in a more compact way, see below.
>
> A few lines down, the es->cursor-- that you do not change is also
> wrong. After the insertion, the curser sits after the insertion,
> but we want it on the last letter of the insertion.
>
> > + }
> > while (putbuf(ybuf, yanklen, 0) == 0 && --argcnt > 0)
> > ;
> > if (es->cursor != 0)
> > @@ -1195,8 +1199,7 @@ domove(int argcnt, const char *cmd, int
> > return -1;
> > ncursor = (*cmd == 'e' ? endword : Endword)(argcnt);
> > if (!sub)
> > - while (isu8cont((unsigned char)es->cbuf[--ncursor]))
> > - continue;
> > + --ncursor;
> > break;
>
> This change breaks the "move to end of word" (e) command.
Could you give me an example? (Read my following paragraph, please.)
> It seems unrelated because your example only uses "ye", that is,
> "e" as a subcommand, but your code change only changes the case
> where "e" is stand-alone, i.e. *not* a subcommand. Consequently,
> i do not understand what you are trying to achieve with this change
> in domove().
[1] This is why I suspect you didn't read my other messages. Thanks to
an explanation from Lucas I found the second issue (the 'p' command
one.) As I explain in my messages, my last patch was intended to fix
both issues.
About the first issue, the 'e' command one. In the following part of
domove() I still don't understand what the while() conditional is for:
case 'e':
case 'E':
if (!sub && es->cursor + 1 >= es->linelen)
return -1;
ncursor = (*cmd == 'e' ? endword : Endword)(argcnt);
if (!sub)
while (isu8cont((unsigned char)es->cbuf[--ncursor]))
continue;
break;
In any case, if you preffer not to remove that conditional the following
diff also fix the first issue:
Index: vi.c
===================================================================
RCS file: /cvs/src/bin/ksh/vi.c,v
diff -u -p -r1.60 vi.c
--- vi.c 12 Mar 2021 02:10:25 -0000 1.60
+++ vi.c 12 Apr 2025 08:06:14 -0000
@@ -1194,9 +1194,11 @@ domove(int argcnt, const char *cmd, int
if (!sub && es->cursor + 1 >= es->linelen)
return -1;
ncursor = (*cmd == 'e' ? endword : Endword)(argcnt);
- if (!sub)
- while (isu8cont((unsigned char)es->cbuf[--ncursor]))
+ if (!sub) {
+ while (isu8cont((unsigned char)es->cbuf[ncursor]))
continue;
+ --ncursor;
+ }
break;
case 'f':
For others testing convenience I paste a diff at the bottom including
this and your diff together.
>
> The following patch is not yet complete and only survived minimal,
> but not yet sufficient testing. For example, the 'P' command
> right below obviously suffers from the same problem.
As far as I tested, I didn't notice the 'P' failing with the same error.
That's why I didn't touch anything there.
>
> Even if i would get OKs for the following patch right now,
As far as I tested it, it seems that your patch effectively solves the
second issue.
> i would
> prefer waiting until after release. We have clearly lived with this
> bug fur years, so it is not release-critical, and the risk of breaking
> the shell so close to release does not feel acceptable.
>
> As a final remark, this patch illustrates what i said a few days ago.
> The places in the code affected by these kinds of issues are not
> easy to find. There may be no more conspicious indications in the
> code than an innocuous-looking ++ or -- operator.
>
> Yours,
> Ingo
Thank you, Ingo! I'm glad to have your feedback again. I thought you'd
gotten tired of me. :-)
[2] I don't know if you remember, but a discussion I had with you
privately *many* years ago where you expressed this approach was
what encourage me to post my mail(1) patches in tech@ years later.
That's why your first reaction back then seemed contradictory to me.
Final diff solving both issues:
Index: vi.c
===================================================================
RCS file: /cvs/src/bin/ksh/vi.c,v
diff -u -p -r1.60 vi.c
--- vi.c 12 Mar 2021 02:10:25 -0000 1.60
+++ vi.c 12 Apr 2025 08:10:28 -0000
@@ -834,12 +834,16 @@ vi_cmd(int argcnt, const char *cmd)
case 'p':
modified = 1; hnum = hlast;
- if (es->linelen != 0)
- es->cursor++;
+ /* Insert after the current character. */
+ while (es->cursor < es->linelen)
+ if (!isu8cont(es->cbuf[++es->cursor]))
+ break;
while (putbuf(ybuf, yanklen, 0) == 0 && --argcnt > 0)
;
- if (es->cursor != 0)
- es->cursor--;
+ /* Back up to the last character inserted. */
+ while (es->cursor != 0)
+ if (!isu8cont(es->cbuf[--es->cursor]))
+ break;
if (argcnt != 0)
return -1;
break;
@@ -1194,9 +1198,11 @@ domove(int argcnt, const char *cmd, int
if (!sub && es->cursor + 1 >= es->linelen)
return -1;
ncursor = (*cmd == 'e' ? endword : Endword)(argcnt);
- if (!sub)
- while (isu8cont((unsigned char)es->cbuf[--ncursor]))
+ if (!sub) {
+ while (isu8cont((unsigned char)es->cbuf[ncursor]))
continue;
+ --ncursor;
+ }
break;
case 'f':
--
Walter