Hi On Fri, May 19, 2017 at 10:23:08PM +0200, Ingo Schwarze wrote: > Hi Nicholas, > > Nicholas Marriott wrote on Fri, May 19, 2017 at 07:04:53PM +0100: > > > Perhaps I haven't understood what you are saying correctly, > > What matters most is that sending an incomplete character > followed by U+0008 (ASCII BACKSPACE) is a no-op, both in the sense > that it doesn't change the line being edited and that it doesn't > change the display. All terminals you mentioned seem to conform > to that according to my testing, except tmux.
They don't all do the same thing for me. I am doing this, the same as ksh does: printf 'a\010\342a\010\010\342\211a\010\010\342\211\240a\010\n' And the terminals behave differently. > > but I don't think it is possible to send control characters or > > any other invalid UTF-8 bytes inside UTF-8 characters and safely > > predict what the terminal will do. How about these examples: > > If the invalid bytes are still present by the time the line is sent > off for processing (like in your example printf '\343\203\n'), then > it is indeed hard to predict what random terminals will do, though I don't know what you mean by "sent off for processing". I think you might be under some misapprehension. \010 (ASCII BS) is just a cursor positioning sequence, all it does is move the cursor one position to the left. \n the same, it moves the cursor one line down. The problem here is that ksh assumes a partial UTF-8 character (whether one byte or two) will also move the cursor by one position, but that is not always the case. Not for tmux, and - for me - not for other terminals. Are you thinking of ICANON? That is a kernel mechanism, and when it is in use, the backspace will never reach the tmux. But ksh doesn't appear to use it. > i would argue that xterm's behaviour is correct (print one substitution > glyph for each incomplete character, or if bytes don't form even > incomplete characters, then one for each such invalid byte. So you think the correct behaviour is to replace invalid sequences by U+FEFF? tmux could do that, but it won't fix ksh for other terminals. ksh is making an assumption about how terminals will behave that is not correct. > urxvt is clearly broken: > > $ printf '\343\203x\n' > > prints U+00E3 x linefeed; i have no idea what it does to garble > 0xe383 into 0xc3a3. Maybe some naive misparsing, or spewing out > incomplete parsing state in some inconsistent way. > > gnome-terminal and konsole print one replacement character for each > invalid byte, even if bytes form an incomplete character. Maybe > not outright wrong, but arguably a bit confusing. This doesn't sound like all the terminals doing the same thing. > So yeah, if lines containing incomplete sequences *when they are > sent off* misformat with tmux and gnome-terminal or konsole, i > wouldn't call that tmux'es fault, and i agree there is little that > can be done about it. There is no "sent off". What you send is what you get, invalid characters, backspaces, everything. > > > Having tmux ignore the whole lot seems like a relatively sensible > > course. > > Well, what tmux currently does is making sure that everything gets > broken in the maximum possible way on every terminal, even if the > line that is finally sent off is completely correct. > > > The only other alternative would be to substitute U+FFFD. > > Why not just pass the bytes through? I don't think it's the job > of a terminal mutiplexer to mess with individual bytes. It's the > job of the final terminal doing the display to select glyphs and > place them, for printable characters, for non-printable characters, > for incomplete characters, and for invalid bytes. No no, it is the job of tmux to interpret what it receives from the application and make sure it shows the same no matter what terminal is outside. If terminals outside behave differently for the same output, tmux can't keep its own internal state correct, so the display will be mangled. tmux has to understand everything it receives, there is almost no case where we can just pass through. > > > But that seems iffy too - U+FFFD is width 1, but what if the > > application is expecting a width 2? > > By definition, incomplete characters and invalid bytes don't > have a width, so it doesn't matter what the application wanted > (for example, which character the user intended to type but > didn't finish). What matters is how wide the replacement > glyph will look on the final terminal. In that respect, we > cannot help making an assumption, and "incomplete sequences > and invalid bytes are displayed as U+FFFD (i.e. width 1) > seems about the best we can do. That may be slightly off > for gnome-terminal and konsole, but i don't see how that can > be helped. > > Anyway, these subtleties of invalid bytes that *remain* are not > the main inconvenience in practice. What matters more is that > tmux breaks even if incomplete characters are deleted again > with backspaces and never sent off. Backspace does not delete anything, it just moves the cursor.