Re: ksh(1): don't output invalid UTF-8 characters

Ingo Schwarze Mon, 05 Jun 2017 09:07:14 -0700

Hi Walter,

Walter Alejandro Iglesias wrote on Mon, Jun 05, 2017 at 04:50:21PM +0200:


> I'm still not skilled enough to make a proper patch or a clear bug
> report (I'm on chapter 2 of K&R :-)).  I wish with time I'll learn how
> to do it.

IIRC, you said you saw some undesirable behaviour with ksh input.

I assume you have a sequence of key presses on your keyboard that
demonstrate the undesirable behaviour.  To capture the sequence,

 1. Type

    printf '

    (without hitting enter)

 2. Now type the sequence of characters, BUT
     - before any control character, press Ctrl-V
     - before any single quote, press a backslash

 3. Type

    ' > input.txt

    and hit enter.

You can look at the sequence with

  hexdump -C input.txt

There should be at least one byte for each key pressed, sometimes
more.  If some key presses do not show up or the order is mixed up,
you likely forgot the Ctrl-V before it.  In that case, retry.

If you want to save people who look at your report some work,
you can craft a printf(1) statement that generates the same
output as the above, but using only ASCII letters, and send
that instead of the output of hexdump -C input.txt.

Here is a simple example.
Test insertion of a non-ASCII character between two ASCII characters.

What i type on my keyboard:

  x y Ctrl-B e-accent-aigu

What i type to capture the input:

  p r i n t f blank ' x y Ctrl-V Ctrl-B e-accent-aigu ' > i n p u t . t x t

This can be used to create the same input file:

   $ printf 'xy\x02\xc3\xa9' > input2.txt
   $ cmp input.txt input2.txt                                     
   $ echo $?
  0

For testing, go to the regress directory:

   $ cd /usr/src/regress/bin/ksh
   $ cvs up -dP
   $ cd edit
   $ make obj
   $ make cleandir
   $ make regress
   $ ./obj/edit < input.txt | hexdump -C
  00000000  24 20 78 79 08 c3 a9 79  08 0a   |$ xy...y..|
  0000000a

Note that the above output is already an improved (uncommitted)
version containing a private patch.

Here is what the -current ksh does:

   $ PATH=/obin ./obj/edit < input.txt | hexdump -C
  00000000  24 20 78 79 08 c3 79 08  08 c3 a9 79 08 0a   |$ xy..y....y..|
  0000000e

Also describe in words why you are typing that exact sequence,
what you would like to happen, and how the actual events
differ from that.

Should be feasible, right?


> I came to the ksh utf8 discussion because I've been playing
> with some mail mime encoder just to learn C and recognizing
> valid utf-8 was the first challenge I ecountered.

In a normal non-threaded application program, you should probably
use mbtowc(3) rather than coding your own UTF-8 parser.

> The code pasted below is what I got so far in recognizing valid utf-8.

That code looks confusing.  I'm not studying it in detail right now
because we already have at least two working UTF-8 decoders in the
tree.

One is /usr/src/lib/libc/citrus/citrus_utf8.c,
function _citrus_utf8_ctype_mbrtowc().  It is relatively long,
but quite easy to understand.

Another one is /usr/src/usr.bin/mandoc/preconv.c,
function preconv_encode().  It is substantially shorter,
but admittedly harder to understand.

This is OpenBSD.  Read the fantastic source to learn something.

> I'm showing it to make my point, I realize it isn't easy; and from my
> poor C I'm not able to figure out how you can do such test byte by byte
> while the user is typing at command line.  (Don't bother in explaining
> me how, I know this is not the place to take C lessons.)

Well, it seems likely the something doing just that will be committed
to ksh/emacs.c during the next week or so - not a full parser, but
something showing the incremental nature.  Then you can inspect it.

If you report actual bugs or propose fixes, then developers are often
willing to help with C issues related to it as well, even though
teaching C is indeed not the point of this list.

> By the way, something the last paragraph of the new utf8(7) man page
> isn't clear enough (I mentioned this to tedu@).

Which paragraph exactly, and what is unclear?  Maybe we can fix it
quickly.

> #define YES   1
> #define NO    0

Either just use 1 and 0 in the code or use <stdbool.h>.

[...]
>       }

Don't forget checking ferror(3) after falling out of the main loop.

>       return 0;
> }

Yours,
  Ingo

Re: ksh(1): don't output invalid UTF-8 characters

Reply via email to