"Valeriy E. Ushakov" wrote on 2000-12-21 16:23 UTC:
> Markus Unicode FAQ states that:
> 
> | UTF-8 still allows you to use C1 control characters such as CSI, even
> | though UTF-8 also uses bytes in the range 0x80-0x9F. It is important
> | to understand that a terminal emulator in UTF-8 mode must apply the
> | UTF-8 decoder to the incoming byte stream before interpreting any
> | control characters. C1 characters are UTF-8 decoded just like any
> | other character above U+007F.
> 
> I can see in xterm logs that this used to be controlled by
> utf8controls resource at one time, but now xterm behaves as described
> in the paragraph quoted above.  Will xterm ignore CSI *byte* when it
> arrives not in a context of building a character from UTF-8 byte
> sequence (e.g. 0x20 0x9B (a space followed by a CSI) vs. 0x20 0xC3
> 0x9B (a space followed by U+0219 in UTF-8))?

Xterm should treat in UTF-8 mode a received 8-bit coded CSI as a
malformed UTF-8 sequence as required by ISO 10646-1 Annex R and display
a REPLACEMENT CHARACTER (U+FFFD) for it.

It is best to understand xterm as a protocol stack, in which the ISO
6429 control sequence decoder operates on a stream of 16-bit characters
that is handed over to it by the next lower layer, the UTF-8 decoder.

> Will xterm encode 8-bit transmitted codes in UTF-8?

Received keysyms will be converted to Unicode and will then be UTF-8
encoded before being sent to the pty and application. Received
UTF8_STRING selections will be forwarded to the pty as they are without
any modifications. Received STRING selections are known/supposed to be
ISO 8859-1 encoded and will therefore be recoded to UTF-8 before being
sent to the pty. I don't remember what other ways there are for xterm to
send out something but if there are other ways, UTF-8 encoding should
take place as well.

> Approach described in FAQ, as far as I understand, simply encodes
> *all* the communication between the terminal and the host in UTF-8.
> Thus, conceptually, this approach just widens the code unit used in
> communication between the host and the terminal and employs UTF-8 as a
> "wire" protocol to encode code units as bytes.

That's correct. We insert the UTF-8 protocol into the protocol stack
below the ISO 6429 layer to extend the character set that can appear
on the line to UCS.

> OTOH, no lead byte of UTF-8 is in C1, so in principle a different
> model is possible.  Host and terminal still talk bytes (i.e. CSI
> *byte* is interpreted as CSI and 8-bit transmitted codes are sent as
> bytes), but an input byte in the 0xA0..0xFF range triggers a UTF-8
> decoding that consumes one UTF-8 encoded character from the byte
> stream.

The disadvantage of this approach is this:

    Simple substring searched for C1 control characters would become
    impossible, because now C1 characters can appear as substrings
    inside other UTF-8 characters. This violates the very useful
    UTF-8 property that no UTF-8 character can appear as a substring
    inside another UTF-8 character, which is why your alternative model
    would be rather undesirable in my opinion.

In addition, ISO 10646-1 says clearly that the UCS C1 character
U+0080-U+009F are to be extended to two bytes by UTF-8. So your
alternative model wouldn't be UTF-8 any more.

Markus

-- 
Markus G. Kuhn, Computer Laboratory, University of Cambridge, UK
Email: mkuhn at acm.org,  WWW: <http://www.cl.cam.ac.uk/~mgk25/>

-
Linux-UTF8:   i18n of Linux on all levels
Archive:      http://mail.nl.linux.org/lists/

Reply via email to