Frank da Cruz wrote:
>
> Doug Ewell wrote:
> >
> > That last paragraph echoes what Frank said about "reversing the layers,"
> > performing the UTF-8 conversion first and then looking for escape
> > sequences. True UTF-8 support, in terminal emulators and in other
> > software as well, really should depend on UTF-8 conversion being
> > performed first.
>
> The irony is, when using ISO 2022 character-set designation and invocation,
> you have to handle the escape sequences first to know if you're in UTF-8.
> Therefore, this pushes the burden onto the end-user to preconfigure their
> emulator for UTF-8 if that is what is being used, when ideally this should
> happen automatically and transparently.
I may be misunderstanding the above, but ISO 2022 says:
ESC 2/5 F shall mean that the other coding system uses
ESC 2/5 4/0 to return;
ESC 2/5 2/15 F shall mean that the other coding system
does not use ESC 2/5 4/0 to return (it may have an alternative
means to return or none at all).
Registration number 196 is for UTF-8 without implementation level, and
its escape sequence is ESC 2/5 4/7. I believe that ISO 2022 was designed
that way so that a decoder that does not know UTF-8 (or any other coding
system invoked by ESC 2/5 F) could simply "skip" the octets in that
encoding until it gets to the octets ESC 2/5 4/0.
This means that it does not need to decode UTF-8 just to find the escape
sequence ESC 2/5 4/0. UTF-8 does not do anything special with characters
below U+0080 anyway (they're just single-byte ASCII), so it works, no?
Of course, if you wanted to include any C1 controls inside the UTF-8
segment, they would have to be encoded in UTF-8, but ESC 2/5 4/0 is
entirely in the ASCII range (less than 128), so those octets are encoded
as is.
Erik