Hi,

if two programs communicating encoded character strings to each other
disagree about the encoding, that can result in problems.

One particular example of such communication is an application program
passing output text to a terminal emulator program.  If the terminal
uses a different encoding for decoding the text than the application
used for encoding it, the terminal may see control codes where the
application only intended printable characters.  This can screw up the
terminal state, spoiling display of subsequent text or even hanging
the terminal.

Actually, i assume that this problem occurs frequently in practice,
for the following reasons.  If the application program is well-behaved,
it either produces C/POSIX/US-ASCII output only, or its idea of the
encoding to use is governed by the LC_CTYPE locale(1) environment
variable, typically passed to it by the shell it was started from.
Now that locale(1) environment is completely unrelated to whatever
encoding the terminal may be set up for.  It may not even be on the
same physical machine.  For example, during an SSH session, your
terminal is on the local SSH client machine, while the shell starting
your application programs is on the remote SSH server machine.
To fully appreciate the implications, try out the following scenario:
Start an xterm(1) that is not UTF-8 enabled on your local machine
by saying "xterm +lc +u8".  Unset LC_ALL, LC_CTYPE, and LANG; check
with locale(1) that your locale is "C".  Use ssh(1) to connect to
a remote machine.  Now simulate a program producing UTF-8 output
on the remote machine, for example U+00DF LATIN SMALL LETTER SHARP S:
  printf "\303\237\n"   # thanks to sobrado@ for the striking example
Now your local terminal hangs until you force a reset using the
menus of the xterm program.  If the shell startup files on the
remote machine set LC_CTYPE=en_US.UTF-8 or something similar by
default, programs on the remote machine will always do just that.

That shows how easy it is to inadvertently cause application-terminal
character encoding mismatches; yet i doubt that many people are aware
of the problem.  So we should try to reduce the likelihood that people
get burnt by such effects.

On an operating system supporting any third locale in addition to
C/POSIX and UTF-8, people are screwed beyond rescue because even
if one side of the connection assumes US-ASCII, communication is
still unsafe in both directions.  Reinterpreting US-ASCII in an
arbitrary encoding and reinterpreting an arbitrary encoding as
US-ASCII may both turn innocuous printable characters into dangerous
terminal control codes.  That is particularly bitter because some
programs will always output US-ASCII, which is not safe to display
in a terminal set up for an arbitrary locale.

Fortunately, in OpenBSD, we made the decision to only support exactly
two locales, C/POSIX and UTF-8, and this combination has the following
properties:

 1. Printing unsanitized strings to the terminal is never safe,
    no matter the locale and terminal setup (think of "cat /bsd").
 2. Printing sanitized US-ASCII to a US-ASCII terminal is safe.
 3. Printing sanitized UTF-8 to a UTF-8 terminal is safe.
 4. Printing sanitized US-ASCII to a UTF-8 terminal is safe.
    That is important because there are some programs that we may
    never want to add UTF-8 support to.

However:

 5. Printing sanitized UTF-8 to a US-ASCII terminal is *NOT* safe.
    Remember the example above that hung a US-ASCII terminal by
    printing U+00DF LATIN SMALL LETTER SHARP S in UTF-8 to it.

By default, our xterm(1) runs in US-ASCII mode.  In view of the
above, that's a terrible idea, even if the user doesn't intend to
ever use UTF-8.  A UTF-8 terminal handles the US-ASCII the user
wants just fine, and in addition to that, and mostly for free, it
is more resilient against stray UTF-8 sneaking in.

Actually, even when fed garbage or unsupported encodings, a UTF-8
xterm(1) is more robust than a US-ASCII xterm(1) because the UTF-8
xterm(1) honours *fewer* terminal escape codes than the US-ASCII
xterm(1).  That may seem surprising at first because Unicode defines
*more* control characters than US-ASCII does.  But as explained on

  http://invisible-island.net/xterm/ctlseqs/ctlseqs.html

xterm(1) never treats decoded multibyte characters as terminal
control codes, so the ISO 6429 C1 control codes do not take effect
in UTF-8 mode; but they do take effect in US-ASCII mode, even though
they fall outside the scope of ASCII.

Consequently, in the interest of safe and sane defaults, i propose
switching our xterm(1) to enable UTF-8 mode by default.  If somebody
insists on running an xterm(1) in US-ASCII mode, there are still
many ways to force that, for example with "+lc +u8".


It is rather tricky to get the switch right because the locale+encoding
user interface of xterm(1) is ridiculously complicated.  It uses
three X resources (*locale, *utf8, *wideChar) with 5+4+2 possible
values (*locale: true, medium, checkfont, false, or an enoding name;
utf8: false, true, always, default; wideChar: true, false) and seven
command line options (-lc +lc -en -u8 +u8 -wc +wc).  Just for
comparison: mandoc(1) uses one command line option with three
possible values (-T locale, utf8, ascii).

The best place to switch is in the setup function VTInitialize_locale()
that decides whether to enable UTF-8 mode and which supporting flags
to set, by pretending to it that CODESET is always UTF-8, but without
interfering with the actual value of the CODESET and without changing
the utility function xtermEnvUTF8().  That way, we get a completely
consistent setup of the terminal, but the terminal can still use
xtermEnvUTF8() for things like deciding whether or not system
wcwidth(3) is usable for measuring UTF-8 display widths, and the
terminal passes an unmangled environment to child processes, in
particular the shell.  All 10 resources and command line options
still work as expected.

The effect of the change is to run in UTF-8 mode whenever the
terminal would otherwise run in US-ASCII mode, except when the user
explicitely requests the opposite by using +u8, *utf8:false,
-en US-ASCII, or *locale:US-ASCII.

The main goal is better robustness.  But it also improves usability.
If you usually run xterm(1) in C/POSIX mode, there should be few
visible changes for you.  But if you stumble upon a directory
containing UTF-8 filenames, you can simply say
  $ LC_CTYPE=en_US.UTF-8 ls
which would have given you garbage output in the past, and which
just works now with the patch.

Feedback and testing is welcome.

Yours,
  Ingo


Index: charproc.c
===================================================================
RCS file: /cvs/xenocara/app/xterm/charproc.c,v
retrieving revision 1.36
diff -u -p -r1.36 charproc.c
--- charproc.c  13 Jan 2016 20:40:08 -0000      1.36
+++ charproc.c  7 Mar 2016 01:15:45 -0000
@@ -7306,7 +7306,13 @@ static void
 VTInitialize_locale(XtermWidget xw)
 {
     TScreen *screen = TScreenOf(xw);
-    Bool is_utf8 = xtermEnvUTF8();
+
+    /*
+     * OpenBSD only supports two locales: C/POSIX and UTF-8.
+     * Using UTF-8 mode for the C/POSIX locale actually is the
+     * safer choice, so make it the default.
+     */
+    const Bool is_utf8 = True;
 
     TRACE(("VTInitialize_locale\n"));
     TRACE(("... request screen.utf8_mode = %d\n", screen->utf8_mode));

Reply via email to