Hi, if two programs communicating encoded character strings to each other disagree about the encoding, that can result in problems.
One particular example of such communication is an application program passing output text to a terminal emulator program. If the terminal uses a different encoding for decoding the text than the application used for encoding it, the terminal may see control codes where the application only intended printable characters. This can screw up the terminal state, spoiling display of subsequent text or even hanging the terminal. Actually, i assume that this problem occurs frequently in practice, for the following reasons. If the application program is well-behaved, it either produces C/POSIX/US-ASCII output only, or its idea of the encoding to use is governed by the LC_CTYPE locale(1) environment variable, typically passed to it by the shell it was started from. Now that locale(1) environment is completely unrelated to whatever encoding the terminal may be set up for. It may not even be on the same physical machine. For example, during an SSH session, your terminal is on the local SSH client machine, while the shell starting your application programs is on the remote SSH server machine. To fully appreciate the implications, try out the following scenario: Start an xterm(1) that is not UTF-8 enabled on your local machine by saying "xterm +lc +u8". Unset LC_ALL, LC_CTYPE, and LANG; check with locale(1) that your locale is "C". Use ssh(1) to connect to a remote machine. Now simulate a program producing UTF-8 output on the remote machine, for example U+00DF LATIN SMALL LETTER SHARP S: printf "\303\237\n" # thanks to sobrado@ for the striking example Now your local terminal hangs until you force a reset using the menus of the xterm program. If the shell startup files on the remote machine set LC_CTYPE=en_US.UTF-8 or something similar by default, programs on the remote machine will always do just that. That shows how easy it is to inadvertently cause application-terminal character encoding mismatches; yet i doubt that many people are aware of the problem. So we should try to reduce the likelihood that people get burnt by such effects. On an operating system supporting any third locale in addition to C/POSIX and UTF-8, people are screwed beyond rescue because even if one side of the connection assumes US-ASCII, communication is still unsafe in both directions. Reinterpreting US-ASCII in an arbitrary encoding and reinterpreting an arbitrary encoding as US-ASCII may both turn innocuous printable characters into dangerous terminal control codes. That is particularly bitter because some programs will always output US-ASCII, which is not safe to display in a terminal set up for an arbitrary locale. Fortunately, in OpenBSD, we made the decision to only support exactly two locales, C/POSIX and UTF-8, and this combination has the following properties: 1. Printing unsanitized strings to the terminal is never safe, no matter the locale and terminal setup (think of "cat /bsd"). 2. Printing sanitized US-ASCII to a US-ASCII terminal is safe. 3. Printing sanitized UTF-8 to a UTF-8 terminal is safe. 4. Printing sanitized US-ASCII to a UTF-8 terminal is safe. That is important because there are some programs that we may never want to add UTF-8 support to. However: 5. Printing sanitized UTF-8 to a US-ASCII terminal is *NOT* safe. Remember the example above that hung a US-ASCII terminal by printing U+00DF LATIN SMALL LETTER SHARP S in UTF-8 to it. By default, our xterm(1) runs in US-ASCII mode. In view of the above, that's a terrible idea, even if the user doesn't intend to ever use UTF-8. A UTF-8 terminal handles the US-ASCII the user wants just fine, and in addition to that, and mostly for free, it is more resilient against stray UTF-8 sneaking in. Actually, even when fed garbage or unsupported encodings, a UTF-8 xterm(1) is more robust than a US-ASCII xterm(1) because the UTF-8 xterm(1) honours *fewer* terminal escape codes than the US-ASCII xterm(1). That may seem surprising at first because Unicode defines *more* control characters than US-ASCII does. But as explained on http://invisible-island.net/xterm/ctlseqs/ctlseqs.html xterm(1) never treats decoded multibyte characters as terminal control codes, so the ISO 6429 C1 control codes do not take effect in UTF-8 mode; but they do take effect in US-ASCII mode, even though they fall outside the scope of ASCII. Consequently, in the interest of safe and sane defaults, i propose switching our xterm(1) to enable UTF-8 mode by default. If somebody insists on running an xterm(1) in US-ASCII mode, there are still many ways to force that, for example with "+lc +u8". It is rather tricky to get the switch right because the locale+encoding user interface of xterm(1) is ridiculously complicated. It uses three X resources (*locale, *utf8, *wideChar) with 5+4+2 possible values (*locale: true, medium, checkfont, false, or an enoding name; utf8: false, true, always, default; wideChar: true, false) and seven command line options (-lc +lc -en -u8 +u8 -wc +wc). Just for comparison: mandoc(1) uses one command line option with three possible values (-T locale, utf8, ascii). The best place to switch is in the setup function VTInitialize_locale() that decides whether to enable UTF-8 mode and which supporting flags to set, by pretending to it that CODESET is always UTF-8, but without interfering with the actual value of the CODESET and without changing the utility function xtermEnvUTF8(). That way, we get a completely consistent setup of the terminal, but the terminal can still use xtermEnvUTF8() for things like deciding whether or not system wcwidth(3) is usable for measuring UTF-8 display widths, and the terminal passes an unmangled environment to child processes, in particular the shell. All 10 resources and command line options still work as expected. The effect of the change is to run in UTF-8 mode whenever the terminal would otherwise run in US-ASCII mode, except when the user explicitely requests the opposite by using +u8, *utf8:false, -en US-ASCII, or *locale:US-ASCII. The main goal is better robustness. But it also improves usability. If you usually run xterm(1) in C/POSIX mode, there should be few visible changes for you. But if you stumble upon a directory containing UTF-8 filenames, you can simply say $ LC_CTYPE=en_US.UTF-8 ls which would have given you garbage output in the past, and which just works now with the patch. Feedback and testing is welcome. Yours, Ingo Index: charproc.c =================================================================== RCS file: /cvs/xenocara/app/xterm/charproc.c,v retrieving revision 1.36 diff -u -p -r1.36 charproc.c --- charproc.c 13 Jan 2016 20:40:08 -0000 1.36 +++ charproc.c 7 Mar 2016 01:15:45 -0000 @@ -7306,7 +7306,13 @@ static void VTInitialize_locale(XtermWidget xw) { TScreen *screen = TScreenOf(xw); - Bool is_utf8 = xtermEnvUTF8(); + + /* + * OpenBSD only supports two locales: C/POSIX and UTF-8. + * Using UTF-8 mode for the C/POSIX locale actually is the + * safer choice, so make it the default. + */ + const Bool is_utf8 = True; TRACE(("VTInitialize_locale\n")); TRACE(("... request screen.utf8_mode = %d\n", screen->utf8_mode));