retitle 33044 Guile misbehaves in the "ja_JP.sjis" locale thanks Hi Tom,
Thanks for the report, analysis and patch. I agree with your analysis, and the patch looks good. However, there's also a much deeper problem here. You found and fixed one occurrence of Guile assuming that the locale encoding is ASCII- compatible. In fact, this assumption is widespread in Guile, and I would guess that it's widespread throughout the POSIX world. I admit that before I saw your message, I believed that it was legitimate to assume that the locale encoding was ASCII-compatible. Now I'm unsure, although I'll note that according to the 'localedef' utility from GNU libc, this locale is "not ISO C compliant". It printed the following message when I asked it to generate the "ja_JP.sjis" locale: [warning] character map `SHIFT_JIS' is not ASCII compatible, locale not ISO C compliant [--no-warnings=ascii] Shift_JIS is _mostly_ ASCII-compatible, except that code points 0x5C and 0x7E, which represent backslash (\) and tilde (~) in ASCII, are mapped to the Yen sign (¥) and overline (‾) in Shift_JIS. Backslash (\) and tilde (~) are multibyte characters in Shift_JIS. One common problem is that Guile often uses 'scm_from_locale_string' to create Scheme strings from ASCII-only C string literals. These should all be changed to use either 'scm_from_latin1_string' or 'scm_from_utf8_string'. I prefer the latter because modern C compilers typically use UTF-8 as the default execution character set, i.e. the character set used to encode string and character constants, regardless of the locale settings. GCC uses UTF-8 by default unless -fexec-charset=CHARSET is given at compile time. I'd prefer to promote writing code that works for arbitrary string literals, so that code needn't be adjusted if non-ASCII characters are later added. A related set of problems is that Guile often applies 'scm_from_locale_string' to char* arguments passed in from the user, or produced by third-party libraries. These issues are more difficult to address. We provide several C APIs that accept C strings without specifying what encoding is expected. If the string ultimately derives from a C string constant, we probably want UTF-8, whereas if the string came from I/O, or program arguments, then we probably want the locale encoding. For example, consider 'scm_c_eval_string'. This has been a public API function since 2002, but we did not specify the encoding of its C string argument until 2011. We chose the locale encoding in this case, which I think is reasonable, but I also expect that code exists in the wild that passes a C string literal to 'scm_c_eval_string'. Until now, problems like this have been mostly harmless, since the C string literals are typically ASCII-only. However, if we wish to support non-ASCII-compatible encodings such as Shift_JIS, we can no longer consider these problems harmless. For example, programs which pass C string literals to 'scm_c_eval_string' will fail when using the "ja_JP.sjis" locale, if any tildes or backslashes are present. Backslashes are fairly common in Scheme code. There's various other code scattered in Guile that assumes ASCII characters can searched for, and sometimes replaced with other ASCII characters. For example, several functions in load.c, including 'search_path', 'load_thunk_from_path' scan through file names in the locale encoding, scanning the bytes looking for particular ASCII codes such as '.', '/', and '\'. On MingW, 'scm_i_mirror_backslashes' in load.c converts backslashes into forward slashes byte-wise, assuming ASCII-compatibility, and this transformation is applied to file names in several places. While looking into this, I also discovered that Guile's S-expression reader, i.e. the 'read' procedure, assumes an ASCII-compatible port encoding, despite the fact that it is meant to support arbitrary encodings such as UTF-16 and UTF-32. I just filed a related bug <https://bug.gnu.org/33057> to track this probem. These are some of the problems that I'm currently aware of. I expect that this bug report will remain open for a while. To begin, I've started working on a patch to change many occurrences of 'scm_from_locale_string' to 'scm_from_utf8_string', in cases where the C string clearly originates from a C string literal. Thanks again for the detailed bug report and analysis. Regards, Mark