On Wed, 21 Oct 2015, D. Hugh Redelmeier wrote: > The LC_CTYPE environment variable specifies character > classification. GCC uses it to determine the character > boundaries in a string; this is needed for some multibyte > encodings that contain quote and escape characters that are > otherwise interpreted as a string end or escape.
That's inaccurate. The default source encoding is always UTF-8. See the comment in libcpp/charset.c. /* We disable this because the default codeset is 7-bit ASCII on most platforms, and this causes conversion failures on every file in GCC that happens to have one of the upper 128 characters in it -- most likely, as part of the name of a contributor. We should definitely recognize in-band markers of file encoding, like: - the appropriate Unicode byte-order mark (FE FF) to recognize UTF16 and UCS4 (in both big-endian and little-endian flavors) and UTF8 - a "#i", "#d", "/ *", "//", " #p" or "#p" (for #pragma) to distinguish ASCII and EBCDIC. - now we can parse something like "#pragma GCC encoding <xyz> on the first line, or even Emacs/VIM's mode line tags (there's a problem here in that VIM uses the last line, and Emacs has its more elaborate "local variables" convention). - investigate whether Java has another common convention, which would be friendly to support. (Zack Weinberg and Paolo Bonzini, May 20th 2004) */ I haven't checked whether the documentation (and the matching documentation for -finput-charset) was once accurate in this regard (i.e. if the documentation in question dates from a time when LC_CTYPE did determine the source character set). -- Joseph S. Myers jos...@codesourcery.com