On Wed, 21 Oct 2015, D. Hugh Redelmeier wrote:
> The LC_CTYPE environment variable specifies character
> classification. GCC uses it to determine the character
> boundaries in a string; this is needed for some multibyte
> encodings that contain quote and escape characters that are
> otherwise interpreted as a string end or escape.
That's inaccurate. The default source encoding is always UTF-8. See the
comment in libcpp/charset.c.
/* We disable this because the default codeset is 7-bit ASCII on
most platforms, and this causes conversion failures on every
file in GCC that happens to have one of the upper 128 characters
in it -- most likely, as part of the name of a contributor.
We should definitely recognize in-band markers of file encoding,
like:
- the appropriate Unicode byte-order mark (FE FF) to recognize
UTF16 and UCS4 (in both big-endian and little-endian flavors)
and UTF8
- a "#i", "#d", "/ *", "//", " #p" or "#p" (for #pragma) to
distinguish ASCII and EBCDIC.
- now we can parse something like "#pragma GCC encoding <xyz>
on the first line, or even Emacs/VIM's mode line tags (there's
a problem here in that VIM uses the last line, and Emacs has
its more elaborate "local variables" convention).
- investigate whether Java has another common convention, which
would be friendly to support.
(Zack Weinberg and Paolo Bonzini, May 20th 2004) */
I haven't checked whether the documentation (and the matching
documentation for -finput-charset) was once accurate in this regard (i.e.
if the documentation in question dates from a time when LC_CTYPE did
determine the source character set).
--
Joseph S. Myers
[email protected]