Re: UTF-8 quotation marks in diagnostics

Joseph Myers Wed, 21 Oct 2015 16:27:11 -0700

On Wed, 21 Oct 2015, D. Hugh Redelmeier wrote:

>       The LC_CTYPE environment variable specifies character
>       classification.  GCC uses it to determine the character
>       boundaries in a string; this is needed for some multibyte
>       encodings that contain quote and escape characters that are
>       otherwise interpreted as a string end or escape.


That's inaccurate.  The default source encoding is always UTF-8.  See the 
comment in libcpp/charset.c.

  /* We disable this because the default codeset is 7-bit ASCII on
     most platforms, and this causes conversion failures on every
     file in GCC that happens to have one of the upper 128 characters
     in it -- most likely, as part of the name of a contributor.
     We should definitely recognize in-band markers of file encoding,
     like:
     - the appropriate Unicode byte-order mark (FE FF) to recognize
       UTF16 and UCS4 (in both big-endian and little-endian flavors)
       and UTF8
     - a "#i", "#d", "/ *", "//", " #p" or "#p" (for #pragma) to
       distinguish ASCII and EBCDIC.
     - now we can parse something like "#pragma GCC encoding <xyz>
       on the first line, or even Emacs/VIM's mode line tags (there's
       a problem here in that VIM uses the last line, and Emacs has
       its more elaborate "local variables" convention).
     - investigate whether Java has another common convention, which
       would be friendly to support.
     (Zack Weinberg and Paolo Bonzini, May 20th 2004)  */

I haven't checked whether the documentation (and the matching 
documentation for -finput-charset) was once accurate in this regard (i.e. 
if the documentation in question dates from a time when LC_CTYPE did 
determine the source character set).

-- 
Joseph S. Myers
[email protected]

Re: UTF-8 quotation marks in diagnostics

Reply via email to