Re: UTF-8 quotation marks in diagnostics

Joseph Myers Thu, 22 Oct 2015 13:11:51 -0700

On Thu, 22 Oct 2015, Martin Sebor wrote:

> > LC_MESSAGES determines "the language and cultural conventions in which
> > messages should be written" (not necessarily the interpretation of
> > multibyte characters in that output).
> 
> Yes, but setting LC_CTYPE shouldn't affect the format, language,
> or encoding of diagnostic messages, which is what the text quoted
> from the GCC page recommends as a mechanism to change the quote
> character in diagnostic messages.


LC_CTYPE should affect the interpretation of multibyte character sequences 
as characters, including on output.  That's the standard semantics.  
That's what all C library functions involving interpretation of multibyte 
character sequences do.  Straightforward use of POSIX library interfaces 
does not support producing output in a character set other than that 
specified with LC_CTYPE; e.g. printf expects a format string (possibly 
resulting from a message catalog) in the LC_CTYPE character set, and does 
not convert the bytes to another character set.  I'm pretty sure POSIX 
never intended semantics for messages that imply not outputting them using 
standard printf functions (or that would mean they couldn't be output with 
wprintf, using wide characters internally and only converting to multibyte 
on output).

Thus, it affects transliteration when the logically desired characters are 
not available in the LC_CTYPE character set, which is the case here.

> If it did, it would mean that programmers would have to adjust
> to the language of the source code of the program they're working
> with.  For example, if part of a program was written assuming,
> say a German locale, then Japanese programmers would need to learn
> German to understand the compiler output.  (Otherwise, if they set
> LC_CTYPE to their Japanese locale) the German quotes in the source
> code could be interpreted as something else.

German quotes are completely irrelevant to source file interpretation.  
Character sets are only relevant to source file interpretation in order to 
map sequences of octets to logical source characters (for example, when a 
source file is in a multibyte character set where \ can form part of 
another character rather than just being the \ character from the basic 
source character set - note that such character sets are not valid for 
locales in a POSIX environment in the GNU system, where valid locale 
character sets must interpret all ASCII characters the same as in the C 
locale and not permit them as part of a multibyte character).  They never 
cause characters to be interpreted as quotes (or digits etc.) if the 
character set does not map them to the quotes (or digits etc.) in the 
basic C source character set.

Again, LC_CTYPE does *not* affect source file interpretation.  Your 
example illustrates why it affecting source file interpretation is a bad 
idea: the correct character set for source file interpretation is a 
property of the source file, and source files are routinely transmitted to 
people using different locales (so need to be considered as made of bytes 
not locale-dependent characters).  Instead, the source file encoding is 
specified with -finput-charset, which should be specified in your 
program's Makefiles if something other than UTF-8 is used in a context 
where it matters.

It appears that LC_CTYPE was intended to have affected source file 
interpretation before GCC 4.0.  It turned out that the code in question 
was unintentionally dead, and, if enabled, it broke things in practice 
<https://gcc.gnu.org/ml/gcc/2004-05/msg01007.html> (though given a 
whitelist of POSIX-safe character sets - ones where conversion cannot 
affect basic source characters, and so cannot affect the identification of 
comments - and disabling conversion inside comments for such character 
sets, it might be safer).  The relevant documentation text was added in 
r24879 ("Merge in gcc2 snapshot 19980929.") so we can't tell the rationale 
for it without someone with access to the gcc2 list archives checking 
them, but it looks like it may have been related to an always disabled by 
default option --enable-c-mbchar which was removed in 2003.

You could write your "c99" program wrapper to add a -finput-charset= 
option based on the locale's character set if you so wish (it also needs 
to do things such as option reordering and handling -O with separate 
argument - the "gcc" driver deliberately processes -D and -U options in 
the order they appear on the command line, not following the POSIX rule 
that -U options take precedence over -D - so you should not expect the 
"gcc" driver to be usable as "c99" without such adaptation for deliberate 
differences).

I think we should clearly update the documentation to reflect reality 
regarding source file encoding, and leave it strictly for wrappers such as 
"c99" to specify -finput-charset= options rather than leaving open the 
possibility that GCC's own default might change in future.

-- 
Joseph S. Myers
jos...@codesourcery.com

Re: UTF-8 quotation marks in diagnostics

Reply via email to