On 07/19/18 16:18, Andrew Robinson wrote:
Hi Theron,
How is a compiler that rejects bytes 128-255 in string literals
necessarily non-conforming?...Now, knowing that not all compilers
in practice do accept this cleanly.
As you yourself pointed out in an earlier message, "the compiler should not
try to reinterpret the byte sequences in any way". That is a truism
I was wrong to say that only 0-127 are "proper C".
Here is a relevant reference:
https://en.cppreference.com/w/c/language/string_literal
In particular,
"1) /character string literal/: The type of the literal ischar[], each character in
the array is initialized from the next character ins-char-sequence using the execution
character set."
However,
"The encoding of character string literals (1) and wide string literals (5) is
implementation-defined. For example, gcc selects them with the commandline options
-fexec-charset and -fwide-exec-charset."
How I interpret this is that while it is not the compiler's job to do an
exact copy of the bytes ("Raw" string literals are a C++ feature), it
should get it right as long as the source encoding agrees with the
encoding targeted by the compiler. I have no reason to doubt that UTF-8
literals work just fine in Clang and GCC, or that ISO-8859-1 work in
Microsoft, and I would be disappointed if these do not also support each
other.
In the case of iup_str.c and iup_strmessage.c, this doesn't mean the
existing C sources are okay as-is: two different encodings are used
within one file. Only one might be expected to work correctly at a
time, and this depends on the compiler and on any command-line options.
and since
the problem only exists in some compilers and not others, it would make more
sense to switch compilers rather switch code just so those certain compilers
that don't work with iup_str.c would suddenly start working.
I guess it works as-is, with no warnings, in GCC and MSVC, but it is not
best for IUP portability (part of its purpose) to be restricted to these
two compilers. Keep in mind that other compilers are not necessarily
wrong, in some cases the problem is greater strictness in enforcing
source validity.
The strings themselves should not ever need editing to "make it work".
As long as the strings are already correctly encoded ISO8859-1 and
UTF-8
There is no such thing as "correctly encoded" C-strings, only valid or invalid
C-strings. Here are a few examples of some valid C-strings:
I meant to refer to the strings themselves, i.e. the 8-bit integer
arrays stored at compile-time, not to the literals in the source. As is
under discussion, there are various ways in the C language and in its
various implementations to pack that byte array into the compiled
library, but the array itself needs to be a valid encoded form of the
intended text - this much didn't really need to be said, but it is all
that I meant.
ASCII and ANSI are so yesterday, so why are they still hanging around causing
problems?
Following this line of reasoning, why are both UTF-8 and ISO-8859-1
needed in IUP source? Shouldn't the project choose one for source, and
convert to the other at runtime only for APIs (Microsoft?) which need it?
If a single C source file absolutely must generate code containing
constants under both encodings, the options seem to be ASCII+escapes as
a lowest-common-denominator, short of simple hexadecimal integer arrays
which are entirely unreadable (except perhaps to a coder). However, I
would be in agreement that this is far from an ideal solution.
Theron
------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot
_______________________________________________
Iup-users mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/iup-users