Re: [Iup-users] Use utf-8 source encoding rather than ISO8859-1

Theron Thu, 19 Jul 2018 15:20:02 -0700

On 07/19/18 16:18, Andrew Robinson wrote:

Hi Theron,

How is a compiler that rejects bytes 128-255 in string literals
necessarily non-conforming?...Now, knowing that not all compilers
in practice do accept this cleanly.

As you yourself pointed out in an earlier message, "the compiler should not
try to reinterpret the byte sequences in any way". That is a truism

I was wrong to say that only 0-127 are "proper C".

Here is a relevant reference:https://en.cppreference.com/w/c/language/string_literal

In particular,

"1)  /character string literal/: The type of the literal ischar[], each character in 
the array is initialized from the next character ins-char-sequence  using the execution 
character set."

However,

"The encoding of character string literals (1) and wide string literals (5) is 
implementation-defined. For example, gcc selects them with the commandline options 
-fexec-charset and -fwide-exec-charset."

How I interpret this is that while it is not the compiler's job to do anexact copy of the bytes ("Raw" string literals are a C++ feature), itshould get it right as long as the source encoding agrees with theencoding targeted by the compiler. I have no reason to doubt that UTF-8literals work just fine in Clang and GCC, or that ISO-8859-1 work inMicrosoft, and I would be disappointed if these do not also support eachother.

In the case of iup_str.c and iup_strmessage.c, this doesn't mean theexisting C sources are okay as-is: two different encodings are usedwithin one file. Only one might be expected to work correctly at atime, and this depends on the compiler and on any command-line options.

  and since
the problem only exists in some compilers and not others, it would make more
sense to switch compilers rather switch code just so those certain compilers
that don't work with iup_str.c would suddenly start working.

I guess it works as-is, with no warnings, in GCC and MSVC, but it is notbest for IUP portability (part of its purpose) to be restricted to thesetwo compilers. Keep in mind that other compilers are not necessarilywrong, in some cases the problem is greater strictness in enforcingsource validity.

The strings themselves should not ever need editing to "make it work".
As long as the strings are already correctly encoded ISO8859-1 and
UTF-8

There is no such thing as "correctly encoded" C-strings, only valid or invalid
C-strings. Here are a few examples of some valid C-strings:

I meant to refer to the strings themselves, i.e. the 8-bit integerarrays stored at compile-time, not to the literals in the source. As isunder discussion, there are various ways in the C language and in itsvarious implementations to pack that byte array into the compiledlibrary, but the array itself needs to be a valid encoded form of theintended text - this much didn't really need to be said, but it is allthat I meant.

ASCII and ANSI are so yesterday, so why are they still hanging around causing
problems?

Following this line of reasoning, why are both UTF-8 and ISO-8859-1needed in IUP source? Shouldn't the project choose one for source, andconvert to the other at runtime only for APIs (Microsoft?) which need it?

If a single C source file absolutely must generate code containingconstants under both encodings, the options seem to be ASCII+escapes asa lowest-common-denominator, short of simple hexadecimal integer arrayswhich are entirely unreadable (except perhaps to a coder). However, Iwould be in agreement that this is far from an ideal solution.


Theron

------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot

_______________________________________________
Iup-users mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/iup-users

Re: [Iup-users] Use utf-8 source encoding rather than ISO8859-1

Reply via email to