Re: [Iup-users] Use utf-8 source encoding rather than ISO8859-1

Antonio Scuri Thu, 19 Jul 2018 15:49:01 -0700

>  Following this line of reasoning, why are both UTF-8 and ISO-8859-1
needed in IUP source?  Shouldn't the project choose one for source, and
convert to the other at runtime only for APIs (Microsoft?) which need it?


  UTF-8 support in IUP has just a couple of years. So we need a way to test
everything without compromising well established projects. And we did not
want to penalty one or another with a conversion that could be done
previously.

  Also text editors took sometime to be UTF-8 friendly, so we decided to
use ISO-8859-1 for our text encoding. And to explicitly show where UTF-8
dual bytes were used. So we can quickly identify encoding problems.

  The problem so far seems to be with some compilers incorrectly
interpreting our source code is certain situations. So although I like more
general solutions, and we are going to study that approach, we also like
simple and practical solutions, like setting a compiler flag that builds
the IUP library using a specified text encoding. So for now I'll ask you
guys to focus on that if possible. Feedback to us the flags used in each
situation so we can add support in our Makefiles. Hopefully in the future
we come up with a more general solution.

Best,
Scuri


Em qui, 19 de jul de 2018 às 19:19, Theron <[email protected]>
escreveu:

> On 07/19/18 16:18, Andrew Robinson wrote:
>
> Hi Theron,
>
>
> How is a compiler that rejects bytes 128-255 in string literals
> necessarily non-conforming?...Now, knowing that not all compilers
> in practice do accept this cleanly.
>
> As you yourself pointed out in an earlier message, "the compiler should not
> try to reinterpret the byte sequences in any way". That is a truism
>
> I was wrong to say that only 0-127 are "proper C".
> Here is a relevant reference:
> https://en.cppreference.com/w/c/language/string_literal
> In particular,
>
> "1) *character string literal*: The type of the literal is char[], each 
> character in the array is initialized from the next character in 
> s-char-sequence using the execution character set."
>
> However,
>
> "The encoding of character string literals (1) and wide string literals (5) 
> is implementation-defined. For example, gcc selects them with the commandline 
> options -fexec-charset and -fwide-exec-charset."
>
> How I interpret this is that while it is not the compiler's job to do an
> exact copy of the bytes ("Raw" string literals are a C++ feature), it
> should get it right as long as the source encoding agrees with the encoding
> targeted by the compiler.  I have no reason to doubt that UTF-8 literals
> work just fine in Clang and GCC, or that ISO-8859-1 work in Microsoft, and
> I would be disappointed if these do not also support each other.
>
> In the case of iup_str.c and iup_strmessage.c, this doesn't mean the
> existing C sources are okay as-is: two different encodings are used within
> one file.  Only one might be expected to work correctly at a time, and this
> depends on the compiler and on any command-line options.
>
>  and since
> the problem only exists in some compilers and not others, it would make more
> sense to switch compilers rather switch code just so those certain compilers
> that don't work with iup_str.c would suddenly start working.
>
> I guess it works as-is, with no warnings, in GCC and MSVC, but it is not
> best for IUP portability (part of its purpose) to be restricted to these
> two compilers.  Keep in mind that other compilers are not necessarily
> wrong, in some cases the problem is greater strictness in enforcing source
> validity.
>
> The strings themselves should not ever need editing to "make it work".
> As long as the strings are already correctly encoded ISO8859-1 and
> UTF-8
>
> There is no such thing as "correctly encoded" C-strings, only valid or invalid
> C-strings. Here are a few examples of some valid C-strings:
>
> I meant to refer to the strings themselves, i.e. the 8-bit integer arrays
> stored at compile-time, not to the literals in the source.  As is under
> discussion, there are various ways in the C language and in its various
> implementations to pack that byte array into the compiled library, but the
> array itself needs to be a valid encoded form of the intended text - this
> much didn't really need to be said, but it is all that I meant.
>
> ASCII and ANSI are so yesterday, so why are they still hanging around causing
> problems?
>
> Following this line of reasoning, why are both UTF-8 and ISO-8859-1 needed
> in IUP source?  Shouldn't the project choose one for source, and convert to
> the other at runtime only for APIs (Microsoft?) which need it?
>
> If a single C source file absolutely must generate code containing
> constants under both encodings, the options seem to be ASCII+escapes as a
> lowest-common-denominator, short of simple hexadecimal integer arrays which
> are entirely unreadable (except perhaps to a coder).  However, I would be
> in agreement that this is far from an ideal solution.
>
> Theron
>
> ------------------------------------------------------------------------------
> Check out the vibrant tech community on one of the world's most
> engaging tech sites, Slashdot.org! http://sdm.link/slashdot
> _______________________________________________
> Iup-users mailing list
> [email protected]
> https://lists.sourceforge.net/lists/listinfo/iup-users
>

------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot

_______________________________________________
Iup-users mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/iup-users

Re: [Iup-users] Use utf-8 source encoding rather than ISO8859-1

Reply via email to