Re: gcc/libcpp: non-UTF-8 source or execution encodings?

David Malcolm Wed, 20 Jul 2016 09:00:03 -0700

On Tue, 2016-07-19 at 16:10 -0700, David Edelsohn wrote:
> Hi, David
> 
> I don't believe that hardware easily is available.  We probably could
> arrange for access, if it is necessary, but it is not accessible
> through the IBM Community Development system for Linux on z Systems
> because this isn't Linux-based.  GCC on the system is not self
> -hosting
> -- I believe that GCC only is used as a cross-compiler.
> 
> Thanks, David


I did some more digging, and it looks like hardware isn't necessary: I
found PR 18785 ("[4.0 Regression] isdigit builtin function fails with
EBCDIC character sets")
which led me to these options in the C family of frontends:
  fexec-charset=
  fwide-exec-charset=

and these get used for cpp_opts->narrow_charset and cpp_opts
->wide_charset respectively in libcpp; they ultimately get passed to
iconv (if they don't match any of the priority special-cases in
libcpp/charset.c)

It looks like
  -fexec-charset=IBM1047
is the correct command-line option for enabling EBCDIC, (or rather, one
of the various EBCDIC encodings), and I was able to use this from my
x86_64 host to generate .s files with EBCDIC for the embedded strings.

I wasn't able to find an iconv code for UTF-EBCDIC:

  gcc -S ../../src/test.c -fexec-charset=UTF-EBCDIC
  cc1: error: conversion from UTF-8 to UTF-EBCDIC not supported by
iconv

but "interesting" values like -fexec-charset=UTF-16 appear to satisfy
my requirement for a way to stress-test the string-literal location
-handling code.

Thanks!

> On Tue, Jul 19, 2016 at 3:39 PM, David Malcolm <[email protected]>
> wrote:
> > On Tue, 2016-07-19 at 12:24 -0400, David Edelsohn wrote:
> > > On Tue, Jul 19, 2016 at 12:05 PM, David Malcolm <
> > > [email protected]>
> > > wrote:
> > > > libcpp/charset.c has a helpful introductory comment
> > > > describingcharacter
> > > > sets, including the source and execution character sets.
> > > > 
> > > > libcpp appears to attempt to support both UTF-8 and UTF-EBCDIC
> > > > for
> > > > the
> > > > source character set, via:
> > > > 
> > > > #if HOST_CHARSET == HOST_CHARSET_ASCII
> > > > #define SOURCE_CHARSET "UTF-8"
> > > > #define LAST_POSSIBLY_BASIC_SOURCE_CHAR 0x7e
> > > > #elif HOST_CHARSET == HOST_CHARSET_EBCDIC
> > > > #define SOURCE_CHARSET "UTF-EBCDIC"
> > > > #define LAST_POSSIBLY_BASIC_SOURCE_CHAR 0xFF
> > > > #else
> > > > #error "Unrecognized basic host character set"
> > > > #endif
> > > > 
> > > > though libiberty's safe-ctype.c has:
> > > > 
> > > > # if HOST_CHARSET == HOST_CHARSET_EBCDIC
> > > >   #error "FIXME: write tables for EBCDIC"
> > > > 
> > > > so presumably we only effectively support UTF-8 as the source
> > > > char
> > > > set.
> > > > 
> > > > Do we support any hosts for which the source character set is
> > > > *not*
> > > > UTF
> > > > -8?
> > > > 
> > > > Similarly, do we support any targets for which the execution
> > > > character
> > > > set is *not* UTF-8?
> > > > 
> > > > This relates to the locations-within-string-literals patch I
> > > > posted
> > > > here:
> > > > https://gcc.gnu.org/ml/gcc-patches/2016-07/msg00441.html
> > > > ("[PATCH] RFC: On-demand locations within string-literals");
> > > > that
> > > > patch
> > > > currently has an assumption that the source encoding ==
> > > > execution
> > > > encoding, and I'd appreciate knowing a configuration for which
> > > > this
> > > > isn't the case so I can test accordingly.
> > > 
> > > I believe that the GCC z/TPF configuration uses EBCDIC.  There
> > > also
> > > is
> > > the on-again off-again i370 port.
> > > 
> > > Thanks, David
> > 
> > Thanks.  Looks like the triple for the former is "s390x-ibm-tpf";
> > I'm
> > experimenting with that as the target.
> > 
> > Is there any accessible hardware for these?  I don't see them in
> > the
> > gcc compile farm.
> > 
> > Dave

Re: gcc/libcpp: non-UTF-8 source or execution encodings?

Reply via email to