Re: [PATCH] IBM z/OS + EBCDIC support

Daniel Richard G. Thu, 30 Apr 2015 23:18:07 -0700

On Wed, 2015 Apr 29 15:42+0000, Thorsten Glaser wrote:
> Daniel Richard G. dixit:
>
> >I'm working from a system with a UTF-8 locale, but as I'm US-based,
> >pretty much everything is ASCII. The conversion layer, however,
>
> OK, I can see that. Though I’m using lots of UTF-8 stuff even when
> writing English… they call me Mr. WTF-8 sometimes ☻


Well, my mail user agent is up to snuff, even if my company's mainframe
system consoles aren't :]

> >explicitly uses ISO 8859-1 on the client side. If I send actual UTF-
> >8, that would probably get interpreted as so much Latin-1.
>
> OK. I can work with that assumption. Thanks.

I've come across some relevant information recently, regarding IBM's
ported version of OpenSSH on z/OS:

    OpenSSH assumes that all text data traveling across the network is
    encoded in ISO/IEC 8859-1 (Latin-1). Specifically, OpenSSH treats
    data as text and performs conversion between the ASCII Latin-1 coded
    character set and the EBCDIC-coded character set of the current
    locale in the following scenarios:

    * ssh login session
    * ssh remote command execution
    * scp file transfers
    * sftp file transfers when the ascii subcommand is specified

    The OpenSSH daemon (sshd) can understand and handle non-Latin-1
    coded character sets on the network for interactive sessions,
    specifically sessions with a tty allocated. However, not all EBCDIC-
    coded character sets are compatible with ISO 8859-1. To determine if
    a coded character set is compatible with a particular locale, see
    the information about locales supplied with z/OS XL C/C++ in z/OS XL
    C/C++ Programming Guide.

    Warning: If there is no one-to-one mapping between the EBCDIC coded
    character set of the session data and ISO 8859-1, then nonidentical
    conversions might occur. Specifically, substitution characters (for
    example, IBM-1047 0x3F) are inserted into the data stream for those
    incompatible characters. See “Configuring the OpenSSH daemon” on p

-- http://www-03.ibm.com/systems/resources/fot4os02.pdf
   (section "OpenSSH and globalization")

It seems like IBM has placed the EBCDIC<->ASCII conversion layer in the
OpenSSH daemon itself, rather than in a system facility  >_<

> Out of curiosity: what do the various conversion layers
> (NFS, extended attributes, etc.) with the nōn-ASCII parts
> of the mksh source? Do you get Â© for the copyright sign
> (i.e. interpreted as latin1) too?

    $ grep Copyright sh.h
     * Copyright © 2003, 2004, 2005, 2006, 2007, 2008, 2009, 2010,
                 ^ (proper copyright symbol)

It's possible that z/OS interprets that symbol as two Latin-1
characters, but then when that is sent back to my Linux terminal, it
gets "reassembled" as UTF-8.

Remember my grumbling about too many conversion layers? :>

> http://en.wikipedia.org/wiki/EBCDIC_1047#Code_page_translation looks
> useful. That maps 1:1 to Unicode, of course, and we can even do cp924.
> This may even make the utf_{wc,mb}to{mb,wc} code into a simple table
> lookup. Well one of them anyway. But see below under “conversion
> routines”.

So basically, iconv or a workalike...

> >Removing it [utf8-mode]? I thought off-by-default would be enough...
>
> It may turn out to be enough. I think it depends on the conversion
> layer. We’ll see. We can experiment a lot, after all. I’d prefer to
> keep the delta low, too.

Aye, I was certainly envisioning a lightweight set of changes. At most,
disable setting UTFMODE != 0.

> I was thinking of this:
> 
> $ echo '+' | tr '(-*' '*-,'
> +
> 
> This should give a ‘)’ in EBCDIC, right?

Hate to disappoint...

    $ echo '+' | tr '(-*' '*-,'
    +

But tr(1) does support octal escapes, so you could do e.g.

    $ echo a | tr '\201' X
    X

> >I would just have a small platform note in the documentation that
> >calls the user's attention to xlc's -qascii and -qconvlit options,
> >with a brief discussion of the ASCII vs. EBCDIC issues, and then let
> >them decide how to deal with it.
>
> OK. Maybe we can use an additional Build.sh option to control that,
> actually.

Perhaps, though if explicit support in Build.sh can be called "hand-
holding," then z/OS is a platform where the number of users is quite
small, and are more likely than not able to figure out compiler flags
themselves anyway. A Build.sh option just seems like overkill to me.

> I was thinking kill(2) not kill(1), but…
> 
> >    $ kill -40 83953851
> >    kill: FSUM7327 signal number 40 not conventional
> >    kill: 83953851: EDC5121I Invalid argument.
> >    $ kill -39 83953851
> >    kill: FSUM7327 signal number 39 not conventional
> 
> … then set NSIG to 40 (or SIGMAX to 39). Can you also send me a list
> of all SIG* defines on the system, so that Build.sh can pick them up?

NSIG=40, check. Will send some information via private mail.

> >Yes, rlimits.gen is lacking the continuation backslashes from
> >rlimits.opt. Guess those are getting dropped somewhere.
>
> Ah. That is definitely a host shell bug; read without -r is supposed
> to drop the backslash *and* the following newline.

Couldn't you just keep the backslash-newlines in the *.gen files? Are
there preprocessors that can't deal with such multi-line macro
definitions?

> >I wouldn't encourage a host-side C tool here, as that was partly what
> >made a GNU Bash build unmanageable on this system...
>
> It’s unevitable though. But I don’t think it will make anything
> unmanageable. It’s mostly still Build.sh checking for things, then
> building something, then running it, which will generate a bunch of
> files, then it’d compile the shell itself.

It's possible some chtag(1) tagging might be needed, as encodings could
potentially get mixed up in certain instances. (This was the case with a
-qascii Bash build)

> >I'm presuming this would be wchar_t and its related functions?
> 
> Absolutely no! These are extremely unportable.

I suspected as much, but I've never actually had to deal with them, so
was unsure. I'll keep that point in mind, however.

> >I didn't think the #ifdefs got out of hand... was there any place in
> >particular where you saw this to be the case?
>
> No, I’ve not yet looked at it in detail. I currently also think they
> can stay manageable. But I don’t want to promise something currently,
> if it may turn out I can’t hold it; especially as I’m pretty ignorant
> wrt. EBCDIC workings.

Oh, okay. I don't think the conditionals should get hairy... and if they
do, then there is probably a better way of going about it.

> >Yet I presume you would not want to integrate such a library into the
> >main source tree (e.g. under a win32/ directory), irrespective of the
> >CR+LF/fork() issue, as you wouldn't/couldn't want to maintain such a
> >library yourself...?
>
> That, and the library is used for much more than just mksh, and
> developed as such, and contains much more. It’s not in the scope.

Ah, yes, that's a solid reason to have it live a separate life, then.

> >I thought a number of older Unix environments still used ISO 8859
> >encodings, as Linux once did. You're saying, it doesn't work at all, or
> >there's just no first-class support for it? Especially if it's 8-bit
> >transparent, that sounds like high-bit characters would at least pass
> >through safely.
> 
> They pass through safely, and backspace will remove one octet from
> the input line. That’s about the extent of it. They will, for example,
> never match [[:print:]], that’s either ASCII or Unicode. (We can make
> this work magically for EBCDIC by treating it as Unicode internally.
> Possibly.)

That seems reasonable... no worse than what I've usually seen, anyway.

> [ conversion routines ]
> >You would continue to provide these routines as a fallback, however,
> >right? For systems that don't have them?
>
> Currently, *all* systems, even those who have them, use the mksh-
> supplied routines, because either there are none at the OS level, or
> they suck; besides, they all don’t support the required 8-bit
> transparentness in UTF-8 mode. (Except MirBSD’s, of course, as that’s
> what it’s designed after.)

Makes perfect sense. I certainly wouldn't put much trust in systems
getting this right, when they already get so much wrong!

> The idea here is to make an exception for this and use the OS-provided
> conversion functions for EBCDIC, if they are available widespread
> enough and handle the possible cases of several EBCDIC variants well.
> If not, we’ll use the lookup tables I mentioned above.

There's iconv, and I also see this handful of z/OS-specific routines:
__atoe(), __atoe_l(), __etoa(), __a2e_l(), and similar:

http://www-01.ibm.com/support/knowledgecenter/SSLTBW_2.1.0/com.ibm.zos.v2r1.bpxbd00/r0cate.htm

About the only advantage I see to using these routines is that they
modify the string in-place.

On the iconv side, I did notice (in the course of running the Gnulib
test suite) that iconv_open() does not recognize "ISO-8859-1"---it has
to be "ISO8859-1". I suspect that only the identifiers printed by
"iconv -l" are supported, and unlike what you see on Linux, there are
not a lot of synonyms in that list.

> We can have, say, zmksh (and zlksh), for which this does not hold.

Is it convention to name the binaries differently for nonstandard
variants? (E.g. the native Win32 port would also have modified names?)

> >of testing, merging, updating/syncing, or even just letting users
> >know it exists. But if it is integrated as a compile-time option
> >with strong caveats about the non-standardness, and an
> >appropriately different KSH_VERSION, then that maintains the
> >distinction and puts scripts/users on notice about the different
> >promises that are being made.
>
> Right. I’ll try to make things fit into that scheme if at all
> possible, then.

Sounds reasonable... I think it can work :)

> >I had in mind compile-time detection of EBCDIC (just depending on
> >what CFLAGS are set)
> 
> Hm. I’d prefer to amend the CFLAGS inside Build.sh with the proper
> flags for OS/390 depending on which EBCDIC variant is used, so we
> can also make that into a CPPFLAGS entry which mksh can use, or
> something like that. You said you’d want to support at least two
> codepages, which differ in e.g. ‘[]’. – Or can you switch codepages
> at runtime? In that case, it’d become even trickier…

The code page is set at compile time, with the -qconvlit option. From
the xlc(1) man page:

         -qconvlit[=<code_page>:{wchar|nowchar|unicode}]
                    | -qnoconvlit
                Changes the assumed codepage for character and
                string literals within the compilation unit.
                The default is -qconvlit=IBM-1047:nowchar.

I would hesitate to enumerate all "supported" EBCDIC variants in
Build.sh, just because I don't think it's necessary---as long as the
code page contains all the ASCII characters we need, then how the code
points are assigned doesn't really matter (at least where we can specify
characters literally, so the escape character would be an exception). I
believe the only case where this could become an issue is when you have
mismatched code pages (e.g. EBCDIC 1047 mksh + EBCDIC 037 user), and
then you pray that as many code points agree as possible. This, IMO,
falls squarely in the category of "user caveat."

This situation could change, however, once mksh is doing UTF-16
internally. Then, because it has to translate everything to and from the
outside world anyway, I see no reason why it couldn't use a 1047 table
for user A, and a 037 table for user B. Perhaps even straight UTF-8 for
user C! But then this would be a major change, because then you might
actually _not_ want the compiler to transcode your strings. I'll be
happy to revisit this when the time comes.

> >I think that ASCII or EBCDIC needs to be indicated somehow, as both
> >can exist in this environment. (ASCII may be an unusual case, but as
> >you've seen, some folks care about it ;)
>
> OK. But the ASCII variant can just be “platonic mksh”, right?

Yes, I think an "ASCII unless indicated otherwise" approach is sensible.

> Indeed. I mentioned zmksh/zlksh earlier, as an aside comment. Are
> there any other EBCDIC environments that may eventually become
> relevant as mksh target platforms? If not, we could go that way for
> shortness. “@(#) Z/OS MKSH R…” (and WIN32) then, maybe.

There are definitely other EBCDIC platforms, but will they become
relevant? That all depends on whether there's some random schmuck
messing around on those systems who takes a liking to your project :)

I'm not sure about "Z/OS MKSH", however, if the -qascii build would have
"MIRBSD MKSH". Both are z/OS, after all, and the only thing
significantly different about the EBCDIC build is, well, EBCDIC.

> Compare (especially the “finish” functions of) Perl…

I would if I'd gotten enough sleep -.-

(Couldn't get uhr to work with R50 on my Debian system, however... lots
of "no coprocess" errors...)


--Daniel


-- 
Daniel Richard G. || sk...@iskunk.org
My ASCII-art .sig got a bad case of Times New Roman.

Re: [PATCH] IBM z/OS + EBCDIC support

Reply via email to