Re: [PATCH] IBM z/OS + EBCDIC support

Thorsten Glaser Wed, 29 Apr 2015 08:52:48 -0700

Daniel Richard G. dixit:

>> >> - what about \u20AC? UTF-8? UTF-EBCDIC?
>> >
>> >Many code pages have received euro-sign updates; e.g. EBCDIC 924 is
>>
>> I wasn’t actually asking about Euro support here, but deeper…
>
>I'm not sure I understand what you're getting at... U+20AC is the
>Euro sign...


Yes, but I was using that only as an example.
Use U+4DC0 HEXAGRAM FOR THE CREATIVE HEAVEN (䷀) then ☺

But we already established that we ignore Unicode here;
it reminds me somewhat of the Win16 codepage scheme.

>I'm working from a system with a UTF-8 locale, but as I'm US-based,
>pretty much everything is ASCII. The conversion layer, however,

OK, I can see that. Though I’m using lots of UTF-8 stuff even
when writing English… they call me Mr. WTF-8 sometimes ☻

>explicitly uses ISO 8859-1 on the client side. If I send actual UTF-8,
>that would probably get interpreted as so much Latin-1.

OK. I can work with that assumption. Thanks.

Out of curiosity: what do the various conversion layers
(NFS, extended attributes, etc.) with the nōn-ASCII parts
of the mksh source? Do you get Â© for the copyright sign
(i.e. interpreted as latin1) too?

http://en.wikipedia.org/wiki/EBCDIC_1047#Code_page_translation
looks useful. That maps 1:1 to Unicode, of course, and we can
even do cp924. This may even make the utf_{wc,mb}to{mb,wc} code
into a simple table lookup. Well one of them anyway. But see
below under “conversion routines”.

>Of course, I see no reason why mksh couldn't use this Unicode support,
>as long as it continues talking ASCII/EBCDIC with the terminal.

Only with a translation layer (hah).

Currently, Unicode support means parsing UTF-8 input instead
of ASCII input, so when a hi-bit7 char arrives, it waits for
the next if in range (or maps it into EF80‥EFFF if invalid).

>> This would mean completely removing utf8-mode from the shell. That’s a
>> more deep incision than I originally thought would be required.
>
>Removing it? I thought off-by-default would be enough...

It may turn out to be enough. I think it depends on the
conversion layer. We’ll see. We can experiment a lot,
after all. I’d prefer to keep the delta low, too.

>> z/Linux is “something like Debian/s390 and Debian/s390x”, then?
>> (In that case: mksh works perfectly well there.)
>
>Yes, exactly; z/Linux is just how I've heard it referred to in my
>company. That environment is pretty trivial to port to, as It's Just
>Linux(tm) with slightly different sysdeps.

Ah okay.

>Even if printf is unportable, the test need only succeed on EBCDIC
>platforms. Instead of checking for 'O' vs. '|', check for '|' vs.
>anything else (including error).

Hm.

>You won't get anywhere with tr(1) in EBCDIC-land, I'm afraid:
>
>    $ echo hijk | tr a-z A-Z
>    HIJK

I was thinking of this:

$ echo '+' | tr '(-*' '*-,'
+

This should give a ‘)’ in EBCDIC, right?

>I would just have a small platform note in the documentation that calls
>the user's attention to xlc's -qascii and -qconvlit options, with a
>brief discussion of the ASCII vs. EBCDIC issues, and then let them
>decide how to deal with it.

OK. Maybe we can use an additional Build.sh option to control that,
actually.

>Pretty sure none of those are available :(  They're certainly not in
>the headers.

OK.

>> You could experiment things at runtime. Just kill(2) something
>> with all numbers, see if high numbers give different errors,
>> maybe the OS says “signal number too high”, then we get a clue.
>
>    $ kill -120 83953851

I was thinking kill(2) not kill(1), but…

>    $ kill -40 83953851
>    kill: FSUM7327 signal number 40 not conventional
>    kill: 83953851: EDC5121I Invalid argument.
>    $ kill -39 83953851
>    kill: FSUM7327 signal number 39 not conventional

… then set NSIG to 40 (or SIGMAX to 39). Can you also send me
a list of all SIG* defines on the system, so that Build.sh can
pick them up?

>Yes, rlimits.gen is lacking the continuation backslashes from
>rlimits.opt. Guess those are getting dropped somewhere.

Ah. That is definitely a host shell bug; read without -r is
supposed to drop the backslash *and* the following newline.

>Once I flattened each of those definitions into a single line, the build
>proceeds and completes without error, and the test suite...
>
>    Total failed: 0
>    Total passed: 498

Wow.

>I wouldn't encourage a host-side C tool here, as that was partly what
>made a GNU Bash build unmanageable on this system...

It’s unevitable though. But I don’t think it will make anything
unmanageable. It’s mostly still Build.sh checking for things,
then building something, then running it, which will generate
a bunch of files, then it’d compile the shell itself.

>> I hope to be able to make the entire of edit.c, plus good parts of
>> lex.c and syn.c and some parts of tree.c use 16-bit Unicode
>> internally.
>
>I'm presuming this would be wchar_t and its related functions?

Absolutely no! These are extremely unportable.

It uses uint16_t, and the utf_* functions from expr.c which
are there already.

>Is the idea along the lines of filtering everything through iconv(3),
>going from UCS-2/UTF-16 internally to whatever encoding the scripts and
>terminal are in? So the code only deals with Unicode and translates
>appropriately to the outside world?

Mostly, yes. It’s inspired by some tricky things in edit.c mostly,
which will probably be the first thing to make use of it, plus the
idea to have character classes available (e.g. [[:alnum:]]).

>Well, at least MacRelix aims to be a Unix/POSIX environment, so Unix
>line endings make sense there.

Indeed.

>I wasn't aware of this project until now. Good on them!

I’d think lewellyn will be happy to hear that. But, yes.
I think it’s brave.

>I'm not sure I understand the point you're making here :>  But you're
>not saying that you'd need to resort to assembly for an efficient hash-
>table implementation... right?

I’m saying that, maybe even if I do, it’s reasonable to write
one that performs good enough for shell without doing that.

Not the hashtable implementation actually, but the security
measures for it (we’ll XOR the hash by a per-table random
32-bit value, then rotate it by another per-table random
5-bit value, to avoid CVE-2011-4815 style attacks).

And hashtable use is going to raise, especially once we
will have multi-dimensional and/or associative arrays in
the shell.

>I didn't think the #ifdefs got out of hand... was there any place in
>particular where you saw this to be the case?

No, I’ve not yet looked at it in detail. I currently also
think they can stay manageable. But I don’t want to promise
something currently, if it may turn out I can’t hold it;
especially as I’m pretty ignorant wrt. EBCDIC workings.

>> Yes and no. Michael managed to hide much in a library.
>
>Yet I presume you would not want to integrate such a library into the
>main source tree (e.g. under a win32/ directory), irrespective of the
>CR+LF/fork() issue, as you wouldn't/couldn't want to maintain such a
>library yourself...?

That, and the library is used for much more than just mksh,
and developed as such, and contains much more. It’s not in
the scope.

>> mksh has never supported latin-1 (or any 8-bit codepage/SBCS, or DBCS)
>> environments, period.
>>
>> mksh is always: ASCII, possibly UTF-8/CESU-8, but 8-bit transparent.
>
>I thought a number of older Unix environments still used ISO 8859
>encodings, as Linux once did. You're saying, it doesn't work at all, or
>there's just no first-class support for it? Especially if it's 8-bit
>transparent, that sounds like high-bit characters would at least pass
>through safely.

They pass through safely, and backspace will remove one octet from
the input line. That’s about the extent of it. They will, for example,
never match [[:print:]], that’s either ASCII or Unicode. (We can make
this work magically for EBCDIC by treating it as Unicode internally.
Possibly.)

[ conversion routines ]
>You would continue to provide these routines as a fallback, however,
>right? For systems that don't have them?

Currently, *all* systems, even those who have them, use the
mksh-supplied routines, because either there are none at the
OS level, or they suck; besides, they all don’t support the
required 8-bit transparentness in UTF-8 mode. (Except MirBSD’s,
of course, as that’s what it’s designed after.)

The idea here is to make an exception for this and use the
OS-provided conversion functions for EBCDIC, if they are
available widespread enough and handle the possible cases
of several EBCDIC variants well. If not, we’ll use the
lookup tables I mentioned above.

>> This is, however, strengthening my (tentative) resolution to make this
>> into a separate product. This removes certain promises the shell
>> offers to scripts that they can rely on, and a lot of functionality.
>
>You would want to guarantee that e.g. "printf '\101'" produces
>(ASCII) 'A'?

There is no printf in mksh, but, yes, “print '\0101'” produces
ASCII ‘A’ is a currently-existing guarantee I’m intending to
keep for “the product mksh”.

We can have, say, zmksh (and zlksh), for which this does not hold.

>I do understand the desire to draw a boundary, Venn-diagram-like, inside
>of which you have "standard" mksh, with LF line endings, O_BINARY, and
>the API and runtime environment promised to scripts and the user.
>Outside of that would live "nonstandard mksh," with variations that may
>better suit a given platform but stray from the mksh Platonic ideal.

Now you’re getting graphic and philosophical, but this is about it, yes.

>However, I'm not sure I see tha value of making the official source tree
>align exactly with the boundaries of "standard mksh." If you have a one-
>line change that doesn't fit the standard, you'll be making a lot more
>work for yourself by keeping that as a separate patch, be it in the way

Agreed.

>of testing, merging, updating/syncing, or even just letting users know
>it exists. But if it is integrated as a compile-time option with strong
>caveats about the non-standardness, and an appropriately different
>KSH_VERSION, then that maintains the distinction and puts scripts/users
>on notice about the different promises that are being made.

Right. I’ll try to make things fit into that scheme if at all
possible, then.

>I had in mind compile-time detection of EBCDIC (just depending on
>what CFLAGS are set)

Hm. I’d prefer to amend the CFLAGS inside Build.sh with the proper
flags for OS/390 depending on which EBCDIC variant is used, so we
can also make that into a CPPFLAGS entry which mksh can use, or
something like that. You said you’d want to support at least two
codepages, which differ in e.g. ‘[]’. – Or can you switch codepages
at runtime? In that case, it’d become even trickier…

>> The bikeshed question is, what to name it? mksh/EBCDIC? mksh/zOS?
>> mksh/OS390? Or what? What should its KSH_VERSION look like⁴, and do
>> you want the mksh/lksh distinction⁵ too?
>
>I think that ASCII or EBCDIC needs to be indicated somehow, as both can
>exist in this environment. (ASCII may be an unusual case, but as you've
>seen, some folks care about it ;)

OK. But the ASCII variant can just be “platonic mksh”, right?

>"OS390" has the advantage of matching "uname" output, and the Perl port
>identifies itself as this, but "zOS" reflects the current name of the
>OS. Good arguments both ways.

OK.

>I can't say what KSH_VERSION should look like, but at least I can help
>you make an informed judgment.

Indeed. I mentioned zmksh/zlksh earlier, as an aside comment.
Are there any other EBCDIC environments that may eventually
become relevant as mksh target platforms? If not, we could
go that way for shortness. “@(#) Z/OS MKSH R…” (and WIN32)
then, maybe.

>No reason I see not to support both mksh and lksh builds. (All I really
>know about the latter is Debian's recommendation to use it instead of
>mksh when replacing /bin/sh, so I was planning on doing as much.)

Right. (lksh is about using “long” instead of “int32_t” as
arithmetic base type, as POSIX mandates. This also disables
the part where mksh emulates operations on those by operating
on “uint32_t” with post-processing to fake a signed calculation,
to avoid the compiler optimising code into brokenness, as mandated
by ISO C.)

>> ④ btw, is dot.mkshrc usable in your environment… once we get the bugs
>>   out, that is?
>
>Wow, that's a _big_ startup file. I do see \033 hard-coded in a couple
>of places... couldn't you take advantage of mksh's interpretation of \e
>or \E there?

Will do. The file is full of hysteric raisins; R51 brings a new
feature I’d want to use anyway.

>> ⑤ mostly, lksh uses POSIX arithmetic, whereas mksh use safe arithmetic
>>   (guaranteed 32-bit, with wraparound, and the signed arithmetics in
>>   shell are actually emulated using uint32_t in C code, plus it has
>>   guarantees for e.g. shift right, mostly like the 80386 works, and it
>>   can rotate left/right)
>
>Hah, bit-twiddling in shell... that's a use case I wouldn't have thought
>of :]

I implemented hash functions in Pure mksh™ as well as arc4random.
For this, we even can do unsigned arithmetics. It’s surprisingly
straightforward to use; Perl requires, for example, to convert
values to 64-bit signed integers (possibly emulated, taking two
32-bit registers each), doing a 64-bit signed operation on it,
then masking the result & 0xFFFFFFFF, for this. mksh has only one
size, but both signedness flavours (though the signedness of the
arithmetic operations is per “let” command, not based on the type;
variable storage is ignorant of the signedness).

Compare (especially the “finish” functions of) Perl…

sub NZATUpdate($$) {
        my ($h, $s) = @_;

        foreach my $c (unpack("C*", $s)) {
                $h = ($h + $c + 1) * 1025;
                $h %= 2**32;
                $h ^= $h >> 6;
        }
        return ($h);
}
sub NZAATFinish($) {
        my $h = shift;

        $h += $h << 10;
        $h %= 2**32;
        $h ^= $h >> 6;
        $h %= 2**32;
        $h += $h << 3;
        $h %= 2**32;
        $h ^= $h >> 11;
        $h %= 2**32;
        $h += $h << 15;
        $h %= 2**32;

        return ($h);
}

… with Pure mksh™:

typeset -Z11 -Uui16 Lnzathash_v
function Lnzathash_add {
        [[ -o utf8-mode ]]; local u=$?
        set +U  # disable UTF-8 mode, we want 8-bit mode
        local s
        # slurp all input as array of uint8_t values into s
        if (( $# )); then
                read -raN-1 s <<<"$*"
                unset s[${#s[*]}-1]
        else
                read -raN-1 s
        fi
        local -i i=0 n=${#s[*]}

        while (( i < n )); do
                ((# Lnzathash_v = (Lnzathash_v + s[i++] + 1) * 1025 ))
                ((# Lnzathash_v ^= Lnzathash_v >> 6 ))
        done

        (( u )) || set -U       # restore UTF-8 mode if it was set
}
function Lnzaathash_end {
        ((# Lnzathash_v *= 1025 ))
        ((# Lnzathash_v ^= Lnzathash_v >> 6 ))
        ((# Lnzathash_v += Lnzathash_v << 3 ))
        ((# Lnzathash_v = (Lnzathash_v ^
            (Lnzathash_v >> 11)) * 32769 ))
        print ${Lnzathash_v#16#}
}

The leading ‘#’ makes the evaluation use unsigned arithmetics.

But now we drift off. Again. :)

bye,
//mirabilos

PS: Re. the signature: “uhr” is a script displaying an analog
    clock on the terminal using ANSI escapes for positioning
    and certain UTF-8 chars for drawing (espeically making use
    of the ▄▀█ chars to double the effective height, making the
    screen almost quadratic), in mksh and bc. Best viewed with
    uxterm and the 9x18 font (“Large” in the Ctrl-RightClick
    menu), but also works e.g. on Android (if bc is added, and
    some of the chars like ◙ which the Android terminal font
    doesn’t have replaced in the m2c array in the script; mksh
    has been Android’s system shell for quite a while).
-- 
„Cool, /usr/share/doc/mksh/examples/uhr.gz ist ja ein Grund,
mksh auf jedem System zu installieren.“
        -- XTaran auf der OpenRheinRuhr, ganz begeistert
(EN: “[…]uhr.gz is a reason to install mksh on every system.”)

Re: [PATCH] IBM z/OS + EBCDIC support

Reply via email to