Re: [PATCH] IBM z/OS + EBCDIC support

2015-04-24 Thread Thorsten Glaser
Daniel Richard G. dixit:

>I'd like to submit some changes that add support for IBM z/OS mainframe
>systems (specifically, for mksh running in the OMVS Unix environment),
>including compatibility with EBCDIC.

Woah!

I’m amazed, flattered, impressed, puzzled, wondering, etc. at the
same time. Please give me a bit to come up with a suitable reply,
this deserves some time to think about, and I love your enthusiasm.

>My ASCII-art .sig got a bad case of Times New Roman.

My condolence! Rest assured I’m using FixedMisc [MirOS]¹ here.

① https://www.mirbsd.org/MirOS/dist/mir/Foundry/FixedMisc-20130517.tgz

bye,
//mirabilos
-- 
FWIW, I'm quite impressed with mksh interactively. I thought it was much
*much* more bare bones. But it turns out it beats the living hell out of
ksh93 in that respect. I'd even consider it for my daily use if I hadn't
wasted half my life on my zsh setup. :-) -- Frank Terbeck in #!/bin/mksh


Re: [PATCH] IBM z/OS + EBCDIC support

2015-04-24 Thread Daniel Richard G.
On Fri, 2015 Apr 24 14:29+, Thorsten Glaser wrote:
> 
> Woah!
> 
> I’m amazed, flattered, impressed, puzzled, wondering, etc. at the same
> time. Please give me a bit to come up with a suitable reply, this
> deserves some time to think about, and I love your enthusiasm.

Oh, you're very kind!

mksh already has a remarkable track record of portability, and this
would be yet another feather in its cap. The build system, though
unconventional, turned out to be a lot easier to work with in the EBCDIC
environment than GNU Bash's.

By the way, there's one addendum I'd like to put here: It turns out that
NSIG=32 isn't quite right for z/OS. The system has a few more signals
than that, some of which appear to be unique:

$ kill -l
 NULL HUP INT ABRT ILL POLL URG STOP FPE KILL BUS SEGV SYS PIPE ALRM
TERM USR1 USR2 ABND CONT CHLD TTIN TTOU IO QUIT TSTP TRAP IOERR
WINCH XCPU XFSZ VTALRM PROF DANGER TRACE DCE DUMP

$ kill -l | tr ' ' '\n' | grep . | wc -l
 37 

$ grep SIGDUMP /usr/include/signal.h
  #define SIGDUMP  39

(SIGDANGER, Will Robinson!)

Do take the time you need to chew through all those changes, of course.
I'll be happy to pick things up again at your convenience.

> >My ASCII-art .sig got a bad case of Times New Roman.
>
> My condolence! Rest assured I’m using FixedMisc [MirOS]¹ here.
>
> ① https://www.mirbsd.org/MirOS/dist/mir/Foundry/FixedMisc-20130517.tgz

Nice! If it weren't for the big Web-mail providers seeing fit to display
.signatures in variable-width fonts, there would still be a little ASCII
skunk down below ^_^


--Daniel


-- 
Daniel Richard G. || sk...@iskunk.org
My ASCII-art .sig got a bad case of Times New Roman.


Re: [PATCH] IBM z/OS + EBCDIC support

2015-04-26 Thread Thorsten Glaser
Hi again,

a few questions back, and a few early comments on the general
idea (I’ve not yet had the time to look at the patch itself):

Assume we have mksh running on your EBCDIC environment. Let me
ask a few questions about this sort of environment, coupled with
my guesses about it.

- the scripts themselves are 'iconv'd to EBCDIC?

- stuff like print/printf \x4F is expected to output '|' not 'O'

- what about \u20AC? UTF-8? UTF-EBCDIC?

- keyboard input is in EBCDIC?

- is there anything that allows Unicode input?


Daniel Richard G. dixit:

>conditionalized in the code. Primarily, EBCDIC has the normal
>[0-9A-Za-z] characters beyond 0x80, so it is not possible to set the high
>bit for signalling purposes---which mksh seems to do a lot of.

Indeed. You probably refer to the variable substitution stuff
(where '#'|0x80 is used for ${foo##bar}) and possibly the MAGIC
stuff in extglobs @(foo|bar).

That’s all legacy. I think it can go unconditionally.

>* Added clauses for TARGET_OS == "OS/390"

Is OS/390 always an EBCDIC environment?

>* '\012\015' != '\n\r' on this platform, so use the latter

Agreed. I think I can change most of the C code to use the
char versions, i.e. '\n' and '@' instead of 0x0A or 0x40.
I will have to see about Build.sh though.

Looking at the patch (editing this eMail later), it seems
you have conditionalised those things nicely. That is good!

*BUT* some things in Build.sh – at least the $lfcr thing –
are dependent on the *host* OS, not the *target* OS.

So, as far as I see, we will require two checks:

• host OS (machine running Build.sh): ASCII-based or EBCDIC?

• target OS (machine running the mksh/lksh binary created):
  z/OS ASCII, z/OS EBCDIC, or anything else?

Remember mksh is cross-buildable. So we’ll need to come up
with (compile-time) checks for all those.

>* When compiling with -g, xlc produces a .dbg file alongside each object
>  file, so clean those up

Good.

>* NSIG is, amazingly, not #defined on this platform. Sure would be nice
>  if the fancy logic that calculates NSIG could conditionally #define
>  it, rather than a TARGET_OS conditional... :-)

No, a TARGET_OS conditional is probably good here, as you cannot
really guess NSIG – as you noticed, you seem to have 37 signals,
the highest number of which is 39.

The best way to determine NSIG is to look at libc sources, followed
by looking at libc binaries (e.g. determine the size of sys_siglist).

>* Check whether "nroff -c" is supported---the system I'm using has GNU
>  nroff 1.17, which doesn't have -c

Ah, I see. Then yes, this does make sense in the GNU case.
(MirBSD has AT&T nroff.)

>* On this platform, xlc -qflag=... takes only one suboption, not two

Hm. Is this a platform thing, a compiler version thing, etc?

>* Some special flags are needed for xlc on z/OS that are not needed on
>  AIX, like to make missing #include files an error instead of a warning
>  (!). Conversely, most of those AIX xlc flags are not recognized

Can we get this sorted out so it continues working on AIX?
I do not have access to an AIX machine any longer, unfortunately.

(Later: conditionalised, looks good.)

>* Added a note that EBCDIC has \047 as the escape character
>  rather than \033

Do EBCDIC systems use ANSI escapes (like ESC [ 0 m) still?

>+++ check.pl
>
>* I was getting a parse error with an expected-exit value of "e != 0",
>  and adding \d to this regex fixed things... this wasn't breaking for
>  other folks?

No. I don’t pretend to know enough Perl to even understand that.

But I think I see the problem, looking at the two regexps. I think
that “+-=” is expanded as range, which includes digits in ASCII.
I’ll have to go through CVS and .tgz history and see whether this
was intentional or an accidental fuckup.

On that note, does anyone have / know of a complete set of pdsh
and pdksh history?

>+++ check.t
>
>* The "cd-pe" test fails on this system (perhaps it should be disabled?)
>  and the directories were not getting cleaned up properly

That fails on many systems. Sure, we can disable it.

What is $^O in Perl on your platform?

>* If compiling in ASCII mode, #define _ENHANCED_ASCII_EXT so that as
>  many C/system calls are switched to ASCII as possible (this is
>  something I was experimenting with, but it's not how most people would
>  be building/using mksh on this system)

So it’s possible to use ASCII on the system, but atypical?

>* Define symbols for some common character/string escape literals so we
>  can swap them out easily

OK.

>* Because EBCDIC characters like 'A' will have a negative value if
>  signed chars are being used, #define the ORD() macro so we can always
>  get an integer value in [0, 255]

Huh, Pascal anyone? :)

>+++ edit.c (back to patch order)

Here’s where we start going into Unicode land. This file is the one
that assumes UTF-8 the most.

>* I don't understand exactly what is_mfs() is used for, but I'm pretty
>  sure we can't do the & 0x80 with EBCDIC (note that e.g. 'A' == 0xC1)

Motion s

Re: [PATCH] IBM z/OS + EBCDIC support

2015-04-27 Thread Daniel Richard G.
On Sun, 2015 Apr 26 14:47+, Thorsten Glaser wrote:
> Hi again,
> 
> a few questions back, and a few early comments on the general idea
> (I’ve not yet had the time to look at the patch itself):

Seems you did shortly after you wrote that :)

> Assume we have mksh running on your EBCDIC environment. Let me ask a
> few questions about this sort of environment, coupled with my guesses
> about it.
>
> - the scripts themselves are 'iconv'd to EBCDIC?

That is one way, but z/OS provides multiple layers of conversion that
make the process easier:

1. You can mark a particular NFS mount as "text", so that every file
   read through that mount is converted on the fly from ASCII to EBCDIC.
   (There are also "binary mounts" that do no conversion, and reading an
   ASCII file through them gives you gibberish.) Of course, reading any
   kind of binary data through a "text" mount will not go well.

   This is how I worked for the most part, with the mksh sources being
   read through such a "text"-mode NFS mount. Especially as the only
   text editor available in this z/OS installation appears to be vi :<

2. For files that are on the local filesystem, you can assign them an
   extended filesystem attribute marking them as either binary or text,
   and if text, you can specify the code page (be it ASCII or EBCDIC).
   So if the file is properly "tagged," and auto-conversion in enabled,
   then an EBCDIC application can read an ASCII file and have it work
   transparently.

   The tagging utility is called "chtag", and the auto-conversion
   parameter is "AUTOCVT", for Googling purposes.

3. I did notice that after the mksh build completed, the following files
   were tagged as EBCDIC text:

 t IBM-1047T=on  Rebuild.sh
 t IBM-1047T=on  conftest.c
 t IBM-1047T=on  rlimits.gen
 t IBM-1047T=on  sh_flags.gen
 t IBM-1047T=on  signames.inc
 t IBM-1047T=on  test.sh

   So there's also an element of auto-tagging, even though it
   shouldn't make a difference here conversion-wise as the files are
   already in EBCDIC.

To return to your question, while conversion with iconv(1) is available,
you can see it's far from the most convenient approach.

> - stuff like print/printf \x4F is expected to output '|' not 'O'

Yep! Just tried it in the shell:

$ printf '\x4F\n'
|

> - what about \u20AC? UTF-8? UTF-EBCDIC?

Many code pages have received euro-sign updates; e.g. EBCDIC 924 is
the euro-ified version of EBCDIC 1047. But that doesn't mean that
anyone actually _uses_ the updated versions. I haven't seen 924 pop
up anywhere.

UTF-8 is known to the system. There is an IBM code-page ID for it
(1208), iconv(1) knows about it, and you can tag files as UTF-8 text.
I don't think that necessarily indicates wider Unicode support,
however, as it would ultimately get converted to EBCDIC 1047 (or
whatever) anyway.

UTF-EBCDIC exists, but you wouldn't know it from the z/OS environment.
No code-page ID [as far as I've found], no mention in "iconv -l". When I
asked the mainframe guys at my company about it, they told me, "you
don't really want to deal with that."

I glanced at the Wikipedia article for UTF-EBCDIC, and can vouch for the
accuracy of this paragraph:

This encoding form is rarely used, even on the EBCDIC-based
mainframes for which it was designed. IBM EBCDIC-based mainframe
operating systems, such as z/OS, usually use UTF-16 for complete
Unicode support. For example, DB2 UDB, COBOL, PL/I, Java and the IBM
XML toolkit support UTF-16 on IBM mainframes.

Locale support in z/OS is like it was in Linux over a decade ago: If
you're a U.S. user, use the default code page; if you're a Russian user,
use a Russian code page, and so on... and all code pages are 8 bits.

> - keyboard input is in EBCDIC?

I worked by way of SSH'ing in to the z/OS OMVS Unix environment.
Everything in OMVS is EBCDIC, but of course my SSH client sends and
receives everything in ASCII. There is a network translation layer in
between, apart from the file-content conversion layers previously
mentioned, that makes it all work transparently.

A "real" mainframe connection, however, would be through TN3270, using
the x3270 program or the like. Then the conversion is happening on the
client side. But this is not relevant to mksh, because you don't get the
z/OS Unix environment through TN3270; you get the old-school full-screen
menu-driven interface that mainframe operators deal with.

(You can bring up OMVS via the TN3270 menu screens, but then you get a
horrible IRC-like line-based interface that sidesteps the normal Unix
shell. IMO, still irrelevant to mksh.)

> - is there anything that allows Unicode input?

>From the keyboard? I've not seen anything suggesting this is possible.
Even IBM's z/OS Unicode support via UTF-16 is, as far as I can tell, for
use by applications and not by logged-in users.

My understanding of why things like locale/encoding support on the
console/terminal aren't up t

Re: [PATCH] IBM z/OS + EBCDIC support

2015-04-27 Thread Thorsten Glaser
Hah!

Hi again.

Your eMail requires at least three passes…

① reading through all of it, taking notes
② this answer message, with a few comments on some things,
  while ignoring some other things altogether
③ another answer message tackling those things, after I
  ponder this some more (it *is* a brave new world you opened!)

The result of #1+#2 follows.


Daniel Richard G. dixit:

>To return to your question, while conversion with iconv(1) is available,
>you can see it's far from the most convenient approach.

OK. Conversion with something like it, then, anyway.

>> - stuff like print/printf \x4F is expected to output '|' not 'O'
>
>Yep! Just tried it in the shell:
>
>$ printf '\x4F\n'
>|

OK. Just what I thought.

>> - what about \u20AC? UTF-8? UTF-EBCDIC?
>
>Many code pages have received euro-sign updates; e.g. EBCDIC 924 is

I wasn’t actually asking about Euro support here, but deeper…

>Locale support in z/OS is like it was in Linux over a decade ago: If
>you're a U.S. user, use the default code page; if you're a Russian user,
>use a Russian code page, and so on... and all code pages are 8 bits.

… *shudder* (and OK @ not using UTF-EBCDIC)…

>> - keyboard input is in EBCDIC?
>
>I worked by way of SSH'ing in to the z/OS OMVS Unix environment.
>Everything in OMVS is EBCDIC, but of course my SSH client sends and
>receives everything in ASCII. There is a network translation layer in
>between, apart from the file-content conversion layers previously
>mentioned, that makes it all work transparently.

… *UGH!* That’s the hard thing.

Actually, does your SSH client send/receive in ASCII, or in latin1
or some other ASCII-based codepage? What does this layer use?

Though, that is almost certainly irrelevant for mksh, I see from #1.

>A "real" mainframe connection, however, would be through TN3270, using
>the x3270 program or the like. Then the conversion is happening on the
>client side. But this is not relevant to mksh, because you don't get the

OK.

>> - is there anything that allows Unicode input?
>
>>From the keyboard? I've not seen anything suggesting this is possible.
>Even IBM's z/OS Unicode support via UTF-16 is, as far as I can tell, for
>use by applications and not by logged-in users.

OK.

This would mean completely removing utf8-mode from the shell.
That’s a more deep incision than I originally thought would
be required.

>My understanding of why things like locale/encoding support on the
>console/terminal aren't up to snuff on z/OS is that this would only
>benefit the crusty mainframe operators, who are comparatively small in
>number compared to the user base of the application(s) running on the
>system. At the same time, there is z/Linux (Linux on the mainframe), and

I see.

z/Linux is “something like Debian/s390 and Debian/s390x”, then?
(In that case: mksh works perfectly well there.)

>return ASCII. You end up with an ASCII application, basically, even
>though the source and environment aren't.)

That makes this pretty useless for us… except (see below).

>I, too, take portability seriously :)

Glad to see! ;)

>> • host OS (machine running Build.sh): ASCII-based or EBCDIC?
>
>Perhaps the "printf '\x4F'" thing can be used to detect an EBCDIC build

No, printf is unportable, but maybe echo something | tr a-z A-Z,
which should differ. Though I recall at least one system not
supporting ranges in tr, so this is more like a “check if the
output is expected for tr on EBCDIC that does support ranges,
and everything else is ASCI” thing, I guess.

>> • target OS (machine running the mksh/lksh binary created):
>>   z/OS ASCII, z/OS EBCDIC, or anything else?
>
>There is also the matter of the EBCDIC variant. Of the EBCDIC code
>pages that contain all of ASCII, the characters are generally
>assigned consistently to the same codepoints. But one exception
>occurs between EBCDICs 1047 and 037, which assign '[', ']', and '^'
>differently---characters that are significant to the shell.
>
>(EBCDIC 037 is likely to be the second-most-popular code page after
>1047, and is in fact the x3270 default.)

Yeowch!

>I don't think it's feasible to have a single mksh binary support
>multiple EBCDIC variants, however, so IMO this matter is best left to
>the user's discretion in what CFLAGS they provide (-qconvlit option). As
>long as the code specifies these characters literally instead of
>numerically, everything should fall in line.

… sounds like a maintenance nightmare. But probably doable,
if we enumerate the set of options (to a carefully chosen,
small number).

>The Build.sh code wouldn't be able to suss out the signals any better if
>it knew about these that are unique to z/OS? IBM might add even more
>signals down the line, after all...

I don’t think so, at least NSIG should be precise, especially
if at least one of sys_siglist, sys_signame and strsignal exists.

You could experiment things at runtime. Just kill(2) something
with all numbers, see if high numbers give different errors,
maybe the OS says 

Re: [PATCH] IBM z/OS + EBCDIC support

2015-04-28 Thread Daniel Richard G.
On Mon, 2015 Apr 27 22:12+, Thorsten Glaser wrote:
> Hah!
> 
> Hi again.
> 
> Your eMail requires at least three passes…
> 
> 1 reading through all of it, taking notes
> ② this answer message, with a few comments on some things,  while
>   ignoring some other things altogether
> ③ another answer message tackling those things, after I ponder this
>   some more (it *is* a brave new world you opened!)
> 
> The result of #1+#2 follows.

Ready for it!

> Daniel Richard G. dixit:
>
> >> - what about \u20AC? UTF-8? UTF-EBCDIC?
> >
> >Many code pages have received euro-sign updates; e.g. EBCDIC 924 is
>
> I wasn’t actually asking about Euro support here, but deeper…

I'm not sure I understand what you're getting at... U+20AC is the
Euro sign...

> >I worked by way of SSH'ing in to the z/OS OMVS Unix environment.
> >Everything in OMVS is EBCDIC, but of course my SSH client sends and
> >receives everything in ASCII. There is a network translation layer in
> >between, apart from the file-content conversion layers previously
> >mentioned, that makes it all work transparently.
>
> … *UGH!* That’s the hard thing.

It's either that, or x3270 :]

> Actually, does your SSH client send/receive in ASCII, or in latin1 or
> some other ASCII-based codepage? What does this layer use?

I'm working from a system with a UTF-8 locale, but as I'm US-based,
pretty much everything is ASCII. The conversion layer, however,
explicitly uses ISO 8859-1 on the client side. If I send actual UTF-8,
that would probably get interpreted as so much Latin-1.

> >Even IBM's z/OS Unicode support via UTF-16 is, as far as I can tell,
> >for use by applications and not by logged-in users.
>
> OK.

Of course, I see no reason why mksh couldn't use this Unicode support,
as long as it continues talking ASCII/EBCDIC with the terminal.

> This would mean completely removing utf8-mode from the shell. That’s a
> more deep incision than I originally thought would be required.

Removing it? I thought off-by-default would be enough...

> z/Linux is “something like Debian/s390 and Debian/s390x”, then?
> (In that case: mksh works perfectly well there.)

Yes, exactly; z/Linux is just how I've heard it referred to in my
company. That environment is pretty trivial to port to, as It's Just
Linux(tm) with slightly different sysdeps.

> >Perhaps the "printf '\x4F'" thing can be used to detect an
> >EBCDIC build
>
> No, printf is unportable, but maybe echo something | tr a-z A-Z, which
> should differ. Though I recall at least one system not supporting
> ranges in tr, so this is more like a “check if the output is expected
> for tr on EBCDIC that does support ranges, and everything else is
> ASCI” thing, I guess.

Even if printf is unportable, the test need only succeed on EBCDIC
platforms. Instead of checking for 'O' vs. '|', check for '|' vs.
anything else (including error).

You won't get anywhere with tr(1) in EBCDIC-land, I'm afraid:

$ echo hijk | tr a-z A-Z
HIJK

> >I don't think it's feasible to have a single mksh binary support
> >multiple EBCDIC variants, however, so IMO this matter is best left to
> >the user's discretion in what CFLAGS they provide (-qconvlit option).
> >As long as the code specifies these characters literally instead of
> >numerically, everything should fall in line.
>
> … sounds like a maintenance nightmare. But probably doable, if we
> enumerate the set of options (to a carefully chosen, small number).

I would just have a small platform note in the documentation that calls
the user's attention to xlc's -qascii and -qconvlit options, with a
brief discussion of the ASCII vs. EBCDIC issues, and then let them
decide how to deal with it.

> >The Build.sh code wouldn't be able to suss out the signals any better if
> >it knew about these that are unique to z/OS? IBM might add even more
> >signals down the line, after all...
> 
> I don’t think so, at least NSIG should be precise, especially
> if at least one of sys_siglist, sys_signame and strsignal exists.

Pretty sure none of those are available :(  They're certainly not in
the headers.

> You could experiment things at runtime. Just kill(2) something
> with all numbers, see if high numbers give different errors,
> maybe the OS says “signal number too high”, then we get a clue.

$ kill -120 83953851
kill: FSUM7327 signal number 120 not conventional
kill: 83953851: EDC5121I Invalid argument.
$ kill -64 83953851
kill: FSUM7327 signal number 64 not conventional
kill: 83953851: EDC5121I Invalid argument.
$ kill -63 83953851
kill: FSUM7327 signal number 63 not conventional
kill: 83953851: EDC5121I Invalid argument.
$ kill -62 83953851
kill: FSUM7327 signal number 62 not conventional
kill: 83953851: EDC5121I Invalid argument.
$ kill -40 83953851
kill: FSUM7327 signal number 40 not conventional
kill: 83953851: EDC5121I Invalid argument.
$ kill -39 83953851
kill: FSUM7327 signal number 39 not conventional
$ kill -38 83953

Re: [PATCH] IBM z/OS + EBCDIC support

2015-04-29 Thread Thorsten Glaser
Daniel Richard G. dixit:

>> >> - what about \u20AC? UTF-8? UTF-EBCDIC?
>> >
>> >Many code pages have received euro-sign updates; e.g. EBCDIC 924 is
>>
>> I wasn’t actually asking about Euro support here, but deeper…
>
>I'm not sure I understand what you're getting at... U+20AC is the
>Euro sign...

Yes, but I was using that only as an example.
Use U+4DC0 HEXAGRAM FOR THE CREATIVE HEAVEN (䷀) then ☺

But we already established that we ignore Unicode here;
it reminds me somewhat of the Win16 codepage scheme.

>I'm working from a system with a UTF-8 locale, but as I'm US-based,
>pretty much everything is ASCII. The conversion layer, however,

OK, I can see that. Though I’m using lots of UTF-8 stuff even
when writing English… they call me Mr. WTF-8 sometimes ☻

>explicitly uses ISO 8859-1 on the client side. If I send actual UTF-8,
>that would probably get interpreted as so much Latin-1.

OK. I can work with that assumption. Thanks.

Out of curiosity: what do the various conversion layers
(NFS, extended attributes, etc.) with the nōn-ASCII parts
of the mksh source? Do you get © for the copyright sign
(i.e. interpreted as latin1) too?

http://en.wikipedia.org/wiki/EBCDIC_1047#Code_page_translation
looks useful. That maps 1:1 to Unicode, of course, and we can
even do cp924. This may even make the utf_{wc,mb}to{mb,wc} code
into a simple table lookup. Well one of them anyway. But see
below under “conversion routines”.

>Of course, I see no reason why mksh couldn't use this Unicode support,
>as long as it continues talking ASCII/EBCDIC with the terminal.

Only with a translation layer (hah).

Currently, Unicode support means parsing UTF-8 input instead
of ASCII input, so when a hi-bit7 char arrives, it waits for
the next if in range (or maps it into EF80‥EFFF if invalid).

>> This would mean completely removing utf8-mode from the shell. That’s a
>> more deep incision than I originally thought would be required.
>
>Removing it? I thought off-by-default would be enough...

It may turn out to be enough. I think it depends on the
conversion layer. We’ll see. We can experiment a lot,
after all. I’d prefer to keep the delta low, too.

>> z/Linux is “something like Debian/s390 and Debian/s390x”, then?
>> (In that case: mksh works perfectly well there.)
>
>Yes, exactly; z/Linux is just how I've heard it referred to in my
>company. That environment is pretty trivial to port to, as It's Just
>Linux(tm) with slightly different sysdeps.

Ah okay.

>Even if printf is unportable, the test need only succeed on EBCDIC
>platforms. Instead of checking for 'O' vs. '|', check for '|' vs.
>anything else (including error).

Hm.

>You won't get anywhere with tr(1) in EBCDIC-land, I'm afraid:
>
>$ echo hijk | tr a-z A-Z
>HIJK

I was thinking of this:

$ echo '+' | tr '(-*' '*-,'
+

This should give a ‘)’ in EBCDIC, right?

>I would just have a small platform note in the documentation that calls
>the user's attention to xlc's -qascii and -qconvlit options, with a
>brief discussion of the ASCII vs. EBCDIC issues, and then let them
>decide how to deal with it.

OK. Maybe we can use an additional Build.sh option to control that,
actually.

>Pretty sure none of those are available :(  They're certainly not in
>the headers.

OK.

>> You could experiment things at runtime. Just kill(2) something
>> with all numbers, see if high numbers give different errors,
>> maybe the OS says “signal number too high”, then we get a clue.
>
>$ kill -120 83953851

I was thinking kill(2) not kill(1), but…

>$ kill -40 83953851
>kill: FSUM7327 signal number 40 not conventional
>kill: 83953851: EDC5121I Invalid argument.
>$ kill -39 83953851
>kill: FSUM7327 signal number 39 not conventional

… then set NSIG to 40 (or SIGMAX to 39). Can you also send me
a list of all SIG* defines on the system, so that Build.sh can
pick them up?

>Yes, rlimits.gen is lacking the continuation backslashes from
>rlimits.opt. Guess those are getting dropped somewhere.

Ah. That is definitely a host shell bug; read without -r is
supposed to drop the backslash *and* the following newline.

>Once I flattened each of those definitions into a single line, the build
>proceeds and completes without error, and the test suite...
>
>Total failed: 0
>Total passed: 498

Wow.

>I wouldn't encourage a host-side C tool here, as that was partly what
>made a GNU Bash build unmanageable on this system...

It’s unevitable though. But I don’t think it will make anything
unmanageable. It’s mostly still Build.sh checking for things,
then building something, then running it, which will generate
a bunch of files, then it’d compile the shell itself.

>> I hope to be able to make the entire of edit.c, plus good parts of
>> lex.c and syn.c and some parts of tree.c use 16-bit Unicode
>> internally.
>
>I'm presuming this would be wchar_t and its related functions?

Absolutely no! These are extremely unportable.

It uses uint16_t, and the utf_* functions from expr.c which
ar

Re: [PATCH] IBM z/OS + EBCDIC support

2015-04-30 Thread Daniel Richard G.
On Wed, 2015 Apr 29 15:42+, Thorsten Glaser wrote:
> Daniel Richard G. dixit:
>
> >I'm working from a system with a UTF-8 locale, but as I'm US-based,
> >pretty much everything is ASCII. The conversion layer, however,
>
> OK, I can see that. Though I’m using lots of UTF-8 stuff even when
> writing English… they call me Mr. WTF-8 sometimes ☻

Well, my mail user agent is up to snuff, even if my company's mainframe
system consoles aren't :]

> >explicitly uses ISO 8859-1 on the client side. If I send actual UTF-
> >8, that would probably get interpreted as so much Latin-1.
>
> OK. I can work with that assumption. Thanks.

I've come across some relevant information recently, regarding IBM's
ported version of OpenSSH on z/OS:

OpenSSH assumes that all text data traveling across the network is
encoded in ISO/IEC 8859-1 (Latin-1). Specifically, OpenSSH treats
data as text and performs conversion between the ASCII Latin-1 coded
character set and the EBCDIC-coded character set of the current
locale in the following scenarios:

* ssh login session
* ssh remote command execution
* scp file transfers
* sftp file transfers when the ascii subcommand is specified

The OpenSSH daemon (sshd) can understand and handle non-Latin-1
coded character sets on the network for interactive sessions,
specifically sessions with a tty allocated. However, not all EBCDIC-
coded character sets are compatible with ISO 8859-1. To determine if
a coded character set is compatible with a particular locale, see
the information about locales supplied with z/OS XL C/C++ in z/OS XL
C/C++ Programming Guide.

Warning: If there is no one-to-one mapping between the EBCDIC coded
character set of the session data and ISO 8859-1, then nonidentical
conversions might occur. Specifically, substitution characters (for
example, IBM-1047 0x3F) are inserted into the data stream for those
incompatible characters. See “Configuring the OpenSSH daemon” on p

-- http://www-03.ibm.com/systems/resources/fot4os02.pdf
   (section "OpenSSH and globalization")

It seems like IBM has placed the EBCDIC<->ASCII conversion layer in the
OpenSSH daemon itself, rather than in a system facility  >_<

> Out of curiosity: what do the various conversion layers
> (NFS, extended attributes, etc.) with the nōn-ASCII parts
> of the mksh source? Do you get © for the copyright sign
> (i.e. interpreted as latin1) too?

$ grep Copyright sh.h
 * Copyright © 2003, 2004, 2005, 2006, 2007, 2008, 2009, 2010,
 ^ (proper copyright symbol)

It's possible that z/OS interprets that symbol as two Latin-1
characters, but then when that is sent back to my Linux terminal, it
gets "reassembled" as UTF-8.

Remember my grumbling about too many conversion layers? :>

> http://en.wikipedia.org/wiki/EBCDIC_1047#Code_page_translation looks
> useful. That maps 1:1 to Unicode, of course, and we can even do cp924.
> This may even make the utf_{wc,mb}to{mb,wc} code into a simple table
> lookup. Well one of them anyway. But see below under “conversion
> routines”.

So basically, iconv or a workalike...

> >Removing it [utf8-mode]? I thought off-by-default would be enough...
>
> It may turn out to be enough. I think it depends on the conversion
> layer. We’ll see. We can experiment a lot, after all. I’d prefer to
> keep the delta low, too.

Aye, I was certainly envisioning a lightweight set of changes. At most,
disable setting UTFMODE != 0.

> I was thinking of this:
> 
> $ echo '+' | tr '(-*' '*-,'
> +
> 
> This should give a ‘)’ in EBCDIC, right?

Hate to disappoint...

$ echo '+' | tr '(-*' '*-,'
+

But tr(1) does support octal escapes, so you could do e.g.

$ echo a | tr '\201' X
X

> >I would just have a small platform note in the documentation that
> >calls the user's attention to xlc's -qascii and -qconvlit options,
> >with a brief discussion of the ASCII vs. EBCDIC issues, and then let
> >them decide how to deal with it.
>
> OK. Maybe we can use an additional Build.sh option to control that,
> actually.

Perhaps, though if explicit support in Build.sh can be called "hand-
holding," then z/OS is a platform where the number of users is quite
small, and are more likely than not able to figure out compiler flags
themselves anyway. A Build.sh option just seems like overkill to me.

> I was thinking kill(2) not kill(1), but…
> 
> >$ kill -40 83953851
> >kill: FSUM7327 signal number 40 not conventional
> >kill: 83953851: EDC5121I Invalid argument.
> >$ kill -39 83953851
> >kill: FSUM7327 signal number 39 not conventional
> 
> … then set NSIG to 40 (or SIGMAX to 39). Can you also send me a list
> of all SIG* defines on the system, so that Build.sh can pick them up?

NSIG=40, check. Will send some information via private mail.

> >Yes, rlimits.gen is lacking the continuation backslashes from
> >rlimits.opt. Guess those are getting dropped somewhere.
>
> Ah. That is defini

Re: [PATCH] IBM z/OS + EBCDIC support

2015-05-01 Thread Thorsten Glaser
Daniel Richard G. dixit:

>Hate to disappoint...

oO

>But tr(1) does support octal escapes, so you could do e.g.
>
>$ echo a | tr '\201' X
>X

OK, wonderful, that should work.

[ set the compiler charset option ]
>themselves anyway. A Build.sh option just seems like overkill to me.

Right, but the idea is that *if* we need the selected charset
in mksh anyway, it’s easier and cheaper to do it like that.
If not, then, sure.

>Couldn't you just keep the backslash-newlines in the *.gen files? Are

Several shells’ read -r option is broken, so if I use that,
I get the exact same problem for a different set of ancient
shells ☹

I could write it all on one line in the source tree, but
that’s like giving up. I convert them for the releases
already anywa………but I feel like stupid right now.

The files have only one user each, so I can just pull the
macros out of them (I think – without looking). Duh!

[ host C tool ]
>It's possible some chtag(1) tagging might be needed, as encodings could
>potentially get mixed up in certain instances. (This was the case with a
>-qascii Bash build)

Sure, we’ll see about this when it becomes current. I assume
that’s just a line “if OS/390 then chtag” more in Build.sh.

>> >I'm presuming this would be wchar_t and its related functions?
>>
>> Absolutely no! These are extremely unportable.
>
>I suspected as much, but I've never actually had to deal with them, so
>was unsure. I'll keep that point in mind, however.

If you target NetBSD, GNU, musl or Solaris, they can be used.
MirBSD too, though we have only one locale. Anything else, or
even older versions of these systems, will suck. Plus, the local
admin must usually install the locale you want to use.

So, this is not for mksh in the generic case. We *can* use this
in the specific z/OS case, though.

>Oh, okay. I don't think the conditionals should get hairy... and if they
>do, then there is probably a better way of going about it.

Right.

>> We can have, say, zmksh (and zlksh), for which this does not hold.
>
>Is it convention to name the binaries differently for nonstandard
>variants? (E.g. the native Win32 port would also have modified names?)

I wish for it to be convention, so they don’t accidentally get used when
#!/usr/bin/env mksh
is a script’s shebang line. Granted, you can possibly check $KSH_VERSION
but tbh that’s like enabling UTF-8 mode for scripts by default if the
current locale is Unicode: too many scripts (the majority) implicitly
assume LC_ALL=C and don’t set that. So no, I prefer to not put this
burden on the script writers.

>The code page is set at compile time, with the -qconvlit option. From
>the xlc(1) man page:

Thanks.

>believe the only case where this could become an issue is when you have
>mismatched code pages (e.g. EBCDIC 1047 mksh + EBCDIC 037 user), and
>then you pray that as many code points agree as possible. This, IMO,
>falls squarely in the category of "user caveat."

So we assume EBCDIC 1047 mksh + EBCDIC 037 user is allowed to fail,
and we only really have to support the code page used at compilation.

>This situation could change, however, once mksh is doing UTF-16
>internally. Then, because it has to translate everything to and from the
>outside world anyway, I see no reason why it couldn't use a 1047 table
>for user A, and a 037 table for user B. Perhaps even straight UTF-8 for

I think there’s two things speaking out against this:
• the compiler transcodes the strings and chars already anyway,
  and we rely on that too much
• there is a speed and simplicity advantage of having only one charset
Weak reasons, but as this is already very tricky should be kept in mind.

>There are definitely other EBCDIC platforms, but will they become
>relevant? That all depends on whether there's some random schmuck
>messing around on those systems who takes a liking to your project :)

;-)

>I'm not sure about "Z/OS MKSH", however, if the -qascii build would have
>"MIRBSD MKSH". Both are z/OS, after all, and the only thing
>significantly different about the EBCDIC build is, well, EBCDIC.

OK. So how about “EBCDIC MKSH” for zmksh, keeping “MIRBSD KSH” for mksh
(historic reasons, I’d use MKSH there nowadays).

>(Couldn't get uhr to work with R50 on my Debian system, however... lots
>of "no coprocess" errors...)

Huh.

tg@tglase-eee:~ $ zcat /usr/share/doc/mksh/examples/uhr.gz | mksh

This works OOTB for me. But you do have to install bc(1) first;
unlike real Unix systems, absolutely-basic-should-be-everywhere
tools like bc, ed, uudecode are not installed by default on GNU.

Now go get some sleep ;-)

bye,
//mirabilos
-- 
“The final straw, to be honest, was probably my amazement at the volume of
petty, peevish whingeing certain of your peers are prone to dish out on
d-devel, telling each other how to talk more like a pretty princess, as though
they were performing some kind of public service.” (someone to me, privately)


Re: [PATCH] IBM z/OS + EBCDIC support

2015-05-04 Thread Daniel Richard G.
On Fri, 2015 May  1 16:26+, Thorsten Glaser wrote:
> 
> [ set the compiler charset option ]
> >themselves anyway. A Build.sh option just seems like overkill to me.
>
> Right, but the idea is that *if* we need the selected charset in
> mksh anyway, it’s easier and cheaper to do it like that. If not,
> then, sure.

I don't think there would be any instance where the code needs to know
"I am building for EBCDIC 1047" vs. "for EBCDIC 037" or whatnot. The
transcoding of char/string literals should be all that's needed.

(I think the compiler throws an error if some literal character cannot
be transcoded, so we would not need to worry about EBCDIC variants that
lack basic ASCII punctuation or lowercase letters.)

> >Couldn't you just keep the backslash-newlines in the *.gen files? Are
>
> Several shells’ read -r option is broken, so if I use that, I get the
> exact same problem for a different set of ancient shells ☹
>
> I could write it all on one line in the source tree, but that’s like
> giving up. I convert them for the releases already anywa………but I
> feel like stupid right now.
>
> The files have only one user each, so I can just pull the macros out
> of them (I think – without looking). Duh!

The multi-line macros don't appear to have any build-time variable
parts... I think moving them to normal source files would be a sensible
solution. Avoid the problem altogether!

> [ host C tool ]
> >It's possible some chtag(1) tagging might be needed, as encodings
> >could potentially get mixed up in certain instances. (This was the
> >case with a -qascii Bash build)
>
> Sure, we’ll see about this when it becomes current. I assume that’s
> just a line “if OS/390 then chtag” more in Build.sh.

Exactly.

> >Is it convention to name the binaries differently for nonstandard
> >variants? (E.g. the native Win32 port would also have modified
> >names?)
>
> I wish for it to be convention, so they don’t accidentally get
> used when
>   #!/usr/bin/env mksh
> is a script’s shebang line. Granted, you can possibly check
> $KSH_VERSION but tbh that’s like enabling UTF-8 mode for scripts by
> default if the current locale is Unicode: too many scripts (the
> majority) implicitly assume LC_ALL=C and don’t set that. So no, I
> prefer to not put this burden on the script writers.

I suppose you could leave it up to the user to create e.g. a mksh ->
zmksh symlink.

> >believe the only case where this could become an issue is when you
> >have mismatched code pages (e.g. EBCDIC 1047 mksh + EBCDIC 037 user),
> >and then you pray that as many code points agree as possible. This,
> >IMO, falls squarely in the category of "user caveat."
>
> So we assume EBCDIC 1047 mksh + EBCDIC 037 user is allowed to fail,
> and we only really have to support the code page used at compilation.

Yes, exactly. To do otherwise would be way too much work, for a platform
with too few users (short of the possibilities opened up by internal
Unicode representation).

> >This situation could change, however, once mksh is doing UTF-16
> >internally. Then, because it has to translate everything to and from
> >the outside world anyway, I see no reason why it couldn't use a 1047
> >table for user A, and a 037 table for user B. Perhaps even straight
> >UTF-8 for
>
> I think there’s two things speaking out against this:
> • the compiler transcodes the strings and chars already anyway,  and
> we rely on that too much

If all of mksh's input/output is being filtered via conversion tables
to/from UTF-16, then a straight ASCII build could support EBCDIC. Heck,
you could configure mksh on BSD/Linux to talk EBCDIC if you like! (It
wouldn't be very useful, but it would be a nifty proof of concept. Your
main concern there would be avoiding "ASCII leaks"---instances of ASCII
text being written to the terminal without going through the conversion
routines.)

> • there is a speed and simplicity advantage of having only one
> charset
> Weak reasons, but as this is already very tricky should be kept
> in mind.

If you're filtering everything through conversion tables anyway, then
using table A versus table B should have little impact on performance.
As for simplicity, well, I'd say that horse has left the UTF-16 barn :]

> >I'm not sure about "Z/OS MKSH", however, if the -qascii build would
> >have "MIRBSD MKSH". Both are z/OS, after all, and the only thing
> >significantly different about the EBCDIC build is, well, EBCDIC.
>
> OK. So how about “EBCDIC MKSH” for zmksh, keeping “MIRBSD KSH” for
> mksh (historic reasons, I’d use MKSH there nowadays).

I think that sounds right. Maybe call the binary "emksh"? As much as
IBM's marketing uses z/This and z/That, you don't see it a whole lot
inside the actual environment...

> >(Couldn't get uhr to work with R50 on my Debian system, however...
> >lots of "no coprocess" errors...)
> 
> Huh.
> 
> tg@tglase-eee:~ $ zcat /usr/share/doc/mksh/examples/uhr.gz | mksh
> 
> This works OOTB for me. But you do have to install bc(1) first;
>

Re: [PATCH] IBM z/OS + EBCDIC support

2015-05-06 Thread Thorsten Glaser
Daniel Richard G. dixit:

>I don't think there would be any instance where the code needs to know
>"I am building for EBCDIC 1047" vs. "for EBCDIC 037" or whatnot. The
>transcoding of char/string literals should be all that's needed.

Unless we convert EBCDIC to Unicode ourselves (as opposed to letting
the system do it; I’m currently convinced that we really want to do
this actually, since we don’t support them all anyway).

>(I think the compiler throws an error if some literal character cannot
>be transcoded, so we would not need to worry about EBCDIC variants that
>lack basic ASCII punctuation or lowercase letters.)

Good.

>I suppose you could leave it up to the user to create e.g. a mksh ->
>zmksh symlink.

Sure, the local admin rules supreme anyway after all.

>> So we assume EBCDIC 1047 mksh + EBCDIC 037 user is allowed to fail,
>> and we only really have to support the code page used at compilation.
>
>Yes, exactly. To do otherwise would be way too much work, for a platform

OK.

>If all of mksh's input/output is being filtered via conversion tables

It isn’t, it never is. That would just be insane ;-) Only some.

>I think that sounds right. Maybe call the binary "emksh"? As much as

Okay, emksh and EBCDIC [ML]KSH it is.

>Very nice hack! I do prefer analog clocks myself.

;-)

I don’t care as long as the clock also shows the day and month,
and preferably also the day-of-week. I tend to not know them.
Oh, and, i̲f̲ I have a clock, it better go right, so it should
do NTP or DCF77 (sadly, most DCF77 clocks only sync once a day
or month or when you trigger it manually, not constantly, so
they are off most of the time, especially when they are often
in buildings that shield radio signals well).

>> Now go get some sleep ;-)
>
>Still working on it...

Ouch. I got some myself, but not enough…

bye,
//mirabilos
-- 
If Harry Potter gets a splitting headache in his scar
when he’s near Tom Riddle (aka Voldemort),
does Tom get pain in the arse when Harry is near him?
-- me, wondering why it’s not Jerry Potter………


Re: [PATCH] IBM z/OS + EBCDIC support

2015-05-07 Thread Daniel Richard G.
On Wed, 2015 May  6 20:22+, Thorsten Glaser wrote:
> Daniel Richard G. dixit:
>
> Unless we convert EBCDIC to Unicode ourselves (as opposed to letting
> the system do it; I’m currently convinced that we really want to do
> this actually, since we don’t support them all anyway).

If you bundle a set of encoding tables with mksh (whether transformed
into C arrays or loaded as-is at runtime), you can easily support every
variant of EBCDIC that matters. Just look at "iconv -l" on Linux, for
example; it's not like this would be a hard problem.

> >If all of mksh's input/output is being filtered via conversion tables
>
> It isn’t, it never is. That would just be insane ;-) Only some.

Well, filtering everything would sure make some interesting
things possible.

(Maybe it's feasible, if you plan it that way from the start. Use
strerror()+convert instead of perror(), and so on. Food for thought ;)

> >I think that sounds right. Maybe call the binary "emksh"? As much as
>
> Okay, emksh and EBCDIC [ML]KSH it is.

Sounds good!

Down the line, if mksh ever gains features that are particular to z/OS---
like being able to interface with parts of the system that are outside
of the OMVS Unix environment---then this may be something to revisit.
But as long as it's just using the normal POSIX interface,
distinguishing it by the use of EBCDIC is the right way to go, IMO.


So, everything seems in order here. I see you've merged in many of the
changes already. Is there anything more you need from me at this time?
I'll be happy to test a pristine tree on z/OS once all the necessary
tweaks are in.

I did want to pass one thing along for now, amending my original patch.
It turns out that xlc on z/OS does in fact support -qro and -qroconst;
it's only -qroptr that is unsupported. Small oversight on my part.

Also, while this xlc doesn't have -qcheck, it does have -qrtcheck:

 -qrtcheck[=] | -qnortcheck
Generates compare-and-trap instructions that
perform certain types of runtime checking. The
available options are:

all  Automatically generates compare-and-trap
 instructions for all possible runtime checks.
bounds
 Performs runtime checking of addresses when
 subscripting within an object of known size.
divzero
 Performs runtime checking of integer division.
nullpr
 Performs runtime checking of addresses
 contained in pointer variables used to
 reference storage.

The default is -qnortcheck.

That seems to be in the same spirit, so I threw it in. Your call if
you'd like to use it, of course.


> >Very nice hack! I do prefer analog clocks myself.
>
> ;-)
>
> I don’t care as long as the clock also shows the day and month, and
> preferably also the day-of-week. I tend to not know them. Oh, and,
> i̲f̲ I have a clock, it better go right, so it should do NTP or DCF77
> (sadly, most DCF77 clocks only sync once a day or month or when you
> trigger it manually, not constantly, so they are off most of the
> time, especially when they are often in buildings that shield radio
> signals well).

It's certainly easy enough to get WWVB radio clocks here in the States,
though if you live in the edges of the continent, you'll have a hard
time getting the signal. I like date information too, but good luck
finding that in an analog model bigger than a wristwatch!


--Daniel


-- 
Daniel Richard G. || sk...@iskunk.org
My ASCII-art .sig got a bad case of Times New Roman.
Index: Build.sh
===
RCS file: /cvs/src/bin/mksh/Build.sh,v
retrieving revision 1.678
diff -u -r1.678 Build.sh
--- Build.sh	29 Apr 2015 20:44:55 -	1.678
+++ Build.sh	8 May 2015 02:34:44 -
@@ -1493,10 +1513,27 @@
 	ac_flags 1 extansi -Xa
 	;;
 xlc)
-	ac_flags 1 rodata "-qro -qroconst -qroptr"
-	ac_flags 1 rtcheck -qcheck=all
-	#ac_flags 1 rtchkc -qextchk	# reported broken
-	ac_flags 1 wformat "-qformat=all -qformat=nozln"
+	case "$TARGET_OS" in
+	OS/390)
+		# On IBM z/OS, the following are warnings by default
+		# CCN3296: #include file  not found.
+		# CCN3944: Attribute "__foo__" is not supported and is ignored.
+		# CCN3963: The attribute "foo" is not a valid variable
+		#  attribute and is ignored.
+		ac_flags 1 halton "-qhaltonmsg=CCN3296 -qhaltonmsg=CCN3944 -qhaltonmsg=CCN3963"
+		# CCN3290: Unknown macro name FOO on #undef directive.
+		# CCN4108: The use of keyword '__attribute__' is non-portable.
+		ac_flags 1 supprss "-qsuppress=CCN3290 -qsuppress=CCN4108"
+		ac_flags 1 rtcheck -qrtcheck=all
+		;;
+	*)
+		ac_flags 1 roptr "-qroptr"
+		ac_flags 1 rtcheck -qcheck=all
+		#ac_flags 1 rtchkc -qextchk	# reported broken
+		ac_flags 1 wformat "-qformat=all -qformat=nozln"
+		;;
+	esac
+	ac_flags 1 rodata "-qro -qroconst"
 	#ac_flag

Re: [PATCH] IBM z/OS + EBCDIC support

2015-05-08 Thread Thorsten Glaser
Daniel Richard G. dixit:

>So, everything seems in order here. I see you've merged in many of the
>changes already. Is there anything more you need from me at this time?

Not right now. I will revisit this after R51, also to look at
how much more Unicode we will want to use already now, internally,
to better support EBCDIC. I have two changes pending for R51,
but a concert this weekend first, and need to attend rehearsal ☺

>It's certainly easy enough to get WWVB radio clocks here in the States,
>though if you live in the edges of the continent, you'll have a hard
>time getting the signal.

Ah ok. The signal is not so much the problem here but buildings.
You know, stone (and lots thereof) and steel. Walls are easily
(hm, what’s that in inches…), well probably most are not a full
foot, but some can get close.

>I like date information too, but good luck
>finding that in an analog model bigger than a wristwatch!

Right. I’d kinda love to have a pocket watch with that information
and remotely-set accurate time, but not at the prices those watch
enthusiasts demand. It need not be golden ;-)

Oh well.

bye,
//mirabilos
-- 
08:05⎜ mika: Does grml have an tool to read Apple
 ⎜System Log (asl) files? :)
08:08⎜ yeah. /bin/rm. ;)   08:09⎜ hexdump -C
08:31⎜ ft, mrud: *g*


Re: [PATCH] IBM z/OS + EBCDIC support

2017-04-20 Thread Thorsten Glaser
Hi,

remember me? ;-)

I’m writing because I was pondering EBCDIC for an upcoming change.


Daniel Richard G. dixit:

>I don't think there would be any instance where the code needs to know
>"I am building for EBCDIC 1047" vs. "for EBCDIC 037" or whatnot. The
>transcoding of char/string literals should be all that's needed.

Unfortunately, this is not true, and this ties in with this part of
an earlier mail in this thread:

>> I was thinking of this:
>>
>> $ echo '+' | tr '(-*' '*-,'
>> +
>>
>> This should give a ‘)’ in EBCDIC, right?
>
>Hate to disappoint...
>
>$ echo '+' | tr '(-*' '*-,'
>+

In the meantime I learnt why: POSIX ranges are always in ASCII order,
so [A-Z] won’t match any nōn-letters even on EBCDIC systems.

For this to work however I require a table mapping all 256 EBCDIC octets
in the compile-time (= run-time) codepage onto their ASCII equivalents,
or -1 if there is no match (in which case they just sort higher than any
char that actually _is_ in ASCII, and probably by ordinal value).

I guess this means we’ll have to include a translation table at compile
time, and possibly multiple in the code (if we want there to be a choice
of codepage). I’m unsure if you actually want to support at least both
1047 and 037 in the future or if restricting yourself to 1047 is good
enough especially considering that OpenSSL application note you posted
and the compiler defaulting to that.


I’ll just keep this in mind but implement that feature, out of necessity
for it to be there rather soon, ASCII-only for now (it’s one that won’t
work in utf8-mode either anyway).

bye,
//mirabilos
-- 
18:47⎜ well channels… you see, I see everything in the
same window anyway  18:48⎜ i know, you have some kind of
telnet with automatic pong 18:48⎜ haha, yes :D
18:49⎜ though that's more tinyirc – sirc is more comfy


Re: [PATCH] IBM z/OS + EBCDIC support

2017-04-20 Thread Daniel Richard G.
Hello Thorsten,

On Thu, 2017 Apr 20 19:14+, Thorsten Glaser wrote:
> Hi,
> 
> remember me? ;-)

I sure do!

> In the meantime I learnt why: POSIX ranges are always in ASCII order,
> so [A-Z] won’t match any nōn-letters even on EBCDIC systems.

Interesting! So POSIX assumes ASCII, to a certain extent.

> For this to work however I require a table mapping all 256 EBCDIC
> octets in the compile-time (= run-time) codepage onto their ASCII
> equivalents, or -1 if there is no match (in which case they just sort
> higher than any char that actually _is_ in ASCII, and probably by
> ordinal value).
>
> I guess this means we’ll have to include a translation table at
> compile time, and possibly multiple in the code (if we want there to
> be a choice of codepage). I’m unsure if you actually want to support
> at least both 1047 and 037 in the future or if restricting yourself to
> 1047 is good enough especially considering that OpenSSL application
> note you posted and the compiler defaulting to that.

It wouldn't work to do the EBCDIC->ASCII conversion all at runtime? z/OS
does provide functions for this, and these will adjust to whatever the
current EBCDIC codepage is:

etoa():

https://www.ibm.com/support/knowledgecenter/SSLTBW_2.1.0/com.ibm.zos.v2r1.bpxbd00/r0ceta.htm

etoa_l():

https://www.ibm.com/support/knowledgecenter/en/SSLTBW_2.1.0/com.ibm.zos.v2r1.bpxbd00/r0etal.htm

Even if you really do need a table, you could populate it on startup
using these.

> I’ll just keep this in mind but implement that feature, out of
> necessity for it to be there rather soon, ASCII-only for now (it’s one
> that won’t work in utf8-mode either anyway).

I guess multi-byte is trickier... EBCDIC does have the
limitation/advantage of still being one byte per character. (UTF-EBCDIC
is definitely not a thing in IBM mainframe land)

Anyway, if you need any z/OS testing, feel free to drop me a line ;)


--Daniel


-- 
Daniel Richard G. || sk...@iskunk.org
My ASCII-art .sig got a bad case of Times New Roman.


Re: [PATCH] IBM z/OS + EBCDIC support

2017-04-20 Thread Thorsten Glaser
Hi,

>> remember me? ;-)
>
>I sure do!

;-)

>> In the meantime I learnt why: POSIX ranges are always in ASCII order,
>> so [A-Z] won’t match any nōn-letters even on EBCDIC systems.
>
>Interesting! So POSIX assumes ASCII, to a certain extent.

Yes, it does. I think EBCDIC as charset is actually nonconformant,
but it probably pays off to stay close nevertheless. (This is
actually about the POSIX/'C' locale; other locales can pretty much
do whatever they want.)

>Even if you really do need a table, you could populate it on startup
>using these.

Indeed… but we have the compile-time translated characters all over
the source (I think we agreed earlier that not supporting changing
it at runtime was okay).

>> that won’t work in utf8-mode either anyway).
>
>I guess multi-byte is trickier... EBCDIC does have the

It definitely is, but I’ll reserve changing stuff there for later.

>Anyway, if you need any z/OS testing, feel free to drop me a line ;)

Thanks!

I hope to be able to get back to that offer eventually. Glad to
know you’re still interested after two years.

Goodnight,
//mirabilos
-- 
(gnutls can also be used, but if you are compiling lynx for your own use,
there is no reason to consider using that package)
-- Thomas E. Dickey on the Lynx mailing list, about OpenSSL


Re: [PATCH] IBM z/OS + EBCDIC support

2017-04-21 Thread Thorsten Glaser
Daniel Richard G. dixit:

>Anyway, if you need any z/OS testing, feel free to drop me a line ;)

main() { printf("%02X\n", '\n'); return 0; }

Out of curiosity, what does that print on your systems, 15 or 25?
Also, what line endings do the auto-converted source files, such
as dot.mkshrc, have?

Thanks,
//mirabilos
-- 
> Wish I had pine to hand :-( I'll give lynx a try, thanks.

Michael Schmitz on nntp://news.gmane.org/gmane.linux.debian.ports.68k
a.k.a. {news.gmane.org/nntp}#news.gmane.linux.debian.ports.68k in pine


Re: [PATCH] IBM z/OS + EBCDIC support

2017-04-22 Thread Daniel Richard G.
Hi Thorsten, apologies for the delay.

On Thu, 2017 Apr 20 21:49+, Thorsten Glaser wrote:
>
> >Interesting! So POSIX assumes ASCII, to a certain extent.
>
> Yes, it does. I think EBCDIC as charset is actually nonconformant, but
> it probably pays off to stay close nevertheless. (This is actually
> about the POSIX/'C' locale; other locales can pretty much do whatever
> they want.)

Ah, okay, C locale; that makes sense. I did imagine POSIX was largely
agnostic about the character set.

> >Even if you really do need a table, you could populate it on startup
> >using these.
>
> Indeed… but we have the compile-time translated characters all over
> the source (I think we agreed earlier that not supporting changing it
> at runtime was okay).

Oh, so you mean like if(c=='[') and such? That is certainly reasonable.
The program would be tied to the compile-time codepage no worse than
most other programs.

(If you could do everything in terms of character literals, without
depending on constructs like if(c>='A'&&c<='Z'), your code would be
pretty much EBCDIC-proof.)

> >Anyway, if you need any z/OS testing, feel free to drop me a line ;)
>
> Thanks!
>
> I hope to be able to get back to that offer eventually. Glad to know
> you’re still interested after two years.

Mainframes are not a platform for the impatient... at least not if one
has to deal with IBM  ^_^


On Fri, 2017 Apr 21 20:20+, Thorsten Glaser wrote:
> Daniel Richard G. dixit:
> 
> >Anyway, if you need any z/OS testing, feel free to drop me a line ;)
>
> main() { printf("%02X\n", '\n'); return 0; }
>
> Out of curiosity, what does that print on your systems, 15 or 25?

$ cat >test.c
main() { printf("%02X\n", '\n'); return 0; }

$ xlc -o test test.c

$ ./test
15

However...

$ cat >test2.c
#pragma convert("ISO8859-1")
int c = '\n';
#pragma convert(pop)
main() { printf("%02X\n", c); return 0; }

$ xlc -o test2 test2.c

$ ./test2
0A

That may or may not be useful. Of course, the pragma would need to be
protected by

#if defined(__MVS__) && defined(__IBMC__)

Gnulib uses this in its test-iconv.c program, because the string
literals therein need to be in ASCII regardless of platform.

> Also, what line endings do the auto-converted source files, such
> as dot.mkshrc, have?

$ head -2 dot.mkshrc 
# $Id$
# $MirOS: src/bin/mksh/dot.mkshrc,v 1.101 2015/07/18 23:03:24 tg Exp $

$ head -2 dot.mkshrc | od -t x1
007B  40  5B  C9  84  5B  15  7B  40  5B  D4  89  99  D6  E2  7A
2040  A2  99  83  61  82  89  95  61  94  92  A2  88  61  84  96
40A3  4B  94  92  A2  88  99  83  6B  A5  40  F1  4B  F1  F0  F1
6040  F2  F0  F1  F5  61  F0  F7  61  F1  F8  40  F2  F3  7A  F0
000100F3  7A  F2  F4  40  A3  87  40  C5  A7  97  40  5B  15
000116

(Yes, binary files do get messed up :-]  On z/OS-native filesystems,
there is a per-file type flag that enables or disables encoding auto-
conversion. For NFS mounts, you have to mount it as either "binary" or
"text." The mksh source tree above is on the latter sort of mount.)

Let me know if I can help any more!


--Daniel


-- 
Daniel Richard G. || sk...@iskunk.org
My ASCII-art .sig got a bad case of Times New Roman.


Re: [PATCH] IBM z/OS + EBCDIC support

2017-04-22 Thread Thorsten Glaser
Hi Daniel,

>Hi Thorsten, apologies for the delay.

don’t worry about that ;)

>> >Interesting! So POSIX assumes ASCII, to a certain extent.
>>
>> Yes, it does. I think EBCDIC as charset is actually nonconformant, but
>> it probably pays off to stay close nevertheless. (This is actually
>> about the POSIX/'C' locale; other locales can pretty much do whatever
>> they want.)
>
>Ah, okay, C locale; that makes sense. I did imagine POSIX was largely
>agnostic about the character set.

It is, but it prescribes that certain operations in the POSIX locale
use ASCII ordering for codepoints no matter which bytes they actually
have in the internal representation.

>> >Even if you really do need a table, you could populate it on startup
>> >using these.
>>
>> Indeed… but we have the compile-time translated characters all over
>> the source (I think we agreed earlier that not supporting changing it
>> at runtime was okay).
>
>Oh, so you mean like if(c=='[') and such? That is certainly reasonable.
>The program would be tied to the compile-time codepage no worse than
>most other programs.

Right. So either something like -DMKSH_EBCDIC_CP=1047 or limiting
EBCDIC support to precisely one codepage.

>(If you could do everything in terms of character literals, without
>depending on constructs like if(c>='A'&&c<='Z'), your code would be
>pretty much EBCDIC-proof.)

Yesss… but…

① not all characters are in every codepage, and
② I need strictly monotonous ordering for all 256 possible octets
  for e.g. sorting strings in some cases and for [a-z] ranges

>> I hope to be able to get back to that offer eventually. Glad to know
>> you’re still interested after two years.
>
>Mainframes are not a platform for the impatient... at least not if one
>has to deal with IBM  ^_^

Oh… I see. My condolences then ;-)

>> main() { printf("%02X\n", '\n'); return 0; }
>>
>> Out of curiosity, what does that print on your systems, 15 or 25?

>$ ./test
>15

OK, I can live with that, so I just need to swap the conversion
tables I got (which map 15 to NEL and 25 to LF).

>#pragma convert("ISO8859-1")
[…]
>That may or may not be useful. Of course, the pragma would need to be

Interesting, but I can’t think of where that would be useful
at the moment. But good to know.

Hmm. Can this be used to construct the table?

Something like running this at configure time:

main() {
int i = 1;

printf("#pragma convert(\"ISO8859-1\")\n");
printf("static const unsigned char map[] = \"");
while (i <= 255)
printf("%c", i++);
printf("\";\n");
}

And then feed its output into the compiling, and have
some code generating the reverse map like:

i = 0;
while (i < 255)
revmap[map[i]] = i + 1;

But this reeks of fragility compared with supporting
a known-good hand-edited set of codepages.

(Not to say we can’t do this manually once in order to
actually _get_ those mappings.)

>> Also, what line endings do the auto-converted source files, such
>> as dot.mkshrc, have?
>
>$ head -2 dot.mkshrc
># $Id$
># $MirOS: src/bin/mksh/dot.mkshrc,v 1.101 2015/07/18 23:03:24 tg Exp $
>
>$ head -2 dot.mkshrc | od -t x1
>007B  40  5B  C9  84  5B  15  7B  40  5B  D4  89  99  D6  E2  
> 7A
   ^

OK, it matches the above. That’s all I needed to know, thanks
for confirming this.

>(Yes, binary files do get messed up :-]  On z/OS-native filesystems,
>there is a per-file type flag that enables or disables encoding auto-
>conversion. For NFS mounts, you have to mount it as either "binary" or
>"text." The mksh source tree above is on the latter sort of mount.)

Yeah, I remembered something like that from the eMail thread.
That’s fine, we can work with that.

>Let me know if I can help any more!

Okay, sure, thanks. I must admit I’m not actively working on
this still but I’m considering making a separate branch on which
we can try things until they work, then merge it back.

But first, the character class changes themselves. That turned
out to be quite a bit more effort than I had estimated and will
keep me busy for another longish hacking session. Ugh. Oh well.
But on the plus side, this will make support much nicer as *all*
constructs like “(c >= '0' && c <= '9')” will go away and even
the OS/2 TEXTMODE line endings (where CR+LF is also supported)
need less cpp hackery.

Goodnight,
//mirabilos, who had a lng day working for a nonprofit
-- 
 you introduced a merge commit│ % g rebase -i HEAD^^
 sorry, no idea and rebasing just fscked │ Segmentation
 should have cloned into a clean repo  │  fault (core dumped)
 if I rebase that now, it's really ugh │ wuahh


Re: [PATCH] IBM z/OS + EBCDIC support

2017-04-24 Thread Daniel Richard G.
On Sat, 2017 Apr 22 23:26+, Thorsten Glaser wrote:
>
> >Oh, so you mean like if(c=='[') and such? That is certainly
> >reasonable. The program would be tied to the compile-time codepage no
> >worse than most other programs.
>
> Right. So either something like -DMKSH_EBCDIC_CP=1047 or limiting
> EBCDIC support to precisely one codepage.

I don't think the former sort of directive should be necessary. There is
enough auto-conversion magic going on that it should be possible to
piggyback on that... where it all "just works" when you compile the code.

> >(If you could do everything in terms of character literals, without
> >depending on constructs like if(c>='A'&&c<='Z'), your code would be
> >pretty much EBCDIC-proof.)
> 
> Yesss… but…
> 
> ① not all characters are in every codepage, and

True, but ASCII should be a given. (There are some older EBCDIC
codepages that lack certain common characters, I forget which ones, but
no one will want to use those anyway.)

> ② I need strictly monotonous ordering for all 256 possible octets
>   for e.g. sorting strings in some cases and for [a-z] ranges

That sounds no worse than what is usually done for LC_COLLATE and
such...

> OK, I can live with that, so I just need to swap the conversion tables
> I got (which map 15 to NEL and 25 to LF).

Always thought it was funny that it's the weirdo mainframe platform
that has a proper "newline" character instead of pressing LF into
service as one  ^_^

> >#pragma convert("ISO8859-1")
> […]
> >That may or may not be useful. Of course, the pragma would need to be
>
> Interesting, but I can’t think of where that would be useful at the
> moment. But good to know.
>
> Hmm. Can this be used to construct the table?
>
> Something like running this at configure time:
> 
> main() {
>   int i = 1;
> 
>   printf("#pragma convert(\"ISO8859-1\")\n");
>   printf("static const unsigned char map[] = \"");
>   while (i <= 255)
>   printf("%c", i++);
>   printf("\";\n");
> }
> 
> And then feed its output into the compiling, and have
> some code generating the reverse map like:
> 
>   i = 0;
>   while (i < 255)
>   revmap[map[i]] = i + 1;
> 
> But this reeks of fragility compared with supporting a known-good hand-
> edited set of codepages.

Probably easier just to use etoa(), or atoe()?  I don't think explicit
hand-edited tables should be needed for EBCDIC, unless you're already
doing those for other encodings.

> (Not to say we can’t do this manually once in order to actually _get_
> those mappings.)

Certainly the above code would either need some tweaking, or the output
some massaging, so the odd characters (especially '"') don't throw off
the compiler.

> >Let me know if I can help any more!
>
> Okay, sure, thanks. I must admit I’m not actively working on this
> still but I’m considering making a separate branch on which we can try
> things until they work, then merge it back.

I'm happy to test iterations of this, as long as it doesn't need much
diagnosing...

> But first, the character class changes themselves. That turned out to
> be quite a bit more effort than I had estimated and will keep me busy
> for another longish hacking session. Ugh. Oh well. But on the plus
> side, this will make support much nicer as *all* constructs like “(c
> >= '0' && c <= '9')” will go away and even the OS/2 TEXTMODE line
> endings (where CR+LF is also supported) need less cpp hackery.

Sounds great! That'll certainly make EBCDIC easier to deal with.

I might suggest looking at Gnulib, specifically lib/c-ctype.h, for
inspiration. I helped them get their ctype implementation in order on
z/OS (and at one point we were even trying to deal with *signed* EBCDIC
chars, where 'A' has a negative value!), and it works solidly now.
They've got a good design for dealing with non-ASCII weirdness; they
were clearly thinking of that from the start.


Happy hacking,


--Daniel


-- 
Daniel Richard G. || sk...@iskunk.org
My ASCII-art .sig got a bad case of Times New Roman.


Re: [PATCH] IBM z/OS + EBCDIC support

2017-04-25 Thread Thorsten Glaser
Daniel Richard G. dixit:

>It wouldn't work to do the EBCDIC->ASCII conversion all at runtime? z/OS
>does provide functions for this, and these will adjust to whatever the
>current EBCDIC codepage is:
>
>etoa():
>
> https://www.ibm.com/support/knowledgecenter/SSLTBW_2.1.0/com.ibm.zos.v2r1.bpxbd00/r0ceta.htm
>
>etoa_l():
>
> https://www.ibm.com/support/knowledgecenter/en/SSLTBW_2.1.0/com.ibm.zos.v2r1.bpxbd00/r0etal.htm

[…]

>Probably easier just to use etoa(), or atoe()?  I don't think explicit
>hand-edited tables should be needed for EBCDIC, unless you're already

Well, the hand-edited tables would be known to be stable and (somewhat)
correct, but…

>Even if you really do need a table, you could populate it on startup
>using these.

I guess I can probably work with that.

So we’re up for testing again!

#include 
#include 
#include 
int main(void) {
int i = 256;
char buf[256];
while (i--)
buf[i] = i;
if ((i = __etoa_l(buf, 256)) != 256)
err(1, "etoa_l: %d != 256", i);
i = 0;
while (i < 256) {
printf(" %02X", (unsigned int)(unsigned char)buf[i]);
if (!(++i & 15))
printf("\n");
}
return (0);
}

Can you run this in both codepages, and possibly their Euro equivalents?

There’s no EBCDIC to Unicode function (ideal would be something that
gets a char and returns an int or something, not on buffers) though,
is there? (If there is, runs of that would also be welcome.) I don’t
find one in the IBM library reference, and I had a look at z/OS Unicode
Services but… there’s CUNLCNV, but it looks extremely… IBM. So maybe
we can or have to make do with etoa and its limitations… probably
still enough at this point.

Thanks,
//mirabilos
-- 
[16:04:33] bkix: "veni vidi violini"
[16:04:45] bkix: "ich kam, sah und vergeigte"...


Re: [PATCH] IBM z/OS + EBCDIC support

2017-04-25 Thread Daniel Richard G.
On Tue, 2017 Apr 25 10:59+, Thorsten Glaser wrote:
> Well, the hand-edited tables would be known to be stable and
> (somewhat) correct, but…
>
> >Even if you really do need a table, you could populate it on startup
> >using these.
> 
> I guess I can probably work with that.
> 
> So we’re up for testing again!
> 
[...]

I had to get rid of , and replace err() with printf(). But
otherwise, here is the result:

 00 01 02 03 9C 09 86 7F 97 8D 8E 0B 0C 0D 0E 0F
 10 11 12 13 9D 0A 08 87 18 19 92 8F 1C 1D 1E 1F
 80 81 82 83 84 85 17 1B 88 89 8A 8B 8C 05 06 07
 90 91 16 93 94 95 96 04 98 99 9A 9B 14 15 9E 1A
 20 A0 E2 E4 E0 E1 E3 E5 E7 F1 A2 2E 3C 28 2B 7C
 26 E9 EA EB E8 ED EE EF EC DF 21 24 2A 29 3B 5E
 2D 2F C2 C4 C0 C1 C3 C5 C7 D1 A6 2C 25 5F 3E 3F
 F8 C9 CA CB C8 CD CE CF CC 60 3A 23 40 27 3D 22
 D8 61 62 63 64 65 66 67 68 69 AB BB F0 FD FE B1
 B0 6A 6B 6C 6D 6E 6F 70 71 72 AA BA E6 B8 C6 A4
 B5 7E 73 74 75 76 77 78 79 7A A1 BF D0 5B DE AE
 AC A3 A5 B7 A9 A7 B6 BC BD BE DD A8 AF 5D B4 D7
 7B 41 42 43 44 45 46 47 48 49 AD F4 F6 F2 F3 F5
 7D 4A 4B 4C 4D 4E 4F 50 51 52 B9 FB FC F9 FA FF
 5C F7 53 54 55 56 57 58 59 5A B2 D4 D6 D2 D3 D5
 30 31 32 33 34 35 36 37 38 39 B3 DB DC D9 DA 9F

> Can you run this in both codepages, and possibly their Euro
> equivalents?

I'm afraid I'm not able to switch the codepage. Some searching indicates
that this can be done in a shell with e.g.

LANG=En_us.IBM-037
LC_ALL=En_us.IBM-037

but that doesn't affect the output of your program. It's possible that
this needs to be set outside the z/OS Unix environment, in the actual
mainframe UI, and that eludes even me :>

You don't have enough confidence in etoa_l() to generate the table at
build time?

> There’s no EBCDIC to Unicode function (ideal would be something that
> gets a char and returns an int or something, not on buffers) though,
> is there? (If there is, runs of that would also be welcome.) I don’t
> find one in the IBM library reference, and I had a look at z/OS
> Unicode Services but… there’s CUNLCNV, but it looks extremely… IBM. So
> maybe we can or have to make do with etoa and its limitations…
> probably still enough at this point.

Don't forget that ISO 8859-1 is equivalent to the first 256 codepoints
of Unicode ;)


--Daniel


-- 
Daniel Richard G. || sk...@iskunk.org
My ASCII-art .sig got a bad case of Times New Roman.


Re: [PATCH] IBM z/OS + EBCDIC support

2017-04-26 Thread Thorsten Glaser
Daniel Richard G. dixit:

>On Tue, 2017 Apr 25 10:59+, Thorsten Glaser wrote:
>> Well, the hand-edited tables would be known to be stable and
>> (somewhat) correct, but…
>>
>> >Even if you really do need a table, you could populate it on startup
>> >using these.
>>
>> I guess I can probably work with that.
>>
>> So we’re up for testing again!

>I had to get rid of , and replace err() with printf(). But

Oh okay. That’s BSD then.

>otherwise, here is the result:
>
> 00 01 02 03 9C 09 86 7F 97 8D 8E 0B 0C 0D 0E 0F
> 10 11 12 13 9D 0A 08 87 18 19 92 8F 1C 1D 1E 1F
> 80 81 82 83 84 85 17 1B 88 89 8A 8B 8C 05 06 07
> 90 91 16 93 94 95 96 04 98 99 9A 9B 14 15 9E 1A
> 20 A0 E2 E4 E0 E1 E3 E5 E7 F1 A2 2E 3C 28 2B 7C
> 26 E9 EA EB E8 ED EE EF EC DF 21 24 2A 29 3B 5E
> 2D 2F C2 C4 C0 C1 C3 C5 C7 D1 A6 2C 25 5F 3E 3F
> F8 C9 CA CB C8 CD CE CF CC 60 3A 23 40 27 3D 22
> D8 61 62 63 64 65 66 67 68 69 AB BB F0 FD FE B1
> B0 6A 6B 6C 6D 6E 6F 70 71 72 AA BA E6 B8 C6 A4
> B5 7E 73 74 75 76 77 78 79 7A A1 BF D0 5B DE AE
> AC A3 A5 B7 A9 A7 B6 BC BD BE DD A8 AF 5D B4 D7
> 7B 41 42 43 44 45 46 47 48 49 AD F4 F6 F2 F3 F5
> 7D 4A 4B 4C 4D 4E 4F 50 51 52 B9 FB FC F9 FA FF
> 5C F7 53 54 55 56 57 58 59 5A B2 D4 D6 D2 D3 D5
> 30 31 32 33 34 35 36 37 38 39 B3 DB DC D9 DA 9F

Thanks, that looks promising… for cp1047 anyway.

>> Can you run this in both codepages, and possibly their Euro
>> equivalents?
>
>I'm afraid I'm not able to switch the codepage. Some searching indicates
>that this can be done in a shell with e.g.
[…]

No problem.

>You don't have enough confidence in etoa_l() to generate the table at
>build time?

I didn’t have this initially (curious about the newline setting and
the handling of control characters in general) but I think I can work
with it now.

There’s one thing though… what about codepages that do NOT completely
map to latin1?

When does it error out, too?

Oh well, we’ll cross that when it’s there.

>Don't forget that ISO 8859-1 is equivalent to the first 256 codepoints
>of Unicode ;)

True, but e.g. cp1140 maps 0x9F to U+20AC which isn’t in the first 256
codepoints of Unicode…


But judging from how difficult you describe changing the codepage is,
I think we can work with what we have for now.

Thanks,
//mirabilos
-- 
This space for rent.


Re: [PATCH] IBM z/OS + EBCDIC support

2017-04-26 Thread Daniel Richard G.
On Wed, 2017 Apr 26 14:25+, Thorsten Glaser wrote:
> 
> Thanks, that looks promising… for cp1047 anyway.
>
> >> Can you run this in both codepages, and possibly their Euro
> >> equivalents?
> >
> >I'm afraid I'm not able to switch the codepage. Some searching
> >indicates that this can be done in a shell with e.g.
> […]
> 
> No problem.

Good news!

I played with this some more, and found what was missing: a call to
setlocale().

So I added , and experimentally, this line...

/* very much NOT Latin-1 compatible */
setlocale(LC_ALL, "Ru_RU.IBM-1025");

...and the result was

 00 01 02 03 9C 09 86 7F 97 8D 8E 0B 0C 0D 0E 0F
 10 11 12 13 9D 0A 08 87 18 19 92 8F 1C 1D 1E 1F
 80 81 82 83 84 85 17 1B 88 89 8A 8B 8C 05 06 07
 90 91 16 93 94 95 96 04 98 99 9A 9B 14 15 9E 1A
 20 A0 A1 A2 A3 A4 A5 A6 A8 A9 5B 2E 3C 28 2B 21
 26 AA AB AC AE AF B0 B1 B2 B3 5D 24 2A 29 3B 5E
 2D 2F B4 B5 B6 B7 B8 B9 BA BB 7C 2C 25 5F 3E 3F
 BC BD BE AD BF C0 C1 C2 C3 60 3A 23 40 27 3D 22
 C4 61 62 63 64 65 66 67 68 69 C5 C6 C7 C8 C9 CA
 CB 6A 6B 6C 6D 6E 6F 70 71 72 CC CD CE CF D0 D1
 D2 7E 73 74 75 76 77 78 79 7A D3 D4 D5 D6 D7 D8
 D9 DA DB DC DD DE DF E0 E1 E2 E3 E4 E5 E6 E7 E8
 7B 41 42 43 44 45 46 47 48 49 E9 EA EB EC ED EE
 7D 4A 4B 4C 4D 4E 4F 50 51 52 EF F0 F1 F2 F3 F4
 5C A7 53 54 55 56 57 58 59 5A F5 F6 F7 F8 F9 FA
 30 31 32 33 34 35 36 37 38 39 FB FC FD FE FF 9F

According to this page...


https://www.ibm.com/support/knowledgecenter/en/SSLTBW_2.2.0/com.ibm.zos.v2r2.cbcpx01/locnamc.htm

...the input should have been converted to ISO 8859-5.

So it seems like maybe the IBM docs are a bit flexible in what they mean
when they say "ISO 8859-1" :-]

I think what they really meant to say is "ASCII-compatible encoding." If
you look at the chcp(1) man page, for example...


https://www.ibm.com/support/knowledgecenter/en/SSLTBW_2.2.0/com.ibm.zos.v2r2.bpxa500/ip1.htm

...it talks about an "ASCII code page" in a sense distinct from (7-bit)
ASCII itself.

Incidentally, here is the result for "En_US.IBM-037":

 00 01 02 03 9C 09 86 7F 97 8D 8E 0B 0C 0D 0E 0F
 10 11 12 13 9D 0A 08 87 18 19 92 8F 1C 1D 1E 1F
 80 81 82 83 84 85 17 1B 88 89 8A 8B 8C 05 06 07
 90 91 16 93 94 95 96 04 98 99 9A 9B 14 15 9E 1A
 20 A0 E2 E4 E0 E1 E3 E5 E7 F1 A2 2E 3C 28 2B 7C
 26 E9 EA EB E8 ED EE EF EC DF 21 24 2A 29 3B AC
 2D 2F C2 C4 C0 C1 C3 C5 C7 D1 A6 2C 25 5F 3E 3F
 F8 C9 CA CB C8 CD CE CF CC 60 3A 23 40 27 3D 22
 D8 61 62 63 64 65 66 67 68 69 AB BB F0 FD FE B1
 B0 6A 6B 6C 6D 6E 6F 70 71 72 AA BA E6 B8 C6 A4
 B5 7E 73 74 75 76 77 78 79 7A A1 BF D0 DD DE AE
 5E A3 A5 B7 A9 A7 B6 BC BD BE 5B 5D AF A8 B4 D7
 7B 41 42 43 44 45 46 47 48 49 AD F4 F6 F2 F3 F5
 7D 4A 4B 4C 4D 4E 4F 50 51 52 B9 FB FC F9 FA FF
 5C F7 53 54 55 56 57 58 59 5A B2 D4 D6 D2 D3 D5
 30 31 32 33 34 35 36 37 38 39 B3 DB DC D9 DA 9F

Do you still want the other tables?

> >You don't have enough confidence in etoa_l() to generate the table at
> >build time?
> 
> I didn’t have this initially (curious about the newline setting and
> the handling of control characters in general) but I think I can work
> with it now.
> 
> There’s one thing though… what about codepages that do NOT completely
> map to latin1?

I discussed this with a colleague who is a long-time mainframer. One
thing to note is that not just any EBCDIC codepage can be used in a
POSIX environment, because if you can't encode e.g. square brackets,
then basic things like shell scripts will break.

These odd encodings should be usable in a 3270 terminal session, the
traditional mainframe UI. But the POSIX environment is a special
case of that.

> When does it error out, too?

It's in the doc. Both failure modes (non-SBCS locale, out-of-memory
condition) should be extremely rare, to the point that they don't really
need to be handled gracefully.


--Daniel


-- 
Daniel Richard G. || sk...@iskunk.org
My ASCII-art .sig got a bad case of Times New Roman.


Re: [PATCH] IBM z/OS + EBCDIC support

2017-04-27 Thread Thorsten Glaser
Daniel Richard G. dixit:

>Good news!
>
>I played with this some more, and found what was missing: a call to
>setlocale().

Oh.

>/* very much NOT Latin-1 compatible */
>setlocale(LC_ALL, "Ru_RU.IBM-1025");
[...]
>...the input should have been converted to ISO 8859-5.
>
>So it seems like maybe the IBM docs are a bit flexible in what they mean
>when they say "ISO 8859-1" :-]

Yeah, their definition of "ASCII codepage" is also a bit... off.

>Do you still want the other tables?

No, thanks, this is information enough.

>thing to note is that not just any EBCDIC codepage can be used in a
>POSIX environment, because if you can't encode e.g. square brackets,
>then basic things like shell scripts will break.

Hmm. I could check for required characters in the output,
or just leave this to the user. (We likely only support
the codepage the shell was compiled for anyway, due to
all those embedded strings.)

>> When does it error out, too?
>
>It's in the doc. Both failure modes (non-SBCS locale, out-of-memory
>condition) should be extremely rare, to the point that they don't really
>need to be handled gracefully.

OK, thanks.

bye,
//mirabilos
-- 
 Beware of ritual lest you forget the meaning behind it.
 yeah but it means if you really care about something, don't
ritualise it, or you will lose it. don't fetishise it, don't
obsess. or you'll forget why you love it in the first place.


Re: [PATCH] IBM z/OS + EBCDIC support

2017-05-01 Thread Daniel Richard G.
Apologies again for the delay; this was a busy weekend for me!

On Thu, 2017 Apr 27 12:01+, Thorsten Glaser wrote:
> Daniel Richard G. dixit:
>
> >I played with this some more, and found what was missing: a call to
> >setlocale().
> 
> Oh.

I often forget, the LC_* envvars don't do anything by themselves...

> >thing to note is that not just any EBCDIC codepage can be used in a
> >POSIX environment, because if you can't encode e.g. square brackets,
> >then basic things like shell scripts will break.
>
> Hmm. I could check for required characters in the output, or just
> leave this to the user. (We likely only support the codepage the shell
> was compiled for anyway, due to all those embedded strings.)

I would leave it to the user. While getting shell scripts to work with
missing punctuation characters might be an interesting challenge (sort
of like writing a novel without the letter "e"), you're allowed to
assume that the build environment is at least minimally sane :-)


On to the new thread!


--Daniel


-- 
Daniel Richard G. || sk...@iskunk.org
My ASCII-art .sig got a bad case of Times New Roman.


\uXXXX on EBCDIC systems (was Re: [PATCH] IBM z/OS + EBCDIC support)

2017-05-03 Thread Thorsten Glaser
Dixi quod…

>Use U+4DC0 HEXAGRAM FOR THE CREATIVE HEAVEN (䷀) then ☺

I *do* have a follow-up question for that now.

The utf8bug-1 test fails because its output is interpreted as UTF-8,
but the UTF-8 string it should match was treated as “extended ASCII”
and is thus converted…

So, the situation as it is right now is:

print -n '0\u4DC0' outputs the following octets:
- on an ASCII system : 30 E4 B7 80
- on an EBCDIC system: F0 E4 B7 80

That is, “0” is output in the native codepage, and the Unicode
value is output as real UTF-8 octets.

Now you say UTF-8 is not really used on z/OS or EBCDIC systems
in general, so I was considering the following heresy:
- output: F0 43 B3 20

That is, convert UTF-8 output, before actually outputting it,
as if it were “extended ASCII”, to EBCDIC.

Converting F0 43 B3 20 from EBCDIC(1047) to “extended ASCII”
yields 30 E4 B7 80 by the way, see above. (Typos in the manual
conversion notwithstanding.)

This would allow more consistency doing all those conversions
(which are done automatically). If it doesn’t diminish the
usefulness of mksh on EBCDIC systems I’d say go for it.

Comments?

Thanks,
//mirabilos
-- 
(gnutls can also be used, but if you are compiling lynx for your own use,
there is no reason to consider using that package)
-- Thomas E. Dickey on the Lynx mailing list, about OpenSSL


Re: \uXXXX on EBCDIC systems (was Re: [PATCH] IBM z/OS + EBCDIC support)

2017-05-03 Thread Daniel Richard G.
Hi Thorsten,

On Wed, 2017 May  3 15:57+, Thorsten Glaser wrote:
> Dixi quod…
> 
> >Use U+4DC0 HEXAGRAM FOR THE CREATIVE HEAVEN (䷀) then ☺
> 
> I *do* have a follow-up question for that now.
> 
> The utf8bug-1 test fails because its output is interpreted as UTF-8,
> but the UTF-8 string it should match was treated as “extended ASCII”
> and is thus converted…
> 
> So, the situation as it is right now is:
> 
> print -n '0\u4DC0' outputs the following octets:
> - on an ASCII system : 30 E4 B7 80
> - on an EBCDIC system: F0 E4 B7 80
> 
> That is, “0” is output in the native codepage, and the Unicode
> value is output as real UTF-8 octets.

This kind of weirdness is but one reason why z/Linux (Linux on z/OS) is
eating Unix System Services alive :]

> Now you say UTF-8 is not really used on z/OS or EBCDIC systems
> in general, so I was considering the following heresy:
> - output: F0 43 B3 20
> 
> That is, convert UTF-8 output, before actually outputting it,
> as if it were “extended ASCII”, to EBCDIC.
> 
> Converting F0 43 B3 20 from EBCDIC(1047) to “extended ASCII”
> yields 30 E4 B7 80 by the way, see above. (Typos in the manual
> conversion notwithstanding.)
> 
> This would allow more consistency doing all those conversions
> (which are done automatically). If it doesn’t diminish the
> usefulness of mksh on EBCDIC systems I’d say go for it.
> 
> Comments?

While UTF-8 isn't a thing in the z/OS environment, I think there could
be value in printing something that will be converted by the existing
EBCDIC->ASCII terminal/NFS conversion into correctly-formed UTF-8
characters.

To wit: Say I have a UTF-8-encoded file in NFS, and I view it via a
text-mode NFS mount on z/OS. If I view it in less(1), then the high
characters are shown as arbitrary byte sequences (e.g. "DIVISION SIGN"
is "<66>"). But if I just "cat" the file, then it renders correctly
in the terminal. Effectively an ASCII->EBCDIC->ASCII round trip.

I don't know if there are use cases where this may yield unintuitive
results... perhaps if this "nega-UTF-8" were redirected to a file and
then processed further in z/OS, that may lead to some surprises. But in
terms of doing something sensible when using a "\u" escape in an
environment that shouldn't support it, it seems no worse than producing
actual UTF-8 bytes.


--Daniel


-- 
Daniel Richard G. || sk...@iskunk.org
My ASCII-art .sig got a bad case of Times New Roman.