Re: [PATCH] IBM z/OS + EBCDIC support

2017-04-22 Thread Daniel Richard G.
Hi Thorsten, apologies for the delay.

On Thu, 2017 Apr 20 21:49+, Thorsten Glaser wrote:
>
> >Interesting! So POSIX assumes ASCII, to a certain extent.
>
> Yes, it does. I think EBCDIC as charset is actually nonconformant, but
> it probably pays off to stay close nevertheless. (This is actually
> about the POSIX/'C' locale; other locales can pretty much do whatever
> they want.)

Ah, okay, C locale; that makes sense. I did imagine POSIX was largely
agnostic about the character set.

> >Even if you really do need a table, you could populate it on startup
> >using these.
>
> Indeed… but we have the compile-time translated characters all over
> the source (I think we agreed earlier that not supporting changing it
> at runtime was okay).

Oh, so you mean like if(c=='[') and such? That is certainly reasonable.
The program would be tied to the compile-time codepage no worse than
most other programs.

(If you could do everything in terms of character literals, without
depending on constructs like if(c>='A'&&c<='Z'), your code would be
pretty much EBCDIC-proof.)

> >Anyway, if you need any z/OS testing, feel free to drop me a line ;)
>
> Thanks!
>
> I hope to be able to get back to that offer eventually. Glad to know
> you’re still interested after two years.

Mainframes are not a platform for the impatient... at least not if one
has to deal with IBM  ^_^


On Fri, 2017 Apr 21 20:20+, Thorsten Glaser wrote:
> Daniel Richard G. dixit:
> 
> >Anyway, if you need any z/OS testing, feel free to drop me a line ;)
>
> main() { printf("%02X\n", '\n'); return 0; }
>
> Out of curiosity, what does that print on your systems, 15 or 25?

$ cat >test.c
main() { printf("%02X\n", '\n'); return 0; }

$ xlc -o test test.c

$ ./test
15

However...

$ cat >test2.c
#pragma convert("ISO8859-1")
int c = '\n';
#pragma convert(pop)
main() { printf("%02X\n", c); return 0; }

$ xlc -o test2 test2.c

$ ./test2
0A

That may or may not be useful. Of course, the pragma would need to be
protected by

#if defined(__MVS__) && defined(__IBMC__)

Gnulib uses this in its test-iconv.c program, because the string
literals therein need to be in ASCII regardless of platform.

> Also, what line endings do the auto-converted source files, such
> as dot.mkshrc, have?

$ head -2 dot.mkshrc 
# $Id$
# $MirOS: src/bin/mksh/dot.mkshrc,v 1.101 2015/07/18 23:03:24 tg Exp $

$ head -2 dot.mkshrc | od -t x1
007B  40  5B  C9  84  5B  15  7B  40  5B  D4  89  99  D6  E2  7A
2040  A2  99  83  61  82  89  95  61  94  92  A2  88  61  84  96
40A3  4B  94  92  A2  88  99  83  6B  A5  40  F1  4B  F1  F0  F1
6040  F2  F0  F1  F5  61  F0  F7  61  F1  F8  40  F2  F3  7A  F0
000100F3  7A  F2  F4  40  A3  87  40  C5  A7  97  40  5B  15
000116

(Yes, binary files do get messed up :-]  On z/OS-native filesystems,
there is a per-file type flag that enables or disables encoding auto-
conversion. For NFS mounts, you have to mount it as either "binary" or
"text." The mksh source tree above is on the latter sort of mount.)

Let me know if I can help any more!


--Daniel


-- 
Daniel Richard G. || sk...@iskunk.org
My ASCII-art .sig got a bad case of Times New Roman.


Re: [PATCH] IBM z/OS + EBCDIC support

2017-04-22 Thread Thorsten Glaser
Hi Daniel,

>Hi Thorsten, apologies for the delay.

don’t worry about that ;)

>> >Interesting! So POSIX assumes ASCII, to a certain extent.
>>
>> Yes, it does. I think EBCDIC as charset is actually nonconformant, but
>> it probably pays off to stay close nevertheless. (This is actually
>> about the POSIX/'C' locale; other locales can pretty much do whatever
>> they want.)
>
>Ah, okay, C locale; that makes sense. I did imagine POSIX was largely
>agnostic about the character set.

It is, but it prescribes that certain operations in the POSIX locale
use ASCII ordering for codepoints no matter which bytes they actually
have in the internal representation.

>> >Even if you really do need a table, you could populate it on startup
>> >using these.
>>
>> Indeed… but we have the compile-time translated characters all over
>> the source (I think we agreed earlier that not supporting changing it
>> at runtime was okay).
>
>Oh, so you mean like if(c=='[') and such? That is certainly reasonable.
>The program would be tied to the compile-time codepage no worse than
>most other programs.

Right. So either something like -DMKSH_EBCDIC_CP=1047 or limiting
EBCDIC support to precisely one codepage.

>(If you could do everything in terms of character literals, without
>depending on constructs like if(c>='A'&&c<='Z'), your code would be
>pretty much EBCDIC-proof.)

Yesss… but…

① not all characters are in every codepage, and
② I need strictly monotonous ordering for all 256 possible octets
  for e.g. sorting strings in some cases and for [a-z] ranges

>> I hope to be able to get back to that offer eventually. Glad to know
>> you’re still interested after two years.
>
>Mainframes are not a platform for the impatient... at least not if one
>has to deal with IBM  ^_^

Oh… I see. My condolences then ;-)

>> main() { printf("%02X\n", '\n'); return 0; }
>>
>> Out of curiosity, what does that print on your systems, 15 or 25?

>$ ./test
>15

OK, I can live with that, so I just need to swap the conversion
tables I got (which map 15 to NEL and 25 to LF).

>#pragma convert("ISO8859-1")
[…]
>That may or may not be useful. Of course, the pragma would need to be

Interesting, but I can’t think of where that would be useful
at the moment. But good to know.

Hmm. Can this be used to construct the table?

Something like running this at configure time:

main() {
int i = 1;

printf("#pragma convert(\"ISO8859-1\")\n");
printf("static const unsigned char map[] = \"");
while (i <= 255)
printf("%c", i++);
printf("\";\n");
}

And then feed its output into the compiling, and have
some code generating the reverse map like:

i = 0;
while (i < 255)
revmap[map[i]] = i + 1;

But this reeks of fragility compared with supporting
a known-good hand-edited set of codepages.

(Not to say we can’t do this manually once in order to
actually _get_ those mappings.)

>> Also, what line endings do the auto-converted source files, such
>> as dot.mkshrc, have?
>
>$ head -2 dot.mkshrc
># $Id$
># $MirOS: src/bin/mksh/dot.mkshrc,v 1.101 2015/07/18 23:03:24 tg Exp $
>
>$ head -2 dot.mkshrc | od -t x1
>007B  40  5B  C9  84  5B  15  7B  40  5B  D4  89  99  D6  E2  
> 7A
   ^

OK, it matches the above. That’s all I needed to know, thanks
for confirming this.

>(Yes, binary files do get messed up :-]  On z/OS-native filesystems,
>there is a per-file type flag that enables or disables encoding auto-
>conversion. For NFS mounts, you have to mount it as either "binary" or
>"text." The mksh source tree above is on the latter sort of mount.)

Yeah, I remembered something like that from the eMail thread.
That’s fine, we can work with that.

>Let me know if I can help any more!

Okay, sure, thanks. I must admit I’m not actively working on
this still but I’m considering making a separate branch on which
we can try things until they work, then merge it back.

But first, the character class changes themselves. That turned
out to be quite a bit more effort than I had estimated and will
keep me busy for another longish hacking session. Ugh. Oh well.
But on the plus side, this will make support much nicer as *all*
constructs like “(c >= '0' && c <= '9')” will go away and even
the OS/2 TEXTMODE line endings (where CR+LF is also supported)
need less cpp hackery.

Goodnight,
//mirabilos, who had a lng day working for a nonprofit
-- 
 you introduced a merge commit│ % g rebase -i HEAD^^
 sorry, no idea and rebasing just fscked │ Segmentation
 should have cloned into a clean repo  │  fault (core dumped)
 if I rebase that now, it's really ugh │ wuahh