On Thu, 7 Aug 2025 12:40:26 +0000, Peter Relson <[email protected]> wrote:

>It's perhaps not overly relevant, but z/OS provided no RMODE 64 for programs 
>until many years after the introduction of AMODE 64.

Very good point because we very rarely (if ever) felt any AMODE 64 pain during 
those years. Out of sight, out of mind. Thanks for the reminder.

IBM mainframe, as the last architecture with competent computer professionals 
can be forgiven for not considering the possibility that RMODE 64 in this 
situation.

Before anyone blasts me for saying "competent computer professionals", may I 
remind you that if non-mainframers can't solve simple problems properly, what 
makes you think they are solving complicated problems properly? Is it 
sufficient that they get the desired result while disregarding all other 
metrics?

Of the millions of non-mainframe screwups, let's consider the simple act of 
moving data. On the mainframe, we've had the same concepts starting in the 
1960's (MVC, MVCL, MVCLE) while changing architectures (S360, S370, S390, ESA, 
zARCH). A single instruction chooses how to optimally move data.

Consider how inefficient moving data is on non-mainframe architectures (e.g. 
x86, RISCV, ...). 65 years (1960's) of ignoring the correct implementation.  

1. It's machine instruction "LANGUAGE"! The instruction cycle is fetch 
instruction, decode instruction, fetch data, execute instruction, store data. 
This is true for all CPUs.

2. Moving data is a repetition of multiple instructions (unnecessarily 
repeating instruction cycles when 1 instruction cycle would suffice). 

3. At a minimum, the instruction loop will be load register, store register, 
increment dest address, increment source address, decrement loop count and loop 
if not zero. May include additional instructions.

4. Inefficient data prefetch (getting data into CPU before it's needed). 
Currently, the best data prefetch for move using the best architecture can only 
prefetch 512 bytes (64 byte vector register times 8 vector registers equals 512 
bytes). The mainframe does not use registers to move data and move can take 
advantage of the entire data prefetch available (4K, 100K, 1M, ... , 
unrealistically up to 16 exabytes). 

5. Inefficient paging. The hardware cannot do any paging prediction because the 
architectures do now know the size of the data being moved. If move were a 
single instruction like the mainframe, the hardware could request source pages 
be paged in and destination page table entries be cleared instead of being 
paged in. I'm not saying it works this way but it's in the realm of 
possibilities. 

6. I believe that Peter has stated the mainframe pipeline is 6 instructions. 
Most people ignore that non-mainframe pipelines are inefficient and 
complicated. Consider the implementation that uses 8 vector registers requiring 
more than 20 instructions (8 vector load registers & 8 stores). The mainframe 
has a "vector load multiple register" instruction that eliminates 7 
instructions and the need for the pipeline to optimize these instructions. If 
this architecture does more than very basic pipeline optimization, then it must 
be more than 20 instructions.

7. Designed to be as slow as a herd of turtles. L1 and L2 cache are sized 
according to the amount of data the CPU process. With tiny L1 per core (most 
less than 80KB) and tiny L2 per core (most less than 1MB), don't expect data to 
move quickly. The mainframe is currently 256KB L1 and 32MB per core. This is 
about working smart instead of harder. 

Does IBM Linux One use GLIBC (standard C libraries) or does REDHAT supply the C 
libraries written correctly? I only ask because some parts must have been 
written by people who do not understand IBM mainframe instructions.

Consider memcpy and memmove 
https://github.com/bminor/glibc/blob/master/sysdeps/s390/memcpy-z900.S which 
meets the requirements but is written horribly and inefficient. 

1. What possessed them to use an MVC loop when moving less than 16M  at 
.L_Z10_13:? MEMCPY is supposed to be the efficient move with unpredictable 
results when dest and source overlap (think MVC DATA+1,DATA). 

2. What possessed them to think the 2 prefetch data (pfd) at .L_Z10_12: is 
making this code efficient? Are they considering the "cpu defined" mentioned in 
MVCLE? They didn't consider page flush for the destination if it's paged out. 

3. What possessed them to think it's smart to copy the vector instructions at 
.L_MEMMOVE_Z13_LARGE_64B_LOOP: from other architectures because that's the best 
that they could do. We have 32 vector registers and the vector load / store 
MULTIPLE register instructions where we can load & store V0 to V15 and load & 
store V16 to V31. That's a 512 byte move loop with 4 fewer instructions than 
the 64 byte loop.

I could list more problems. It's sad when you realize that less than 100 lines 
has so many problems.

----------------------------------------------------------------------
For IBM-MAIN subscribe / signoff / archive access instructions,
send email to [email protected] with the message: INFO IBM-MAIN

Reply via email to