On 14/01/2019 6:06 am, Ed Jaffe wrote:
On 1/13/2019 4:08 AM, David Crayford wrote:
On 13/01/2019 7:06 pm, Tony Thigpen wrote:
I have seen some reports that current C compilers, which understand
the z-hardware pipeline, can actually produce object that is faster
running than an assembler. Mainly because no sane assembler
programmer would produce great pipe-line code because it would be
un-maintanable.
It's well established that that's been true for a well over a decade
now. Not just C but all compilers including COBOL which got a new
optimizer a few releases back.
Far, far less true now than it used to be.
Good to hear. The best optimization is done in hardware where you don't
have to recompile. Followed by a JIT.
Back in the old days, things ran a lot faster if you interleaved
unrelated things in an "unfriendly" way. For example, this code fragment:
| LGF R0,Field1 Increment Field1
| AGHI R0,1 (same)
| ST R0,Field1 (same)
| LGF R0,Field2 Increment Field2
| AGHI R0,1 (same)
| ST R0,Field2 (same)
| LGF R0,Field3 Increment Field3
| AGHI R0,1 (same)
| ST R0,Field3 (same)
| LGF R0,Field4 Increment Field4
| AGHI R0,1 (same)
| ST R0,Field4 (same)
ran much faster when coded this way (which is not how a programmer
would usually write things):
| LGF R0,Field1 Increment Field1
| LGF R1,Field2 Increment Field2
| LGF R2,Field3 Increment Field3
| LGF R3,Field4 Increment Field4
| AGHI R0,1 (same)
| AGHI R1,1 (same)
| AGHI R2,1 (same)
| AGHI R3,1 (same)
| ST R0,Field1 (same)
| ST R1,Field2 (same)
| ST R2,Field3 (same)
| ST R3,Field5 (same)
But once OOO execution came on the scene with z196, you could get the
same enhanced performance from this easy-to-code and easy-to-read
version:
| LGF R0,Field1 Increment Field1
| AGHI R0,1 (same)
| ST R0,Field1 (same)
| LGF R1,Field2 Increment Field2
| AGHI R1,1 (same)
| ST R1,Field2 (same)
| LGF R2,Field3 Increment Field3
| AGHI R2,1 (same)
| ST R2,Field3 (same)
| LGF R3,Field4 Increment Field4
| AGHI R3,1 (same)
| ST R3,Field4 (same)
These days, many performance improvements are realized by the compiler
using newer instructions that replace older ones. For example, on z10
and higher, this very same code can be replaced with:
| ASI Field1,1 Increment Field1
| ASI Field2,1 Increment Field1
| ASI Field3,1 Increment Field1
| ASI Field4,1 Increment Field1
IIRC, the interleaved instruction scheduling was to mitigate the AGI
problem?
In my experienced the two optimizations that make the most difference
are function inlining and loop unrolling. I've taken to defining
functions in header files to take advantage of both (we don't use IPA).
Of course, an HLASM programmer can do exactly the same thing. But
changing old code to use new instructions requires
relatively-expensive programmer resources whereas simply recompiling
programs targeting a new machine is a relatively-inexpensive proposition.
I've noticed (depending on compiler options) the C optimizer is starting
to use vector instructions. They can be a bit hairy even for experienced
assembler programmers. Best to leave that to a compiler IMO.
Instead of using a SRST instruction strlen() generates the following:
VLBB v0,str(r6,r9,0),2
LCBB r7,str(r6,r9,0),2
LR r0,r6
ALR r6,r7
VFENEB v0,v0,v0,b'0010'
VLGVB r2,v0,7
CLRJH r7,r2,@2L33
VLBB v0,str(r6,r9,0),2
LCBB r7,str(r6,r9,0),2
LR r0,r6
ALR r6,r7
VFENEB v0,v0,v0,b'0010'
VLGVB r2,v0,7
CLRJH r7,r2,@2L33
VLBB v0,str(r6,r9,0),2
LCBB r7,str(r6,r9,0),2
LR r0,r6
VFENEB v0,v0,v0,b'0010'
ALR r6,r7
VLGVB r2,v0,7
CLRJH r7,r2,@2L33
VLBB v0,str(r6,r9,0),2
LCBB r7,str(r6,r9,0),2
LR r0,r6
ALR r6,r7
VFENEB v0,v0,v0,b'0010'
VLGVB r2,v0,7
CLRJNH r7,r2,@2L25
----------------------------------------------------------------------
For IBM-MAIN subscribe / signoff / archive access instructions,
send email to lists...@listserv.ua.edu with the message: INFO IBM-MAIN