Interesting stuff! Thank you Ed! I did not realize the OOO execution capabilities were as smart as your example suggests, and try to write code that runs free of charge whenever possible. But I will have to give that one a whirl.
And if in your samples, Fields 1-4 happened to be consecutive fullwords, then you could likely reduce the number of storage fetches to just one and storage updates also to just one, but you would still need the 4 increments. I have steered clear of ASI instruction in the past because of possible interlock overhead when I did not need things serialized, but that too may be less of an issue these days. Mike ________________________________ From: IBM Mainframe Discussion List <IBM-MAIN@LISTSERV.UA.EDU> on behalf of David Crayford <dcrayf...@gmail.com> Sent: Sunday, January 13, 2019 11:47 PM To: IBM-MAIN@LISTSERV.UA.EDU Subject: Re: Unreadable code (Was: Concurrent Server Task Dispatch issue multitasking issue) On 14/01/2019 6:06 am, Ed Jaffe wrote: > On 1/13/2019 4:08 AM, David Crayford wrote: >> On 13/01/2019 7:06 pm, Tony Thigpen wrote: >>> I have seen some reports that current C compilers, which understand >>> the z-hardware pipeline, can actually produce object that is faster >>> running than an assembler. Mainly because no sane assembler >>> programmer would produce great pipe-line code because it would be >>> un-maintanable. >>> >> It's well established that that's been true for a well over a decade >> now. Not just C but all compilers including COBOL which got a new >> optimizer a few releases back. > > > Far, far less true now than it used to be. > Good to hear. The best optimization is done in hardware where you don't have to recompile. Followed by a JIT. > Back in the old days, things ran a lot faster if you interleaved > unrelated things in an "unfriendly" way. For example, this code fragment: > > | LGF R0,Field1 Increment Field1 > | AGHI R0,1 (same) > | ST R0,Field1 (same) > | LGF R0,Field2 Increment Field2 > | AGHI R0,1 (same) > | ST R0,Field2 (same) > | LGF R0,Field3 Increment Field3 > | AGHI R0,1 (same) > | ST R0,Field3 (same) > | LGF R0,Field4 Increment Field4 > | AGHI R0,1 (same) > | ST R0,Field4 (same) > > ran much faster when coded this way (which is not how a programmer > would usually write things): > > | LGF R0,Field1 Increment Field1 > | LGF R1,Field2 Increment Field2 > | LGF R2,Field3 Increment Field3 > | LGF R3,Field4 Increment Field4 > | AGHI R0,1 (same) > | AGHI R1,1 (same) > | AGHI R2,1 (same) > | AGHI R3,1 (same) > | ST R0,Field1 (same) > | ST R1,Field2 (same) > | ST R2,Field3 (same) > | ST R3,Field5 (same) > > But once OOO execution came on the scene with z196, you could get the > same enhanced performance from this easy-to-code and easy-to-read > version: > > | LGF R0,Field1 Increment Field1 > | AGHI R0,1 (same) > | ST R0,Field1 (same) > | LGF R1,Field2 Increment Field2 > | AGHI R1,1 (same) > | ST R1,Field2 (same) > | LGF R2,Field3 Increment Field3 > | AGHI R2,1 (same) > | ST R2,Field3 (same) > | LGF R3,Field4 Increment Field4 > | AGHI R3,1 (same) > | ST R3,Field4 (same) > > These days, many performance improvements are realized by the compiler > using newer instructions that replace older ones. For example, on z10 > and higher, this very same code can be replaced with: > > | ASI Field1,1 Increment Field1 > | ASI Field2,1 Increment Field1 > | ASI Field3,1 Increment Field1 > | ASI Field4,1 Increment Field1 > IIRC, the interleaved instruction scheduling was to mitigate the AGI problem? In my experienced the two optimizations that make the most difference are function inlining and loop unrolling. I've taken to defining functions in header files to take advantage of both (we don't use IPA). > Of course, an HLASM programmer can do exactly the same thing. But > changing old code to use new instructions requires > relatively-expensive programmer resources whereas simply recompiling > programs targeting a new machine is a relatively-inexpensive proposition. > > I've noticed (depending on compiler options) the C optimizer is starting to use vector instructions. They can be a bit hairy even for experienced assembler programmers. Best to leave that to a compiler IMO. Instead of using a SRST instruction strlen() generates the following: VLBB v0,str(r6,r9,0),2 LCBB r7,str(r6,r9,0),2 LR r0,r6 ALR r6,r7 VFENEB v0,v0,v0,b'0010' VLGVB r2,v0,7 CLRJH r7,r2,@2L33 VLBB v0,str(r6,r9,0),2 LCBB r7,str(r6,r9,0),2 LR r0,r6 ALR r6,r7 VFENEB v0,v0,v0,b'0010' VLGVB r2,v0,7 CLRJH r7,r2,@2L33 VLBB v0,str(r6,r9,0),2 LCBB r7,str(r6,r9,0),2 LR r0,r6 VFENEB v0,v0,v0,b'0010' ALR r6,r7 VLGVB r2,v0,7 CLRJH r7,r2,@2L33 VLBB v0,str(r6,r9,0),2 LCBB r7,str(r6,r9,0),2 LR r0,r6 ALR r6,r7 VFENEB v0,v0,v0,b'0010' VLGVB r2,v0,7 CLRJNH r7,r2,@2L25 ---------------------------------------------------------------------- For IBM-MAIN subscribe / signoff / archive access instructions, send email to lists...@listserv.ua.edu with the message: INFO IBM-MAIN ---------------------------------------------------------------------- For IBM-MAIN subscribe / signoff / archive access instructions, send email to lists...@listserv.ua.edu with the message: INFO IBM-MAIN