On 29/04/2016 10:34 PM, Joe Testa wrote:
There seems to be little point worrying about the time needed to branch past an 
eyecatcher at the start of a program, compared to the time used by the rest of 
the program.

Unfortunately that's not true. For high frequency subroutines it can dominate the performance profile. We have customer feedback where the code has been profiled using APA and the hot spots are clearly at the branch over eye-cachers. The reason I'm asking the question is for a reason why? The customer suggested we were non re-entrant and saving registers into the instruction stream. Our code is re-entrant.


From: Mike Schwab
Sent: Friday, April 29, 2016 10:27 AM
Newsgroups: bit.listserv.ibm-main
To: IBM-MAIN@LISTSERV.UA.EDU
Subject: Re: An explanation for branch performance?

Well, the obvious solution is to code the eyecatcher literals before
the entry point.  It will be less obvious that the eyecatcher is part
of the program (and not the end of the previous program) but as the
technique become more widespread it should become more trusted.

On Fri, Apr 29, 2016 at 9:13 AM, David Crayford <dcrayf...@gmail.com> wrote:
On 29/04/2016 10:09 PM, Mike Schwab wrote:
The pipeline is optimized for running many instructions in a row.  A
branch is not recognized until through a good part of the pipeline.
Meanwhile the data to be skipped is in the instruction pipeline.

Results meet expectations.

So branching over eyecatchers is expected to be x2 slower on a z13 than a
z114? I was always lead to believe that new hardware always ran old code
faster unless it was doing nasty stuff like storing into the instruction
stream.


On Fri, Apr 29, 2016 at 7:40 AM, David Crayford <dcrayf...@gmail.com>
wrote:
We're doing some performance work on our assembler code and one of my
colleagues ran the following test which was surprising. Unconditional
branching can add significant overhead. I always believed that
conditional
branches were expensive because the branch predictor needed to do more
work
and unconditional branches were easy to predict. Does anybody have an
explanation for this. Our machine is z114. It appears that it's even
worse
on a z13.

Here's the code.

I wrote a simple program - it tight loops 1 billion times


           L     R4,=A(1*1000*1000*1000)
           LTR   R4,R4
           J     LOOP
*
LOOP     DS   0D                  .LOOP START
           B     NEXT

NEXT     JCT   R4,LOOP

The loop starts with a branch ... I tested it twice - when the CC is
matched
(branch happens) and when it is not matched (falls through)

1. When the CC is matched and branching happens, CPU TIME=2.94 seconds
2. When the CC is not matched the code falls through, CPU TIME=1.69
seconds
- a reduction of 42%

----------------------------------------------------------------------
For IBM-MAIN subscribe / signoff / archive access instructions,
send email to lists...@listserv.ua.edu with the message: INFO IBM-MAIN


----------------------------------------------------------------------
For IBM-MAIN subscribe / signoff / archive access instructions,
send email to lists...@listserv.ua.edu with the message: INFO IBM-MAIN



----------------------------------------------------------------------
For IBM-MAIN subscribe / signoff / archive access instructions,
send email to lists...@listserv.ua.edu with the message: INFO IBM-MAIN

Reply via email to