On 23 December 2015 at 10:46, Jerry Callen <jcal...@narsil.org> wrote:
> I'm in the process of hand-tuning a small, performance critical algorithm on 
> a Z13, and I'm hampered by the lack of detailed information on the 
> instruction-level performance of the machine.

Just to add two thoughts to the several good comments here...

First, what is the nature of your "small, performance critical
algorithm", and why do you see the need to hand tune it? Is it small
in the sense of code size, but the code itself loops, or small but
invoked very frequently, or small and invoked rarely but it really
really has to perform when it is, or...? That's the code; now how
about the data? Lots of it? Dense or sparse? Or a small amount that is
worked on intensively?

To some extent I think answers to this will determine your best course
of action. Not that I think you haven't thought about these things,
but clearly if you have a small routine that's invoked frequently, it
is important to remove as much of the calling overhead as possible so
that it doesn't outweigh your actual code. And if the data is large or
sparse, and the code small, you need to think hard about data cache
performance vs instruction cache, and the cache level at which they
converge/interfere.

Second, in the absence of detailed documentation on the machine (which
I think you will never see for a modern implementation, and which will
in any case change in the next one), you will do well do emulate what
those best informed do: write your routine in (say) C, and look at
what the IBM compiler generates, not just for the latest and greatest
OPT(...) value, but for some lower ones to see what has changed. Of
course this has some problems. In many cases you won't know *why* the
compiler does something, and therefore how to extrapolate to what you
want to do. And the high level languages lack a mechanism to tell the
compiler much of anything about what *you* know about the code and
data that it can't. There are occasional ways of sneaking hints
through the HLL to the optimizer, such as declaring the size of length
fields so as to limit the cases the compiler has to consider, but
these are accidental and incomplete.

As for your specific question

> * Are there any ways to bypass the L1 cache on moves of less than a page, 
> when simply moving
>  data without looking at it?

there is a cache usage hinting scheme when using the MVCL[E]
instructions: a padding byte value of X'B0' during non-padding
execution hints that you won't be referencing the target soon, and by
implication that <some> cache should be avoided. This has been
documented in the PofO for years, and how a specific model treats it
is, of course, not mentioned there.

There is also the more general Next Instruction Access Intent (NIAI)
instruction that may be of use, though naturally its effects are also
described in somewhat more general terms than you might like.

Tony H.

----------------------------------------------------------------------
For IBM-MAIN subscribe / signoff / archive access instructions,
send email to lists...@listserv.ua.edu with the message: INFO IBM-MAIN

Reply via email to