On 23 December 2015 at 10:46, Jerry Callen <jcal...@narsil.org> wrote: > I'm in the process of hand-tuning a small, performance critical algorithm on > a Z13, and I'm hampered by the lack of detailed information on the > instruction-level performance of the machine.
Just to add two thoughts to the several good comments here... First, what is the nature of your "small, performance critical algorithm", and why do you see the need to hand tune it? Is it small in the sense of code size, but the code itself loops, or small but invoked very frequently, or small and invoked rarely but it really really has to perform when it is, or...? That's the code; now how about the data? Lots of it? Dense or sparse? Or a small amount that is worked on intensively? To some extent I think answers to this will determine your best course of action. Not that I think you haven't thought about these things, but clearly if you have a small routine that's invoked frequently, it is important to remove as much of the calling overhead as possible so that it doesn't outweigh your actual code. And if the data is large or sparse, and the code small, you need to think hard about data cache performance vs instruction cache, and the cache level at which they converge/interfere. Second, in the absence of detailed documentation on the machine (which I think you will never see for a modern implementation, and which will in any case change in the next one), you will do well do emulate what those best informed do: write your routine in (say) C, and look at what the IBM compiler generates, not just for the latest and greatest OPT(...) value, but for some lower ones to see what has changed. Of course this has some problems. In many cases you won't know *why* the compiler does something, and therefore how to extrapolate to what you want to do. And the high level languages lack a mechanism to tell the compiler much of anything about what *you* know about the code and data that it can't. There are occasional ways of sneaking hints through the HLL to the optimizer, such as declaring the size of length fields so as to limit the cases the compiler has to consider, but these are accidental and incomplete. As for your specific question > * Are there any ways to bypass the L1 cache on moves of less than a page, > when simply moving > data without looking at it? there is a cache usage hinting scheme when using the MVCL[E] instructions: a padding byte value of X'B0' during non-padding execution hints that you won't be referencing the target soon, and by implication that <some> cache should be avoided. This has been documented in the PofO for years, and how a specific model treats it is, of course, not mentioned there. There is also the more general Next Instruction Access Intent (NIAI) instruction that may be of use, though naturally its effects are also described in somewhat more general terms than you might like. Tony H. ---------------------------------------------------------------------- For IBM-MAIN subscribe / signoff / archive access instructions, send email to lists...@listserv.ua.edu with the message: INFO IBM-MAIN