Marius Hillenbrand <mhil...@linux.ibm.com> writes: Most notably, I changed the order so that the mlgr's are next to each other. The reason is that decode and dispatch happens in two "groups" of up to three instructions each, with each group going into one of the two "issue sides" of the core (both are symmetric and have the same set of issue ports since z13). For some instructions, grouping is restricted -- that includes mlgr, which will be alone in a group. Thus, placing two mlgr's next to each other will ensure that they will spread across both issue sides and exploit both multiply units.
This sort of restrictions are hard to deal with. What happens around branches with these grouping restrictions? Could an incoming branch (such as the loop branch) have one group from before the branch and one after the branch, thus invalidating the assumption of adjacent mlgr's going to different groups? I've seen cases (with other pipelines) where a loop can take completely different times depending on some parity-like issue condition upon loop entry. It never recovers in the "bad" case. In my experiments, incrementing and comparing a single "idx" turned out beneficial over incrementing the pointers and decrementing n separately. Doesn't brctg with its awareness of the induction variable help branch prediction in such a way that not only is branch back accurately predicted, but also the final fall-through? OK, using brctg and whether to use idx is perhaps orthogonal. Similarly, using 128-bit adds in vector registers performs better than alcgr + algr. One factor is that alcgr must be alone in a dispatch group, same as mlgr. Given the number of alcgrs we would need, the 128-bit adds wins. For comparison, vacq and vacccq also have a grouping limitation -- only two of them can be in a group. However, that means we can fit a 128-bit add with carry in and out in one dispatch group, instead of just a 64-bit add. I wrote a 4x unrolled addmul_1 a while back, timing it on a z196 (yes, old system, but that's the hardware to which I have convenient access). It is 60% faster than the existing code; it takes 5 cycles/limb whereas the old code takes 8 cycles/limb. The code is attached. (I hope more recent machines get much better cycles/limb numbers. Many machines (x86, POWER, Apple M1) today are approaching 1 cycle/limb.) To improve performance notably (~40% over my initial patch on z15), my currently best-performing implementation maintains the same instruction sequence (mlgr + vacq for two limbs at a time) as our previous attempts, yet unrolls for 8 limbs per iteration with software pipelining of the different stages (load, multiply, add, and so on). Unrolling even more did not improve performance. Did you get rid of the lgr of the carry limb? That should not be too hard. The code attached does that. What is the performance improvement for going from 4x to 8x unrolling? Be careful about the lead-in times too. With deep unrolling, one needs a table indexed by n % W, where W is the unrolling arity. I split up my 4-way code into two similar loop blocks. That makes entry into the loop middle possible. For 8x, using such an approach would avoid huge feed-in code. (Code attached.) While this variant helped a lot in debugging and tweaking parameters and schedule, it is hackish and brittle (e.g., the empty asm("")s help define instruction scheduling, yet GCC may change how it handles them over time). Further, I suspect there may be performance gains left in hand-tweaking the assembly code. I agree that we should use asm to avoid the performance brittleness of C code. So, for integrating this implementation into GMP, I propose adding both the resulting assembly variant and that C code for reference or future improvements. What do you think? We might include the C code as a comment in the asm. Two attachments, 4-way code with possible mid-loop-entry, and a 4-way addmul_1 using just plain registers.
z14-addmul_1-ur4b.asm
Description: Binary data
s390-addmul_1-ur4.asm
Description: Binary data
-- Torbjörn Please encrypt, key id 0xC8601622
_______________________________________________ gmp-devel mailing list gmp-devel@gmplib.org https://gmplib.org/mailman/listinfo/gmp-devel