Re: [PATCH] Add optimized addmul_1 and submul_1 for IBM z13

Torbjörn Granlund Tue, 02 Mar 2021 04:55:29 -0800

Marius Hillenbrand <mhil...@linux.ibm.com> writes:

  Most notably, I changed the order so that the mlgr's are next to each
  other. The reason is that decode and dispatch happens in two "groups" of
  up to three instructions each, with each group going into one of the two
  "issue sides" of the core (both are symmetric and have  the same set of
  issue ports since z13). For some instructions, grouping is restricted --
  that includes mlgr, which will be alone in a group. Thus, placing two
  mlgr's next to each other will ensure that they will spread across both
  issue sides and exploit both multiply units.


This sort of restrictions are hard to deal with.

What happens around branches with these grouping restrictions?  Could an
incoming branch (such as the loop branch) have one group from before the
branch and one after the branch, thus invalidating the assumption of
adjacent mlgr's going to different groups?

I've seen cases (with other pipelines) where a loop can take completely
different times depending on some parity-like issue condition upon loop
entry.  It never recovers in the "bad" case.

  In my experiments, incrementing and comparing a single "idx" turned out
  beneficial over incrementing the pointers and decrementing n separately.

Doesn't brctg with its awareness of the induction variable help branch
prediction in such a way that not only is branch back accurately
predicted, but also the final fall-through?

OK, using brctg and whether to use idx is perhaps orthogonal.

  Similarly, using 128-bit adds in vector registers performs better than
  alcgr + algr. One factor is that alcgr must be alone in a dispatch
  group, same as mlgr. Given the number of alcgrs we would need, the
  128-bit adds wins. For comparison, vacq and vacccq also have a grouping
  limitation -- only two of them can be in a group. However, that means we
  can fit a 128-bit add with carry in and out in one dispatch group,
  instead of just a 64-bit add.

I wrote a 4x unrolled addmul_1 a while back, timing it on a z196 (yes,
old system, but that's the hardware to which I have convenient access).
It is 60% faster than the existing code; it takes 5 cycles/limb whereas
the old code takes 8 cycles/limb.  The code is attached.

(I hope more recent machines get much better cycles/limb numbers.  Many
machines (x86, POWER, Apple M1) today are approaching 1 cycle/limb.)

  To improve performance notably (~40% over my initial patch on z15), my
  currently best-performing implementation maintains the same instruction
  sequence (mlgr + vacq for two limbs at a time) as our previous attempts,
  yet unrolls for 8 limbs per iteration with software pipelining of the
  different stages (load, multiply, add, and so on).

  Unrolling even more did not improve performance.

Did you get rid of the lgr of the carry limb?  That should not be too
hard.  The code attached does that.

What is the performance improvement for going from 4x to 8x unrolling?

Be careful about the lead-in times too.  With deep unrolling, one needs
a table indexed by n % W, where W is the unrolling arity.

I split up my 4-way code into two similar loop blocks.  That makes entry
into the loop middle possible.  For 8x, using such an approach would
avoid huge feed-in code.  (Code attached.)

  While this variant helped a lot in debugging and tweaking parameters and
  schedule, it is hackish and brittle (e.g., the empty asm("")s help
  define instruction scheduling, yet GCC may change how it handles them
  over time). Further, I suspect there may be performance gains left in
  hand-tweaking the assembly code.

I agree that we should use asm to avoid the performance brittleness of C
code.

  So, for integrating this implementation into GMP, I propose adding both
  the resulting assembly variant and that C code for reference or future
  improvements.

  What do you think?

We might include the C code as a comment in the asm.

Two attachments, 4-way code with possible mid-loop-entry, and a 4-way
addmul_1 using just plain registers.

z14-addmul_1-ur4b.asm
Description: Binary data

s390-addmul_1-ur4.asm
Description: Binary data


-- 
Torbjörn
Please encrypt, key id 0xC8601622

_______________________________________________
gmp-devel mailing list
gmp-devel@gmplib.org
https://gmplib.org/mailman/listinfo/gmp-devel

Re: [PATCH] Add optimized addmul_1 and submul_1 for IBM z13

Reply via email to