Re: div_qr_1 interface

Torbjorn Granlund Mon, 21 Oct 2013 05:50:13 -0700

I looked at the logic following this:

        sbb     U2, U2          C 7 13


You negate the U2 copy in Q2.  It seems that three adc by sbb
could avoid the neg.

I might also be possible to replace the early loop "and" stuff by cmov.
Note that the carry flag survives dec, although that causes a pipeline
replay on older Intel chips.  (IIRC, only sandybridge, ivybridge,
haswell runs that well.)

  But one variable must be moved out of the registers. Maybe B2md (used
  once) is the best candidate. Then
  
        lea     (U0, B2md), U1O
  
  would have to be replaced by
  
        mov     (%rsp), U1O     C Can be done very early
          ...
          add   U0, U1O
  
  We then have 26 instructions + loop overhead, or 54 instructions for 2
  iterations. Or possibly DINV, if one thinks the quotient logic is less
  critical.
  
Reading from a stack slot costs nothing under ideal circumstances.

To optimise register usage, I sometimes annotate the code with live
ranges for each register.  That will help with register coalescing.
(T is rather shot-lived, perhaps its register could serve two usages?)

-- 
Torbjörn
_______________________________________________
gmp-devel mailing list
gmp-devel@gmplib.org
http://gmplib.org/mailman/listinfo/gmp-devel

Re: div_qr_1 interface

Reply via email to