Torbjorn Granlund <t...@gmplib.org> writes:

> On Intel chips, op-to-mem is expensive.  Even op-from-memory is often
> slower than load+op.  (I understand the register shortage problem.)

The following (untested) variant needs one register too many.

  UP, QP, UN:           Load, store, loop counter.
  DINV, B2, B2md:       Loop invariant constants.
  U2, U1I, U0, Q1I, Q0: Inputs.
  U1O, Q1O:             Outputs.
  Q2, %rax, %rdx:       Locals.

Also U1I -> U1O recurrency chain (with opteron cycle counts)

        mov     U2, Q2
        mov     U2, Q1O
        neg     Q2
        mov     DINV, %rax
        and     DINV, Q1O
        mul     U1I
        add     Q0, Q1O
        adc     $0, Q2
        mov     %rax, Q0
        add     %rdx, Q1O
        adc     $0, Q2

        mov     B2, %rax
        and     B2, U2
        mul     U1I             C 0 6
        lea     (U0, B2md), U1O
        add     U0, U2
        cmovnc  U2, U1O
        adc     U1I, Q1O
        adc     Q1I, Q2
        mov     Q2, (QP, UN, 8)
        jc      L(incr)

L(incr_done):
        mov     (UP, UN, 8), U0
        add     %rax, U0        C 4
        adc     %rdx, U1O       C 5
        sbb     U2, U2          C 6

25 instructions (27 K10 decoder slots) excluding loop overhead.

But one variable must be moved out of the registers. Maybe B2md (used
once) is the best candidate. Then

        lea     (U0, B2md), U1O

would have to be replaced by

        mov     (%rsp), U1O     C Can be done very early
        ...
        add     U0, U1O

We then have 26 instructions + loop overhead, or 54 instructions for 2
iterations. Or possibly DINV, if one thinks the quotient logic is less
critical.

Regards,
/Niels

-- 
Niels Möller. PGP-encrypted email is preferred. Keyid C0B98E26.
Internet email is subject to wholesale government surveillance.
_______________________________________________
gmp-devel mailing list
gmp-devel@gmplib.org
http://gmplib.org/mailman/listinfo/gmp-devel

Reply via email to