Ciao,

Il 2021-06-03 12:40 Torbjörn Granlund ha scritto:
If we dare use cmov (and its presumed side-channel leakage) we could
probably shorten the critical path by a cycle.  The "sbb" and "and"
would go away.

Using masks does not always give the fastest code. I tried the following variation on Niels' code, and, on my laptop with "g++-10 -O2 -mtune=icelake-client -march=icelake-client", the resulting code is comparable (faster?) with the current asm.

*************** mpn_div_qr_1n_pi1 (mp_ptr qp, mp_srcptr
*** 245,266 ****
         * +      | q0|
         *   -+---+---+---+
         *    | q2| q1| q0|
         *    +---+---+---+
        */
!       umul_ppmm (p1, t, u1, dinv);
!       add_ssaaaa (q2, q1, -u2, u2 & dinv, CNST_LIMB(0), u1);
!       add_ssaaaa (q2, q1, q2, q1, CNST_LIMB(0), p1);
!       add_ssaaaa (q2, q1, q2, q1, CNST_LIMB(0), q0);
!       q0 = t;

        umul_ppmm (p1, p0, u1, B2);
-       ADDC_LIMB (cy, u0, u0, u2 & B2);
-       u0 -= (-cy) & d;

        /* Final q update */
!       add_ssaaaa (q2, q1, q2, q1, CNST_LIMB(0), cy);
        qp[j+1] = q1;
        MPN_INCR_U (qp+j+2, n-j-2, q2);

        add_mssaaaa (u2, u1, u0, u0, up[j], p1, p0);
      }
--- 245,264 ----
         * +      | q0|
         *   -+---+---+---+
         *    | q2| q1| q0|
         *    +---+---+---+
        */
!       ADDC_LIMB (q2, q1, q0, u1);
!       umul_ppmm (t, q0, u1, dinv);
!       ADDC_LIMB (cy, u0, u0, u2 ? B2 : 0);
!       u0 -= cy ? d : 0;
!       add_ssaaaa (q2, q1, q2, q1, -u2, u2 ? dinv : 0);

        umul_ppmm (p1, p0, u1, B2);

        /* Final q update */
!       add_ssaaaa (q2, q1, q2, q1, CNST_LIMB(0), t + cy);
        qp[j+1] = q1;
        MPN_INCR_U (qp+j+2, n-j-2, q2);

        add_mssaaaa (u2, u1, u0, u0, up[j], p1, p0);
      }

$ build/tune/speed -p10000000 -s1-100 -f1.6 -C mpn_div_qr_1n_pi1.9999999999999999999 ...
               ASM-code                C-code
1              2.1227   1              3.6125
2              3.1758   2              3.9425
3              3.4567   3              3.8861
4              3.4758   4              3.8606
6              3.7857   6              3.8764
9              3.9912   9              3.9676
14             4.0304   14             4.0531
22             4.3461   22             4.1798
35             4.4161   35             4.2080
56             4.4744   56             4.2833
89             4.4896   89             4.2950

(I am a bit fixated with side-channel leakage; our present
implementations of these particular functions are not side-channel
silent.)

We should write a fast version, and then a sec_ one :-)

Ĝis,
m
_______________________________________________
gmp-devel mailing list
gmp-devel@gmplib.org
https://gmplib.org/mailman/listinfo/gmp-devel

Reply via email to