Re: Better tabselect
David Miller writes: From: Torbjorn Granlund Date: Fri, 12 Apr 2013 10:04:35 +0200 > I am quite sure your code runs in the neighbourhood of 9/4 = 2.25 cycles > per limb on T4, BTW. On US1-2 it might run at 7/4 c/l and on US3-4 it > again probably runs at 9/4. Hmmm, how to interpret "speed -C" results for mpn_tabselect? Just curious, as that is what I was using. It is broken, and actually passes invalid arguments, and the dimension is not representable. I will check in a fix which allows any geometry N x M by using speed -sN mpn_tabselect.M where N is the vector length and M is the number of vectors. -- Torbjörn ___ gmp-devel mailing list gmp-devel@gmplib.org http://gmplib.org/mailman/listinfo/gmp-devel
Re: Better tabselect
David Miller writes: > I sincerely apologise for the odd number of insns in the loop. :-) Easily solved by using the pointer trick on 'tp' and making 'i' instead be 'i * stride'. That'll get us down to 16 instructions. I'll try to find time to play with this over the weekend, thanks! The typical use for this function is RSA encryption/signing, or crypto operations with even smaller operand sizes. The vector length will typically be 512 to 2048 bits (8 to 32 limbs on 64-bit machines, 16 to 64 limbs on 32-bit machines) and the number of vectors will be up to perhaps 32, usually just 16. I.e., we're dealing with rather small data structures. A large loop setup cost will not be compensated by a slightly faster loop. -- Torbjörn ___ gmp-devel mailing list gmp-devel@gmplib.org http://gmplib.org/mailman/listinfo/gmp-devel
Re: Better tabselect
From: Torbjorn Granlund Date: Fri, 12 Apr 2013 20:10:13 +0200 > If we could trust movXX to be silent, we should of course not bother to > create that mask, but replace that and the logops with 4 moveq for a > great speedup... BTW, even if we could do this, this would be really bad on Ultra1&2 ("moveq" is single issue and 3 cycle latency) and suboptimal on Ultra3&4 (only one "moveq" dispatched per cycle). Only T4&T5 would really benefit from this sort of approach. ___ gmp-devel mailing list gmp-devel@gmplib.org http://gmplib.org/mailman/listinfo/gmp-devel
Re: Better tabselect
From: Torbjorn Granlund Date: Fri, 12 Apr 2013 17:14:41 +0200 > I'd suggest to use the loop below for sparc64. It limits `which' to be < > 2^32 by creating the mask based on 32-bit comparison. It would be > possible to replace "subcc o1,1,o1; subc ..." by "addcc o1,-1,o1; addxc > ..." for newer chips, but I think that's no use. Looks good. > I sincerely apologise for the odd number of insns in the loop. :-) Easily solved by using the pointer trick on 'tp' and making 'i' instead be 'i * stride'. That'll get us down to 16 instructions. I'll try to find time to play with this over the weekend, thanks! ___ gmp-devel mailing list gmp-devel@gmplib.org http://gmplib.org/mailman/listinfo/gmp-devel
Re: Better tabselect
From: Torbjorn Granlund Date: Fri, 12 Apr 2013 10:04:35 +0200 > I am quite sure your code runs in the neighbourhood of 9/4 = 2.25 cycles > per limb on T4, BTW. On US1-2 it might run at 7/4 c/l and on US3-4 it > again probably runs at 9/4. Hmmm, how to interpret "speed -C" results for mpn_tabselect? Just curious, as that is what I was using. ___ gmp-devel mailing list gmp-devel@gmplib.org http://gmplib.org/mailman/listinfo/gmp-devel
Re: Better tabselect
David Miller writes: From: Torbjorn Granlund Date: Fri, 12 Apr 2013 19:30:42 +0200 > David Miller writes: > > It isn't really conditional execution on sparc, the resources and > timing required for the "move" instruction are constant whether the > condition matches or not. > > That's not enough. > > It needs to have the same data-dependency behaviour too. > > And it needs to be a documented feature, since it is easy to imagine an > implementation which does create different data-dependency behaviour, > even if all current ones do not. > > Using carry seems safer. Of course, it is conceivable that a > carry-dependent instruction could run diffferently depending on the > value of the carry flag, but that'd be quite strange. Even on a pipeline that dynamically renames the condition codes? You know, something we've never seen before :-) I am not worrying mainly about WAR and WAW dependencies, but about plain RAW dependencies. Does the movXX read the source register or not, for a false condition? Does it read the destnation, or does it only write it (for a true condition)? If we could trust movXX to be silent, we should of course not bother to create that mask, but replace that and the logops with 4 moveq for a great speedup... > (For generic sparc64 code, we should perhaps avoid movXX for other > reasons too, since it interacts poorly with loads on US1-2, IIRC.) Now that's a more solid reason to avoid movcc. Keeping cnd functions and tabselect side-channel silent is crucial. (And while not specificaly designed for side-channel silence, also mul_k, addmul_k and submul_k should be silent.) -- Torbjörn ___ gmp-devel mailing list gmp-devel@gmplib.org http://gmplib.org/mailman/listinfo/gmp-devel
Re: Better tabselect
From: Torbjorn Granlund Date: Fri, 12 Apr 2013 19:30:42 +0200 > David Miller writes: > > It isn't really conditional execution on sparc, the resources and > timing required for the "move" instruction are constant whether the > condition matches or not. > > That's not enough. > > It needs to have the same data-dependency behaviour too. > > And it needs to be a documented feature, since it is easy to imagine an > implementation which does create different data-dependency behaviour, > even if all current ones do not. > > Using carry seems safer. Of course, it is conceivable that a > carry-dependent instruction could run diffferently depending on the > value of the carry flag, but that'd be quite strange. Even on a pipeline that dynamically renames the condition codes? You know, something we've never seen before :-) > (For generic sparc64 code, we should perhaps avoid movXX for other > reasons too, since it interacts poorly with loads on US1-2, IIRC.) Now that's a more solid reason to avoid movcc. ___ gmp-devel mailing list gmp-devel@gmplib.org http://gmplib.org/mailman/listinfo/gmp-devel
Re: Better tabselect
David Miller writes: It isn't really conditional execution on sparc, the resources and timing required for the "move" instruction are constant whether the condition matches or not. That's not enough. It needs to have the same data-dependency behaviour too. And it needs to be a documented feature, since it is easy to imagine an implementation which does create different data-dependency behaviour, even if all current ones do not. Using carry seems safer. Of course, it is conceivable that a carry-dependent instruction could run diffferently depending on the value of the carry flag, but that'd be quite strange. (For generic sparc64 code, we should perhaps avoid movXX for other reasons too, since it interacts poorly with loads on US1-2, IIRC.) -- Torbjörn ___ gmp-devel mailing list gmp-devel@gmplib.org http://gmplib.org/mailman/listinfo/gmp-devel
Re: Better tabselect
From: Torbjorn Granlund Date: Fri, 12 Apr 2013 10:04:35 +0200 > David Miller writes: > > The existing C code approaches 6 cycles/limb on T4, the best I can do > without pipelining with this new approach at 4 way unrolling is ~4.5 > cycles/limb: > > This function need to leak no side-channel information of actual operand > contents. Conditional execution is a no-no. Instead, we need to for > the mask using arithmetic operations. It isn't really conditional execution on sparc, the resources and timing required for the "move" instruction are constant whether the condition matches or not. > I am quite ignorant about VIS, but doesn't that allow 128-bit > operations? The largest vector element on sparc VIS is 32-bits, and the largest vectors are 64-bits. And even if the elements were large enough, the instructions provided to do this violate the side-channel restriction. What you'd do is do a VIS comparison to generate a store mask, and then do a partial store instruction using that mask to conditionally store the elements into the 'rp' array. fpcmpXXXxxx, %g1 stda%fN, [rp + %g1] ASI_PSTXXX But that conditional store would change the timing between the case where we select and the case where we do not. > Here us a tabselect test program: Thanks! ___ gmp-devel mailing list gmp-devel@gmplib.org http://gmplib.org/mailman/listinfo/gmp-devel
Re: Better tabselect
I'd suggest to use the loop below for sparc64. It limits `which' to be < 2^32 by creating the mask based on 32-bit comparison. It would be possible to replace "subcc o1,1,o1; subc ..." by "addcc o1,-1,o1; addxc ..." for newer chips, but I think that's no use. I sincerely apologise for the odd number of insns in the loop. :-) (Note that we support >= 2^32 operand sizes in mpn for 64bit chips, but that doesn't mean that the number of vectors in tabselect need to be that large.) sllxn, 3, stride mov tp, tporig sub n, 4, j brlzj, L(outer_end) nop L(outer_loop): clr data0 clr data1 clr data2 clr data3 mov tporig, tp mov nents, i mov which, %o1 L(top): subcc %o1, 1, %o1 C set carry iff o1 = 0 ldx [tp + 0], t0 subc%g0, %g0, mask ldx [tp + 8], t1 sub i, 1, i ldx [tp + 16], t2 ldx [tp + 24], t3 add tp, stride, tp and t0, mask, t0 and t1, mask, t1 or t0, data0, data0 and t2, mask, t2 or t1, data1, data1 and t3, mask, t3 or t2, data2, data2 brnzi, L(top) or t3, data3, data3 stx data0, [rp + 0] subcc j, 4, j stx data1, [rp + 8] stx data2, [rp + 16] stx data3, [rp + 24] add tporig, (4 * 8), tporig brgez j, L(outer_loop) addrp, (4 * 8), rp L(outer_end): -- Torbjörn ___ gmp-devel mailing list gmp-devel@gmplib.org http://gmplib.org/mailman/listinfo/gmp-devel
Re: mpq_get_d_nearest
Marc, > Date: Fri, 12 Apr 2013 11:28:01 +0200 (CEST) > From: Marc Glisse > > On Fri, 12 Apr 2013, Zimmermann Paul wrote: > > > the mpq_get_d function rounds towards zero (i.e., truncates). > > In several applications, people usually prefer rounding to nearest. > > Is it planned to provide a function (say mpq_get_d_nearest) for that? > > I could contribute it, based on the code below (which does not deal with > > subnormal numbers yet). > > > > We could also have rounding towards -infinity and +infinity, which would be > > useful for people doing interval arithmetic. > > Preferably with a single function that returns both roundings :-) > That would allow us to remove this slow code from CGAL (and stop linking > with mpfr in many cases): > >operator()( const mpq_class& x ) const { > mpfr_t y; > mpfr_init2 (y, 53); /* Assume IEEE-754 */ > mpfr_set_q (y, x.get_mpq_t(), GMP_RNDD); > double i = mpfr_get_d (y, GMP_RNDD); /* EXACT but can overflow */ > mpfr_set_q (y, x.get_mpq_t(), GMP_RNDU); > double s = mpfr_get_d (y, GMP_RNDU); /* EXACT but can overflow */ > mpfr_clear (y); > return std::pair(i, s); >} > > (I hadn't looked at that code in a while, it could at least use > MPFR_DECL_INIT I guess, but best would be not needing mpfr) > > -- > Marc Glisse note that in the above code, the return value of mpfr_set_q (which CGAL ignores) is enough to use only one call to mpfr_get_d, then a call to nextafter is enough. But this is now off-topic, sorry... Paul ___ gmp-devel mailing list gmp-devel@gmplib.org http://gmplib.org/mailman/listinfo/gmp-devel
Re: mpq_get_d_nearest
On Fri, 12 Apr 2013, Zimmermann Paul wrote: the mpq_get_d function rounds towards zero (i.e., truncates). In several applications, people usually prefer rounding to nearest. Is it planned to provide a function (say mpq_get_d_nearest) for that? I could contribute it, based on the code below (which does not deal with subnormal numbers yet). We could also have rounding towards -infinity and +infinity, which would be useful for people doing interval arithmetic. Preferably with a single function that returns both roundings :-) That would allow us to remove this slow code from CGAL (and stop linking with mpfr in many cases): operator()( const mpq_class& x ) const { mpfr_t y; mpfr_init2 (y, 53); /* Assume IEEE-754 */ mpfr_set_q (y, x.get_mpq_t(), GMP_RNDD); double i = mpfr_get_d (y, GMP_RNDD); /* EXACT but can overflow */ mpfr_set_q (y, x.get_mpq_t(), GMP_RNDU); double s = mpfr_get_d (y, GMP_RNDU); /* EXACT but can overflow */ mpfr_clear (y); return std::pair(i, s); } (I hadn't looked at that code in a while, it could at least use MPFR_DECL_INIT I guess, but best would be not needing mpfr) -- Marc Glisse ___ gmp-devel mailing list gmp-devel@gmplib.org http://gmplib.org/mailman/listinfo/gmp-devel
mpq_get_d_nearest
Hi, the mpq_get_d function rounds towards zero (i.e., truncates). In several applications, people usually prefer rounding to nearest. Is it planned to provide a function (say mpq_get_d_nearest) for that? I could contribute it, based on the code below (which does not deal with subnormal numbers yet). We could also have rounding towards -infinity and +infinity, which would be useful for people doing interval arithmetic. Paul double mpq_get_d_nearest (mpq_t q) { mpz_ptr a = mpq_numref (q); mpz_ptr b = mpq_denref (q); size_t sa = mpz_sizeinbase (a, 2); size_t sb = mpz_sizeinbase (b, 2); size_t na, nb; mpz_t aa, bb; double d; /* easy case: |a|, |b| < 2^53, no overflow nor underflow can occur */ if (sa <= 53 && sb <= 53) return mpz_get_d (a) / mpz_get_d (b); /* same if a = m*2^e with m representable on 53 bits, idem for b, but beware that both a and b do not give an overflow */ na = sa - mpz_scan1 (a, 0); nb = sb - mpz_scan1 (b, 0); if (sa <= 1024 && na <= 53 && sb <= 1024 && nb <= 53) return mpz_get_d (a) / mpz_get_d (b); /* hard case */ mpz_init (aa); mpz_init (bb); if (sa >= sb) { mpz_set (aa, a); mpz_mul_2exp (bb, b, sa - sb); } else { mpz_mul_2exp (aa, a, sb - sa); mpz_set (bb, b); } /* q = aa/bb*2^(sa-sb) */ if (mpz_cmpabs (aa, bb) >= 0) { mpz_mul_2exp (bb, bb, 1); sa ++; } mpz_mul_2exp (aa, aa, 54); sb += 54; mpz_tdiv_qr (aa, bb, aa, bb); /* the quotient aa should have exactly 54 bits */ if (mpz_tstbit (aa, 0) == 0) { } else if (mpz_cmp_ui (bb, 0) != 0) { if (mpz_sgn (aa) > 0) mpz_add_ui (aa, aa, 1); else mpz_sub_ui (aa, aa, 1); } else /* mid case: round to even */ { if (mpz_tstbit (aa, 1) == 0) { if (mpz_sgn (aa) > 0) mpz_sub_ui (aa, aa, 1); else mpz_add_ui (aa, aa, 1); } else { if (mpz_sgn (aa) > 0) mpz_add_ui (aa, aa, 1); else mpz_sub_ui (aa, aa, 1); } } d = mpz_get_d (aa); /* exact */ mpz_clear (aa); mpz_clear (bb); return ldexp (d, (long) sa - (long) sb); } ___ gmp-devel mailing list gmp-devel@gmplib.org http://gmplib.org/mailman/listinfo/gmp-devel
Re: Better tabselect
David Miller writes: The existing C code approaches 6 cycles/limb on T4, the best I can do without pipelining with this new approach at 4 way unrolling is ~4.5 cycles/limb: This function need to leak no side-channel information of actual operand contents. Conditional execution is a no-no. Instead, we need to for the mask using arithmetic operations. The critical information not to be leaked for tabselect is which vector is chosen, i.e., the value iof the `which' parameter. We need to use something like "sub x,y,z, negcc z, subc %g0,%g0,%mask" rather than your mask creation code. Just like my x86-sse code, this will not allow table widthes (strange plural, does it exist?) greater than 2^32, but that's OK. I am quite sure your code runs in the neighbourhood of 9/4 = 2.25 cycles per limb on T4, BTW. On US1-2 it might run at 7/4 c/l and on US3-4 it again probably runs at 9/4. (Both these CPUs would benefit from sparser ldx scheduling, of course. I don't think it is worth the effort, at least not at this point.) I am quite ignorant about VIS, but doesn't that allow 128-bit operations? Here us a tabselect test program: test-tabselect.c Description: Binary data -- Torbjörn ___ gmp-devel mailing list gmp-devel@gmplib.org http://gmplib.org/mailman/listinfo/gmp-devel