Re: Better tabselect

2013-04-12 Thread Torbjorn Granlund
David Miller  writes:

  From: Torbjorn Granlund 
  Date: Fri, 12 Apr 2013 10:04:35 +0200
  
  > I am quite sure your code runs in the neighbourhood of 9/4 = 2.25 cycles
  > per limb on T4, BTW.  On US1-2 it might run at 7/4 c/l and on US3-4 it
  > again probably runs at 9/4.
  
  Hmmm, how to interpret "speed -C" results for mpn_tabselect?
  
  Just curious, as that is what I was using.
  
It is broken, and actually passes invalid arguments, and the dimension
is not representable.

I will check in a fix which allows any geometry N x M by using
  speed -sN mpn_tabselect.M
where N is the vector length and M is the number of vectors.

-- 
Torbjörn
___
gmp-devel mailing list
gmp-devel@gmplib.org
http://gmplib.org/mailman/listinfo/gmp-devel


Re: Better tabselect

2013-04-12 Thread Torbjorn Granlund
David Miller  writes:
  
  > I sincerely apologise for the odd number of insns in the loop.  :-)
  
  Easily solved by using the pointer trick on 'tp' and making 'i'
  instead be 'i * stride'.  That'll get us down to 16 instructions.
  
  I'll try to find time to play with this over the weekend, thanks!
  
The typical use for this function is RSA encryption/signing, or crypto
operations with even smaller operand sizes.

The vector length will typically be 512 to 2048 bits (8 to 32 limbs on
64-bit machines, 16 to 64 limbs on 32-bit machines) and the number of
vectors will be up to perhaps 32, usually just 16.

I.e., we're dealing with rather small data structures.  A large loop
setup cost will not be compensated by a slightly faster loop.

-- 
Torbjörn
___
gmp-devel mailing list
gmp-devel@gmplib.org
http://gmplib.org/mailman/listinfo/gmp-devel


Re: Better tabselect

2013-04-12 Thread David Miller
From: Torbjorn Granlund 
Date: Fri, 12 Apr 2013 20:10:13 +0200

> If we could trust movXX to be silent, we should of course not bother to
> create that mask, but replace that and the logops with 4 moveq for a
> great speedup...

BTW, even if we could do this, this would be really bad on Ultra1&2
("moveq" is single issue and 3 cycle latency) and suboptimal on
Ultra3&4 (only one "moveq" dispatched per cycle).

Only T4&T5 would really benefit from this sort of approach.
___
gmp-devel mailing list
gmp-devel@gmplib.org
http://gmplib.org/mailman/listinfo/gmp-devel


Re: Better tabselect

2013-04-12 Thread David Miller
From: Torbjorn Granlund 
Date: Fri, 12 Apr 2013 17:14:41 +0200

> I'd suggest to use the loop below for sparc64.  It limits `which' to be <
> 2^32 by creating the mask based on 32-bit comparison.  It would be
> possible to replace "subcc o1,1,o1; subc ..." by "addcc o1,-1,o1; addxc
> ..." for newer chips, but I think that's no use.

Looks good.

> I sincerely apologise for the odd number of insns in the loop.  :-)

Easily solved by using the pointer trick on 'tp' and making 'i'
instead be 'i * stride'.  That'll get us down to 16 instructions.

I'll try to find time to play with this over the weekend, thanks!
___
gmp-devel mailing list
gmp-devel@gmplib.org
http://gmplib.org/mailman/listinfo/gmp-devel


Re: Better tabselect

2013-04-12 Thread David Miller
From: Torbjorn Granlund 
Date: Fri, 12 Apr 2013 10:04:35 +0200

> I am quite sure your code runs in the neighbourhood of 9/4 = 2.25 cycles
> per limb on T4, BTW.  On US1-2 it might run at 7/4 c/l and on US3-4 it
> again probably runs at 9/4.

Hmmm, how to interpret "speed -C" results for mpn_tabselect?

Just curious, as that is what I was using.
___
gmp-devel mailing list
gmp-devel@gmplib.org
http://gmplib.org/mailman/listinfo/gmp-devel


Re: Better tabselect

2013-04-12 Thread Torbjorn Granlund
David Miller  writes:

  From: Torbjorn Granlund 
  Date: Fri, 12 Apr 2013 19:30:42 +0200
  
  > David Miller  writes:
  > 
  >   It isn't really conditional execution on sparc, the resources and
  >   timing required for the "move" instruction are constant whether the
  >   condition matches or not.
  >   
  > That's not enough.
  > 
  > It needs to have the same data-dependency behaviour too.
  > 
  > And it needs to be a documented feature, since it is easy to imagine an
  > implementation which does create different data-dependency behaviour,
  > even if all current ones do not.
  > 
  > Using carry seems safer.  Of course, it is conceivable that a
  > carry-dependent instruction could run diffferently depending on the
  > value of the carry flag, but that'd be quite strange.
  
  Even on a pipeline that dynamically renames the condition codes?
  You know, something we've never seen before :-)
  
I am not worrying mainly about WAR and WAW dependencies, but about plain
RAW dependencies.  Does the movXX read the source register or not, for a
false condition?  Does it read the destnation, or does it only write it
(for a true condition)?

If we could trust movXX to be silent, we should of course not bother to
create that mask, but replace that and the logops with 4 moveq for a
great speedup...

  > (For generic sparc64 code, we should perhaps avoid movXX for other
  > reasons too, since it interacts poorly with loads on US1-2, IIRC.)
  
  Now that's a more solid reason to avoid movcc.
  
Keeping cnd functions and tabselect side-channel silent is crucial.
(And while not specificaly designed for side-channel silence, also
mul_k, addmul_k and submul_k should be silent.)

-- 
Torbjörn
___
gmp-devel mailing list
gmp-devel@gmplib.org
http://gmplib.org/mailman/listinfo/gmp-devel


Re: Better tabselect

2013-04-12 Thread David Miller
From: Torbjorn Granlund 
Date: Fri, 12 Apr 2013 19:30:42 +0200

> David Miller  writes:
> 
>   It isn't really conditional execution on sparc, the resources and
>   timing required for the "move" instruction are constant whether the
>   condition matches or not.
>   
> That's not enough.
> 
> It needs to have the same data-dependency behaviour too.
> 
> And it needs to be a documented feature, since it is easy to imagine an
> implementation which does create different data-dependency behaviour,
> even if all current ones do not.
> 
> Using carry seems safer.  Of course, it is conceivable that a
> carry-dependent instruction could run diffferently depending on the
> value of the carry flag, but that'd be quite strange.

Even on a pipeline that dynamically renames the condition codes?
You know, something we've never seen before :-)

> (For generic sparc64 code, we should perhaps avoid movXX for other
> reasons too, since it interacts poorly with loads on US1-2, IIRC.)

Now that's a more solid reason to avoid movcc.
___
gmp-devel mailing list
gmp-devel@gmplib.org
http://gmplib.org/mailman/listinfo/gmp-devel


Re: Better tabselect

2013-04-12 Thread Torbjorn Granlund
David Miller  writes:

  It isn't really conditional execution on sparc, the resources and
  timing required for the "move" instruction are constant whether the
  condition matches or not.
  
That's not enough.

It needs to have the same data-dependency behaviour too.

And it needs to be a documented feature, since it is easy to imagine an
implementation which does create different data-dependency behaviour,
even if all current ones do not.

Using carry seems safer.  Of course, it is conceivable that a
carry-dependent instruction could run diffferently depending on the
value of the carry flag, but that'd be quite strange.

(For generic sparc64 code, we should perhaps avoid movXX for other
reasons too, since it interacts poorly with loads on US1-2, IIRC.)

-- 
Torbjörn
___
gmp-devel mailing list
gmp-devel@gmplib.org
http://gmplib.org/mailman/listinfo/gmp-devel


Re: Better tabselect

2013-04-12 Thread David Miller
From: Torbjorn Granlund 
Date: Fri, 12 Apr 2013 10:04:35 +0200

> David Miller  writes:
> 
>   The existing C code approaches 6 cycles/limb on T4, the best I can do
>   without pipelining with this new approach at 4 way unrolling is ~4.5
>   cycles/limb:
>   
> This function need to leak no side-channel information of actual operand
> contents.  Conditional execution is a no-no.  Instead, we need to for
> the mask using arithmetic operations.

It isn't really conditional execution on sparc, the resources and
timing required for the "move" instruction are constant whether the
condition matches or not.

> I am quite ignorant about VIS, but doesn't that allow 128-bit
> operations?

The largest vector element on sparc VIS is 32-bits, and the largest
vectors are 64-bits.

And even if the elements were large enough, the instructions provided
to do this violate the side-channel restriction.

What you'd do is do a VIS comparison to generate a store mask, and
then do a partial store instruction using that mask to conditionally
store the elements into the 'rp' array.

fpcmpXXXxxx, %g1
stda%fN, [rp + %g1] ASI_PSTXXX

But that conditional store would change the timing between the case
where we select and the case where we do not.

> Here us a tabselect test program:

Thanks!
___
gmp-devel mailing list
gmp-devel@gmplib.org
http://gmplib.org/mailman/listinfo/gmp-devel


Re: Better tabselect

2013-04-12 Thread Torbjorn Granlund
I'd suggest to use the loop below for sparc64.  It limits `which' to be <
2^32 by creating the mask based on 32-bit comparison.  It would be
possible to replace "subcc o1,1,o1; subc ..." by "addcc o1,-1,o1; addxc
..." for newer chips, but I think that's no use.

I sincerely apologise for the odd number of insns in the loop.  :-)

(Note that we support >= 2^32 operand sizes in mpn for 64bit chips, but
that doesn't mean that the number of vectors in tabselect need to be
that large.)

  
sllxn, 3, stride
mov tp, tporig
sub n, 4, j
brlzj, L(outer_end)
 nop

L(outer_loop):
clr data0
clr data1
clr data2
clr data3
mov tporig, tp
mov nents, i
mov which, %o1

L(top): subcc   %o1, 1, %o1 C set carry iff o1 = 0
ldx [tp + 0], t0
subc%g0, %g0, mask
ldx [tp + 8], t1
sub i, 1, i
ldx [tp + 16], t2
ldx [tp + 24], t3
add tp, stride, tp
and t0, mask, t0
and t1, mask, t1
or  t0, data0, data0
and t2, mask, t2
or  t1, data1, data1
and t3, mask, t3
or  t2, data2, data2
brnzi, L(top)
 or t3, data3, data3

stx data0, [rp + 0]
subcc   j, 4, j
stx data1, [rp + 8]
stx data2, [rp + 16]
stx data3, [rp + 24]
add tporig, (4 * 8), tporig

brgez   j, L(outer_loop)
 addrp, (4 * 8), rp
L(outer_end):

-- 
Torbjörn
___
gmp-devel mailing list
gmp-devel@gmplib.org
http://gmplib.org/mailman/listinfo/gmp-devel


Re: mpq_get_d_nearest

2013-04-12 Thread Zimmermann Paul
   Marc,

> Date: Fri, 12 Apr 2013 11:28:01 +0200 (CEST)
> From: Marc Glisse 
> 
> On Fri, 12 Apr 2013, Zimmermann Paul wrote:
> 
> > the mpq_get_d function rounds towards zero (i.e., truncates).
> > In several applications, people usually prefer rounding to nearest.
> > Is it planned to provide a function (say mpq_get_d_nearest) for that?
> > I could contribute it, based on the code below (which does not deal with
> > subnormal numbers yet).
> >
> > We could also have rounding towards -infinity and +infinity, which would be
> > useful for people doing interval arithmetic.
> 
> Preferably with a single function that returns both roundings :-)
> That would allow us to remove this slow code from CGAL (and stop linking
> with mpfr in many cases):
> 
>operator()( const mpq_class& x ) const {
>  mpfr_t y;
>  mpfr_init2 (y, 53); /* Assume IEEE-754 */
>  mpfr_set_q (y, x.get_mpq_t(), GMP_RNDD);
>  double i = mpfr_get_d (y, GMP_RNDD); /* EXACT but can overflow */
>  mpfr_set_q (y, x.get_mpq_t(), GMP_RNDU);
>  double s = mpfr_get_d (y, GMP_RNDU); /* EXACT but can overflow */
>  mpfr_clear (y);
>  return std::pair(i, s);
>}
> 
> (I hadn't looked at that code in a while, it could at least use 
> MPFR_DECL_INIT I guess, but best would be not needing mpfr)
> 
> -- 
> Marc Glisse

note that in the above code, the return value of mpfr_set_q (which CGAL
ignores) is enough to use only one call to mpfr_get_d, then a call to
nextafter is enough.

But this is now off-topic, sorry...

Paul
___
gmp-devel mailing list
gmp-devel@gmplib.org
http://gmplib.org/mailman/listinfo/gmp-devel


Re: mpq_get_d_nearest

2013-04-12 Thread Marc Glisse

On Fri, 12 Apr 2013, Zimmermann Paul wrote:


the mpq_get_d function rounds towards zero (i.e., truncates).
In several applications, people usually prefer rounding to nearest.
Is it planned to provide a function (say mpq_get_d_nearest) for that?
I could contribute it, based on the code below (which does not deal with
subnormal numbers yet).

We could also have rounding towards -infinity and +infinity, which would be
useful for people doing interval arithmetic.


Preferably with a single function that returns both roundings :-)
That would allow us to remove this slow code from CGAL (and stop linking
with mpfr in many cases):

  operator()( const mpq_class& x ) const {
mpfr_t y;
mpfr_init2 (y, 53); /* Assume IEEE-754 */
mpfr_set_q (y, x.get_mpq_t(), GMP_RNDD);
double i = mpfr_get_d (y, GMP_RNDD); /* EXACT but can overflow */
mpfr_set_q (y, x.get_mpq_t(), GMP_RNDU);
double s = mpfr_get_d (y, GMP_RNDU); /* EXACT but can overflow */
mpfr_clear (y);
return std::pair(i, s);
  }

(I hadn't looked at that code in a while, it could at least use 
MPFR_DECL_INIT I guess, but best would be not needing mpfr)


--
Marc Glisse
___
gmp-devel mailing list
gmp-devel@gmplib.org
http://gmplib.org/mailman/listinfo/gmp-devel


mpq_get_d_nearest

2013-04-12 Thread Zimmermann Paul
   Hi,

the mpq_get_d function rounds towards zero (i.e., truncates).
In several applications, people usually prefer rounding to nearest.
Is it planned to provide a function (say mpq_get_d_nearest) for that?
I could contribute it, based on the code below (which does not deal with
subnormal numbers yet).

We could also have rounding towards -infinity and +infinity, which would be
useful for people doing interval arithmetic.

Paul

double
mpq_get_d_nearest (mpq_t q)
{
  mpz_ptr a = mpq_numref (q);
  mpz_ptr b = mpq_denref (q);
  size_t sa = mpz_sizeinbase (a, 2);
  size_t sb = mpz_sizeinbase (b, 2);
  size_t na, nb;
  mpz_t aa, bb;
  double d;
  
  /* easy case: |a|, |b| < 2^53, no overflow nor underflow can occur */
  if (sa <= 53 && sb <= 53)
return mpz_get_d (a) / mpz_get_d (b);

  /* same if a = m*2^e with m representable on 53 bits, idem for b, but beware
 that both a and b do not give an overflow */
  na = sa - mpz_scan1 (a, 0);
  nb = sb - mpz_scan1 (b, 0);
  if (sa <= 1024 && na <= 53 && sb <= 1024 && nb <= 53)
return mpz_get_d (a) / mpz_get_d (b);

  /* hard case */
  mpz_init (aa);
  mpz_init (bb);
  
  if (sa >= sb)
{
  mpz_set (aa, a);
  mpz_mul_2exp (bb, b, sa - sb);
}
  else
{
  mpz_mul_2exp (aa, a, sb - sa);
  mpz_set (bb, b);
}

  /* q = aa/bb*2^(sa-sb) */

  if (mpz_cmpabs (aa, bb) >= 0)
{
  mpz_mul_2exp (bb, bb, 1);
  sa ++;
}

  mpz_mul_2exp (aa, aa, 54);
  sb += 54;

  mpz_tdiv_qr (aa, bb, aa, bb);

  /* the quotient aa should have exactly 54 bits */

  if (mpz_tstbit (aa, 0) == 0)
{
}
  else if (mpz_cmp_ui (bb, 0) != 0)
{
  if (mpz_sgn (aa) > 0)
mpz_add_ui (aa, aa, 1);
  else
mpz_sub_ui (aa, aa, 1);
}
  else /* mid case: round to even */
{
  if (mpz_tstbit (aa, 1) == 0)
{
  if (mpz_sgn (aa) > 0)
mpz_sub_ui (aa, aa, 1);
  else
mpz_add_ui (aa, aa, 1);
}
  else
{
  if (mpz_sgn (aa) > 0)
mpz_add_ui (aa, aa, 1);
  else
mpz_sub_ui (aa, aa, 1);
}
}
  
  d = mpz_get_d (aa); /* exact */
  mpz_clear (aa);
  mpz_clear (bb);
  return ldexp (d, (long) sa - (long) sb);
}
___
gmp-devel mailing list
gmp-devel@gmplib.org
http://gmplib.org/mailman/listinfo/gmp-devel


Re: Better tabselect

2013-04-12 Thread Torbjorn Granlund
David Miller  writes:

  The existing C code approaches 6 cycles/limb on T4, the best I can do
  without pipelining with this new approach at 4 way unrolling is ~4.5
  cycles/limb:
  
This function need to leak no side-channel information of actual operand
contents.  Conditional execution is a no-no.  Instead, we need to for
the mask using arithmetic operations.

The critical information not to be leaked for tabselect is which vector
is chosen, i.e., the value iof the `which' parameter.

We need to use something like "sub x,y,z, negcc z, subc %g0,%g0,%mask"
rather than your mask creation code.  Just like my x86-sse code,
this will not allow table widthes (strange plural, does it exist?)
greater than 2^32, but that's OK.

I am quite sure your code runs in the neighbourhood of 9/4 = 2.25 cycles
per limb on T4, BTW.  On US1-2 it might run at 7/4 c/l and on US3-4 it
again probably runs at 9/4.  (Both these CPUs would benefit from sparser
ldx scheduling, of course.  I don't think it is worth the effort, at
least not at this point.)

I am quite ignorant about VIS, but doesn't that allow 128-bit
operations?

Here us a tabselect test program:


test-tabselect.c
Description: Binary data

-- 
Torbjörn
___
gmp-devel mailing list
gmp-devel@gmplib.org
http://gmplib.org/mailman/listinfo/gmp-devel