On Fri, Oct 28, 2022 at 7:34 AM Maamoun TK <maamoun...@googlemail.com>
wrote:

> I created a new merge request for poly1305 multi block implementation
> https://git.lysator.liu.se/nettle/nettle/-/merge_requests/55
> The patch is based on radix 2^44 with linear carry addition for reduction.
> I tried multiple scenarios to best performing _nettle_poly1305_blocks on
> ppc and got interesting results.
> *----------------------------------------------------------*
> |  Radix                                |      Mbyte/s       |
> |------------------------------------|---------------------|
> |  2^44  (Reduction 1)          |         1691.39   |
> |  2^44  (Reduction 2)          |         1812.81   |
> |  2^ 44  (Linear reduction)   |         2044.05   |
> |  2^64                                  |         1002.27   |
> *---------------------------------------------------------*
> Reductions 1/2 refer to two versions of 2^44 reduction you shared earlier
> on this thread. While these reductions have more speedup on C, the linear
> reduction still has a superior performance in the vectorized version on
> ppc.
> *----------------------------------------------------------------*
> |  Radix                                       |      Mbyte/s       |
> |------------------------------------------|---------------------|
> |  2^44  (C - Linear reduction)     |         1734.09   |
> |  2^44  (C - Reduction 1)            |         2035.56  |
> |  2^44  (C - Reduction 2)            |         2044.31  |
> *---------------------------------------------------------------*
> I still manage to monitor the best performance on C using radix-64 with
> deferred partial folding so having it as a default path along with 2^26
> implementation for legacy purposes in conditional preprocessor directives
> would fit well in my opinion.
>
> > ppc only offers instruction to shift 128-bit value by octet
> After trying several formulas to get around this, I found that the best
> solution is to stick with regular reduction and utilize POWER10-specific
> instruction 'vsrq' for Vector shift right quadword instead of vsro/vsrd
> instructions but I wouldn't go with that for now since it's early to make
> use of Power ISA v3.1
>

Why is it early to use POWER10 instructions?  Various BLAS libraries
have been updated to use POWER10 instructions.

Thanks, David


>
> regards,
> Mamone
>
> On Tue, Oct 25, 2022 at 8:23 PM Niels Möller <ni...@lysator.liu.se> wrote:
>
> > Maamoun TK <maamoun...@googlemail.com> writes:
> >
> > > I did the benchmark on my laptop too. I got a speed of 3964.37 GB/s on
> > > upstream and 5054.32 GB/s benchmarking poly1305 update on the new
> > branch. I
> > > wonder if the result numbers are truncated on your end because that
> would
> > > keep the improvement on context with my test (25%).
> >
> > I did round, it was something like 2990 MB/s vs 5010 MB/s. But it's
> > going to be rather machine dependent. Those numbers were from a AMD
> > Ryzen 5 something (laptop is two years old, iirc). Now I've tried on my
> > intel box too (i3-5010U, much older).
> >
> > I get
> >          MB/s  cycles/block
> > before:  1.8   70
> > after:   2.4   53
> >
> > so on this machine, appr. 30% speedup.
> >
> > > I will give multiblock radix-2^64 a try on ppc to examine the result.
> >
> > I'm curious to see how that works out, if results extend beyond x86.
> >
> > > ppc only offers instruction to shift 128-bit value by octet making
> > > something like the following sketch is more ideal
> >
> > Not sure if that helps, but the purpose of the shift operations that
> > need to operate on 128-bit values is to extract 64 bits in the middle, a
> > bit like the x86 "double precision shift" instructions shld/shrd.
> > Everything else can use 64-bit registers. (Even though if the 128-bit
> > accumualtion is done in vector registers, I guess one ought to keep all
> > data in the vector registers, also for the folding).
> >
> > And since we have highest bit always zero, it may also work to round
> > shift count down, extract low 64 bits, followed by a shift of a few
> > bits more operating only on the 64-bit value. Something like
> >
> >   unsigned __int128 p = ...;
> >   uint64_t t = p >> 40;
> >   t >>= 2;  /* or >>= 4, if we want a 44 bit shift rather than 42 */
> >
> > Should be correct provided that our bounds guarantee that
> >
> >   p < 2^{40 + 64}.
> >
> > >   l0 = p0 & 0x3ffffffffff; h0 = (p0 >> 40) ...
> > >
> > > I've thought of radix 2^48 but the fact that r^2_1' = r^2_1 * 5 << 14
> > > (where r^2_1 is of degree 48) needs 65-bit to fit in make it a little
> > > harder to apply.
> >
> > Hmm. It might work to do a 65-bit constant. To do r' * h, where r is 65
> > bits
> > and h is 48 (and possible one or two excess bits), say r_0 = r mod 2^64,
> > r_1 = - (r>>64) (all ones if and only if r >= 2^64).
> >
> > Then
> >
> >   r * h = r_0 * h + (r_1 & h) 2^64
> >
> > So first multiply by the 64 low bits, then do a conditional (via and and
> > operation) addition into the high half. If working with regular 64-bit
> > registers, it's not too bad, but I guess it may get rather messy with
> > vector registers. So perhaps not the first thing to try out.
> >
> > Regards,
> > /Niels
> >
> > --
> > Niels Möller. PGP key CB4962D070D77D7FCB8BA36271D8F1FF368C6677.
> > Internet email is subject to wholesale government surveillance.
> >
> _______________________________________________
> nettle-bugs mailing list -- nettle-bugs@lists.lysator.liu.se
> To unsubscribe send an email to nettle-bugs-le...@lists.lysator.liu.se
>
_______________________________________________
nettle-bugs mailing list -- nettle-bugs@lists.lysator.liu.se
To unsubscribe send an email to nettle-bugs-le...@lists.lysator.liu.se

Reply via email to