Re: arm "neon"

Richard Henderson Fri, 22 Feb 2013 15:52:57 -0800

On 02/22/2013 10:20 AM, Torbjorn Granlund wrote:
> Useful.  Is there any 32+32 >> 32 -> 32?  I.e., carry-out.


Sadly, no.  Or if there is, I missed it.

Also interesting, as I'm looking around, is VEXT.  Consider

        vmull.u32       Qa01, Du00, Dv01
        vmull.u32       Qb12, Du11, Dv01

which gives us 64-bit products

            a1  a0
        b2  b1

considered as v4si vectors (gcc-speak for 4 x 32-bit) as

        { b2h, b2l, b1h, b1l }, { a1h, a1l, a0h, a0l }

Apply

        vext.32         Qc01, Qa01, Qb12, #1

and we get Qc01 = { b1l, a1h, a1l, a0h }.  If you look at the pairs is exactly
the input we'd like to feed into

        vpaddl.u32      Qc01, Qc01

to achieve the v2di vector { b11 + a1h, a1h + b0h }.

Now, we all know that u32 * u32 + u32 + u32 cannot overflow u64 (indeed exactly
fits), so the output of that vpaddl could be used as the addend to a multiply
round with vmlal.

Which suggests a code structure like

.Loop:
        vmlal.u32       Qp01, Du00, Dv01        @ v2di{ p1, p0 }
        vst1.u32        {Dp0[0]}, [rp]!         @ store p0l
        vext.32         Qp01, Qzero, Qp01       @ v4si{ 0, p1h, p1l, p0h }
        vpaddl.u32      Qp01, Qp01              @ v2di{ p1h, p1l+p0h }
        // bookkeeping
        bne             .Loop

I.e. we store out 32-bits each round, keeping a "48-bit" rolling carry going
between each stage.  If this works, it's significantly less overhead than the
structure I posted yesterday.

Oh, wait, this misses the addend part of addmul.  Hmm.  We have room in the
rolling carry where I shift in zero above.  That could contain the addend
element from the appropriate round instead.

Perhaps I should give this another go...


r~
_______________________________________________
gmp-devel mailing list
gmp-devel@gmplib.org
http://gmplib.org/mailman/listinfo/gmp-devel

Re: arm "neon"

Reply via email to