On 02/22/2013 10:20 AM, Torbjorn Granlund wrote: > Useful. Is there any 32+32 >> 32 -> 32? I.e., carry-out.
Sadly, no. Or if there is, I missed it. Also interesting, as I'm looking around, is VEXT. Consider vmull.u32 Qa01, Du00, Dv01 vmull.u32 Qb12, Du11, Dv01 which gives us 64-bit products a1 a0 b2 b1 considered as v4si vectors (gcc-speak for 4 x 32-bit) as { b2h, b2l, b1h, b1l }, { a1h, a1l, a0h, a0l } Apply vext.32 Qc01, Qa01, Qb12, #1 and we get Qc01 = { b1l, a1h, a1l, a0h }. If you look at the pairs is exactly the input we'd like to feed into vpaddl.u32 Qc01, Qc01 to achieve the v2di vector { b11 + a1h, a1h + b0h }. Now, we all know that u32 * u32 + u32 + u32 cannot overflow u64 (indeed exactly fits), so the output of that vpaddl could be used as the addend to a multiply round with vmlal. Which suggests a code structure like .Loop: vmlal.u32 Qp01, Du00, Dv01 @ v2di{ p1, p0 } vst1.u32 {Dp0[0]}, [rp]! @ store p0l vext.32 Qp01, Qzero, Qp01 @ v4si{ 0, p1h, p1l, p0h } vpaddl.u32 Qp01, Qp01 @ v2di{ p1h, p1l+p0h } // bookkeeping bne .Loop I.e. we store out 32-bits each round, keeping a "48-bit" rolling carry going between each stage. If this works, it's significantly less overhead than the structure I posted yesterday. Oh, wait, this misses the addend part of addmul. Hmm. We have room in the rolling carry where I shift in zero above. That could contain the addend element from the appropriate round instead. Perhaps I should give this another go... r~ _______________________________________________ gmp-devel mailing list gmp-devel@gmplib.org http://gmplib.org/mailman/listinfo/gmp-devel