ni...@lysator.liu.se (Niels Möller) writes: > If this works, > FOLD would turn into something like > > sldi F0, $1, 32 > srdi F1, $1, 32 > subfc F2, $1, F0 > addme F3, F1
I'm looking at a different approach (experimenting on ARM64, which is quite similar to powerpc, but I don't yet have working code). To understand what the redc code is doing we need to keep in mind that what one folding step does is to compute <U4,U3,U2,U1,U0> + U0*p which cancels the low limb, since p = -1 (mod 2^64). So since the low limb always cancel, what we need is <U4,U3,U2,U1> + U0*((p+1)/2^64) The x86_64 code does this by splitting U0*p into 2^{256} U0 - (2^{256} - p) * U0, subtracting in the folding step, and adding in the high part later. But one doesn't have to do it that way. One could instead use a FOLD macro that computes (2^{192} - 2^{160} + 2^{128} + 2^{32}) U0 I also wonder of there's some way to use carry out from one fold step and apply it at the right place while preparing the F0,F1,F2,F3 for the next step. Regards, /Niels -- Niels Möller. PGP key CB4962D070D77D7FCB8BA36271D8F1FF368C6677. Internet email is subject to wholesale government surveillance. _______________________________________________ nettle-bugs mailing list nettle-bugs@lists.lysator.liu.se http://lists.lysator.liu.se/mailman/listinfo/nettle-bugs