Re: arm neon

2013-02-21 Thread Torbjorn Granlund
ni...@lysator.liu.se (Niels Möller) writes: Found the vmlal instruction now. Makes for a cute loop, .Loop: vld1.32 l01[1], [vp]! vld1.32 {u00[]}, [up]! vaddl.u32 q1, l01, c01 vmlal.u32 q1, u00, v01 C q1 overlaps with c01

Re: arm neon

2013-02-21 Thread Niels Möller
Torbjorn Granlund t...@gmplib.org writes: I suspect that some scheduling just might improve performance by a large factor... One could hope so. I'm still a bit skeptic to mul and accumulate, at least for addmul_2, since it's seems like a good idea to schedule the multiplications far in

Re: arm neon

2013-02-21 Thread Torbjorn Granlund
ni...@lysator.liu.se (Niels Möller) writes: One could hope so. I'm still a bit skeptic to mul and accumulate, at least for addmul_2, since it's seems like a good idea to schedule the multiplications far in advance. Multiply-accumulate can have the drawback to put multiplication in the