ni...@lysator.liu.se (Niels Möller) writes: I guess it's lowest numbered first (and lowest memory address). But a loop with use r7 ldm up!, {r4,r5,r6,r7} use r4 looks like poor scheduling betwen load of r4 and use of it, and the ldm can't be moved earlier since it clobbers r7. But I have a pretty vague idea about how this really works. I haven't explored the ARM chips enough to know thi either.
A possible schedule is to put a stm in the ldm latency time slot: ldm {r4-r7} stm {r8-r11} ldm {r8-r11} operate on r4-r7 operate on r4-r11 The "operate on" blocks don't need to be as disjoint as the picture seems to suggest. Right, rewriting the loop with 3-way unrolling would be an interesting experiment. But I don't think I'll look into that soon. The current improvement is very good already. It's hard to organise the ARM code. Since we have a very incomplete set of systems, we might choose a poor asm file vector for some systems. If you new code runs well on A15, perhaps we should assume it is good for other systems which support umaal (>= armv6) and put it in the v6 subdir? I'll push my new addmul_1 and submul_1 to the corea15 subdir at some point (unless your code beats it, of course). -- Torbjörn _______________________________________________ gmp-devel mailing list gmp-devel@gmplib.org http://gmplib.org/mailman/listinfo/gmp-devel