ni...@lysator.liu.se (Niels Möller) writes: > I had on the other hand not realised David's ones complement + pre-invert > carry trick. Not sure I understand what you are referring to here. I haven't been following the sparc developments very closely (and I don't remember much of sparc assembly). The newer sparc adds 64-bit carrying adds, but they still don't have corresponding subtraction instructions. Se David sets carry before entering the loop, and ones complements the subtrahend.
> Cool! Looks like it is actually faster than 3.9 for some > alignments/sizes. It seems one iteration takes 15.5 cycles. I guess that means that even and odd iterations are executed differently? Probably. Or that it is limping in some other odd way, such as cache line related. But your explanation is the most likely one, I think. It would be nice to get it down to 15 cycles (3.75 c/l) (the addmul_1 iteration takes 13 cycles, there's no good reason the four additional mvn instructions should cost more than two cycles). But I find instruction scheduling both very hard and tedious. It is hard and tedious, but it really pays off if one is persistent enough (or has some tool). The A9 pipeline can execute two mvn each cycle. > 1. Use descrete ptr updates for up and/or rp. Maybe. Costs additional instructions, but with more freedom on where to place the pointer increment. Loopmixer would help. I have never seen a case where a separate insn adds execution time. I suspect the hardware executes ld with autoupdate as two insns, at least on some ARMs. (It should work to execute st with autoupdate within the usual read-two-regs, write-one-reg, unlke ld.) A digression: I'm running Debian GNU/Linux on my pandaboard system. The Linux way to get access to the instruction counter seems to be via "perf_event_open". However, when I tried it, it seems no hardware-based events exist (I do get access to the software-based ones though, so the interface is partially working). Also, clock_get_time with CLOCK_PROCESS_CPUTIME_ID gives very poor accuracy, so maybe the entire "high res timers"-subsystem is non-working. Any clues on where to look for solving this problem is appreciated. The obvious (to me) things seem to be enabled in the kernel config There is an annoying Linux tradition of "implementing" things and then not make them work for years. Clocks and timing has always been a sore area for Linux. This is why almost all gmplib machines run BSD, where things actually work. I have tried to get cycle counters to work on both my ARM systems, following various examples ("HOWTOs"). Nothing works. I will not waste more time on this, but as soon as *BSD is available for Panda or Chromebook, I will migrate to it. And it's no use to even think of porting the loop mixer to arm without access to cycle-accurate timing. That would indeed make it less useful. I suppose it could still be made to work by running each sequence enough times for some Linux counter to be updated. > 3. Use ldm/stm. Often an A9 win. If we want to schedule loads early, that seems to rule out using a single ldm to load all values used in the loop. Right? But two ldm, loading two limbs at a time, could work. stm seems easier. Any changes of this type will break the current loop setup logic, I'm afraid. I assume that ldm loads the registers in some secific order, such as lowest numbered first. Then, it could lift the screboard bit for availale register values while ldm executes. Using ldm with just two register might be pointless. Also, it will for 50% of alignments take 2 cycles. Doing three registers is (as we've discussed in the past) more applealing. I haven't explored if ldm is a win for A9 compared to well placed discrete loads. On A15 ldm seems pretty useless, but it is not harmful either. -- Torbjörn _______________________________________________ gmp-devel mailing list gmp-devel@gmplib.org http://gmplib.org/mailman/listinfo/gmp-devel