Torbjorn Granlund <t...@gmplib.org> writes: > I sometimes get better A9 performance with *discrete* pointer updates, > not one-out-of-four autoincrement pointer updates like used here. I > think the code you started with had that one-out-of-four trick for str, > already?
Right, it uses a single update of rp (which is used for both loads and stores), I just changed it to handle up updates in a similar way. > I had on the other hand not realised David's ones complement + pre-invert > carry trick. Not sure I understand what you are referring to here. I haven't been following the sparc developments very closely (and I don't remember much of sparc assembly). > Cool! Looks like it is actually faster than 3.9 for some > alignments/sizes. It seems one iteration takes 15.5 cycles. I guess that means that even and odd iterations are executed differently? It would be nice to get it down to 15 cycles (3.75 c/l) (the addmul_1 iteration takes 13 cycles, there's no good reason the four additional mvn instructions should cost more than two cycles). But I find instruction scheduling both very hard and tedious. > Did you time this on some other CPU too? No. When I get home (I don't log in to the gmp machines from the office network), I might get time to try it on the appropriate gmp machine. > 1. Use descrete ptr updates for up and/or rp. Maybe. Costs additional instructions, but with more freedom on where to place the pointer increment. Loopmixer would help. A digression: I'm running Debian GNU/Linux on my pandaboard system. The Linux way to get access to the instruction counter seems to be via "perf_event_open". However, when I tried it, it seems no hardware-based events exist (I do get access to the software-based ones though, so the interface is partially working). Also, clock_get_time with CLOCK_PROCESS_CPUTIME_ID gives very poor accuracy, so maybe the entire "high res timers"-subsystem is non-working. Any clues on where to look for solving this problem is appreciated. The obvious (to me) things seem to be enabled in the kernel config $ zgrep 'PERF_EV\|HIGH_RES' /proc/config.gz CONFIG_HIGH_RES_TIMERS=y CONFIG_HAVE_PERF_EVENTS=y CONFIG_PERF_EVENTS=y CONFIG_HW_PERF_EVENTS=y And it's no use to even think of porting the loop mixer to arm without access to cycle-accurate timing. > 2. Move the one-out-of-four autoincrement updates to other ldr/str > insns. Could try that. > 3. Use ldm/stm. Often an A9 win. If we want to schedule loads early, that seems to rule out using a single ldm to load all values used in the loop. Right? But two ldm, loading two limbs at a time, could work. stm seems easier. Any changes of this type will break the current loop setup logic, I'm afraid. Regards, /Niels -- Niels Möller. PGP-encrypted email is preferred. Keyid C0B98E26. Internet email is subject to wholesale government surveillance. _______________________________________________ gmp-devel mailing list gmp-devel@gmplib.org http://gmplib.org/mailman/listinfo/gmp-devel