ni...@lysator.liu.se (Niels Möller) writes: Hmm, if I understand you correctly, it is preferable if the cpu can start doing the multiplication without any dependency on the carry from previous iteration, right? At least in theory, umaal could be implemented in such a way. IIRC, the read of the accumulating registers is a cycle later than the read of the multiplicand registers on A9.
1. What are the calling conventions? It probably easiest to compile some trivial examples to assembly. I look at gcc/config/arm/arm.h and its CALL_USED_REGISTERS to figure out the register partitioning... (That macros exist for all machines.) 2. What gcc flags should I use to be able to get uint64_t variables into neon registers? No idea. I doubt there is one. I think armhf ("hard float") is a separate ABI, which passes float parameters in fp regs. I've been looking primarily for operations useful for crypto. Like wide xor, shift/rotate, other data shuffling. Or just using the additional registers to store uint64_t variables would give a decent speedup over using the regular registers, I imagine. Most (non-mul) stuff is avalable at data size of up 64. The registers are 128 bits wide, IIRC. The load ad store insns are cool, allowing various strides and (for load) padding. > Using Neon in a robust way might be a bit tricky, though. I have no > idea how to determine if a CPU has Neon or not, and ARM has made most > useful meta instructions supervisor-only. For a start, I guess it could be a configure time option (with no fat-binary things). Either explicit, or automatically based on, e.g., linux' /proc/cpuinfo which lists available cpu extensions. I feel that grepping in a /prec file is perhaps OK for configure, but still not great. -- Torbjörn _______________________________________________ gmp-devel mailing list gmp-devel@gmplib.org http://gmplib.org/mailman/listinfo/gmp-devel