[email protected] (Niels Möller) writes: Below is an assembly sketch, not yet working. With some luck, some variant of this should run in 54 cycles per iteration on Atom (limited by multiplication throughput, each of the three muls taking 18 cycles). Not sure how to think about scheduling for this processor (and it's little point in thinking too much about it until we have a correct instruction sequence).
The original Atom (the one which needs 18 cycles for a 64-bit mul insn) is a 2-issue in order pipeline. I am not sure the multiply runs completely unperturbed in the background on these CPUs. Some timing experiments suggest that mul stalls the pipeline, at least partially. I timed pmuludq now, and their throughput is 1 every 2nd cycle, i.e., one *should* be able to beat 54 c/l for addmul_2 quite soundly. 4 pmuludq plus some shuffling will give a 64-bit addmul_2. SSE4.1 instruction are missing from this old platform, and that will result in more shuffling/masking insn. But given that the raw multiply needs just 8 cycles, I expect 54 c/l will not be hard to beat. :-) It might be worth noticing that pmuludq-based multiply loops could be shared between x86-32 and x86-64. An 64-bit addmul_(k) will be a 32-bit addmul_(2k). At least if we stay with SSE/AVX insns and do not do some final accumulation using plain integer ops. (The same is not true for big-endian CPUs.) I like the idea of a side-channel silent toom addmul_2, but it is not obvious which platform will benefit. -- Torbjörn Please encrypt, key id 0xC8601622 _______________________________________________ gmp-devel mailing list [email protected] https://gmplib.org/mailman/listinfo/gmp-devel
