Hi, I've dug a bit deeper, and it seems that there is an alignment issue within addmul_1. I've created two marginally different programs, of which one is much faster:
$ gcc -O3 addmul_1.s -o addmul_1.o -c; for i in a b; do gcc -O3 $i.cpp -o $i.out addmul_1.o; ./$i.out ; done Time: 0.490647 Time: 0.681671 addmul_1.s is the Sandy Bridge-optimized assembly. When I change the alignment there a bit as in addmul_1.opt.s, the difference disappears: $ gcc -O3 addmul_1.opt.s -o addmul_1.o -c; for i in a b; do gcc -O3 $i.cpp -o $i.out addmul_1.o; ./$i.out ; objdump -CSD $i.out > $i.dis ;done Time: 0.505714 Time: 0.495640 Best regards, Marcel On Friday, August 11, 2017 at 7:00:10 PM UTC+1, Bill Hart wrote: > > We've noticed similar sorts of things. One possibility is that the loop in > your test code is not aligned as well in one version. Or perhaps your stack > is hitting the same location modulo 4096, which is a known issue on some > modern processors. There might be SSE code in the linker and AVX code in > the addmul_1 function. The kernel might pin the process to a different CPU > which is slightly slower or faster, when the pthreads library is used. You > might also hit some frequency scaling in the CPU due to the pthreads > library taking longer to link in. There's so many possibilities on a modern > CPU, it hardly bears thinking about. > > Also, in your code, you don't seem to set y anywhere and I wasn't aware > you could use 1e8 as an int constant. > > Bill. > > On 11 August 2017 at 18:10, Marcel Keller <m.ke...@bristol.ac.uk > <javascript:>> wrote: > >> Hi, >> >> I've noticed that the performance of mpn_addmul_1 can depend considerably >> on whether I link against libpthread, which strikes me as very weird: >> >> $ g++ -O3 Time-addmul_1.cpp ~/src/mpir-3.0.0-ivybridge/mpn/addmul_1.o -o >> a.out >> >> $ g++ -O3 Time-addmul_1.cpp ~/src/mpir-3.0.0-ivybridge/mpn/addmul_1.o -o >> b.out -lpthread >> >> $ ./a.out >> mpn_addmul_1: 0.506279 >> >> $ ./b.out >> mpn_addmul_1: 0.682086 >> >> Disassembling the binaries shows that the mpn function in >> Time-addmul_1.cpp is compiled exactly the same way. >> >> I'm running CentOS 7 and GCC 6.2. The source as well as the outputs are >> attached. >> >> Does anyone have any idea why this could be? >> >> Best regards, >> Marcel >> >> -- >> You received this message because you are subscribed to the Google Groups >> "mpir-devel" group. >> To unsubscribe from this group and stop receiving emails from it, send an >> email to mpir-devel+...@googlegroups.com <javascript:>. >> To post to this group, send email to mpir-...@googlegroups.com >> <javascript:>. >> Visit this group at https://groups.google.com/group/mpir-devel. >> For more options, visit https://groups.google.com/d/optout. >> > > -- You received this message because you are subscribed to the Google Groups "mpir-devel" group. To unsubscribe from this group and stop receiving emails from it, send an email to mpir-devel+unsubscr...@googlegroups.com. To post to this group, send email to mpir-devel@googlegroups.com. Visit this group at https://groups.google.com/group/mpir-devel. For more options, visit https://groups.google.com/d/optout.
b.out
Description: Binary data
a.out
Description: Binary data
#include "common.h" void g() { } int main() { mp_limb_t x[t+1], y, z[t+1]; mpn(x, y, z); }
#include <mpir.h> #include <stdlib.h> #include <stdio.h> #include <time.h> const int t = 5; const int n = 1e8; void f(); void mpn(mp_limb_t* x, mp_limb_t y, mp_limb_t* z) { //f(); //mp_limb_t x[t+1], y, z[t+1]; struct timespec start, stop; clock_gettime(CLOCK_REALTIME, &start); for (int i = 0; i < n; i++) z[t] = mpn_addmul_1(z, x, t, y); clock_gettime(CLOCK_REALTIME, &stop); printf("Time: %f\n", 1e-9 * (stop.tv_nsec - start.tv_nsec) + (stop.tv_sec - start.tv_sec)); } void f() { printf("mpn: %x\n", mpn); printf("add_mul_1: %x\n", __gmpn_addmul_1); printf("diff: %x\n", (long)__gmpn_addmul_1 - (long)mpn); }
#include "common.h" int main() { mp_limb_t x[t+1], y, z[t+1]; mpn(x, y, z); }
addmul_1.opt.s
Description: Binary data
addmul_1.s
Description: Binary data