https://gcc.gnu.org/bugzilla/show_bug.cgi?id=71245
Bug ID: 71245 Summary: std::atomic<double> load/store bounces the data to the stack using fild/fistp Product: gcc Version: 6.1.0 Status: UNCONFIRMED Keywords: missed-optimization Severity: normal Priority: P3 Component: c++ Assignee: unassigned at gcc dot gnu.org Reporter: peter at cordes dot ca Target Milestone: --- Target: i386-linux-gnu Same result with gcc 4.8, gcc5, and gcc6.1. Didn't test exhaustively. #include <atomic> std::atomic<double> d(5.0); void foo_d(void) { d = d + 1.0; // d+=1.0; // unimplemented } with gcc6.1 -m32 -O3 -march=i586 (https://godbolt.org/g/w3VKpG) this compiles to foo_d(): subl $20, %esp #, fildq d # # fistpq (%esp) # %sfp # copy `d`'s bits to the stack (with no loss) fldl (%esp) # %sfp fadds .LC0 # fstpl (%esp) # %sfp # store d + 1.0 to the stack fildq (%esp) # %sfp # fistpq d # # atomic store using fild lock orl $0, (%esp) ## mfence equivalent addl $20, %esp #, ret I assume fild/fistp is gcc's trick for implementing atomic loads and stores without resorting to cmpxchg8b. Clever, since 80bit float can't munge the data. The fild/fistp pairs are of course not necessary in this case, where the data is already 64bit float. The function should just fld / fstp to d directly. With -march=i486 or lower, gcc correctly doesn't assume that 64bit FP loads/stores are atomic, so it calls a library function to do the atomic load and store. With SSE or SSE2 available, it uses an SSE load/store to copy to the stack. With -msse2 and -mfpmath=sse, we finally load/store directly from/to d, with movq / addsd / movq. movq vs. movsd shouldn't make a performance difference, I think. -------- We don't need to allocate any stack space. We could implement the StoreLoad barrier with lock or $0, -4(%esp) instead of reserving extra stack to avoid doing it to our return address (which would introduce extra store-forwarding delay before the ret could eventually retire).