https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103066
--- Comment #8 from Jakub Jelinek <jakub at gcc dot gnu.org> --- (In reply to H.J. Lu from comment #7) > Instead of generating: > > movl f(%rip), %eax > .L2: > movd %eax, %xmm0 > addss .LC0(%rip), %xmm0 > movd %xmm0, %edx > lock cmpxchgl %edx, f(%rip) > jne .L2 > ret > > we want > > movl f(%rip), %eax > .L2: > movd %eax, %xmm0 > addss .LC0(%rip), %xmm0 > movd %xmm0, %edx > cmpl f(%rip), %eax > jne .L2 > lock cmpxchgl %edx, f(%rip) > jne .L2 > ret No, certainly not. The mov before or the remembered value from previous lock cmpxchgl already has the right value unless the atomic memory is extremely contended, so you don't want to add the non-atomic comparison in between. Not to mention that the way you've written it totally breaks it, because if the memory is not equal to the expected value, you should get the current value. With the above code, if f is modified by another thread in between the initial movl f(%rip), %eax and cmpl f(%rip), %eax and never after it, it will loop forever. I believe what the above paper is talking about should be addressed by users of these intrinsics if they care and if it is beneficial (e.g. depending on extra information on how much the lock etc. is contended etc., in OpenMP one has omp_sync_hint_* constants one can use in hint clause to tell if the lock is contended, uncontended, unknown, speculative, non-speculative, unknown etc.).