Hello, I'm not sure if it should be better handled as missed optimization, but there is a certain lack of functionality in the GCC's __sync_* function family.
When implementing a reference counting smart pointer, two operations are of crucial importance: void __sync_increment(T* p); bool __sync_decrement_iszero(T* p); The former only increments the location pointed to by p, the latter decrements it and returns true if and only if the result was zero. Both can be implemented in terms of existing __sync functions (and what can't? -- since there is __sync_bool_compare_and_swap()), e.g.: void __sync_increment(T* p) { __sync_fetch_and_add(p, 1); } bool __sync_decrement(T* p) { return __sync_fetch_and_add(p, -1) == 1; } Unfortunately, onx86/x64 both are compiled in a rather poor way: __sync_increment: lock addl $x01,(ptr) which is longer than: lock incl (ptr) __sync_decrement: movl -1, %rA lock xadd %rA, (ptr) cmpl $0x01, %rA je/jne... which is undoubtedly longer than "lock dec" and wastes a register. I can optimally implement the increment function with a bit of inline assembly, but decrement is not so lucky, as there is no way to inform the compiler the result is in the flags register. One must retreat to something like that: lock decl (ptr) sete %rA which GCC will finally use to perform a comparison in if(), emitting: lock decl (ptr) sete %rA testb %rA, %rA je/jne... which is hardly an improvement. On the other hand, the __sync functions integrate perfectly with the flag system (i.e. the pairs like cmpxchg/jne), so implementing the changes in the compiler gives far better opportunities to emit an optimal sequence compared to what can inline assembly do. As my code is to a high degree propelled by atomic power, I would like to ask you to provide these functions or tweak the optimizer in order to notice the aforementioned idioms. There is also lack of generic __sync_exchange() -- quite an important operation in lock-free programming. It could be implemented in terms of compare_exchange, but many platforms have native support for it and thus it should be exposed at the API level, tweaking the optimizer is not the proper way IMHO. Best regards, Piotr Wydersk