Hello,

I'm not sure if it should be better handled as missed optimization,
but there is a certain lack of functionality in the GCC's __sync_*
function family.

When implementing a reference counting smart pointer, two operations
are of crucial importance:

    void __sync_increment(T* p);
    bool __sync_decrement_iszero(T* p);

The former only increments the location pointed to by p, the latter decrements
it and returns true if and only if the result was zero.

Both can be implemented in terms of existing __sync functions (and what
can't? -- since there is __sync_bool_compare_and_swap()), e.g.:

   void __sync_increment(T* p) {

      __sync_fetch_and_add(p, 1);
   }

  bool __sync_decrement(T* p) {

     return __sync_fetch_and_add(p, -1) == 1;
  }

Unfortunately, onx86/x64 both are compiled in a rather poor way:

__sync_increment:

    lock addl $x01,(ptr)

which is longer than:

   lock incl (ptr)

__sync_decrement:

    movl -1, %rA
    lock xadd %rA, (ptr)
    cmpl $0x01, %rA
    je/jne...

which is undoubtedly longer than "lock dec" and wastes a register.
I can optimally implement the increment function with a bit of inline
assembly, but decrement is not so lucky, as there is no way to
inform the compiler the result is in the flags register. One must retreat
to something like that:

    lock decl (ptr)
    sete %rA

which GCC will finally use to perform a comparison in if(), emitting:

    lock decl (ptr)
    sete %rA
    testb  %rA, %rA
    je/jne...

which is hardly an improvement. On the other hand, the __sync functions
integrate perfectly with the flag system (i.e. the pairs like cmpxchg/jne),
so implementing the changes in the compiler gives far better opportunities
to emit an optimal sequence compared to what can inline assembly do.

As my code is to a high degree propelled by atomic power, I would like to
ask you to provide these functions or tweak the optimizer in order to notice
the aforementioned idioms.

There is also lack of generic __sync_exchange() -- quite an important operation
in lock-free programming. It could be implemented in terms of compare_exchange,
but many platforms have native support for it and thus it should be
exposed at the
API level, tweaking the optimizer is not the proper way IMHO.

    Best regards,
    Piotr Wydersk

Reply via email to