RE: [tip:locking/core] refcount_t: Introduce a special purpose refcount type

Reshetova, Elena Wed, 15 Feb 2017 03:18:07 -0800

> * Reshetova, Elena <[email protected]> wrote:
> 
> > > Subject: refcount: Out-of-line everything
> > > From: Peter Zijlstra <[email protected]>
> > > Date: Fri Feb 10 16:27:52 CET 2017
> > >
> > > Linus asked to please make this real C code.
> >
> > Perhaps a completely stupid question, but I am going to ask anyway since 
> > only
> > this way I can learn. What a real difference it makes? Or are we talking 
> > more
> > about styling and etc.? Is it because of size concerns? This way it is 
> > certainly
> > now done differently than rest of atomic and kref...
> 
> It's about generated code size mostly.
> 
> This inline function is way too large to be inlined:
> 
> static inline __refcount_check
> bool refcount_add_not_zero(unsigned int i, refcount_t *r)
> {
>       unsigned int old, new, val = atomic_read(&r->refs);
> 
>       for (;;) {
>               if (!val)
>                       return false;
> 
>               if (unlikely(val == UINT_MAX))
>                       return true;
> 
>               new = val + i;
>               if (new < val)
>                       new = UINT_MAX;
>               old = atomic_cmpxchg_relaxed(&r->refs, val, new);
>               if (old == val)
>                       break;
> 
>               val = old;
>       }
> 
>       REFCOUNT_WARN(new == UINT_MAX, "refcount_t: saturated;
> leaking memory.\n");
> 
>       return true;
> }
> 
> When used then this function generates this much code on x86-64 defconfig:
> 
> 00000000000084d0 <test>:
>     84d0:     8b 0f                   mov    (%rdi),%ecx
>     84d2:     55                      push   %rbp
>     84d3:     48 89 e5                mov    %rsp,%rbp
> 
>     84d6:     85 c9                   test   %ecx,%ecx                |
>     84d8:     74 30                   je     850a <test+0x3a>         |
>     84da:     83 f9 ff                cmp    $0xffffffff,%ecx         |
>     84dd:     be ff ff ff ff          mov    $0xffffffff,%esi         |
>     84e2:     75 04                   jne    84e8 <test+0x18>         |
>     84e4:     eb 1d                   jmp    8503 <test+0x33>         |
>     84e6:     89 c1                   mov    %eax,%ecx                |
>     84e8:     8d 51 01                lea    0x1(%rcx),%edx           |
>     84eb:     89 c8                   mov    %ecx,%eax                |
>     84ed:     39 ca                   cmp    %ecx,%edx                |
>     84ef:     0f 42 d6                cmovb  %esi,%edx                |
>     84f2:     f0 0f b1 17             lock cmpxchg %edx,(%rdi)        |
>     84f6:     39 c8                   cmp    %ecx,%eax                |
>     84f8:     74 09                   je     8503 <test+0x33>         |
>     84fa:     85 c0                   test   %eax,%eax                |
>     84fc:     74 0c                   je     850a <test+0x3a>         |
>     84fe:     83 f8 ff                cmp    $0xffffffff,%eax         |
>     8501:     75 e3                   jne    84e6 <test+0x16>         |
>     8503:     b8 01 00 00 00          mov    $0x1,%eax                |
> 
>     8508:     5d                      pop    %rbp
>     8509:     c3                      retq
>     850a:     31 c0                   xor    %eax,%eax
>     850c:     5d                      pop    %rbp
>     850d:     c3                      retq
> 
> 
> (I've annotated the fastpath impact with '|'. Out of line code generally does 
> not
> count.)
> 
> It's about ~50 bytes of code per usage site. It might be better in some 
> cases, but
> not by much.
> 
> This is way above any sane inlining threshold. The 'unconditionally good' 
> inlining
> threshold is at 1-2 instructions and below ~10 bytes of code.
> 
> So for example refcount_set() and refcount_read() can stay inlined:
> 
> static inline void refcount_set(refcount_t *r, unsigned int n)
> {
>       atomic_set(&r->refs, n);
> }
> 
> static inline unsigned int refcount_read(const refcount_t *r)
> {
>       return atomic_read(&r->refs);
> }
> 
> 
> ... beacuse they compile into a single instruction with 2-5 bytes I$ overhead.
> 
> Thanks,
> 
>       Ingo


Thank you very much Ingo for such detailed and nice explanation! 

Best Regards,
Elena

RE: [tip:locking/core] refcount_t: Introduce a special purpose refcount type

Reply via email to