On Mon, Feb 02, 2015 at 03:29:09PM -0800, Andrew Morton wrote:
> On Mon, 2 Feb 2015 11:55:03 +0800 "Wang, Yalin" <yalin.w...@sonymobile.com> 
> wrote:
> 
> > This patch change non-atomic bitops,
> > add a if() condition to test it, before set/clear the bit.
> > so that we don't need dirty the cache line, if this bit
> > have been set or clear. On SMP system, dirty cache line will
> > need invalidate other processors cache line, this will have
> > some impact on SMP systems.
> > 
> > --- a/include/asm-generic/bitops/non-atomic.h
> > +++ b/include/asm-generic/bitops/non-atomic.h
> > @@ -17,7 +17,9 @@ static inline void __set_bit(int nr, volatile unsigned 
> > long *addr)
> >     unsigned long mask = BIT_MASK(nr);
> >     unsigned long *p = ((unsigned long *)addr) + BIT_WORD(nr);
> >  
> > -   *p  |= mask;
> > +   if ((*p & mask) == 0)
> > +           *p  |= mask;
> > +
> >  }
> 
> hm, maybe.
> 
> It will speed up set_bit on an already-set bit.  But it will slow down
> set_bit on a not-set bit.  And the latter case is presumably much, much
> more common.
> 
> How do we know the patch is a net performance gain?

Let's try to measure. The micro benchmark:

        #include <stdio.h>
        #include <time.h>
        #include <sys/mman.h>

        #ifdef CACHE_HOT
        #define SIZE (2UL << 20)
        #define TIMES 10000000
        #else
        #define SIZE (1UL << 30)
        #define TIMES 10000
        #endif

        int main(int argc, char **argv)
        {
                struct timespec a, b, diff;
                unsigned long i, *p, times = TIMES;

                p = mmap(NULL, SIZE, PROT_READ | PROT_WRITE,
                                MAP_ANONYMOUS | MAP_PRIVATE | MAP_POPULATE, -1, 
0);
                
                clock_gettime(CLOCK_MONOTONIC, &a);
                while (times--) {
                        for (i = 0; i < SIZE/64/sizeof(*p); i++) {
        #ifdef CHECK_BEFORE_SET
                                if (p[i] != times)
        #endif
                                        p[i] = times;
                        }
                }
                clock_gettime(CLOCK_MONOTONIC, &b);

                diff.tv_sec = b.tv_sec - a.tv_sec;
                if (a.tv_nsec > b.tv_nsec) {
                        diff.tv_sec--;
                        diff.tv_nsec = 1000000000 + b.tv_nsec - a.tv_nsec;
                } else
                        diff.tv_nsec = b.tv_nsec - a.tv_nsec;

                printf("%lu.%09lu\n", diff.tv_sec, diff.tv_nsec);
                return 0;
        }

Results for 10 runs on my laptop -- i5-3427U (IvyBridge 1.8 Ghz, 2.8Ghz Turbo
with 3MB LLC):

                                Avg             Stddev
baseline                        21.5351         0.5315
-DCHECK_BEFORE_SET              21.9834         0.0789
-DCACHE_HOT                     14.9987         0.0365
-DCACHE_HOT -DCHECK_BEFORE_SET  29.9010         0.0204

Difference between -DCACHE_HOT and -DCACHE_HOT -DCHECK_BEFORE_SET appears
huge, but if you recalculate it to CPU cycles per inner loop @ 2.8 Ghz,
it's 1.02530 and 2.04401 CPU cycles respectively.

Basically, the check is free on decent CPU. 

-- 
 Kirill A. Shutemov
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Reply via email to