qspinlock: Use generic smp_cond_load_relaxed

Nicholas Piggin Mon, 15 Mar 2021 22:00:07 -0700

Excerpts from Davidlohr Bueso's message of March 9, 2021 11:59 am:
> 49a7d46a06c3 (powerpc: Implement smp_cond_load_relaxed()) added
> busy-waiting pausing with a preferred SMT priority pattern, lowering
> the priority (reducing decode cycles) during the whole loop slowpath.
> 
> However, data shows that while this pattern works well with simple
> spinlocks, queued spinlocks benefit more being kept in medium priority,
> with a cpu_relax() instead, being a low+medium combo on powerpc.


Thanks for tracking this down and the comprehensive results, great
work.

It's only a relatively recent patch, so I think the revert is a
good idea (i.e., don't keep it around for possibly other code to
hit problems with).

One request, could you add a comment in place that references
smp_cond_load_relaxed() so this commit can be found again if
someone looks at it? Something like this

/*
 * smp_cond_load_relaxed was found to have performance problems if
 * implemented with spin_begin()/spin_end().
 */

I wonder if it should have a Fixes: tag to the original commit as
well.

Otherwise,

Acked-by: Nicholas Piggin <npig...@gmail.com>

Thanks,
Nick

> 
> Data is from three benchmarks on a Power9: 9008-22L 64 CPUs with
> 2 sockets and 8 threads per core.
> 
> 1. locktorture.
> 
> This is data for the lowest and most artificial/pathological level,
> with increasing thread counts pounding on the lock. Metrics are total
> ops/minute. Despite some small hits in the 4-8 range, scenarios are
> either neutral or favorable to this patch.
> 
> +=========+==========+==========+=======+
> | # tasks | vanilla  | dirty    | %diff |
> +=========+==========+==========+=======+
> | 2       | 46718565 | 48751350 | 4.35  |
> +---------+----------+----------+-------+
> | 4       | 51740198 | 50369082 | -2.65 |
> +---------+----------+----------+-------+
> | 8       | 63756510 | 62568821 | -1.86 |
> +---------+----------+----------+-------+
> | 16      | 67824531 | 70966546 | 4.63  |
> +---------+----------+----------+-------+
> | 32      | 53843519 | 61155508 | 13.58 |
> +---------+----------+----------+-------+
> | 64      | 53005778 | 53104412 | 0.18  |
> +---------+----------+----------+-------+
> | 128     | 53331980 | 54606910 | 2.39  |
> +=========+==========+==========+=======+
> 
> 2. sockperf (tcp throughput)
> 
> Here a client will do one-way throughput tests to a localhost server, with
> increasing message sizes, dealing with the sk_lock. This patch shows to put
> the performance of the qspinlock back to par with that of the simple lock:
> 
>                    simple-spinlock           vanilla                  dirty
> Hmean     14        73.50 (   0.00%)       54.44 * -25.93%*       73.45 * 
> -0.07%*
> Hmean     100      654.47 (   0.00%)      385.61 * -41.08%*      771.43 * 
> 17.87%*
> Hmean     300     2719.39 (   0.00%)     2181.67 * -19.77%*     2666.50 * 
> -1.94%*
> Hmean     500     4400.59 (   0.00%)     3390.77 * -22.95%*     4322.14 * 
> -1.78%*
> Hmean     850     6726.21 (   0.00%)     5264.03 * -21.74%*     6863.12 * 
> 2.04%*
> 
> 3. dbench (tmpfs)
> 
> Configured to run with up to ncpusx8 clients, it shows both latency and
> throughput metrics. For the latency, with the exception of the 64 case,
> there is really nothing to go by:
>                                    vanilla                dirty
> Amean     latency-1          1.67 (   0.00%)        1.67 *   0.09%*
> Amean     latency-2          2.15 (   0.00%)        2.08 *   3.36%*
> Amean     latency-4          2.50 (   0.00%)        2.56 *  -2.27%*
> Amean     latency-8          2.49 (   0.00%)        2.48 *   0.31%*
> Amean     latency-16         2.69 (   0.00%)        2.72 *  -1.37%*
> Amean     latency-32         2.96 (   0.00%)        3.04 *  -2.60%*
> Amean     latency-64         7.78 (   0.00%)        8.17 *  -5.07%*
> Amean     latency-512      186.91 (   0.00%)      186.41 *   0.27%*
> 
> For the dbench4 Throughput (misleading but traditional) there's a small
> but rather constant improvement:
> 
>                            vanilla                dirty
> Hmean     1        849.13 (   0.00%)      851.51 *   0.28%*
> Hmean     2       1664.03 (   0.00%)     1663.94 *  -0.01%*
> Hmean     4       3073.70 (   0.00%)     3104.29 *   1.00%*
> Hmean     8       5624.02 (   0.00%)     5694.16 *   1.25%*
> Hmean     16      9169.49 (   0.00%)     9324.43 *   1.69%*
> Hmean     32     11969.37 (   0.00%)    12127.09 *   1.32%*
> Hmean     64     15021.12 (   0.00%)    15243.14 *   1.48%*
> Hmean     512    14891.27 (   0.00%)    15162.11 *   1.82%*
> 
> Measuring the dbench4 Per-VFS Operation latency, shows some very minor
> differences within the noise level, around the 0-1% ranges.
> 
> Signed-off-by: Davidlohr Bueso <dbu...@suse.de>
> ---
>  arch/powerpc/include/asm/barrier.h | 16 ----------------
>  1 file changed, 16 deletions(-)
> 
> diff --git a/arch/powerpc/include/asm/barrier.h 
> b/arch/powerpc/include/asm/barrier.h
> index aecfde829d5d..7ae29cfb06c0 100644
> --- a/arch/powerpc/include/asm/barrier.h
> +++ b/arch/powerpc/include/asm/barrier.h
> @@ -80,22 +80,6 @@ do {                                                       
>                 \
>       ___p1;                                                          \
>  })
>  
> -#ifdef CONFIG_PPC64
> -#define smp_cond_load_relaxed(ptr, cond_expr) ({             \
> -     typeof(ptr) __PTR = (ptr);                              \
> -     __unqual_scalar_typeof(*ptr) VAL;                       \
> -     VAL = READ_ONCE(*__PTR);                                \
> -     if (unlikely(!(cond_expr))) {                           \
> -             spin_begin();                                   \
> -             do {                                            \
> -                     VAL = READ_ONCE(*__PTR);                \
> -             } while (!(cond_expr));                         \
> -             spin_end();                                     \
> -     }                                                       \
> -     (typeof(*ptr))VAL;                                      \
> -})
> -#endif
> -
>  #ifdef CONFIG_PPC_BOOK3S_64
>  #define NOSPEC_BARRIER_SLOT   nop
>  #elif defined(CONFIG_PPC_FSL_BOOK3E)
> -- 
> 2.26.2
> 
>

Re: [PATCH 3/3] powerpc/qspinlock: Use generic smp_cond_load_relaxed

Reply via email to