Re: [PATCH 3/3] powerpc/qspinlock: Use generic smp_cond_load_relaxed

2021-03-18 Thread Davidlohr Bueso

On Tue, 16 Mar 2021, Nicholas Piggin wrote:


One request, could you add a comment in place that references
smp_cond_load_relaxed() so this commit can be found again if
someone looks at it? Something like this

/*
* smp_cond_load_relaxed was found to have performance problems if
* implemented with spin_begin()/spin_end().
*/


Sure, let me see where I can fit that in and send out a v2.

Similarly, but unrelated to this patch, is there any chance we could
remove the whole spin_until_cond() machinery and make it specific to
powerpc? This was introduced in 2017 and doesn't really have any users
outside of powerpc, except for these:

drivers/firmware/arm_scmi/driver.c: 
spin_until_cond(scmi_xfer_done_no_timeout(cinfo, xfer, stop));
drivers/firmware/arm_scmi/shmem.c:  
spin_until_cond(ioread32(>channel_status) &
drivers/net/ethernet/xilinx/ll_temac_main.c:
spin_until_cond(hard_acs_rdy_or_timeout(lp, timeout));

... which afaict only the xilinx one can actually build on powerpc.
Regardless, these could be converted to smp_cond_load_relaxed(), being
the more standard way to do optimized busy-waiting, caring more about
the family of barriers than ad-hoc SMT priorities. Of course, I have
no way of testing any of these changes.


I wonder if it should have a Fixes: tag to the original commit as
well.


I'm not sure either. I've actually been informed recently of other
workloads that benefit from the revert on large Power9 boxes. So I'll
go ahead and add it.



Otherwise,

Acked-by: Nicholas Piggin 


Thanks,
Davidlohr


Re: [PATCH 3/3] powerpc/qspinlock: Use generic smp_cond_load_relaxed

2021-03-15 Thread Nicholas Piggin
Excerpts from Davidlohr Bueso's message of March 9, 2021 11:59 am:
> 49a7d46a06c3 (powerpc: Implement smp_cond_load_relaxed()) added
> busy-waiting pausing with a preferred SMT priority pattern, lowering
> the priority (reducing decode cycles) during the whole loop slowpath.
> 
> However, data shows that while this pattern works well with simple
> spinlocks, queued spinlocks benefit more being kept in medium priority,
> with a cpu_relax() instead, being a low+medium combo on powerpc.

Thanks for tracking this down and the comprehensive results, great
work.

It's only a relatively recent patch, so I think the revert is a
good idea (i.e., don't keep it around for possibly other code to
hit problems with).

One request, could you add a comment in place that references
smp_cond_load_relaxed() so this commit can be found again if
someone looks at it? Something like this

/*
 * smp_cond_load_relaxed was found to have performance problems if
 * implemented with spin_begin()/spin_end().
 */

I wonder if it should have a Fixes: tag to the original commit as
well.

Otherwise,

Acked-by: Nicholas Piggin 

Thanks,
Nick

> 
> Data is from three benchmarks on a Power9: 9008-22L 64 CPUs with
> 2 sockets and 8 threads per core.
> 
> 1. locktorture.
> 
> This is data for the lowest and most artificial/pathological level,
> with increasing thread counts pounding on the lock. Metrics are total
> ops/minute. Despite some small hits in the 4-8 range, scenarios are
> either neutral or favorable to this patch.
> 
> +=+==+==+===+
> | # tasks | vanilla  | dirty| %diff |
> +=+==+==+===+
> | 2   | 46718565 | 48751350 | 4.35  |
> +-+--+--+---+
> | 4   | 51740198 | 50369082 | -2.65 |
> +-+--+--+---+
> | 8   | 63756510 | 62568821 | -1.86 |
> +-+--+--+---+
> | 16  | 67824531 | 70966546 | 4.63  |
> +-+--+--+---+
> | 32  | 53843519 | 61155508 | 13.58 |
> +-+--+--+---+
> | 64  | 53005778 | 53104412 | 0.18  |
> +-+--+--+---+
> | 128 | 53331980 | 54606910 | 2.39  |
> +=+==+==+===+
> 
> 2. sockperf (tcp throughput)
> 
> Here a client will do one-way throughput tests to a localhost server, with
> increasing message sizes, dealing with the sk_lock. This patch shows to put
> the performance of the qspinlock back to par with that of the simple lock:
> 
>simple-spinlock   vanilla  dirty
> Hmean 1473.50 (   0.00%)   54.44 * -25.93%*   73.45 * 
> -0.07%*
> Hmean 100  654.47 (   0.00%)  385.61 * -41.08%*  771.43 * 
> 17.87%*
> Hmean 300 2719.39 (   0.00%) 2181.67 * -19.77%* 2666.50 * 
> -1.94%*
> Hmean 500 4400.59 (   0.00%) 3390.77 * -22.95%* 4322.14 * 
> -1.78%*
> Hmean 850 6726.21 (   0.00%) 5264.03 * -21.74%* 6863.12 * 
> 2.04%*
> 
> 3. dbench (tmpfs)
> 
> Configured to run with up to ncpusx8 clients, it shows both latency and
> throughput metrics. For the latency, with the exception of the 64 case,
> there is really nothing to go by:
>vanilladirty
> Amean latency-1  1.67 (   0.00%)1.67 *   0.09%*
> Amean latency-2  2.15 (   0.00%)2.08 *   3.36%*
> Amean latency-4  2.50 (   0.00%)2.56 *  -2.27%*
> Amean latency-8  2.49 (   0.00%)2.48 *   0.31%*
> Amean latency-16 2.69 (   0.00%)2.72 *  -1.37%*
> Amean latency-32 2.96 (   0.00%)3.04 *  -2.60%*
> Amean latency-64 7.78 (   0.00%)8.17 *  -5.07%*
> Amean latency-512  186.91 (   0.00%)  186.41 *   0.27%*
> 
> For the dbench4 Throughput (misleading but traditional) there's a small
> but rather constant improvement:
> 
>vanilladirty
> Hmean 1849.13 (   0.00%)  851.51 *   0.28%*
> Hmean 2   1664.03 (   0.00%) 1663.94 *  -0.01%*
> Hmean 4   3073.70 (   0.00%) 3104.29 *   1.00%*
> Hmean 8   5624.02 (   0.00%) 5694.16 *   1.25%*
> Hmean 16  9169.49 (   0.00%) 9324.43 *   1.69%*
> Hmean 32 11969.37 (   0.00%)12127.09 *   1.32%*
> Hmean 64 15021.12 (   0.00%)15243.14 *   1.48%*
> Hmean 51214891.27 (   0.00%)15162.11 *   1.82%*
> 
> Measuring the dbench4 Per-VFS Operation latency, shows some very minor
> differences within the noise level, around the 0-1% ranges.
> 
> Signed-off-by: Davidlohr Bueso 
> ---
>  arch/powerpc/include/asm/barrier.h | 16 
>  1 file changed, 16 deletions(-)
> 
> diff --git a/arch/powerpc/include/asm/barrier.h 
> b/arch/powerpc/include/asm/barrier.h
> index aecfde829d5d..7ae29cfb06c0 100644
> --- 

Re: [PATCH 3/3] powerpc/qspinlock: Use generic smp_cond_load_relaxed

2021-03-09 Thread Michal Suchánek
On Tue, Mar 09, 2021 at 07:46:11AM -0800, Davidlohr Bueso wrote:
> On Tue, 09 Mar 2021, Michal Such�nek wrote:
> 
> > On Mon, Mar 08, 2021 at 05:59:50PM -0800, Davidlohr Bueso wrote:
> > > 49a7d46a06c3 (powerpc: Implement smp_cond_load_relaxed()) added
> > > busy-waiting pausing with a preferred SMT priority pattern, lowering
> > > the priority (reducing decode cycles) during the whole loop slowpath.
> > > 
> > > However, data shows that while this pattern works well with simple
> >  ^^
> > > spinlocks, queued spinlocks benefit more being kept in medium priority,
> > > with a cpu_relax() instead, being a low+medium combo on powerpc.
> > ...
> > > 
> > > diff --git a/arch/powerpc/include/asm/barrier.h 
> > > b/arch/powerpc/include/asm/barrier.h
> > > index aecfde829d5d..7ae29cfb06c0 100644
> > > --- a/arch/powerpc/include/asm/barrier.h
> > > +++ b/arch/powerpc/include/asm/barrier.h
> > > @@ -80,22 +80,6 @@ do {   
> > > \
> > >   ___p1;  \
> > >  })
> > > 
> > > -#ifdef CONFIG_PPC64
> > Maybe it should be kept for the simple spinlock case then?
> 
> It is kept, note that simple spinlocks don't use smp_cond_load_relaxed,
> but instead deal with the priorities in arch_spin_lock(), so it will
> spin in low priority until it sees a chance to take the lock, where
> it switches back to medium.

Indeed, thanks for the clarification.

Michal


Re: [PATCH 3/3] powerpc/qspinlock: Use generic smp_cond_load_relaxed

2021-03-09 Thread Davidlohr Bueso

On Tue, 09 Mar 2021, Michal Such�nek wrote:


On Mon, Mar 08, 2021 at 05:59:50PM -0800, Davidlohr Bueso wrote:

49a7d46a06c3 (powerpc: Implement smp_cond_load_relaxed()) added
busy-waiting pausing with a preferred SMT priority pattern, lowering
the priority (reducing decode cycles) during the whole loop slowpath.

However, data shows that while this pattern works well with simple

 ^^

spinlocks, queued spinlocks benefit more being kept in medium priority,
with a cpu_relax() instead, being a low+medium combo on powerpc.

...


diff --git a/arch/powerpc/include/asm/barrier.h 
b/arch/powerpc/include/asm/barrier.h
index aecfde829d5d..7ae29cfb06c0 100644
--- a/arch/powerpc/include/asm/barrier.h
+++ b/arch/powerpc/include/asm/barrier.h
@@ -80,22 +80,6 @@ do { 
\
___p1;  \
 })

-#ifdef CONFIG_PPC64

Maybe it should be kept for the simple spinlock case then?


It is kept, note that simple spinlocks don't use smp_cond_load_relaxed,
but instead deal with the priorities in arch_spin_lock(), so it will
spin in low priority until it sees a chance to take the lock, where
it switches back to medium.

Thanks,
Davidlohr


Re: [PATCH 3/3] powerpc/qspinlock: Use generic smp_cond_load_relaxed

2021-03-09 Thread Michal Suchánek
On Mon, Mar 08, 2021 at 05:59:50PM -0800, Davidlohr Bueso wrote:
> 49a7d46a06c3 (powerpc: Implement smp_cond_load_relaxed()) added
> busy-waiting pausing with a preferred SMT priority pattern, lowering
> the priority (reducing decode cycles) during the whole loop slowpath.
> 
> However, data shows that while this pattern works well with simple
  ^^
> spinlocks, queued spinlocks benefit more being kept in medium priority,
> with a cpu_relax() instead, being a low+medium combo on powerpc.
...
> 
> diff --git a/arch/powerpc/include/asm/barrier.h 
> b/arch/powerpc/include/asm/barrier.h
> index aecfde829d5d..7ae29cfb06c0 100644
> --- a/arch/powerpc/include/asm/barrier.h
> +++ b/arch/powerpc/include/asm/barrier.h
> @@ -80,22 +80,6 @@ do {   
> \
>   ___p1;  \
>  })
>  
> -#ifdef CONFIG_PPC64
Maybe it should be kept for the simple spinlock case then?

Thanks

Michal
> -#define smp_cond_load_relaxed(ptr, cond_expr) ({ \
> - typeof(ptr) __PTR = (ptr);  \
> - __unqual_scalar_typeof(*ptr) VAL;   \
> - VAL = READ_ONCE(*__PTR);\
> - if (unlikely(!(cond_expr))) {   \
> - spin_begin();   \
> - do {\
> - VAL = READ_ONCE(*__PTR);\
> - } while (!(cond_expr)); \
> - spin_end(); \
> - }   \
> - (typeof(*ptr))VAL;  \
> -})
> -#endif
> -
>  #ifdef CONFIG_PPC_BOOK3S_64
>  #define NOSPEC_BARRIER_SLOT   nop
>  #elif defined(CONFIG_PPC_FSL_BOOK3E)
> -- 
> 2.26.2
> 


[PATCH 3/3] powerpc/qspinlock: Use generic smp_cond_load_relaxed

2021-03-08 Thread Davidlohr Bueso
49a7d46a06c3 (powerpc: Implement smp_cond_load_relaxed()) added
busy-waiting pausing with a preferred SMT priority pattern, lowering
the priority (reducing decode cycles) during the whole loop slowpath.

However, data shows that while this pattern works well with simple
spinlocks, queued spinlocks benefit more being kept in medium priority,
with a cpu_relax() instead, being a low+medium combo on powerpc.

Data is from three benchmarks on a Power9: 9008-22L 64 CPUs with
2 sockets and 8 threads per core.

1. locktorture.

This is data for the lowest and most artificial/pathological level,
with increasing thread counts pounding on the lock. Metrics are total
ops/minute. Despite some small hits in the 4-8 range, scenarios are
either neutral or favorable to this patch.

+=+==+==+===+
| # tasks | vanilla  | dirty| %diff |
+=+==+==+===+
| 2   | 46718565 | 48751350 | 4.35  |
+-+--+--+---+
| 4   | 51740198 | 50369082 | -2.65 |
+-+--+--+---+
| 8   | 63756510 | 62568821 | -1.86 |
+-+--+--+---+
| 16  | 67824531 | 70966546 | 4.63  |
+-+--+--+---+
| 32  | 53843519 | 61155508 | 13.58 |
+-+--+--+---+
| 64  | 53005778 | 53104412 | 0.18  |
+-+--+--+---+
| 128 | 53331980 | 54606910 | 2.39  |
+=+==+==+===+

2. sockperf (tcp throughput)

Here a client will do one-way throughput tests to a localhost server, with
increasing message sizes, dealing with the sk_lock. This patch shows to put
the performance of the qspinlock back to par with that of the simple lock:

 simple-spinlock   vanilla  dirty
Hmean 1473.50 (   0.00%)   54.44 * -25.93%*   73.45 * 
-0.07%*
Hmean 100  654.47 (   0.00%)  385.61 * -41.08%*  771.43 * 
17.87%*
Hmean 300 2719.39 (   0.00%) 2181.67 * -19.77%* 2666.50 * 
-1.94%*
Hmean 500 4400.59 (   0.00%) 3390.77 * -22.95%* 4322.14 * 
-1.78%*
Hmean 850 6726.21 (   0.00%) 5264.03 * -21.74%* 6863.12 * 2.04%*

3. dbench (tmpfs)

Configured to run with up to ncpusx8 clients, it shows both latency and
throughput metrics. For the latency, with the exception of the 64 case,
there is really nothing to go by:
 vanilladirty
Amean latency-1  1.67 (   0.00%)1.67 *   0.09%*
Amean latency-2  2.15 (   0.00%)2.08 *   3.36%*
Amean latency-4  2.50 (   0.00%)2.56 *  -2.27%*
Amean latency-8  2.49 (   0.00%)2.48 *   0.31%*
Amean latency-16 2.69 (   0.00%)2.72 *  -1.37%*
Amean latency-32 2.96 (   0.00%)3.04 *  -2.60%*
Amean latency-64 7.78 (   0.00%)8.17 *  -5.07%*
Amean latency-512  186.91 (   0.00%)  186.41 *   0.27%*

For the dbench4 Throughput (misleading but traditional) there's a small
but rather constant improvement:

 vanilladirty
Hmean 1849.13 (   0.00%)  851.51 *   0.28%*
Hmean 2   1664.03 (   0.00%) 1663.94 *  -0.01%*
Hmean 4   3073.70 (   0.00%) 3104.29 *   1.00%*
Hmean 8   5624.02 (   0.00%) 5694.16 *   1.25%*
Hmean 16  9169.49 (   0.00%) 9324.43 *   1.69%*
Hmean 32 11969.37 (   0.00%)12127.09 *   1.32%*
Hmean 64 15021.12 (   0.00%)15243.14 *   1.48%*
Hmean 51214891.27 (   0.00%)15162.11 *   1.82%*

Measuring the dbench4 Per-VFS Operation latency, shows some very minor
differences within the noise level, around the 0-1% ranges.

Signed-off-by: Davidlohr Bueso 
---
 arch/powerpc/include/asm/barrier.h | 16 
 1 file changed, 16 deletions(-)

diff --git a/arch/powerpc/include/asm/barrier.h 
b/arch/powerpc/include/asm/barrier.h
index aecfde829d5d..7ae29cfb06c0 100644
--- a/arch/powerpc/include/asm/barrier.h
+++ b/arch/powerpc/include/asm/barrier.h
@@ -80,22 +80,6 @@ do { 
\
___p1;  \
 })
 
-#ifdef CONFIG_PPC64
-#define smp_cond_load_relaxed(ptr, cond_expr) ({   \
-   typeof(ptr) __PTR = (ptr);  \
-   __unqual_scalar_typeof(*ptr) VAL;   \
-   VAL = READ_ONCE(*__PTR);\
-   if (unlikely(!(cond_expr))) {   \
-   spin_begin();   \
-   do {\
-   VAL = READ_ONCE(*__PTR);\
-   } while (!(cond_expr)); \
-   spin_end();