Re: [PATCH] xen/smp: Speed up on_selected_cpus()

2022-02-08 Thread Jan Beulich
On 07.02.2022 18:06, Andrew Cooper wrote:
> On 07/02/2022 08:11, Jan Beulich wrote:
>>  (And of
>> course I still have that conversion to POPCNT alternatives patching pending,
>> where Roger did ask for some re-work in reply to v2, but where it has
>> remained unclear whether investing time into that wouldn't be in vein,
>> considering some of your replies on v1. Thus would have further shrunk the
>> difference, without me meaning to say the change here isn't a good one.)
> 
> There is a perfectly clear and simple way forward.  It's the one which
> doesn't fight the optimiser and actively regress the code generation in
> the calling functions, and add an unreasonable quantity technical debt
> into the marginal paths.
> 
> I will ack a version where you're not adding complexity for negative gains.

Thanks, at least some form of a reply. I'm afraid I can't really translate
this to which parts of the change you'd be okay with and which parts need
changing. I didn't think I would "fight the optimiser and actively regress
the code generation in the calling functions" in v2 (this may have been
different in v1, but I haven't gone back to check; I wonder though whether
you're mixing this with e.g. the BMI patching series I've long given up on).

Jan




Re: [PATCH] xen/smp: Speed up on_selected_cpus()

2022-02-07 Thread Andrew Cooper
On 06/02/2022 19:40, Julien Grall wrote:
> Hi,
>
> On 04/02/2022 20:31, Andrew Cooper wrote:
>> cpumask_weight() is a horribly expensive way to find if no bits are
>> set, made
>> worse by the fact that the calculation is performed with the global
>> call_lock
>> held.
>
> I looked at the archive because I was wondering why we were using
> cpumask_weight here. It looks like this was a left-over of the rework
> in ac3fc35d919c "x86: Fix flush_area_mask() and on_selected_cpus() to
> not race updates".

That change shuffled the code, but didn't introduce the problem.

I'm pretty sure it was 433f14699d48 which dropped the !=0 user of nr_cpus.


Talking of, there is more efficiency to be gained by reworking the
second cpumask_empty() call to not restart from 0 on failure, because
that removes useless reads.


>
>>
>> Switch to using cpumask_empty() instead, which will short circuit as
>> soon as
>> it find any set bit in the cpumask.
>>
>> Signed-off-by: Andrew Cooper 
>
> Reviewed-by: Julien Grall 

Thanks.

~Andrew


Re: [PATCH] xen/smp: Speed up on_selected_cpus()

2022-02-07 Thread Andrew Cooper
On 07/02/2022 08:11, Jan Beulich wrote:
> On 04.02.2022 21:31, Andrew Cooper wrote:
>> cpumask_weight() is a horribly expensive way to find if no bits are set, made
>> worse by the fact that the calculation is performed with the global call_lock
>> held.
>>
>> Switch to using cpumask_empty() instead, which will short circuit as soon as
>> it find any set bit in the cpumask.
>>
>> Signed-off-by: Andrew Cooper 
> May I suggest to drop "horribly"? How expensive one is compared to the other
> depends on the number of CPUs actually enumerated in the system.

In absolute terms perhaps, but they both scale as O(nr_cpus).  Hamming
weight has a far larger constant.

>  (And of
> course I still have that conversion to POPCNT alternatives patching pending,
> where Roger did ask for some re-work in reply to v2, but where it has
> remained unclear whether investing time into that wouldn't be in vein,
> considering some of your replies on v1. Thus would have further shrunk the
> difference, without me meaning to say the change here isn't a good one.)

There is a perfectly clear and simple way forward.  It's the one which
doesn't fight the optimiser and actively regress the code generation in
the calling functions, and add an unreasonable quantity technical debt
into the marginal paths.

I will ack a version where you're not adding complexity for negative gains.

~Andrew


Re: [PATCH] xen/smp: Speed up on_selected_cpus()

2022-02-07 Thread Jan Beulich
On 04.02.2022 21:31, Andrew Cooper wrote:
> cpumask_weight() is a horribly expensive way to find if no bits are set, made
> worse by the fact that the calculation is performed with the global call_lock
> held.
> 
> Switch to using cpumask_empty() instead, which will short circuit as soon as
> it find any set bit in the cpumask.
> 
> Signed-off-by: Andrew Cooper 

May I suggest to drop "horribly"? How expensive one is compared to the other
depends on the number of CPUs actually enumerated in the system. (And of
course I still have that conversion to POPCNT alternatives patching pending,
where Roger did ask for some re-work in reply to v2, but where it has
remained unclear whether investing time into that wouldn't be in vein,
considering some of your replies on v1. Thus would have further shrunk the
difference, without me meaning to say the change here isn't a good one.)

Reviewed-by: Jan Beulich 

Jan




Re: [PATCH] xen/smp: Speed up on_selected_cpus()

2022-02-06 Thread Julien Grall

Hi,

On 04/02/2022 20:31, Andrew Cooper wrote:

cpumask_weight() is a horribly expensive way to find if no bits are set, made
worse by the fact that the calculation is performed with the global call_lock
held.


I looked at the archive because I was wondering why we were using 
cpumask_weight here. It looks like this was a left-over of the rework in 
ac3fc35d919c "x86: Fix flush_area_mask() and on_selected_cpus() to not 
race updates".




Switch to using cpumask_empty() instead, which will short circuit as soon as
it find any set bit in the cpumask.

Signed-off-by: Andrew Cooper 


Reviewed-by: Julien Grall 


---
CC: Jan Beulich 
CC: Roger Pau Monné 
CC: Wei Liu 
CC: Juergen Gross 
CC: Stefano Stabellini 
CC: Julien Grall 
CC: Volodymyr Babchuk 
CC: Bertrand Marquis 

I have not done performance testing, but I would be surprised if this cannot
be measured on a busy or large box.
---
  xen/common/smp.c | 5 +
  1 file changed, 1 insertion(+), 4 deletions(-)

diff --git a/xen/common/smp.c b/xen/common/smp.c
index 781bcf2c246c..a011f541f1ea 100644
--- a/xen/common/smp.c
+++ b/xen/common/smp.c
@@ -50,8 +50,6 @@ void on_selected_cpus(
  void *info,
  int wait)
  {
-unsigned int nr_cpus;
-
  ASSERT(local_irq_is_enabled());
  ASSERT(cpumask_subset(selected, &cpu_online_map));
  
@@ -59,8 +57,7 @@ void on_selected_cpus(
  
  cpumask_copy(&call_data.selected, selected);
  
-nr_cpus = cpumask_weight(&call_data.selected);

-if ( nr_cpus == 0 )
+if ( cpumask_empty(&call_data.selected) )
  goto out;
  
  call_data.func = func;


--
Julien Grall



[PATCH] xen/smp: Speed up on_selected_cpus()

2022-02-04 Thread Andrew Cooper
cpumask_weight() is a horribly expensive way to find if no bits are set, made
worse by the fact that the calculation is performed with the global call_lock
held.

Switch to using cpumask_empty() instead, which will short circuit as soon as
it find any set bit in the cpumask.

Signed-off-by: Andrew Cooper 
---
CC: Jan Beulich 
CC: Roger Pau Monné 
CC: Wei Liu 
CC: Juergen Gross 
CC: Stefano Stabellini 
CC: Julien Grall 
CC: Volodymyr Babchuk 
CC: Bertrand Marquis 

I have not done performance testing, but I would be surprised if this cannot
be measured on a busy or large box.
---
 xen/common/smp.c | 5 +
 1 file changed, 1 insertion(+), 4 deletions(-)

diff --git a/xen/common/smp.c b/xen/common/smp.c
index 781bcf2c246c..a011f541f1ea 100644
--- a/xen/common/smp.c
+++ b/xen/common/smp.c
@@ -50,8 +50,6 @@ void on_selected_cpus(
 void *info,
 int wait)
 {
-unsigned int nr_cpus;
-
 ASSERT(local_irq_is_enabled());
 ASSERT(cpumask_subset(selected, &cpu_online_map));
 
@@ -59,8 +57,7 @@ void on_selected_cpus(
 
 cpumask_copy(&call_data.selected, selected);
 
-nr_cpus = cpumask_weight(&call_data.selected);
-if ( nr_cpus == 0 )
+if ( cpumask_empty(&call_data.selected) )
 goto out;
 
 call_data.func = func;
-- 
2.11.0