Re: [PATCH] xen/smp: Speed up on_selected_cpus()
On 07.02.2022 18:06, Andrew Cooper wrote: > On 07/02/2022 08:11, Jan Beulich wrote: >> (And of >> course I still have that conversion to POPCNT alternatives patching pending, >> where Roger did ask for some re-work in reply to v2, but where it has >> remained unclear whether investing time into that wouldn't be in vein, >> considering some of your replies on v1. Thus would have further shrunk the >> difference, without me meaning to say the change here isn't a good one.) > > There is a perfectly clear and simple way forward. It's the one which > doesn't fight the optimiser and actively regress the code generation in > the calling functions, and add an unreasonable quantity technical debt > into the marginal paths. > > I will ack a version where you're not adding complexity for negative gains. Thanks, at least some form of a reply. I'm afraid I can't really translate this to which parts of the change you'd be okay with and which parts need changing. I didn't think I would "fight the optimiser and actively regress the code generation in the calling functions" in v2 (this may have been different in v1, but I haven't gone back to check; I wonder though whether you're mixing this with e.g. the BMI patching series I've long given up on). Jan
Re: [PATCH] xen/smp: Speed up on_selected_cpus()
On 06/02/2022 19:40, Julien Grall wrote: > Hi, > > On 04/02/2022 20:31, Andrew Cooper wrote: >> cpumask_weight() is a horribly expensive way to find if no bits are >> set, made >> worse by the fact that the calculation is performed with the global >> call_lock >> held. > > I looked at the archive because I was wondering why we were using > cpumask_weight here. It looks like this was a left-over of the rework > in ac3fc35d919c "x86: Fix flush_area_mask() and on_selected_cpus() to > not race updates". That change shuffled the code, but didn't introduce the problem. I'm pretty sure it was 433f14699d48 which dropped the !=0 user of nr_cpus. Talking of, there is more efficiency to be gained by reworking the second cpumask_empty() call to not restart from 0 on failure, because that removes useless reads. > >> >> Switch to using cpumask_empty() instead, which will short circuit as >> soon as >> it find any set bit in the cpumask. >> >> Signed-off-by: Andrew Cooper > > Reviewed-by: Julien Grall Thanks. ~Andrew
Re: [PATCH] xen/smp: Speed up on_selected_cpus()
On 07/02/2022 08:11, Jan Beulich wrote: > On 04.02.2022 21:31, Andrew Cooper wrote: >> cpumask_weight() is a horribly expensive way to find if no bits are set, made >> worse by the fact that the calculation is performed with the global call_lock >> held. >> >> Switch to using cpumask_empty() instead, which will short circuit as soon as >> it find any set bit in the cpumask. >> >> Signed-off-by: Andrew Cooper > May I suggest to drop "horribly"? How expensive one is compared to the other > depends on the number of CPUs actually enumerated in the system. In absolute terms perhaps, but they both scale as O(nr_cpus). Hamming weight has a far larger constant. > (And of > course I still have that conversion to POPCNT alternatives patching pending, > where Roger did ask for some re-work in reply to v2, but where it has > remained unclear whether investing time into that wouldn't be in vein, > considering some of your replies on v1. Thus would have further shrunk the > difference, without me meaning to say the change here isn't a good one.) There is a perfectly clear and simple way forward. It's the one which doesn't fight the optimiser and actively regress the code generation in the calling functions, and add an unreasonable quantity technical debt into the marginal paths. I will ack a version where you're not adding complexity for negative gains. ~Andrew
Re: [PATCH] xen/smp: Speed up on_selected_cpus()
On 04.02.2022 21:31, Andrew Cooper wrote: > cpumask_weight() is a horribly expensive way to find if no bits are set, made > worse by the fact that the calculation is performed with the global call_lock > held. > > Switch to using cpumask_empty() instead, which will short circuit as soon as > it find any set bit in the cpumask. > > Signed-off-by: Andrew Cooper May I suggest to drop "horribly"? How expensive one is compared to the other depends on the number of CPUs actually enumerated in the system. (And of course I still have that conversion to POPCNT alternatives patching pending, where Roger did ask for some re-work in reply to v2, but where it has remained unclear whether investing time into that wouldn't be in vein, considering some of your replies on v1. Thus would have further shrunk the difference, without me meaning to say the change here isn't a good one.) Reviewed-by: Jan Beulich Jan
Re: [PATCH] xen/smp: Speed up on_selected_cpus()
Hi, On 04/02/2022 20:31, Andrew Cooper wrote: cpumask_weight() is a horribly expensive way to find if no bits are set, made worse by the fact that the calculation is performed with the global call_lock held. I looked at the archive because I was wondering why we were using cpumask_weight here. It looks like this was a left-over of the rework in ac3fc35d919c "x86: Fix flush_area_mask() and on_selected_cpus() to not race updates". Switch to using cpumask_empty() instead, which will short circuit as soon as it find any set bit in the cpumask. Signed-off-by: Andrew Cooper Reviewed-by: Julien Grall --- CC: Jan Beulich CC: Roger Pau Monné CC: Wei Liu CC: Juergen Gross CC: Stefano Stabellini CC: Julien Grall CC: Volodymyr Babchuk CC: Bertrand Marquis I have not done performance testing, but I would be surprised if this cannot be measured on a busy or large box. --- xen/common/smp.c | 5 + 1 file changed, 1 insertion(+), 4 deletions(-) diff --git a/xen/common/smp.c b/xen/common/smp.c index 781bcf2c246c..a011f541f1ea 100644 --- a/xen/common/smp.c +++ b/xen/common/smp.c @@ -50,8 +50,6 @@ void on_selected_cpus( void *info, int wait) { -unsigned int nr_cpus; - ASSERT(local_irq_is_enabled()); ASSERT(cpumask_subset(selected, &cpu_online_map)); @@ -59,8 +57,7 @@ void on_selected_cpus( cpumask_copy(&call_data.selected, selected); -nr_cpus = cpumask_weight(&call_data.selected); -if ( nr_cpus == 0 ) +if ( cpumask_empty(&call_data.selected) ) goto out; call_data.func = func; -- Julien Grall
[PATCH] xen/smp: Speed up on_selected_cpus()
cpumask_weight() is a horribly expensive way to find if no bits are set, made worse by the fact that the calculation is performed with the global call_lock held. Switch to using cpumask_empty() instead, which will short circuit as soon as it find any set bit in the cpumask. Signed-off-by: Andrew Cooper --- CC: Jan Beulich CC: Roger Pau Monné CC: Wei Liu CC: Juergen Gross CC: Stefano Stabellini CC: Julien Grall CC: Volodymyr Babchuk CC: Bertrand Marquis I have not done performance testing, but I would be surprised if this cannot be measured on a busy or large box. --- xen/common/smp.c | 5 + 1 file changed, 1 insertion(+), 4 deletions(-) diff --git a/xen/common/smp.c b/xen/common/smp.c index 781bcf2c246c..a011f541f1ea 100644 --- a/xen/common/smp.c +++ b/xen/common/smp.c @@ -50,8 +50,6 @@ void on_selected_cpus( void *info, int wait) { -unsigned int nr_cpus; - ASSERT(local_irq_is_enabled()); ASSERT(cpumask_subset(selected, &cpu_online_map)); @@ -59,8 +57,7 @@ void on_selected_cpus( cpumask_copy(&call_data.selected, selected); -nr_cpus = cpumask_weight(&call_data.selected); -if ( nr_cpus == 0 ) +if ( cpumask_empty(&call_data.selected) ) goto out; call_data.func = func; -- 2.11.0