On 2026/2/7 4:37, Waiman Long wrote:
> Now that we are going to defer any changes to the HK_TYPE_DOMAIN
> housekeeping cpumasks to either task_work or workqueue
> where rebuild_sched_domains() call will be issued. The current
> rebuild_sched_domains_locked() call near the end of the cpuset critical
> section can be removed in such cases.
>
> Currently, a boolean force_sd_rebuild flag is used to decide if
> rebuild_sched_domains_locked() call needs to be invoked. To allow
> deferral that like, we change it to a tri-state sd_rebuild enumaration
> type.
>
> Signed-off-by: Waiman Long <[email protected]>
> ---
> kernel/cgroup/cpuset.c | 20 ++++++++++++++------
> 1 file changed, 14 insertions(+), 6 deletions(-)
>
> diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
> index d26c77a726b2..e224df321e34 100644
> --- a/kernel/cgroup/cpuset.c
> +++ b/kernel/cgroup/cpuset.c
> @@ -173,7 +173,11 @@ static bool isolcpus_twork_queued; /* T */
> * Note that update_relax_domain_level() in cpuset-v1.c can still call
> * rebuild_sched_domains_locked() directly without using this flag.
> */
> -static bool force_sd_rebuild; /* RWCS */
> +static enum {
> + SD_NO_REBUILD = 0,
> + SD_REBUILD,
> + SD_DEFER_REBUILD,
> +} sd_rebuild; /* RWCS */
>
> /*
> * Partition root states:
> @@ -990,7 +994,7 @@ void rebuild_sched_domains_locked(void)
>
> lockdep_assert_cpus_held();
> lockdep_assert_cpuset_lock_held();
> - force_sd_rebuild = false;
> + sd_rebuild = SD_NO_REBUILD;
>
> /* Generate domain masks and attrs */
> ndoms = generate_sched_domains(&doms, &attr);
> @@ -1377,6 +1381,9 @@ static void update_isolation_cpumasks(void)
> else
> isolated_cpus_updating = false;
>
If isolated_hk_cpus is defined, I believe isolated_cpus_updating becomes
redundant.
> + /* Defer rebuild_sched_domains() to task_work or wq */
> + sd_rebuild = SD_DEFER_REBUILD;
> +
There is a potential issue: we defer all domain rebuilds here, including those
triggered by hotplug events which may change the isolation state.
The problem is that functions like cpuset_cpu_active, which rely on the
scheduler domains being up-to-date—will, also be delayed. Is that okay?
> /*
> * This function can be reached either directly from regular cpuset
> * control file write or via CPU hotplug. In the latter case, it is
> @@ -3011,7 +3018,7 @@ static int update_prstate(struct cpuset *cs, int
> new_prs)
> update_partition_sd_lb(cs, old_prs);
>
> notify_partition_change(cs, old_prs);
> - if (force_sd_rebuild)
> + if (sd_rebuild == SD_REBUILD)
> rebuild_sched_domains_locked();
> free_tmpmasks(&tmpmask);
> return 0;
> @@ -3288,7 +3295,7 @@ ssize_t cpuset_write_resmask(struct kernfs_open_file
> *of,
> }
>
> free_cpuset(trialcs);
> - if (force_sd_rebuild)
> + if (sd_rebuild == SD_REBUILD)
> rebuild_sched_domains_locked();
> out_unlock:
> cpuset_full_unlock();
> @@ -3771,7 +3778,8 @@ hotplug_update_tasks(struct cpuset *cs,
>
> void cpuset_force_rebuild(void)
> {
> - force_sd_rebuild = true;
> + if (!sd_rebuild)
> + sd_rebuild = SD_REBUILD;
> }
>
> /**
> @@ -3981,7 +3989,7 @@ static void cpuset_handle_hotplug(void)
> }
>
> /* rebuild sched domains if necessary */
> - if (force_sd_rebuild)
> + if (sd_rebuild == SD_REBUILD)
> rebuild_sched_domains_cpuslocked();
>
> free_tmpmasks(ptmp);
--
Best regards,
Ridong