Hi Peter,

Thanks for looking at this.

On Thu, Sep 24, 2020 at 12:57:14PM +0200, Peter Zijlstra wrote:
> On Sat, Sep 19, 2020 at 09:42:49AM +0800, Peng Liu wrote:
> > When user changes sched_rt_{runtime, period}_us, then
> > 
> >   sched_rt_handler()
> >     -->     sched_dl_bandwidth_validate()
> >     {
> >             new_bw = global_rt_runtime()/global_rt_period();
> > 
> >             for_each_possible_cpu(cpu) {
> >                     dl_b = dl_bw_of(cpu);
> >                     if (new_bw < dl_b->total_bw)
> >                             ret = -EBUSY;
> >             }
> >     }
> > 
> > Under CONFIG_SMP, dl_bw is per root domain , but not per CPU,
> > dl_b->total_bw is the allocated bandwidth of the whole root domain.
> > we should compare dl_b->total_bw against cpus*new_bw, where 'cpus'
> > is the number of CPUs of the root domain.
> 
> Is there an actual problem there? Spell it out.

I created another root domain(contains 2 CPUs) besides the default
one, and the global default rt bandwidth is 95%. Then I launched a
DL process which need 25% bandwidth and moved it to the new root
domain, so far so good.

Then I tried to change global rt bandwidth to 20% with cmd:
        echo 200000 > /proc/sys/kernel/sched_rt_runtime_us
but ending with the "device busy" error. Only values greater than
250000 could work.

The new root domain contains two CPUs, thus should could provide
totally 2*20%(>25%) bandwidth. So the error is strange.

Finally I found it is the sched_dl_global_validate() mistakenly
do the validation. It doesn't multiply the root domain weight.

way to reproduce:
cd /sys/fs/cgroup/cpuset/
mkdir cluster
echo 0 > cpuset.sched_load_balance
cd cluster
echo 10-11 > cpuset.cpus
echo 0 > cpuset.mems
echo 1 > cpuset.cpu_exclusive
echo pid-of-dl25 > tasks

> > 
> > [1] https://lkml.org/lkml/2010/2/28/119
> 
> Don't use lkml.org links, use lkml.kernel.org/r/$MsgID instead.

OK, I will.

> 
> > [!CONFIG_SMP build error]
> > Reported-by: kernel test robot <l...@intel.com>
> > Signed-off-by: Peng Liu <iwtba...@gmail.com>
> 
> Quite frankly this patch is horrible #ifdef soup.

Frankly speaking, I also hate the ugly #ifdef guys, but I have no
idea how to eliminate them until seeing your method. Indeed, quite
clear. I will refine the patch according your suggestion. Thanks.

> 
> Can't you make something like the below work?
> 
> ---
> diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
> index 3862a28cd05d..3f309e0f69f5 100644
> --- a/kernel/sched/deadline.c
> +++ b/kernel/sched/deadline.c
> @@ -97,6 +97,17 @@ static inline unsigned long dl_bw_capacity(int i)
>               return __dl_bw_capacity(i);
>       }
>  }
> +
> +static inline bool dl_bw_visited(int cpu, u64 gen)
> +{
> +     struct root_domain *rd = cpu_rq(i)->rd;
> +
> +     if (rd->visit_gen == gen)
> +             return true;
> +
> +     rd->visit_gen = gen;
> +     return false;
> +}
>  #else
>  static inline struct dl_bw *dl_bw_of(int i)
>  {
> @@ -112,6 +123,11 @@ static inline unsigned long dl_bw_capacity(int i)
>  {
>       return SCHED_CAPACITY_SCALE;
>  }
> +
> +static inline bool dl_bw_visited(int cpu, u64 gen)
> +{
> +     return false;
> +}
>  #endif
>  
>  static inline
> @@ -2513,31 +2529,35 @@ const struct sched_class dl_sched_class
>  
>  int sched_dl_global_validate(void)
>  {
> +     static u64 generation = 0;
>       u64 runtime = global_rt_runtime();
>       u64 period = global_rt_period();
>       u64 new_bw = to_ratio(period, runtime);
> -     struct dl_bw *dl_b;
> -     int cpu, ret = 0;
> +     int cpu, cpus, ret = 0;
>       unsigned long flags;
> +     struct dl_bw *dl_b;
> +     u64 gen = ++generation;
>  
>       /*
>        * Here we want to check the bandwidth not being set to some
>        * value smaller than the currently allocated bandwidth in
>        * any of the root_domains.
> -      *
> -      * FIXME: Cycling on all the CPUs is overdoing, but simpler than
> -      * cycling on root_domains... Discussion on different/better
> -      * solutions is welcome!
>        */
>       for_each_possible_cpu(cpu) {
> +
>               rcu_read_lock_sched();
> +             if (dl_bw_visited(cpu, gen))
> +                     goto next;
> +
>               dl_b = dl_bw_of(cpu);
> +             cpus = dl_bw_cpus(cpu);
>  
>               raw_spin_lock_irqsave(&dl_b->lock, flags);
> -             if (new_bw < dl_b->total_bw)
> +             if (new_bw * cpus < dl_b->total_bw)
>                       ret = -EBUSY;
>               raw_spin_unlock_irqrestore(&dl_b->lock, flags);
>  
> +     next:
>               rcu_read_unlock_sched();
>  
>               if (ret)
> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> index 28709f6b0975..7f0947db6e2c 100644
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -801,6 +801,8 @@ struct root_domain {
>       struct dl_bw            dl_bw;
>       struct cpudl            cpudl;
>  
> +     u64                     visit_gen;
> +
>  #ifdef HAVE_RT_PUSH_IPI
>       /*
>        * For IPI pull requests, loop across the rto_mask.

Reply via email to