I appreciate your effort and detailed reply, however I'm still experiencing  
performance hit at 
partition_sched_domains(). It seems the issue is due to the large magnitude of 
cpus. 
I used he suggested method 2, patched in the diffs and used the command line 
switch isolcpus to
kill load-balancing.
It did save few hundredth of a sec per cpu. When I limited number of available 
cpus 
(using present and possible cpus ) to 48, it did reduced dramatically this 
function execution time:

With 4K available cpus :
[   48.890000] ## CPU16 LIVE ##: Executing Code...
[   48.910000] partition_sched_domains start
[   49.360000] partition_sched_domains end

With 48 available cpus:
[   36.950000] ## CPU16 LIVE ##: Executing Code...
[   36.950000] partition_sched_domains start
[   36.960000] partition_sched_domains end

Note that I currently use kernel version: 4.8.0.17.0600.00.0000, if this has 
any influence.
Would appreciate your thoughts.


Thanks
-Ofer



> -----Original Message-----
> From: Peter Zijlstra [mailto:pet...@infradead.org]
> Sent: Tuesday, August 8, 2017 1:16 PM
> To: Ofer Levi(SW) <ofe...@mellanox.com>
> Cc: ru...@rustcorp.com.au; mi...@redhat.com;
> vineet.gup...@synopsys.com; linux-kernel@vger.kernel.org; Tejun Heo
> <t...@kernel.org>
> Subject: Re: hotplug support for arch/arc/plat-eznps platform
> 
> On Tue, Aug 08, 2017 at 06:49:39AM +0000, Ofer Levi(SW) wrote:
> 
> > The idea behind implementing hotplug for this arch is to shorten time
> > to traffic processing.  This way instead of waiting ~5 min for all
> > cpus to boot, application running on cpu 0 will Loop booting other
> > cpus and assigning  the traffic processing application to it.
> > Outgoing traffic will build up until all cpus are up and running full
> > traffic rate.  This method allow for traffic processing to start after
> > ~20 sec instead of the 5 min.
> 
> Ah, ok. So only online is ever used. Offline is a whole other can of worms.
> 
> > > So how can boot be different than hot-pugging them?
> >
> > Please have a look at following code kernel/sched/core.c,
> sched_cpu_activate() :
> >
> >     if (sched_smp_initialized) {
> >             sched_domains_numa_masks_set(cpu);
> >             cpuset_cpu_active();
> >     }
> 
> Ah, cute, I totally missed we did that. Yes that avoids endless domain 
> rebuilds
> on boot.
> 
> > The cpuset_cpu_active call eventually leads to the function in
> > question partition_sched_domains() When cold-booting cpus the
> > sched_smp_initialized flag is false and therefore
> > partition_sched_domains is not executing.
> 
> So you're booting with "maxcpus=1" to only online the one. And then you
> want to online the rest once userspace runs.
> 
> There's two possibilities. The one I prefer (but which appears the most
> broken with the current code) is using the cpuset controller.
> 
> 1)
> 
>   Once you're up and running with a single CPU do:
> 
>   $ mkdir /cgroup
>   $ mount none /cgroup -t cgroup -o cpuset
>   $ echo 0 > /cgroup/cpuset.sched_load_balance
>   $ for ((i=1;i<4096;i++))
>     do
>       echo 1 > /sys/devices/system/cpu/cpu$i/online;
>     done
> 
>   And then, if you want load-balancing, you can re-enable it globally,
>   or only on a subset of CPUs.
> 
> 
> 2)
> 
>   The alternative is to use "isolcpus=1-4095" to completely kill
>   load-balancing. This more or less works with the current code,
>   except that it will keep rebuilding the CPU0 sched-domain, which
>   is somewhat pointless (also fixed by the below patch).
> 
>   The reason I don't particularly like this option is that its boot time
>   only, you cannot reconfigure your system at runtime, but that might
>   be good enough for you.
> 
> 
> With the attached patch, either option generates (I only have 40 CPUs):
> 
> [   44.305563] CPU0 attaching NULL sched-domain.
> [   51.954872] SMP alternatives: switching to SMP code
> [   51.976923] x86: Booting SMP configuration:
> [   51.981602] smpboot: Booting Node 0 Processor 1 APIC 0x2
> [   52.057756] microcode: sig=0x306e4, pf=0x1, revision=0x416
> [   52.064740] microcode: updated to revision 0x428, date = 2014-05-29
> [   52.080854] smpboot: Booting Node 0 Processor 2 APIC 0x4
> [   52.164124] smpboot: Booting Node 0 Processor 3 APIC 0x6
> [   52.244615] smpboot: Booting Node 0 Processor 4 APIC 0x8
> [   52.324564] smpboot: Booting Node 0 Processor 5 APIC 0x10
> [   52.405407] smpboot: Booting Node 0 Processor 6 APIC 0x12
> [   52.485460] smpboot: Booting Node 0 Processor 7 APIC 0x14
> [   52.565333] smpboot: Booting Node 0 Processor 8 APIC 0x16
> [   52.645364] smpboot: Booting Node 0 Processor 9 APIC 0x18
> [   52.725314] smpboot: Booting Node 1 Processor 10 APIC 0x20
> [   52.827517] smpboot: Booting Node 1 Processor 11 APIC 0x22
> [   52.912271] smpboot: Booting Node 1 Processor 12 APIC 0x24
> [   52.996101] smpboot: Booting Node 1 Processor 13 APIC 0x26
> [   53.081239] smpboot: Booting Node 1 Processor 14 APIC 0x28
> [   53.164990] smpboot: Booting Node 1 Processor 15 APIC 0x30
> [   53.250146] smpboot: Booting Node 1 Processor 16 APIC 0x32
> [   53.333894] smpboot: Booting Node 1 Processor 17 APIC 0x34
> [   53.419026] smpboot: Booting Node 1 Processor 18 APIC 0x36
> [   53.502820] smpboot: Booting Node 1 Processor 19 APIC 0x38
> [   53.587938] smpboot: Booting Node 0 Processor 20 APIC 0x1
> [   53.659828] microcode: sig=0x306e4, pf=0x1, revision=0x428
> [   53.674857] smpboot: Booting Node 0 Processor 21 APIC 0x3
> [   53.756346] smpboot: Booting Node 0 Processor 22 APIC 0x5
> [   53.836793] smpboot: Booting Node 0 Processor 23 APIC 0x7
> [   53.917753] smpboot: Booting Node 0 Processor 24 APIC 0x9
> [   53.998717] smpboot: Booting Node 0 Processor 25 APIC 0x11
> [   54.079674] smpboot: Booting Node 0 Processor 26 APIC 0x13
> [   54.160636] smpboot: Booting Node 0 Processor 27 APIC 0x15
> [   54.241592] smpboot: Booting Node 0 Processor 28 APIC 0x17
> [   54.322553] smpboot: Booting Node 0 Processor 29 APIC 0x19
> [   54.403487] smpboot: Booting Node 1 Processor 30 APIC 0x21
> [   54.487676] smpboot: Booting Node 1 Processor 31 APIC 0x23
> [   54.571921] smpboot: Booting Node 1 Processor 32 APIC 0x25
> [   54.656508] smpboot: Booting Node 1 Processor 33 APIC 0x27
> [   54.740835] smpboot: Booting Node 1 Processor 34 APIC 0x29
> [   54.824466] smpboot: Booting Node 1 Processor 35 APIC 0x31
> [   54.908374] smpboot: Booting Node 1 Processor 36 APIC 0x33
> [   54.992322] smpboot: Booting Node 1 Processor 37 APIC 0x35
> [   55.076333] smpboot: Booting Node 1 Processor 38 APIC 0x37
> [   55.160249] smpboot: Booting Node 1 Processor 39 APIC 0x39
> 
> 
> ---
> Subject: sched,cpuset: Avoid spurious/wrong domain rebuilds
> 
> When disabling cpuset.sched_load_balance we expect to be able to online
> CPUs without generating sched_domains. However this is currently
> completely broken.
> 
> What happens is that we generate the sched_domains and then destroy
> them. This is because of the spurious 'default' domain build in
> cpuset_update_active_cpus(). That builds a single machine wide domain and
> then schedules a work to build the 'real' domains. The work then finds there
> are _no_ domains and destroys the lot again.
> 
> Furthermore, if there actually were cpusets, building the machine wide
> domain is actively wrong, because it would allow tasks to 'escape' their
> cpuset. Also I don't think its needed, the scheduler really should respect the
> active mask.
> 
> Also (this should probably be a separate patch) fix
> partition_sched_domains() to try and preserve the existing machine wide
> domain instead of unconditionally destroying it. We do this by attempting to
> allocate the new single domain, only when that fails to we reuse the
> fallback_doms.
> 
> Cc: Tejun Heo <t...@kernel.org>
> Almost-Signed-off-by: Peter Zijlstra (Intel) <pet...@infradead.org>
> ---
>  kernel/cgroup/cpuset.c  |  6 ------
>  kernel/sched/topology.c | 15 ++++++++++++---
>  2 files changed, 12 insertions(+), 9 deletions(-)
> 
> diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c index
> ca8376e5008c..e557cdba2350 100644
> --- a/kernel/cgroup/cpuset.c
> +++ b/kernel/cgroup/cpuset.c
> @@ -2342,13 +2342,7 @@ void cpuset_update_active_cpus(void)
>        * We're inside cpu hotplug critical region which usually nests
>        * inside cgroup synchronization.  Bounce actual hotplug processing
>        * to a work item to avoid reverse locking order.
> -      *
> -      * We still need to do partition_sched_domains() synchronously;
> -      * otherwise, the scheduler will get confused and put tasks to the
> -      * dead CPU.  Fall back to the default single domain.
> -      * cpuset_hotplug_workfn() will rebuild it as necessary.
>        */
> -     partition_sched_domains(1, NULL, NULL);
>       schedule_work(&cpuset_hotplug_work);
>  }
> 
> diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c index
> 79895aec281e..1b74b2cc5dba 100644
> --- a/kernel/sched/topology.c
> +++ b/kernel/sched/topology.c
> @@ -1854,7 +1854,17 @@ void partition_sched_domains(int ndoms_new,
> cpumask_var_t doms_new[],
>       /* Let the architecture update CPU core mappings: */
>       new_topology = arch_update_cpu_topology();
> 
> -     n = doms_new ? ndoms_new : 0;
> +     if (!doms_new) {
> +             WARN_ON_ONCE(dattr_new);
> +             n = 0;
> +             doms_new = alloc_sched_domains(1);
> +             if (doms_new) {
> +                     n = 1;
> +                     cpumask_andnot(doms_new[0], cpu_active_mask,
> cpu_isolated_map);
> +             }
> +     } else {
> +             n = ndoms_new;
> +     }
> 
>       /* Destroy deleted domains: */
>       for (i = 0; i < ndoms_cur; i++) {
> @@ -1870,11 +1880,10 @@ void partition_sched_domains(int ndoms_new,
> cpumask_var_t doms_new[],
>       }
> 
>       n = ndoms_cur;
> -     if (doms_new == NULL) {
> +     if (!doms_new) {
>               n = 0;
>               doms_new = &fallback_doms;
>               cpumask_andnot(doms_new[0], cpu_active_mask,
> cpu_isolated_map);
> -             WARN_ON_ONCE(dattr_new);
>       }
> 
>       /* Build new domains: */

Reply via email to