> OK, so to really do anything different (from a non-partitioned setup),
> you would need to set sched_load_balance=0 for the root cpuset?

Yup - exactly.  In fact one code fragment in my patch highlights this:

        /* Special case for the 99% of systems with one, full, sched domain */
        if (is_sched_load_balance(&top_cpuset)) {
                ndoms = 1;
                doms = kmalloc(sizeof(cpumask_t), GFP_KERNEL);
                *doms = top_cpuset.cpus_allowed;
                goto rebuild;
        }

This code says: if the top cpuset is load balanced, you've got one
big fat sched domain covering all (nonisolated) CPUs - end of story.
None of the other 'sched_load_balance' flags matter in this case.

Logically, the above code fragment is not needed.  Without it, the
code would still do the same thing, just wasting more CPU cycles doing
it.

> Suppose you do that to hard partition the machine, what happens to
> newly created tasks like kernel threads or things that aren't in a
> cpuset?

Well ... --every-- task is in a cpuset, always.  Newly created tasks
start in the cpuset of their parent.  Grep for 'the_top_cpuset_hack'
in kernel/cpuset.c to see the lengths to which we go to ensure that
current->cpuset always resolves somewhere.

The usual case on the big systems that I care about the most is
that we move (almost) every task out of the top cpuset, into smaller
cpusets, because we don't want some random thread intruding on the
CPUs dedicated to a particular job.  The only threads left in the root
cpuset are pinned kernel threads, such as for thread migration, per-cpu
irq handlers and various per-cpu and per-node disk and file flushers
and such.  These threads aren't going anywhere, regardless.  But no
thread that is willing to run anywhere is left free to run anywhere.

I will advise my third party batch scheduler developers to turn off
sched_load_balance on their main cpuset, and on any big "holding tank"
cpusets they have which hold only inactive jobs.  This way, on big
systems that are managed to optimize for this, the kernel scheduler
won't waste time load balancing the batch schedulers big cpusets that
don't need it.  With the 'sched_load_balance' flag defined the way
it is, the batch scheduler won't have to make system-wide decisions
as to sched domain partitioning.  They can just make local 'advisory'
markings on particular cpusets that (1) are or might be big, and (2)
don't hold any active tasks that might need load balancing.  The system
will take it from there, providing the finest granularity sched domain
partitioning that will accomplish that.

I will advise the system admins of bigger systems to turn off
sched_load_balance on the top cpuset, as part of the above work
routinely done to get all non-pinned tasks out of the top cpuset.

I will advise the real time developers using cpusets to: (1) turn off
sched_load_balance on their real time cpusets, and (2) insist that
the sys admins using their products turn off sched_load_balance on
the top cpuset, to ensure the expected realtime performance is obtained.

Most systems, even medium size ones (for some definition of medium,
perhaps dozens of CPUs?) so long as they aren't running realtime on
some CPUs, can just run with the default - one big fat load balanced
sched domain ... unless of course they have some other need not
considered above.

-- 
                  I won't rest till it's the best ...
                  Programmer, Linux Scalability
                  Paul Jackson <[EMAIL PROTECTED]> 1.925.600.0401
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Reply via email to