Re: [Lse-tech] Re: [RFC PATCH] Dynamic sched domains aka Isolated cpusets

2005-04-22 Thread Paul Jackson
Dinakar wrote:
> Ok, Let me begin at the beginning and attempt to define what I am 
> doing here

The statement of requirements and approach help.  Thank-you.

And the comments in the code patch are much easier for me
to understand.  Thanks.

Let me step back and consider where we are here.

I've not been entirely happy with the cpu_exclusive (and mem_exclusive)
properties.  They were easy to code, and they require only looking at
ones siblings and parent, but they don't provide all that people usually
want, which is system wide exclusivity, because they don't exclude tasks
in ones parent (or more remote ancestor) cpusets from stealing resources.

I take your isolated cpusets as a reasonable attempt to provide what's
really wanted.  I had avoided simple, system-wide exclusivity because
I really wanted cpusets to be hierarchical.  One should be able to
subdivide and manage one subtree of the cpuset hierarchy, oblivious
to what someone else is doing with a disjoint subtree.  Your work shows
how to provide a stronger form of isolation (exclusivity) without
abandoning the hierarchical structure.

There are three directions we could go from here.  I am not yet decided
between them:

 1) Remove cpu and mem exclusive flags - they are of limited use.

 2) Leave code as is.

 3) Extend the exclusive capability to include isolation from parents,
along the lines of your patch.

If I was redoing cpusets from scratch, I might not include the exclusive
feature at all - not sure.  But it's cheap, at least in terms of code,
and of some use to some users.  So I would choose (2) over (1), given
where we are now.  The main cost at present of the exclusive flags is
the cost in understanding - they tend to confuse people at first glance,
due to their somewhat unusual approach.

If we go with (3), then I'd like to consider the overall design of this
a bit more.  Your patch, as is common for patches, attempts to work within
the current framework, minimizing change.  Better to take a step back and
consider what would have been the best design as if the past didn't matter,
then with that clearly in mind, ask how best to get there from here.

I don't think we would have both isolated and exclusive flags, in the
'ideal design.'  The exclusive flags are essentially half (or a third)
of what's needed, and the isolated flags and masks the rest of it.

Essentially, your patch replaces the single set of CPUs in a cpuset
with three, related sets:
 A] the set of all CPUs managed by that cpuset
 B] the set of CPUs allowed to tasks attached to that cpuset
 C] the set of CPUs isolated for the dedicated use of some descendent

Sets [B] and [C] form a partition of [A] -- their intersection is empty,
and their union is [A].

Your current presentation of these sets of CPUs shows set [B] in the
cpus file, followed by set [C] in brackets, if I am recalling correctly.
This format changes the format of the current cpus_allowed file, and it
violates the preference for a single value or vector per file.  I would
like to consider alternatives.

Your code automatically updates [C] if the child cpuset adds or removes
CPUs from those it manages in isolation (though I am not sure that your
code manages this change all the way back up the hierarchy to the top
cpuset, and I wondering if perhaps your code should be doing this, as
noted in my detailed comments on your patch earlier today.)

I'd be tempted, if taking this approach (3) to consider a couple of
alternatives.

As I spelled out a few days ago, one could mark some cpusets that form a
partition of the systems CPUs, for the purposes of establishing isolated
scheduler domains, without requiring the above three related sets per
cpuset instead of one.  I am still unsure how much of your motivation is
the need to make the scheduler more efficient by establishing useful
isolated sched domains, and how much is the need to keep the usage of
CPUs by various jobs isolated, even from tasks attached to parent cpusets.

One can obtain the job isolation just in user code - if you don't want a
task to use a parent cpusets access to your isolated cpuset, then simply
don't attach a task to the parent cpusets.  I do not understand yet how
strong your requirement is to have the _kernel_ enforce that there are
not tasks in a parent cpuset which could intrude on the non-isolated
resources of a child.  I provide (non open source) user level tools to
my users which enable them to conveniently ensure that there are no such
unwanted tasks, so they don't have a problem with a parent cpusets CPUs
overlapping a cpuset that they are using for an isolated job.  Perhaps I
could persuade my employer that it would be appropriate to open source
these tools.

In any case, going (3) would result in _one_ attribute, not two (both
exclusive and isolated, with overlapping semantics, which is confusing.)

-- 
  I won't rest till it's the best ...
  Programmer, Linux Scalability
  Paul 

Re: [Lse-tech] Re: [RFC PATCH] Dynamic sched domains aka Isolated cpusets

2005-04-22 Thread Paul Jackson
Dinakar wrote:
 Ok, Let me begin at the beginning and attempt to define what I am 
 doing here

The statement of requirements and approach help.  Thank-you.

And the comments in the code patch are much easier for me
to understand.  Thanks.

Let me step back and consider where we are here.

I've not been entirely happy with the cpu_exclusive (and mem_exclusive)
properties.  They were easy to code, and they require only looking at
ones siblings and parent, but they don't provide all that people usually
want, which is system wide exclusivity, because they don't exclude tasks
in ones parent (or more remote ancestor) cpusets from stealing resources.

I take your isolated cpusets as a reasonable attempt to provide what's
really wanted.  I had avoided simple, system-wide exclusivity because
I really wanted cpusets to be hierarchical.  One should be able to
subdivide and manage one subtree of the cpuset hierarchy, oblivious
to what someone else is doing with a disjoint subtree.  Your work shows
how to provide a stronger form of isolation (exclusivity) without
abandoning the hierarchical structure.

There are three directions we could go from here.  I am not yet decided
between them:

 1) Remove cpu and mem exclusive flags - they are of limited use.

 2) Leave code as is.

 3) Extend the exclusive capability to include isolation from parents,
along the lines of your patch.

If I was redoing cpusets from scratch, I might not include the exclusive
feature at all - not sure.  But it's cheap, at least in terms of code,
and of some use to some users.  So I would choose (2) over (1), given
where we are now.  The main cost at present of the exclusive flags is
the cost in understanding - they tend to confuse people at first glance,
due to their somewhat unusual approach.

If we go with (3), then I'd like to consider the overall design of this
a bit more.  Your patch, as is common for patches, attempts to work within
the current framework, minimizing change.  Better to take a step back and
consider what would have been the best design as if the past didn't matter,
then with that clearly in mind, ask how best to get there from here.

I don't think we would have both isolated and exclusive flags, in the
'ideal design.'  The exclusive flags are essentially half (or a third)
of what's needed, and the isolated flags and masks the rest of it.

Essentially, your patch replaces the single set of CPUs in a cpuset
with three, related sets:
 A] the set of all CPUs managed by that cpuset
 B] the set of CPUs allowed to tasks attached to that cpuset
 C] the set of CPUs isolated for the dedicated use of some descendent

Sets [B] and [C] form a partition of [A] -- their intersection is empty,
and their union is [A].

Your current presentation of these sets of CPUs shows set [B] in the
cpus file, followed by set [C] in brackets, if I am recalling correctly.
This format changes the format of the current cpus_allowed file, and it
violates the preference for a single value or vector per file.  I would
like to consider alternatives.

Your code automatically updates [C] if the child cpuset adds or removes
CPUs from those it manages in isolation (though I am not sure that your
code manages this change all the way back up the hierarchy to the top
cpuset, and I wondering if perhaps your code should be doing this, as
noted in my detailed comments on your patch earlier today.)

I'd be tempted, if taking this approach (3) to consider a couple of
alternatives.

As I spelled out a few days ago, one could mark some cpusets that form a
partition of the systems CPUs, for the purposes of establishing isolated
scheduler domains, without requiring the above three related sets per
cpuset instead of one.  I am still unsure how much of your motivation is
the need to make the scheduler more efficient by establishing useful
isolated sched domains, and how much is the need to keep the usage of
CPUs by various jobs isolated, even from tasks attached to parent cpusets.

One can obtain the job isolation just in user code - if you don't want a
task to use a parent cpusets access to your isolated cpuset, then simply
don't attach a task to the parent cpusets.  I do not understand yet how
strong your requirement is to have the _kernel_ enforce that there are
not tasks in a parent cpuset which could intrude on the non-isolated
resources of a child.  I provide (non open source) user level tools to
my users which enable them to conveniently ensure that there are no such
unwanted tasks, so they don't have a problem with a parent cpusets CPUs
overlapping a cpuset that they are using for an isolated job.  Perhaps I
could persuade my employer that it would be appropriate to open source
these tools.

In any case, going (3) would result in _one_ attribute, not two (both
exclusive and isolated, with overlapping semantics, which is confusing.)

-- 
  I won't rest till it's the best ...
  Programmer, Linux Scalability
  Paul 

Re: [Lse-tech] Re: [RFC PATCH] Dynamic sched domains aka Isolated cpusets

2005-04-21 Thread Dinakar Guniguntala
On Wed, Apr 20, 2005 at 12:09:46PM -0700, Paul Jackson wrote:
> Earlier, I wrote to Dinakar:
> > What are your invariants, and how can you assure yourself and us
> > that your code preserves these invariants?

Ok, Let me begin at the beginning and attempt to define what I am 
doing here

1. I need a method to isolate a random set of cpus in such a way that
   only the set of processes that are specifically assigned can
   make use of these CPUs
2. I need to ensure that the sched load balance code does not pull
   any tasks other than the assigned ones onto these cpus
3. I need to be able to create multiple such groupings of cpus
   that are disjoint from the rest and run only specified tasks
4. I need a user interface to specify which random set of cpus
   form such a grouping of disjoint cpus
5. I need to be able to dynamically create and destroy these
   grouping of disjoint cpus
6. I need to be able to add/remove cpus to/from this grouping


Now if you try to fit these requirements onto cpusets, keeping in mind
that it already has an user interface and some of the frame work
required to create disjoint groupings of cpus

1. An exclusive cpuset ensures that the cpus it has are disjoint from
   all other cpusets except its parent and children
2. So now I need a way to disassociate the cpus of an exclusive
   cpuset from its parent, so that this set of cpus are truly
   disjoint from the rest of the system.
3. After I have done (2) above, I now need to build two set of sched 
   domains corresponding to the cpus of this exclusive cpuset and the 
   remaining cpus of its parent
4. Ensure that the current rules of non-isolated cpusets are all
   preserved such that if this feature is not used, all other features
   work as before

This is exactly what I have tried to do.

1. Maintain a flag to indicate whether a cpuset is isolated
2. Maintain an isolated_map for every cpuset. This contains a cache of 
   all cpus associated with isolated children
3. To isolate a cpuset x, x has to be an exclusive cpuset and its
   parent has to be an isolated cpuset
4. On isolating a cpuset by issuing
   /bin/echo 1 > cpu_isolated
   
   It ensures that conditions in (3) are satisfied and then removes the 
   cpus of the current cpuset from the parent cpus_allowed mask. (It also
   puts the cpus of the current cpuset into the isolated_map of its parent)
   This ensures that only the current cpuset and its children will have
   access to the now isolated cpus.
   It also rebuilds the sched domains into two new domains consisting of
   a. All cpus in the parent->cpus_allowed
   b. All cpus in current->cpus_allowed
5. Similarly on setting isolated off on a isolated cpuset, (or on doing
   an rmdir on an isolated cpuset) It adds all of the cpus of the current 
   cpuset into its parent cpuset's cpus_allowed mask and removes them from 
   it's parent's isolated_map

   This ensures that all of the cpus in the current cpuset are now
   visible to the parent cpuset.

   It now rebuilds only one sched domain consisting of all of the cpus
   in its parent's cpus_allowed mask.
6. You can also modify the cpus present in an isolated cpuset x provided
   that x does not have any children that are also isolated.
7. On adding or removing cpus from an isolated cpuset that does not
   have any isolated children, it reworks the parent cpuset's
   cpus_allowed and isolated_map masks and rebuilds the sched domains
   appropriately
8. Since the function update_cpu_domains, which does all of the
   above updations to the parent cpuset's masks, is always called with
   cpuset_sem held, it ensures that all these changes are atomic.


> > He removes cpus 4-5 from batch and adds them to cint
> 
> Could you spell out the exact steps the user would take, for this part
> of your example?  What does the user do, what does the kernel do in
> response, and what state the cpusets end up in, after each action of the
> user?


   cpuset   cpus   isolated   cpus_allowed   isolated_map
 top 0-7   1   0 0-7
 top/lowlat 0-11  0-1 0
 top/others 2-71  4-72-3
 top/others/cint   2-3 1  2-3 0
 top/others/batch  4-7 0  4-7 0

At this point to remove cpus 4-5 from batch and add them to cint, the admin
would do the following steps

# Remove cpus 4-5 from batch
# batch is not a isolated cpuset and hence this step 
# has no other implications
/bin/echo 6-7 > /top/others/batch/cpus 

   cpuset   cpus   isolated   cpus_allowed   isolated_map
 top 0-7   1   0 0-7
 top/lowlat 0-11  0-1 0
 top/others 2-71  4-72-3
 top/others/cint   2-3 1  2-3

Re: [Lse-tech] Re: [RFC PATCH] Dynamic sched domains aka Isolated cpusets

2005-04-21 Thread Dinakar Guniguntala
On Wed, Apr 20, 2005 at 12:09:46PM -0700, Paul Jackson wrote:
 Earlier, I wrote to Dinakar:
  What are your invariants, and how can you assure yourself and us
  that your code preserves these invariants?

Ok, Let me begin at the beginning and attempt to define what I am 
doing here

1. I need a method to isolate a random set of cpus in such a way that
   only the set of processes that are specifically assigned can
   make use of these CPUs
2. I need to ensure that the sched load balance code does not pull
   any tasks other than the assigned ones onto these cpus
3. I need to be able to create multiple such groupings of cpus
   that are disjoint from the rest and run only specified tasks
4. I need a user interface to specify which random set of cpus
   form such a grouping of disjoint cpus
5. I need to be able to dynamically create and destroy these
   grouping of disjoint cpus
6. I need to be able to add/remove cpus to/from this grouping


Now if you try to fit these requirements onto cpusets, keeping in mind
that it already has an user interface and some of the frame work
required to create disjoint groupings of cpus

1. An exclusive cpuset ensures that the cpus it has are disjoint from
   all other cpusets except its parent and children
2. So now I need a way to disassociate the cpus of an exclusive
   cpuset from its parent, so that this set of cpus are truly
   disjoint from the rest of the system.
3. After I have done (2) above, I now need to build two set of sched 
   domains corresponding to the cpus of this exclusive cpuset and the 
   remaining cpus of its parent
4. Ensure that the current rules of non-isolated cpusets are all
   preserved such that if this feature is not used, all other features
   work as before

This is exactly what I have tried to do.

1. Maintain a flag to indicate whether a cpuset is isolated
2. Maintain an isolated_map for every cpuset. This contains a cache of 
   all cpus associated with isolated children
3. To isolate a cpuset x, x has to be an exclusive cpuset and its
   parent has to be an isolated cpuset
4. On isolating a cpuset by issuing
   /bin/echo 1  cpu_isolated
   
   It ensures that conditions in (3) are satisfied and then removes the 
   cpus of the current cpuset from the parent cpus_allowed mask. (It also
   puts the cpus of the current cpuset into the isolated_map of its parent)
   This ensures that only the current cpuset and its children will have
   access to the now isolated cpus.
   It also rebuilds the sched domains into two new domains consisting of
   a. All cpus in the parent-cpus_allowed
   b. All cpus in current-cpus_allowed
5. Similarly on setting isolated off on a isolated cpuset, (or on doing
   an rmdir on an isolated cpuset) It adds all of the cpus of the current 
   cpuset into its parent cpuset's cpus_allowed mask and removes them from 
   it's parent's isolated_map

   This ensures that all of the cpus in the current cpuset are now
   visible to the parent cpuset.

   It now rebuilds only one sched domain consisting of all of the cpus
   in its parent's cpus_allowed mask.
6. You can also modify the cpus present in an isolated cpuset x provided
   that x does not have any children that are also isolated.
7. On adding or removing cpus from an isolated cpuset that does not
   have any isolated children, it reworks the parent cpuset's
   cpus_allowed and isolated_map masks and rebuilds the sched domains
   appropriately
8. Since the function update_cpu_domains, which does all of the
   above updations to the parent cpuset's masks, is always called with
   cpuset_sem held, it ensures that all these changes are atomic.


  He removes cpus 4-5 from batch and adds them to cint
 
 Could you spell out the exact steps the user would take, for this part
 of your example?  What does the user do, what does the kernel do in
 response, and what state the cpusets end up in, after each action of the
 user?


   cpuset   cpus   isolated   cpus_allowed   isolated_map
 top 0-7   1   0 0-7
 top/lowlat 0-11  0-1 0
 top/others 2-71  4-72-3
 top/others/cint   2-3 1  2-3 0
 top/others/batch  4-7 0  4-7 0

At this point to remove cpus 4-5 from batch and add them to cint, the admin
would do the following steps

# Remove cpus 4-5 from batch
# batch is not a isolated cpuset and hence this step 
# has no other implications
/bin/echo 6-7  /top/others/batch/cpus 

   cpuset   cpus   isolated   cpus_allowed   isolated_map
 top 0-7   1   0 0-7
 top/lowlat 0-11  0-1 0
 top/others 2-71  4-72-3
 top/others/cint   2-3 1  2-3 0
 

Re: [Lse-tech] Re: [RFC PATCH] Dynamic sched domains aka Isolated cpusets

2005-04-20 Thread Paul Jackson
Earlier, I wrote to Dinakar:
> What are your invariants, and how can you assure yourself and us
> that your code preserves these invariants?

I repeat that question.

===

On my first reading of your example, I see the following.

It is sinking into my dense skull more than it had before that your
patch changes the meaning of the cpuset field 'cpus_allowed', to only
include the cpus not in isolated children.  However there are other uses
of the 'cpus_allowed' field in the cpuset code that are not changed, and
comments and documentation describing this field that are not changed. 
I suspect this is an incomplete change.

You don't actually state it that I noticed, but the main point of your
example seems to be that you support incrementally moving individual
cpus between cpusets, without the constraint that both cpusets be in the
same subset of the partition (the same isolation group).  So you can
move a cpu in and out of an isolated group without tearing down the
group down first, only to rebuild it after.

To do this, you've added new semantics to some of the operations to
write the 'cpus' special file of a cpuset, if and only if that cpuset is
marked isolated, which involves changing some other masks.  These new
semantics are something along the lines of "adding a cpu here implies
removing it from there.  This presumably allows you to move cpus in or
out of or between an isolated cpuset, while preserving the essential
properties of a partition - that it is a disjoint covering.

> He removes cpus 4-5 from batch and adds them to cint

Could you spell out the exact steps the user would take, for this part
of your example?  What does the user do, what does the kernel do in
response, and what state the cpusets end up in, after each action of the
user?

===

So far, to be honest, I am finding your patch to be rather frustrating.

Perhaps the essential reason is this.  The interface that cpusets
presents in the cpuset file system, mounted at /dev/cpuset, is not in my
intentions primarily a human interface.  It is primarily a programmatic
interface.

As such, there is a high premium on clarity of design, consistency of
behaviour and absence of side affects.   Each operation should do one
thing, clearly defined, changing only what is operated on, preserving
clearly spelled out invariants.

If it takes three steps instead of one to accomplish a typical task,
that's fine.  The programs that layer on top of /dev/cpuset don't mind
doing three things to get one thing done.  But such programs are a pain
in the backside to program correctly if the affects of each operation
are not clearly defined, not focused on the obvious object being
operated on, or not precisely consistent with an overriding model.

This patch seems to add side affects and the change the meanings of
things, doing so with the most minimum of mention in the description,
without clearly and consistently spelling out the new mental model, and
without uniformly changing all uses, comments and documentation to fit
the new model.

This cpuset facility is also a less commonly used kernel facility, and
changes to cpusets, outside of a few key hooks in the scheduler and
allocator, are not performance critical.  This means that there is a
premium in keeping the kernel code minimal, leaving as many details as
practical to userland.  This patch seems to increase the kernel text
size, for an ia64 SN2 build using gcc 3.2.3 of a 2.6.12-rc1-mm4 tree I
had at hand, _just_ for the cpuset.c changes, from 23071 bytes to 28999.

  That's over a 25% per-cent increase in the kernel text size of the file
  kernel/cpuset.o, just for this feature.  That's too much, in my view.

I don't know yet if the ability to move cpus between isolated sched
domains without tearing them down and rebuilding them, is a critical
feature for you or not.  You have not been clear on what are the
essential requirements of this feature.  I don't even know for sure yet
that this is the one key feature in your view that separates your
proposal from the variations I explored.

But if this is for you the critical feature that your proposal has, and
mine lack, then I'd like to see if there is a way to do it without
implicit side affects, without messing with the semantics of what's
there now, and with significantly fewer bytes of kernel text space.  And
I'd like to see if we can have uniform and precisely spelled out
semantics, in the code, comments and documentation, with any changes to
the current semantics made everywhere, uniformly.

-- 
  I won't rest till it's the best ...
  Programmer, Linux Scalability
  Paul Jackson <[EMAIL PROTECTED]> 1.650.933.1373, 
1.925.600.0401
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [Lse-tech] Re: [RFC PATCH] Dynamic sched domains aka Isolated cpusets

2005-04-20 Thread Dinakar Guniguntala
On Tue, Apr 19, 2005 at 10:23:48AM -0700, Paul Jackson wrote:
> 
> How does this play out in your interface?  Are you convinced that
> your invariants are preserved at all times, to all users?  Can
> you present a convincing argument to others that this is so?


Let me give an example of how the current version of isolated cpusets can
be used and hopefully clarify my approach.


Consider a system with 8 cpus that needs to run a mix of workloads.
One set of applications have low latency requirements and another
set have a mixed workload. The administrator decides to allot
2 cpus to the low latency application and the rest to other apps.
To do this, he creates two cpusets
(All cpusets are considered to be exclusive for this discussion)

   cpuset   cpus   isolated   cpus_allowed   isolated_map
 top 0-7   1  0-7 0
 top/lowlat 0-10  0-1 0
 top/others 2-70  2-7 0

He now wants to partition the system along these lines as he wants
to isolate lowlat from the rest of the system to ensure that
a. No tasks from the parent cpuset (top_cpuset in this case)
   use these cpus
b. load balance does not run across all cpus 0-7

He does this by

cd /mount-point/lowlat
/bin/echo 1 > cpu_isolated

Internally it takes the cpuset_sem, does some sanity checks and ensures
that these cpus are not visible to any other cpuset including its parent
(by removing these cpus from its parent's cpus_allowed mask and adding
them to its parent's isolated_map) and then calls sched code to partition
the system as

[0-1] [2-7]

   The internal state of data structures are as follows

   cpuset   cpus   isolated   cpus_allowed   isolated_map
 top 0-7   1  2-70-1
 top/lowlat 0-11  0-1 0
 top/others 2-70  2-7 0

---


The administrator now wants to further partition the "others" cpuset into
a cpu intensive application and a batch one

   cpuset   cpus   isolated   cpus_allowed   isolated_map
 top 0-7   1  2-70-1
 top/lowlat 0-11  0-1 0
 top/others 2-70  2-7 0
 top/others/cint   2-3 0  2-3 0
 top/others/batch  4-7 0  4-7 0


If now the administrator wants to isolate the cint cpuset...

cd /mount-point/others
/bin/echo 1 > cpu_isolated

(At this point no new sched domains are built
 as there exists a sched domain which exactly
 matches the cpus in the "others" cpuset.)

cd /mount-point/others/cint
/bin/echo 1 > cpu_isolated

At this point cpus from the "others" cpuset are also taken away from its
parent cpus_allowed mask and put into the parent's isolated_map. This means
that the parent cpus_allowed mask is empty.  This would now result in
partitioning the "others" cpuset and builds two new sched domains as follows

[2-3] [4-7]

Notice that the cpus 0-1 having already been isolated are not affected
in this operation

   cpuset   cpus   isolated   cpus_allowed   isolated_map
 top 0-7   1   0 0-7
 top/lowlat 0-11  0-1 0
 top/others 2-71  4-72-3
 top/others/cint   2-3 1  2-3 0
 top/others/batch  4-7 0  4-7 0

---

The admin now wants to run more applications in the cint cpuset
and decides to borrow a couple of cpus from the batch cpuset
He removes cpus 4-5 from batch and adds them to cint

   cpuset   cpus   isolated   cpus_allowed   isolated_map
 top 0-7   1   0 0-7
 top/lowlat 0-11  0-1 0
 top/others 2-71  6-72-5
 top/others/cint   2-5 1  2-5 0
 top/others/batch  6-7 0  6-7 0

As cint is already isolated, adding cpus causes it to rebuild all cpus
covered by its cpus_allowed and its parent's cpus_allowed, so the new
sched domains will look as follows

[2-5] [6-7]

cpus 0-1 are ofcourse still not affected

Similarly the admin can remove cpus from cint, which will
result in the domains being rebuilt to what was before

[2-3] [4-7]

---


Hope this clears up my approach. Also note that 

Re: [Lse-tech] Re: [RFC PATCH] Dynamic sched domains aka Isolated cpusets

2005-04-20 Thread Dinakar Guniguntala
On Tue, Apr 19, 2005 at 10:23:48AM -0700, Paul Jackson wrote:
 
 How does this play out in your interface?  Are you convinced that
 your invariants are preserved at all times, to all users?  Can
 you present a convincing argument to others that this is so?


Let me give an example of how the current version of isolated cpusets can
be used and hopefully clarify my approach.


Consider a system with 8 cpus that needs to run a mix of workloads.
One set of applications have low latency requirements and another
set have a mixed workload. The administrator decides to allot
2 cpus to the low latency application and the rest to other apps.
To do this, he creates two cpusets
(All cpusets are considered to be exclusive for this discussion)

   cpuset   cpus   isolated   cpus_allowed   isolated_map
 top 0-7   1  0-7 0
 top/lowlat 0-10  0-1 0
 top/others 2-70  2-7 0

He now wants to partition the system along these lines as he wants
to isolate lowlat from the rest of the system to ensure that
a. No tasks from the parent cpuset (top_cpuset in this case)
   use these cpus
b. load balance does not run across all cpus 0-7

He does this by

cd /mount-point/lowlat
/bin/echo 1  cpu_isolated

Internally it takes the cpuset_sem, does some sanity checks and ensures
that these cpus are not visible to any other cpuset including its parent
(by removing these cpus from its parent's cpus_allowed mask and adding
them to its parent's isolated_map) and then calls sched code to partition
the system as

[0-1] [2-7]

   The internal state of data structures are as follows

   cpuset   cpus   isolated   cpus_allowed   isolated_map
 top 0-7   1  2-70-1
 top/lowlat 0-11  0-1 0
 top/others 2-70  2-7 0

---


The administrator now wants to further partition the others cpuset into
a cpu intensive application and a batch one

   cpuset   cpus   isolated   cpus_allowed   isolated_map
 top 0-7   1  2-70-1
 top/lowlat 0-11  0-1 0
 top/others 2-70  2-7 0
 top/others/cint   2-3 0  2-3 0
 top/others/batch  4-7 0  4-7 0


If now the administrator wants to isolate the cint cpuset...

cd /mount-point/others
/bin/echo 1  cpu_isolated

(At this point no new sched domains are built
 as there exists a sched domain which exactly
 matches the cpus in the others cpuset.)

cd /mount-point/others/cint
/bin/echo 1  cpu_isolated

At this point cpus from the others cpuset are also taken away from its
parent cpus_allowed mask and put into the parent's isolated_map. This means
that the parent cpus_allowed mask is empty.  This would now result in
partitioning the others cpuset and builds two new sched domains as follows

[2-3] [4-7]

Notice that the cpus 0-1 having already been isolated are not affected
in this operation

   cpuset   cpus   isolated   cpus_allowed   isolated_map
 top 0-7   1   0 0-7
 top/lowlat 0-11  0-1 0
 top/others 2-71  4-72-3
 top/others/cint   2-3 1  2-3 0
 top/others/batch  4-7 0  4-7 0

---

The admin now wants to run more applications in the cint cpuset
and decides to borrow a couple of cpus from the batch cpuset
He removes cpus 4-5 from batch and adds them to cint

   cpuset   cpus   isolated   cpus_allowed   isolated_map
 top 0-7   1   0 0-7
 top/lowlat 0-11  0-1 0
 top/others 2-71  6-72-5
 top/others/cint   2-5 1  2-5 0
 top/others/batch  6-7 0  6-7 0

As cint is already isolated, adding cpus causes it to rebuild all cpus
covered by its cpus_allowed and its parent's cpus_allowed, so the new
sched domains will look as follows

[2-5] [6-7]

cpus 0-1 are ofcourse still not affected

Similarly the admin can remove cpus from cint, which will
result in the domains being rebuilt to what was before

[2-3] [4-7]

---


Hope this clears up my approach. Also note that we still need 

Re: [Lse-tech] Re: [RFC PATCH] Dynamic sched domains aka Isolated cpusets

2005-04-20 Thread Paul Jackson
Earlier, I wrote to Dinakar:
 What are your invariants, and how can you assure yourself and us
 that your code preserves these invariants?

I repeat that question.

===

On my first reading of your example, I see the following.

It is sinking into my dense skull more than it had before that your
patch changes the meaning of the cpuset field 'cpus_allowed', to only
include the cpus not in isolated children.  However there are other uses
of the 'cpus_allowed' field in the cpuset code that are not changed, and
comments and documentation describing this field that are not changed. 
I suspect this is an incomplete change.

You don't actually state it that I noticed, but the main point of your
example seems to be that you support incrementally moving individual
cpus between cpusets, without the constraint that both cpusets be in the
same subset of the partition (the same isolation group).  So you can
move a cpu in and out of an isolated group without tearing down the
group down first, only to rebuild it after.

To do this, you've added new semantics to some of the operations to
write the 'cpus' special file of a cpuset, if and only if that cpuset is
marked isolated, which involves changing some other masks.  These new
semantics are something along the lines of adding a cpu here implies
removing it from there.  This presumably allows you to move cpus in or
out of or between an isolated cpuset, while preserving the essential
properties of a partition - that it is a disjoint covering.

 He removes cpus 4-5 from batch and adds them to cint

Could you spell out the exact steps the user would take, for this part
of your example?  What does the user do, what does the kernel do in
response, and what state the cpusets end up in, after each action of the
user?

===

So far, to be honest, I am finding your patch to be rather frustrating.

Perhaps the essential reason is this.  The interface that cpusets
presents in the cpuset file system, mounted at /dev/cpuset, is not in my
intentions primarily a human interface.  It is primarily a programmatic
interface.

As such, there is a high premium on clarity of design, consistency of
behaviour and absence of side affects.   Each operation should do one
thing, clearly defined, changing only what is operated on, preserving
clearly spelled out invariants.

If it takes three steps instead of one to accomplish a typical task,
that's fine.  The programs that layer on top of /dev/cpuset don't mind
doing three things to get one thing done.  But such programs are a pain
in the backside to program correctly if the affects of each operation
are not clearly defined, not focused on the obvious object being
operated on, or not precisely consistent with an overriding model.

This patch seems to add side affects and the change the meanings of
things, doing so with the most minimum of mention in the description,
without clearly and consistently spelling out the new mental model, and
without uniformly changing all uses, comments and documentation to fit
the new model.

This cpuset facility is also a less commonly used kernel facility, and
changes to cpusets, outside of a few key hooks in the scheduler and
allocator, are not performance critical.  This means that there is a
premium in keeping the kernel code minimal, leaving as many details as
practical to userland.  This patch seems to increase the kernel text
size, for an ia64 SN2 build using gcc 3.2.3 of a 2.6.12-rc1-mm4 tree I
had at hand, _just_ for the cpuset.c changes, from 23071 bytes to 28999.

  That's over a 25% per-cent increase in the kernel text size of the file
  kernel/cpuset.o, just for this feature.  That's too much, in my view.

I don't know yet if the ability to move cpus between isolated sched
domains without tearing them down and rebuilding them, is a critical
feature for you or not.  You have not been clear on what are the
essential requirements of this feature.  I don't even know for sure yet
that this is the one key feature in your view that separates your
proposal from the variations I explored.

But if this is for you the critical feature that your proposal has, and
mine lack, then I'd like to see if there is a way to do it without
implicit side affects, without messing with the semantics of what's
there now, and with significantly fewer bytes of kernel text space.  And
I'd like to see if we can have uniform and precisely spelled out
semantics, in the code, comments and documentation, with any changes to
the current semantics made everywhere, uniformly.

-- 
  I won't rest till it's the best ...
  Programmer, Linux Scalability
  Paul Jackson [EMAIL PROTECTED] 1.650.933.1373, 
1.925.600.0401
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [Lse-tech] Re: [RFC PATCH] Dynamic sched domains aka Isolated cpusets

2005-04-19 Thread Paul Jackson
Dinakar wrote:
> I was hoping that by the time we are done with this, we would
> be able to completely get rid of the isolcpus= option.

I won't miss it.  Though, since it's in the main line kernel,
do you need to mark it deprecated for a while first?

> For that
> ofcourse we need to be able build domains that dont run
> load balance

Ah - so that's what these isolcpus are - ones not load balanced?
This was never clear to me.


> The wording [/* Set ... */ ] was from the users point of view
> for what action was being done, guess I'll change that

Ok - at least now I can read and understand the comments, knowing this. 
The other comments in cpuset.c don't follow this convention, of speaking
in the "user's voice", but rather speak in the "responding systems
voice."  Best to remain consistent in this matter.

> It is complicated because it has to handle all of the different
> possible actions that the user can initiate. It can be simplified
> if we have stricter rules of what the user can/cannot do
> w.r.t to isolated cpusets

It is complicated because you are trying to pretend that to be doing a
complex state change one step at a time, without a precise statement (at
least, not that I saw) of what the invariants are, and atomic operations
that preserve the invariants.

> > First, let me verify one thing.  I understand that the _key_
> > purpose of your patch is not so much to isolate cpus, as it
> > is to allow for structuring scheduling domains to align with
> > cpuset boundaries.  I understand real isolated cpus to be ones
> > that don't have a scheduling domain (have only the dummy one),
> > as requested by the "isolcpus=..." boot flag.
> 
> Not really. Isolated cpusets allows you to do a soft-partition
> of the system, and it would make sense to continue to have load
> balancing within these partitions. I would think not having
> load balancing should be one of the options available

Ok ... then is it correct to say that your purpose is to partition
the systems CPUs into subsets, such that for each subset, either
there is a scheduler domain for that exactly the CPUs in that subset,
or none of the CPUs in the subset are in any scheduler domain?

> I must confess that I havent looked at the memory side all that much,
> having more interest in trying to build soft-partitioning of the cpu's

This is an understandable focus of interest.  Just know that one of the
sanity tests I will apply to a solution for CPUs is whether there is a
corresponding solution for Memory Nodes, using much the same principles,
invariants and conventions.

> ok I need to spend more time on you model Paul, but my first
> guess is that it doesn't seem to be very intuitive and seems
> to make it very complex from the users perspective. However as
> I said I need to understand your model a bit more before I
> comment on it

Well ... I can't claim that my approach is simple.  It does have a
clearly defined (well, clear to me ;) mathematical model, with some
invariants that are always preserved in what user space sees, with
atomic operations for changing from one legal state to the next.

The primary invariant is that the sets of CPUs in the cpusets
marked domain_cpu_current form a partition (disjoint covering)
of the CPUs in the system.

What are your invariants, and how can you assure yourself and us
that your code preserves these invariants?

Also, I don't know that the sequence of user operations required
by my interface is that much worse than yours.  Let's take an
example, and compare what the user would have to do.

Let's say we have the following cpusets on our 8 CPU system:

/   # CPUs 0-7
/Alpha  # CPUs   0-3
/Alpha/phi  # CPUs0-1
/Alpha/chi  # CPUs2-3
/Beta   # CPUs  4-7

Let's say we currently have three scheduler domains, for three isolated
(in your terms) cpusets: /Alpha/phi, /Alpha/chi and /Beta.

Let's say we want to change the configuration to have just two scheduler
domains (two isolated cpusets): /Alpha and /Beta.

A user of my API would do the operations:

echo 1 > /Alpha/domain_cpu_pending
echo 1 > /Beta/domain_cpu_pending
echo 0 > /Alpha/phi/domain_cpu_pending
echo 0 > /Alpha/chi/domain_cpu_pending
echo 1 > /domain_cpu_rebuild

The domain_cpu_current state would not change until the final write
(echo) above, at which time the cpuset_sem lock would be taken, and the
system would, atomically to all viewing tasks, change from having the
three cpusets /Alpha/phi, /Alpha/chi and /Beta marked with a true
domain_cpu_current, to having the two cpusets /Alpha and /Beta so
marked.

The alternative API, which I didn't explore, could do this in one step
by writing the new list of cpusets defining the partition, doing the
rough equivalent (need nul separators, not space separators) of:

echo /Alpha /Beta > /list_cpu_subdomains

How does this play out in your interface?  Are you convinced that
your 

Re: [Lse-tech] Re: [RFC PATCH] Dynamic sched domains aka Isolated cpusets

2005-04-19 Thread Dinakar Guniguntala
On Mon, Apr 18, 2005 at 10:54:27PM -0700, Paul Jackson wrote:
> Hmmm ... interesting patch.  My reaction to the changes in
> kernel/cpuset.c are complicated:

Thanks Paul for taking time off your vaction to reply to this.
I was expecting to see one of your huge mails but this has
exceeded all my expectations :)

>  * I'd probably ditch the all_cpus() macro, on the
>concern that it obfuscates more than it helps.
>  * The need for _both_ a per-cpuset flag 'CS_CPU_ISOLATED'
>and another per-cpuset mask 'isolated_map' concerns me.
>I guess that the isolated_map is just a cache of the
>set of CPUs isolated in child cpusets, not an independently
>settable mask, but it needs to be clearly marked as such
>if so.

Currently the isolated_map is read-only as you have guessed.
I did think of the user adding cpus to this map from the 
cpus_allowed mask but thought the current approach made more sense

>  * Some code lines go past column 80.
I need to set my vi to wrap past 80...

>  * The name 'isolated'  probably won't work.  There is already
>a boottime option "isolcpus=..." for 'isolated' cpus which
>is (I think ?) rather different.  Perhaps a better name will
>fall out of the conceptual discussion, below.

I was hoping that by the time we are done with this, we would
be able to completely get rid of the isolcpus= option. For that
ofcourse we need to be able build domains that dont run
load balance

>  * The change to the output format of the special cpuset file
>'cpus', to look like '0-3[4-7]' bothers me in a couple of
>ways.  It complicates the format from being a simple list.
>And it means that the output format is not the same as the
>input format (you can't just write back what you read from
>such a file anymore).

As i had said in my earlier mail, this was just one way of
representing what I call isolated cpus. The other was to expose
isolated_map to userspace and move cpus between cpus_allowed
and isolated_map

>  * Several comments start with the word 'Set', as in:
>   Set isolated ON on a non exclusive cpuset
>Such wording suggests to me that something is being set,
>some bit or value changed or turned on.  But in each case,
>you are just testing for some condition that will return
>or error out.  Some phrasing such as "If ..." or other
>conditional would be clearer.

The wording was from the users point of view for what
action was being done, guess I'll change that

>  * The update_sched_domains() routine is complicated, and
>hence a primary clue that the conceptual model is not
>clean yet.

It is complicated because it has to handle all of the different
possible actions that the user can initiate. It can be simplified
if we have stricter rules of what the user can/cannot do
w.r.t to isolated cpusets

>  * None of this was explained in Documentation/cpusets.txt.

Yes I plan to add the documentation shortly

>  * Too bad that cpuset_common_file_write() has to have special
>logic for this isolated case.  The other flag settings just
>turn on and off the associated bit, and don't trigger any
>kernel code to adapt to new cpu or memory settings.  We
>should make an exception to that behaviour only if we must,
>and then we must be explicit about the exception.

See my notes on isolated_map above

> First, let me verify one thing.  I understand that the _key_
> purpose of your patch is not so much to isolate cpus, as it
> is to allow for structuring scheduling domains to align with
> cpuset boundaries.  I understand real isolated cpus to be ones
> that don't have a scheduling domain (have only the dummy one),
> as requested by the "isolcpus=..." boot flag.

Not really. Isolated cpusets allows you to do a soft-partition
of the system, and it would make sense to continue to have load
balancing within these partitions. I would think not having
load balancing should be one of the options available

> 
> Second, let me describe how this same issue shows up on the
> memory side.
> 

...snip...

> 
> 
> In the case of cpus, we really do prefer the partitions to be
> disjoint, because it would be better not to confuse the domain
> scheduler with overlapping domains.

Absolutely one of the problem I had was to map the flat disjoint
heirarchy of sched domains to the tree like heirarchy of cpusets

> 
> In the case of memory, we technically probably don't _have_ to
> keep the partitions disjoint.  I doubt that the page allocator
> (mm/page_alloc.c:__alloc_pages()) really cares.  It will strive
> valiantly to satisfy the memory request from any of the zones
> (each node specific) in the list passed into it.
> 
I must confess that I havent looked at the memory side all that much,
having more interest in trying to build soft-partitioning of the cpu's

> But for the purposes of providing a clear conceptual model to
> our users, I think it is best that we impose this constraint on
> the memory side as well as on the cpu 

Re: [Lse-tech] Re: [RFC PATCH] Dynamic sched domains aka Isolated cpusets

2005-04-19 Thread Dinakar Guniguntala
On Mon, Apr 18, 2005 at 10:54:27PM -0700, Paul Jackson wrote:
 Hmmm ... interesting patch.  My reaction to the changes in
 kernel/cpuset.c are complicated:

Thanks Paul for taking time off your vaction to reply to this.
I was expecting to see one of your huge mails but this has
exceeded all my expectations :)

  * I'd probably ditch the all_cpus() macro, on the
concern that it obfuscates more than it helps.
  * The need for _both_ a per-cpuset flag 'CS_CPU_ISOLATED'
and another per-cpuset mask 'isolated_map' concerns me.
I guess that the isolated_map is just a cache of the
set of CPUs isolated in child cpusets, not an independently
settable mask, but it needs to be clearly marked as such
if so.

Currently the isolated_map is read-only as you have guessed.
I did think of the user adding cpus to this map from the 
cpus_allowed mask but thought the current approach made more sense

  * Some code lines go past column 80.
I need to set my vi to wrap past 80...

  * The name 'isolated'  probably won't work.  There is already
a boottime option isolcpus=... for 'isolated' cpus which
is (I think ?) rather different.  Perhaps a better name will
fall out of the conceptual discussion, below.

I was hoping that by the time we are done with this, we would
be able to completely get rid of the isolcpus= option. For that
ofcourse we need to be able build domains that dont run
load balance

  * The change to the output format of the special cpuset file
'cpus', to look like '0-3[4-7]' bothers me in a couple of
ways.  It complicates the format from being a simple list.
And it means that the output format is not the same as the
input format (you can't just write back what you read from
such a file anymore).

As i had said in my earlier mail, this was just one way of
representing what I call isolated cpus. The other was to expose
isolated_map to userspace and move cpus between cpus_allowed
and isolated_map

  * Several comments start with the word 'Set', as in:
   Set isolated ON on a non exclusive cpuset
Such wording suggests to me that something is being set,
some bit or value changed or turned on.  But in each case,
you are just testing for some condition that will return
or error out.  Some phrasing such as If ... or other
conditional would be clearer.

The wording was from the users point of view for what
action was being done, guess I'll change that

  * The update_sched_domains() routine is complicated, and
hence a primary clue that the conceptual model is not
clean yet.

It is complicated because it has to handle all of the different
possible actions that the user can initiate. It can be simplified
if we have stricter rules of what the user can/cannot do
w.r.t to isolated cpusets

  * None of this was explained in Documentation/cpusets.txt.

Yes I plan to add the documentation shortly

  * Too bad that cpuset_common_file_write() has to have special
logic for this isolated case.  The other flag settings just
turn on and off the associated bit, and don't trigger any
kernel code to adapt to new cpu or memory settings.  We
should make an exception to that behaviour only if we must,
and then we must be explicit about the exception.

See my notes on isolated_map above

 First, let me verify one thing.  I understand that the _key_
 purpose of your patch is not so much to isolate cpus, as it
 is to allow for structuring scheduling domains to align with
 cpuset boundaries.  I understand real isolated cpus to be ones
 that don't have a scheduling domain (have only the dummy one),
 as requested by the isolcpus=... boot flag.

Not really. Isolated cpusets allows you to do a soft-partition
of the system, and it would make sense to continue to have load
balancing within these partitions. I would think not having
load balancing should be one of the options available

 
 Second, let me describe how this same issue shows up on the
 memory side.
 

...snip...

 
 
 In the case of cpus, we really do prefer the partitions to be
 disjoint, because it would be better not to confuse the domain
 scheduler with overlapping domains.

Absolutely one of the problem I had was to map the flat disjoint
heirarchy of sched domains to the tree like heirarchy of cpusets

 
 In the case of memory, we technically probably don't _have_ to
 keep the partitions disjoint.  I doubt that the page allocator
 (mm/page_alloc.c:__alloc_pages()) really cares.  It will strive
 valiantly to satisfy the memory request from any of the zones
 (each node specific) in the list passed into it.
 
I must confess that I havent looked at the memory side all that much,
having more interest in trying to build soft-partitioning of the cpu's

 But for the purposes of providing a clear conceptual model to
 our users, I think it is best that we impose this constraint on
 the memory side as well as on the cpu side.  And I don't think
 it will deprive users of any useful 

Re: [Lse-tech] Re: [RFC PATCH] Dynamic sched domains aka Isolated cpusets

2005-04-19 Thread Paul Jackson
Dinakar wrote:
 I was hoping that by the time we are done with this, we would
 be able to completely get rid of the isolcpus= option.

I won't miss it.  Though, since it's in the main line kernel,
do you need to mark it deprecated for a while first?

 For that
 ofcourse we need to be able build domains that dont run
 load balance

Ah - so that's what these isolcpus are - ones not load balanced?
This was never clear to me.


 The wording [/* Set ... */ ] was from the users point of view
 for what action was being done, guess I'll change that

Ok - at least now I can read and understand the comments, knowing this. 
The other comments in cpuset.c don't follow this convention, of speaking
in the user's voice, but rather speak in the responding systems
voice.  Best to remain consistent in this matter.

 It is complicated because it has to handle all of the different
 possible actions that the user can initiate. It can be simplified
 if we have stricter rules of what the user can/cannot do
 w.r.t to isolated cpusets

It is complicated because you are trying to pretend that to be doing a
complex state change one step at a time, without a precise statement (at
least, not that I saw) of what the invariants are, and atomic operations
that preserve the invariants.

  First, let me verify one thing.  I understand that the _key_
  purpose of your patch is not so much to isolate cpus, as it
  is to allow for structuring scheduling domains to align with
  cpuset boundaries.  I understand real isolated cpus to be ones
  that don't have a scheduling domain (have only the dummy one),
  as requested by the isolcpus=... boot flag.
 
 Not really. Isolated cpusets allows you to do a soft-partition
 of the system, and it would make sense to continue to have load
 balancing within these partitions. I would think not having
 load balancing should be one of the options available

Ok ... then is it correct to say that your purpose is to partition
the systems CPUs into subsets, such that for each subset, either
there is a scheduler domain for that exactly the CPUs in that subset,
or none of the CPUs in the subset are in any scheduler domain?

 I must confess that I havent looked at the memory side all that much,
 having more interest in trying to build soft-partitioning of the cpu's

This is an understandable focus of interest.  Just know that one of the
sanity tests I will apply to a solution for CPUs is whether there is a
corresponding solution for Memory Nodes, using much the same principles,
invariants and conventions.

 ok I need to spend more time on you model Paul, but my first
 guess is that it doesn't seem to be very intuitive and seems
 to make it very complex from the users perspective. However as
 I said I need to understand your model a bit more before I
 comment on it

Well ... I can't claim that my approach is simple.  It does have a
clearly defined (well, clear to me ;) mathematical model, with some
invariants that are always preserved in what user space sees, with
atomic operations for changing from one legal state to the next.

The primary invariant is that the sets of CPUs in the cpusets
marked domain_cpu_current form a partition (disjoint covering)
of the CPUs in the system.

What are your invariants, and how can you assure yourself and us
that your code preserves these invariants?

Also, I don't know that the sequence of user operations required
by my interface is that much worse than yours.  Let's take an
example, and compare what the user would have to do.

Let's say we have the following cpusets on our 8 CPU system:

/   # CPUs 0-7
/Alpha  # CPUs   0-3
/Alpha/phi  # CPUs0-1
/Alpha/chi  # CPUs2-3
/Beta   # CPUs  4-7

Let's say we currently have three scheduler domains, for three isolated
(in your terms) cpusets: /Alpha/phi, /Alpha/chi and /Beta.

Let's say we want to change the configuration to have just two scheduler
domains (two isolated cpusets): /Alpha and /Beta.

A user of my API would do the operations:

echo 1  /Alpha/domain_cpu_pending
echo 1  /Beta/domain_cpu_pending
echo 0  /Alpha/phi/domain_cpu_pending
echo 0  /Alpha/chi/domain_cpu_pending
echo 1  /domain_cpu_rebuild

The domain_cpu_current state would not change until the final write
(echo) above, at which time the cpuset_sem lock would be taken, and the
system would, atomically to all viewing tasks, change from having the
three cpusets /Alpha/phi, /Alpha/chi and /Beta marked with a true
domain_cpu_current, to having the two cpusets /Alpha and /Beta so
marked.

The alternative API, which I didn't explore, could do this in one step
by writing the new list of cpusets defining the partition, doing the
rough equivalent (need nul separators, not space separators) of:

echo /Alpha /Beta  /list_cpu_subdomains

How does this play out in your interface?  Are you convinced that
your invariants are preserved at all times, to all