Re: [patch 2/2] cpusets: add interleave_over_allowed option

David Rientjes Sun, 28 Oct 2007 10:22:11 -0800

On Sat, 27 Oct 2007, Paul Jackson wrote:

> > but I actually would recommend against any flag to effect Choice A.
> > It's simply going to be too complex to describe and is going to be a
> > headache to code and support. 
> 
> While I am sorely tempted to agree entirely with this, I suspect that
> Christoph has a point when he cautions against breaking this kernel API.
> 
> Especially for users of the set/get mempolicy calls coming in via
> libnuma, we have to be very careful not to break the current behaviour,
> whether it is documented API or just an accident of the implementation.
>


>From a standpoint of the MPOL_PREFERRED memory policy itself, there is no 
documented behavior or standard that specifies its interaction with 
cpusets.  Thus, it's "undefined."  We are completely free to implement an 
undefined behavior as we choose and change it as Linux matures.

Once it is defined, however, we carry the burden of protecting 
applications that are written on that definition.  That's the point where 
we need to get it right and if we don't, we're stuck with it forever; I 
don't believe we're at that point with MPOL_PREFERRED policies under 
cpusets right now.

> There is a fairly deep and important stack of software, involving a
> well known DBMS product whose name begins with 'O', sitting on that
> libnuma software stack.  Steering that solution stack is like steering
> a giant oil tanker near shore.  You take it slow and easy, and listen
> closely to the advice of the ancient harbor master.  The harbor masters
> in this case are or were Andi Kleen and Christoph Lameter.
> 

Ok, let's take a look at some specific unproprietary examples of tasks 
that use set_mempolicy(MPOL_PREFERRED) for a specific node, intending it 
to be the actual system node offset, that is then assigned to a cpuset 
that doesn't require that offset to be allowed.

I think it's going to become pretty difficult to find an example because 
the whole scenario is pretty lame: you would need to already know which 
nodes you're going to be assigned to in the cpuset to ask for one of them 
as your preferred node.  I don't imagine any application can have that 
type of foresight and, if it does, then we certainly shouldn't support the 
preferred node_remap() when it changes mems.

You're trying to support a scheme, in Choice A, where an application knows 
it's going to be assigned to a range of nodes (for example, 1-3) and wants 
the preferred node to be included (for example, 2).  So now the 
application must have control over both its memory policy and its cpuset 
placement.  Then it must be willing to change its cpuset placement to a 
different set of nodes (with equal or greater cardinality) and have the 
preferred node offset respected.  Why can't it simply then issue another 
set_mempolicy(MPOL_PREFERRED) call for the new preferred node?

See?  The problem is that you're trying to protect applications that know 
its initial cpuset mems [the only way it could ever send a 
set_mempolicy(MPOL_PREFERRED) for the right node in that range in the 
first place] but then seemingly loses control over its cpuset and intends 
for the kernel to fix it up for it without having the burden of issuing 
another set_mempolicy() call.

And you're trying to protect this application that based this 
implementation not on a standard or documentation, but on its observed 
behavior.  My bet is that it's going to issue that subsequent 
set_mempolicy(), at least if libnuma returned a numa_preferred() value 
that it wasn't expecting.

> True, which is why I am hoping we can keep this modal flag, if such be,
> from having to be used on every set/get mempolicy call.  The ordinary
> coder of new code using these calls directly should just see Choice B
> behaviour.  However the user of libnuma should continue to see whatever
> API libnuma supports, with no change whatsoever, and various versions of
> libnuma, including those already shipped years ago, must continue to
> behave without any changes in node numbering.
> 

I don't see how you can accomplish that.  If the default behavior is 
Choice B, which is different from what is currently implemented in the 
kernel, you're going to either require a modification to the application 
to set a flag asking for Choice A again or make the default kernel 
behavior that of Choice A and set a flag implicitly via libnuma when 
future versions are released.

In the former case, just ask the application to adjust its node numbering 
scheme or check the result of numa_preferred().  In the latter case, we're 
not even talking about changing the kernel default anymore to Choice B.

>  2) We have a per-task mode flag selecting whether Choice A or B
>     node numbering apply to the masks passed in to set_mempolicy.
> 
>     The kernel implementation is fairly easy.  (Yeah, I know, I
>     too cringe everytime I read that line ;)
> 

If you add this per-task mode flag to default to Choice A for preferred 
memory policies, it'll be extremely confusing to document and support.  If 
it's already decided that we should default to Choice B, it's going to 
require an update to the application to write to /proc/pid/i_want_choice_A 
or use the new set_mempolicy() option anyway, so instead of adding that 
hack you should simply fix your node numbering.

And I suspect that if that per-task mode flag is added, it will eventually 
be the subject of a thread with the subject "is this highly specialized 
flag even used anymore?" at which point it will be marked deprecated and 
eventually obsoleted.

>     The bulk of the kernel's mempolicy code is coded for Choice B.
> 
>     If Choice B is active, we don't enforce the subset check in
>     contextualize_policy(), and we don't invoke nodes_remap() in either
>     of the set or get mempolicy code paths.
> 

Yeah, remapping the nodemask is a bad idea anyway to get a preferred node.  
Preferred nodes inherently deal with offsets from node 0 anyway.

>     A new option to get_mempolicy() would query the current state of
>     this mode flag, and a new option to set_mempolicy() would set
>     and clear this mode flag.  Perhaps Christoph had this in mind
>     when he wrote in an earlier message "The alternative is to add
>     new set/get mempolicy functions."
> 

That still requires a change to the application.  So they should simply 
rethink their node numbering instead and fix their application to follow a 
behavior that will, at that point, be documented.

Any application that doesn't respect the return value of 
set_mempolicy(MPOL_PREFERRED) node isn't worth supporting anyway.

There's two cases to think about:

 - When the cpuset assignment changes from the root cpuset to a
   user-created cpuset with a subset of system mems and then
   set_mempolicy() is called, and

 - When set_mempolicy() is called and then the cpuset mems change either
   because it was attached to a different cpuset or someone wrote to its
   'mems' file.

In the first case, the new API should return -EINVAL if you ask for a 
preferred node offset that is smaller than the cardinality of your 
mems_allowed.  That will catch some of these applications that may have 
actually been implemented based on the current undocumented behavior.

In the second case, the first node in the nodemask passed to 
set_mempolicy() was a system node offset anyway and had nothing to do with 
cpusets (it was a member of the root cpuset with access to all mems) so it 
already behaves as Choice B.

> There are two major user level libraries sitting on top of this API,
> libnuma and libcpuset.  Libnuma is well known; it was written by Andi
> Kleen.  I wrote libcpuset, and while it is LGPL licensed, it has not
> been publicized very well yet.  I can speak for libcpuset: it could
> adapt to the above proposal, in particular to the details in way (2),
> just fine.  Old versions of libcpuset running on new kernels will
> have a little bit of subtle breakage, but not in areas that I expect
> will cause much grief.  Someone more familiar with libnuma than I would
> have to examine the above proposal in way (2) to be sure that we weren't
> throwing libnuma some curveball that was unnecessarily troublesome.
> 

I think any application that gets constrained to a subset of nodes in its 
mems_allowed and then bases its preferred node number off that subset to 
create an offset that is intended to be preserved over subsequent mems 
changes without rechecking the result with numa_preferred() or issuing a 
subsequent set_mempolicy() is poorly written.  Especially since that 
behavior was undocumented.

                David
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [patch 2/2] cpusets: add interleave_over_allowed option

Reply via email to