On Fri, 26 Oct 2007, Paul Jackson wrote:

> Choice A:
>     as it does today, the second node in the tasks cpuset or it could
>     mean
> 
> Choice B:
>     the fourth node in the cpuset, if available, just as
>     it did in the case above involving a cpuset on nodes 10 and 11.
> 

Thanks for describing the situation with MPOL_PREFERRED so thoroughly.

I prefer Choice B because it does not force mempolicies to have any 
dependence on cpusets with regard to what nodemask is passed.

        [EMAIL PROTECTED] ~]$ man set_mempolicy | grep -i cpuset | wc -l
        0

It would be very good to store the passed nodemask to set_mempolicy in 
struct mempolicy, as you've already recommended for MPOL_INTERLEAVE, so 
that you can try to match the intent of the application as much as 
possible.  But since cpusets are built on top of mempolicies, I don't 
think there's any reason why we should respect any nodemask in terms of 
the current cpuset context, whether it's preferred or interleave.

So if you were to pass a nodemask with only the fourth node set for an 
MPOL_PREFERRED mempolicy, the correct behavior would be to prefer the 
fourth node on the system or, if constrained by cpusets, the fourth node 
in the cpuset.  If the cpuset has fewer than four nodes, the behavior 
should be undefined (probably implemented to just cycle the set of 
mems_allowed until you reach the fourth entry).  That's the result of 
constraining a task to a cpuset that obviously wants access to more
nodes -- it's a userspace mistake and abusing cpusets so that the task 
does not get what it expects.

That concept isn't actually new: we already restrict tasks to a certain 
amount of memory by writing to the mems file and just because it happens 
to have access to more memory when unconstrained by cpusets doesn't 
matter.  You've placed it in a cpuset that wasn't prepared to deal with 
what the task was asking for.  At least in the MPOL_PREFERRED case you 
describe above, it'll be dealt with much more pleasantly by at least 
giving it a preferred node as opposed to OOM killing it when a task has 
exhausted its available cpuset-constrained memory.

I'd prefer a solution where mempolicies can always be described and used 
without ever considering cpusets.  Then, a sane implementation will 
configure the cpuset accordingly to accomodate its tasks' mempolicies.  We 
don't want to get in a situation where we are denying a task to be 
attached to a cpuset when there are fewer nodes than the preferred node, 
for example, but we can fallback to better behavior by at least giving it 
a preferred node in the MPOL_PREFERRED case.

                David
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Reply via email to