On Fri, 26 Oct 2007, Paul Jackson wrote: > Choice A: > as it does today, the second node in the tasks cpuset or it could > mean > > Choice B: > the fourth node in the cpuset, if available, just as > it did in the case above involving a cpuset on nodes 10 and 11. >
Thanks for describing the situation with MPOL_PREFERRED so thoroughly. I prefer Choice B because it does not force mempolicies to have any dependence on cpusets with regard to what nodemask is passed. [EMAIL PROTECTED] ~]$ man set_mempolicy | grep -i cpuset | wc -l 0 It would be very good to store the passed nodemask to set_mempolicy in struct mempolicy, as you've already recommended for MPOL_INTERLEAVE, so that you can try to match the intent of the application as much as possible. But since cpusets are built on top of mempolicies, I don't think there's any reason why we should respect any nodemask in terms of the current cpuset context, whether it's preferred or interleave. So if you were to pass a nodemask with only the fourth node set for an MPOL_PREFERRED mempolicy, the correct behavior would be to prefer the fourth node on the system or, if constrained by cpusets, the fourth node in the cpuset. If the cpuset has fewer than four nodes, the behavior should be undefined (probably implemented to just cycle the set of mems_allowed until you reach the fourth entry). That's the result of constraining a task to a cpuset that obviously wants access to more nodes -- it's a userspace mistake and abusing cpusets so that the task does not get what it expects. That concept isn't actually new: we already restrict tasks to a certain amount of memory by writing to the mems file and just because it happens to have access to more memory when unconstrained by cpusets doesn't matter. You've placed it in a cpuset that wasn't prepared to deal with what the task was asking for. At least in the MPOL_PREFERRED case you describe above, it'll be dealt with much more pleasantly by at least giving it a preferred node as opposed to OOM killing it when a task has exhausted its available cpuset-constrained memory. I'd prefer a solution where mempolicies can always be described and used without ever considering cpusets. Then, a sane implementation will configure the cpuset accordingly to accomodate its tasks' mempolicies. We don't want to get in a situation where we are denying a task to be attached to a cpuset when there are fewer nodes than the preferred node, for example, but we can fallback to better behavior by at least giving it a preferred node in the MPOL_PREFERRED case. David - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/