On Fri, 26 Oct 2007, Paul Jackson wrote:

>  1) If you want the current behaviour, where set_mempolicy(MPOL_INTERLEAVE)
>     calls mean what they say, and cpusets tries as best it can (imperfectly)
>     to honor those memory policy calls, even in the face of changing cpusets,
>     then leave memory_spread_user turned off (the default, right?)
>  2) If you want MPOL_INTERLEAVE tasks to interleave user memory across all
>     nodes in the cpuset, whatever that might be, then enable 
> memory_spread_user.
> 
> This is admittedly less flexible than your patch provided, but I am
> attracted to the simpler API - easier to explain.
> 

That seems to follow the convention better with respect to 
memory_spread_page and memory_spread_slab anyway.  Either all the tasks 
attached to the cpuset get that behavior or none of them do.  We can do 
the same for memory_spread_user.

Sounds good.

> This does beg yet another question: shouldn't memory_spread_user force
> interleaving of user memory -regardless- of mempolicy.
> 

Sure, I don't see any compelling reason why it shouldn't.

> And yet another question: what about the MPOL_BIND mempolicy?  It too,
> to a lesser extent, has the same problems with cpusets that shrink and
> then expand.  Several tasks in a cpuset with multiple nodes could carefully
> bind to a separate node each, but then end up collapsed all onto the same
> node if the cpuset was shrunk to one node and then expanded again.
> 

Hmm.  At some point we're going to have to just say that if you use 
mempolicies such as MPOL_BIND in your application and then you insanely 
take those nodes away from your application via cpusets, that you are 
actually getting exactly what you asked for.

There's two ways to fix that: try to remap the MPOL_BIND nodes onto the 
new set of mems_allowed regardless of the cardinality of the two sets, or 
refuse to update the nodemask of the cpuset if you're taking one or more 
nodes away from an attached task that has such a policy.  I favor the 
former because, in conjunction with a sane memory_migrate setting, it 
shouldn't actually matter that much.  The memory you previously allocated 
will still be on the removed nodes; only your future allocations will 
actually respect the new nodemask.

The MPOL_INTERLEAVE case is more interesting because we're trying to 
reduce bus contention and decrease our latency with quicker memory access.  
So, using the true definition of a node as a premise, we should get better 
results in terms of performance if we expand the nodemask as much as 
possible.  That's exactly what we've been trying to address: when an 
application's mems_allowed is expanded to allow more nodes, the 
application is unaware of the change and can't take advantage of the 
it (without the get_mempolicy() - set_mempolicy() loop).  That's my whole 
case for why cpusets should be modifying MPOL_INTERLEAVE policies in the 
first place: because they are the ones that allowed access to more memory.

> On a different point, we could, if it was worth the extra bit of code,
> improve the current code's handling of mempolicy rebinding when the
> cpuset adds memory nodes.  If we kept both the original cpusets
> mems_allowed, and the original MPOL_INTERLEAVE nodemask requested by
> the user in a call to set_mempolicy, then we could rebind (nodes_remap)
> the currently active policy v.nodes using that pair of saved masks to
> guide the rebinding.  This way, if say a cpuset shrunk, then regrew back
> to its original size (original number of nodes) we would end up
> replicating the original MPOL_INTERLEAVE request, cpuset relative.

Keeping a copy of the nodemask passed to set_mempolicy() in struct 
mempolicy is an interesting idea and could, with the logic you describe, 
help guide the remapping as the set of allowed nodes changes.  We'd have 
two interleaved nodemasks, the actual (pol->v.nodes) and the requested 
(something like pol->passed_nodemask).  get_mempolicy() would always 
return the actual nodemask so the application can be aware of what it has 
access to and what it doesn't.  I like it.

                David
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Reply via email to