On Sat, 27 Oct 2007, Paul Jackson wrote: > > but I actually would recommend against any flag to effect Choice A. > > It's simply going to be too complex to describe and is going to be a > > headache to code and support. > > While I am sorely tempted to agree entirely with this, I suspect that > Christoph has a point when he cautions against breaking this kernel API. > > Especially for users of the set/get mempolicy calls coming in via > libnuma, we have to be very careful not to break the current behaviour, > whether it is documented API or just an accident of the implementation. >
>From a standpoint of the MPOL_PREFERRED memory policy itself, there is no documented behavior or standard that specifies its interaction with cpusets. Thus, it's "undefined." We are completely free to implement an undefined behavior as we choose and change it as Linux matures. Once it is defined, however, we carry the burden of protecting applications that are written on that definition. That's the point where we need to get it right and if we don't, we're stuck with it forever; I don't believe we're at that point with MPOL_PREFERRED policies under cpusets right now. > There is a fairly deep and important stack of software, involving a > well known DBMS product whose name begins with 'O', sitting on that > libnuma software stack. Steering that solution stack is like steering > a giant oil tanker near shore. You take it slow and easy, and listen > closely to the advice of the ancient harbor master. The harbor masters > in this case are or were Andi Kleen and Christoph Lameter. > Ok, let's take a look at some specific unproprietary examples of tasks that use set_mempolicy(MPOL_PREFERRED) for a specific node, intending it to be the actual system node offset, that is then assigned to a cpuset that doesn't require that offset to be allowed. I think it's going to become pretty difficult to find an example because the whole scenario is pretty lame: you would need to already know which nodes you're going to be assigned to in the cpuset to ask for one of them as your preferred node. I don't imagine any application can have that type of foresight and, if it does, then we certainly shouldn't support the preferred node_remap() when it changes mems. You're trying to support a scheme, in Choice A, where an application knows it's going to be assigned to a range of nodes (for example, 1-3) and wants the preferred node to be included (for example, 2). So now the application must have control over both its memory policy and its cpuset placement. Then it must be willing to change its cpuset placement to a different set of nodes (with equal or greater cardinality) and have the preferred node offset respected. Why can't it simply then issue another set_mempolicy(MPOL_PREFERRED) call for the new preferred node? See? The problem is that you're trying to protect applications that know its initial cpuset mems [the only way it could ever send a set_mempolicy(MPOL_PREFERRED) for the right node in that range in the first place] but then seemingly loses control over its cpuset and intends for the kernel to fix it up for it without having the burden of issuing another set_mempolicy() call. And you're trying to protect this application that based this implementation not on a standard or documentation, but on its observed behavior. My bet is that it's going to issue that subsequent set_mempolicy(), at least if libnuma returned a numa_preferred() value that it wasn't expecting. > True, which is why I am hoping we can keep this modal flag, if such be, > from having to be used on every set/get mempolicy call. The ordinary > coder of new code using these calls directly should just see Choice B > behaviour. However the user of libnuma should continue to see whatever > API libnuma supports, with no change whatsoever, and various versions of > libnuma, including those already shipped years ago, must continue to > behave without any changes in node numbering. > I don't see how you can accomplish that. If the default behavior is Choice B, which is different from what is currently implemented in the kernel, you're going to either require a modification to the application to set a flag asking for Choice A again or make the default kernel behavior that of Choice A and set a flag implicitly via libnuma when future versions are released. In the former case, just ask the application to adjust its node numbering scheme or check the result of numa_preferred(). In the latter case, we're not even talking about changing the kernel default anymore to Choice B. > 2) We have a per-task mode flag selecting whether Choice A or B > node numbering apply to the masks passed in to set_mempolicy. > > The kernel implementation is fairly easy. (Yeah, I know, I > too cringe everytime I read that line ;) > If you add this per-task mode flag to default to Choice A for preferred memory policies, it'll be extremely confusing to document and support. If it's already decided that we should default to Choice B, it's going to require an update to the application to write to /proc/pid/i_want_choice_A or use the new set_mempolicy() option anyway, so instead of adding that hack you should simply fix your node numbering. And I suspect that if that per-task mode flag is added, it will eventually be the subject of a thread with the subject "is this highly specialized flag even used anymore?" at which point it will be marked deprecated and eventually obsoleted. > The bulk of the kernel's mempolicy code is coded for Choice B. > > If Choice B is active, we don't enforce the subset check in > contextualize_policy(), and we don't invoke nodes_remap() in either > of the set or get mempolicy code paths. > Yeah, remapping the nodemask is a bad idea anyway to get a preferred node. Preferred nodes inherently deal with offsets from node 0 anyway. > A new option to get_mempolicy() would query the current state of > this mode flag, and a new option to set_mempolicy() would set > and clear this mode flag. Perhaps Christoph had this in mind > when he wrote in an earlier message "The alternative is to add > new set/get mempolicy functions." > That still requires a change to the application. So they should simply rethink their node numbering instead and fix their application to follow a behavior that will, at that point, be documented. Any application that doesn't respect the return value of set_mempolicy(MPOL_PREFERRED) node isn't worth supporting anyway. There's two cases to think about: - When the cpuset assignment changes from the root cpuset to a user-created cpuset with a subset of system mems and then set_mempolicy() is called, and - When set_mempolicy() is called and then the cpuset mems change either because it was attached to a different cpuset or someone wrote to its 'mems' file. In the first case, the new API should return -EINVAL if you ask for a preferred node offset that is smaller than the cardinality of your mems_allowed. That will catch some of these applications that may have actually been implemented based on the current undocumented behavior. In the second case, the first node in the nodemask passed to set_mempolicy() was a system node offset anyway and had nothing to do with cpusets (it was a member of the root cpuset with access to all mems) so it already behaves as Choice B. > There are two major user level libraries sitting on top of this API, > libnuma and libcpuset. Libnuma is well known; it was written by Andi > Kleen. I wrote libcpuset, and while it is LGPL licensed, it has not > been publicized very well yet. I can speak for libcpuset: it could > adapt to the above proposal, in particular to the details in way (2), > just fine. Old versions of libcpuset running on new kernels will > have a little bit of subtle breakage, but not in areas that I expect > will cause much grief. Someone more familiar with libnuma than I would > have to examine the above proposal in way (2) to be sure that we weren't > throwing libnuma some curveball that was unnecessarily troublesome. > I think any application that gets constrained to a subset of nodes in its mems_allowed and then bases its preferred node number off that subset to create an offset that is intended to be preserved over subsequent mems changes without rechecking the result with numa_preferred() or issuing a subsequent set_mempolicy() is poorly written. Especially since that behavior was undocumented. David - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/