On Thu 10-10-19 14:07:21, Yunsheng Lin wrote: > On 2019/10/9 20:25, Robin Murphy wrote: > > On 2019-10-08 9:38 am, Yunsheng Lin wrote: > >> On 2019/9/25 18:41, Peter Zijlstra wrote: > >>> On Wed, Sep 25, 2019 at 05:14:20PM +0800, Yunsheng Lin wrote: > >>>> From the discussion above, It seems making the node_to_cpumask_map() > >>>> NUMA_NO_NODE aware is the most feasible way to move forwad. > >>> > >>> That's still wrong. > >> > >> Hi, Peter > >> > >> It seems this has trapped in the dead circle. > >> > >> From my understanding, NUMA_NO_NODE which means not node numa preference > >> is the state to describe the node of virtual device or the physical device > >> that has equal distance to all cpu. > >> > >> We can be stricter if the device does have a nearer node, but we can not > >> deny that a device does not have a node numa preference or node affinity, > >> which also means the control or data buffer can be allocated at the node > >> where > >> the process is running. > >> > >> As you has proposed, making it -2 and have dev_to_node() warn if the > >> device does > >> have a nearer node and not set by the fw is a way to be stricter. > >> > >> But I think maybe being stricter is not really relevant to NUMA_NO_NODE, > >> because > >> we does need a state to describe the device that have equal distance to > >> all node, > >> even if it is not physically scalable. > >> > >> Any better suggestion to move this forward? > > > > FWIW (since this is in my inbox), it sounds like the fundamental issue is > > that NUMA_NO_NODE is conflated for at least two different purposes, so > > trying to sort that out would be a good first step. AFAICS we have genuine > > "don't care" cases like alloc_pages_node(), where if the producer says it > > doesn't matter then the consumer is free to make its own judgement on what > > to do, and fundamentally different "we expect this thing to have an > > affinity but it doesn't, so we can't say what's appropriate" cases which > > could really do with some separate indicator like "NUMA_INVALID_NODE". > > > > The tricky part is then bestowed on the producers to decide whether they > > can downgrade "invalid" to "don't care". You can technically build 'a > > device' whose internal logic is distributed between nodes and thus appears > > to have equal affinity - interrupt controllers, for example, may have > > per-CPU or per-node interfaces that end up looking like that - so although > > it's unlikely it's not outright nonsensical. Similarly a 'device' that's > > actually emulated behind a firmware call interface may well effectively > > have no real affinity. > > We may set node of the physical device to NUMA_INVALID_NODE when fw does not > provide one. > > But what do we do about NUMA_INVALID_NODE when alloc_pages_node() is called > with nid being NUMA_INVALID_NODE?
There is nothing sensible the allocator can do. The only point of NUMA_INVALID_NODE would be to catch potential misconfiguration and report them to users so they can complain to their HW/FS suppliers. Pushing it to other susbystem doesn't make much sense IMHO because there is nothing really actionable. Refusing an allocation altogether sounds like a bad plan to me. > If we change the node to default one(like node 0) when node of device is > NUMA_INVALID_NODE in device_add(), how do we know the default one(like node 0) > is the right one to choose? Exactly. We cannot really assume any node in that situation. > >From the privous disccusion, the below seems not get to consensus yet: > 1) Do we need a state like NUMA_NO_NODE to describe that the device does not > have any numa preference? This is a traditional meaning MM subsystem is using. > 2) What do we do if the fw does not provide a node for the device? Should > we guess and pick one for it and how do we do the guessing? Or leave it > as it is and handle it as NUMA_NO_NODE? As already pointed several times, picking any node is rather error prone. You can never assume topology. We used to assume that there always be node 0 but that is not really the case (see 3e8589963773 ("memcg: make it work on sparse non-0-node systems")). Nodes might also come and go so this might just lead to all sorts of subtle problems. On the other hand using NUMA_NO_NODE as no preference could only lead to slightly sub optimal performance. I do agree with Peter that reporting a lack of affinity might be useful but we shouldn't really try to clever and make up the affinity nilly willy. -- Michal Hocko SUSE Labs