On Thu 10-10-19 14:07:21, Yunsheng Lin wrote:
> On 2019/10/9 20:25, Robin Murphy wrote:
> > On 2019-10-08 9:38 am, Yunsheng Lin wrote:
> >> On 2019/9/25 18:41, Peter Zijlstra wrote:
> >>> On Wed, Sep 25, 2019 at 05:14:20PM +0800, Yunsheng Lin wrote:
> >>>>  From the discussion above, It seems making the node_to_cpumask_map()
> >>>> NUMA_NO_NODE aware is the most feasible way to move forwad.
> >>>
> >>> That's still wrong.
> >>
> >> Hi, Peter
> >>
> >> It seems this has trapped in the dead circle.
> >>
> >>  From my understanding, NUMA_NO_NODE which means not node numa preference
> >> is the state to describe the node of virtual device or the physical device
> >> that has equal distance to all cpu.
> >>
> >> We can be stricter if the device does have a nearer node, but we can not
> >> deny that a device does not have a node numa preference or node affinity,
> >> which also means the control or data buffer can be allocated at the node 
> >> where
> >> the process is running.
> >>
> >> As you has proposed, making it -2 and have dev_to_node() warn if the 
> >> device does
> >> have a nearer node and not set by the fw is a way to be stricter.
> >>
> >> But I think maybe being stricter is not really relevant to NUMA_NO_NODE, 
> >> because
> >> we does need a state to describe the device that have equal distance to 
> >> all node,
> >> even if it is not physically scalable.
> >>
> >> Any better suggestion to move this forward?
> > 
> > FWIW (since this is in my inbox), it sounds like the fundamental issue is 
> > that NUMA_NO_NODE is conflated for at least two different purposes, so 
> > trying to sort that out would be a good first step. AFAICS we have genuine 
> > "don't care" cases like alloc_pages_node(), where if the producer says it 
> > doesn't matter then the consumer is free to make its own judgement on what 
> > to do, and fundamentally different "we expect this thing to have an 
> > affinity but it doesn't, so we can't say what's appropriate" cases which 
> > could really do with some separate indicator like "NUMA_INVALID_NODE".
> > 
> > The tricky part is then bestowed on the producers to decide whether they 
> > can downgrade "invalid" to "don't care". You can technically build 'a 
> > device' whose internal logic is distributed between nodes and thus appears 
> > to have equal affinity - interrupt controllers, for example, may have 
> > per-CPU or per-node interfaces that end up looking like that - so although 
> > it's unlikely it's not outright nonsensical. Similarly a 'device' that's 
> > actually emulated behind a firmware call interface may well effectively 
> > have no real affinity.
> 
> We may set node of the physical device to NUMA_INVALID_NODE when fw does not
> provide one.
> 
> But what do we do about NUMA_INVALID_NODE when alloc_pages_node() is called
> with nid being NUMA_INVALID_NODE?

There is nothing sensible the allocator can do. The only point of
NUMA_INVALID_NODE would be to catch potential misconfiguration and
report them to users so they can complain to their HW/FS suppliers.

Pushing it to other susbystem doesn't make much sense IMHO because there
is nothing really actionable. Refusing an allocation altogether sounds
like a bad plan to me.
 
> If we change the node to default one(like node 0) when node of device is
> NUMA_INVALID_NODE in device_add(), how do we know the default one(like node 0)
> is the right one to choose?

Exactly. We cannot really assume any node in that situation.
 
> >From the privous disccusion, the below seems not get to consensus yet:
> 1) Do we need a state like NUMA_NO_NODE to describe that the device does not
>    have any numa preference?

This is a traditional meaning MM subsystem is using.

> 2) What do we do if the fw does not provide a node for the device? Should
>    we guess and pick one for it and how do we do the guessing? Or leave it
>    as it is and handle it as NUMA_NO_NODE?

As already pointed several times, picking any node is rather error
prone. You can never assume topology. We used to assume that there
always be node 0 but that is not really the case (see 3e8589963773
("memcg: make it work on sparse non-0-node systems")). Nodes might also
come and go so this might just lead to all sorts of subtle problems.

On the other hand using NUMA_NO_NODE as no preference could only lead to
slightly sub optimal performance.

I do agree with Peter that reporting a lack of affinity might be useful
but we shouldn't really try to clever and make up the affinity nilly
willy.
-- 
Michal Hocko
SUSE Labs

Reply via email to