x86: add detection of discontinous node memory range

Wei Chen Mon, 27 Sep 2021 02:51:18 -0700

Hi Stefano,

> -----Original Message-----
> From: Stefano Stabellini <sstabell...@kernel.org>
> Sent: 2021年9月27日 13:05
> To: Stefano Stabellini <sstabell...@kernel.org>
> Cc: Wei Chen <wei.c...@arm.com>; xen-devel@lists.xenproject.org;
> jul...@xen.org; Bertrand Marquis <bertrand.marq...@arm.com>;
> jbeul...@suse.com; andrew.coop...@citrix.com; roger....@citrix.com;
> w...@xen.org
> Subject: RE: [PATCH 08/37] xen/x86: add detection of discontinous node
> memory range
> 
> On Sun, 26 Sep 2021, Stefano Stabellini wrote:
> > On Sun, 26 Sep 2021, Wei Chen wrote:
> > > > -----Original Message-----
> > > > From: Stefano Stabellini <sstabell...@kernel.org>
> > > > Sent: 2021年9月25日 3:53
> > > > To: Wei Chen <wei.c...@arm.com>
> > > > Cc: Stefano Stabellini <sstabell...@kernel.org>; xen-
> > > > de...@lists.xenproject.org; jul...@xen.org; Bertrand Marquis
> > > > <bertrand.marq...@arm.com>; jbeul...@suse.com;
> andrew.coop...@citrix.com;
> > > > roger....@citrix.com; w...@xen.org
> > > > Subject: RE: [PATCH 08/37] xen/x86: add detection of discontinous
> node
> > > > memory range
> > > >
> > > > On Fri, 24 Sep 2021, Wei Chen wrote:
> > > > > > -----Original Message-----
> > > > > > From: Stefano Stabellini <sstabell...@kernel.org>
> > > > > > Sent: 2021年9月24日 8:26
> > > > > > To: Wei Chen <wei.c...@arm.com>
> > > > > > Cc: xen-devel@lists.xenproject.org; sstabell...@kernel.org;
> > > > jul...@xen.org;
> > > > > > Bertrand Marquis <bertrand.marq...@arm.com>; jbeul...@suse.com;
> > > > > > andrew.coop...@citrix.com; roger....@citrix.com; w...@xen.org
> > > > > > Subject: Re: [PATCH 08/37] xen/x86: add detection of
> discontinous node
> > > > > > memory range
> > > > > >
> > > > > > CC'ing x86 maintainers
> > > > > >
> > > > > > On Thu, 23 Sep 2021, Wei Chen wrote:
> > > > > > > One NUMA node may contain several memory blocks. In current
> Xen
> > > > > > > code, Xen will maintain a node memory range for each node to
> cover
> > > > > > > all its memory blocks. But here comes the problem, in the gap
> of
> > > > > > > one node's two memory blocks, if there are some memory blocks
> don't
> > > > > > > belong to this node (remote memory blocks). This node's memory
> range
> > > > > > > will be expanded to cover these remote memory blocks.
> > > > > > >
> > > > > > > One node's memory range contains othe nodes' memory, this is
> > > > obviously
> > > > > > > not very reasonable. This means current NUMA code only can
> support
> > > > > > > node has continous memory blocks. However, on a physical
> machine,
> > > > the
> > > > > > > addresses of multiple nodes can be interleaved.
> > > > > > >
> > > > > > > So in this patch, we add code to detect discontinous memory
> blocks
> > > > > > > for one node. NUMA initializtion will be failed and error
> messages
> > > > > > > will be printed when Xen detect such hardware configuration.
> > > > > >
> > > > > > At least on ARM, it is not just memory that can be interleaved,
> but
> > > > also
> > > > > > MMIO regions. For instance:
> > > > > >
> > > > > > node0 bank0 0-0x1000000
> > > > > > MMIO 0x1000000-0x1002000
> > > > > > Hole 0x1002000-0x2000000
> > > > > > node0 bank1 0x2000000-0x3000000
> > > > > >
> > > > > > So I am not familiar with the SRAT format, but I think on ARM
> the
> > > > check
> > > > > > would look different: we would just look for multiple memory
> ranges
> > > > > > under a device_type = "memory" node of a NUMA node in device
> tree.
> > > > > >
> > > > > >
> > > > >
> > > > > Should I need to include/refine above message to commit log?
> > > >
> > > > Let me ask you a question first.
> > > >
> > > > With the NUMA implementation of this patch series, can we deal with
> > > > cases where each node has multiple memory banks, not interleaved?
> > >
> > > Yes.
> > >
> > > > An an example:
> > > >
> > > > node0: 0x0        - 0x10000000
> > > > MMIO : 0x10000000 - 0x20000000
> > > > node0: 0x20000000 - 0x30000000
> > > > MMIO : 0x30000000 - 0x50000000
> > > > node1: 0x50000000 - 0x60000000
> > > > MMIO : 0x60000000 - 0x80000000
> > > > node2: 0x80000000 - 0x90000000
> > > >
> > > >
> > > > I assume we can deal with this case simply by setting node0 memory
> to
> > > > 0x0-0x30000000 even if there is actually something else, a device,
> that
> > > > doesn't belong to node0 in between the two node0 banks?
> > >
> > > While this configuration is rare in SoC design, but it is not
> impossible.
> >
> > Definitely, I have seen it before.
> >
> >
> > > > Is it only other nodes' memory interleaved that cause issues? In
> other
> > > > words, only the following is a problematic scenario?
> > > >
> > > > node0: 0x0        - 0x10000000
> > > > MMIO : 0x10000000 - 0x20000000
> > > > node1: 0x20000000 - 0x30000000
> > > > MMIO : 0x30000000 - 0x50000000
> > > > node0: 0x50000000 - 0x60000000
> > > >
> > > > Because node1 is in between the two ranges of node0?
> > > >
> > >
> > > But only device_type="memory" can be added to allocation.
> > > For mmio there are two cases:
> > > 1. mmio doesn't have NUMA id property.
> > > 2. mmio has NUMA id property, just like some PCIe controllers.
> > >    But we don’t need to handle these kinds of MMIO devices
> > >    in memory block parsing. Because we don't need to allocate
> > >    memory from these mmio ranges. And for accessing, we need
> > >    a NUMA-aware PCIe controller driver or a generic NUMA-aware
> > >    MMIO accessing APIs.
> >
> > Yes, I am not too worried about devices with a NUMA id property because
> > they are less common and this series doesn't handle them at all, right?
> > I imagine they would be treated like any other device without NUMA
> > awareness.
> >
> > I am thinking about the case where the memory of each NUMA node is made
> > of multiple banks. I understand that this patch adds an explicit check
> > for cases where these banks are interleaving, however there are many
> > other cases where NUMA memory nodes are *not* interleaving but they are
> > still made of multiple discontinuous banks, like in the two example
> > above.
> >
> > My question is whether this patch series in its current form can handle
> > the two cases above correctly. If so, I am wondering how it works given
> > that we only have a single "start" and "size" parameter per node.
> >
> > On the other hand if this series cannot handle the two cases above, my
> > question is whether it would fail explicitly or not. The new
> > check is_node_memory_continuous doesn't seem to be able to catch them.
> 
> 
> Looking at numa_update_node_memblks, it is clear that the code is meant
> to increase the range of each numa node to cover even MMIO regions in
> between memory banks. Also see the comment at the top of the file:
> 
>  * Assumes all memory regions belonging to a single proximity domain
>  * are in one chunk. Holes between them will be included in the node.
> 
> So if there are multiple banks for each node, start and end are
> stretched to cover the holes between them, and it works as long as
> memory banks of different NUMA nodes don't interleave.
> 
> I would appreciate if you could add an in-code comment to explain this
> on top of numa_update_node_memblk.


Yes, I will do it.

> 
> Have you had a chance to test this? If not it would be fantastic if you
> could give it a quick test to make sure it works as intended: for
> instance by creating multiple memory banks for each NUMA node by
> splitting an real bank into two smaller banks with a hole in between in
> device tree, just for the sake of testing.

Yes, I have created some fake NUMA nodes in FVP device tree to test it.
The intertwine of nodes' address can be detected.

(XEN) SRAT: Node 0 0000000080000000-00000000ff000000
(XEN) SRAT: Node 1 0000000880000000-00000008c0000000
(XEN) NODE 0: (0000000080000000-00000008d0000000) intertwine with NODE 1 
(0000000880000000-00000008c0000000)

RE: [PATCH 08/37] xen/x86: add detection of discontinous node memory range

Reply via email to