Re: [PATCH] x86/NUMA: improve memnode_shift calculation for multi node system

Roger Pau Monné Fri, 30 Sep 2022 01:26:29 -0700

On Tue, Sep 27, 2022 at 06:20:35PM +0200, Jan Beulich wrote:
> SRAT may describe individual nodes using multiple ranges. When they're
> adjacent (with or without a gap in between), only the start of the first
> such range actually needs accounting for. Furthermore the very first
> range doesn't need considering of its start address at all, as it's fine
> to associate all lower addresses (with no memory) with that same node.
> For this to work, the array of ranges needs to be sorted by address -
> adjust logic accordingly in acpi_numa_memory_affinity_init().


Speaking for myself (due to the lack of knowledge of the NUMA stuff) I
would benefit from a bit of context on why and how memnode_shift is
used.

> Signed-off-by: Jan Beulich <jbeul...@suse.com>
> ---
> On my Dinar and Rome systems this changes memnodemapsize to a single
> page. Originally they used 9 / 130 pages (with shifts going from 8 / 6
> to 15 / 16) respectively, resulting from lowmem gaps [A0,FF] / [A0,BF].
> 
> This goes on top of "x86/NUMA: correct memnode_shift calculation for
> single node system".
> 
> --- a/xen/arch/x86/numa.c
> +++ b/xen/arch/x86/numa.c
> @@ -127,7 +127,8 @@ static int __init extract_lsb_from_nodes
>          epdx = paddr_to_pdx(nodes[i].end - 1) + 1;
>          if ( spdx >= epdx )
>              continue;
> -        bitfield |= spdx;
> +        if ( i && (!nodeids || nodeids[i - 1] != nodeids[i]) )
> +            bitfield |= spdx;
>          if ( !i || !nodeids || nodeids[i - 1] != nodeids[i] )
>              nodes_used++;
>          if ( epdx > memtop )
> --- a/xen/arch/x86/srat.c
> +++ b/xen/arch/x86/srat.c
> @@ -312,6 +312,7 @@ acpi_numa_memory_affinity_init(const str
>       unsigned pxm;
>       nodeid_t node;
>       unsigned int i;
> +     bool next = false;
>  
>       if (srat_disabled())
>               return;
> @@ -413,14 +414,37 @@ acpi_numa_memory_affinity_init(const str
>              node, pxm, start, end - 1,
>              ma->flags & ACPI_SRAT_MEM_HOT_PLUGGABLE ? " (hotplug)" : "");
>  
> -     node_memblk_range[num_node_memblks].start = start;
> -     node_memblk_range[num_node_memblks].end = end;
> -     memblk_nodeid[num_node_memblks] = node;
> +     /* Keep node_memblk_range[] sorted by address. */
> +     for (i = 0; i < num_node_memblks; ++i)
> +             if (node_memblk_range[i].start > start ||
> +                 (node_memblk_range[i].start == start &&

Maybe I'm confused, but won't .start == start means we have
overlapping ranges?

> +                  node_memblk_range[i].end > end))
> +                     break;
> +
> +     memmove(&node_memblk_range[i + 1], &node_memblk_range[i],
> +             (num_node_memblks - i) * sizeof(*node_memblk_range));
> +     node_memblk_range[i].start = start;
> +     node_memblk_range[i].end = end;
> +
> +     memmove(&memblk_nodeid[i + 1], &memblk_nodeid[i],
> +             (num_node_memblks - i) * sizeof(*memblk_nodeid));
> +     memblk_nodeid[i] = node;
> +
>       if (ma->flags & ACPI_SRAT_MEM_HOT_PLUGGABLE) {
> -             __set_bit(num_node_memblks, memblk_hotplug);
> +             next = true;
>               if (end > mem_hotplug)
>                       mem_hotplug = end;
>       }
> +     for (; i <= num_node_memblks; ++i) {
> +             bool prev = next;
> +
> +             next = test_bit(i, memblk_hotplug);
> +             if (prev)
> +                     __set_bit(i, memblk_hotplug);
> +             else
> +                     __clear_bit(i, memblk_hotplug);

Nit: I think you could avoid doing the clear for the last bit, ie:
else if (i != num_node_memblks) __clear_bit(...);

But I'm not sure it's worth adding the logic, just makes it more
complicated to follow.

Thanks, Roger.

Re: [PATCH] x86/NUMA: improve memnode_shift calculation for multi node system

Reply via email to