On Fri, Oct 14, 2016 at 03:18:00PM +0200, Laszlo Ersek wrote:
> On 10/14/16 10:05, Andrew Jones wrote:
> > On Fri, Oct 14, 2016 at 12:50:29AM +0200, Laszlo Ersek wrote:
> >> (4) Analysis (well, a lame attempt at that, because I have zero
> >> familiarity with this code). Let me quote the patch:
> >>
> >>> commit 7ba5f605f3a0d9495aad539eeb8346d726dfc183
> >>> Author: Zhen Lei <[email protected]>
> >>> Date: Thu Sep 1 14:55:04 2016 +0800
> >>>
> >>> arm64/numa: remove the limitation that cpu0 must bind to node0
> >>>
> >>> 1. Remove the old binding code.
> >>> 2. Read the nid of cpu0 from dts.
> >>> 3. Fallback the nid of cpu0 to 0 when numa=off is set in bootargs.
> >>>
> >>> Signed-off-by: Zhen Lei <[email protected]>
> >>> Signed-off-by: Will Deacon <[email protected]>
> >>>
> >>> diff --git a/arch/arm64/kernel/smp.c b/arch/arm64/kernel/smp.c
> >>> index c3c08368a685..8b048e6ec34a 100644
> >>> --- a/arch/arm64/kernel/smp.c
> >>> +++ b/arch/arm64/kernel/smp.c
> >>> @@ -624,6 +624,7 @@ static void __init of_parse_and_init_cpus(void)
> >>> }
> >>>
> >>> bootcpu_valid = true;
> >>> + early_map_cpu_to_node(0, of_node_to_nid(dn));
> >>>
> >>> /*
> >>> * cpu_logical_map has already been
> >>> diff --git a/arch/arm64/mm/numa.c b/arch/arm64/mm/numa.c
> >>> index 0a15f010b64a..778a985c8a70 100644
> >>> --- a/arch/arm64/mm/numa.c
> >>> +++ b/arch/arm64/mm/numa.c
> >>> @@ -116,16 +116,24 @@ static void __init setup_node_to_cpumask_map(void)
> >>> */
> >>> void numa_store_cpu_info(unsigned int cpu)
> >>> {
> >>> - map_cpu_to_node(cpu, numa_off ? 0 : cpu_to_node_map[cpu]);
> >>> + map_cpu_to_node(cpu, cpu_to_node_map[cpu]);
> >>> }
> >>>
> >>> void __init early_map_cpu_to_node(unsigned int cpu, int nid)
> >>> {
> >>> /* fallback to node 0 */
> >>> - if (nid < 0 || nid >= MAX_NUMNODES)
> >>> + if (nid < 0 || nid >= MAX_NUMNODES || numa_off)
> >>> nid = 0;
> >
> > The ACPI equivalent code must be missing (at least) the above, because,
> > even with DT, mach-virt won't have cpu to node mappings unless numa
> > is configured on the command line. Can you try adding something like
> >
> > -m 512 -smp 4 \
> > -numa node,mem=256M,cpus=0-1,nodeid=0 \
> > -numa node,mem=256M,cpus=2-3,nodeid=1
> >
> > to your QEMU command line?
>
> I added the following to my domain XML, under <cpu>:
>
> <numa>
> <cell id='0' cpus='0-1' memory='2097152' unit='KiB'/>
> <cell id='1' cpus='2-3' memory='2097152' unit='KiB'/>
> </numa>
>
> (See <http://libvirt.org/formatdomain.html#elementsCPU>.)
>
> With that, each NUMA node gets half of the VCPUs and half of the guest RAM.
>
> (This is in a different guest now, one that has a bleeding edge Fedora kernel
> -- I didn't want to rebuild the upstream kernel yet again, just for this
> test. So, "4.9.0-0.rc0.git7.1.fc26.aarch64" is based on upstream
> v4.8-14109-g1573d2c, and it reproduces the problem too.)
>
> > Then when you boot with ACPI you'll get a
> > SRAT.
>
> Yes, that's confirmed by the guest kernel log (see below).
>
> > If that works, then we're just missing the "no SRAT, nid = 0"
> > code (that should have been added with this patch)
>
> It still crashes with the SRAT, with the following log:
Rats.
> Note the warning message (from wq_numa_init()):
>
> workqueue: NUMA node mapping not available for cpu0, disabling NUMA support
>
> Something looks genuinely broken with the cpu <-> numa-node associations in
> the ACPI case -- it even seems to fail when the SRAT does exist.
>
> So, perhaps, commit 7ba5f605f3a0 may not have introduced the bug, only
> exposed one in the ACPI code?...
The kernel's ACPI NUMA support used to work. It was the test case for
QEMU's SRAT generation code.
Two more experiments I wouldn't mind having you try, if you have time;
1) confirm that this NUMA configured guest and this latest kernel works
with DT. This is just for sanity, but I guess it will.
2) If (1) succeeds, then try the last-known-good kernel with ACPI again,
but this time with the NUMA configured guest.
If (2) fails then we may need to expand the bisection :-(
Thanks,
drew