On Fri, Oct 14, 2016 at 03:18:00PM +0200, Laszlo Ersek wrote:
> On 10/14/16 10:05, Andrew Jones wrote:
> > On Fri, Oct 14, 2016 at 12:50:29AM +0200, Laszlo Ersek wrote:
> >> (4) Analysis (well, a lame attempt at that, because I have zero
> >> familiarity with this code). Let me quote the patch:
> >>
> >>> commit 7ba5f605f3a0d9495aad539eeb8346d726dfc183
> >>> Author: Zhen Lei <thunder.leiz...@huawei.com>
> >>> Date:   Thu Sep 1 14:55:04 2016 +0800
> >>>
> >>>     arm64/numa: remove the limitation that cpu0 must bind to node0
> >>>
> >>>     1. Remove the old binding code.
> >>>     2. Read the nid of cpu0 from dts.
> >>>     3. Fallback the nid of cpu0 to 0 when numa=off is set in bootargs.
> >>>
> >>>     Signed-off-by: Zhen Lei <thunder.leiz...@huawei.com>
> >>>     Signed-off-by: Will Deacon <will.dea...@arm.com>
> >>>
> >>> diff --git a/arch/arm64/kernel/smp.c b/arch/arm64/kernel/smp.c
> >>> index c3c08368a685..8b048e6ec34a 100644
> >>> --- a/arch/arm64/kernel/smp.c
> >>> +++ b/arch/arm64/kernel/smp.c
> >>> @@ -624,6 +624,7 @@ static void __init of_parse_and_init_cpus(void)
> >>>                   }
> >>>
> >>>                   bootcpu_valid = true;
> >>> +                 early_map_cpu_to_node(0, of_node_to_nid(dn));
> >>>
> >>>                   /*
> >>>                    * cpu_logical_map has already been
> >>> diff --git a/arch/arm64/mm/numa.c b/arch/arm64/mm/numa.c
> >>> index 0a15f010b64a..778a985c8a70 100644
> >>> --- a/arch/arm64/mm/numa.c
> >>> +++ b/arch/arm64/mm/numa.c
> >>> @@ -116,16 +116,24 @@ static void __init setup_node_to_cpumask_map(void)
> >>>   */
> >>>  void numa_store_cpu_info(unsigned int cpu)
> >>>  {
> >>> - map_cpu_to_node(cpu, numa_off ? 0 : cpu_to_node_map[cpu]);
> >>> + map_cpu_to_node(cpu, cpu_to_node_map[cpu]);
> >>>  }
> >>>
> >>>  void __init early_map_cpu_to_node(unsigned int cpu, int nid)
> >>>  {
> >>>   /* fallback to node 0 */
> >>> - if (nid < 0 || nid >= MAX_NUMNODES)
> >>> + if (nid < 0 || nid >= MAX_NUMNODES || numa_off)
> >>>           nid = 0;
> > 
> > The ACPI equivalent code must be missing (at least) the above, because,
> > even with DT, mach-virt won't have cpu to node mappings unless numa
> > is configured on the command line. Can you try adding something like
> > 
> >     -m 512 -smp 4 \
> >     -numa node,mem=256M,cpus=0-1,nodeid=0 \
> >     -numa node,mem=256M,cpus=2-3,nodeid=1
> > 
> > to your QEMU command line?
> 
> I added the following to my domain XML, under <cpu>:
> 
>     <numa>
>       <cell id='0' cpus='0-1' memory='2097152' unit='KiB'/>
>       <cell id='1' cpus='2-3' memory='2097152' unit='KiB'/>
>     </numa>
> 
> (See <http://libvirt.org/formatdomain.html#elementsCPU>.)
> 
> With that, each NUMA node gets half of the VCPUs and half of the guest RAM.
> 
> (This is in a different guest now, one that has a bleeding edge Fedora kernel 
> -- I didn't want to rebuild the upstream kernel yet again, just for this 
> test. So, "4.9.0-0.rc0.git7.1.fc26.aarch64" is based on upstream 
> v4.8-14109-g1573d2c, and it reproduces the problem too.)
> 
> > Then when you boot with ACPI you'll get a
> > SRAT.
> 
> Yes, that's confirmed by the guest kernel log (see below).
> 
> > If that works, then we're just missing the "no SRAT, nid = 0"
> > code (that should have been added with this patch)
> 
> It still crashes with the SRAT, with the following log:

Rats.

> Note the warning message (from wq_numa_init()):
> 
>   workqueue: NUMA node mapping not available for cpu0, disabling NUMA support
> 
> Something looks genuinely broken with the cpu <-> numa-node associations in 
> the ACPI case -- it even seems to fail when the SRAT does exist.
> 
> So, perhaps, commit 7ba5f605f3a0 may not have introduced the bug, only 
> exposed one in the ACPI code?...

The kernel's ACPI NUMA support used to work. It was the test case for
QEMU's SRAT generation code.

Two more experiments I wouldn't mind having you try, if you have time;
 1) confirm that this NUMA configured guest and this latest kernel works
    with DT. This is just for sanity, but I guess it will.
 2) If (1) succeeds, then try the last-known-good kernel with ACPI again,
    but this time with the NUMA configured guest.

If (2) fails then we may need to expand the bisection :-(

Thanks,
drew

Reply via email to