On Sun, 10 Nov 2013 12:36:29 +0200
"Michael S. Tsirkin" <m...@redhat.com> wrote:

> On Fri, Nov 08, 2013 at 06:33:12PM +0100, Igor Mammedov wrote:
> > On Fri, 8 Nov 2013 12:22:12 +0200
> > Vasilis Liaskovitis <vasilis.liaskovi...@profitbricks.com> wrote:
> > 
> > > Hi,
> > > 
> > > On Thu, Nov 07, 2013 at 03:03:42PM +0200, Michael S. Tsirkin wrote:
> > > > On Thu, Nov 07, 2013 at 01:41:59PM +0100, Vasilis Liaskovitis wrote:
> > > > > This patch adds a _PXM method to ACPI CPU objects for the pc machine. 
> > > > > The _PXM
> > > > > value is derived from the passed in guest info, same way as CPU SRAT 
> > > > > entries.
> > > > > 
> > > > > The motivation for this patch is a CPU hot-unplug/hot-plug bug 
> > > > > observed when
> > > > > using a 3.11 linux guest kernel on a multi-NUMA node qemu/kvm VM. The 
> > > > > linux
> > > > > guest kernel parses the SRAT CPU entries at boot time and stores them 
> > > > > in the
> > > > > array __apicid_to_node. When a CPU is hot-removed, the linux guest 
> > > > > kernel
> > > > > resets the removed CPU's __apicid_to_node entry to NO_NUMA_NODE 
> > > > > (kernel commit
> > > > > c4c60524). When the removed cpu is hot-added again, the linux kernel 
> > > > > looks up
> > > > > the hot-added cpu object's _PXM method instead of somehow 
> > > > > re-discovering the
> > > > > SRAT entry info. With current qemu/seabios, the _PXM method is not 
> > > > > found, and
> > > > > the CPU is thus hot-plugged in the default NUMA node 0. (The problem 
> > > > > does not
> > > > > show up on initial hotplug of a cpu; the PXM method is still not 
> > > > > found in this
> > > > > case, but the kernel still has the correct proximity value from the 
> > > > > CPU's SRAT
> > > > > entry stored in __apicid_to_node)
> > > > > 
> > > > > ACPI spec mentions that the _PXM method is the correct way to 
> > > > > determine
> > > > > proximity information at hot-add time.
> > > > 
> > > > Where does it say this?
> > > > I found this:
> > > > If the Local APIC ID / Local SAPIC ID / Local x2APIC ID of a dynamically
> > > > added processor is not present in the System Resource Affinity Table
> > > > (SRAT), a _PXM object must exist for the processor’s device or one of
> > > > its ancestors in the ACPI Namespace.
> > > > 
> > > > Does this mean that linux is buggy, and should be fixed up to look up
> > > > the apic ID in SRAT?
> > > 
> > > The quote above suggests that if SRAT is absent, _PXM should be present.
> > > Seabios/qemu provide SRAT entries, and  no _PXM. The fact that the kernel
> > > resets the parse SRAT info on hot-remove time looks like a kernel problem.
> > > 
> > > But As Toshi Kani mentioned in the original thread, here is a quote from 
> > > ACPI
> > > 5.0, stating _PXM and only _PXM should be used at hot-plug time:
> > > 
> > > ===
> > > 17.2.1 System Resource Affinity Table Definition
> > > 
> > > This optional System Resource Affinity Table (SRAT) provides the boot
> > > time description of the processor and memory ranges belonging to a
> > > system locality. OSPM will consume the SRAT only at boot time. OSPM
> > > should use _PXM for any devices that are hot-added into the system after
> > > boot up.
> > > ====
> > > 
> > > So in this sense, the kernel is correct (kernel only uses _PXM at 
> > > hot-plug time)
> > > , and qemu/Seabios should have _PXM methods for hot operations.
> > 
> > in terms of RFC SHOULD doesn't mean MUST, and in my interpretation of above 
> > is
> > that SRAT parsed once but it doesn't mean that OS should forget data from 
> > it.
> 
> Well it says "OSPM will consume the SRAT only at boot time".
> How do you interpret that?
Exactly as I've wrote before. In the same chapter spec makes provision that 
hotplug memory entries could be in SRAT as well, which rules out using table
only at boot time.

But looking at APIC entry it doesn't have flag that marks it as hotplugable
opposed to memory entry, so maybe we are doing useless work putting not present
CPUs into it.

> 
> > Anyway we surely can have both in QEMU.
> > > 
> > > > 
> > > > > So far, qemu/seabios do not provide this
> > > > > method for CPUs. So regardless of kernel behaviour, it is a good idea 
> > > > > to add
> > > > > this _PXM method. Since ACPI table generation has recently been moved 
> > > > > from
> > > > > seabios to qemu, we do this in qemu.
> > > > > 
> > > > > Note that the above hot-remove/hot-add scenario has been tested on an 
> > > > > older
> > > > > qemu + non-upstreamed patches for cpu hot-removal support, and not on 
> > > > > qemu
> > > > > master (since cpu-del support is still not on master). The only 
> > > > > testing done
> > > > > with qemu/seabios master and this patch, are successful boots of 
> > > > > multi-node
> > > > > linux and windows8 guests.
> > > > > 
> > > > > For the initial discussion on seabios and linux-acpi lists see
> > > > > http://www.spinics.net/lists/linux-acpi/msg47058.html
> > > > > 
> > > > > Signed-off-by: Vasilis Liaskovitis 
> > > > > <vasilis.liaskovi...@profitbricks.com>
> > > > > Reviewed-by: Thilo Fromm <t...@thilo-fromm.de>
> > > > 
> > > > Even if this is a linux bug, I have no issue with working around
> > > > it in qemu.
> > > > 
> > > > But I think proper testing needs to be done with rebased upport for 
> > > > cpu-del.
> > > 
> > > Ok, I can try to rebase cpu-del support for testing. If there are cpu-del 
> > > bits
> > > already somewhere (Igor?) and not merged yet, please point me to them.
> > > 
> > > > 
> > > > > ---
> > > > >  hw/i386/acpi-build.c  |    2 ++
> > > > >  hw/i386/ssdt-proc.dsl |    2 ++
> > > > >  2 files changed, 4 insertions(+)
> > > > > 
> > > > > diff --git a/hw/i386/acpi-build.c b/hw/i386/acpi-build.c
> > > > > index 6cfa044..9373f5e 100644
> > > > > --- a/hw/i386/acpi-build.c
> > > > > +++ b/hw/i386/acpi-build.c
> > > > > @@ -603,6 +603,7 @@ static inline char acpi_get_hex(uint32_t val)
> > > > >  #define ACPI_PROC_OFFSET_CPUHEX (*ssdt_proc_name - *ssdt_proc_start 
> > > > > + 2)
> > > > >  #define ACPI_PROC_OFFSET_CPUID1 (*ssdt_proc_name - *ssdt_proc_start 
> > > > > + 4)
> > > > >  #define ACPI_PROC_OFFSET_CPUID2 (*ssdt_proc_id - *ssdt_proc_start)
> > > > > +#define ACPI_PROC_OFFSET_CPUPXM (*ssdt_proc_pxm - *ssdt_proc_start)
> > > > >  #define ACPI_PROC_SIZEOF (*ssdt_proc_end - *ssdt_proc_start)
> > > > >  #define ACPI_PROC_AML (ssdp_proc_aml + *ssdt_proc_start)
> > > > >  
> > > > > @@ -724,6 +725,7 @@ build_ssdt(GArray *table_data, GArray *linker,
> > > > >              proc[ACPI_PROC_OFFSET_CPUHEX+1] = acpi_get_hex(i);
> > > > >              proc[ACPI_PROC_OFFSET_CPUID1] = i;
> > > > >              proc[ACPI_PROC_OFFSET_CPUID2] = i;
> > > > > +            proc[ACPI_PROC_OFFSET_CPUPXM] = guest_info->node_cpu[i];
> > > > >          }
> > > > >  
> > > > >          /* build this code:
> > > > > diff --git a/hw/i386/ssdt-proc.dsl b/hw/i386/ssdt-proc.dsl
> > > > > index 8229bfd..7eef8b2 100644
> > > > > --- a/hw/i386/ssdt-proc.dsl
> > > > > +++ b/hw/i386/ssdt-proc.dsl
> > > > > @@ -47,6 +47,8 @@ DefinitionBlock ("ssdt-proc.aml", "SSDT", 0x01, 
> > > > > "BXPC", "BXSSDT", 0x1)
> > > > >   * also updating the C code.
> > > > >   */
> > > > >          Name(_HID, "ACPI0007")
> > > > > +        ACPI_EXTRACT_NAME_BYTE_CONST ssdt_proc_pxm
> > > > > +        Name(_PXM, 0xAA)
> > > > 
> > > > The ACPI spec says this should be a DWORD value:
> > > > 
> > > > Return Value:
> > > > An Integer (DWORD) containing a proximity domain identifier.
> > > 
> > > ok, I 'll change this.
> > > 
> > > thanks,
> > > 
> > > - Vasilis
> > 
> > 
> > -- 
> > Regards,
> >   Igor
> 


-- 
Regards,
  Igor

Reply via email to