Re: [Qemu-devel] [PATCH] xen-hvm: stop faking I/O to access PCI config space

2018-05-03 Thread Alexey G
On Thu, 3 May 2018 12:49:59 +
Paul Durrant  wrote:
>> >This patch removes the current hackery where IOREQ_TYPE_PCI_CONFIG
>> >reqyests are handled by faking PIO to 0xcf8 and 0xcfc and replaces
>> >it with direct calls to pci_host_config_read/write_common().
>> >Doing so necessitates mapping BDFs to PCIDevices but maintaining a
>> >simple QLIST in xen_device_realize/unrealize() will suffice.
>> >
>> >NOTE: whilst config space accesses are currently limited to
>> >  PCI_CONFIG_SPACE_SIZE, this patch paves the way to increasing
>> > the limit to PCIE_CONFIG_SPACE_SIZE when Xen gains the ability to
>> >  emulate MCFG table accesses.
>> >
>> >Signed-off-by: Paul Durrant   
>> 
>> Minor problem here is a possible incompatibility with PCI-PCI bridges
>> which we'll need to use eventually for Q35 PT -- IIRC changing
>> secondary bus numbers do not cause unrealize/realize pair to be
>> called for affected PCI devices. This means that dev_list may
>> contain stale BDF information if any related bus number change.  
>
>It also means that emulation won't work in general since, unless the
>devices are re-registered with Xen under their new BDFs things are not
>going to get steered correctly. This patch will not change that
>behaviour so no regression is introduced by removing the I/O fakery.

Completely agree, this was what I meant by "PCI-PCI bridges must be
handled specifically".

>> 
>> Anyway, PCIPCI bridges and their secondary bus numbers must be
>> handled specifically, so it can be ignored for now.
>>   
>
>As we're discussed before, Xen needs to own the topology so it knows
>what's going on.
>
>> I'll try to reuse this patch for my Xen patch for supporting MMCONFIG
>> ioreq forwarding to multiple ioreq servers.
>>   
>
>It should be ok (with the increased limit of course).

I've adjusted limits for PCI conf size in one of Q35 RFC patches (which
are still waiting for review):

xen/pt: add support for PCIe Extended Capabilities and larger config space
http://lists.gnu.org/archive/html/qemu-devel/2018-03/msg03594.html

Also, for hw/xen/xen-pt*.c one patch from upstream QEMU needed which
currently still missing in the qemu-xen repo. (the one which defaults
is_express for 'xen-pci-passthrough' devices). Otherwise new limits
won't work due to is_express=0.

>  Paul
>
>> >--
>> >Cc: Stefano Stabellini 
>> >Cc: Anthony Perard 
>> >Cc: "Michael S. Tsirkin" 
>> >Cc: Marcel Apfelbaum 
>> >Cc: Paolo Bonzini 
>> >Cc: Richard Henderson 
>> >Cc: Eduardo Habkost 
>> >---
>> > hw/i386/xen/trace-events |   2 +
>> > hw/i386/xen/xen-hvm.c| 101
>> > +-- 2 files changed,  
>> 83  
>> > insertions(+), 20 deletions(-)
>> >
>> >diff --git a/hw/i386/xen/trace-events b/hw/i386/xen/trace-events
>> >index 8dab7bc..f576f1b 100644
>> >--- a/hw/i386/xen/trace-events
>> >+++ b/hw/i386/xen/trace-events
>> >@@ -15,6 +15,8 @@ cpu_ioreq_pio(void *req, uint32_t dir, uint32_t
>> >df, uint32_t data_is_ptr, uint64
>> > cpu_ioreq_pio_read_reg(void *req, uint64_t data, uint64_t addr,
>> > uint32_t size) "I/O=%p pio read reg data=0x%"PRIx64"
>> > port=0x%"PRIx64" size=%d" cpu_ioreq_pio_write_reg(void *req,
>> > uint64_t data, uint64_t addr, uint32_t size) "I/O=%p pio write reg
>> > data=0x%"PRIx64" port=0x%"PRIx64" size=%d" cpu_ioreq_move(void
>> > *req, uint32_t dir, uint32_t df, uint32_t data_is_ptr, uint64_t
>> > addr, uint64_t data, uint32_t count, uint32_t size) "I/O=%p copy
>> > dir=%d df=%d ptr=%d port=0x%"PRIx64" data=0x%"PRIx64" count=%d
>> > size=%d"
>> >+cpu_ioreq_config_read(void *req, uint32_t sbdf, uint32_t reg,
>> >uint32_t size, uint32_t data) "I/O=%p sbdf=0x%x reg=%u size=%u
>> >data=0x%x" +cpu_ioreq_config_write(void *req, uint32_t sbdf,
>> >uint32_t reg, uint32_t size, uint32_t data) "I/O=%p sbdf=0x%x
>> >reg=%u size=%u data=0x%x"
>> >
>> > # xen-mapcache.c
>> > xen_map_cache(uint64_t phys_addr) "want 0x%"PRIx64
>> >diff --git a/hw/i386/xen/xen-hvm.c b/hw/i386/xen/xen-hvm.c
>> >index caa563b..c139d29 100644
>> >--- a/hw/i386/xen/xen-hvm.c
>> >+++ b/hw/i386/xen/xen-hvm.c
>> >@@ -12,6 +12,7 @@
>> >
>> > #include "cpu.h"
>> > #include "hw/pci/pci.h"
>> >+#include "hw/pci/pci_host.h"
>> > #include "hw/i386/pc.h"
>> > #include "hw/i386/apic-msidef.h"
>> > #include "hw/xen/xen_common.h"
>> >@@ -86,6 +87,12 @@ typedef struct XenPhysmap {
>> > QLIST_ENTRY(XenPhysmap) list;
>> > } XenPhysmap;
>> >
>> >+typedef struct XenPciDevice {
>> >+PCIDevice *pci_dev;
>> >+uint32_t sbdf;
>> >+QLIST_ENTRY(XenPciDevice) entry;
>> >+} XenPciDevice;
>> >+
>> > typedef struct XenIOState {
>> > ioservid_t ioservid;
>> > shared_iopage_t *shared_page;
>> >@@ -105,6 +112,7 @@ typedef struct XenIOState {
>> > struct xs_handle *xenstore;
>> > MemoryListener memory_listener;
>> > 

Re: [Qemu-devel] [PATCH] xen-hvm: stop faking I/O to access PCI config space

2018-05-03 Thread Alexey G
On Thu, 3 May 2018 12:18:40 +0100
Paul Durrant  wrote:

>This patch removes the current hackery where IOREQ_TYPE_PCI_CONFIG
>reqyests are handled by faking PIO to 0xcf8 and 0xcfc and replaces it
>with direct calls to pci_host_config_read/write_common().
>Doing so necessitates mapping BDFs to PCIDevices but maintaining a
>simple QLIST in xen_device_realize/unrealize() will suffice.
>
>NOTE: whilst config space accesses are currently limited to
>  PCI_CONFIG_SPACE_SIZE, this patch paves the way to increasing the
>  limit to PCIE_CONFIG_SPACE_SIZE when Xen gains the ability to
>  emulate MCFG table accesses.
>
>Signed-off-by: Paul Durrant 

Minor problem here is a possible incompatibility with PCI-PCI bridges
which we'll need to use eventually for Q35 PT -- IIRC changing secondary
bus numbers do not cause unrealize/realize pair to be called for
affected PCI devices. This means that dev_list may contain stale BDF
information if any related bus number change.

Anyway, PCI-PCI bridges and their secondary bus numbers must be handled
specifically, so it can be ignored for now.

I'll try to reuse this patch for my Xen patch for supporting MMCONFIG
ioreq forwarding to multiple ioreq servers.

>--
>Cc: Stefano Stabellini 
>Cc: Anthony Perard 
>Cc: "Michael S. Tsirkin" 
>Cc: Marcel Apfelbaum 
>Cc: Paolo Bonzini 
>Cc: Richard Henderson 
>Cc: Eduardo Habkost 
>---
> hw/i386/xen/trace-events |   2 +
> hw/i386/xen/xen-hvm.c| 101
> +-- 2 files changed, 83
> insertions(+), 20 deletions(-)
>
>diff --git a/hw/i386/xen/trace-events b/hw/i386/xen/trace-events
>index 8dab7bc..f576f1b 100644
>--- a/hw/i386/xen/trace-events
>+++ b/hw/i386/xen/trace-events
>@@ -15,6 +15,8 @@ cpu_ioreq_pio(void *req, uint32_t dir, uint32_t df,
>uint32_t data_is_ptr, uint64
> cpu_ioreq_pio_read_reg(void *req, uint64_t data, uint64_t addr,
> uint32_t size) "I/O=%p pio read reg data=0x%"PRIx64" port=0x%"PRIx64"
> size=%d" cpu_ioreq_pio_write_reg(void *req, uint64_t data, uint64_t
> addr, uint32_t size) "I/O=%p pio write reg data=0x%"PRIx64"
> port=0x%"PRIx64" size=%d" cpu_ioreq_move(void *req, uint32_t dir,
> uint32_t df, uint32_t data_is_ptr, uint64_t addr, uint64_t data,
> uint32_t count, uint32_t size) "I/O=%p copy dir=%d df=%d ptr=%d
> port=0x%"PRIx64" data=0x%"PRIx64" count=%d size=%d"
>+cpu_ioreq_config_read(void *req, uint32_t sbdf, uint32_t reg,
>uint32_t size, uint32_t data) "I/O=%p sbdf=0x%x reg=%u size=%u
>data=0x%x" +cpu_ioreq_config_write(void *req, uint32_t sbdf, uint32_t
>reg, uint32_t size, uint32_t data) "I/O=%p sbdf=0x%x reg=%u size=%u
>data=0x%x"
> 
> # xen-mapcache.c
> xen_map_cache(uint64_t phys_addr) "want 0x%"PRIx64
>diff --git a/hw/i386/xen/xen-hvm.c b/hw/i386/xen/xen-hvm.c
>index caa563b..c139d29 100644
>--- a/hw/i386/xen/xen-hvm.c
>+++ b/hw/i386/xen/xen-hvm.c
>@@ -12,6 +12,7 @@
> 
> #include "cpu.h"
> #include "hw/pci/pci.h"
>+#include "hw/pci/pci_host.h"
> #include "hw/i386/pc.h"
> #include "hw/i386/apic-msidef.h"
> #include "hw/xen/xen_common.h"
>@@ -86,6 +87,12 @@ typedef struct XenPhysmap {
> QLIST_ENTRY(XenPhysmap) list;
> } XenPhysmap;
> 
>+typedef struct XenPciDevice {
>+PCIDevice *pci_dev;
>+uint32_t sbdf;
>+QLIST_ENTRY(XenPciDevice) entry;
>+} XenPciDevice;
>+
> typedef struct XenIOState {
> ioservid_t ioservid;
> shared_iopage_t *shared_page;
>@@ -105,6 +112,7 @@ typedef struct XenIOState {
> struct xs_handle *xenstore;
> MemoryListener memory_listener;
> MemoryListener io_listener;
>+QLIST_HEAD(, XenPciDevice) dev_list;
> DeviceListener device_listener;
> QLIST_HEAD(, XenPhysmap) physmap;
> hwaddr free_phys_offset;
>@@ -569,6 +577,12 @@ static void xen_device_realize(DeviceListener
>*listener,
> 
> if (object_dynamic_cast(OBJECT(dev), TYPE_PCI_DEVICE)) {
> PCIDevice *pci_dev = PCI_DEVICE(dev);
>+XenPciDevice *xendev = g_new(XenPciDevice, 1);
>+
>+xendev->pci_dev = pci_dev;
>+xendev->sbdf = PCI_BUILD_BDF(pci_dev_bus_num(pci_dev),
>+ pci_dev->devfn);
>+QLIST_INSERT_HEAD(>dev_list, xendev, entry);
> 
> xen_map_pcidev(xen_domid, state->ioservid, pci_dev);
> }
>@@ -581,8 +595,17 @@ static void xen_device_unrealize(DeviceListener
>*listener,
> 
> if (object_dynamic_cast(OBJECT(dev), TYPE_PCI_DEVICE)) {
> PCIDevice *pci_dev = PCI_DEVICE(dev);
>+XenPciDevice *xendev, *next;
> 
> xen_unmap_pcidev(xen_domid, state->ioservid, pci_dev);
>+
>+QLIST_FOREACH_SAFE(xendev, >dev_list, entry, next) {
>+if (xendev->pci_dev == pci_dev) {
>+QLIST_REMOVE(xendev, entry);
>+g_free(xendev);
>+break;
>+}
>+}
> }
> }
> 
>@@ -903,6 

Re: [Qemu-devel] [RFC PATCH 13/30] pc/xen: Xen Q35 support: provide IRQ handling for PCI devices

2018-03-14 Thread Alexey G
On Wed, 14 Mar 2018 11:48:46 +0100
Paolo Bonzini  wrote:

>On 12/03/2018 19:33, Alexey Gerasimenko wrote:
>> xen_pci_slot_get_pirq --> xen_cmn_pci_slot_get_pirq
>> xen_piix3_set_irq --> xen_cmn_set_irq  
>
>Don't abbrvt names, xen_hvm_ is a better prefix.

Agree, will rename xen_cmn_* to xen_hvm_*

>> +fprintf(stderr, "WARNING: guest domain
>> attempted to use PIRQ%c "
>> +"routing which is not supported for
>> Xen/Q35 currently\n",
>> +(char)(address - ICH9_LPC_PIRQE_ROUT +
>> 'E'));  
>
>Use error_report instead.

OK, will change to error_report().
There are multiple fprintf(stderr,...)'s still left in the file though,
an additional cleanup patch to replace all such instances to
error_report() calls might be needed later.



Re: [Qemu-devel] [RFC PATCH 16/30] q35/xen: Add Xen platform device support for Q35

2018-03-13 Thread Alexey G
On Mon, 12 Mar 2018 18:44:02 -0300
Eduardo Habkost <ehabk...@redhat.com> wrote:

>On Tue, Mar 13, 2018 at 06:56:37AM +1000, Alexey G wrote:
>> On Mon, 12 Mar 2018 16:44:06 -0300
>> Eduardo Habkost <ehabk...@redhat.com> wrote:
>>   
>> >On Tue, Mar 13, 2018 at 04:34:01AM +1000, Alexey Gerasimenko
>> >wrote:  
>> >> Current Xen/QEMU method to control Xen Platform device on i440 is
>> >> a bit odd -- enabling/disabling Xen platform device actually
>> >> modifies the QEMU emulated machine type, namely xenfv <--> pc.
>> >> 
>> >> In order to avoid multiplying machine types, use a new way to
>> >> control Xen Platform device for QEMU -- "xen-platform-dev" machine
>> >> property (bool). To maintain backward compatibility with existing
>> >> Xen/QEMU setups, this is only applicable to q35 machine currently.
>> >> i440 emulation still uses the old method (i.e. xenfv/pc machine
>> >> selection) to control Xen Platform device, this may be changed
>> >> later to xen-platform-dev property as well.
>> >> 
>> >> This way we can use a single machine type (q35) and change just
>> >> xen-platform-dev value to on/off to control Xen platform device.
>> >> 
>> >> Signed-off-by: Alexey Gerasimenko <x19...@gmail.com>
>> >> ---
>> >[...]  
>> >> diff --git a/qemu-options.hx b/qemu-options.hx
>> >> index 6585058c6c..cee0b92028 100644
>> >> --- a/qemu-options.hx
>> >> +++ b/qemu-options.hx
>> >> @@ -38,6 +38,7 @@ DEF("machine", HAS_ARG, QEMU_OPTION_machine, \
>> >>  "dump-guest-core=on|off include guest memory
>> >> in a core dump (default=on)\n" "mem-merge=on|off
>> >> controls memory merge support (default: on)\n" "
>> >> igd-passthru=on|off controls IGD GFX passthrough support
>> >> (default=off)\n"
>> >> +"xen-platform-dev=on|off controls Xen
>> >> Platform device (default=off)\n" "
>> >> aes-key-wrap=on|off controls support for AES key wrapping
>> >> (default=on)\n" "dea-key-wrap=on|off controls
>> >> support for DEA key wrapping (default=on)\n" "
>> >> suppress-vmdesc=on|off disables self-describing migration
>> >> (default=off)\n"
>> >
>> >What are the obstacles preventing "-device xen-platform" from
>> >working?  It would be better than adding a new boolean option to
>> >-machine.  
>> 
>> I guess the initial assumption was that changing the
>> xen_platform_device value in Xen's options may cause some additional
>> changes in platform configuration besides adding (or not) the Xen
>> Platform device, hence a completely different machine type was chosen
>> (xenfv).
>> 
>> At the moment pc,accel=xen/xenfv selection mostly governs
>> only the Xen Platform device presence. Also setting max_cpus to
>> HVM_MAX_VCPUS depends on it, but this doesn't applicable to a
>> 'pc,accel=xen' machine for some reason.
>> 
>> If applying HVM_MAX_VCPUS to max_cpus is really necessary I think
>> it's better to set it unconditionally for all 'accel=xen' HVM machine
>> types inside xen_enabled() block. Right now it's missing for
>> pc,accel=xen and q35,accel=xen.  
>
>If you are talking about MachineClass::max_cpus, note that it is
>returned by query-machines, so it's supposed to be a static
>value.  Changing it a runtime would mean the query-machines value
>is incorrect.
>
>Is HVM_MAX_CPUS higher or lower than 255?  If it's higher, does
>it mean the current value on pc and q35 isn't accurate?

HVM_MAX_VCPUS is 128 currently, but there is an ongoing work from Intel
to support more vcpus and >8bit APIC IDs, so this number will likely
change soon.

According to the code, using HVM_MAX_VCPUS in QEMU is a bit excessive as
the maximum number of vcpus is controlled on Xen side anyway. Currently
HVM_MAX_VCPUS is used in a one-time check for the maxcpus value (which
itself comes from libxl).
I think for future compatibility it's better to set mc->max_cpus to
HVM_MAX_VCPUS for all accel=xen HVM-supported machine types, not just
xenfv.

The '-device' approach you suggested seems more preferable than a
machine bool property, I'll try switching to it.

>Is HVM_MAX_CPUS something that needs to be enabled because of
>accel=xen or because or the xen-platform device?
>
>If it's just because of accel=xen, we could introduce a
>AccelClass::max_cpus() method (we also have KVM-imposed CPU count
>limits, currently implemented inside kvm_init()).



Re: [Qemu-devel] [RFC PATCH 00/30] Xen Q35 Bringup patches + support for PCIe Extended Capabilities for passed through devices

2018-03-13 Thread Alexey G
On Tue, 13 Mar 2018 09:21:54 +
Daniel P. Berrangé  wrote:

>The subject line says to expect 30 patches, but you've only sent 18 to
>the list here. I eventually figured out that the first 12 patches were
>in Xen code and so not sent to qemu-devel.
>
>For future if you have changes that affect multiple completely separate
>projects, send them as separate series. ie just send PATCH 00/18 to
>QEMU devel so it doesn't look like a bunch of patches have gone
>missing.

OK, we'll do for next versions.

>> A new domain config option was implemented: device_model_machine.
>> It's a string which has following possible values:
>> - "i440" -- i440 emulation (default)
>> - "q35"  -- emulate a Q35 machine. By default, the storage interface
>> is AHCI.  
>
>Presumably this is mapping to the QEMU -machine arg, so it feels
>desirable to keep the same naming scheme. ie allow any of the
>versioned machine names that QEMU uses. eg any of "pc-q35-2.x"
>versioned types, or 'q35' as an alias for latest, and use
>"pc-i440fx-2.x" versioned types of 'pc' as an alias for latest, rather
>than 'i440' which is needlessly divering from the QEMU machine type.

Yes, it is translated into the '-machine' argument.

A direct mapping between the Xen device_model_machine option and QEMU
'-machine' argument won't be accepted by Xen maintainers I guess.

The main problem with this approach is a requirement to have a match
between Xen/libxl and QEMU versions. If, for example,
device_model_machine tells something like "pc-q35-2.11" and later we
downgrade QEMU to some older version we'll likely have a problem
without changing anything in the domain config. So I guess the "use the
latest available" approach for machine selection (pc, q35, etc) is the
only possible option. Perhaps having the way to specify the exact QEMU
machine name and version in a separate domain config parameter (for
advanced use) might be feasible.

Also, parameter names do not speak for themselves I'm afraid. This way
we'll have, for example, device_model_machine="pc" vs
device_model_machine="q35"... a bit unclear I think. This may be
obvious for a QEMU user, but many Xen users didn't get used to QEMU
machines and there might be some wondering why "q35" is not "pc" and
why "pc" is an i440 system precisely.

Another obstacle here is xen_platform_device option which indirectly
selects QEMU machine type for i440 at the moment (pc/xenfv), but this
may be addressed by controlling the Xen platform device independently
via a separate machine property or '-device xen-platform' like
Eduardo Habkost suggested.



Re: [Qemu-devel] [RFC PATCH 16/30] q35/xen: Add Xen platform device support for Q35

2018-03-12 Thread Alexey G
On Mon, 12 Mar 2018 16:44:06 -0300
Eduardo Habkost  wrote:

>On Tue, Mar 13, 2018 at 04:34:01AM +1000, Alexey Gerasimenko wrote:
>> Current Xen/QEMU method to control Xen Platform device on i440 is a
>> bit odd -- enabling/disabling Xen platform device actually modifies
>> the QEMU emulated machine type, namely xenfv <--> pc.
>> 
>> In order to avoid multiplying machine types, use a new way to
>> control Xen Platform device for QEMU -- "xen-platform-dev" machine
>> property (bool). To maintain backward compatibility with existing
>> Xen/QEMU setups, this is only applicable to q35 machine currently.
>> i440 emulation still uses the old method (i.e. xenfv/pc machine
>> selection) to control Xen Platform device, this may be changed later
>> to xen-platform-dev property as well.
>> 
>> This way we can use a single machine type (q35) and change just
>> xen-platform-dev value to on/off to control Xen platform device.
>> 
>> Signed-off-by: Alexey Gerasimenko 
>> ---  
>[...]
>> diff --git a/qemu-options.hx b/qemu-options.hx
>> index 6585058c6c..cee0b92028 100644
>> --- a/qemu-options.hx
>> +++ b/qemu-options.hx
>> @@ -38,6 +38,7 @@ DEF("machine", HAS_ARG, QEMU_OPTION_machine, \
>>  "dump-guest-core=on|off include guest memory in
>> a core dump (default=on)\n" "mem-merge=on|off
>> controls memory merge support (default: on)\n" "
>> igd-passthru=on|off controls IGD GFX passthrough support
>> (default=off)\n"
>> +"xen-platform-dev=on|off controls Xen Platform
>> device (default=off)\n" "aes-key-wrap=on|off
>> controls support for AES key wrapping (default=on)\n"
>> "dea-key-wrap=on|off controls support for DEA key
>> wrapping (default=on)\n" "suppress-vmdesc=on|off
>> disables self-describing migration (default=off)\n"  
>
>What are the obstacles preventing "-device xen-platform" from
>working?  It would be better than adding a new boolean option to
>-machine.

I guess the initial assumption was that changing the
xen_platform_device value in Xen's options may cause some additional
changes in platform configuration besides adding (or not) the Xen
Platform device, hence a completely different machine type was chosen
(xenfv).

At the moment pc,accel=xen/xenfv selection mostly governs
only the Xen Platform device presence. Also setting max_cpus to
HVM_MAX_VCPUS depends on it, but this doesn't applicable to a
'pc,accel=xen' machine for some reason.

If applying HVM_MAX_VCPUS to max_cpus is really necessary I think it's
better to set it unconditionally for all 'accel=xen' HVM machine
types inside xen_enabled() block. Right now it's missing for
pc,accel=xen and q35,accel=xen.

I'll check if supplying the Xen platform device via the '-device' option
will be ok for all usage cases.



[Qemu-devel] [PATCH] xen-mapcache: Fix the bug when overlapping emulated DMA operations may cause inconsistency in guest memory mappings

2017-07-10 Thread Alexey G
(Sorry, sent to a wrong mailing list first instead of QEMU-devel)

Under certain circumstances normal xen-mapcache functioning may be broken
by guest's actions. This may lead to either QEMU performing exit() due to
a caught bad pointer (and with QEMU process gone the guest domain simply
appears hung afterwards) or actual use of the incorrect pointer inside
QEMU address space -- a write to unmapped memory is possible. The bug is
hard to reproduce on a i440 machine as multiple DMA sources are required
(though it's possible in theory, using multiple emulated devices), but can
be reproduced somewhat easily on a Q35 machine using an emulated AHCI
controller -- each NCQ queue command slot may be used as an independent
DMA source ex. using READ FPDMA QUEUED command, so a single storage
device on the AHCI controller port will be enough to produce multiple DMAs
(up to 32). The detailed description of the issue follows.

Xen-mapcache provides an ability to map parts of a guest memory into
QEMU's own address space to work with.

There are two types of cache lookups:
 - translating a guest physical address into a pointer in QEMU's address
   space, mapping a part of guest domain memory if necessary (while trying
   to reduce a number of such (re)mappings to a minimum)
 - translating a QEMU's pointer back to its physical address in guest RAM

These lookups are managed via two linked-lists of structures.
MapCacheEntry is used for forward cache lookups, while MapCacheRev -- for
reverse lookups.

Every guest physical address is broken down into 2 parts:
address_index  = phys_addr >> MCACHE_BUCKET_SHIFT;
address_offset = phys_addr & (MCACHE_BUCKET_SIZE - 1);

MCACHE_BUCKET_SHIFT depends on a system (32/64) and is equal to 20 for
a 64-bit system (which assumed for the further description). Basically,
this means that we deal with 1 MB chunks and offsets within those 1 MB
chunks. All mappings are created with 1MB-granularity, i.e. 1MB/2MB/3MB
etc. Most DMA transfers typically are less than 1MB, however, if the
transfer crosses any 1MB border(s) - than a nearest larger mapping size
will be used, so ex. a 512-byte DMA transfer with the start address
700FFF80h will actually require a 2MB range.

Current implementation assumes that MapCacheEntries are unique for a given
address_index and size pair and that a single MapCacheEntry may be reused
by multiple requests -- in this case the 'lock' field will be larger than
1. On other hand, each requested guest physical address (with 'lock' flag)
is described by each own MapCacheRev. So there may be multiple MapCacheRev
entries corresponding to a single MapCacheEntry. The xen-mapcache code
uses MapCacheRev entries to retrieve the address_index & size pair which
in turn used to find a related MapCacheEntry. The 'lock' field within
a MapCacheEntry structure is actually a reference counter which shows
a number of corresponding MapCacheRev entries.

The bug lies in ability for the guest to indirectly manipulate with the
xen-mapcache MapCacheEntries list via a special sequence of DMA
operations, typically for storage devices. In order to trigger the bug,
guest needs to issue DMA operations in specific order and timing.
Although xen-mapcache is protected by the mutex lock -- this doesn't help
in this case, as the bug is not due to a race condition.

Suppose we have 3 DMA transfers, namely A, B and C, where
- transfer A crosses 1MB border and thus uses a 2MB mapping
- transfers B and C are normal transfers within 1MB range
- and all 3 transfers belong to the same address_index

In this case, if all these transfers are to be executed one-by-one
(without overlaps), no special treatment necessary -- each transfer's
mapping lock will be set and then cleared on unmap before starting
the next transfer.
The situation changes when DMA transfers overlap in time, ex. like this:

  |= transfer A (2MB) =|

  |= transfer B (1MB) =|

  |= transfer C (1MB) =|
 time --->

In this situation the following sequence of actions happens:

1. transfer A creates a mapping to 2MB area (lock=1)
2. transfer B (1MB) tries to find available mapping but cannot find one
   because transfer A is still in progress, and it has 2MB size + non-zero
   lock. So transfer B creates another mapping -- same address_index,
   but 1MB size.
3. transfer A completes, making 1st mapping entry available by setting its
   lock to 0
4. transfer C starts and tries to find available mapping entry and sees
   that 1st entry has lock=0, so it uses this entry but remaps the mapping
   to a 1MB size
5. transfer B completes and by this time
  - there are two locked entries in the MapCacheEntry list with the SAME
values for both address_index and size
  - the entry for transfer B actually resides farther in list while
transfer C's entry is first
6. xen_ram_addr_from_mapcache() for transfer B gets correct address_index
   and size pair from corresponding MapCacheRev entry, 

Re: [Qemu-devel] [Xen-devel] [RFC/BUG] xen-mapcache: buggy invalidate map cache?

2017-04-12 Thread Alexey G
On Tue, 11 Apr 2017 15:32:09 -0700 (PDT)
Stefano Stabellini  wrote:

> On Tue, 11 Apr 2017, hrg wrote:
> > On Tue, Apr 11, 2017 at 3:50 AM, Stefano Stabellini
> >  wrote:  
> > > On Mon, 10 Apr 2017, Stefano Stabellini wrote:  
> > >> On Mon, 10 Apr 2017, hrg wrote:  
> > >> > On Sun, Apr 9, 2017 at 11:55 PM, hrg  wrote:  
> > >> > > On Sun, Apr 9, 2017 at 11:52 PM, hrg  wrote:  
> > >> > >> Hi,
> > >> > >>
> > >> > >> In xen_map_cache_unlocked(), map to guest memory maybe in
> > >> > >> entry->next instead of first level entry (if map to rom other than
> > >> > >> guest memory comes first), while in xen_invalidate_map_cache(),
> > >> > >> when VM ballooned out memory, qemu did not invalidate cache entries
> > >> > >> in linked list(entry->next), so when VM balloon back in memory,
> > >> > >> gfns probably mapped to different mfns, thus if guest asks device
> > >> > >> to DMA to these GPA, qemu may DMA to stale MFNs.
> > >> > >>
> > >> > >> So I think in xen_invalidate_map_cache() linked lists should also be
> > >> > >> checked and invalidated.
> > >> > >>
> > >> > >> What’s your opinion? Is this a bug? Is my analyze correct?  
> > >>
> > >> Yes, you are right. We need to go through the list for each element of
> > >> the array in xen_invalidate_map_cache. Can you come up with a patch?  
> > >
> > > I spoke too soon. In the regular case there should be no locked mappings
> > > when xen_invalidate_map_cache is called (see the DPRINTF warning at the
> > > beginning of the functions). Without locked mappings, there should never
> > > be more than one element in each list (see xen_map_cache_unlocked:
> > > entry->lock == true is a necessary condition to append a new entry to
> > > the list, otherwise it is just remapped).
> > >
> > > Can you confirm that what you are seeing are locked mappings
> > > when xen_invalidate_map_cache is called? To find out, enable the DPRINTK
> > > by turning it into a printf or by defininig MAPCACHE_DEBUG.  
> > 
> > In fact, I think the DPRINTF above is incorrect too. In
> > pci_add_option_rom(), rtl8139 rom is locked mapped in
> > pci_add_option_rom->memory_region_get_ram_ptr (after
> > memory_region_init_ram). So actually I think we should remove the
> > DPRINTF warning as it is normal.  
> 
> Let me explain why the DPRINTF warning is there: emulated dma operations
> can involve locked mappings. Once a dma operation completes, the related
> mapping is unlocked and can be safely destroyed. But if we destroy a
> locked mapping in xen_invalidate_map_cache, while a dma is still
> ongoing, QEMU will crash. We cannot handle that case.
> 
> However, the scenario you described is different. It has nothing to do
> with DMA. It looks like pci_add_option_rom calls
> memory_region_get_ram_ptr to map the rtl8139 rom. The mapping is a
> locked mapping and it is never unlocked or destroyed.
> 
> It looks like "ptr" is not used after pci_add_option_rom returns. Does
> the append patch fix the problem you are seeing? For the proper fix, I
> think we probably need some sort of memory_region_unmap wrapper or maybe
> a call to address_space_unmap.

Hmm, for some reason my message to the Xen-devel list got rejected but was sent
to Qemu-devel instead, without any notice. Sorry if I'm missing something
obvious as a list newbie.

Stefano, hrg,

There is an issue with inconsistency between the list of normal MapCacheEntry's
and their 'reverse' counterparts - MapCacheRev's in locked_entries.
When bad situation happens, there are multiple (locked) MapCacheEntry
entries in the bucket's linked list along with a number of MapCacheRev's. And
when it comes to a reverse lookup, xen-mapcache picks the wrong entry from the
first list and calculates a wrong pointer from it which may then be caught with
the "Bad RAM offset" check (or not). Mapcache invalidation might be related to
this issue as well I think.

I'll try to provide a test code which can reproduce the issue from the
guest side using an emulated IDE controller, though it's much simpler to achieve
this result with an AHCI controller using multiple NCQ I/O commands. So far I've
seen this issue only with Windows 7 (and above) guest on AHCI, but any block I/O
DMA should be enough I think.



Re: [Qemu-devel] [Xen-devel] [RFC/BUG] xen-mapcache: buggy invalidate map cache?

2017-04-09 Thread Alexey G
On Mon, 10 Apr 2017 00:36:02 +0800
hrg  wrote:

Hi,

> On Sun, Apr 9, 2017 at 11:55 PM, hrg  wrote:
> > On Sun, Apr 9, 2017 at 11:52 PM, hrg  wrote:  
> >> Hi,
> >>
> >> In xen_map_cache_unlocked(), map to guest memory maybe in entry->next
> >> instead of first level entry (if map to rom other than guest memory
> >> comes first), while in xen_invalidate_map_cache(), when VM ballooned
> >> out memory, qemu did not invalidate cache entries in linked
> >> list(entry->next), so when VM balloon back in memory, gfns probably
> >> mapped to different mfns, thus if guest asks device to DMA to these
> >> GPA, qemu may DMA to stale MFNs.
> >>
> >> So I think in xen_invalidate_map_cache() linked lists should also be
> >> checked and invalidated.
> >>
> >> What’s your opinion? Is this a bug? Is my analyze correct?  
> >
> > Added Jun Nakajima and Alexander Graf  
> And correct Stefano Stabellini's email address.

There is a real issue with the xen-mapcache corruption in fact. I encountered
it a few months ago while experimenting with Q35 support on Xen. Q35 emulation
uses an AHCI controller by default, along with NCQ mode enabled. The issue can
be (somewhat) easily reproduced there, though using a normal i440 emulation
might possibly allow to reproduce the issue as well, using a dedicated test
code from a guest side. In case of Q35+NCQ the issue can be reproduced "as is".

The issue occurs when a guest domain performs an intensive disk I/O, ex. while
guest OS booting. QEMU crashes with "Bad ram offset 980aa000"
message logged, where the address is different each time. The hard thing with
this issue is that it has a very low reproducibility rate.

The corruption happens when there are multiple I/O commands in the NCQ queue.
So there are overlapping emulated DMA operations in flight and QEMU uses a
sequence of mapcache actions which can be executed in the "wrong" order thus
leading to an inconsistent xen-mapcache - so a bad address from the wrong
entry is returned.

The bad thing with this issue is that QEMU crash due to "Bad ram offset"
appearance is a relatively good situation in the sense that this is a caught
error. But there might be a much worse (artificial) situation where the returned
address looks valid but points to a different mapped memory.

The fix itself is not hard (ex. an additional checked field in MapCacheEntry),
but there is a need of some reliable way to test it considering the low
reproducibility rate.

Regards,
Alex