On 25/10/2016 01:00, Alex Williamson wrote:
> With a vfio assigned device we lay down a base MemoryRegion registered
> as an IO region, giving us read & write accessors.  If the region
> supports mmap, we lay down a higher priority sub-region MemoryRegion
> on top of the base layer initialized as a RAM device pointer to the
> mmap.  Finally, if we have any quirks for the device (ie. address
> ranges that need additional virtualization support), we put another IO
> sub-region on top of the mmap MemoryRegion.  When this is flattened,
> we now potentially have sub-page mmap MemoryRegions exposed which
> cannot be directly mapped through KVM.
> 
> This is as expected, but a subtle detail of this is that we end up
> with two different access mechanisms through QEMU.  If we disable the
> mmap MemoryRegion, we make use of the IO MemoryRegion and service
> accesses using pread and pwrite to the vfio device file descriptor.
> If the mmap MemoryRegion is enabled and results in one of these
> sub-page gaps, QEMU handles the access as RAM, using memcpy to the
> mmap.  Using either pread/pwrite or the mmap directly should be
> correct, but using memcpy causes us problems.  I expect that not only
> does memcpy not necessarily honor the original width and alignment in
> performing a copy, but it potentially also uses processor instructions
> not intended for MMIO spaces.  It turns out that this has been a
> problem for Realtek NIC assignment, which has such a quirk that
> creates a sub-page mmap MemoryRegion access.
> 
> To resolve this, we disable memory_access_is_direct() for ram_device
> regions since QEMU assumes that it can use memcpy for those regions.
> Instead we access through MemoryRegionOps, which replaces the memcpy
> with simple de-references of standard sizes to the host memory.  This
> also allows us to eliminate the ram_device bool from the MemoryRegion
> structure since we can simply test the ops pointer.
> 
> With this patch we attempt to provide unrestricted access to the RAM
> device, allowing byte through qword access as well as unaligned
> access.  The assumption here is that accesses initiated by the VM are
> driven by a device specific driver, which knows the device
> capabilities.  If unaligned accesses are not supported by the device,
> we don't want them to work in a VM by performing multiple aligned
> accesses to compose the unaligned access.  A down-side of this
> philosophy is that the xp command from the monitor attempts to use
> the largest available access weidth, unaware of the underlying
> device.  Using memcpy had this same restriction, but at least now an
> operator can dump individual registers, even if blocks of device
> memory may result in access widths beyond the capabilities of a
> given device (RTL NICs only support up to dword).
> 
> Reported-by: Thorsten Kohfeldt <thorsten.kohfe...@gmx.de>
> Signed-off-by: Alex Williamson <alex.william...@redhat.com>
> ---
>  include/exec/memory.h |    7 +++--
>  memory.c              |   70 
> ++++++++++++++++++++++++++++++++++++++++++++++++-
>  trace-events          |    2 +
>  3 files changed, 74 insertions(+), 5 deletions(-)
> 
> diff --git a/include/exec/memory.h b/include/exec/memory.h
> index a75b8c3..2d4a287 100644
> --- a/include/exec/memory.h
> +++ b/include/exec/memory.h
> @@ -209,7 +209,6 @@ struct MemoryRegion {
>      void (*destructor)(MemoryRegion *mr);
>      uint64_t align;
>      bool terminates;
> -    bool ram_device;
>      bool enabled;
>      bool warning_printed; /* For reservations */
>      uint8_t vga_logging_count;
> @@ -1480,9 +1479,11 @@ void *qemu_map_ram_ptr(RAMBlock *ram_block, ram_addr_t 
> addr);
>  static inline bool memory_access_is_direct(MemoryRegion *mr, bool is_write)
>  {
>      if (is_write) {
> -        return memory_region_is_ram(mr) && !mr->readonly;
> +        return memory_region_is_ram(mr) &&
> +               !mr->readonly && !memory_region_is_ram_device(mr);
>      } else {
> -        return memory_region_is_ram(mr) || memory_region_is_romd(mr);
> +        return (memory_region_is_ram(mr) && 
> !memory_region_is_ram_device(mr)) ||
> +               memory_region_is_romd(mr);

Oops, I hadn't thought of this.  This is a bit slower, especially so if 
you have to compare the ops pointer.  It's probably best to keep the 
mr->ram_device flag, so that later we can place all five relevant flags 
(ram, readonly, ram_device, rom_device, romd_mode) into the same word 
and play games with bits, like the following:

        // write case
        mr->flags == MR_RAM

        // read cases
        (mr->flags & ~MR_READONLY) == MR_RAM ||
        (mr->flags & ~MR_READONLY) == (MR_ROM_DEVICE | MR_ROMD_MODE)

(the latter in turn can be optimized to a range check if the flags are
arranged cleverly).

The patch is okay with mr->ram_device left in place, feel free to merge 
it through your own VFIO tree.

Paolo

>      }
>  }
>  
> diff --git a/memory.c b/memory.c
> index 7ffcff1..d07f785 100644
> --- a/memory.c
> +++ b/memory.c
> @@ -1128,6 +1128,71 @@ const MemoryRegionOps unassigned_mem_ops = {
>      .endianness = DEVICE_NATIVE_ENDIAN,
>  };
>  
> +static uint64_t memory_region_ram_device_read(void *opaque,
> +                                              hwaddr addr, unsigned size)
> +{
> +    MemoryRegion *mr = opaque;
> +    uint64_t data = (uint64_t)~0;
> +
> +    switch (size) {
> +    case 1:
> +        data = *(uint8_t *)(mr->ram_block->host + addr);
> +        break;
> +    case 2:
> +        data = *(uint16_t *)(mr->ram_block->host + addr);
> +        break;
> +    case 4:
> +        data = *(uint32_t *)(mr->ram_block->host + addr);
> +        break;
> +    case 8:
> +        data = *(uint64_t *)(mr->ram_block->host + addr);
> +        break;
> +    }
> +
> +    trace_memory_region_ram_device_read(get_cpu_index(), mr, addr, data, 
> size);
> +
> +    return data;
> +}
> +
> +static void memory_region_ram_device_write(void *opaque, hwaddr addr,
> +                                           uint64_t data, unsigned size)
> +{
> +    MemoryRegion *mr = opaque;
> +
> +    trace_memory_region_ram_device_write(get_cpu_index(), mr, addr, data, 
> size);
> +
> +    switch (size) {
> +    case 1:
> +        *(uint8_t *)(mr->ram_block->host + addr) = (uint8_t)data;
> +        break;
> +    case 2:
> +        *(uint16_t *)(mr->ram_block->host + addr) = (uint16_t)data;
> +        break;
> +    case 4:
> +        *(uint32_t *)(mr->ram_block->host + addr) = (uint32_t)data;
> +        break;
> +    case 8:
> +        *(uint64_t *)(mr->ram_block->host + addr) = data;
> +        break;
> +    }
> +}
> +
> +static const MemoryRegionOps ram_device_mem_ops = {
> +    .read = memory_region_ram_device_read,
> +    .write = memory_region_ram_device_write,
> +    .endianness = DEVICE_NATIVE_ENDIAN,
> +    .valid = {
> +        .min_access_size = 1,
> +        .max_access_size = 8,
> +        .unaligned = true,
> +    },
> +    .impl = {
> +        .min_access_size = 1,
> +        .max_access_size = 8,
> +        .unaligned = true,
> +    },
> +};
> +
>  bool memory_region_access_valid(MemoryRegion *mr,
>                                  hwaddr addr,
>                                  unsigned size,
> @@ -1362,7 +1427,8 @@ void memory_region_init_ram_device_ptr(MemoryRegion *mr,
>                                         void *ptr)
>  {
>      memory_region_init_ram_ptr(mr, owner, name, size, ptr);
> -    mr->ram_device = true;
> +    mr->ops = &ram_device_mem_ops;
> +    mr->opaque = mr;
>  }
>  
>  void memory_region_init_alias(MemoryRegion *mr,
> @@ -1498,7 +1564,7 @@ const char *memory_region_name(const MemoryRegion *mr)
>  
>  bool memory_region_is_ram_device(MemoryRegion *mr)
>  {
> -    return mr->ram_device;
> +    return mr->ops == &ram_device_mem_ops;
>  }
>  
>  uint8_t memory_region_get_dirty_log_mask(MemoryRegion *mr)
> diff --git a/trace-events b/trace-events
> index 8ecded5..f74e1d3 100644
> --- a/trace-events
> +++ b/trace-events
> @@ -121,6 +121,8 @@ memory_region_subpage_read(int cpu_index, void *mr, 
> uint64_t offset, uint64_t va
>  memory_region_subpage_write(int cpu_index, void *mr, uint64_t offset, 
> uint64_t value, unsigned size) "cpu %d mr %p offset %#"PRIx64" value 
> %#"PRIx64" size %u"
>  memory_region_tb_read(int cpu_index, uint64_t addr, uint64_t value, unsigned 
> size) "cpu %d addr %#"PRIx64" value %#"PRIx64" size %u"
>  memory_region_tb_write(int cpu_index, uint64_t addr, uint64_t value, 
> unsigned size) "cpu %d addr %#"PRIx64" value %#"PRIx64" size %u"
> +memory_region_ram_device_read(int cpu_index, void *mr, uint64_t addr, 
> uint64_t value, unsigned size) "cpu %d mr %p addr %#"PRIx64" value %#"PRIx64" 
> size %u"
> +memory_region_ram_device_write(int cpu_index, void *mr, uint64_t addr, 
> uint64_t value, unsigned size) "cpu %d mr %p addr %#"PRIx64" value %#"PRIx64" 
> size %u"
>  
>  ### Guest events, keep at bottom
>  
> 

Reply via email to