On 25/10/2016 01:00, Alex Williamson wrote: > With a vfio assigned device we lay down a base MemoryRegion registered > as an IO region, giving us read & write accessors. If the region > supports mmap, we lay down a higher priority sub-region MemoryRegion > on top of the base layer initialized as a RAM device pointer to the > mmap. Finally, if we have any quirks for the device (ie. address > ranges that need additional virtualization support), we put another IO > sub-region on top of the mmap MemoryRegion. When this is flattened, > we now potentially have sub-page mmap MemoryRegions exposed which > cannot be directly mapped through KVM. > > This is as expected, but a subtle detail of this is that we end up > with two different access mechanisms through QEMU. If we disable the > mmap MemoryRegion, we make use of the IO MemoryRegion and service > accesses using pread and pwrite to the vfio device file descriptor. > If the mmap MemoryRegion is enabled and results in one of these > sub-page gaps, QEMU handles the access as RAM, using memcpy to the > mmap. Using either pread/pwrite or the mmap directly should be > correct, but using memcpy causes us problems. I expect that not only > does memcpy not necessarily honor the original width and alignment in > performing a copy, but it potentially also uses processor instructions > not intended for MMIO spaces. It turns out that this has been a > problem for Realtek NIC assignment, which has such a quirk that > creates a sub-page mmap MemoryRegion access. > > To resolve this, we disable memory_access_is_direct() for ram_device > regions since QEMU assumes that it can use memcpy for those regions. > Instead we access through MemoryRegionOps, which replaces the memcpy > with simple de-references of standard sizes to the host memory. This > also allows us to eliminate the ram_device bool from the MemoryRegion > structure since we can simply test the ops pointer. > > With this patch we attempt to provide unrestricted access to the RAM > device, allowing byte through qword access as well as unaligned > access. The assumption here is that accesses initiated by the VM are > driven by a device specific driver, which knows the device > capabilities. If unaligned accesses are not supported by the device, > we don't want them to work in a VM by performing multiple aligned > accesses to compose the unaligned access. A down-side of this > philosophy is that the xp command from the monitor attempts to use > the largest available access weidth, unaware of the underlying > device. Using memcpy had this same restriction, but at least now an > operator can dump individual registers, even if blocks of device > memory may result in access widths beyond the capabilities of a > given device (RTL NICs only support up to dword). > > Reported-by: Thorsten Kohfeldt <thorsten.kohfe...@gmx.de> > Signed-off-by: Alex Williamson <alex.william...@redhat.com> > --- > include/exec/memory.h | 7 +++-- > memory.c | 70 > ++++++++++++++++++++++++++++++++++++++++++++++++- > trace-events | 2 + > 3 files changed, 74 insertions(+), 5 deletions(-) > > diff --git a/include/exec/memory.h b/include/exec/memory.h > index a75b8c3..2d4a287 100644 > --- a/include/exec/memory.h > +++ b/include/exec/memory.h > @@ -209,7 +209,6 @@ struct MemoryRegion { > void (*destructor)(MemoryRegion *mr); > uint64_t align; > bool terminates; > - bool ram_device; > bool enabled; > bool warning_printed; /* For reservations */ > uint8_t vga_logging_count; > @@ -1480,9 +1479,11 @@ void *qemu_map_ram_ptr(RAMBlock *ram_block, ram_addr_t > addr); > static inline bool memory_access_is_direct(MemoryRegion *mr, bool is_write) > { > if (is_write) { > - return memory_region_is_ram(mr) && !mr->readonly; > + return memory_region_is_ram(mr) && > + !mr->readonly && !memory_region_is_ram_device(mr); > } else { > - return memory_region_is_ram(mr) || memory_region_is_romd(mr); > + return (memory_region_is_ram(mr) && > !memory_region_is_ram_device(mr)) || > + memory_region_is_romd(mr);
Oops, I hadn't thought of this. This is a bit slower, especially so if you have to compare the ops pointer. It's probably best to keep the mr->ram_device flag, so that later we can place all five relevant flags (ram, readonly, ram_device, rom_device, romd_mode) into the same word and play games with bits, like the following: // write case mr->flags == MR_RAM // read cases (mr->flags & ~MR_READONLY) == MR_RAM || (mr->flags & ~MR_READONLY) == (MR_ROM_DEVICE | MR_ROMD_MODE) (the latter in turn can be optimized to a range check if the flags are arranged cleverly). The patch is okay with mr->ram_device left in place, feel free to merge it through your own VFIO tree. Paolo > } > } > > diff --git a/memory.c b/memory.c > index 7ffcff1..d07f785 100644 > --- a/memory.c > +++ b/memory.c > @@ -1128,6 +1128,71 @@ const MemoryRegionOps unassigned_mem_ops = { > .endianness = DEVICE_NATIVE_ENDIAN, > }; > > +static uint64_t memory_region_ram_device_read(void *opaque, > + hwaddr addr, unsigned size) > +{ > + MemoryRegion *mr = opaque; > + uint64_t data = (uint64_t)~0; > + > + switch (size) { > + case 1: > + data = *(uint8_t *)(mr->ram_block->host + addr); > + break; > + case 2: > + data = *(uint16_t *)(mr->ram_block->host + addr); > + break; > + case 4: > + data = *(uint32_t *)(mr->ram_block->host + addr); > + break; > + case 8: > + data = *(uint64_t *)(mr->ram_block->host + addr); > + break; > + } > + > + trace_memory_region_ram_device_read(get_cpu_index(), mr, addr, data, > size); > + > + return data; > +} > + > +static void memory_region_ram_device_write(void *opaque, hwaddr addr, > + uint64_t data, unsigned size) > +{ > + MemoryRegion *mr = opaque; > + > + trace_memory_region_ram_device_write(get_cpu_index(), mr, addr, data, > size); > + > + switch (size) { > + case 1: > + *(uint8_t *)(mr->ram_block->host + addr) = (uint8_t)data; > + break; > + case 2: > + *(uint16_t *)(mr->ram_block->host + addr) = (uint16_t)data; > + break; > + case 4: > + *(uint32_t *)(mr->ram_block->host + addr) = (uint32_t)data; > + break; > + case 8: > + *(uint64_t *)(mr->ram_block->host + addr) = data; > + break; > + } > +} > + > +static const MemoryRegionOps ram_device_mem_ops = { > + .read = memory_region_ram_device_read, > + .write = memory_region_ram_device_write, > + .endianness = DEVICE_NATIVE_ENDIAN, > + .valid = { > + .min_access_size = 1, > + .max_access_size = 8, > + .unaligned = true, > + }, > + .impl = { > + .min_access_size = 1, > + .max_access_size = 8, > + .unaligned = true, > + }, > +}; > + > bool memory_region_access_valid(MemoryRegion *mr, > hwaddr addr, > unsigned size, > @@ -1362,7 +1427,8 @@ void memory_region_init_ram_device_ptr(MemoryRegion *mr, > void *ptr) > { > memory_region_init_ram_ptr(mr, owner, name, size, ptr); > - mr->ram_device = true; > + mr->ops = &ram_device_mem_ops; > + mr->opaque = mr; > } > > void memory_region_init_alias(MemoryRegion *mr, > @@ -1498,7 +1564,7 @@ const char *memory_region_name(const MemoryRegion *mr) > > bool memory_region_is_ram_device(MemoryRegion *mr) > { > - return mr->ram_device; > + return mr->ops == &ram_device_mem_ops; > } > > uint8_t memory_region_get_dirty_log_mask(MemoryRegion *mr) > diff --git a/trace-events b/trace-events > index 8ecded5..f74e1d3 100644 > --- a/trace-events > +++ b/trace-events > @@ -121,6 +121,8 @@ memory_region_subpage_read(int cpu_index, void *mr, > uint64_t offset, uint64_t va > memory_region_subpage_write(int cpu_index, void *mr, uint64_t offset, > uint64_t value, unsigned size) "cpu %d mr %p offset %#"PRIx64" value > %#"PRIx64" size %u" > memory_region_tb_read(int cpu_index, uint64_t addr, uint64_t value, unsigned > size) "cpu %d addr %#"PRIx64" value %#"PRIx64" size %u" > memory_region_tb_write(int cpu_index, uint64_t addr, uint64_t value, > unsigned size) "cpu %d addr %#"PRIx64" value %#"PRIx64" size %u" > +memory_region_ram_device_read(int cpu_index, void *mr, uint64_t addr, > uint64_t value, unsigned size) "cpu %d mr %p addr %#"PRIx64" value %#"PRIx64" > size %u" > +memory_region_ram_device_write(int cpu_index, void *mr, uint64_t addr, > uint64_t value, unsigned size) "cpu %d mr %p addr %#"PRIx64" value %#"PRIx64" > size %u" > > ### Guest events, keep at bottom > >