On Fri, Jan 18, 2013 at 05:14:02PM +0100, Paolo Bonzini wrote: > Il 18/01/2013 17:04, Luigi Rizzo ha scritto: > > Hi, > > with a bunch of e1000 improvements we are at a point where we are > > doing over 1Mpps (short frames) and 7-8Gbit/s (1500 byte frames) > > between two guests, and two things that are high in the "perf top" > > stats are phys_page_find() and related memory copies. > > > > Both are triggered by the pci_dma_read() and pci_dma_write(), > > which on e1000 (and presumably other frontends) are called on > > every single descriptor and every single buffer. > > > > I have then tried to access the guest memory without going every > > time through the page lookup. [...] > > > > This relies on the assumption that the ring (which is contiguous in the > > guest's physical address space) is also contiguous in the host's virtual > > address space. In principle the property could be easily verified once > > the ring is set up. > > IIRC, the amount of contiguous memory is written by address_space_map in > the plen parameter.
unfortunately the plen parameter is modified only if the area is smaller than the request, and there is no method that i can find that returns [base,len] of a RAMBlock. What I came up with, also to check for invalid addresses, IOMMU and the like, is something like this: // addr is the address we want to map into host VM int mappable_addr(PCIDevice *dev, hwaddr addr, uint64_t *guest_ha_low, uint64_t *guest_ha_high, uint64_t *gpa_to_hva) { AddressSpace *as = dev->as; AddressSpaceDispatch *d = as->dispatch; MemoryRegionSection *section; RAMBlock *block; if (dma_has_iommu(pci_dma_context(dev))) return 0; // no direct access section = phys_page_find(d, addr >> TARGET_PAGE_BITS); if (!memory_region_is_ram(section->mr) || section->readonly) return 0; // no direct access QLIST_FOREACH(block, &ram_list.blocks, next) { if (addr - block->offset < block->length) { /* set 3 variables indicating the valid range * and the offset between the two address spaces. */ *guest_ha_low = block->offset; *guest_ha_high = block->offset + block->length; *gpa_to_hva = (uint64_t)block->host - block->offset; return 1; } } return 0; } (this probably needs to be put in exec.c or some other place that can access phys_page_find() and RAMBlock) The interested client (hw/e1000.c in my case) could then do a memory_listener_register() to be notified of changes, invoke mappable_addr() on the first data buffer it has to translate, and cache the result and use it to speed up the translation subsequently in case of a hit (with the pci_dma_read/write being the fallback methods in case of a miss). The cache is invalidated on updates arriving from the memory listener, and refreshed at the next access. Is this more sound ? The only missing piece then is the call to invalidate_and_set_dirty() cheers luigi