On Fri, Jan 18, 2013 at 05:14:02PM +0100, Paolo Bonzini wrote:
> Il 18/01/2013 17:04, Luigi Rizzo ha scritto:
> > Hi,
> > with a bunch of e1000 improvements we are at a point where we are
> > doing over 1Mpps (short frames) and 7-8Gbit/s (1500 byte frames)
> > between two guests, and two things that are high in the "perf top"
> > stats are phys_page_find() and related memory copies.
> > 
> > Both are triggered by the pci_dma_read() and pci_dma_write(),
> > which on e1000 (and presumably other frontends) are called on
> > every single descriptor and every single buffer.
> > 
> > I have then tried to access the guest memory without going every
> > time through the page lookup. [...]
> > 
> > This relies on the assumption that the ring (which is contiguous in the
> > guest's physical address space) is also contiguous in the host's virtual
> > address space.  In principle the property could be easily verified once
> > the ring is set up.
> 
> IIRC, the amount of contiguous memory is written by address_space_map in
> the plen parameter.

unfortunately the plen parameter is modified only if the area
is smaller than the request, and there is no method that i can
find that returns [base,len] of a RAMBlock.

What I came up with, also to check for invalid addresses,
IOMMU and the like, is something like this:

        // addr is the address we want to map into host VM
    int mappable_addr(PCIDevice *dev, hwaddr addr, 
                uint64_t *guest_ha_low, uint64_t *guest_ha_high,
                uint64_t *gpa_to_hva)
    {
        AddressSpace *as = dev->as;
        AddressSpaceDispatch *d = as->dispatch;
        MemoryRegionSection *section;
        RAMBlock *block;

        if (dma_has_iommu(pci_dma_context(dev)))
            return 0;   // no direct access

        section = phys_page_find(d, addr >> TARGET_PAGE_BITS);
        if (!memory_region_is_ram(section->mr) || section->readonly)
            return 0;   // no direct access

        QLIST_FOREACH(block, &ram_list.blocks, next) {
            if (addr - block->offset < block->length) {
                /* set 3 variables indicating the valid range
                 * and the offset between the two address spaces.
                 */
                *guest_ha_low =  block->offset;
                *guest_ha_high = block->offset + block->length;
                *gpa_to_hva = (uint64_t)block->host - block->offset;
                return 1;
            }
        }
        return 0;
    }

(this probably needs to be put in exec.c or some other place
that can access phys_page_find() and RAMBlock)

The interested client (hw/e1000.c in my case) could then do a
memory_listener_register() to be notified of changes,
invoke mappable_addr() on the first data buffer it has to
translate, and cache the result and use it to speed up the
translation subsequently in case of a hit (with the
pci_dma_read/write being the fallback methods in case
of a miss).

The cache is invalidated on updates arriving from the
memory listener, and refreshed at the next access.

Is this more sound ?
The only missing piece then is the call to
invalidate_and_set_dirty() 


cheers
luigi

Reply via email to