2015-12-16 10:34 GMT+01:00 Paolo Bonzini <pbonz...@redhat.com>: > > > On 16/12/2015 10:28, Vincenzo Maffione wrote: >> Assuming my TX experiments with disconnected backend (and I disable >> CPU dynamic scaling of performance, etc.): >> 1) after patch 1 and 2, virtio bottleneck jumps from ~1Mpps to 1.910 Mpps. >> 2) after patch 1,2 and 3, virtio bottleneck jumps to 2.039 Mpps. >> >> So I see an improvement for patch 3, and I guess it's because we avoid >> an additional memory translation and related overhead. I believe that >> avoiding the memory translation is more beneficial than avoiding the >> variable-sized memcpy. >> I'm not surprised of that, because taking a brief look at what happens >> under the hood when you call an access_memory() function - it looks >> like a lot of operations. > > Great, thanks for confirming! > > Paolo
No problems. I have some additional (orthogonal) curiosities: 1) Assuming "hw/virtio/dataplane/vring.c" is what I think it is (VQ data structures directly accessible in the host virtual memory, with guest-phyisical-to-host-virtual mapping done statically at setup time) why isn't QEMU using this approach also for virtio-net? I see it is used by virtio-blk only. 2) In any case (vring or not) QEMU dynamically maps data buffers from guest physical memory, for each descriptor to be processed: e1000 uses pci_dma_read/pci_dma_write, virtio uses cpu_physical_memory_map()/cpu_physical_memory_unmap(), vring uses the more specialied vring_map()/vring_unmap(). All of these go through expensive lookups and related operations to do the address translation. Have you considered the possibility to cache the translation result to remove this bottleneck (maybe just for virtio devices)? Or is any consistency or migration-related problem that would create issues? Just to give an example of what I'm talking about: https://github.com/vmaffione/qemu/blob/master/hw/net/e1000.c#L349-L423. At very high packet rates, once notifications (kicks and interrupts) have been amortized in some way, memory translation becomes the major bottleneck. And this (1 and 2) is why QEMU virtio implementation cannot achieve the same throughput as bhyve does (5-6 Mpps or more IIRC). Cheers, Vincenzo -- Vincenzo Maffione