2015-12-16 12:02 GMT+01:00 Michael S. Tsirkin <m...@redhat.com>: > On Wed, Dec 16, 2015 at 11:39:46AM +0100, Vincenzo Maffione wrote: >> 2015-12-16 10:34 GMT+01:00 Paolo Bonzini <pbonz...@redhat.com>: >> > >> > >> > On 16/12/2015 10:28, Vincenzo Maffione wrote: >> >> Assuming my TX experiments with disconnected backend (and I disable >> >> CPU dynamic scaling of performance, etc.): >> >> 1) after patch 1 and 2, virtio bottleneck jumps from ~1Mpps to 1.910 >> >> Mpps. >> >> 2) after patch 1,2 and 3, virtio bottleneck jumps to 2.039 Mpps. >> >> >> >> So I see an improvement for patch 3, and I guess it's because we avoid >> >> an additional memory translation and related overhead. I believe that >> >> avoiding the memory translation is more beneficial than avoiding the >> >> variable-sized memcpy. >> >> I'm not surprised of that, because taking a brief look at what happens >> >> under the hood when you call an access_memory() function - it looks >> >> like a lot of operations. >> > >> > Great, thanks for confirming! >> > >> > Paolo >> >> No problems. >> >> I have some additional (orthogonal) curiosities: >> >> 1) Assuming "hw/virtio/dataplane/vring.c" is what I think it is (VQ >> data structures directly accessible in the host virtual memory, with >> guest-phyisical-to-host-virtual mapping done statically at setup time) >> why isn't QEMU using this approach also for virtio-net? I see it is >> used by virtio-blk only. > > Because on Linux, nothing would be gained as compared to using vhost-net > in kernel or vhost-user with dpdk. virtio-net is there for non-Linux > hosts, keeping it simple is important to avoid e.g. security problems. > Same as serial, etc.
Ok, thanks for the clarification. > >> 2) In any case (vring or not) QEMU dynamically maps data buffers >> from guest physical memory, for each descriptor to be processed: e1000 >> uses pci_dma_read/pci_dma_write, virtio uses >> cpu_physical_memory_map()/cpu_physical_memory_unmap(), vring uses the >> more specialied vring_map()/vring_unmap(). All of these go through >> expensive lookups and related operations to do the address >> translation. >> Have you considered the possibility to cache the translation result to >> remove this bottleneck (maybe just for virtio devices)? Or is any >> consistency or migration-related problem that would create issues? >> Just to give an example of what I'm talking about: >> https://github.com/vmaffione/qemu/blob/master/hw/net/e1000.c#L349-L423. >> >> At very high packet rates, once notifications (kicks and interrupts) >> have been amortized in some way, memory translation becomes the major >> bottleneck. And this (1 and 2) is why QEMU virtio implementation >> cannot achieve the same throughput as bhyve does (5-6 Mpps or more >> IIRC). >> >> Cheers, >> Vincenzo >> >> >> >> -- >> Vincenzo Maffione -- Vincenzo Maffione