Hi, I am doing performance experiments to test how QEMU behaves when the guest is transmitting (short) network packets at very high packet rates, say over 1Mpps. I run a netmap application in the guest to generate high packet rates, but this is not relevant to this discussion. The only important fact is that the generator running in the guest is not the bottleneck, and in fact CPU utilization is low (20%).
Moreover, I'm not considering vhost-net to boost virtio-net HV-side processing, because I want to do performance unit-tests on the QEMU virtio userspace implementation (hw/virtio/virtio.c). In the most common benchmarks - e.g. netperf TCP_STREAM, TCP_RR, UDP_STREAM, ..., with one end of the communication in the guest, and the other in the host, for instance with the simplest TAP networking setup - the virtio-net adapter definitely outperforms the emulated e1000 adapter (and all the other emulated devices). This was expected because of the great benefits of I/O paravirtualization. However, I was surprised to find out that the situation changes drastically at very high packet rates. My measurements show that e1000 emulated adapter is able to transmit over 3.5 Mpps when the network backend is disconnected. I disconnect the backend to see at what packet rate e1000 becomes the bottleneck. The same experiment, however, shows that virtio-net has a bottleneck at 1Mpps only. Once verified that the TX VQ kicks and TX VQ interrupts are properly amortized/suppressed, I found out that the bottleneck is partially due to the way the code accesses the VQ in the guest physical memory, since each access involves an expensive address space translation. For each VQ element to process, I counted over 15 accesses, while e1000 has just 2 accesses to its rings. This patch slightly rewrites the code to reduce the number of accesses, since many of them seems unnecessary to me. After this reduction, the bottleneck jumps from 1 Mpps to 2 Mpps. Patch is not complete (e.g. it still does not properly manage endianess, it is not clean, etc.). I just wanted to ask if you think the idea makes sense, and a proper patch in this direction would be accepted. Thanks, Vincenzo CHANGELOG: - rebased on bonzini/dataplane branch - removed optimization (combined descriptor read) already present in the dataplane branch - 3 separate optimizations splitted into 3 patches Vincenzo Maffione (3): virtio: cache used_idx in a VirtQueue field virtio: read avail_idx from VQ only when necessary virtio: combine write of an entry into used ring hw/virtio/virtio.c | 62 ++++++++++++++++++++++++++++++++++++------------------ 1 file changed, 42 insertions(+), 20 deletions(-) -- 2.6.4