2015-12-16 12:37 GMT+01:00 Paolo Bonzini <pbonz...@redhat.com>:
>
>
> On 16/12/2015 11:39, Vincenzo Maffione wrote:
>> No problems.
>>
>> I have some additional (orthogonal) curiosities:
>>
>>   1) Assuming "hw/virtio/dataplane/vring.c" is what I think it is (VQ
>> data structures directly accessible in the host virtual memory, with
>> guest-phyisical-to-host-virtual mapping done statically at setup time)
>> why isn't QEMU using this approach also for virtio-net? I see it is
>> used by virtio-blk only.
>
> Yes, it's a good idea.  vring.c's roots are in the old "virtio-blk
> dataplane" code, which bypassed all of QEMU.  It had a separate event
> loop, a separate I/O queue, a separate implementation of the memory map,
> and a separate implementation of virtio.  virtio-blk dataplane is plenty
> fast, which is why there isn't a vhost-blk.
>
> Right now the only part that survives is the last one, which is vring.c.
>  The event loop has been reconciled with AioContext, the I/O queue uses
> BlockDriverState, and the memory map uses memory_region_find.
>
> virtio-net dataplane also existed, as a possible competitor of
> vhost-net.  However, vhost-net actually had better performance, so
> virtio-net dataplane was never committed.  As Michael mentioned, in
> practice on Linux you use vhost, and non-Linux hypervisors you do not
> use QEMU. :)

Yes, I understand. However, another possible use-case would using QEMU
+ virtio-net + netmap backend + Linux (e.g. for QEMU-sandboxed packet
generators or packe processors, where very high packet rates are
common), where is not possible to use vhost.

>
> Indeed the implementation in virtio.c does kind of suck.  On the other
> hand, vring.c doesn't support migration because it doesn't know how to
> mark guest memory as dirty.  And virtio.c is used by plenty of
> devices---including virtio-blk and virtio-scsi unless you enable usage
> of a separate I/O thread---and converting them to vring.c is bug-prone.
> This is why I would like to drop vring.c and improve virtio.c, rather
> than use vring.c even more.

+1

>
> The main optimization that vring.c has is to cache the translation of
> the rings.  Using address_space_map/unmap for rings in virtio.c would be
> a noticeable improvement, as your numbers for patch 3 show.  However, by
> caching translations you also conveniently "forget" to promptly mark the
> pages as dirty.  As you pointed out this is obviously an issue for
> migration.  You can then add a notifier for runstate changes.  When
> entering RUN_STATE_FINISH_MIGRATE or RUN_STATE_SAVE_VM the rings would
> be unmapped, and then remapped the next time the VM starts running again.
>

Ok so it seems feasible with a bit of care. The numbers we've been
seing in various experiments have always shown that this optimization
could easily double the 2 Mpps packet rate bottleneck.

> You also guessed right that there are consistency issues; for these you
> can add a MemoryListener that invalidates all mappings.

Yeah, but I don't know exactly what kind of inconsinstencies there can
be. Maybe the memory we are mapping may be hot-unplugged?

>
> That said, I'm wondering where the cost of address translation lies---is
> it cache-unfriendly data structures, locked operations, or simply too
> much code to execute?  It was quite surprising to me that on virtio-blk
> benchmarks we were spending 5% of the time doing memcpy! (I have just
> extracted from my branch the patches to remove that, and sent them to
> qemu-devel).

I feel it's just too much code (but I may be wrong).
I'm not sure whether you are thinking that 5% is too much or too little.
To me it's too little, showing that most of the overhead it's
somewhere else (e.g. memory translation, or backend processing). In a
ideal transmission system, most of the overhead should be spent on
copying, because it means that you successfully managed to suppress
notifications and translation overhead.

I've tried out your patches, but unfortunately I don't see any effect
on my transmission tests (fast transmission over virtio-net with
disconnected backend).

>
> Examples of missing optimizations in exec.c include:
>
> * caching enough information in RAM MemoryRegions to avoid the calls to
> qemu_get_ram_block (e.g. replace mr->ram_addr with a RAMBlock pointer);
>
> * adding a MRU cache to address_space_lookup_region.
>
> In particular, the former should be easy if you want to give it a
> try---easier than caching ring translations in virtio.c.
>
> Paolo

Thank you so much for the insights :)

Cheers,
  Vincenzo

Reply via email to