Daniel P. Berrangé <berra...@redhat.com> wrote:
> On Tue, Oct 25, 2022 at 01:47:31AM -0300, Leonardo Bras wrote:
>> Zero-copy multifd migration sends both the header and the memory pages in a
>> single syscall. Since it's necessary to flush before reusing the header, a
>> header array was implemented, so each write call uses a different
>> array, and flushing only take place after all headers have been used,
>> meaning 1 flush for each N writes.
>> 
>> This method has a bottleneck, though: After the last write, a flush will
>> have to wait for all writes to finish, which will be a lot, meaning the
>> recvmsg() syscall called in qio_channel_socket_flush() will be called a
>> lot. On top of that, it will create a time period when the I/O queue is
>> empty and nothing is getting send: between the flush and the next write.
>> 
>> To avoid that, use qio_channel_flush()'s new max_pending parameter to wait
>> until at most half of the array is still in use. (i.e. the LRU half of the
>> array can be reused)
>> 
>> Flushing for the LRU half of the array is much faster, since it does not
>> have to wait for the most recent writes to finish, making up for having
>> to flush twice per array.
>> 
>> As a main benefit, this approach keeps the I/O queue from being empty while
>> there are still data to be sent, making it easier to keep the I/O maximum
>> throughput while consuming less cpu time.
>
> Doesn't this defeat the reason for adding the flush in the first
> place, which was to ensure that a migration iteration was fully
> sent before starting the next iteration over RAM ? If it is OK to
> only partially flush on each iteration, then why do we need to
> flush at all ?

Now we need to do the flush in two places:
- on sync_main (the old place)
- on the migration thread, when we run out of array entries.
  This one has nothing to do with correctness, it is just that we have
  more space than needed.

So, in this regard, I think that the patches are correct.

But on the other hand, I am not sure that I like the size of the array.
Leonardo is using 1024 entries for each migration channel.  That means
that it allows it to have 1024 entries * 512 KB each packet is 512MB of
unflushed data in each channel.  I think that this is still too much.

I will need some data from testing, but my understanding on how Zero
Copy work is that having around 10MB in each channel would be more than
enough to saturate the link.  And once that the data inflight is
smaller, we can just flush it when we get out of channels.

My idea here was to work the size the other way around, add a parameter
to the user about how much memory is he available for mlocking, and
just do a memory/channels array entries on each channel.  That will:

a - limit the amount of mlocked memory that we need
    10MB/channel for 10 channels is 100MB of mlocked memory, for a guest
    that has lots of Gigabytes of RAM looks reasonable.

b - We don't synchronize after each write, because I still claim than
    doing a non asynchronous write on the channel just syncs everything
    (otherwise I can't see how the hardware can even work).

So I guess that the best thing to decide this is:
- get a couple of nice 100Gigabit networking cards
- Enable 16 channels or so, so we know that the CPU is not going to be
  the bottleneck
- test with this patch
- remove last two patches and test with several array sizes
  (10, 20, 30,..) and we will see that after some size, performance will
  not improve at all.
- Got that value as default one.

What do you think?

Later, Juan.

PD.  Yes, I don't like to add another parameter, so you can recompile
     with different values, or we will not add the parameter once that
     we find a right value.


Reply via email to