On Tue, 2022-10-25 at 12:07 +0200, Juan Quintela wrote: > Daniel P. Berrangé <berra...@redhat.com> wrote: > > On Tue, Oct 25, 2022 at 01:47:31AM -0300, Leonardo Bras wrote: > > > Zero-copy multifd migration sends both the header and the memory pages in > > > a > > > single syscall. Since it's necessary to flush before reusing the header, a > > > header array was implemented, so each write call uses a different > > > array, and flushing only take place after all headers have been used, > > > meaning 1 flush for each N writes. > > > > > > This method has a bottleneck, though: After the last write, a flush will > > > have to wait for all writes to finish, which will be a lot, meaning the > > > recvmsg() syscall called in qio_channel_socket_flush() will be called a > > > lot. On top of that, it will create a time period when the I/O queue is > > > empty and nothing is getting send: between the flush and the next write. > > > > > > To avoid that, use qio_channel_flush()'s new max_pending parameter to wait > > > until at most half of the array is still in use. (i.e. the LRU half of the > > > array can be reused) > > > > > > Flushing for the LRU half of the array is much faster, since it does not > > > have to wait for the most recent writes to finish, making up for having > > > to flush twice per array. > > > > > > As a main benefit, this approach keeps the I/O queue from being empty > > > while > > > there are still data to be sent, making it easier to keep the I/O maximum > > > throughput while consuming less cpu time. > > > > Doesn't this defeat the reason for adding the flush in the first > > place, which was to ensure that a migration iteration was fully > > sent before starting the next iteration over RAM ? If it is OK to > > only partially flush on each iteration, then why do we need to > > flush at all ? > > Now we need to do the flush in two places: > - on sync_main (the old place) > - on the migration thread, when we run out of array entries. > This one has nothing to do with correctness, it is just that we have > more space than needed.
That's correct! In fact, sync_main (the old place) will call flush with max_remaining = 0, meaning it will only return when there is nothing else pending. > > So, in this regard, I think that the patches are correct. > > But on the other hand, I am not sure that I like the size of the array. > Leonardo is using 1024 entries for each migration channel. That means > that it allows it to have 1024 entries * 512 KB each packet is 512MB of > unflushed data in each channel. I think that this is still too much. The size is correct, meaning we need to allow 512MB of locked memory per multifd channel. > > I will need some data from testing, but my understanding on how Zero > Copy work is that having around 10MB in each channel would be more than > enough to saturate the link. And once that the data inflight is > smaller, we can just flush it when we get out of channels. > > My idea here was to work the size the other way around, add a parameter > to the user about how much memory is he available for mlocking, and > just do a memory/channels array entries on each channel. That will: > > a - limit the amount of mlocked memory that we need > 10MB/channel for 10 channels is 100MB of mlocked memory, for a guest > that has lots of Gigabytes of RAM looks reasonable. > > b - We don't synchronize after each write, because I still claim than > doing a non asynchronous write on the channel just syncs everything > (otherwise I can't see how the hardware can even work). So the user sets the locked memory available and we split it between the number of channels during migration. Seems a good strategy, since it will explore the hardware well regardless of the channel count. > > So I guess that the best thing to decide this is: > - get a couple of nice 100Gigabit networking cards > - Enable 16 channels or so, so we know that the CPU is not going to be > the bottleneck > - test with this patch > - remove last two patches and test with several array sizes > (10, 20, 30,..) and we will see that after some size, performance will > not improve at all. > - Got that value as default one. > > What do you think? Most of it is good. I only disagree on removing the two last patches. I did some tests with smaller array sizes, and without the last two patches it would totally destroy performance. I blame the fact that once each N writes it will have to wait the queue to be completely empty, and then add more data for sending. Also, waiting for the completion of a send that was just scheduled takes multiple recvmsg syscalls, raising cpu usage. Other than that, seems a good strategy. Thank you both for reviewing! Best regards, Leo > > Later, Juan. > > PD. Yes, I don't like to add another parameter, so you can recompile > with different values, or we will not add the parameter once that > we find a right value. >