On 24.04.2024 00:27, Peter Xu wrote:
On Tue, Apr 23, 2024 at 06:14:18PM +0200, Maciej S. Szmigiero wrote:
We don't lose any genericity since by default the transfer is done via
mixed RAM / device state multifd channels from a shared pool.

It's only when x-multifd-channels-device-state is set to value > 0 then
the requested multifd channel counts gets dedicated to device state.

It could be seen as a fine-tuning option for cases where tests show that
it provides some benefits to the particular workload - just like many
other existing migration options are.

14% downtime improvement is too much to waste - I'm not sure that's only
due to avoiding RAM syncs, it's possible that there are other subtle
performance interactions too.

For even more genericity this option could be named like
x-multifd-channels-map and contain an array of channel settings like
"ram,ram,ram,device-state,device-state".
Then a possible future other uses of multifd channels wouldn't even need
a new dedicated option.

Yeah I understand such option would only provide more options.

However as long as such option got introduced, user will start to do their
own "optimizations" on how to provision the multifd channels, and IMHO
it'll be great if we as developer can be crystal clear on why it needs to
be introduced in the first place, rather than making all channels open to
all purposes.

So I don't think I'm strongly against such parameter, but I want to double
check we really understand what's behind this to justify such parameter.
Meanwhile I'd be always be pretty caucious on introducing any migration
parameters, due to the compatibility nightmares.  The less parameter the
better..

Ack, I am also curious why dedicated device state multifd channels bring
such downtime improvement.



I think one of the reasons for these results is that mixed (RAM + device
state) multifd channels participate in the RAM sync process
(MULTIFD_FLAG_SYNC) whereas device state dedicated channels don't.

Firstly, I'm wondering whether we can have better names for these new
hooks.  Currently (only comment on the async* stuff):

    - complete_precopy_async
    - complete_precopy
    - complete_precopy_async_wait

But perhaps better:

    - complete_precopy_begin
    - complete_precopy
    - complete_precopy_end

?

As I don't see why the device must do something with async in such hook.
To me it's more like you're splitting one process into multiple, then
begin/end sounds more generic.

Ack, I will rename these hooks to begin/end.

Then, if with that in mind, IIUC we can already split ram_save_complete()
into >1 phases too. For example, I would be curious whether the performance
will go back to normal if we offloading multifd_send_sync_main() into the
complete_precopy_end(), because we really only need one shot of that, and I
am quite surprised it already greatly affects VFIO dumping its own things.

AFAIK there's already just one multifd_send_sync_main() during downtime -
the one called from save_live_complete_precopy SaveVMHandler.

In order to truly never interfere with device state transfer the sync would
need to be ordered after the device state transfer is complete - that is,
after VFIO complete_precopy_end (complete_precopy_async_wait) handler
returns.

Do you think it'll be worthwhile give it a shot, even if we can't decide
yet on the order of end()s to be called?

Upon a closer inspection it looks like that there are, in fact, *two*
RAM syncs done during the downtime - besides the one at the end of
ram_save_complete() there's another on in find_dirty_block(). This function
is called earlier from ram_save_complete() -> ram_find_and_save_block().

Unfortunately, skipping that intermediate sync in find_dirty_block() and
moving the one from the end of ram_save_complete() to another SaveVMHandler
that's called only after VFIO device state transfer doesn't actually
improve downtime (at least not on its own).

It'll be great if we could look into these issues instead of workarounds,
and figure out what affected the performance behind, and also whether that
can be fixed without such parameter.

I've been looking at this and added some measurements around device state
queuing for submission in multifd_queue_device_state().

To my surprise, the mixed RAM / device state config of 15/0 has *much*
lower total queuing time of around 100 msec compared to the dedicated
device state channels 15/4 config with total queuing time of around
300 msec.

Despite that, the 15/4 config still has significantly lower overall
downtime.

This means that any reason for the downtime difference is probably on
the receive / load side of the migration rather than on the save /
send side.

I guess the reason for the lower device state queuing time in the 15/0
case is that this data could be sent via any of the 15 multifd channels
rather than just the 4 dedicated ones in the 15/4 case.

Nevertheless, I will continue to look at this problem to at least find
some explanation for the difference in downtime that dedicated device
state multifd channels bring.

Thanks,


Thanks,
Maciej


Reply via email to