On 27/02/2023 19:43, Alex Williamson wrote:
External email: Use caution opening links or attachments
On Mon, 27 Feb 2023 13:26:00 -0400
Jason Gunthorpe <j...@nvidia.com> wrote:
On Mon, Feb 27, 2023 at 09:14:44AM -0700, Alex Williamson wrote:
But we have no requirement to send all init_bytes before stop-copy.
This is a hack to achieve a theoretical benefit that a driver might be
able to improve the latency on the target by completing another
iteration.
I think this is another half-step at this point..
The goal is to not stop the VM until the target VFIO driver has
completed loading initial_bytes.
This signals that the time consuming pre-setup is completed in the
device and we don't have to use downtime to do that work.
We've measured this in our devices and the time-shift can be
significant, like seconds levels of time removed from the downtime
period.
Stopping the VM before this pre-setup is done is simply extending the
stopped VM downtime.
Really what we want is to have the far side acknowledge that
initial_bytes has completed loading.
To remind, what mlx5 is doing here with precopy is time-shifting work,
not data. We want to put expensive work (ie time) into the period when
the VM is still running and have less downtime.
This challenges the assumption built into qmeu that all data has equal
time and it can estimate downtime time simply by scaling the estimated
data. We have a data-size independent time component to deal with as
well.
As I mentioned before, I understand the motivation, but imo the
implementation is exploiting the interface it extended in order to force
a device driven policy which is specifically not a requirement of the
vfio migration uAPI. It sounds like there's more work required in the
QEMU migration interfaces to properly factor this information into the
algorithm. Until then, this seems like a follow-on improvement unless
you can convince the migration maintainers that providing false
information in order to force another pre-copy iteration is a valid use
of passing the threshold value to the driver.
In my previous message I suggested to drop this exploit and instead
change the QEMU migration API and introduce to it the concept of
pre-copy initial bytes -- data that must be transferred before source VM
stops (which is different from current @must_precopy that represents
data that can be transferred even when VM is stopped).
We could do it by adding a new parameter "init_precopy_size" to the
state_pending_{estimate,exact} handlers and every migration user could
use it (RAM, block, etc).
We will also change the migration algorithm to take this new parameter
into account when deciding to move to stop-copy.
Of course this will have to be approved by migration maintainers first,
but if it's done in a standard way such as above, via the migration API,
would it be OK by you to go this way?
Thanks.