On Tue, Jul 25, 2023 at 05:09:57PM +0100, Daniel P. Berrangé wrote:
> On Tue, Jul 25, 2023 at 11:54:52AM -0400, Peter Xu wrote:
> > We can make the semantics specific, no strong opinion here.  I wished it
> > can be as generic / easy as possible but maybe I went too far.
> > 
> > Though, is there anything else we can choose from besides
> > "max-convergence-bandwidth"? Or am I the only one that thinks it's hard to
> > understand when put "max" and "convergence" together?
> > 
> > When I take one step back to look at the whole "bandwidth" parameters, I am
> > not sure why we'd even need both "convergence" and "postcopy" bandwidth
> > being separate.  With my current understanding of migration, we may
> > actually need:
> > 
> >   - One bandwidth that we may want to run the background migration, aka,
> >     precopy migration, where we don't rush on pushing data.
> > 
> >   - One bandwidth that is whatever we can have maximum; for dedicated NIC
> >     that's the line speed.  We should always use this full speed for
> >     important things.  I'd say postcopy falls into this, and this
> >     "convergence" calculation should also rely on this.
> 
> I don't think postcopy should be assumed to run at line speed.
> 
> At the point where you flip to post-copy mode, there could
> conceivably still be GB's worth of data still dirty and
> pending transfer.
> 
> The migration convergance step is reasonable to put at line
> speed, because the max downtime parameter caps how long this
> burst will be, genrally to some fraction of a second.
> 
> Once in post-copy mode, while the remaining data to transfer
> is finite, the wall clock time to complete that transfer may
> still be huge. It is unreasonable to assume users want to
> run at max linespeed for many minutes to finish post-copy
> at least in terms of the background transfer. You could make
> a  case for the page fault handling to run at a higher bandwidth
> cap than the background transfer, but I think it is still probably
> not reasonable to run page fault fetches at line speed by default.
> 
> IOW, I don't think we can put the same bandwidth limit on the
> short convergance operation, as on the longer post-copy operation.

Postcopy still heavily affects the performance of the VM for the whole
duration, and afaiu that's so far the major issue (after we fix postcopy
interruptions with recovery capability) that postcopy may not be wanted in
many cases.

If I am the admin I'd want it to run at full speed even if the pages were
not directly requested just to shrink the duration of postcopy; I'd just
want to make sure requested pages are queued sooner.

But that's okay if any of us still thinks that three values would be
helpful here, because we can simply have the latter two having the same
value when we want.  Three is the superset of two anyway.

I see you used "convergance" explicitly even after PeterM's reply, is that
what you prefer over "convergence"?  I do see more occurances of
"convergence" as a word in migration context, though.  Any better name you
can come up with, before I just go with "max-convergence-bandwidth" (I
really cannot come up with anything better than this or available-bandwidth
for now)?

Thanks,

-- 
Peter Xu


Reply via email to