Re: [PATCH] migration: Refine the convergence detection

Yong Huang Tue, 06 Aug 2024 23:49:58 -0700

On Wed, Aug 7, 2024 at 12:45 AM Peter Xu <pet...@redhat.com> wrote:


> On Mon, Aug 05, 2024 at 03:03:27PM +0800, Yong Huang wrote:
> > Sorry for the late reply.
> >
> > On Wed, Jul 31, 2024 at 4:01 AM Peter Xu <pet...@redhat.com> wrote:
> >
> > > On Wed, Jul 24, 2024 at 07:39:29PM +0800, Hyman Huang wrote:
> > > > Currently, the convergence algorithm determines that the migration
> > > > cannot converge according to the following principle:
> > > > The dirty pages generated in current iteration exceed a specific
> > > > percentage (throttle-trigger-threshold, 50 by default) of the number
> > > > of transmissions. Let's refer to this criteria as the
> > > > "transmission speed," If this criteria is met more than or equal to
> > > > twice (dirty_rate_high_cnt >= 2), the throttle percentage needs to
> > > > be increased.
> > > >
> > > > In most cases, above implementation is appropriate. However, for a
> > > > VM with a huge memory and high memory overload, each iteration is
> > >
> > > When you're talking about huge memory, I do agree the current throttle
> > > logic doesn't look like to scale, because migration_trigger_throttle()
> is
> > > only called for each iteration, so it won't be invoked for a long time
> if
> > > one iteration can take a long time.
> > >
> > > I wonder whether you considered fixing that for a huge VM case in some
> way,
> > > so that we may need to pay for the sync overhead more, but maybe the
> > > throttle logic can still get scheduled from time to time.
> >
> >
> > > E.g., on a 10TB system with even a 100Gbps network, one iteration can
> take:
> > >
> > >   10TB / ~10GB/s = ~1000 seconds = ~17min
> > >
> > > It means migration_trigger_throttle() will only be invoked once every
> 17
> > > mins.
> >
> >
> > Agree, and another case, VM is at a high dirty rate when migrating,
> > similarly making the iteration take a long time.
> >
> >
> > > > time-consuming. The VM's computing performance may be throttled at
> > > > a high percentage and last for a long time due to the repeated
> > > > confirmation behavior. Which may be intolerable for some
> > > > computationally sensitive software in the VM. The refinement is for
> > > > this scenario.
> > > >
> > > > As the comment mentioned in the migration_trigger_throttle function,
> > > > in order to avoid erroneous detection, the original algorithm
> confirms
> > > > the criteria repeatedly. Put differently, once the detection become
> > > > more reliable, it does not need to be confirmed twice.
> > > >
> > > > In the refinement, in order to make the detection more accurate, we
> > > > introduce another criteria, called the "valid transmission ratio"
> > > > to determine the migration convergence. The "valid transmission
> ratio"
> > > > is the ratio of bytes_xfer_period and bytes_dirty_period, which
> > > > actually describe the migration efficiency.
> > > >
> > > > When the algorithm repeatedly detects that the current iteration
> > > > "valid transmission ratio" is lower than the previous iteration,
> > > > the algorithm determines that the migration cannot converge.
> > > >
> > > > For the "transmission speed" and "valid transmission ratio", if one
> > > > of the two criteria is met, the penalty percentage would be
> increased.
> > > > This saves the time of the entire iteration and therefore reduces
> > > > the time of VM performance degradation.
> > > >
> > > > In conclusion, this refinement significantly reduces the processing
> > > > time required for the throttle percentage step to its maximum while
> > > > the VM is under a high memory load.
> > > >
> > > > The following are test environment:
> > > >
> ----------------------------------------------------------------------
> > > > a. Test tool:
> > > > guestperf
> > > >
> > > > Refer to the following link to see details:
> > > > https://github.com/qemu/qemu/tree/master/tests/migration/guestperf
> > > >
> > > > b. Test VM scale:
> > > > CPU: 10; Memory: 28GB
> > >
> > > Isn't 28GB not a huge VM at all?  It'll be nice to know exactly what's
> the
> > > problem behind first.  So are we talking about "real huge VM"s (I'd
> say at
> > > least a few hundreds GBs), or "28GB VMs" (mostly.. every VM QEMU
> invokes)?
> > >
> > > >
> > > > c. Average bandwidth between source and destination for migration:
> > > > 1.53 Gbits/sec
> > > >
> ----------------------------------------------------------------------
> > > > All the supplementary test data shown as follows are basing on
> > > > above test environment.
> > > >
> > > > We use stress tool contained in the initrd-stress.img to update
> > > > ramsize MB on every CPU in guest, refer to the following link to
> > > > see the source code:
> > > > https://github.com/qemu/qemu/blob/master/tests/migration/stress.c
> > > >
> > > > We run the following command to compare our refined QEMU with the
> > > > original QEMU:
> > > >
> > > > $ ./migration/guestperf.py --cpus 10 --mem 28 --max-iters 40 \
> > > > --binary $(path_to_qemu-kvm) \
> > > > --dst-host $(destination_ip) \
> > > > --transport tcp --kernel $(path_to_vmlinuz) \
> > > > --initrd $(path_to_initrd-stress.img) \
> > > > --auto-converge \
> > > > --auto-converge-step 10 \
> > > > --max-iters 40
> > >
> > > So this is for aut-converge.  How's the dirty limit solution?  I am
> > >
> >
> > This patch is a general solution, not just for auto-convergence, here
> > I missed the dirty limit test case, i'll post the dirty limit test result
> > in the next version.
> >
> >
> > > surprised you switched to the old one.  Does it mean that
> auto-converge is
> > > better in some cases?
> > >
> >
> > Since for non-x86 hardware platforms, auto-converge is still the only
> > solution
> > to throttle the guest.  It still reap the benefits from this patch.
>
> ARM should have that too now, if that's what you're talking about.
>

Ok, thanks for the tip, i missed the important series:

https://patchew.org/QEMU/20230509022122.20888-1-gs...@redhat.com/
https://lore.kernel.org/kvmarm/20220819005601.198436-1-gs...@redhat.com/#r

I'll try to enable the dirty-limit on both arches and see how it goes.


> Said that, for quite some time I was wondering whether upstream should be
> focused on some solution that will work out for convergence, where I
> ultimately chose postcopy.  That's also why I was asking whether you
> considered postcopy.
>

Get it, so far, applying postcopy still needs more tests and practice for
our
production environment. For stability reasons, it may not be the first
choice for us.


>
> It'll be more focused for everyone if we can have one major solution over
> this problem rather than spreading the energy over a few.
>

Agree.


>
> >
> >
> > >
> > > >
> > > > The following data shows the migration test results with an increase
> in
> > > > stress, ignoring the title row, the odd rows show the unrefined QEMU
> > > > test data, and the even rows show the refined QEMU test data:
> > > >
> > > >
> |---------+--------+-----------+--------------+------------+------------|
> > > > | ramsize | sucess | totaltime | downtime(ms) | iterations |
> switchover |
> > > > | (MB)    |        | (ms)      | (ms)         |            |
> throttle   |
> > > > |         |        |           |              |            |
> percent    |
> > > >
> |---------+--------+-----------+--------------+------------+------------|
> > > > |     350 |    yes |    170285 |          399 |         23 |
>  99 |
> > > > |     350 |    yes |     85962 |          412 |         24 |
>  70 |
> > > > |     350 |    yes |    176305 |          199 |         20 |
>  99 |
> > > > |     350 |    yes |     66729 |          321 |         11 |
>  40 |
> > > > |     400 |    yes |    183042 |          469 |         21 |
>  99 |
> > > > |     400 |    yes |     77225 |          421 |         10 |
>  30 |
> > > > |     400 |    yes |    183641 |          866 |         27 |
>  99 |
> > > > |     400 |    yes |     59796 |          479 |         15 |
>  50 |
> > > > |     450 |    yes |    165881 |          820 |         21 |
>  99 |
> > > > |     450 |    yes |     87930 |          368 |         20 |
>  90 |
> > > > |     450 |    yes |    195448 |          452 |         23 |
>  99 |
> > > > |     450 |    yes |     70394 |          295 |          6 |
>  20 |
> > > > |     500 |    yes |    112587 |          471 |         19 |
>  99 |
> > > > |     500 |    yes |     57401 |          299 |          5 |
>  30 |
> > > > |     500 |    yes |    110683 |          657 |         21 |
>  99 |
> > > > |     500 |    yes |     69949 |          649 |          8 |
>  40 |
> > > > |     600 |    yes |    111036 |          324 |         23 |
>  99 |
> > > > |     600 |    yes |     63455 |          346 |          4 |
>  20 |
> > > > |     600 |    yes |    126667 |          426 |         20 |
>  99 |
> > > > |     600 |    yes |    101024 |          643 |         20 |
>  99 |
> > > > |    1000 |    yes |    296216 |          660 |         23 |
>  99 |
> > > > |    1000 |    yes |    106915 |          468 |         16 |
>  99 |
> > > > |    1000 |     no |    300000 |              |            |
>     |
> > > > |    1000 |    yes |    125819 |          824 |         17 |
>  99 |
> > > > |    1200 |     no |    300000 |              |            |
>     |
> > > > |    1200 |    yes |    127379 |          664 |         14 |
>  90 |
> > > > |    1200 |     no |    300000 |              |            |
>     |
> > > > |    1200 |    yes |     67086 |          793 |         11 |
>  50 |
> > > >
> |---------+--------+-----------+--------------+------------+------------|
> > > >
> > > > To summarize the data above, any data that implies negative
> optimization
> > > > does not appear, and morover, the throttle algorithm seems to become
> more
> > > > responsive to dirty rate increases due to the refined detection.
> > > >
> > > > Signed-off-by: Hyman Huang <yong.hu...@smartx.com>
> > > > ---
> > > >  migration/ram.c        | 48
> +++++++++++++++++++++++++++++++++++++++---
> > > >  migration/trace-events |  1 +
> > > >  2 files changed, 46 insertions(+), 3 deletions(-)
> > > >
> > > > diff --git a/migration/ram.c b/migration/ram.c
> > > > index edec1a2d07..18b2d422b5 100644
> > > > --- a/migration/ram.c
> > > > +++ b/migration/ram.c
> > > > @@ -414,6 +414,17 @@ struct RAMState {
> > > >       * RAM migration.
> > > >       */
> > > >      unsigned int postcopy_bmap_sync_requested;
> > > > +
> > > > +    /*
> > > > +     * Ratio of bytes_dirty_period and bytes_xfer_period in the last
> > > > +     * iteration
> > > > +     */
> > > > +    uint64_t dirty_ratio_pct;
> > > > +    /*
> > > > +     * How many times is the most recent iteration dirty ratio is
> > > > +     * higher than previous one
> > > > +     */
> > > > +    int dirty_ratio_high_cnt;
> > > >  };
> > > >  typedef struct RAMState RAMState;
> > > >
> > > > @@ -1013,6 +1024,32 @@ static void migration_dirty_limit_guest(void)
> > > >      trace_migration_dirty_limit_guest(quota_dirtyrate);
> > > >  }
> > > >
> > > > +static bool migration_dirty_ratio_unexpected(RAMState *rs)
> > > > +{
> > > > +    uint64_t threshold = migrate_throttle_trigger_threshold();
> > > > +    uint64_t bytes_xfer_period =
> > > > +        migration_transferred_bytes() - rs->bytes_xfer_prev;
> > > > +    uint64_t bytes_dirty_period = rs->num_dirty_pages_period *
> > > TARGET_PAGE_SIZE;
> > > > +    uint64_t prev, curr;
> > > > +
> > > > +    /* Skip the first iterations since it isn't helpful */
> > > > +    if (stat64_get(&mig_stats.dirty_sync_count) == 1 ||
> > > !bytes_xfer_period) {
> > > > +        return false;
> > > > +    }
> > > > +
> > > > +    curr = 100 * (bytes_dirty_period * 1.0 / bytes_xfer_period);
> > > > +
> > > > +    prev = rs->dirty_ratio_pct;
> > > > +    rs->dirty_ratio_pct = curr;
> > > > +
> > > > +    if (prev == 0 || curr <= threshold) {
> > > > +        return false;
> > > > +    }
> > > > +
> > > > +    trace_dirty_ratio_pct(curr, prev);
> > > > +    return curr >= prev;
> > > > +}
> > > > +
> > > >  static void migration_trigger_throttle(RAMState *rs)
> > > >  {
> > > >      uint64_t threshold = migrate_throttle_trigger_threshold();
> > > > @@ -1028,9 +1065,14 @@ static void
> migration_trigger_throttle(RAMState
> > > *rs)
> > > >       * we were in this routine reaches the threshold. If that
> happens
> > > >       * twice, start or increase throttling.
> > > >       */
> > > > -    if ((bytes_dirty_period > bytes_dirty_threshold) &&
> > > > -        (++rs->dirty_rate_high_cnt >= 2)) {
> > > > -        rs->dirty_rate_high_cnt = 0;
> > > > +    if ((migration_dirty_ratio_unexpected(rs) &&
> > > > +         (++rs->dirty_ratio_high_cnt >= 2)) ||
> > > > +        ((bytes_dirty_period > bytes_dirty_threshold) &&
> > > > +         (++rs->dirty_rate_high_cnt >= 2))) {
> > >
> > > I'm afraid this is a mess.
> > >
> > > Now it's (A||B) and any condition can trigger this throttle logic.
> Both A
> > > & B make decisions on merely the same parameters.  It's totally
> > > unpredictable to me on how these cnts bumps, and not readable.
> > >
> >
> > Indeed, this is not readable. How about introducing a migration
> capability
> > like "precise-detection" to make this patch backward-compatible?
>
> If we want to do some change here anyway, I'd think it better we do it in
> the best form rather than keep adding caps.


> For now "stick with postcopy" is still one way to go, but not sure whether
> it applies in your setup.  Otherwise I'd consider we sync more for huge
> VMs.  I mean "throttle only for each sync, sync for each iteration" may
> make sense in the old days, but perhaps not anymore.  That sound like the
> real root cause of this issue.
>

These inspired me,  I'll try this way. :)


>
> >
> >
> > >
> > > What I kind of see how this could make migration shorter is that now
> either
> > > A or B can trigger the throttle, so it triggered in a faster pace than
> > > before.  E.g. IIUC if we drop dirty_rate_high_cnt completely it'll also
> > > achieve similar thing at least in guestperf tests.
> > >
> >
> > Yes !  I get it, and what my original idea is to drop the
> > dirty_rate_high_cnt.
> >
> > As we mentioned above: migration needs to pay for the sync overhead more
> > once a VM is configured with huge memory or running at a high dirty rate.
> >
> > Dropping the dirty_rate_high_cnt will make the iteration take less time
> > in above cases.  I think this is feasible since there is no other
> reasons to
> > retain the  dirty_rate_high_cnt once we make the detection precise,
> > let me know if i missed something.
>

I still think dropping the dirty_rate_high_cnt is a refinement. If migration
has a precise detection of convergence, what do you think of it?


> >
> >
> > >
> > > Have you considered what I mentioned above that may make auto converge
> work
> > > better with "real huge VM"s (by allowing sync >1 times for each
> iteration),
> > > or have you considered postcopy?
> > >
> >
> > Yes, IMHO, this is another refinement direction for the auto-converge,
> we'll
> > try this and keep communication with upstream once it gets progress.
>
> Thanks,
>
> --
> Peter Xu
>
>

-- 
Best regards

Re: [PATCH] migration: Refine the convergence detection

Reply via email to