Sorry for the late reply. On Wed, Jul 31, 2024 at 4:01 AM Peter Xu <pet...@redhat.com> wrote:
> On Wed, Jul 24, 2024 at 07:39:29PM +0800, Hyman Huang wrote: > > Currently, the convergence algorithm determines that the migration > > cannot converge according to the following principle: > > The dirty pages generated in current iteration exceed a specific > > percentage (throttle-trigger-threshold, 50 by default) of the number > > of transmissions. Let's refer to this criteria as the > > "transmission speed," If this criteria is met more than or equal to > > twice (dirty_rate_high_cnt >= 2), the throttle percentage needs to > > be increased. > > > > In most cases, above implementation is appropriate. However, for a > > VM with a huge memory and high memory overload, each iteration is > > When you're talking about huge memory, I do agree the current throttle > logic doesn't look like to scale, because migration_trigger_throttle() is > only called for each iteration, so it won't be invoked for a long time if > one iteration can take a long time. > > I wonder whether you considered fixing that for a huge VM case in some way, > so that we may need to pay for the sync overhead more, but maybe the > throttle logic can still get scheduled from time to time. > E.g., on a 10TB system with even a 100Gbps network, one iteration can take: > > 10TB / ~10GB/s = ~1000 seconds = ~17min > > It means migration_trigger_throttle() will only be invoked once every 17 > mins. Agree, and another case, VM is at a high dirty rate when migrating, similarly making the iteration take a long time. > > time-consuming. The VM's computing performance may be throttled at > > a high percentage and last for a long time due to the repeated > > confirmation behavior. Which may be intolerable for some > > computationally sensitive software in the VM. The refinement is for > > this scenario. > > > > As the comment mentioned in the migration_trigger_throttle function, > > in order to avoid erroneous detection, the original algorithm confirms > > the criteria repeatedly. Put differently, once the detection become > > more reliable, it does not need to be confirmed twice. > > > > In the refinement, in order to make the detection more accurate, we > > introduce another criteria, called the "valid transmission ratio" > > to determine the migration convergence. The "valid transmission ratio" > > is the ratio of bytes_xfer_period and bytes_dirty_period, which > > actually describe the migration efficiency. > > > > When the algorithm repeatedly detects that the current iteration > > "valid transmission ratio" is lower than the previous iteration, > > the algorithm determines that the migration cannot converge. > > > > For the "transmission speed" and "valid transmission ratio", if one > > of the two criteria is met, the penalty percentage would be increased. > > This saves the time of the entire iteration and therefore reduces > > the time of VM performance degradation. > > > > In conclusion, this refinement significantly reduces the processing > > time required for the throttle percentage step to its maximum while > > the VM is under a high memory load. > > > > The following are test environment: > > ---------------------------------------------------------------------- > > a. Test tool: > > guestperf > > > > Refer to the following link to see details: > > https://github.com/qemu/qemu/tree/master/tests/migration/guestperf > > > > b. Test VM scale: > > CPU: 10; Memory: 28GB > > Isn't 28GB not a huge VM at all? It'll be nice to know exactly what's the > problem behind first. So are we talking about "real huge VM"s (I'd say at > least a few hundreds GBs), or "28GB VMs" (mostly.. every VM QEMU invokes)? > > > > > c. Average bandwidth between source and destination for migration: > > 1.53 Gbits/sec > > ---------------------------------------------------------------------- > > All the supplementary test data shown as follows are basing on > > above test environment. > > > > We use stress tool contained in the initrd-stress.img to update > > ramsize MB on every CPU in guest, refer to the following link to > > see the source code: > > https://github.com/qemu/qemu/blob/master/tests/migration/stress.c > > > > We run the following command to compare our refined QEMU with the > > original QEMU: > > > > $ ./migration/guestperf.py --cpus 10 --mem 28 --max-iters 40 \ > > --binary $(path_to_qemu-kvm) \ > > --dst-host $(destination_ip) \ > > --transport tcp --kernel $(path_to_vmlinuz) \ > > --initrd $(path_to_initrd-stress.img) \ > > --auto-converge \ > > --auto-converge-step 10 \ > > --max-iters 40 > > So this is for aut-converge. How's the dirty limit solution? I am > This patch is a general solution, not just for auto-convergence, here I missed the dirty limit test case, i'll post the dirty limit test result in the next version. > surprised you switched to the old one. Does it mean that auto-converge is > better in some cases? > Since for non-x86 hardware platforms, auto-converge is still the only solution to throttle the guest. It still reap the benefits from this patch. > > > > > The following data shows the migration test results with an increase in > > stress, ignoring the title row, the odd rows show the unrefined QEMU > > test data, and the even rows show the refined QEMU test data: > > > > |---------+--------+-----------+--------------+------------+------------| > > | ramsize | sucess | totaltime | downtime(ms) | iterations | switchover | > > | (MB) | | (ms) | (ms) | | throttle | > > | | | | | | percent | > > |---------+--------+-----------+--------------+------------+------------| > > | 350 | yes | 170285 | 399 | 23 | 99 | > > | 350 | yes | 85962 | 412 | 24 | 70 | > > | 350 | yes | 176305 | 199 | 20 | 99 | > > | 350 | yes | 66729 | 321 | 11 | 40 | > > | 400 | yes | 183042 | 469 | 21 | 99 | > > | 400 | yes | 77225 | 421 | 10 | 30 | > > | 400 | yes | 183641 | 866 | 27 | 99 | > > | 400 | yes | 59796 | 479 | 15 | 50 | > > | 450 | yes | 165881 | 820 | 21 | 99 | > > | 450 | yes | 87930 | 368 | 20 | 90 | > > | 450 | yes | 195448 | 452 | 23 | 99 | > > | 450 | yes | 70394 | 295 | 6 | 20 | > > | 500 | yes | 112587 | 471 | 19 | 99 | > > | 500 | yes | 57401 | 299 | 5 | 30 | > > | 500 | yes | 110683 | 657 | 21 | 99 | > > | 500 | yes | 69949 | 649 | 8 | 40 | > > | 600 | yes | 111036 | 324 | 23 | 99 | > > | 600 | yes | 63455 | 346 | 4 | 20 | > > | 600 | yes | 126667 | 426 | 20 | 99 | > > | 600 | yes | 101024 | 643 | 20 | 99 | > > | 1000 | yes | 296216 | 660 | 23 | 99 | > > | 1000 | yes | 106915 | 468 | 16 | 99 | > > | 1000 | no | 300000 | | | | > > | 1000 | yes | 125819 | 824 | 17 | 99 | > > | 1200 | no | 300000 | | | | > > | 1200 | yes | 127379 | 664 | 14 | 90 | > > | 1200 | no | 300000 | | | | > > | 1200 | yes | 67086 | 793 | 11 | 50 | > > |---------+--------+-----------+--------------+------------+------------| > > > > To summarize the data above, any data that implies negative optimization > > does not appear, and morover, the throttle algorithm seems to become more > > responsive to dirty rate increases due to the refined detection. > > > > Signed-off-by: Hyman Huang <yong.hu...@smartx.com> > > --- > > migration/ram.c | 48 +++++++++++++++++++++++++++++++++++++++--- > > migration/trace-events | 1 + > > 2 files changed, 46 insertions(+), 3 deletions(-) > > > > diff --git a/migration/ram.c b/migration/ram.c > > index edec1a2d07..18b2d422b5 100644 > > --- a/migration/ram.c > > +++ b/migration/ram.c > > @@ -414,6 +414,17 @@ struct RAMState { > > * RAM migration. > > */ > > unsigned int postcopy_bmap_sync_requested; > > + > > + /* > > + * Ratio of bytes_dirty_period and bytes_xfer_period in the last > > + * iteration > > + */ > > + uint64_t dirty_ratio_pct; > > + /* > > + * How many times is the most recent iteration dirty ratio is > > + * higher than previous one > > + */ > > + int dirty_ratio_high_cnt; > > }; > > typedef struct RAMState RAMState; > > > > @@ -1013,6 +1024,32 @@ static void migration_dirty_limit_guest(void) > > trace_migration_dirty_limit_guest(quota_dirtyrate); > > } > > > > +static bool migration_dirty_ratio_unexpected(RAMState *rs) > > +{ > > + uint64_t threshold = migrate_throttle_trigger_threshold(); > > + uint64_t bytes_xfer_period = > > + migration_transferred_bytes() - rs->bytes_xfer_prev; > > + uint64_t bytes_dirty_period = rs->num_dirty_pages_period * > TARGET_PAGE_SIZE; > > + uint64_t prev, curr; > > + > > + /* Skip the first iterations since it isn't helpful */ > > + if (stat64_get(&mig_stats.dirty_sync_count) == 1 || > !bytes_xfer_period) { > > + return false; > > + } > > + > > + curr = 100 * (bytes_dirty_period * 1.0 / bytes_xfer_period); > > + > > + prev = rs->dirty_ratio_pct; > > + rs->dirty_ratio_pct = curr; > > + > > + if (prev == 0 || curr <= threshold) { > > + return false; > > + } > > + > > + trace_dirty_ratio_pct(curr, prev); > > + return curr >= prev; > > +} > > + > > static void migration_trigger_throttle(RAMState *rs) > > { > > uint64_t threshold = migrate_throttle_trigger_threshold(); > > @@ -1028,9 +1065,14 @@ static void migration_trigger_throttle(RAMState > *rs) > > * we were in this routine reaches the threshold. If that happens > > * twice, start or increase throttling. > > */ > > - if ((bytes_dirty_period > bytes_dirty_threshold) && > > - (++rs->dirty_rate_high_cnt >= 2)) { > > - rs->dirty_rate_high_cnt = 0; > > + if ((migration_dirty_ratio_unexpected(rs) && > > + (++rs->dirty_ratio_high_cnt >= 2)) || > > + ((bytes_dirty_period > bytes_dirty_threshold) && > > + (++rs->dirty_rate_high_cnt >= 2))) { > > I'm afraid this is a mess. > > Now it's (A||B) and any condition can trigger this throttle logic. Both A > & B make decisions on merely the same parameters. It's totally > unpredictable to me on how these cnts bumps, and not readable. > Indeed, this is not readable. How about introducing a migration capability like "precise-detection" to make this patch backward-compatible? > > What I kind of see how this could make migration shorter is that now either > A or B can trigger the throttle, so it triggered in a faster pace than > before. E.g. IIUC if we drop dirty_rate_high_cnt completely it'll also > achieve similar thing at least in guestperf tests. > Yes ! I get it, and what my original idea is to drop the dirty_rate_high_cnt. As we mentioned above: migration needs to pay for the sync overhead more once a VM is configured with huge memory or running at a high dirty rate. Dropping the dirty_rate_high_cnt will make the iteration take less time in above cases. I think this is feasible since there is no other reasons to retain the dirty_rate_high_cnt once we make the detection precise, let me know if i missed something. > > Have you considered what I mentioned above that may make auto converge work > better with "real huge VM"s (by allowing sync >1 times for each iteration), > or have you considered postcopy? > Yes, IMHO, this is another refinement direction for the auto-converge, we'll try this and keep communication with upstream once it gets progress. > > Thanks, > > > + rs->dirty_rate_high_cnt = > > + rs->dirty_rate_high_cnt >= 2 ? 0 : rs->dirty_rate_high_cnt; > > + rs->dirty_ratio_high_cnt = > > + rs->dirty_ratio_high_cnt >= 2 ? 0 : > rs->dirty_ratio_high_cnt; > > if (migrate_auto_converge()) { > > trace_migration_throttle(); > > mig_throttle_guest_down(bytes_dirty_period, > > diff --git a/migration/trace-events b/migration/trace-events > > index 0b7c3324fb..654c52c5e4 100644 > > --- a/migration/trace-events > > +++ b/migration/trace-events > > @@ -90,6 +90,7 @@ put_qlist_end(const char *field_name, const char > *vmsd_name) "%s(%s)" > > qemu_file_fclose(void) "" > > > > # ram.c > > +dirty_ratio_pct(uint64_t cur, uint64_t prev) "current ratio: %" PRIu64 > " previous ratio: %" PRIu64 > > get_queued_page(const char *block_name, uint64_t tmp_offset, unsigned > long page_abs) "%s/0x%" PRIx64 " page_abs=0x%lx" > > get_queued_page_not_dirty(const char *block_name, uint64_t tmp_offset, > unsigned long page_abs) "%s/0x%" PRIx64 " page_abs=0x%lx" > > migration_bitmap_sync_start(void) "" > > -- > > 2.39.1 > > > > -- > Peter Xu > > -- Best regards