On Wed, Aug 7, 2024 at 12:45 AM Peter Xu <pet...@redhat.com> wrote:
> On Mon, Aug 05, 2024 at 03:03:27PM +0800, Yong Huang wrote: > > Sorry for the late reply. > > > > On Wed, Jul 31, 2024 at 4:01 AM Peter Xu <pet...@redhat.com> wrote: > > > > > On Wed, Jul 24, 2024 at 07:39:29PM +0800, Hyman Huang wrote: > > > > Currently, the convergence algorithm determines that the migration > > > > cannot converge according to the following principle: > > > > The dirty pages generated in current iteration exceed a specific > > > > percentage (throttle-trigger-threshold, 50 by default) of the number > > > > of transmissions. Let's refer to this criteria as the > > > > "transmission speed," If this criteria is met more than or equal to > > > > twice (dirty_rate_high_cnt >= 2), the throttle percentage needs to > > > > be increased. > > > > > > > > In most cases, above implementation is appropriate. However, for a > > > > VM with a huge memory and high memory overload, each iteration is > > > > > > When you're talking about huge memory, I do agree the current throttle > > > logic doesn't look like to scale, because migration_trigger_throttle() > is > > > only called for each iteration, so it won't be invoked for a long time > if > > > one iteration can take a long time. > > > > > > I wonder whether you considered fixing that for a huge VM case in some > way, > > > so that we may need to pay for the sync overhead more, but maybe the > > > throttle logic can still get scheduled from time to time. > > > > > > > E.g., on a 10TB system with even a 100Gbps network, one iteration can > take: > > > > > > 10TB / ~10GB/s = ~1000 seconds = ~17min > > > > > > It means migration_trigger_throttle() will only be invoked once every > 17 > > > mins. > > > > > > Agree, and another case, VM is at a high dirty rate when migrating, > > similarly making the iteration take a long time. > > > > > > > > time-consuming. The VM's computing performance may be throttled at > > > > a high percentage and last for a long time due to the repeated > > > > confirmation behavior. Which may be intolerable for some > > > > computationally sensitive software in the VM. The refinement is for > > > > this scenario. > > > > > > > > As the comment mentioned in the migration_trigger_throttle function, > > > > in order to avoid erroneous detection, the original algorithm > confirms > > > > the criteria repeatedly. Put differently, once the detection become > > > > more reliable, it does not need to be confirmed twice. > > > > > > > > In the refinement, in order to make the detection more accurate, we > > > > introduce another criteria, called the "valid transmission ratio" > > > > to determine the migration convergence. The "valid transmission > ratio" > > > > is the ratio of bytes_xfer_period and bytes_dirty_period, which > > > > actually describe the migration efficiency. > > > > > > > > When the algorithm repeatedly detects that the current iteration > > > > "valid transmission ratio" is lower than the previous iteration, > > > > the algorithm determines that the migration cannot converge. > > > > > > > > For the "transmission speed" and "valid transmission ratio", if one > > > > of the two criteria is met, the penalty percentage would be > increased. > > > > This saves the time of the entire iteration and therefore reduces > > > > the time of VM performance degradation. > > > > > > > > In conclusion, this refinement significantly reduces the processing > > > > time required for the throttle percentage step to its maximum while > > > > the VM is under a high memory load. > > > > > > > > The following are test environment: > > > > > ---------------------------------------------------------------------- > > > > a. Test tool: > > > > guestperf > > > > > > > > Refer to the following link to see details: > > > > https://github.com/qemu/qemu/tree/master/tests/migration/guestperf > > > > > > > > b. Test VM scale: > > > > CPU: 10; Memory: 28GB > > > > > > Isn't 28GB not a huge VM at all? It'll be nice to know exactly what's > the > > > problem behind first. So are we talking about "real huge VM"s (I'd > say at > > > least a few hundreds GBs), or "28GB VMs" (mostly.. every VM QEMU > invokes)? > > > > > > > > > > > c. Average bandwidth between source and destination for migration: > > > > 1.53 Gbits/sec > > > > > ---------------------------------------------------------------------- > > > > All the supplementary test data shown as follows are basing on > > > > above test environment. > > > > > > > > We use stress tool contained in the initrd-stress.img to update > > > > ramsize MB on every CPU in guest, refer to the following link to > > > > see the source code: > > > > https://github.com/qemu/qemu/blob/master/tests/migration/stress.c > > > > > > > > We run the following command to compare our refined QEMU with the > > > > original QEMU: > > > > > > > > $ ./migration/guestperf.py --cpus 10 --mem 28 --max-iters 40 \ > > > > --binary $(path_to_qemu-kvm) \ > > > > --dst-host $(destination_ip) \ > > > > --transport tcp --kernel $(path_to_vmlinuz) \ > > > > --initrd $(path_to_initrd-stress.img) \ > > > > --auto-converge \ > > > > --auto-converge-step 10 \ > > > > --max-iters 40 > > > > > > So this is for aut-converge. How's the dirty limit solution? I am > > > > > > > This patch is a general solution, not just for auto-convergence, here > > I missed the dirty limit test case, i'll post the dirty limit test result > > in the next version. > > > > > > > surprised you switched to the old one. Does it mean that > auto-converge is > > > better in some cases? > > > > > > > Since for non-x86 hardware platforms, auto-converge is still the only > > solution > > to throttle the guest. It still reap the benefits from this patch. > > ARM should have that too now, if that's what you're talking about. > Ok, thanks for the tip, i missed the important series: https://patchew.org/QEMU/20230509022122.20888-1-gs...@redhat.com/ https://lore.kernel.org/kvmarm/20220819005601.198436-1-gs...@redhat.com/#r I'll try to enable the dirty-limit on both arches and see how it goes. > Said that, for quite some time I was wondering whether upstream should be > focused on some solution that will work out for convergence, where I > ultimately chose postcopy. That's also why I was asking whether you > considered postcopy. > Get it, so far, applying postcopy still needs more tests and practice for our production environment. For stability reasons, it may not be the first choice for us. > > It'll be more focused for everyone if we can have one major solution over > this problem rather than spreading the energy over a few. > Agree. > > > > > > > > > > > > > > > > The following data shows the migration test results with an increase > in > > > > stress, ignoring the title row, the odd rows show the unrefined QEMU > > > > test data, and the even rows show the refined QEMU test data: > > > > > > > > > |---------+--------+-----------+--------------+------------+------------| > > > > | ramsize | sucess | totaltime | downtime(ms) | iterations | > switchover | > > > > | (MB) | | (ms) | (ms) | | > throttle | > > > > | | | | | | > percent | > > > > > |---------+--------+-----------+--------------+------------+------------| > > > > | 350 | yes | 170285 | 399 | 23 | > 99 | > > > > | 350 | yes | 85962 | 412 | 24 | > 70 | > > > > | 350 | yes | 176305 | 199 | 20 | > 99 | > > > > | 350 | yes | 66729 | 321 | 11 | > 40 | > > > > | 400 | yes | 183042 | 469 | 21 | > 99 | > > > > | 400 | yes | 77225 | 421 | 10 | > 30 | > > > > | 400 | yes | 183641 | 866 | 27 | > 99 | > > > > | 400 | yes | 59796 | 479 | 15 | > 50 | > > > > | 450 | yes | 165881 | 820 | 21 | > 99 | > > > > | 450 | yes | 87930 | 368 | 20 | > 90 | > > > > | 450 | yes | 195448 | 452 | 23 | > 99 | > > > > | 450 | yes | 70394 | 295 | 6 | > 20 | > > > > | 500 | yes | 112587 | 471 | 19 | > 99 | > > > > | 500 | yes | 57401 | 299 | 5 | > 30 | > > > > | 500 | yes | 110683 | 657 | 21 | > 99 | > > > > | 500 | yes | 69949 | 649 | 8 | > 40 | > > > > | 600 | yes | 111036 | 324 | 23 | > 99 | > > > > | 600 | yes | 63455 | 346 | 4 | > 20 | > > > > | 600 | yes | 126667 | 426 | 20 | > 99 | > > > > | 600 | yes | 101024 | 643 | 20 | > 99 | > > > > | 1000 | yes | 296216 | 660 | 23 | > 99 | > > > > | 1000 | yes | 106915 | 468 | 16 | > 99 | > > > > | 1000 | no | 300000 | | | > | > > > > | 1000 | yes | 125819 | 824 | 17 | > 99 | > > > > | 1200 | no | 300000 | | | > | > > > > | 1200 | yes | 127379 | 664 | 14 | > 90 | > > > > | 1200 | no | 300000 | | | > | > > > > | 1200 | yes | 67086 | 793 | 11 | > 50 | > > > > > |---------+--------+-----------+--------------+------------+------------| > > > > > > > > To summarize the data above, any data that implies negative > optimization > > > > does not appear, and morover, the throttle algorithm seems to become > more > > > > responsive to dirty rate increases due to the refined detection. > > > > > > > > Signed-off-by: Hyman Huang <yong.hu...@smartx.com> > > > > --- > > > > migration/ram.c | 48 > +++++++++++++++++++++++++++++++++++++++--- > > > > migration/trace-events | 1 + > > > > 2 files changed, 46 insertions(+), 3 deletions(-) > > > > > > > > diff --git a/migration/ram.c b/migration/ram.c > > > > index edec1a2d07..18b2d422b5 100644 > > > > --- a/migration/ram.c > > > > +++ b/migration/ram.c > > > > @@ -414,6 +414,17 @@ struct RAMState { > > > > * RAM migration. > > > > */ > > > > unsigned int postcopy_bmap_sync_requested; > > > > + > > > > + /* > > > > + * Ratio of bytes_dirty_period and bytes_xfer_period in the last > > > > + * iteration > > > > + */ > > > > + uint64_t dirty_ratio_pct; > > > > + /* > > > > + * How many times is the most recent iteration dirty ratio is > > > > + * higher than previous one > > > > + */ > > > > + int dirty_ratio_high_cnt; > > > > }; > > > > typedef struct RAMState RAMState; > > > > > > > > @@ -1013,6 +1024,32 @@ static void migration_dirty_limit_guest(void) > > > > trace_migration_dirty_limit_guest(quota_dirtyrate); > > > > } > > > > > > > > +static bool migration_dirty_ratio_unexpected(RAMState *rs) > > > > +{ > > > > + uint64_t threshold = migrate_throttle_trigger_threshold(); > > > > + uint64_t bytes_xfer_period = > > > > + migration_transferred_bytes() - rs->bytes_xfer_prev; > > > > + uint64_t bytes_dirty_period = rs->num_dirty_pages_period * > > > TARGET_PAGE_SIZE; > > > > + uint64_t prev, curr; > > > > + > > > > + /* Skip the first iterations since it isn't helpful */ > > > > + if (stat64_get(&mig_stats.dirty_sync_count) == 1 || > > > !bytes_xfer_period) { > > > > + return false; > > > > + } > > > > + > > > > + curr = 100 * (bytes_dirty_period * 1.0 / bytes_xfer_period); > > > > + > > > > + prev = rs->dirty_ratio_pct; > > > > + rs->dirty_ratio_pct = curr; > > > > + > > > > + if (prev == 0 || curr <= threshold) { > > > > + return false; > > > > + } > > > > + > > > > + trace_dirty_ratio_pct(curr, prev); > > > > + return curr >= prev; > > > > +} > > > > + > > > > static void migration_trigger_throttle(RAMState *rs) > > > > { > > > > uint64_t threshold = migrate_throttle_trigger_threshold(); > > > > @@ -1028,9 +1065,14 @@ static void > migration_trigger_throttle(RAMState > > > *rs) > > > > * we were in this routine reaches the threshold. If that > happens > > > > * twice, start or increase throttling. > > > > */ > > > > - if ((bytes_dirty_period > bytes_dirty_threshold) && > > > > - (++rs->dirty_rate_high_cnt >= 2)) { > > > > - rs->dirty_rate_high_cnt = 0; > > > > + if ((migration_dirty_ratio_unexpected(rs) && > > > > + (++rs->dirty_ratio_high_cnt >= 2)) || > > > > + ((bytes_dirty_period > bytes_dirty_threshold) && > > > > + (++rs->dirty_rate_high_cnt >= 2))) { > > > > > > I'm afraid this is a mess. > > > > > > Now it's (A||B) and any condition can trigger this throttle logic. > Both A > > > & B make decisions on merely the same parameters. It's totally > > > unpredictable to me on how these cnts bumps, and not readable. > > > > > > > Indeed, this is not readable. How about introducing a migration > capability > > like "precise-detection" to make this patch backward-compatible? > > If we want to do some change here anyway, I'd think it better we do it in > the best form rather than keep adding caps. > For now "stick with postcopy" is still one way to go, but not sure whether > it applies in your setup. Otherwise I'd consider we sync more for huge > VMs. I mean "throttle only for each sync, sync for each iteration" may > make sense in the old days, but perhaps not anymore. That sound like the > real root cause of this issue. > These inspired me, I'll try this way. :) > > > > > > > > > > > What I kind of see how this could make migration shorter is that now > either > > > A or B can trigger the throttle, so it triggered in a faster pace than > > > before. E.g. IIUC if we drop dirty_rate_high_cnt completely it'll also > > > achieve similar thing at least in guestperf tests. > > > > > > > Yes ! I get it, and what my original idea is to drop the > > dirty_rate_high_cnt. > > > > As we mentioned above: migration needs to pay for the sync overhead more > > once a VM is configured with huge memory or running at a high dirty rate. > > > > Dropping the dirty_rate_high_cnt will make the iteration take less time > > in above cases. I think this is feasible since there is no other > reasons to > > retain the dirty_rate_high_cnt once we make the detection precise, > > let me know if i missed something. > I still think dropping the dirty_rate_high_cnt is a refinement. If migration has a precise detection of convergence, what do you think of it? > > > > > > > > > > Have you considered what I mentioned above that may make auto converge > work > > > better with "real huge VM"s (by allowing sync >1 times for each > iteration), > > > or have you considered postcopy? > > > > > > > Yes, IMHO, this is another refinement direction for the auto-converge, > we'll > > try this and keep communication with upstream once it gets progress. > > Thanks, > > -- > Peter Xu > > -- Best regards