* Daniel P. Berrangé (berra...@redhat.com) wrote: > On Mon, Mar 06, 2023 at 01:44:38PM +0000, Dr. David Alan Gilbert wrote: > > * Thomas Huth (th...@redhat.com) wrote: > > > On 03/03/2023 13.05, Peter Maydell wrote: > > > > On Fri, 3 Mar 2023 at 11:29, Thomas Huth <th...@redhat.com> wrote: > > > > > > > > > > On 03/03/2023 12.18, Peter Maydell wrote: > > > > > > On Fri, 3 Mar 2023 at 09:10, Juan Quintela <quint...@redhat.com> > > > > > > wrote: > > > > > > > > > > > > > > Daniel P. Berrangé <berra...@redhat.com> wrote: > > > > > > > > On Thu, Mar 02, 2023 at 05:22:11PM +0000, Peter Maydell wrote: > > > > > > > > > migration-test has been flaky for a long time, both in CI and > > > > > > > > > otherwise: > > > > > > > > > > > > > > > > > > https://gitlab.com/qemu-project/qemu/-/jobs/3806090216 > > > > > > > > > (a FreeBSD job) > > > > > > > > > 32/648 > > > > > > > > > ERROR:../tests/qtest/migration-helpers.c:205:wait_for_migration_status: > > > > > > > > > assertion failed: (g_test_timer_elapsed() < > > > > > > > > > MIGRATION_STATUS_WAIT_TIMEOUT) ERROR > > > > > > > > > > > > > > > > > > on a local macos x86 box: > > > > > > > > > > > > > > > > > > > > > > > > > What is really weird with this failure is that: > > > > > > > - it only happens on non-x86 > > > > > > > > > > > > No, I have seen it on x86 macos, and x86 OpenBSD > > > > > > > > > > > > > - on code that is not arch dependent > > > > > > > - on cancel, what we really do there is close fd's for the multifd > > > > > > > channel threads to get out of the recv, i.e. again, nothing > > > > > > > that > > > > > > > should be arch dependent. > > > > > > > > > > > > I'm pretty sure that it tends to happen when the machine that's > > > > > > running the test is heavily loaded. You probably have a race > > > > > > condition. > > > > > > > > > > I think I can second that. IIRC I've seen it a couple of times on my > > > > > x86 > > > > > laptop when running "make check -j$(nproc) SPEED=slow" here. > > > > > > > > And another on-x86 failure case, just now, on the FreeBSD x86 CI job: > > > > https://gitlab.com/qemu-project/qemu/-/jobs/3870165180 > > > > > > And FWIW, I just saw this while doing "make vm-build-netbsd J=4": > > > > > > ▶ 31/645 > > > ERROR:../src/tests/qtest/migration-test.c:1841:test_migrate_auto_converge: > > > 'got_stop' should be FALSE ERROR > > > > That one is kind of interesting; this is an auto converge test - so it > > tries to setup migration so it won't finish, to check that the auto > > converge kicks in. Except in this case the migration *did* finish > > without the autoconverge (significantly) kicking in. > > > > So I guess any of: > > a) The CPU thread never got much CPU time so not much dirtying > > happened. > > b) The bandwidth calculations might be bad enough/course enough > > that it's passing the (very low) bandwidth limit due to bad > > approximation at bandwidth needed. > > c) The autoconverge jump happens fast enough for that loop > > to hit the got_stop in the loop time of that loop. > > > > I guess we could: > > i) Reduce the usleep in test_migrate_auto_converge > > (So it is more likely to correctly drop out of that loop > > as soon as autoconverge kicks in) > > The CPU time spent by the dirtying guest CPUs should dominate > here, so we can afford to reduce that timeout down a bit to > be more responsive. > > > ii) Reduce inc_pct so that autoconverge kicks in slower > > iii) Reduce max-bandwidth in migrate_ensure_non_converge > > even further. > > migrate_ensure_non_converge is trying to guarantee non-convergance, > but obviously we're only achieving a probibalistic chance of > non-converage. To get the probably closer to 100% we should make > it massively smaller, say 100kbs instead of 30mbs.
Yeh, I'll cut a patch for this. Dave > With regards, > Daniel > -- > |: https://berrange.com -o- https://www.flickr.com/photos/dberrange :| > |: https://libvirt.org -o- https://fstop138.berrange.com :| > |: https://entangle-photo.org -o- https://www.instagram.com/dberrange :| > -- Dr. David Alan Gilbert / dgilb...@redhat.com / Manchester, UK