On Thu, Sep 14, 2023 at 12:10:04PM -0300, Fabiano Rosas wrote: > Peter Xu <pet...@redhat.com> writes: > > > On Wed, Sep 13, 2023 at 04:42:31PM -0300, Fabiano Rosas wrote: > >> Stefan Hajnoczi <stefa...@redhat.com> writes: > >> > >> > Hi, > >> > The following intermittent failure occurred in the CI and I have filed > >> > an Issue for it: > >> > https://gitlab.com/qemu-project/qemu/-/issues/1886 > >> > > >> > Output: > >> > > >> > >>> QTEST_QEMU_IMG=./qemu-img MALLOC_PERTURB_=116 > >> > QTEST_QEMU_STORAGE_DAEMON_BINARY=./storage-daemon/qemu-storage-daemon > >> > G_TEST_DBUS_DAEMON=/builds/qemu-project/qemu/tests/dbus-vmstate-daemon.sh > >> > QTEST_QEMU_BINARY=./qemu-system-x86_64 > >> > /builds/qemu-project/qemu/build/tests/qtest/migration-test --tap -k > >> > ――――――――――――――――――――――――――――――――――――― ✀ > >> > ――――――――――――――――――――――――――――――――――――― > >> > stderr: > >> > qemu-system-x86_64: Unable to read from socket: Connection reset by > >> > peer > >> > Memory content inconsistency at 5b43000 first_byte = bd last_byte = bc > >> > current = 4f hit_edge = 1 > >> > ** > >> > ERROR:../tests/qtest/migration-test.c:300:check_guests_ram: assertion > >> > failed: (bad == 0) > >> > (test program exited with status code -6) > >> > > >> > You can find the full output here: > >> > https://gitlab.com/qemu-project/qemu/-/jobs/5080200417 > >> > >> This is the postcopy return path issue that I'm addressing here: > >> > >> https://lore.kernel.org/r/20230911171320.24372-1-faro...@suse.de > >> Subject: [PATCH v6 00/10] Fix segfault on migration return path > >> Message-ID: <20230911171320.24372-1-faro...@suse.de> > > > > Hmm I just noticed one thing, that Stefan's failure is a ram check issue > > only, which means qemu won't crash? > > > > The source could have crashed and left the migration at an inconsistent > state and then the destination saw corrupted memory? > > > Fabiano, are you sure it's the same issue on your return-path fix? > > > > I've been running the preempt tests on my branch for thousands of > iterations and didn't see any other errors. Since there's no code going > into the migration tree recently I assume it's the same error. > > I run the tests with GDB attached to QEMU, so I'll always see a crash > before any memory corruption.
Okay, maybe that stops you from seeing the above check_guests_ram() error? Worth checking whether it fails differently always if you just don't attach gdb to it; I had a feeling that it'll always fail in the other way (I think migration-test will say something like "qemu killed" etc. in most cases), further to identify the issues. > > > I'm also trying to reproduce either of them with some loads. I think I hit > > some but it's very hard to reproduce solidly. > > Well, if you find anything else let me know and we'll fix it. I think Stefan's issue is the one I triggered once, but only once; I did see check_guests_ram() lines. I ran concurrently 10 migration-tests (on 8 cores; just to make scheduler start to really work), each looping over preempt/plain for 500 times and hit nothing.. I'm trying again with a larger host with more instances, so far I've run 200 loops over 40 instances running together, I hit nothing.. but I'm keeping trying. -- Peter Xu