On Tue, Sep 12, 2023 at 07:49:37PM -0300, Fabiano Rosas wrote: > I figured what is going on here (test #1). At postcopy_pause_incoming() > the state transition is ACTIVE -> PAUSED, but when the first recovery > fails on the incoming side, the transition would have to be RECOVER -> > PAUSED. > > Could you add that change to this patch?
Yes, and actually, see: https://lore.kernel.org/qemu-devel/20230912222145.731099-11-pet...@redhat.com/ > > -bool migration_postcopy_is_alive(void) > > +bool migration_postcopy_is_alive(int state) > > { > > MigrationState *s = migrate_get_current(); > > > > Note there's a missing hunk here to actually use the 'state'. Yes.. I fixed it in the version I just posted, here: https://lore.kernel.org/qemu-devel/20230912222145.731099-10-pet...@redhat.com/ +bool migration_postcopy_is_alive(int state) +{ + switch (state) { + case MIGRATION_STATUS_POSTCOPY_ACTIVE: + case MIGRATION_STATUS_POSTCOPY_RECOVER: + return true; + default: + return false; + } +} [...] > >> Here, with this patch the migration gets stuck unless we call > >> migrate_pause() one more time. After another round of migrate_pause + > >> recover, it finishes properly. > > Here (test #2), the issue is that the sockets are unpaired, so there's > no G_IO_IN to trigger the qio_channel watch callback. The incoming side > never calls fd_accept_incoming_migration() and the RP hangs waiting for > something. I think there's no other way to unblock aside from the > explicit qmp_migrate_pause(). Exactly, that's the "trick" I mentioned. :) Sorry when replying just now I seem to have jumped over some sections. See: https://lore.kernel.org/qemu-devel/20230912222145.731099-12-pet...@redhat.com/ I put a rich comment for that: + /* + * Write the 1st byte as QEMU_VM_COMMAND (0x8) for the dest socket, to + * emulate the 1st byte of a real recovery, but stops from there to + * keep dest QEMU in RECOVER. This is needed so that we can kick off + * the recover process on dest QEMU (by triggering the G_IO_IN event). + * + * NOTE: this trick is not needed on src QEMUs, because src doesn't + * rely on an pre-existing G_IO_IN event, so it will always trigger the + * upcoming recovery anyway even if it can read nothing. + */ +#define QEMU_VM_COMMAND 0x08 + c = QEMU_VM_COMMAND; + ret = send(pair2[1], &c, 1, 0); + g_assert_cmpint(ret, ==, 1); > We could give them both separate files and the result would be more > predictable. Please have a look at the test patch I posted (note! it's still under your name but I changed it quite a lot with my sign-off). I used your 2nd method to create socket pairs, and hopefully that provides very reliable way to put both src/dst sides into RECOVER state, then kick it out of it using qmp migrate-pause on both sides. > You can take it. Or drop it if it ends being too artificial. I like your suggestion on having the test case, and I hope the new version in above link I posted isn't so artificial; the only part I don't like about that was the "write 1 byte" trick for dest qemu, but that seems still okay. Feel free to go and have a look. Thanks a lot, -- Peter Xu