On Tue, Feb 08, 2022 at 04:44:33PM +0200, Nir Soffer wrote: > This happens because virt-v2v try to finalize the transfer *before* closing > the > connections to imageio server. > > Current imageio release mark the a ticket as canceled, but will not remove it > if the ticket has open connections from clients. If the clients are > idle, the connection > closes after 60 seconds, so when engine tries again to remove the ticket, the > operation succeeds.
I posted this patch: https://listman.redhat.com/archives/libguestfs/2022-February/msg00113.html It's possibly not quite right. We now just kill nbdkit and immediately go on to finalization, so there might still be a race (since killing nbdkit only starts the process of nbdkit shutting down, closing the Python plugin etc). I don't know if this is a concern or not, but the patch does avoid the 60s timeout in my limited testing. > Upstream version improved this flow to wait only for ongoing > requests. If there are no ongoing requests the ticket is removed > immediately, ignoring the open connections. The connections will be > closed when the client closes the connection on when reading from > the socket times out. "Ongoing" here means that an HTTP request is being processed? > You can test upstream version from here: > https://github.com/oVirt/ovirt-imageio/actions/runs/1811058094 > > But if you want to be compatible with released imageio version, you > must close the connections to imageio *before* finalizing the transfers. > > We already discussed this issue last year, when we handled the bug > when vddk blocks for many minutes during block status and imageio > connection is droped, and you opened a bug for this. > > > It's not a completely like-for-like comparison because the rhv-upload > > backend changed a lot between these versions. In particular if I was > > going to pick some suspicious change it would be my refactoring here > > which was supposed to be neutral but maybe isn't: > > > > > > https://github.com/libguestfs/virt-v2v/commit/143a22860216b94d3a81706193088d50c03fc35c > > > > Unfortunately this commit is almost impossible to revert because of > > deep restructuring of surrounding code. > > The refactoring is not the issue, the issue is that we do not terminate > the output nbdkit before finalizing. > > For some output we cannot do this since we want to query nbdkit for > more info (e.g block status) but for rhv upload we don't have anything > to do with nbkdit instance and we must close the connection before > we finalize, so we should close the output right after we finish the copy. It's not really desirable to kill nbdkit like I did in my patch. We do really want to open up virt-v2v to allow other processes to query those sockets during the transfer. An alternative to the patch I posted might be to have a way to request that nbdkit disconnects its HTTP pool. I will also note that HTTP is supposed to be stateless. An open TCP connection should not have meaning. > > Another idea: > > > > Old virt-v2v uses qemu-img convert which does not flush by default. > > New virt-v2v uses nbdcopy with the --flush option, so it will call > > imageio PATCH /path ... "op":"flush" at the end. However removing > > nbdcopy --flush didn't help very much (only a couple of seconds off). > > Calling flush is right, we cannot remove it. Although there is bug in nbdkit, > and if flushes during close() flow even if you did not send a flush. > > > Do you have any other ideas? > > > > What exactly does imageio do during the finalizing_success state? > > The flow is this: > - engine send request to vdsm to delete the ticket > - vdsm connects to imageio control socket and send DELETE /tickets/{ticket-id} > - imageio mark the ticket as canceled, so no new request can succeed > and no new connection to attach to the ticket > - imageio mark all ongoing requests as canceled, so if the request is in > read/write loop, it will abort on the the next read/write complete > - upstream: imageio waits until all ongoing operations complete > - release: imageio waits until all connections are closed This may answer my HTTP stateless point above. > - if the ticket could not be removed in 60 seconds, imageio returns 409 > Conflict > - imageio returns 200 OK > - engine retries the delete if it failed > - when engine succeeds and finish other finalization tasks, the image > transfer will > be reported as finished. > > I would expect the flow to time out on virt-v2v side, since we use 60 seconds > timeout, and when removing a ticket times out, it takes 60 seconds, so you > should not succeed to finalize in 61 seconds. There is a bug in ovirt > engine that cause this flow to succeed. Thanks, Rich. -- Richard Jones, Virtualization Group, Red Hat http://people.redhat.com/~rjones Read my programming and virtualization blog: http://rwmj.wordpress.com virt-builder quickly builds VMs from scratch http://libguestfs.org/virt-builder.1.html _______________________________________________ Libguestfs mailing list Libguestfs@redhat.com https://listman.redhat.com/mailman/listinfo/libguestfs