On Fri, May 19, 2017 at 01:51:43PM +0100, Dr. David Alan Gilbert wrote: > * Daniel P. Berrange (berra...@redhat.com) wrote: > > On Fri, May 19, 2017 at 02:43:27PM +0800, Peter Xu wrote: > > > We don't really have a return path for the other types yet. Let's check > > > this when .get_return_path() is called. > > > > > > For this, we introduce a new feature bit, and set it up only for socket > > > typed IO channels. > > > > > > This will help detect earlier failure for postcopy, e.g., logically > > > speaking postcopy cannot work with "exec:". Before this patch, when we > > > try to migrate with "migrate -d exec:cat>out", we'll hang the system. > > > With this patch, we'll get: > > > > > > (qemu) migrate -d exec:cat>out > > > Unable to open return-path for postcopy > > > > This is wrong - post-copy migration *can* work with exec: - it just entirely > > depends on what command you are running. Your example ran a command which is > > unidirectional, but if you ran 'exec:socat ...' you would have a fully > > bidirectional channel. Actually the channel is always bi-directional, but > > 'cat' simply won't ever send data back to QEMU. > > The thing is it didn't used to be able to; prior to your conversion to > channel, postcopy would reject being started with exec: because it > couldn't open a return path, so it was safe.
It safe but functionally broken because it is valid to want to use exec migration with post-copy. > > If QEMU hangs when the other end doesn't send data back, that actually seems > > like a potentially serious bug in migration code. Even if using the normal > > 'tcp' migration protocol, if the target QEMU server hangs and fails to > > send data to QEMU on the return path, the source QEMU must never hang. > > Hmm, we shouldn't get a 'hang' with a postcopy on a link without a > return path; but it does depend on how the exec: behaves on the > destination. > If the destination discards data written to it, then I think the > behaviour would be: > a) Page requests would just get dropped, they'd eventually get > fulfilled by the background page transmissions, but that could mean that > a page request would wait for minutes for the page. > b) The qemu main thread on the destination can be blocked by that, so > the monitor might not respond until the page request is fulfilled. > c) I'm not quite sure what would happen to the source return-path > thread > > The behaviour seems to have changed between 2.9.0 (f26 package) and my > reasonably recent head build. That's due to the bug we just fixed where we mistakenly didn't allow bi-directional I/O for exec commit 062d81f0e968fe1597474735f3ea038065027372 Author: Daniel P. Berrange <berra...@redhat.com> Date: Fri Apr 21 12:12:20 2017 +0100 migration: setup bi-directional I/O channel for exec: protocol Historically the migration data channel has only needed to be unidirectional. Thus the 'exec:' protocol was requesting an I/O channel with O_RDONLY on incoming side, and O_WRONLY on the outgoing side. This is fine for classic migration, but if you then try to run TLS over it, this fails because the TLS handshake requires a bi-directional channel. Signed-off-by: Daniel P. Berrange <berra...@redhat.com> Reviewed-by: Juan Quintela <quint...@redhat.com> Signed-off-by: Juan Quintela <quint...@redhat.com> > A migration_cancel also doesn't work for 'exec' because it doesn't > support shutdown() - it just sticks in 'cancelling'. > On a socket that was broken like this the cancel would work because > it issues a shutdown() which causes the socket to cleanup. I'm curious why migration_cancel requires shutdown() to work ? Why isn't it sufficient from the source POV to just close the socket entirely straight away. Regards, Daniel -- |: https://berrange.com -o- https://www.flickr.com/photos/dberrange :| |: https://libvirt.org -o- https://fstop138.berrange.com :| |: https://entangle-photo.org -o- https://www.instagram.com/dberrange :|