On Thu, Oct 12, 2017 at 01:19:52PM +0100, Dr. David Alan Gilbert wrote: > * Peter Xu (pet...@redhat.com) wrote: > > On Tue, Oct 10, 2017 at 01:30:18PM +0100, Dr. David Alan Gilbert wrote: > > > * Peter Xu (pet...@redhat.com) wrote: > > > > On Mon, Oct 09, 2017 at 07:58:13PM +0100, Dr. David Alan Gilbert wrote: > > > > [...] > > > > > > > We have to be careful about this; a network can fail in a way it > > > > > gets stuck rather than fails - this can get stuck until a full TCP > > > > > disconnection; and that takes about 30mins (from memory). > > > > > The nice thing about using 'shutdown' is that you can kill the > > > > > existing > > > > > connection if it's hung. (Which then makes an interesting question; > > > > > the rules in your migrate-incoming command become different if you > > > > > want to declare it's failed!). Having said that, you're right that at > > > > > this point stuff has already failed - so do we need the shutdown? > > > > > (You might want to do the shutdown as part of the recovery earlier > > > > > or as a separate command to force the failure) > > > > > > > > I assume if I call shutdown before the lock then we'll be good then. > > > > > > The question is what happens if you only allow recovery if we're already > > > in postcopy-paused state; in the case of a hung socket, since no IO has > > > actually failed yet, you will still be in postcopy-active. > > > > Hmm, but isn't that a problem of kernel rather than QEMU? Since > > sockets are after all managed by kernel. > > Kind of, but it comes down to what the right behaviour of a TCP socket > is, and the kernel is probably doing the right thing. > > > I don't really know what is the best thing to do to detect whether a > > socket is stuck. Assume we can observed that (say, we see migration > > transferred bytes keep static for 30 seconds), IIRC you mentioned > > about iptable tricks to break an existing e.g. TCP connection, then we > > can trigger the -EIO path. > > From the qemu level I'd prefer to make it a command; if we start > adding heuristics and timeouts etc then it's very difficult to actually > get them right. > > > Or do you think we should provide a way to manually trigger the paused > > state? Then it goes back to something we discussed with Dan in the > > earlier post - I'd appreciate if we can postpone the manual trigger > > support a bit (to make this series small, which is already not...). > > I think that manual trigger is probably necessary; it would just call a > shutdown() on the sockets and let the things fail into the paused state. > It'd be pretty simple. It would be another OOB command; the tricky > part is just making sure it's thread safe against hte migration > finishing when you issue it. > > I think it can wait until after this series if you want, but it would > be good if we can figure it out.
OK. Let me try it in my next post. I hope it won't grow into something bigger (which does happens sometimes... :). -- Peter Xu