On Wed, Aug 02, 2017 at 01:56:46PM +0800, Peter Xu wrote: > On Tue, Aug 01, 2017 at 12:03:48PM +0100, Daniel P. Berrange wrote: > > On Fri, Jul 28, 2017 at 04:06:25PM +0800, Peter Xu wrote: > > > It will be used when we want to resume one paused migration. > > > > > > Signed-off-by: Peter Xu <pet...@redhat.com> > > > --- > > > hmp-commands.hx | 7 ++++--- > > > hmp.c | 4 +++- > > > migration/migration.c | 2 +- > > > qapi-schema.json | 5 ++++- > > > 4 files changed, 12 insertions(+), 6 deletions(-) > > > > I'm not seeing explicit info about how we handle the original failure > > and how it relates to this resume command, but this feels like a > > potentially racy approach to me. > > > > If we have a network problem between source & target, we could see > > two results. Either the TCP stream will simply hang (it'll still > > appear open to QEMU but no traffic will be flowing), > > (let's say this is the "1st condition") > > > or the connection > > may actually break such that we get EOF and end up closing the file > > descriptor. > > (let's say this is the "2nd condition") > > > > > In the latter case, we're ok because the original channel is now > > gone and we can safely establish the new one by issuing the new > > 'migrate --resume URI' command. > > > > In the former case, however, there is the possibility that the > > hang may come back to life at some point, concurrently with us > > trying to do 'migrate --resume URI' and I'm unclear on the > > semantics if that happens. > > > > Should the original connection carry on, and thus cause the > > 'migrate --resume' command to fail, or will we forcably terminate > > the original connection no matter what and use the new "resumed" > > connection. > > Hmm yes, this is a good question. Currently this series is only > handling the 2nd condition, say, when we can detect the error via > system calls (IIUC we can know nothing when the 1st condition is > encountered, we just e.g. block at the system calls as usual when > reading the file handle). And currently the "resume" command is only > allowed if the 2nd condition is detected (so it will never destroy an > existing channel). > > If you see the next following patch, there is something like: > > if (has_resume && resume) { > if (s->state != MIGRATION_STATUS_POSTCOPY_PAUSED) { > error_setg(errp, "Cannot resume if there is no " > "paused migration"); > return; > } > goto do_resume; > } > > And here MIGRATION_STATUS_POSTCOPY_PAUSED will only be set when the > 2nd condition is met. > > > > > There's also synchronization with the target host - at the time we > > want to recover, we need to be able to tell the target to accept > > new incoming clients again, but we don't want to do that if the > > original connection comes back to life. > > Yeah, I hacked this part in this v1 series (as you may have seen) to > keep the ports open-forever. I am not sure whether that is acceptable, > but looks not. :) > > How about this: when destination detected 2nd condition, it firstly > switch to "postcopy-pause" state, then re-opens the accept channels. > And it can turns the accept channels off when the state moves out of > "postcopy-pause". > > > > > It feels to me that if the mgmt app or admin believes the migration > > is in a stuck state, we should be able to explicitly terminate the > > existing connection via a monitor command. Then setup the target > > host to accept new client, and then issue this migrate resume on > > the source. > > Totally agree. That should be the only way to handle 1st condition > well. However, would you mind if I postpone it a bit? IMHO as long as > we can solve the 2nd condition nicely (which is the goal of this > series), then it won't be too hard to continue support the 1st > condition.
Sure, the 1st scenario is an easy bolt on to the second scenario. I just wanted to be clear about what the target of these patches is, because I think the 1st scenario is probably the most common one. I guess if you have TCP keepalives enabled with a reasonably short timeout, the 1st scenario will turn into the 2nd scenario fairly quickly. Regards, Daniel -- |: https://berrange.com -o- https://www.flickr.com/photos/dberrange :| |: https://libvirt.org -o- https://fstop138.berrange.com :| |: https://entangle-photo.org -o- https://www.instagram.com/dberrange :|