On 06/16/2010 11:17 AM, Juan Quintela wrote:
Anthony Liguori<anth...@codemonkey.ws> wrote:
On 06/16/2010 08:11 AM, Juan Quintela wrote:
It's only ensured if you've got the same disk image running on another
machine. Considering that we support migrating from a file and we
support migrating block devices, I don't think it's practical.
- outgoing migration
After sucessful migration, we can issue "cont" command in source, and
having source and target running at the same time -> disk corruption
again.
My suggestion:
- add a third state "incoming", and cont/stop don't work on that state
- add a fourth state "migrated", and "cont" gives an explicit error, and you
have to run "cont --force" or "cont" twice (whatever) to get it to continue.
Very few users are going to do manual migration like this and those
that do have no good reason to execute cont in either of these
scenarios.
as of today, libvirt uses it (guess who filled that bug to me).
libvirt is not a human so I fail to see how forcing it to use a --force
option would help them.
Either we didn't document migration well enough or their developers are
not careful enough. Considering our lack of documentation, I'm sure it
was the former.
A --force command like this is equivalent to popping up a
message box saying "are you sure you really want to do this" which
most users find to be extremely annoying.
I had to debug this one from testers/field. They were testing things
and it was very "practical" to launch guest on machine A, configure
whatever they wanted, migrate to machine B. test whatever on machine B.
back to machine A, continue.
Honestly, that's a terrible testing strategy. You cannot just execute
random commands and hope nothing bad happens.
You can guess what happened. The problem here is that qemu is not
giving user the _minimal_ advise that something could go wrong. And it
is not going to be wrong, it is going to cause disk corruption for sure :(
We should try to inform users when it's likely that they'll stumble
upon a dangerous action. cache=volatile is a good example of this
because a user could have used it pretty easily and it's a reasonable
expectation that we wouldn't expose a feature that could lead to
corruption in obscure cases.
This is not _so_ obscure if you run qemu by hand :(
you have a nice "(qemu)" prompt, and if you issue "cont", bad things happen.
And if you issue system_reset, quit, commit, loadvm, pci_del, or any set
of commands bad things can happen including some form of data loss or
corruption.
IMHO, there's a significant difference between twiddling something where
there is a reasonable expectation that the impact is only going to be
related to performance (like -smp X, -m X, or cache=X) and just trying
random things.
If a user executes cont in either of these scenarios and has two
copies of a virtual machine running accessing the same resources, then
they surely ought to expect bad behavior.
It is not _so_ easy O:-).
Consider the example that I showed you:
(host A) (host B)
launch qemu launch qemu -incoming
migrate host B
.....
do your things
exit/poweroff/...
At this point you have a qemu launched on machine A, with nothing on
machine B. running "cont" on machine A, have disastreus consecuences,
and there is no way to prevent it :(
If there was a reasonable belief that it wouldn't result in disaster, I
would fully support you. However, I can't think of any rational reason
why someone would do this. I can't think of a better analogy to
shooting yourself in the foot.
As I have received this bug from users a couple of times, I would like
to be able to prevent this case.
I've never seen anyone hit run into this before. Can you show me a bug
report? I'd love to see how someone expected this to behave.
Regards,
Anthony Liguori
Later, Juan.