Anthony Liguori <anth...@codemonkey.ws> wrote:
> On 06/16/2010 11:17 AM, Juan Quintela wrote:
>> Anthony Liguori<anth...@codemonkey.ws>  wrote:
>>    
>>> On 06/16/2010 08:11 AM, Juan Quintela wrote:
>>>      
>>    
>>> It's only ensured if you've got the same disk image running on another
>>> machine.  Considering that we support migrating from a file and we
>>> support migrating block devices, I don't think it's practical.
>>>
>>>      
>>>> - outgoing migration
>>>>
>>>> After sucessful migration, we can issue "cont" command in source, and
>>>> having source and target running at the same time ->   disk corruption
>>>> again.
>>>>
>>>> My suggestion:
>>>> - add a third state "incoming", and cont/stop don't work on that state
>>>> - add a fourth state "migrated", and "cont" gives an explicit error, and 
>>>> you
>>>>     have to run "cont --force" or "cont" twice (whatever) to get it to 
>>>> continue.
>>>>
>>>>        
>>> Very few users are going to do manual migration like this and those
>>> that do have no good reason to execute cont in either of these
>>> scenarios.
>>>      
>> as of today, libvirt uses it (guess who filled that bug to me).
>>    
>
> libvirt is not a human so I fail to see how forcing it to use a
> --force option would help them.

There were two bugs.  The incoming/vs cont were the libvirt one.

> Either we didn't document migration well enough or their developers
> are not careful enough.  Considering our lack of documentation, I'm
> sure it was the former.

>> I had to debug this one from testers/field.  They were testing things
>> and it was very "practical" to launch guest on machine A, configure
>> whatever they wanted, migrate to machine B.  test whatever on machine B.
>> back to machine A, continue.
>>    
>
> Honestly, that's a terrible testing strategy.  You cannot just execute
> random commands and hope nothing bad happens.

As testing strategy: ugly, but I find that what testers do -> customers
do later, so .....

>>> We should try to inform users when it's likely that they'll stumble
>>> upon a dangerous action.  cache=volatile is a good example of this
>>> because a user could have used it pretty easily and it's a reasonable
>>> expectation that we wouldn't expose a feature that could lead to
>>> corruption in obscure cases.
>>>      
>> This is not _so_ obscure if you run qemu by hand :(
>> you have a nice "(qemu)" prompt, and if you issue "cont", bad things happen.
>>    
>
> And if you issue system_reset, quit, commit, loadvm, pci_del, or any
> set of commands bad things can happen including some form of data loss
> or corruption.

pci_del, system_reset, .... looks dangerous.

"cont" don't look dangerous, look as "continue stoped machine".

> IMHO, there's a significant difference between twiddling something
> where there is a reasonable expectation that the impact is only going
> to be related to performance (like -smp X, -m X, or cache=X) and just
> trying random things.

[...]

> If there was a reasonable belief that it wouldn't result in disaster,
> I would fully support you.  However, I can't think of any rational
> reason why someone would do this.  I can't think of a better analogy
> to shooting yourself in the foot.

This has been reported to me several times.  Guess why I want to do it
more difficult.

>> As I have received this bug from users a couple of times, I would like
>> to be able to prevent this case.
>>    
>
> I've never seen anyone hit run into this before.  Can you show me a
> bug report?  I'd love to see how someone expected this to behave.

Will search on bugzilla.  My main problem with it is that it took me ages to
discover that the problem was this.

Later, Juan.

Reply via email to