Re: [Qemu-devel] [PATCH v14 7/8] Implement new driver for block replication

Wen Congyang Tue, 02 Feb 2016 17:29:59 -0800

On 02/02/2016 10:34 PM, Stefan Hajnoczi wrote:
> On Mon, Feb 01, 2016 at 09:13:36AM +0800, Wen Congyang wrote:
>> On 01/29/2016 11:46 PM, Stefan Hajnoczi wrote:
>>> On Fri, Jan 29, 2016 at 11:13:42AM +0800, Changlong Xie wrote:
>>>> On 01/28/2016 11:15 PM, Stefan Hajnoczi wrote:
>>>>> On Thu, Jan 28, 2016 at 09:13:24AM +0800, Wen Congyang wrote:
>>>>>> On 01/27/2016 10:46 PM, Stefan Hajnoczi wrote:
>>>>>>> On Wed, Jan 13, 2016 at 05:18:31PM +0800, Changlong Xie wrote:
>>>>> I'm concerned that the bdrv_drain_all() in vm_stop() can take a long
>>>>> time if the disk is slow/failing.  bdrv_drain_all() blocks until all
>>>>> in-flight I/O requests have completed.  What does the Primary do if the
>>>>> Secondary becomes unresponsive?
>>>>
>>>> Actually, we knew this problem. But currently, there seems no better way to
>>>> resolve it. If you have any ideas?
>>>
>>> Is it possible to hold the checkpoint information and acknowledge the
>>> checkpoint right away, without waiting for bdrv_drain_all() or any
>>> Secondory guest activity to complete?
>>
>> There is no way to know that secondary becomes unreponsive.
> 
> I meant whether it is necessary for the Secondary to vm_stop() and apply
> the checkpoint before acknowledging the checkpoint to the Primary?


I don't understand this.
Here is the COLO checkpoint flow:

    Primary                                                Secondary
    new checkpoint notice                 --->
    vm_stop()                                              vm_stop()
    vm state(device state, memory, cpu)   --->
                                                           load state
                                          <---             done
    vm_start()                                             vm_start()
> 
>>> I think this really means falling back to microcheckpointing until the
>>> Secondary guest can checkpoint.  Instead of a blocking vm_stop() we
>>> would prevent vcpus from running and when the last pending I/O finishes
>>> the Secondary could apply the last checkpoint.  This approach does not
>>> block QEMU (the monitor, etc).
>>>
>>
>> If secondary host becomes unresponsive, it means that we cannot do 
>> mocrocheckpointing.
>> We should do failover in this case.
> 
> This is dangerous because it means that a delay/failure in the Secondary
> would cause the Primary to fail over to the broken Secondary.  All the
> more reason not to perform blocking operations on the Secondary in the
> checkpoint code path.

If the secondary is broken, primary qemu will take over.

Thanks
Wen Congyang

> 
> Stefan
>

Re: [Qemu-devel] [PATCH v14 7/8] Implement new driver for block replication

Reply via email to