On 2016/2/3 17:55, Wen Congyang wrote:
On 02/03/2016 05:32 PM, Stefan Hajnoczi wrote:
On Wed, Feb 03, 2016 at 09:29:15AM +0800, Wen Congyang wrote:
On 02/02/2016 10:34 PM, Stefan Hajnoczi wrote:
On Mon, Feb 01, 2016 at 09:13:36AM +0800, Wen Congyang wrote:
On 01/29/2016 11:46 PM, Stefan Hajnoczi wrote:
On Fri, Jan 29, 2016 at 11:13:42AM +0800, Changlong Xie wrote:
On 01/28/2016 11:15 PM, Stefan Hajnoczi wrote:
On Thu, Jan 28, 2016 at 09:13:24AM +0800, Wen Congyang wrote:
On 01/27/2016 10:46 PM, Stefan Hajnoczi wrote:
On Wed, Jan 13, 2016 at 05:18:31PM +0800, Changlong Xie wrote:
I'm concerned that the bdrv_drain_all() in vm_stop() can take a long
time if the disk is slow/failing. bdrv_drain_all() blocks until all
in-flight I/O requests have completed. What does the Primary do if the
Secondary becomes unresponsive?
Actually, we knew this problem. But currently, there seems no better way to
resolve it. If you have any ideas?
Is it possible to hold the checkpoint information and acknowledge the
checkpoint right away, without waiting for bdrv_drain_all() or any
Secondory guest activity to complete?
There is no way to know that secondary becomes unreponsive.
I meant whether it is necessary for the Secondary to vm_stop() and apply
the checkpoint before acknowledging the checkpoint to the Primary?
I don't understand this.
Here is the COLO checkpoint flow:
Primary Secondary
new checkpoint notice --->
vm_stop() vm_stop()
vm state(device state, memory, cpu) --->
load state
<--- done
vm_start() vm_start()
If the Secondary's vm_stop() call blocks then the Primary is stuck too.
I was wondering whether the Secondary can do:
<--- done
vm_stop()
load state
It simply receives the checkpoint data into a buffer and immediately
replies with "done". vm_stop() and load state is only performed after
sending "done".
Secondary vm is running, so we should also get the pages that are dirtied
by secondary vm, but not dirtied by primary vm.
We have two ways to do it:
1. Cache all original memory in the secondary qemu
2. Send the dirty pfn list to primary qemu, and get it.
If we ack the checkpoint and the call vm_stop(), we only can select 1. It
means that secondary qemu costs more memory.
In COLO mode, we will compare the output socket, and will do checkpoint if
the application level data is different. If we ack the checkpoint and the
call vm_stop(), the client can not get any more data until secondary vm
is running again. So we still 'wait' the secondary vm.
The advantage is that the Primary will not be delayed by the Secondary.
It's an approach that doesn't block.
But perhaps it's a problem if the Secondary is slower than the Primary
since the Secondary still needs to complete vm_stop() and load state
before it can resume execution?
I think this really means falling back to microcheckpointing until the
Secondary guest can checkpoint. Instead of a blocking vm_stop() we
would prevent vcpus from running and when the last pending I/O finishes
the Secondary could apply the last checkpoint. This approach does not
block QEMU (the monitor, etc).
If secondary host becomes unresponsive, it means that we cannot do
mocrocheckpointing.
We should do failover in this case.
This is dangerous because it means that a delay/failure in the Secondary
would cause the Primary to fail over to the broken Secondary. All the
more reason not to perform blocking operations on the Secondary in the
checkpoint code path.
If the secondary is broken, primary qemu will take over.
Does the Primary use a timeout between "new checkpoint notice" and
Secondary's "done" so it can move on if the Secondary is unresponsive?
To hailiang:
IIRC, we don't use a timeout but I think we can do it. In our design, there is
Yes, we may need a timeout to help detecting the unresponsive case
which can not be caught by the external heartbeat module.
I will investigate it.
Thanks,
Hailiang
an exteranl heartbeat to check primary and secondary status, and decide when
to do checkpoint.
Thanks
Wen Congyang
Stefan
.