subject:"Re\: \[Qemu\-block\] \[PATCH COLO v3 01\/14\] docs\: block replication's description"

Re: [Qemu-block] [PATCH COLO v3 01/14] docs: block replication's description

2015-05-08 Thread Dr. David Alan Gilbert

* Kevin Wolf (kw...@redhat.com) wrote:
> Am 08.05.2015 um 10:42 hat Stefan Hajnoczi geschrieben:
> > On Tue, May 05, 2015 at 04:23:56PM +0100, Dr. David Alan Gilbert wrote:
> > > * Stefan Hajnoczi (stefa...@redhat.com) wrote:
> > > > On Fri, Apr 24, 2015 at 11:36:35AM +0200, Paolo Bonzini wrote:
> > > > > 
> > > > > 
> > > > > On 24/04/2015 11:38, Wen Congyang wrote:
> > > > > >> > 
> > > > > >> > That can be done with drive-mirror.  But I think it's too early 
> > > > > >> > for that.
> > > > > > Do you mean use drive-mirror instead of quorum?
> > > > > 
> > > > > Only before starting up a new secondary.  Basically you do a migration
> > > > > with non-shared storage, and then start the secondary in colo mode.
> > > > > 
> > > > > But it's only for the failover case.  Quorum (or a new block/colo.c
> > > > > driver or filter) is fine for normal colo operation.
> > > > 
> > > > Perhaps this patch series should mirror the Secondary's disk to a Backup
> > > > Secondary so that the system can be protected very quickly after
> > > > failover.
> > > > 
> > > > I think anyone serious about fault tolerance would deploy a Backup
> > > > Secondary, otherwise the system cannot survive two failures unless a
> > > > human administrator is lucky/fast enough to set up a new Secondary.
> > > 
> > > I'd assumed that a higher level management layer would do the allocation
> > > of a new secondary after the first failover, so no human need be involved.
> > 
> > That doesn't help, after the first failover is too late even if it's
> > done by a program.  There should be no window during which the VM is
> > unprotected.
> > 
> > People who want fault tolerance care about 9s of availability.  The VM
> > must be protected on the new Primary as soon as the failover occurs,
> > otherwise this isn't a serious fault tolerance solution.
> 
> If you're worried about two failures in a row, why wouldn't you be
> worried about three in a row? I think if you really want more than one
> backup to be ready, you shouldn't go to two, but to n.

Agreed, if you did multiple secondaries you'd do 'n'.

But 1+2 does satisfy all but the most paranoid; and in particular it does
mean that if you want to take a host down for some maintenance you can
do it without worrying.

But, as I said in my reply to Stefan, doing more than 1+1 gets really hairy;
the combinations of failovers are much more complicated.

Dave
  1) It means that 
  1) As Stefan mentions you get worried about the lack of protection after
the first failover; 
> Kevin


--
Dr. David Alan Gilbert / dgilb...@redhat.com / Manchester, UK

Re: [Qemu-block] [PATCH COLO v3 01/14] docs: block replication's description

2015-05-08 Thread Kevin Wolf

Am 08.05.2015 um 10:42 hat Stefan Hajnoczi geschrieben:
> On Tue, May 05, 2015 at 04:23:56PM +0100, Dr. David Alan Gilbert wrote:
> > * Stefan Hajnoczi (stefa...@redhat.com) wrote:
> > > On Fri, Apr 24, 2015 at 11:36:35AM +0200, Paolo Bonzini wrote:
> > > > 
> > > > 
> > > > On 24/04/2015 11:38, Wen Congyang wrote:
> > > > >> > 
> > > > >> > That can be done with drive-mirror.  But I think it's too early 
> > > > >> > for that.
> > > > > Do you mean use drive-mirror instead of quorum?
> > > > 
> > > > Only before starting up a new secondary.  Basically you do a migration
> > > > with non-shared storage, and then start the secondary in colo mode.
> > > > 
> > > > But it's only for the failover case.  Quorum (or a new block/colo.c
> > > > driver or filter) is fine for normal colo operation.
> > > 
> > > Perhaps this patch series should mirror the Secondary's disk to a Backup
> > > Secondary so that the system can be protected very quickly after
> > > failover.
> > > 
> > > I think anyone serious about fault tolerance would deploy a Backup
> > > Secondary, otherwise the system cannot survive two failures unless a
> > > human administrator is lucky/fast enough to set up a new Secondary.
> > 
> > I'd assumed that a higher level management layer would do the allocation
> > of a new secondary after the first failover, so no human need be involved.
> 
> That doesn't help, after the first failover is too late even if it's
> done by a program.  There should be no window during which the VM is
> unprotected.
> 
> People who want fault tolerance care about 9s of availability.  The VM
> must be protected on the new Primary as soon as the failover occurs,
> otherwise this isn't a serious fault tolerance solution.

If you're worried about two failures in a row, why wouldn't you be
worried about three in a row? I think if you really want more than one
backup to be ready, you shouldn't go to two, but to n.

Kevin


pgptDGm99rx0M.pgp
Description: PGP signature

Re: [Qemu-block] [PATCH COLO v3 01/14] docs: block replication's description

2015-05-08 Thread Dr. David Alan Gilbert

* Stefan Hajnoczi (stefa...@redhat.com) wrote:
> On Tue, May 05, 2015 at 04:23:56PM +0100, Dr. David Alan Gilbert wrote:
> > * Stefan Hajnoczi (stefa...@redhat.com) wrote:
> > > On Fri, Apr 24, 2015 at 11:36:35AM +0200, Paolo Bonzini wrote:
> > > > 
> > > > 
> > > > On 24/04/2015 11:38, Wen Congyang wrote:
> > > > >> > 
> > > > >> > That can be done with drive-mirror.  But I think it's too early 
> > > > >> > for that.
> > > > > Do you mean use drive-mirror instead of quorum?
> > > > 
> > > > Only before starting up a new secondary.  Basically you do a migration
> > > > with non-shared storage, and then start the secondary in colo mode.
> > > > 
> > > > But it's only for the failover case.  Quorum (or a new block/colo.c
> > > > driver or filter) is fine for normal colo operation.
> > > 
> > > Perhaps this patch series should mirror the Secondary's disk to a Backup
> > > Secondary so that the system can be protected very quickly after
> > > failover.
> > > 
> > > I think anyone serious about fault tolerance would deploy a Backup
> > > Secondary, otherwise the system cannot survive two failures unless a
> > > human administrator is lucky/fast enough to set up a new Secondary.
> > 
> > I'd assumed that a higher level management layer would do the allocation
> > of a new secondary after the first failover, so no human need be involved.
> 
> That doesn't help, after the first failover is too late even if it's
> done by a program.  There should be no window during which the VM is
> unprotected.
>
> People who want fault tolerance care about 9s of availability.  The VM
> must be protected on the new Primary as soon as the failover occurs,
> otherwise this isn't a serious fault tolerance solution.

I'm not aware of any other system that manages that, so I don't
think that's fair.

You gain a lot more availability going from a single
system to the 1+1 system that COLO (or any of the checkpointing systems)
propose, I can't say how many 9s it gets you.  It's true having multiple
secondaries would get you a bit more on top of that, but you're still
a lot better off just having the one secondary.

I had thought that having >1 secondary would be a nice addition, but it's
a big change everywhere else (e.g. having to maintain multiple migration
streams, dealing with miscompares from multiple hosts).

Dave


> 
> Stefan


--
Dr. David Alan Gilbert / dgilb...@redhat.com / Manchester, UK

Re: [Qemu-block] [PATCH COLO v3 01/14] docs: block replication's description

2015-05-08 Thread Stefan Hajnoczi

On Tue, May 05, 2015 at 04:23:56PM +0100, Dr. David Alan Gilbert wrote:
> * Stefan Hajnoczi (stefa...@redhat.com) wrote:
> > On Fri, Apr 24, 2015 at 11:36:35AM +0200, Paolo Bonzini wrote:
> > > 
> > > 
> > > On 24/04/2015 11:38, Wen Congyang wrote:
> > > >> > 
> > > >> > That can be done with drive-mirror.  But I think it's too early for 
> > > >> > that.
> > > > Do you mean use drive-mirror instead of quorum?
> > > 
> > > Only before starting up a new secondary.  Basically you do a migration
> > > with non-shared storage, and then start the secondary in colo mode.
> > > 
> > > But it's only for the failover case.  Quorum (or a new block/colo.c
> > > driver or filter) is fine for normal colo operation.
> > 
> > Perhaps this patch series should mirror the Secondary's disk to a Backup
> > Secondary so that the system can be protected very quickly after
> > failover.
> > 
> > I think anyone serious about fault tolerance would deploy a Backup
> > Secondary, otherwise the system cannot survive two failures unless a
> > human administrator is lucky/fast enough to set up a new Secondary.
> 
> I'd assumed that a higher level management layer would do the allocation
> of a new secondary after the first failover, so no human need be involved.

That doesn't help, after the first failover is too late even if it's
done by a program.  There should be no window during which the VM is
unprotected.

People who want fault tolerance care about 9s of availability.  The VM
must be protected on the new Primary as soon as the failover occurs,
otherwise this isn't a serious fault tolerance solution.

Stefan


pgps2BxZf832m.pgp
Description: PGP signature

Re: [Qemu-block] [PATCH COLO v3 01/14] docs: block replication's description

2015-05-05 Thread Fam Zheng

On Wed, 05/06 02:26, Dong, Eddie wrote:
> 
> 
> > -Original Message-
> > From: Dr. David Alan Gilbert [mailto:dgilb...@redhat.com]
> > Sent: Tuesday, May 05, 2015 11:24 PM
> > To: Stefan Hajnoczi
> > Cc: Paolo Bonzini; Wen Congyang; Fam Zheng; Kevin Wolf; Lai Jiangshan; qemu
> > block; Jiang, Yunhong; Dong, Eddie; qemu devel; Max Reitz; Gonglei; Yang
> > Hongyang; zhanghailiang; arm...@redhat.com; jc...@redhat.com
> > Subject: Re: [PATCH COLO v3 01/14] docs: block replication's description
> > 
> > * Stefan Hajnoczi (stefa...@redhat.com) wrote:
> > > On Fri, Apr 24, 2015 at 11:36:35AM +0200, Paolo Bonzini wrote:
> > > >
> > > >
> > > > On 24/04/2015 11:38, Wen Congyang wrote:
> > > > >> >
> > > > >> > That can be done with drive-mirror.  But I think it's too early 
> > > > >> > for that.
> > > > > Do you mean use drive-mirror instead of quorum?
> > > >
> > > > Only before starting up a new secondary.  Basically you do a
> > > > migration with non-shared storage, and then start the secondary in colo
> > mode.
> > > >
> > > > But it's only for the failover case.  Quorum (or a new block/colo.c
> > > > driver or filter) is fine for normal colo operation.
> > >
> > > Perhaps this patch series should mirror the Secondary's disk to a
> > > Backup Secondary so that the system can be protected very quickly
> > > after failover.
> > >
> > > I think anyone serious about fault tolerance would deploy a Backup
> > > Secondary, otherwise the system cannot survive two failures unless a
> > > human administrator is lucky/fast enough to set up a new Secondary.
> > 
> > I'd assumed that a higher level management layer would do the allocation of 
> > a
> > new secondary after the first failover, so no human need be involved.
> > 
> 
> I agree. The cloud OS, such as open stack, will have the capability to handle
> the case, together with certain API in VMM side for this (libvirt?). 

The question here is the QMP API to switch secondary mode to primary mode is
not mentioned in this series.  I think that interface matters for this series.

Fam

Re: [Qemu-block] [PATCH COLO v3 01/14] docs: block replication's description

2015-05-05 Thread Dong, Eddie



> -Original Message-
> From: Dr. David Alan Gilbert [mailto:dgilb...@redhat.com]
> Sent: Tuesday, May 05, 2015 11:24 PM
> To: Stefan Hajnoczi
> Cc: Paolo Bonzini; Wen Congyang; Fam Zheng; Kevin Wolf; Lai Jiangshan; qemu
> block; Jiang, Yunhong; Dong, Eddie; qemu devel; Max Reitz; Gonglei; Yang
> Hongyang; zhanghailiang; arm...@redhat.com; jc...@redhat.com
> Subject: Re: [PATCH COLO v3 01/14] docs: block replication's description
> 
> * Stefan Hajnoczi (stefa...@redhat.com) wrote:
> > On Fri, Apr 24, 2015 at 11:36:35AM +0200, Paolo Bonzini wrote:
> > >
> > >
> > > On 24/04/2015 11:38, Wen Congyang wrote:
> > > >> >
> > > >> > That can be done with drive-mirror.  But I think it's too early for 
> > > >> > that.
> > > > Do you mean use drive-mirror instead of quorum?
> > >
> > > Only before starting up a new secondary.  Basically you do a
> > > migration with non-shared storage, and then start the secondary in colo
> mode.
> > >
> > > But it's only for the failover case.  Quorum (or a new block/colo.c
> > > driver or filter) is fine for normal colo operation.
> >
> > Perhaps this patch series should mirror the Secondary's disk to a
> > Backup Secondary so that the system can be protected very quickly
> > after failover.
> >
> > I think anyone serious about fault tolerance would deploy a Backup
> > Secondary, otherwise the system cannot survive two failures unless a
> > human administrator is lucky/fast enough to set up a new Secondary.
> 
> I'd assumed that a higher level management layer would do the allocation of a
> new secondary after the first failover, so no human need be involved.
> 

I agree. The cloud OS, such as open stack, will have the capability to handle 
the case, together with certain API in VMM side for this (libvirt?). 

Thx Eddie

Re: [Qemu-block] [PATCH COLO v3 01/14] docs: block replication's description

2015-05-05 Thread Dr. David Alan Gilbert

* Stefan Hajnoczi (stefa...@redhat.com) wrote:
> On Fri, Apr 24, 2015 at 11:36:35AM +0200, Paolo Bonzini wrote:
> > 
> > 
> > On 24/04/2015 11:38, Wen Congyang wrote:
> > >> > 
> > >> > That can be done with drive-mirror.  But I think it's too early for 
> > >> > that.
> > > Do you mean use drive-mirror instead of quorum?
> > 
> > Only before starting up a new secondary.  Basically you do a migration
> > with non-shared storage, and then start the secondary in colo mode.
> > 
> > But it's only for the failover case.  Quorum (or a new block/colo.c
> > driver or filter) is fine for normal colo operation.
> 
> Perhaps this patch series should mirror the Secondary's disk to a Backup
> Secondary so that the system can be protected very quickly after
> failover.
> 
> I think anyone serious about fault tolerance would deploy a Backup
> Secondary, otherwise the system cannot survive two failures unless a
> human administrator is lucky/fast enough to set up a new Secondary.

I'd assumed that a higher level management layer would do the allocation
of a new secondary after the first failover, so no human need be involved.

Dave

> Stefan


--
Dr. David Alan Gilbert / dgilb...@redhat.com / Manchester, UK

Re: [Qemu-block] [PATCH COLO v3 01/14] docs: block replication's description

2015-04-30 Thread Stefan Hajnoczi

On Wed, Apr 29, 2015 at 04:37:49PM +0800, Gonglei wrote:
> On 2015/4/29 16:29, Paolo Bonzini wrote:
> > 
> > 
> > On 27/04/2015 11:37, Stefan Hajnoczi wrote:
>  But it's only for the failover case.  Quorum (or a new 
>  block/colo.c driver or filter) is fine for normal colo 
>  operation.
> >> Perhaps this patch series should mirror the Secondary's disk to a 
> >> Backup Secondary so that the system can be protected very quickly 
> >> after failover.
> >>
> >> I think anyone serious about fault tolerance would deploy a Backup
> >>  Secondary, otherwise the system cannot survive two failures
> >> unless a human administrator is lucky/fast enough to set up a new 
> >> Secondary.
> > 
> > Let's do one thing at a time.  Otherwise nothing of this is going to
> > be ever completed...
> > 
> Yes, and the continuous backup feature is on our TODO list. We hope
> this series (including basic functions and  COLO framework) can be
> upstream first.

That's fine, I just wanted to make sure you have the issue in mind.

Stefan


pgptBg6Vh_cn5.pgp
Description: PGP signature

Re: [Qemu-block] [PATCH COLO v3 01/14] docs: block replication's description

2015-04-29 Thread Paolo Bonzini



On 27/04/2015 11:37, Stefan Hajnoczi wrote:
>>> But it's only for the failover case.  Quorum (or a new 
>>> block/colo.c driver or filter) is fine for normal colo 
>>> operation.
> Perhaps this patch series should mirror the Secondary's disk to a 
> Backup Secondary so that the system can be protected very quickly 
> after failover.
> 
> I think anyone serious about fault tolerance would deploy a Backup
>  Secondary, otherwise the system cannot survive two failures
> unless a human administrator is lucky/fast enough to set up a new 
> Secondary.

Let's do one thing at a time.  Otherwise nothing of this is going to
be ever completed...

Paolo

Re: [Qemu-block] [PATCH COLO v3 01/14] docs: block replication's description

2015-04-29 Thread Gonglei

On 2015/4/29 16:29, Paolo Bonzini wrote:
> 
> 
> On 27/04/2015 11:37, Stefan Hajnoczi wrote:
 But it's only for the failover case.  Quorum (or a new 
 block/colo.c driver or filter) is fine for normal colo 
 operation.
>> Perhaps this patch series should mirror the Secondary's disk to a 
>> Backup Secondary so that the system can be protected very quickly 
>> after failover.
>>
>> I think anyone serious about fault tolerance would deploy a Backup
>>  Secondary, otherwise the system cannot survive two failures
>> unless a human administrator is lucky/fast enough to set up a new 
>> Secondary.
> 
> Let's do one thing at a time.  Otherwise nothing of this is going to
> be ever completed...
> 
Yes, and the continuous backup feature is on our TODO list. We hope
this series (including basic functions and  COLO framework) can be
upstream first.

Regards,
-Gonglei

Re: [Qemu-block] [PATCH COLO v3 01/14] docs: block replication's description

2015-04-27 Thread Stefan Hajnoczi

On Fri, Apr 24, 2015 at 11:36:35AM +0200, Paolo Bonzini wrote:
> 
> 
> On 24/04/2015 11:38, Wen Congyang wrote:
> >> > 
> >> > That can be done with drive-mirror.  But I think it's too early for that.
> > Do you mean use drive-mirror instead of quorum?
> 
> Only before starting up a new secondary.  Basically you do a migration
> with non-shared storage, and then start the secondary in colo mode.
> 
> But it's only for the failover case.  Quorum (or a new block/colo.c
> driver or filter) is fine for normal colo operation.

Perhaps this patch series should mirror the Secondary's disk to a Backup
Secondary so that the system can be protected very quickly after
failover.

I think anyone serious about fault tolerance would deploy a Backup
Secondary, otherwise the system cannot survive two failures unless a
human administrator is lucky/fast enough to set up a new Secondary.

Stefan

pgpL0eGPupVHC.pgp
Description: PGP signature

Re: [Qemu-block] [PATCH COLO v3 01/14] docs: block replication's description

2015-04-24 Thread Paolo Bonzini



On 24/04/2015 11:53, Wen Congyang wrote:
>> > Only before starting up a new secondary.  Basically you do a migration
>> > with non-shared storage, and then start the secondary in colo mode.
>> > 
>> > But it's only for the failover case.  Quorum (or a new block/colo.c
>> > driver or filter) is fine for normal colo operation.
> Is nbd+colo needed to connect the NBD server later?

Elsewhere in the thread I proposed a new flag BDRV_O_NO_CONNECT and a
new BlockDriver function pointer bdrv_connect.

Paolo

Re: [Qemu-block] [PATCH COLO v3 01/14] docs: block replication's description

2015-04-24 Thread Wen Congyang

On 04/24/2015 05:36 PM, Paolo Bonzini wrote:
> 
> 
> On 24/04/2015 11:38, Wen Congyang wrote:

 That can be done with drive-mirror.  But I think it's too early for that.
>> Do you mean use drive-mirror instead of quorum?
> 
> Only before starting up a new secondary.  Basically you do a migration
> with non-shared storage, and then start the secondary in colo mode.
> 
> But it's only for the failover case.  Quorum (or a new block/colo.c
> driver or filter) is fine for normal colo operation.

Is nbd+colo needed to connect the NBD server later?

Thanks
Wen Congyang

> 
> Paolo
> .
>

Re: [Qemu-block] [PATCH COLO v3 01/14] docs: block replication's description

2015-04-24 Thread Paolo Bonzini

On 24/04/2015 11:38, Wen Congyang wrote:
>> > 
>> > That can be done with drive-mirror.  But I think it's too early for that.
> Do you mean use drive-mirror instead of quorum?

Only before starting up a new secondary.  Basically you do a migration
with non-shared storage, and then start the secondary in colo mode.

But it's only for the failover case.  Quorum (or a new block/colo.c
driver or filter) is fine for normal colo operation.

Paolo

Re: [Qemu-block] [PATCH COLO v3 01/14] docs: block replication's description

2015-04-24 Thread Wen Congyang

On 04/24/2015 05:04 PM, Paolo Bonzini wrote:
> 
> 
> On 24/04/2015 10:58, Dr. David Alan Gilbert wrote:
 If we can add a filter dynamically, we can add a filter that's file is nbd
 dynamically after secondary qemu's nbd server is ready. In this case, I 
 think
 there is no need to touch nbd client.
>> Yes, I think maybe the harder part is getting a copy of the current disk
>> contents to the new secondary while the new primary is still running.
> 
> That can be done with drive-mirror.  But I think it's too early for that.

Do you mean use drive-mirror instead of quorum?

Hmm, I don't find the final design for primary QEMU...

Thanks
Wen Congyang

> 
> Paolo
> .
>

Re: [Qemu-block] [PATCH COLO v3 01/14] docs: block replication's description

2015-04-24 Thread Paolo Bonzini



On 24/04/2015 10:58, Dr. David Alan Gilbert wrote:
>> > If we can add a filter dynamically, we can add a filter that's file is nbd
>> > dynamically after secondary qemu's nbd server is ready. In this case, I 
>> > think
>> > there is no need to touch nbd client.
> Yes, I think maybe the harder part is getting a copy of the current disk
> contents to the new secondary while the new primary is still running.

That can be done with drive-mirror.  But I think it's too early for that.

Paolo

Re: [Qemu-block] [PATCH COLO v3 01/14] docs: block replication's description

2015-04-24 Thread Dr. David Alan Gilbert

* Wen Congyang (we...@cn.fujitsu.com) wrote:
> On 04/24/2015 03:47 PM, Paolo Bonzini wrote:
> > 
> > 
> > On 24/04/2015 04:16, Wen Congyang wrote:
> >> I think the primary shouldn't do any I/O after failover (and the
> >> secondary should close the NBD server) so it is probably okay to ignore
> >> the removal for now.  Inserting the filter dynamically is probably
> >> needed though.
> 
>  Or maybe just enabling/disabling?
> >> Hmm, after failover, the secondary qemu should become primary qemu, but we 
> >> don't
> >> know the nbd server's IP/port when we execute the secondary qemu. So we 
> >> need
> >> to inserting nbd client dynamically after failover.
> > 
> > True, but secondary->primary switch is already not supported in v3.
> 
> Yes, we should consider it, and support it more easily later.
> 
> If we can add a filter dynamically, we can add a filter that's file is nbd
> dynamically after secondary qemu's nbd server is ready. In this case, I think
> there is no need to touch nbd client.

Yes, I think maybe the harder part is getting a copy of the current disk
contents to the new secondary while the new primary is still running.

Dave

> 
> Thanks
> Wen Congyang
> 
> > 
> > Kevin/Stefan, is there a design document somewhere that covers at least
> > static filters?
> > 
> > Paolo
> > .
> > 
> 
--
Dr. David Alan Gilbert / dgilb...@redhat.com / Manchester, UK

Re: [Qemu-block] [PATCH COLO v3 01/14] docs: block replication's description

2015-04-24 Thread Wen Congyang

On 04/24/2015 03:47 PM, Paolo Bonzini wrote:
> 
> 
> On 24/04/2015 04:16, Wen Congyang wrote:
>> I think the primary shouldn't do any I/O after failover (and the
>> secondary should close the NBD server) so it is probably okay to ignore
>> the removal for now.  Inserting the filter dynamically is probably
>> needed though.

 Or maybe just enabling/disabling?
>> Hmm, after failover, the secondary qemu should become primary qemu, but we 
>> don't
>> know the nbd server's IP/port when we execute the secondary qemu. So we need
>> to inserting nbd client dynamically after failover.
> 
> True, but secondary->primary switch is already not supported in v3.

Yes, we should consider it, and support it more easily later.

If we can add a filter dynamically, we can add a filter that's file is nbd
dynamically after secondary qemu's nbd server is ready. In this case, I think
there is no need to touch nbd client.

Thanks
Wen Congyang

> 
> Kevin/Stefan, is there a design document somewhere that covers at least
> static filters?
> 
> Paolo
> .
>

Re: [Qemu-block] [PATCH COLO v3 01/14] docs: block replication's description

2015-04-24 Thread Paolo Bonzini



On 24/04/2015 04:16, Wen Congyang wrote:
>>> >> I think the primary shouldn't do any I/O after failover (and the
>>> >> secondary should close the NBD server) so it is probably okay to ignore
>>> >> the removal for now.  Inserting the filter dynamically is probably
>>> >> needed though.
>> > 
>> > Or maybe just enabling/disabling?
> Hmm, after failover, the secondary qemu should become primary qemu, but we 
> don't
> know the nbd server's IP/port when we execute the secondary qemu. So we need
> to inserting nbd client dynamically after failover.

True, but secondary->primary switch is already not supported in v3.

Kevin/Stefan, is there a design document somewhere that covers at least
static filters?

Paolo

Re: [Qemu-block] [PATCH COLO v3 01/14] docs: block replication's description

2015-04-23 Thread Wen Congyang

On 04/24/2015 10:01 AM, Fam Zheng wrote:
> On Thu, 04/23 14:23, Paolo Bonzini wrote:
>>
>>
>> On 23/04/2015 14:19, Dr. David Alan Gilbert wrote:
> So that means the bdrv_start_replication and bdrv_stop_replication
> callbacks are more or less redundant, at least on the primary?
>
> In fact, who calls them?  Certainly nothing in this patch set...
> :)
>>> In the main colo set (I'm looking at the February version) there
>>> are calls to them, the 'stop_replication' is called at failover time.
>>>
>>> Here is I think the later version:
>>> http://lists.nongnu.org/archive/html/qemu-devel/2015-03/msg05391.html
>>
>> I think the primary shouldn't do any I/O after failover (and the
>> secondary should close the NBD server) so it is probably okay to ignore
>> the removal for now.  Inserting the filter dynamically is probably
>> needed though.
> 
> Or maybe just enabling/disabling?

Hmm, after failover, the secondary qemu should become primary qemu, but we don't
know the nbd server's IP/port when we execute the secondary qemu. So we need
to inserting nbd client dynamically after failover.

Thanks
Wen Congyang

> 
> Fam
> .
>

Re: [Qemu-block] [PATCH COLO v3 01/14] docs: block replication's description

2015-04-23 Thread Fam Zheng

On Thu, 04/23 14:23, Paolo Bonzini wrote:
> 
> 
> On 23/04/2015 14:19, Dr. David Alan Gilbert wrote:
> >> > So that means the bdrv_start_replication and bdrv_stop_replication
> >> > callbacks are more or less redundant, at least on the primary?
> >> > 
> >> > In fact, who calls them?  Certainly nothing in this patch set...
> >> > :)
> > In the main colo set (I'm looking at the February version) there
> > are calls to them, the 'stop_replication' is called at failover time.
> > 
> > Here is I think the later version:
> > http://lists.nongnu.org/archive/html/qemu-devel/2015-03/msg05391.html
> 
> I think the primary shouldn't do any I/O after failover (and the
> secondary should close the NBD server) so it is probably okay to ignore
> the removal for now.  Inserting the filter dynamically is probably
> needed though.

Or maybe just enabling/disabling?

Fam

Re: [Qemu-block] [PATCH COLO v3 01/14] docs: block replication's description

2015-04-23 Thread Paolo Bonzini



On 23/04/2015 14:19, Dr. David Alan Gilbert wrote:
>> > So that means the bdrv_start_replication and bdrv_stop_replication
>> > callbacks are more or less redundant, at least on the primary?
>> > 
>> > In fact, who calls them?  Certainly nothing in this patch set...
>> > :)
> In the main colo set (I'm looking at the February version) there
> are calls to them, the 'stop_replication' is called at failover time.
> 
> Here is I think the later version:
> http://lists.nongnu.org/archive/html/qemu-devel/2015-03/msg05391.html

I think the primary shouldn't do any I/O after failover (and the
secondary should close the NBD server) so it is probably okay to ignore
the removal for now.  Inserting the filter dynamically is probably
needed though.

Paolo

Re: [Qemu-block] [PATCH COLO v3 01/14] docs: block replication's description

2015-04-23 Thread Dr. David Alan Gilbert

* Paolo Bonzini (pbonz...@redhat.com) wrote:
> 
> 
> On 23/04/2015 14:05, Dr. David Alan Gilbert wrote:
> > As presented at the moment, I don't see there's any dynamic reconfiguration
> > on the primary side at the moment
> 
> So that means the bdrv_start_replication and bdrv_stop_replication
> callbacks are more or less redundant, at least on the primary?
> 
> In fact, who calls them?  Certainly nothing in this patch set...
> :)

In the main colo set (I'm looking at the February version) there
are calls to them, the 'stop_replication' is called at failover time.

Here is I think the later version:
http://lists.nongnu.org/archive/html/qemu-devel/2015-03/msg05391.html

Dave

> 
> Paolo
> 
>  - it starts up in the configuration with
> > the quorum(disk, NBD), and that's the way it stays throughout the 
> > fault-tolerant
> > setup; the primary doesn't start running until the secondary is connected.
> > 
> > Similarly the secondary startups in the configuration and stays that way;
> > the interesting question to me is what happens after a failure.
> > 
> > If the secondary fails, then your primary is still quorum(disk, NBD) but
> > the NBD side is dead - so I don't think you need to do anything there
> > immediately.
> > 
> > If the primary fails, and the secondary takes over, then a lot of the
> > stuff on the secondary now becomes redundent; does that stay the same
> > and just operate in some form of passthrough - or does it need to
> > change configuration?
> > 
> > The hard part to me is how to bring it back into fault-tolerance now;
> > after a primary failure, the secondary now needs to morph into something
> > like a primary, and somehow you need to bring up a new secondary
> > and get that new secondary an image of the primaries current disk.
--
Dr. David Alan Gilbert / dgilb...@redhat.com / Manchester, UK

Re: [Qemu-block] [PATCH COLO v3 01/14] docs: block replication's description

2015-04-23 Thread Paolo Bonzini



On 23/04/2015 14:05, Dr. David Alan Gilbert wrote:
> As presented at the moment, I don't see there's any dynamic reconfiguration
> on the primary side at the moment

So that means the bdrv_start_replication and bdrv_stop_replication
callbacks are more or less redundant, at least on the primary?

In fact, who calls them?  Certainly nothing in this patch set...
:)

Paolo

 - it starts up in the configuration with
> the quorum(disk, NBD), and that's the way it stays throughout the 
> fault-tolerant
> setup; the primary doesn't start running until the secondary is connected.
> 
> Similarly the secondary startups in the configuration and stays that way;
> the interesting question to me is what happens after a failure.
> 
> If the secondary fails, then your primary is still quorum(disk, NBD) but
> the NBD side is dead - so I don't think you need to do anything there
> immediately.
> 
> If the primary fails, and the secondary takes over, then a lot of the
> stuff on the secondary now becomes redundent; does that stay the same
> and just operate in some form of passthrough - or does it need to
> change configuration?
> 
> The hard part to me is how to bring it back into fault-tolerance now;
> after a primary failure, the secondary now needs to morph into something
> like a primary, and somehow you need to bring up a new secondary
> and get that new secondary an image of the primaries current disk.

Re: [Qemu-block] [PATCH COLO v3 01/14] docs: block replication's description

2015-04-23 Thread Dr. David Alan Gilbert

* Paolo Bonzini (pbonz...@redhat.com) wrote:
> 
> 
> On 23/04/2015 13:36, Kevin Wolf wrote:
> > Crap. Then we need to figure out dynamic reconfiguration for filters
> > (CCed Markus and Jeff).
> > 
> > And this is really part of the fundamental operation mode and not just a
> > way to give users a way to change their mind at runtime? Because if it
> > were, we could go forward without that for the start and add dynamic
> > reconfiguration in a second step.
> 
> I honestly don't know.  Wen, David?

As presented at the moment, I don't see there's any dynamic reconfiguration
on the primary side at the moment - it starts up in the configuration with
the quorum(disk, NBD), and that's the way it stays throughout the fault-tolerant
setup; the primary doesn't start running until the secondary is connected.

Similarly the secondary startups in the configuration and stays that way;
the interesting question to me is what happens after a failure.

If the secondary fails, then your primary is still quorum(disk, NBD) but
the NBD side is dead - so I don't think you need to do anything there
immediately.

If the primary fails, and the secondary takes over, then a lot of the
stuff on the secondary now becomes redundent; does that stay the same
and just operate in some form of passthrough - or does it need to
change configuration?

The hard part to me is how to bring it back into fault-tolerance now;
after a primary failure, the secondary now needs to morph into something
like a primary, and somehow you need to bring up a new secondary
and get that new secondary an image of the primaries current disk.

Dave

> Paolo
> 
> > Anyway, even if we move it to a second step, it looks like we need to
> > design something rather soon now.
--
Dr. David Alan Gilbert / dgilb...@redhat.com / Manchester, UK

Re: [Qemu-block] [PATCH COLO v3 01/14] docs: block replication's description

2015-04-23 Thread Paolo Bonzini



On 23/04/2015 13:36, Kevin Wolf wrote:
> Crap. Then we need to figure out dynamic reconfiguration for filters
> (CCed Markus and Jeff).
> 
> And this is really part of the fundamental operation mode and not just a
> way to give users a way to change their mind at runtime? Because if it
> were, we could go forward without that for the start and add dynamic
> reconfiguration in a second step.

I honestly don't know.  Wen, David?

Paolo

> Anyway, even if we move it to a second step, it looks like we need to
> design something rather soon now.

Re: [Qemu-block] [PATCH COLO v3 01/14] docs: block replication's description

2015-04-23 Thread Kevin Wolf

Am 23.04.2015 um 12:44 hat Paolo Bonzini geschrieben:
> On 23/04/2015 12:40, Kevin Wolf wrote:
> > The question that is still open for me is whether it would be a colo.c
> > or an active-mirror.c, i.e. if this would be tied specifically to COLO
> > or if it could be kept generic enough that it could be used for other
> > use cases as well.
> 
> Understood (now).
> 
> >>> What I think is really needed here is essentially an active mirror
> >>> filter.
> >>
> >> Yes, an active synchronous mirror.  It can be either a filter or a
> >> device.  Has anyone ever come up with a design for filters?  Colo
> >> doesn't need much more complexity than a "toy" blkverify filter.
> > 
> > I think what we're doing now for quorum/blkverify/blkdebug is okay.
> > 
> > The tricky and yet unsolved part is how to add/remove filter BDSes at
> > runtime (dynamic reconfiguration), but IIUC that isn't needed here.
> 
> Yes, it is.  The "defer connection to NBD when replication is started"
> is effectively "add the COLO filter" (with the NBD connection as a
> children) when replication is started.
> 
> Similarly "close the NBD device when replication is stopped" is
> effectively "remove the COLO filter" (which brings the NBD connection
> down with it).

Crap. Then we need to figure out dynamic reconfiguration for filters
(CCed Markus and Jeff).

And this is really part of the fundamental operation mode and not just a
way to give users a way to change their mind at runtime? Because if it
were, we could go forward without that for the start and add dynamic
reconfiguration in a second step.

Anyway, even if we move it to a second step, it looks like we need to
design something rather soon now.

Kevin

Re: [Qemu-block] [PATCH COLO v3 01/14] docs: block replication's description

2015-04-23 Thread Wen Congyang

On 04/23/2015 06:44 PM, Paolo Bonzini wrote:
> 
> 
> On 23/04/2015 12:40, Kevin Wolf wrote:
>> The question that is still open for me is whether it would be a colo.c
>> or an active-mirror.c, i.e. if this would be tied specifically to COLO
>> or if it could be kept generic enough that it could be used for other
>> use cases as well.
> 
> Understood (now).
> 
 What I think is really needed here is essentially an active mirror
 filter.
>>>
>>> Yes, an active synchronous mirror.  It can be either a filter or a
>>> device.  Has anyone ever come up with a design for filters?  Colo
>>> doesn't need much more complexity than a "toy" blkverify filter.
>>
>> I think what we're doing now for quorum/blkverify/blkdebug is okay.
>>
>> The tricky and yet unsolved part is how to add/remove filter BDSes at
>> runtime (dynamic reconfiguration), but IIUC that isn't needed here.
> 
> Yes, it is.  The "defer connection to NBD when replication is started"
> is effectively "add the COLO filter" (with the NBD connection as a
> children) when replication is started.
> 
> Similarly "close the NBD device when replication is stopped" is
> effectively "remove the COLO filter" (which brings the NBD connection
> down with it).

Hmm, I don't understand it clearly. Do you mean:
1. COLO filter is quorum's child
2. We can add/remove quorum's child at run-time.

If I misunderstand something, please correct me.

Thanks
Wen Congyang

> 
> Paolo
> .
>

Re: [Qemu-block] [PATCH COLO v3 01/14] docs: block replication's description

2015-04-23 Thread Paolo Bonzini



On 23/04/2015 12:40, Kevin Wolf wrote:
> The question that is still open for me is whether it would be a colo.c
> or an active-mirror.c, i.e. if this would be tied specifically to COLO
> or if it could be kept generic enough that it could be used for other
> use cases as well.

Understood (now).

>>> What I think is really needed here is essentially an active mirror
>>> filter.
>>
>> Yes, an active synchronous mirror.  It can be either a filter or a
>> device.  Has anyone ever come up with a design for filters?  Colo
>> doesn't need much more complexity than a "toy" blkverify filter.
> 
> I think what we're doing now for quorum/blkverify/blkdebug is okay.
> 
> The tricky and yet unsolved part is how to add/remove filter BDSes at
> runtime (dynamic reconfiguration), but IIUC that isn't needed here.

Yes, it is.  The "defer connection to NBD when replication is started"
is effectively "add the COLO filter" (with the NBD connection as a
children) when replication is started.

Similarly "close the NBD device when replication is stopped" is
effectively "remove the COLO filter" (which brings the NBD connection
down with it).

Paolo

Re: [Qemu-block] [PATCH COLO v3 01/14] docs: block replication's description

2015-04-23 Thread Kevin Wolf

Am 23.04.2015 um 12:33 hat Paolo Bonzini geschrieben:
> On 23/04/2015 12:17, Kevin Wolf wrote:
> > > Perhaps quorum is not a great match after all, and it's better to add a
> > > new "colo" driver similar to quorum but simpler and only using the read
> > > policy that you need for colo.  The new driver would also know how to
> > > use BDRV_O_NO_CONNECT.  In any case the amount of work needed would not
> > > be too big.
> >
> > I thought the same, but haven't looked at the details yet. But if I
> > understand correctly, the plan is to take quorum and add options to turn
> > off the functionality of using a quorum - that's a bit odd.
> 
> Yes, indeed.  Quorum was okay for experimenting, now it's better to "cp
> quorum.c colo.c" and clean up the code instead of adding options to
> quorum.  There's not going to be more duplication between quorum.c and
> colo.c than, say, between colo.c and blkverify.c.

The question that is still open for me is whether it would be a colo.c
or an active-mirror.c, i.e. if this would be tied specifically to COLO
or if it could be kept generic enough that it could be used for other
use cases as well.

> > What I think is really needed here is essentially an active mirror
> > filter.
> 
> Yes, an active synchronous mirror.  It can be either a filter or a
> device.  Has anyone ever come up with a design for filters?  Colo
> doesn't need much more complexity than a "toy" blkverify filter.

I think what we're doing now for quorum/blkverify/blkdebug is okay.

The tricky and yet unsolved part is how to add/remove filter BDSes at
runtime (dynamic reconfiguration), but IIUC that isn't needed here.

Kevin

Re: [Qemu-block] [PATCH COLO v3 01/14] docs: block replication's description

2015-04-23 Thread Paolo Bonzini

On 23/04/2015 12:17, Kevin Wolf wrote:
> > Perhaps quorum is not a great match after all, and it's better to add a
> > new "colo" driver similar to quorum but simpler and only using the read
> > policy that you need for colo.  The new driver would also know how to
> > use BDRV_O_NO_CONNECT.  In any case the amount of work needed would not
> > be too big.
>
> I thought the same, but haven't looked at the details yet. But if I
> understand correctly, the plan is to take quorum and add options to turn
> off the functionality of using a quorum - that's a bit odd.

Yes, indeed.  Quorum was okay for experimenting, now it's better to "cp
quorum.c colo.c" and clean up the code instead of adding options to
quorum.  There's not going to be more duplication between quorum.c and
colo.c than, say, between colo.c and blkverify.c.

> What I think is really needed here is essentially an active mirror
> filter.

Yes, an active synchronous mirror.  It can be either a filter or a
device.  Has anyone ever come up with a design for filters?  Colo
doesn't need much more complexity than a "toy" blkverify filter.

Paolo

Re: [Qemu-block] [PATCH COLO v3 01/14] docs: block replication's description

2015-04-23 Thread Kevin Wolf

Am 23.04.2015 um 12:05 hat Paolo Bonzini geschrieben:
> 
> 
> On 23/04/2015 11:14, Wen Congyang wrote:
> > The bs->file->driver should support backing file, and use backing reference
> > already.
> > 
> > What about the primary side? We should control when to connect to NBD 
> > server,
> > not in nbd_open().

Why do you need to create the block device before the connection should
be made?

> My naive suggestion could be to add a BDRV_O_NO_CONNECT option to
> bdrv_open and a separate bdrv_connect callback.  Open would fail if
> BDRV_O_NO_CONNECT is specified and drv->bdrv_connect is NULL.
> 
> You would then need a way to have quorum pass BDRV_O_NO_CONNECT.

Please don't add new flags. If we have to, we can introduce a new option
(in the QDict), but first let's check if it's really necessary.

> Perhaps quorum is not a great match after all, and it's better to add a
> new "colo" driver similar to quorum but simpler and only using the read
> policy that you need for colo.  The new driver would also know how to
> use BDRV_O_NO_CONNECT.  In any case the amount of work needed would not
> be too big.

I thought the same, but haven't looked at the details yet. But if I
understand correctly, the plan is to take quorum and add options to turn
off the functionality of using a quorum - that's a bit odd.

What I think is really needed here is essentially an active mirror
filter.

Kevin

Re: [Qemu-block] [PATCH COLO v3 01/14] docs: block replication's description

2015-04-23 Thread Wen Congyang

On 04/23/2015 05:55 PM, Stefan Hajnoczi wrote:
> On Wed, Apr 22, 2015 at 05:28:01PM +0800, Wen Congyang wrote:
>> On 04/22/2015 05:18 PM, Stefan Hajnoczi wrote:
>>> On Tue, Apr 21, 2015 at 05:28:01PM +0200, Paolo Bonzini wrote:
 On 21/04/2015 03:25, Wen Congyang wrote:
>>> Please do not introduce "+colo" block drivers.  This approach is
>>> invasive and makes block replication specific to only a few block
>>> drivers, e.g. NBD or qcow2.
> NBD is used to connect to secondary qemu, so it must be used. But the 
> primary
> qemu uses quorum, so the primary disk can be any format.
> The secondary disk is nbd target, and it can also be any format. The cache
> disk(active disk/hidden disk) is an empty disk, and it is created before 
> run
> COLO. The cache disk format is qcow2 now. In theory, it can be ant format 
> which
> supports backing file. But the driver should be updated to support colo 
> mode.
>
>> A cleaner approach is a QMP command or -drive options that work for any
>> BlockDriverState.
>
> OK, I will add a new drive option to avoid use "+colo".

 Actually I liked the "foo+colo" names.

 These are just internal details of the implementations and the
 primary/secondary disks actually can be any format.

 Stefan, what was your worry with the +colo block drivers?
>>>
>>> Why does NBD need to know about COLO?  It should be possible to use
>>> iSCSI or other protocols too.
>>
>> Hmm, if you want to use iSCSI or other protocols, you should update the 
>> driver
>> to implement block replication's control interface.
>>
>> Currently, we only support nbd now.
> 
> I took a quick look at the NBD patches in this series, it looks like
> they are a hacky way to make quorum dynamically reconfigurable.
> 
> In other words, what you really need is a way to enable/disable a quorum
> child or even add/remove children at run-time.
> 
> NBD is not the right place to implement that.  Add APIs to quorum so
> COLO code can use them.
> 
> Or maybe I'm misinterpreting the patches, I only took a quick look...

Hmm, if we can enable/disable or add/remove a child at run-time, it is another
choice.

Thanks
Wen Congyang

> 
> Stefan
>

Re: [Qemu-block] [PATCH COLO v3 01/14] docs: block replication's description

2015-04-23 Thread Paolo Bonzini

On 23/04/2015 11:14, Wen Congyang wrote:
> The bs->file->driver should support backing file, and use backing reference
> already.
> 
> What about the primary side? We should control when to connect to NBD server,
> not in nbd_open().

My naive suggestion could be to add a BDRV_O_NO_CONNECT option to
bdrv_open and a separate bdrv_connect callback.  Open would fail if
BDRV_O_NO_CONNECT is specified and drv->bdrv_connect is NULL.

You would then need a way to have quorum pass BDRV_O_NO_CONNECT.

Perhaps quorum is not a great match after all, and it's better to add a
new "colo" driver similar to quorum but simpler and only using the read
policy that you need for colo.  The new driver would also know how to
use BDRV_O_NO_CONNECT.  In any case the amount of work needed would not
be too big.

Paolo

Re: [Qemu-block] [PATCH COLO v3 01/14] docs: block replication's description

2015-04-23 Thread Wen Congyang

On 04/23/2015 05:00 PM, Kevin Wolf wrote:
> Am 22.04.2015 um 12:12 hat Paolo Bonzini geschrieben:
>> On 22/04/2015 11:31, Kevin Wolf wrote:
 Actually I liked the "foo+colo" names.

 These are just internal details of the implementations and the
 primary/secondary disks actually can be any format.

 Stefan, what was your worry with the +colo block drivers?
>>>
>>> I haven't read the patches yet, so I may be misunderstanding, but
>>> wouldn't a separate filter driver be more appropriate than modifying
>>> qcow2 with logic that has nothing to do with the image format?
>>
>> Possibly; on the other hand, why multiply the size of the test matrix
>> with options that no one will use and that will bitrot?
> 
> Because it may be the right design.
> 
> If you're really worried about the test matrix, put a check in the
> filter block driver that its bs->file is qcow2. Of course, such an
> artificial restriction looks a bit ugly, but using a bad design just
> in order to get the same restriction is even worse.

The bs->file->driver should support backing file, and use backing reference
already.

What about the primary side? We should control when to connect to NBD server,
not in nbd_open().

Thanks
Wen Congyang

> 
> Stefan originally wanted to put image streaming in the QED driver. I
> think we'll agree today that it was right to reject that. It's simply
> not functionality related to the format. Adding replication logic to
> qcow2 looks similar to me in that respect.
> 
> Kevin
> .
>

Re: [Qemu-block] [PATCH COLO v3 01/14] docs: block replication's description

2015-04-23 Thread Stefan Hajnoczi

On Wed, Apr 22, 2015 at 05:28:01PM +0800, Wen Congyang wrote:
> On 04/22/2015 05:18 PM, Stefan Hajnoczi wrote:
> > On Tue, Apr 21, 2015 at 05:28:01PM +0200, Paolo Bonzini wrote:
> >> On 21/04/2015 03:25, Wen Congyang wrote:
> > Please do not introduce "+colo" block drivers.  This approach is
> > invasive and makes block replication specific to only a few block
> > drivers, e.g. NBD or qcow2.
> >>> NBD is used to connect to secondary qemu, so it must be used. But the 
> >>> primary
> >>> qemu uses quorum, so the primary disk can be any format.
> >>> The secondary disk is nbd target, and it can also be any format. The cache
> >>> disk(active disk/hidden disk) is an empty disk, and it is created before 
> >>> run
> >>> COLO. The cache disk format is qcow2 now. In theory, it can be ant format 
> >>> which
> >>> supports backing file. But the driver should be updated to support colo 
> >>> mode.
> >>>
>  A cleaner approach is a QMP command or -drive options that work for any
>  BlockDriverState.
> >>>
> >>> OK, I will add a new drive option to avoid use "+colo".
> >>
> >> Actually I liked the "foo+colo" names.
> >>
> >> These are just internal details of the implementations and the
> >> primary/secondary disks actually can be any format.
> >>
> >> Stefan, what was your worry with the +colo block drivers?
> > 
> > Why does NBD need to know about COLO?  It should be possible to use
> > iSCSI or other protocols too.
> 
> Hmm, if you want to use iSCSI or other protocols, you should update the driver
> to implement block replication's control interface.
> 
> Currently, we only support nbd now.

I took a quick look at the NBD patches in this series, it looks like
they are a hacky way to make quorum dynamically reconfigurable.

In other words, what you really need is a way to enable/disable a quorum
child or even add/remove children at run-time.

NBD is not the right place to implement that.  Add APIs to quorum so
COLO code can use them.

Or maybe I'm misinterpreting the patches, I only took a quick look...

Stefan


pgplmUuucC3yz.pgp
Description: PGP signature

Re: [Qemu-block] [PATCH COLO v3 01/14] docs: block replication's description

2015-04-23 Thread Wen Congyang

On 04/23/2015 05:26 PM, Paolo Bonzini wrote:
> 
> 
> On 23/04/2015 11:00, Kevin Wolf wrote:
>> Because it may be the right design.
>>
>> If you're really worried about the test matrix, put a check in the
>> filter block driver that its bs->file is qcow2. Of course, such an
>> artificial restriction looks a bit ugly, but using a bad design just
>> in order to get the same restriction is even worse.
>>
>> Stefan originally wanted to put image streaming in the QED driver. I
>> think we'll agree today that it was right to reject that. It's simply
>> not functionality related to the format. Adding replication logic to
>> qcow2 looks similar to me in that respect.
> 
> Yes, I can't deny it is similar.  Still, there is a very important
> difference: limiting colo's internal workings to qcow2 or NBD doesn't
> limit what the user can do (while streaming limited the user to image
> files in QED format).
> 
> It may also depend on how the patches look like and how much the colo
> code relies on other internal state.
> 
> For NBD the answer is almost nothing, and you don't even need a filter
> driver.  You only need to separate sharply the "configure" and "open"
> phases.  So it may indeed be possible to generalize the handling of the
> secondary to non-NBD.
> 
> It may be the same for the primary; I admit I haven't even tried to read
> the qcow2 patch, as I couldn't do a meaningful review.

For qcow2, we need to read/write from NBD target directly after failover,
because the cache image(the format is qcow2) may be put in ramfs to get
better performance. The other thing is not changed.

For qcow2, if we use a filter driver, the bs->file->drv should support
backing file, and make_empty. So it can be the other format.

Thanks
Wen Congyang

> 
> Paolo
> .
>

Re: [Qemu-block] [PATCH COLO v3 01/14] docs: block replication's description

2015-04-23 Thread Kevin Wolf

Am 23.04.2015 um 11:26 hat Paolo Bonzini geschrieben:
> 
> 
> On 23/04/2015 11:00, Kevin Wolf wrote:
> > Because it may be the right design.
> > 
> > If you're really worried about the test matrix, put a check in the
> > filter block driver that its bs->file is qcow2. Of course, such an
> > artificial restriction looks a bit ugly, but using a bad design just
> > in order to get the same restriction is even worse.
> > 
> > Stefan originally wanted to put image streaming in the QED driver. I
> > think we'll agree today that it was right to reject that. It's simply
> > not functionality related to the format. Adding replication logic to
> > qcow2 looks similar to me in that respect.
> 
> Yes, I can't deny it is similar.  Still, there is a very important
> difference: limiting colo's internal workings to qcow2 or NBD doesn't
> limit what the user can do (while streaming limited the user to image
> files in QED format).
> 
> It may also depend on how the patches look like and how much the colo
> code relies on other internal state.
> 
> For NBD the answer is almost nothing, and you don't even need a filter
> driver.  You only need to separate sharply the "configure" and "open"
> phases.  So it may indeed be possible to generalize the handling of the
> secondary to non-NBD.
> 
> It may be the same for the primary; I admit I haven't even tried to read
> the qcow2 patch, as I couldn't do a meaningful review.

The qcow2 patch only modifies two existing lines. The rest it adds is a
the qcow2+colo BlockDriver, which references some qcow2 functions
directly and has a wrapper for others. On a quick scan, it didn't seem
like it accesses any internal qcow2 variables or calls any private
functions.

In other words, it's the perfect example for a filter.

Kevin

Re: [Qemu-block] [PATCH COLO v3 01/14] docs: block replication's description

2015-04-23 Thread Paolo Bonzini

On 23/04/2015 11:00, Kevin Wolf wrote:
> Because it may be the right design.
> 
> If you're really worried about the test matrix, put a check in the
> filter block driver that its bs->file is qcow2. Of course, such an
> artificial restriction looks a bit ugly, but using a bad design just
> in order to get the same restriction is even worse.
> 
> Stefan originally wanted to put image streaming in the QED driver. I
> think we'll agree today that it was right to reject that. It's simply
> not functionality related to the format. Adding replication logic to
> qcow2 looks similar to me in that respect.

Yes, I can't deny it is similar.  Still, there is a very important
difference: limiting colo's internal workings to qcow2 or NBD doesn't
limit what the user can do (while streaming limited the user to image
files in QED format).

It may also depend on how the patches look like and how much the colo
code relies on other internal state.

For NBD the answer is almost nothing, and you don't even need a filter
driver.  You only need to separate sharply the "configure" and "open"
phases.  So it may indeed be possible to generalize the handling of the
secondary to non-NBD.

It may be the same for the primary; I admit I haven't even tried to read
the qcow2 patch, as I couldn't do a meaningful review.

Paolo

Re: [Qemu-block] [PATCH COLO v3 01/14] docs: block replication's description

2015-04-23 Thread Kevin Wolf

Am 22.04.2015 um 12:12 hat Paolo Bonzini geschrieben:
> On 22/04/2015 11:31, Kevin Wolf wrote:
> >> Actually I liked the "foo+colo" names.
> >>
> >> These are just internal details of the implementations and the
> >> primary/secondary disks actually can be any format.
> >>
> >> Stefan, what was your worry with the +colo block drivers?
> > 
> > I haven't read the patches yet, so I may be misunderstanding, but
> > wouldn't a separate filter driver be more appropriate than modifying
> > qcow2 with logic that has nothing to do with the image format?
> 
> Possibly; on the other hand, why multiply the size of the test matrix
> with options that no one will use and that will bitrot?

Because it may be the right design.

If you're really worried about the test matrix, put a check in the
filter block driver that its bs->file is qcow2. Of course, such an
artificial restriction looks a bit ugly, but using a bad design just
in order to get the same restriction is even worse.

Stefan originally wanted to put image streaming in the QED driver. I
think we'll agree today that it was right to reject that. It's simply
not functionality related to the format. Adding replication logic to
qcow2 looks similar to me in that respect.

Kevin

Re: [Qemu-block] [PATCH COLO v3 01/14] docs: block replication's description

2015-04-22 Thread Dr. David Alan Gilbert

* Wen Congyang (we...@cn.fujitsu.com) wrote:
> Signed-off-by: Wen Congyang 
> Signed-off-by: Paolo Bonzini 
> Signed-off-by: Yang Hongyang 
> Signed-off-by: zhanghailiang 
> Signed-off-by: Gonglei 
> ---
>  docs/block-replication.txt | 153 
> +
>  1 file changed, 153 insertions(+)
>  create mode 100644 docs/block-replication.txt
> 
> diff --git a/docs/block-replication.txt b/docs/block-replication.txt
> new file mode 100644
> index 000..4426ffc
> --- /dev/null
> +++ b/docs/block-replication.txt
> @@ -0,0 +1,153 @@
> +Block replication
> +
> +Copyright Fujitsu, Corp. 2015
> +Copyright (c) 2015 Intel Corporation
> +Copyright (c) 2015 HUAWEI TECHNOLOGIES CO., LTD.
> +
> +This work is licensed under the terms of the GNU GPL, version 2 or later.
> +See the COPYING file in the top-level directory.
> +
> +Block replication is used for continuous checkpoints. It is designed
> +for COLO (COurse-grain LOck-stepping) where the Secondary VM is running.
> +It can also be applied for FT/HA (Fault-tolerance/High Assurance) scenario,
> +where the Secondary VM is not running.
> +
> +This document gives an overview of block replication's design.
> +
> +== Background ==
> +High availability solutions such as micro checkpoint and COLO will do
> +consecutive checkpoints. The VM state of Primary VM and Secondary VM is
> +identical right after a VM checkpoint, but becomes different as the VM
> +executes till the next checkpoint. To support disk contents checkpoint,
> +the modified disk contents in the Secondary VM must be buffered, and are
> +only dropped at next checkpoint time. To reduce the network transportation
> +effort at the time of checkpoint, the disk modification operations of
> +Primary disk are asynchronously forwarded to the Secondary node.
> +
> +== Workflow ==
> +The following is the image of block replication workflow:
> +
> ++--+++
> +|Primary Write Requests||Secondary Write Requests|
> ++--+++
> +  |   |
> +  |  (4)
> +  |   V
> +  |  /-\
> +  |  Copy and Forward| |
> +  |-(1)--+   | Disk Buffer |
> +  |  |   | |
> +  | (3)  \-/
> +  | speculative  ^
> +  |write through(2)
> +  |  |   |
> +  V  V   |
> +   +--+   ++
> +   | Primary Disk |   | Secondary Disk |
> +   +--+   ++
> +
> +1) Primary write requests will be copied and forwarded to Secondary
> +   QEMU.
> +2) Before Primary write requests are written to Secondary disk, the
> +   original sector content will be read from Secondary disk and
> +   buffered in the Disk buffer, but it will not overwrite the existing
> +   sector content(it could be from either "Secondary Write Requests" or
> +   previous COW of "Primary Write Requests") in the Disk buffer.
> +3) Primary write requests will be written to Secondary disk.
> +4) Secondary write requests will be buffered in the Disk buffer and it
> +   will overwrite the existing sector content in the buffer.
> +
> +== Architecture ==
> +We are going to implement COLO block replication from many basic
> +blocks that are already in QEMU.
> +
> + virtio-blk   ||
> + ^||.--
> + |||| Secondary
> +1 Quorum  ||'--
> + /  \ ||
> +/\||
> +   Primary  2 NBD  --->  2 NBD
> + disk   client|| server  
>virtio-blk
> +  ||^
> ^
> +. |||
> |
> +Primary | ||  Secondary disk <- hidden-disk 4 
> <- active-disk 3
> +' |||  backing^   backing
> +  ||| |
> +  ||| |
> +  ||'-'
> +  ||   drive-backup sync=no

Re: [Qemu-block] [PATCH COLO v3 01/14] docs: block replication's description

2015-04-22 Thread Paolo Bonzini



On 22/04/2015 11:31, Kevin Wolf wrote:
>> Actually I liked the "foo+colo" names.
>>
>> These are just internal details of the implementations and the
>> primary/secondary disks actually can be any format.
>>
>> Stefan, what was your worry with the +colo block drivers?
> 
> I haven't read the patches yet, so I may be misunderstanding, but
> wouldn't a separate filter driver be more appropriate than modifying
> qcow2 with logic that has nothing to do with the image format?

Possibly; on the other hand, why multiply the size of the test matrix
with options that no one will use and that will bitrot?

Paolo

Re: [Qemu-block] [PATCH COLO v3 01/14] docs: block replication's description

2015-04-22 Thread Wen Congyang

On 04/22/2015 05:29 PM, Stefan Hajnoczi wrote:
> On Tue, Apr 21, 2015 at 09:25:59AM +0800, Wen Congyang wrote:
>> On 04/20/2015 11:30 PM, Stefan Hajnoczi wrote:
>>> On Fri, Apr 03, 2015 at 06:01:07PM +0800, Wen Congyang wrote:
>>> One general question about the design: the Secondary host needs 3x
>>> storage space since it has the Secondary Disk, hidden-disk, and
>>> active-disk.  Each image requires a certain amount of space depending on
>>> writes or COW operations.  Is 3x the upper bound or is there a way to
>>> reduce the bound?
>>
>> active disk and hidden disk are temp file. It will be maked empty in
>> bdrv_do_checkpoint(). Their format is qcow2 now, so it doesn't need too
>> many spaces if we do checkpoint periodically.
> 
> A question related to checkpoints: both Primary and Secondary are active
> (running) in COLO.  The Secondary will be slower since it performs extra
> work; disk I/O on the Secondary has a COW overhead.
> 
> Does this force the Primary to wait for checkpoint commit so that the
> Secondary can catch up?
> 
> I'm a little confused about that since the point of COLO is to avoid the
> overheads of microcheckpointing, but there still seems to be a
> checkpointing bottleneck for disk I/O-intensive applications.
> 
>>>
>>> The bound is important since large amounts of data become a bottleneck
>>> for writeout/commit operations.  They could cause downtime if the guest
>>> is blocked until the entire Disk Buffer has been written to the
>>> Secondary Disk during failover, for example.
>>
>> OK, I will test it. In my test, vm_stop() will take about 2-3 seconds if
>> I run filebench in the guest. Is there anyway to speed it up?
> 
> Is it necessary to commit the active disk and hidden disk to the
> Secondary Disk on failover?  Maybe the VM could continue executing
> immediately and run a block-commit job.  The active disk and hidden disk
> files can be dropped once block-commit finishes.
> 

We need to stop the vm before doing checkpoint. So if vm_stop() takes
too much time, it will affect the performance.

On failover, we can commit the data while the vm is running. But the active
disk and hidden disk may be put in ramfs, The guest writes faster than
block-commit...

Thanks
Wen Congyang

Re: [Qemu-block] [PATCH COLO v3 01/14] docs: block replication's description

2015-04-22 Thread Kevin Wolf

Am 21.04.2015 um 17:28 hat Paolo Bonzini geschrieben:
> 
> 
> On 21/04/2015 03:25, Wen Congyang wrote:
> >> > Please do not introduce "+colo" block drivers.  This approach is
> >> > invasive and makes block replication specific to only a few block
> >> > drivers, e.g. NBD or qcow2.
> > NBD is used to connect to secondary qemu, so it must be used. But the 
> > primary
> > qemu uses quorum, so the primary disk can be any format.
> > The secondary disk is nbd target, and it can also be any format. The cache
> > disk(active disk/hidden disk) is an empty disk, and it is created before run
> > COLO. The cache disk format is qcow2 now. In theory, it can be ant format 
> > which
> > supports backing file. But the driver should be updated to support colo 
> > mode.
> > 
> > > A cleaner approach is a QMP command or -drive options that work for any
> > > BlockDriverState.
> > 
> > OK, I will add a new drive option to avoid use "+colo".
> 
> Actually I liked the "foo+colo" names.
> 
> These are just internal details of the implementations and the
> primary/secondary disks actually can be any format.
> 
> Stefan, what was your worry with the +colo block drivers?

I haven't read the patches yet, so I may be misunderstanding, but
wouldn't a separate filter driver be more appropriate than modifying
qcow2 with logic that has nothing to do with the image format?

Kevin

Re: [Qemu-block] [PATCH COLO v3 01/14] docs: block replication's description

2015-04-22 Thread Stefan Hajnoczi

On Tue, Apr 21, 2015 at 09:25:59AM +0800, Wen Congyang wrote:
> On 04/20/2015 11:30 PM, Stefan Hajnoczi wrote:
> > On Fri, Apr 03, 2015 at 06:01:07PM +0800, Wen Congyang wrote:
> > One general question about the design: the Secondary host needs 3x
> > storage space since it has the Secondary Disk, hidden-disk, and
> > active-disk.  Each image requires a certain amount of space depending on
> > writes or COW operations.  Is 3x the upper bound or is there a way to
> > reduce the bound?
> 
> active disk and hidden disk are temp file. It will be maked empty in
> bdrv_do_checkpoint(). Their format is qcow2 now, so it doesn't need too
> many spaces if we do checkpoint periodically.

A question related to checkpoints: both Primary and Secondary are active
(running) in COLO.  The Secondary will be slower since it performs extra
work; disk I/O on the Secondary has a COW overhead.

Does this force the Primary to wait for checkpoint commit so that the
Secondary can catch up?

I'm a little confused about that since the point of COLO is to avoid the
overheads of microcheckpointing, but there still seems to be a
checkpointing bottleneck for disk I/O-intensive applications.

> > 
> > The bound is important since large amounts of data become a bottleneck
> > for writeout/commit operations.  They could cause downtime if the guest
> > is blocked until the entire Disk Buffer has been written to the
> > Secondary Disk during failover, for example.
> 
> OK, I will test it. In my test, vm_stop() will take about 2-3 seconds if
> I run filebench in the guest. Is there anyway to speed it up?

Is it necessary to commit the active disk and hidden disk to the
Secondary Disk on failover?  Maybe the VM could continue executing
immediately and run a block-commit job.  The active disk and hidden disk
files can be dropped once block-commit finishes.

pgpSmtN_bltYK.pgp
Description: PGP signature

Re: [Qemu-block] [PATCH COLO v3 01/14] docs: block replication's description

2015-04-22 Thread Wen Congyang

On 04/22/2015 05:18 PM, Stefan Hajnoczi wrote:
> On Tue, Apr 21, 2015 at 05:28:01PM +0200, Paolo Bonzini wrote:
>> On 21/04/2015 03:25, Wen Congyang wrote:
> Please do not introduce "+colo" block drivers.  This approach is
> invasive and makes block replication specific to only a few block
> drivers, e.g. NBD or qcow2.
>>> NBD is used to connect to secondary qemu, so it must be used. But the 
>>> primary
>>> qemu uses quorum, so the primary disk can be any format.
>>> The secondary disk is nbd target, and it can also be any format. The cache
>>> disk(active disk/hidden disk) is an empty disk, and it is created before run
>>> COLO. The cache disk format is qcow2 now. In theory, it can be ant format 
>>> which
>>> supports backing file. But the driver should be updated to support colo 
>>> mode.
>>>
 A cleaner approach is a QMP command or -drive options that work for any
 BlockDriverState.
>>>
>>> OK, I will add a new drive option to avoid use "+colo".
>>
>> Actually I liked the "foo+colo" names.
>>
>> These are just internal details of the implementations and the
>> primary/secondary disks actually can be any format.
>>
>> Stefan, what was your worry with the +colo block drivers?
> 
> Why does NBD need to know about COLO?  It should be possible to use
> iSCSI or other protocols too.

Hmm, if you want to use iSCSI or other protocols, you should update the driver
to implement block replication's control interface.

Currently, we only support nbd now.

Thanks
Wen Congyang

> 
> Stefan
>

Re: [Qemu-block] [PATCH COLO v3 01/14] docs: block replication's description

2015-04-22 Thread Stefan Hajnoczi

On Tue, Apr 21, 2015 at 05:28:01PM +0200, Paolo Bonzini wrote:
> On 21/04/2015 03:25, Wen Congyang wrote:
> >> > Please do not introduce "+colo" block drivers.  This approach is
> >> > invasive and makes block replication specific to only a few block
> >> > drivers, e.g. NBD or qcow2.
> > NBD is used to connect to secondary qemu, so it must be used. But the 
> > primary
> > qemu uses quorum, so the primary disk can be any format.
> > The secondary disk is nbd target, and it can also be any format. The cache
> > disk(active disk/hidden disk) is an empty disk, and it is created before run
> > COLO. The cache disk format is qcow2 now. In theory, it can be ant format 
> > which
> > supports backing file. But the driver should be updated to support colo 
> > mode.
> > 
> > > A cleaner approach is a QMP command or -drive options that work for any
> > > BlockDriverState.
> > 
> > OK, I will add a new drive option to avoid use "+colo".
> 
> Actually I liked the "foo+colo" names.
> 
> These are just internal details of the implementations and the
> primary/secondary disks actually can be any format.
> 
> Stefan, what was your worry with the +colo block drivers?

Why does NBD need to know about COLO?  It should be possible to use
iSCSI or other protocols too.

Stefan


pgpGr2J18TCwu.pgp
Description: PGP signature

Re: [Qemu-block] [PATCH COLO v3 01/14] docs: block replication's description

2015-04-21 Thread Paolo Bonzini



On 21/04/2015 03:25, Wen Congyang wrote:
>> > Please do not introduce "+colo" block drivers.  This approach is
>> > invasive and makes block replication specific to only a few block
>> > drivers, e.g. NBD or qcow2.
> NBD is used to connect to secondary qemu, so it must be used. But the primary
> qemu uses quorum, so the primary disk can be any format.
> The secondary disk is nbd target, and it can also be any format. The cache
> disk(active disk/hidden disk) is an empty disk, and it is created before run
> COLO. The cache disk format is qcow2 now. In theory, it can be ant format 
> which
> supports backing file. But the driver should be updated to support colo mode.
> 
> > A cleaner approach is a QMP command or -drive options that work for any
> > BlockDriverState.
> 
> OK, I will add a new drive option to avoid use "+colo".

Actually I liked the "foo+colo" names.

These are just internal details of the implementations and the
primary/secondary disks actually can be any format.

Stefan, what was your worry with the +colo block drivers?

Paolo

Re: [Qemu-block] [PATCH COLO v3 01/14] docs: block replication's description

2015-04-20 Thread Wen Congyang

On 04/20/2015 11:30 PM, Stefan Hajnoczi wrote:
> On Fri, Apr 03, 2015 at 06:01:07PM +0800, Wen Congyang wrote:
>> Signed-off-by: Wen Congyang 
>> Signed-off-by: Paolo Bonzini 
>> Signed-off-by: Yang Hongyang 
>> Signed-off-by: zhanghailiang 
>> Signed-off-by: Gonglei 
>> ---
>>  docs/block-replication.txt | 153 
>> +
>>  1 file changed, 153 insertions(+)
>>  create mode 100644 docs/block-replication.txt
>>
>> diff --git a/docs/block-replication.txt b/docs/block-replication.txt
>> new file mode 100644
>> index 000..4426ffc
>> --- /dev/null
>> +++ b/docs/block-replication.txt
>> @@ -0,0 +1,153 @@
>> +Block replication
>> +
>> +Copyright Fujitsu, Corp. 2015
>> +Copyright (c) 2015 Intel Corporation
>> +Copyright (c) 2015 HUAWEI TECHNOLOGIES CO., LTD.
>> +
>> +This work is licensed under the terms of the GNU GPL, version 2 or later.
>> +See the COPYING file in the top-level directory.
>> +
>> +Block replication is used for continuous checkpoints. It is designed
>> +for COLO (COurse-grain LOck-stepping) where the Secondary VM is running.
>> +It can also be applied for FT/HA (Fault-tolerance/High Assurance) scenario,
>> +where the Secondary VM is not running.
>> +
>> +This document gives an overview of block replication's design.
>> +
>> +== Background ==
>> +High availability solutions such as micro checkpoint and COLO will do
>> +consecutive checkpoints. The VM state of Primary VM and Secondary VM is
>> +identical right after a VM checkpoint, but becomes different as the VM
>> +executes till the next checkpoint. To support disk contents checkpoint,
>> +the modified disk contents in the Secondary VM must be buffered, and are
>> +only dropped at next checkpoint time. To reduce the network transportation
>> +effort at the time of checkpoint, the disk modification operations of
>> +Primary disk are asynchronously forwarded to the Secondary node.
>> +
>> +== Workflow ==
>> +The following is the image of block replication workflow:
>> +
>> ++--+++
>> +|Primary Write Requests||Secondary Write Requests|
>> ++--+++
>> +  |   |
>> +  |  (4)
>> +  |   V
>> +  |  /-\
>> +  |  Copy and Forward| |
>> +  |-(1)--+   | Disk Buffer |
>> +  |  |   | |
>> +  | (3)  \-/
>> +  | speculative  ^
>> +  |write through(2)
>> +  |  |   |
>> +  V  V   |
>> +   +--+   ++
>> +   | Primary Disk |   | Secondary Disk |
>> +   +--+   ++
>> +
>> +1) Primary write requests will be copied and forwarded to Secondary
>> +   QEMU.
>> +2) Before Primary write requests are written to Secondary disk, the
>> +   original sector content will be read from Secondary disk and
>> +   buffered in the Disk buffer, but it will not overwrite the existing
>> +   sector content(it could be from either "Secondary Write Requests" or
>> +   previous COW of "Primary Write Requests") in the Disk buffer.
>> +3) Primary write requests will be written to Secondary disk.
>> +4) Secondary write requests will be buffered in the Disk buffer and it
>> +   will overwrite the existing sector content in the buffer.
>> +
>> +== Architecture ==
>> +We are going to implement COLO block replication from many basic
>> +blocks that are already in QEMU.
>> +
>> + virtio-blk   ||
>> + ^||.--
>> + |||| Secondary
>> +1 Quorum  ||'--
>> + /  \ ||
>> +/\||
>> +   Primary  2 NBD  --->  2 NBD
>> + disk   client|| server 
>> virtio-blk
>> +  ||^   
>>  ^
>> +. |||   
>>  |
>> +Primary | ||  Secondary disk <- hidden-disk 4 
>> <- active-disk 3
>> +' |||  backing^   
>> backing
>> +  ||| |
>> +

Re: [Qemu-block] [PATCH COLO v3 01/14] docs: block replication's description

2015-04-20 Thread Stefan Hajnoczi

On Fri, Apr 03, 2015 at 06:01:07PM +0800, Wen Congyang wrote:
> Signed-off-by: Wen Congyang 
> Signed-off-by: Paolo Bonzini 
> Signed-off-by: Yang Hongyang 
> Signed-off-by: zhanghailiang 
> Signed-off-by: Gonglei 
> ---
>  docs/block-replication.txt | 153 
> +
>  1 file changed, 153 insertions(+)
>  create mode 100644 docs/block-replication.txt
> 
> diff --git a/docs/block-replication.txt b/docs/block-replication.txt
> new file mode 100644
> index 000..4426ffc
> --- /dev/null
> +++ b/docs/block-replication.txt
> @@ -0,0 +1,153 @@
> +Block replication
> +
> +Copyright Fujitsu, Corp. 2015
> +Copyright (c) 2015 Intel Corporation
> +Copyright (c) 2015 HUAWEI TECHNOLOGIES CO., LTD.
> +
> +This work is licensed under the terms of the GNU GPL, version 2 or later.
> +See the COPYING file in the top-level directory.
> +
> +Block replication is used for continuous checkpoints. It is designed
> +for COLO (COurse-grain LOck-stepping) where the Secondary VM is running.
> +It can also be applied for FT/HA (Fault-tolerance/High Assurance) scenario,
> +where the Secondary VM is not running.
> +
> +This document gives an overview of block replication's design.
> +
> +== Background ==
> +High availability solutions such as micro checkpoint and COLO will do
> +consecutive checkpoints. The VM state of Primary VM and Secondary VM is
> +identical right after a VM checkpoint, but becomes different as the VM
> +executes till the next checkpoint. To support disk contents checkpoint,
> +the modified disk contents in the Secondary VM must be buffered, and are
> +only dropped at next checkpoint time. To reduce the network transportation
> +effort at the time of checkpoint, the disk modification operations of
> +Primary disk are asynchronously forwarded to the Secondary node.
> +
> +== Workflow ==
> +The following is the image of block replication workflow:
> +
> ++--+++
> +|Primary Write Requests||Secondary Write Requests|
> ++--+++
> +  |   |
> +  |  (4)
> +  |   V
> +  |  /-\
> +  |  Copy and Forward| |
> +  |-(1)--+   | Disk Buffer |
> +  |  |   | |
> +  | (3)  \-/
> +  | speculative  ^
> +  |write through(2)
> +  |  |   |
> +  V  V   |
> +   +--+   ++
> +   | Primary Disk |   | Secondary Disk |
> +   +--+   ++
> +
> +1) Primary write requests will be copied and forwarded to Secondary
> +   QEMU.
> +2) Before Primary write requests are written to Secondary disk, the
> +   original sector content will be read from Secondary disk and
> +   buffered in the Disk buffer, but it will not overwrite the existing
> +   sector content(it could be from either "Secondary Write Requests" or
> +   previous COW of "Primary Write Requests") in the Disk buffer.
> +3) Primary write requests will be written to Secondary disk.
> +4) Secondary write requests will be buffered in the Disk buffer and it
> +   will overwrite the existing sector content in the buffer.
> +
> +== Architecture ==
> +We are going to implement COLO block replication from many basic
> +blocks that are already in QEMU.
> +
> + virtio-blk   ||
> + ^||.--
> + |||| Secondary
> +1 Quorum  ||'--
> + /  \ ||
> +/\||
> +   Primary  2 NBD  --->  2 NBD
> + disk   client|| server  
>virtio-blk
> +  ||^
> ^
> +. |||
> |
> +Primary | ||  Secondary disk <- hidden-disk 4 
> <- active-disk 3
> +' |||  backing^   backing
> +  ||| |
> +  ||| |
> +  ||'-'
> +  ||   dri

50 matches

Mail list logo