Re: [Qemu-block] [PATCH COLO v3 01/14] docs: block replication's description

2015-05-08 Thread Stefan Hajnoczi
On Tue, May 05, 2015 at 04:23:56PM +0100, Dr. David Alan Gilbert wrote:
 * Stefan Hajnoczi (stefa...@redhat.com) wrote:
  On Fri, Apr 24, 2015 at 11:36:35AM +0200, Paolo Bonzini wrote:
   
   
   On 24/04/2015 11:38, Wen Congyang wrote:
 
 That can be done with drive-mirror.  But I think it's too early for 
 that.
Do you mean use drive-mirror instead of quorum?
   
   Only before starting up a new secondary.  Basically you do a migration
   with non-shared storage, and then start the secondary in colo mode.
   
   But it's only for the failover case.  Quorum (or a new block/colo.c
   driver or filter) is fine for normal colo operation.
  
  Perhaps this patch series should mirror the Secondary's disk to a Backup
  Secondary so that the system can be protected very quickly after
  failover.
  
  I think anyone serious about fault tolerance would deploy a Backup
  Secondary, otherwise the system cannot survive two failures unless a
  human administrator is lucky/fast enough to set up a new Secondary.
 
 I'd assumed that a higher level management layer would do the allocation
 of a new secondary after the first failover, so no human need be involved.

That doesn't help, after the first failover is too late even if it's
done by a program.  There should be no window during which the VM is
unprotected.

People who want fault tolerance care about 9s of availability.  The VM
must be protected on the new Primary as soon as the failover occurs,
otherwise this isn't a serious fault tolerance solution.

Stefan


pgps2BxZf832m.pgp
Description: PGP signature


Re: [Qemu-block] [PATCH COLO v3 01/14] docs: block replication's description

2015-05-08 Thread Dr. David Alan Gilbert
* Stefan Hajnoczi (stefa...@redhat.com) wrote:
 On Tue, May 05, 2015 at 04:23:56PM +0100, Dr. David Alan Gilbert wrote:
  * Stefan Hajnoczi (stefa...@redhat.com) wrote:
   On Fri, Apr 24, 2015 at 11:36:35AM +0200, Paolo Bonzini wrote:


On 24/04/2015 11:38, Wen Congyang wrote:
  
  That can be done with drive-mirror.  But I think it's too early 
  for that.
 Do you mean use drive-mirror instead of quorum?

Only before starting up a new secondary.  Basically you do a migration
with non-shared storage, and then start the secondary in colo mode.

But it's only for the failover case.  Quorum (or a new block/colo.c
driver or filter) is fine for normal colo operation.
   
   Perhaps this patch series should mirror the Secondary's disk to a Backup
   Secondary so that the system can be protected very quickly after
   failover.
   
   I think anyone serious about fault tolerance would deploy a Backup
   Secondary, otherwise the system cannot survive two failures unless a
   human administrator is lucky/fast enough to set up a new Secondary.
  
  I'd assumed that a higher level management layer would do the allocation
  of a new secondary after the first failover, so no human need be involved.
 
 That doesn't help, after the first failover is too late even if it's
 done by a program.  There should be no window during which the VM is
 unprotected.

 People who want fault tolerance care about 9s of availability.  The VM
 must be protected on the new Primary as soon as the failover occurs,
 otherwise this isn't a serious fault tolerance solution.

I'm not aware of any other system that manages that, so I don't
think that's fair.

You gain a lot more availability going from a single
system to the 1+1 system that COLO (or any of the checkpointing systems)
propose, I can't say how many 9s it gets you.  It's true having multiple
secondaries would get you a bit more on top of that, but you're still
a lot better off just having the one secondary.

I had thought that having 1 secondary would be a nice addition, but it's
a big change everywhere else (e.g. having to maintain multiple migration
streams, dealing with miscompares from multiple hosts).

Dave


 
 Stefan


--
Dr. David Alan Gilbert / dgilb...@redhat.com / Manchester, UK



Re: [Qemu-block] [PATCH COLO v3 01/14] docs: block replication's description

2015-05-05 Thread Fam Zheng
On Wed, 05/06 02:26, Dong, Eddie wrote:
 
 
  -Original Message-
  From: Dr. David Alan Gilbert [mailto:dgilb...@redhat.com]
  Sent: Tuesday, May 05, 2015 11:24 PM
  To: Stefan Hajnoczi
  Cc: Paolo Bonzini; Wen Congyang; Fam Zheng; Kevin Wolf; Lai Jiangshan; qemu
  block; Jiang, Yunhong; Dong, Eddie; qemu devel; Max Reitz; Gonglei; Yang
  Hongyang; zhanghailiang; arm...@redhat.com; jc...@redhat.com
  Subject: Re: [PATCH COLO v3 01/14] docs: block replication's description
  
  * Stefan Hajnoczi (stefa...@redhat.com) wrote:
   On Fri, Apr 24, 2015 at 11:36:35AM +0200, Paolo Bonzini wrote:
   
   
On 24/04/2015 11:38, Wen Congyang wrote:
 
  That can be done with drive-mirror.  But I think it's too early 
  for that.
 Do you mean use drive-mirror instead of quorum?
   
Only before starting up a new secondary.  Basically you do a
migration with non-shared storage, and then start the secondary in colo
  mode.
   
But it's only for the failover case.  Quorum (or a new block/colo.c
driver or filter) is fine for normal colo operation.
  
   Perhaps this patch series should mirror the Secondary's disk to a
   Backup Secondary so that the system can be protected very quickly
   after failover.
  
   I think anyone serious about fault tolerance would deploy a Backup
   Secondary, otherwise the system cannot survive two failures unless a
   human administrator is lucky/fast enough to set up a new Secondary.
  
  I'd assumed that a higher level management layer would do the allocation of 
  a
  new secondary after the first failover, so no human need be involved.
  
 
 I agree. The cloud OS, such as open stack, will have the capability to handle
 the case, together with certain API in VMM side for this (libvirt?). 

The question here is the QMP API to switch secondary mode to primary mode is
not mentioned in this series.  I think that interface matters for this series.

Fam



Re: [Qemu-block] [PATCH COLO v3 01/14] docs: block replication's description

2015-05-05 Thread Dong, Eddie


 -Original Message-
 From: Dr. David Alan Gilbert [mailto:dgilb...@redhat.com]
 Sent: Tuesday, May 05, 2015 11:24 PM
 To: Stefan Hajnoczi
 Cc: Paolo Bonzini; Wen Congyang; Fam Zheng; Kevin Wolf; Lai Jiangshan; qemu
 block; Jiang, Yunhong; Dong, Eddie; qemu devel; Max Reitz; Gonglei; Yang
 Hongyang; zhanghailiang; arm...@redhat.com; jc...@redhat.com
 Subject: Re: [PATCH COLO v3 01/14] docs: block replication's description
 
 * Stefan Hajnoczi (stefa...@redhat.com) wrote:
  On Fri, Apr 24, 2015 at 11:36:35AM +0200, Paolo Bonzini wrote:
  
  
   On 24/04/2015 11:38, Wen Congyang wrote:

 That can be done with drive-mirror.  But I think it's too early for 
 that.
Do you mean use drive-mirror instead of quorum?
  
   Only before starting up a new secondary.  Basically you do a
   migration with non-shared storage, and then start the secondary in colo
 mode.
  
   But it's only for the failover case.  Quorum (or a new block/colo.c
   driver or filter) is fine for normal colo operation.
 
  Perhaps this patch series should mirror the Secondary's disk to a
  Backup Secondary so that the system can be protected very quickly
  after failover.
 
  I think anyone serious about fault tolerance would deploy a Backup
  Secondary, otherwise the system cannot survive two failures unless a
  human administrator is lucky/fast enough to set up a new Secondary.
 
 I'd assumed that a higher level management layer would do the allocation of a
 new secondary after the first failover, so no human need be involved.
 

I agree. The cloud OS, such as open stack, will have the capability to handle 
the case, together with certain API in VMM side for this (libvirt?). 

Thx Eddie



Re: [Qemu-block] [PATCH COLO v3 01/14] docs: block replication's description

2015-05-05 Thread Dr. David Alan Gilbert
* Stefan Hajnoczi (stefa...@redhat.com) wrote:
 On Fri, Apr 24, 2015 at 11:36:35AM +0200, Paolo Bonzini wrote:
  
  
  On 24/04/2015 11:38, Wen Congyang wrote:

That can be done with drive-mirror.  But I think it's too early for 
that.
   Do you mean use drive-mirror instead of quorum?
  
  Only before starting up a new secondary.  Basically you do a migration
  with non-shared storage, and then start the secondary in colo mode.
  
  But it's only for the failover case.  Quorum (or a new block/colo.c
  driver or filter) is fine for normal colo operation.
 
 Perhaps this patch series should mirror the Secondary's disk to a Backup
 Secondary so that the system can be protected very quickly after
 failover.
 
 I think anyone serious about fault tolerance would deploy a Backup
 Secondary, otherwise the system cannot survive two failures unless a
 human administrator is lucky/fast enough to set up a new Secondary.

I'd assumed that a higher level management layer would do the allocation
of a new secondary after the first failover, so no human need be involved.

Dave

 Stefan


--
Dr. David Alan Gilbert / dgilb...@redhat.com / Manchester, UK



Re: [Qemu-block] [PATCH COLO v3 01/14] docs: block replication's description

2015-04-30 Thread Stefan Hajnoczi
On Wed, Apr 29, 2015 at 04:37:49PM +0800, Gonglei wrote:
 On 2015/4/29 16:29, Paolo Bonzini wrote:
  
  
  On 27/04/2015 11:37, Stefan Hajnoczi wrote:
  But it's only for the failover case.  Quorum (or a new 
  block/colo.c driver or filter) is fine for normal colo 
  operation.
  Perhaps this patch series should mirror the Secondary's disk to a 
  Backup Secondary so that the system can be protected very quickly 
  after failover.
 
  I think anyone serious about fault tolerance would deploy a Backup
   Secondary, otherwise the system cannot survive two failures
  unless a human administrator is lucky/fast enough to set up a new 
  Secondary.
  
  Let's do one thing at a time.  Otherwise nothing of this is going to
  be ever completed...
  
 Yes, and the continuous backup feature is on our TODO list. We hope
 this series (including basic functions and  COLO framework) can be
 upstream first.

That's fine, I just wanted to make sure you have the issue in mind.

Stefan


pgptBg6Vh_cn5.pgp
Description: PGP signature


Re: [Qemu-block] [PATCH COLO v3 01/14] docs: block replication's description

2015-04-29 Thread Gonglei
On 2015/4/29 16:29, Paolo Bonzini wrote:
 
 
 On 27/04/2015 11:37, Stefan Hajnoczi wrote:
 But it's only for the failover case.  Quorum (or a new 
 block/colo.c driver or filter) is fine for normal colo 
 operation.
 Perhaps this patch series should mirror the Secondary's disk to a 
 Backup Secondary so that the system can be protected very quickly 
 after failover.

 I think anyone serious about fault tolerance would deploy a Backup
  Secondary, otherwise the system cannot survive two failures
 unless a human administrator is lucky/fast enough to set up a new 
 Secondary.
 
 Let's do one thing at a time.  Otherwise nothing of this is going to
 be ever completed...
 
Yes, and the continuous backup feature is on our TODO list. We hope
this series (including basic functions and  COLO framework) can be
upstream first.

Regards,
-Gonglei




Re: [Qemu-block] [PATCH COLO v3 01/14] docs: block replication's description

2015-04-24 Thread Paolo Bonzini


On 24/04/2015 10:58, Dr. David Alan Gilbert wrote:
  If we can add a filter dynamically, we can add a filter that's file is nbd
  dynamically after secondary qemu's nbd server is ready. In this case, I 
  think
  there is no need to touch nbd client.
 Yes, I think maybe the harder part is getting a copy of the current disk
 contents to the new secondary while the new primary is still running.

That can be done with drive-mirror.  But I think it's too early for that.

Paolo



Re: [Qemu-block] [PATCH COLO v3 01/14] docs: block replication's description

2015-04-24 Thread Paolo Bonzini


On 24/04/2015 11:53, Wen Congyang wrote:
  Only before starting up a new secondary.  Basically you do a migration
  with non-shared storage, and then start the secondary in colo mode.
  
  But it's only for the failover case.  Quorum (or a new block/colo.c
  driver or filter) is fine for normal colo operation.
 Is nbd+colo needed to connect the NBD server later?

Elsewhere in the thread I proposed a new flag BDRV_O_NO_CONNECT and a
new BlockDriver function pointer bdrv_connect.

Paolo



Re: [Qemu-block] [PATCH COLO v3 01/14] docs: block replication's description

2015-04-24 Thread Dr. David Alan Gilbert
* Wen Congyang (we...@cn.fujitsu.com) wrote:
 On 04/24/2015 03:47 PM, Paolo Bonzini wrote:
  
  
  On 24/04/2015 04:16, Wen Congyang wrote:
  I think the primary shouldn't do any I/O after failover (and the
  secondary should close the NBD server) so it is probably okay to ignore
  the removal for now.  Inserting the filter dynamically is probably
  needed though.
 
  Or maybe just enabling/disabling?
  Hmm, after failover, the secondary qemu should become primary qemu, but we 
  don't
  know the nbd server's IP/port when we execute the secondary qemu. So we 
  need
  to inserting nbd client dynamically after failover.
  
  True, but secondary-primary switch is already not supported in v3.
 
 Yes, we should consider it, and support it more easily later.
 
 If we can add a filter dynamically, we can add a filter that's file is nbd
 dynamically after secondary qemu's nbd server is ready. In this case, I think
 there is no need to touch nbd client.

Yes, I think maybe the harder part is getting a copy of the current disk
contents to the new secondary while the new primary is still running.

Dave

 
 Thanks
 Wen Congyang
 
  
  Kevin/Stefan, is there a design document somewhere that covers at least
  static filters?
  
  Paolo
  .
  
 
--
Dr. David Alan Gilbert / dgilb...@redhat.com / Manchester, UK



Re: [Qemu-block] [PATCH COLO v3 01/14] docs: block replication's description

2015-04-23 Thread Wen Congyang
On 04/23/2015 05:26 PM, Paolo Bonzini wrote:
 
 
 On 23/04/2015 11:00, Kevin Wolf wrote:
 Because it may be the right design.

 If you're really worried about the test matrix, put a check in the
 filter block driver that its bs-file is qcow2. Of course, such an
 artificial restriction looks a bit ugly, but using a bad design just
 in order to get the same restriction is even worse.

 Stefan originally wanted to put image streaming in the QED driver. I
 think we'll agree today that it was right to reject that. It's simply
 not functionality related to the format. Adding replication logic to
 qcow2 looks similar to me in that respect.
 
 Yes, I can't deny it is similar.  Still, there is a very important
 difference: limiting colo's internal workings to qcow2 or NBD doesn't
 limit what the user can do (while streaming limited the user to image
 files in QED format).
 
 It may also depend on how the patches look like and how much the colo
 code relies on other internal state.
 
 For NBD the answer is almost nothing, and you don't even need a filter
 driver.  You only need to separate sharply the configure and open
 phases.  So it may indeed be possible to generalize the handling of the
 secondary to non-NBD.
 
 It may be the same for the primary; I admit I haven't even tried to read
 the qcow2 patch, as I couldn't do a meaningful review.

For qcow2, we need to read/write from NBD target directly after failover,
because the cache image(the format is qcow2) may be put in ramfs to get
better performance. The other thing is not changed.

For qcow2, if we use a filter driver, the bs-file-drv should support
backing file, and make_empty. So it can be the other format.

Thanks
Wen Congyang

 
 Paolo
 .
 




Re: [Qemu-block] [PATCH COLO v3 01/14] docs: block replication's description

2015-04-23 Thread Paolo Bonzini


On 23/04/2015 11:00, Kevin Wolf wrote:
 Because it may be the right design.
 
 If you're really worried about the test matrix, put a check in the
 filter block driver that its bs-file is qcow2. Of course, such an
 artificial restriction looks a bit ugly, but using a bad design just
 in order to get the same restriction is even worse.
 
 Stefan originally wanted to put image streaming in the QED driver. I
 think we'll agree today that it was right to reject that. It's simply
 not functionality related to the format. Adding replication logic to
 qcow2 looks similar to me in that respect.

Yes, I can't deny it is similar.  Still, there is a very important
difference: limiting colo's internal workings to qcow2 or NBD doesn't
limit what the user can do (while streaming limited the user to image
files in QED format).

It may also depend on how the patches look like and how much the colo
code relies on other internal state.

For NBD the answer is almost nothing, and you don't even need a filter
driver.  You only need to separate sharply the configure and open
phases.  So it may indeed be possible to generalize the handling of the
secondary to non-NBD.

It may be the same for the primary; I admit I haven't even tried to read
the qcow2 patch, as I couldn't do a meaningful review.

Paolo



Re: [Qemu-block] [PATCH COLO v3 01/14] docs: block replication's description

2015-04-23 Thread Stefan Hajnoczi
On Wed, Apr 22, 2015 at 05:28:01PM +0800, Wen Congyang wrote:
 On 04/22/2015 05:18 PM, Stefan Hajnoczi wrote:
  On Tue, Apr 21, 2015 at 05:28:01PM +0200, Paolo Bonzini wrote:
  On 21/04/2015 03:25, Wen Congyang wrote:
  Please do not introduce name+colo block drivers.  This approach is
  invasive and makes block replication specific to only a few block
  drivers, e.g. NBD or qcow2.
  NBD is used to connect to secondary qemu, so it must be used. But the 
  primary
  qemu uses quorum, so the primary disk can be any format.
  The secondary disk is nbd target, and it can also be any format. The cache
  disk(active disk/hidden disk) is an empty disk, and it is created before 
  run
  COLO. The cache disk format is qcow2 now. In theory, it can be ant format 
  which
  supports backing file. But the driver should be updated to support colo 
  mode.
 
  A cleaner approach is a QMP command or -drive options that work for any
  BlockDriverState.
 
  OK, I will add a new drive option to avoid use name+colo.
 
  Actually I liked the foo+colo names.
 
  These are just internal details of the implementations and the
  primary/secondary disks actually can be any format.
 
  Stefan, what was your worry with the +colo block drivers?
  
  Why does NBD need to know about COLO?  It should be possible to use
  iSCSI or other protocols too.
 
 Hmm, if you want to use iSCSI or other protocols, you should update the driver
 to implement block replication's control interface.
 
 Currently, we only support nbd now.

I took a quick look at the NBD patches in this series, it looks like
they are a hacky way to make quorum dynamically reconfigurable.

In other words, what you really need is a way to enable/disable a quorum
child or even add/remove children at run-time.

NBD is not the right place to implement that.  Add APIs to quorum so
COLO code can use them.

Or maybe I'm misinterpreting the patches, I only took a quick look...

Stefan


pgplmUuucC3yz.pgp
Description: PGP signature


Re: [Qemu-block] [PATCH COLO v3 01/14] docs: block replication's description

2015-04-23 Thread Paolo Bonzini


On 23/04/2015 12:17, Kevin Wolf wrote:
  Perhaps quorum is not a great match after all, and it's better to add a
  new colo driver similar to quorum but simpler and only using the read
  policy that you need for colo.  The new driver would also know how to
  use BDRV_O_NO_CONNECT.  In any case the amount of work needed would not
  be too big.

 I thought the same, but haven't looked at the details yet. But if I
 understand correctly, the plan is to take quorum and add options to turn
 off the functionality of using a quorum - that's a bit odd.

Yes, indeed.  Quorum was okay for experimenting, now it's better to cp
quorum.c colo.c and clean up the code instead of adding options to
quorum.  There's not going to be more duplication between quorum.c and
colo.c than, say, between colo.c and blkverify.c.

 What I think is really needed here is essentially an active mirror
 filter.

Yes, an active synchronous mirror.  It can be either a filter or a
device.  Has anyone ever come up with a design for filters?  Colo
doesn't need much more complexity than a toy blkverify filter.

Paolo



Re: [Qemu-block] [PATCH COLO v3 01/14] docs: block replication's description

2015-04-23 Thread Paolo Bonzini


On 23/04/2015 12:40, Kevin Wolf wrote:
 The question that is still open for me is whether it would be a colo.c
 or an active-mirror.c, i.e. if this would be tied specifically to COLO
 or if it could be kept generic enough that it could be used for other
 use cases as well.

Understood (now).

 What I think is really needed here is essentially an active mirror
 filter.

 Yes, an active synchronous mirror.  It can be either a filter or a
 device.  Has anyone ever come up with a design for filters?  Colo
 doesn't need much more complexity than a toy blkverify filter.
 
 I think what we're doing now for quorum/blkverify/blkdebug is okay.
 
 The tricky and yet unsolved part is how to add/remove filter BDSes at
 runtime (dynamic reconfiguration), but IIUC that isn't needed here.

Yes, it is.  The defer connection to NBD when replication is started
is effectively add the COLO filter (with the NBD connection as a
children) when replication is started.

Similarly close the NBD device when replication is stopped is
effectively remove the COLO filter (which brings the NBD connection
down with it).

Paolo



Re: [Qemu-block] [PATCH COLO v3 01/14] docs: block replication's description

2015-04-23 Thread Wen Congyang
On 04/23/2015 05:55 PM, Stefan Hajnoczi wrote:
 On Wed, Apr 22, 2015 at 05:28:01PM +0800, Wen Congyang wrote:
 On 04/22/2015 05:18 PM, Stefan Hajnoczi wrote:
 On Tue, Apr 21, 2015 at 05:28:01PM +0200, Paolo Bonzini wrote:
 On 21/04/2015 03:25, Wen Congyang wrote:
 Please do not introduce name+colo block drivers.  This approach is
 invasive and makes block replication specific to only a few block
 drivers, e.g. NBD or qcow2.
 NBD is used to connect to secondary qemu, so it must be used. But the 
 primary
 qemu uses quorum, so the primary disk can be any format.
 The secondary disk is nbd target, and it can also be any format. The cache
 disk(active disk/hidden disk) is an empty disk, and it is created before 
 run
 COLO. The cache disk format is qcow2 now. In theory, it can be ant format 
 which
 supports backing file. But the driver should be updated to support colo 
 mode.

 A cleaner approach is a QMP command or -drive options that work for any
 BlockDriverState.

 OK, I will add a new drive option to avoid use name+colo.

 Actually I liked the foo+colo names.

 These are just internal details of the implementations and the
 primary/secondary disks actually can be any format.

 Stefan, what was your worry with the +colo block drivers?

 Why does NBD need to know about COLO?  It should be possible to use
 iSCSI or other protocols too.

 Hmm, if you want to use iSCSI or other protocols, you should update the 
 driver
 to implement block replication's control interface.

 Currently, we only support nbd now.
 
 I took a quick look at the NBD patches in this series, it looks like
 they are a hacky way to make quorum dynamically reconfigurable.
 
 In other words, what you really need is a way to enable/disable a quorum
 child or even add/remove children at run-time.
 
 NBD is not the right place to implement that.  Add APIs to quorum so
 COLO code can use them.
 
 Or maybe I'm misinterpreting the patches, I only took a quick look...

Hmm, if we can enable/disable or add/remove a child at run-time, it is another
choice.

Thanks
Wen Congyang

 
 Stefan
 




Re: [Qemu-block] [PATCH COLO v3 01/14] docs: block replication's description

2015-04-23 Thread Wen Congyang
On 04/23/2015 05:00 PM, Kevin Wolf wrote:
 Am 22.04.2015 um 12:12 hat Paolo Bonzini geschrieben:
 On 22/04/2015 11:31, Kevin Wolf wrote:
 Actually I liked the foo+colo names.

 These are just internal details of the implementations and the
 primary/secondary disks actually can be any format.

 Stefan, what was your worry with the +colo block drivers?

 I haven't read the patches yet, so I may be misunderstanding, but
 wouldn't a separate filter driver be more appropriate than modifying
 qcow2 with logic that has nothing to do with the image format?

 Possibly; on the other hand, why multiply the size of the test matrix
 with options that no one will use and that will bitrot?
 
 Because it may be the right design.
 
 If you're really worried about the test matrix, put a check in the
 filter block driver that its bs-file is qcow2. Of course, such an
 artificial restriction looks a bit ugly, but using a bad design just
 in order to get the same restriction is even worse.

The bs-file-driver should support backing file, and use backing reference
already.

What about the primary side? We should control when to connect to NBD server,
not in nbd_open().

Thanks
Wen Congyang

 
 Stefan originally wanted to put image streaming in the QED driver. I
 think we'll agree today that it was right to reject that. It's simply
 not functionality related to the format. Adding replication logic to
 qcow2 looks similar to me in that respect.
 
 Kevin
 .
 




Re: [Qemu-block] [PATCH COLO v3 01/14] docs: block replication's description

2015-04-23 Thread Paolo Bonzini


On 23/04/2015 11:14, Wen Congyang wrote:
 The bs-file-driver should support backing file, and use backing reference
 already.
 
 What about the primary side? We should control when to connect to NBD server,
 not in nbd_open().

My naive suggestion could be to add a BDRV_O_NO_CONNECT option to
bdrv_open and a separate bdrv_connect callback.  Open would fail if
BDRV_O_NO_CONNECT is specified and drv-bdrv_connect is NULL.

You would then need a way to have quorum pass BDRV_O_NO_CONNECT.

Perhaps quorum is not a great match after all, and it's better to add a
new colo driver similar to quorum but simpler and only using the read
policy that you need for colo.  The new driver would also know how to
use BDRV_O_NO_CONNECT.  In any case the amount of work needed would not
be too big.

Paolo



Re: [Qemu-block] [PATCH COLO v3 01/14] docs: block replication's description

2015-04-23 Thread Dr. David Alan Gilbert
* Paolo Bonzini (pbonz...@redhat.com) wrote:
 
 
 On 23/04/2015 14:05, Dr. David Alan Gilbert wrote:
  As presented at the moment, I don't see there's any dynamic reconfiguration
  on the primary side at the moment
 
 So that means the bdrv_start_replication and bdrv_stop_replication
 callbacks are more or less redundant, at least on the primary?
 
 In fact, who calls them?  Certainly nothing in this patch set...
 :)

In the main colo set (I'm looking at the February version) there
are calls to them, the 'stop_replication' is called at failover time.

Here is I think the later version:
http://lists.nongnu.org/archive/html/qemu-devel/2015-03/msg05391.html

Dave

 
 Paolo
 
  - it starts up in the configuration with
  the quorum(disk, NBD), and that's the way it stays throughout the 
  fault-tolerant
  setup; the primary doesn't start running until the secondary is connected.
  
  Similarly the secondary startups in the configuration and stays that way;
  the interesting question to me is what happens after a failure.
  
  If the secondary fails, then your primary is still quorum(disk, NBD) but
  the NBD side is dead - so I don't think you need to do anything there
  immediately.
  
  If the primary fails, and the secondary takes over, then a lot of the
  stuff on the secondary now becomes redundent; does that stay the same
  and just operate in some form of passthrough - or does it need to
  change configuration?
  
  The hard part to me is how to bring it back into fault-tolerance now;
  after a primary failure, the secondary now needs to morph into something
  like a primary, and somehow you need to bring up a new secondary
  and get that new secondary an image of the primaries current disk.
--
Dr. David Alan Gilbert / dgilb...@redhat.com / Manchester, UK



Re: [Qemu-block] [PATCH COLO v3 01/14] docs: block replication's description

2015-04-23 Thread Paolo Bonzini


On 23/04/2015 14:19, Dr. David Alan Gilbert wrote:
  So that means the bdrv_start_replication and bdrv_stop_replication
  callbacks are more or less redundant, at least on the primary?
  
  In fact, who calls them?  Certainly nothing in this patch set...
  :)
 In the main colo set (I'm looking at the February version) there
 are calls to them, the 'stop_replication' is called at failover time.
 
 Here is I think the later version:
 http://lists.nongnu.org/archive/html/qemu-devel/2015-03/msg05391.html

I think the primary shouldn't do any I/O after failover (and the
secondary should close the NBD server) so it is probably okay to ignore
the removal for now.  Inserting the filter dynamically is probably
needed though.

Paolo



Re: [Qemu-block] [PATCH COLO v3 01/14] docs: block replication's description

2015-04-23 Thread Paolo Bonzini


On 23/04/2015 13:36, Kevin Wolf wrote:
 Crap. Then we need to figure out dynamic reconfiguration for filters
 (CCed Markus and Jeff).
 
 And this is really part of the fundamental operation mode and not just a
 way to give users a way to change their mind at runtime? Because if it
 were, we could go forward without that for the start and add dynamic
 reconfiguration in a second step.

I honestly don't know.  Wen, David?

Paolo

 Anyway, even if we move it to a second step, it looks like we need to
 design something rather soon now.



Re: [Qemu-block] [PATCH COLO v3 01/14] docs: block replication's description

2015-04-23 Thread Dr. David Alan Gilbert
* Paolo Bonzini (pbonz...@redhat.com) wrote:
 
 
 On 23/04/2015 13:36, Kevin Wolf wrote:
  Crap. Then we need to figure out dynamic reconfiguration for filters
  (CCed Markus and Jeff).
  
  And this is really part of the fundamental operation mode and not just a
  way to give users a way to change their mind at runtime? Because if it
  were, we could go forward without that for the start and add dynamic
  reconfiguration in a second step.
 
 I honestly don't know.  Wen, David?

As presented at the moment, I don't see there's any dynamic reconfiguration
on the primary side at the moment - it starts up in the configuration with
the quorum(disk, NBD), and that's the way it stays throughout the fault-tolerant
setup; the primary doesn't start running until the secondary is connected.

Similarly the secondary startups in the configuration and stays that way;
the interesting question to me is what happens after a failure.

If the secondary fails, then your primary is still quorum(disk, NBD) but
the NBD side is dead - so I don't think you need to do anything there
immediately.

If the primary fails, and the secondary takes over, then a lot of the
stuff on the secondary now becomes redundent; does that stay the same
and just operate in some form of passthrough - or does it need to
change configuration?

The hard part to me is how to bring it back into fault-tolerance now;
after a primary failure, the secondary now needs to morph into something
like a primary, and somehow you need to bring up a new secondary
and get that new secondary an image of the primaries current disk.

Dave

 Paolo
 
  Anyway, even if we move it to a second step, it looks like we need to
  design something rather soon now.
--
Dr. David Alan Gilbert / dgilb...@redhat.com / Manchester, UK



Re: [Qemu-block] [PATCH COLO v3 01/14] docs: block replication's description

2015-04-23 Thread Wen Congyang
On 04/23/2015 06:44 PM, Paolo Bonzini wrote:
 
 
 On 23/04/2015 12:40, Kevin Wolf wrote:
 The question that is still open for me is whether it would be a colo.c
 or an active-mirror.c, i.e. if this would be tied specifically to COLO
 or if it could be kept generic enough that it could be used for other
 use cases as well.
 
 Understood (now).
 
 What I think is really needed here is essentially an active mirror
 filter.

 Yes, an active synchronous mirror.  It can be either a filter or a
 device.  Has anyone ever come up with a design for filters?  Colo
 doesn't need much more complexity than a toy blkverify filter.

 I think what we're doing now for quorum/blkverify/blkdebug is okay.

 The tricky and yet unsolved part is how to add/remove filter BDSes at
 runtime (dynamic reconfiguration), but IIUC that isn't needed here.
 
 Yes, it is.  The defer connection to NBD when replication is started
 is effectively add the COLO filter (with the NBD connection as a
 children) when replication is started.
 
 Similarly close the NBD device when replication is stopped is
 effectively remove the COLO filter (which brings the NBD connection
 down with it).

Hmm, I don't understand it clearly. Do you mean:
1. COLO filter is quorum's child
2. We can add/remove quorum's child at run-time.

If I misunderstand something, please correct me.

Thanks
Wen Congyang

 
 Paolo
 .
 




Re: [Qemu-block] [PATCH COLO v3 01/14] docs: block replication's description

2015-04-23 Thread Kevin Wolf
Am 23.04.2015 um 12:44 hat Paolo Bonzini geschrieben:
 On 23/04/2015 12:40, Kevin Wolf wrote:
  The question that is still open for me is whether it would be a colo.c
  or an active-mirror.c, i.e. if this would be tied specifically to COLO
  or if it could be kept generic enough that it could be used for other
  use cases as well.
 
 Understood (now).
 
  What I think is really needed here is essentially an active mirror
  filter.
 
  Yes, an active synchronous mirror.  It can be either a filter or a
  device.  Has anyone ever come up with a design for filters?  Colo
  doesn't need much more complexity than a toy blkverify filter.
  
  I think what we're doing now for quorum/blkverify/blkdebug is okay.
  
  The tricky and yet unsolved part is how to add/remove filter BDSes at
  runtime (dynamic reconfiguration), but IIUC that isn't needed here.
 
 Yes, it is.  The defer connection to NBD when replication is started
 is effectively add the COLO filter (with the NBD connection as a
 children) when replication is started.
 
 Similarly close the NBD device when replication is stopped is
 effectively remove the COLO filter (which brings the NBD connection
 down with it).

Crap. Then we need to figure out dynamic reconfiguration for filters
(CCed Markus and Jeff).

And this is really part of the fundamental operation mode and not just a
way to give users a way to change their mind at runtime? Because if it
were, we could go forward without that for the start and add dynamic
reconfiguration in a second step.

Anyway, even if we move it to a second step, it looks like we need to
design something rather soon now.

Kevin



Re: [Qemu-block] [PATCH COLO v3 01/14] docs: block replication's description

2015-04-23 Thread Wen Congyang
On 04/24/2015 10:01 AM, Fam Zheng wrote:
 On Thu, 04/23 14:23, Paolo Bonzini wrote:


 On 23/04/2015 14:19, Dr. David Alan Gilbert wrote:
 So that means the bdrv_start_replication and bdrv_stop_replication
 callbacks are more or less redundant, at least on the primary?

 In fact, who calls them?  Certainly nothing in this patch set...
 :)
 In the main colo set (I'm looking at the February version) there
 are calls to them, the 'stop_replication' is called at failover time.

 Here is I think the later version:
 http://lists.nongnu.org/archive/html/qemu-devel/2015-03/msg05391.html

 I think the primary shouldn't do any I/O after failover (and the
 secondary should close the NBD server) so it is probably okay to ignore
 the removal for now.  Inserting the filter dynamically is probably
 needed though.
 
 Or maybe just enabling/disabling?

Hmm, after failover, the secondary qemu should become primary qemu, but we don't
know the nbd server's IP/port when we execute the secondary qemu. So we need
to inserting nbd client dynamically after failover.

Thanks
Wen Congyang

 
 Fam
 .
 




Re: [Qemu-block] [PATCH COLO v3 01/14] docs: block replication's description

2015-04-22 Thread Paolo Bonzini


On 22/04/2015 11:31, Kevin Wolf wrote:
 Actually I liked the foo+colo names.

 These are just internal details of the implementations and the
 primary/secondary disks actually can be any format.

 Stefan, what was your worry with the +colo block drivers?
 
 I haven't read the patches yet, so I may be misunderstanding, but
 wouldn't a separate filter driver be more appropriate than modifying
 qcow2 with logic that has nothing to do with the image format?

Possibly; on the other hand, why multiply the size of the test matrix
with options that no one will use and that will bitrot?

Paolo



Re: [Qemu-block] [PATCH COLO v3 01/14] docs: block replication's description

2015-04-22 Thread Stefan Hajnoczi
On Tue, Apr 21, 2015 at 05:28:01PM +0200, Paolo Bonzini wrote:
 On 21/04/2015 03:25, Wen Congyang wrote:
   Please do not introduce name+colo block drivers.  This approach is
   invasive and makes block replication specific to only a few block
   drivers, e.g. NBD or qcow2.
  NBD is used to connect to secondary qemu, so it must be used. But the 
  primary
  qemu uses quorum, so the primary disk can be any format.
  The secondary disk is nbd target, and it can also be any format. The cache
  disk(active disk/hidden disk) is an empty disk, and it is created before run
  COLO. The cache disk format is qcow2 now. In theory, it can be ant format 
  which
  supports backing file. But the driver should be updated to support colo 
  mode.
  
   A cleaner approach is a QMP command or -drive options that work for any
   BlockDriverState.
  
  OK, I will add a new drive option to avoid use name+colo.
 
 Actually I liked the foo+colo names.
 
 These are just internal details of the implementations and the
 primary/secondary disks actually can be any format.
 
 Stefan, what was your worry with the +colo block drivers?

Why does NBD need to know about COLO?  It should be possible to use
iSCSI or other protocols too.

Stefan


pgpGr2J18TCwu.pgp
Description: PGP signature


Re: [Qemu-block] [PATCH COLO v3 01/14] docs: block replication's description

2015-04-22 Thread Wen Congyang
On 04/22/2015 05:18 PM, Stefan Hajnoczi wrote:
 On Tue, Apr 21, 2015 at 05:28:01PM +0200, Paolo Bonzini wrote:
 On 21/04/2015 03:25, Wen Congyang wrote:
 Please do not introduce name+colo block drivers.  This approach is
 invasive and makes block replication specific to only a few block
 drivers, e.g. NBD or qcow2.
 NBD is used to connect to secondary qemu, so it must be used. But the 
 primary
 qemu uses quorum, so the primary disk can be any format.
 The secondary disk is nbd target, and it can also be any format. The cache
 disk(active disk/hidden disk) is an empty disk, and it is created before run
 COLO. The cache disk format is qcow2 now. In theory, it can be ant format 
 which
 supports backing file. But the driver should be updated to support colo 
 mode.

 A cleaner approach is a QMP command or -drive options that work for any
 BlockDriverState.

 OK, I will add a new drive option to avoid use name+colo.

 Actually I liked the foo+colo names.

 These are just internal details of the implementations and the
 primary/secondary disks actually can be any format.

 Stefan, what was your worry with the +colo block drivers?
 
 Why does NBD need to know about COLO?  It should be possible to use
 iSCSI or other protocols too.

Hmm, if you want to use iSCSI or other protocols, you should update the driver
to implement block replication's control interface.

Currently, we only support nbd now.

Thanks
Wen Congyang

 
 Stefan
 




Re: [Qemu-block] [PATCH COLO v3 01/14] docs: block replication's description

2015-04-22 Thread Stefan Hajnoczi
On Tue, Apr 21, 2015 at 09:25:59AM +0800, Wen Congyang wrote:
 On 04/20/2015 11:30 PM, Stefan Hajnoczi wrote:
  On Fri, Apr 03, 2015 at 06:01:07PM +0800, Wen Congyang wrote:
  One general question about the design: the Secondary host needs 3x
  storage space since it has the Secondary Disk, hidden-disk, and
  active-disk.  Each image requires a certain amount of space depending on
  writes or COW operations.  Is 3x the upper bound or is there a way to
  reduce the bound?
 
 active disk and hidden disk are temp file. It will be maked empty in
 bdrv_do_checkpoint(). Their format is qcow2 now, so it doesn't need too
 many spaces if we do checkpoint periodically.

A question related to checkpoints: both Primary and Secondary are active
(running) in COLO.  The Secondary will be slower since it performs extra
work; disk I/O on the Secondary has a COW overhead.

Does this force the Primary to wait for checkpoint commit so that the
Secondary can catch up?

I'm a little confused about that since the point of COLO is to avoid the
overheads of microcheckpointing, but there still seems to be a
checkpointing bottleneck for disk I/O-intensive applications.

  
  The bound is important since large amounts of data become a bottleneck
  for writeout/commit operations.  They could cause downtime if the guest
  is blocked until the entire Disk Buffer has been written to the
  Secondary Disk during failover, for example.
 
 OK, I will test it. In my test, vm_stop() will take about 2-3 seconds if
 I run filebench in the guest. Is there anyway to speed it up?

Is it necessary to commit the active disk and hidden disk to the
Secondary Disk on failover?  Maybe the VM could continue executing
immediately and run a block-commit job.  The active disk and hidden disk
files can be dropped once block-commit finishes.


pgpSmtN_bltYK.pgp
Description: PGP signature


Re: [Qemu-block] [PATCH COLO v3 01/14] docs: block replication's description

2015-04-21 Thread Paolo Bonzini


On 21/04/2015 03:25, Wen Congyang wrote:
  Please do not introduce name+colo block drivers.  This approach is
  invasive and makes block replication specific to only a few block
  drivers, e.g. NBD or qcow2.
 NBD is used to connect to secondary qemu, so it must be used. But the primary
 qemu uses quorum, so the primary disk can be any format.
 The secondary disk is nbd target, and it can also be any format. The cache
 disk(active disk/hidden disk) is an empty disk, and it is created before run
 COLO. The cache disk format is qcow2 now. In theory, it can be ant format 
 which
 supports backing file. But the driver should be updated to support colo mode.
 
  A cleaner approach is a QMP command or -drive options that work for any
  BlockDriverState.
 
 OK, I will add a new drive option to avoid use name+colo.

Actually I liked the foo+colo names.

These are just internal details of the implementations and the
primary/secondary disks actually can be any format.

Stefan, what was your worry with the +colo block drivers?

Paolo



Re: [Qemu-block] [PATCH COLO v3 01/14] docs: block replication's description

2015-04-20 Thread Wen Congyang
On 04/20/2015 11:30 PM, Stefan Hajnoczi wrote:
 On Fri, Apr 03, 2015 at 06:01:07PM +0800, Wen Congyang wrote:
 Signed-off-by: Wen Congyang we...@cn.fujitsu.com
 Signed-off-by: Paolo Bonzini pbonz...@redhat.com
 Signed-off-by: Yang Hongyang yan...@cn.fujitsu.com
 Signed-off-by: zhanghailiang zhang.zhanghaili...@huawei.com
 Signed-off-by: Gonglei arei.gong...@huawei.com
 ---
  docs/block-replication.txt | 153 
 +
  1 file changed, 153 insertions(+)
  create mode 100644 docs/block-replication.txt

 diff --git a/docs/block-replication.txt b/docs/block-replication.txt
 new file mode 100644
 index 000..4426ffc
 --- /dev/null
 +++ b/docs/block-replication.txt
 @@ -0,0 +1,153 @@
 +Block replication
 +
 +Copyright Fujitsu, Corp. 2015
 +Copyright (c) 2015 Intel Corporation
 +Copyright (c) 2015 HUAWEI TECHNOLOGIES CO., LTD.
 +
 +This work is licensed under the terms of the GNU GPL, version 2 or later.
 +See the COPYING file in the top-level directory.
 +
 +Block replication is used for continuous checkpoints. It is designed
 +for COLO (COurse-grain LOck-stepping) where the Secondary VM is running.
 +It can also be applied for FT/HA (Fault-tolerance/High Assurance) scenario,
 +where the Secondary VM is not running.
 +
 +This document gives an overview of block replication's design.
 +
 +== Background ==
 +High availability solutions such as micro checkpoint and COLO will do
 +consecutive checkpoints. The VM state of Primary VM and Secondary VM is
 +identical right after a VM checkpoint, but becomes different as the VM
 +executes till the next checkpoint. To support disk contents checkpoint,
 +the modified disk contents in the Secondary VM must be buffered, and are
 +only dropped at next checkpoint time. To reduce the network transportation
 +effort at the time of checkpoint, the disk modification operations of
 +Primary disk are asynchronously forwarded to the Secondary node.
 +
 +== Workflow ==
 +The following is the image of block replication workflow:
 +
 ++--+++
 +|Primary Write Requests||Secondary Write Requests|
 ++--+++
 +  |   |
 +  |  (4)
 +  |   V
 +  |  /-\
 +  |  Copy and Forward| |
 +  |-(1)--+   | Disk Buffer |
 +  |  |   | |
 +  | (3)  \-/
 +  | speculative  ^
 +  |write through(2)
 +  |  |   |
 +  V  V   |
 +   +--+   ++
 +   | Primary Disk |   | Secondary Disk |
 +   +--+   ++
 +
 +1) Primary write requests will be copied and forwarded to Secondary
 +   QEMU.
 +2) Before Primary write requests are written to Secondary disk, the
 +   original sector content will be read from Secondary disk and
 +   buffered in the Disk buffer, but it will not overwrite the existing
 +   sector content(it could be from either Secondary Write Requests or
 +   previous COW of Primary Write Requests) in the Disk buffer.
 +3) Primary write requests will be written to Secondary disk.
 +4) Secondary write requests will be buffered in the Disk buffer and it
 +   will overwrite the existing sector content in the buffer.
 +
 +== Architecture ==
 +We are going to implement COLO block replication from many basic
 +blocks that are already in QEMU.
 +
 + virtio-blk   ||
 + ^||.--
 + |||| Secondary
 +1 Quorum  ||'--
 + /  \ ||
 +/\||
 +   Primary  2 NBD  ---  2 NBD
 + disk   client|| server 
 virtio-blk
 +  ||^   
  ^
 +. |||   
  |
 +Primary | ||  Secondary disk - hidden-disk 4 
 - active-disk 3
 +' |||  backing^   
 backing
 +  ||| |
 +  ||| |
 +  ||

Re: [Qemu-block] [PATCH COLO v3 01/14] docs: block replication's description

2015-04-20 Thread Stefan Hajnoczi
On Fri, Apr 03, 2015 at 06:01:07PM +0800, Wen Congyang wrote:
 Signed-off-by: Wen Congyang we...@cn.fujitsu.com
 Signed-off-by: Paolo Bonzini pbonz...@redhat.com
 Signed-off-by: Yang Hongyang yan...@cn.fujitsu.com
 Signed-off-by: zhanghailiang zhang.zhanghaili...@huawei.com
 Signed-off-by: Gonglei arei.gong...@huawei.com
 ---
  docs/block-replication.txt | 153 
 +
  1 file changed, 153 insertions(+)
  create mode 100644 docs/block-replication.txt
 
 diff --git a/docs/block-replication.txt b/docs/block-replication.txt
 new file mode 100644
 index 000..4426ffc
 --- /dev/null
 +++ b/docs/block-replication.txt
 @@ -0,0 +1,153 @@
 +Block replication
 +
 +Copyright Fujitsu, Corp. 2015
 +Copyright (c) 2015 Intel Corporation
 +Copyright (c) 2015 HUAWEI TECHNOLOGIES CO., LTD.
 +
 +This work is licensed under the terms of the GNU GPL, version 2 or later.
 +See the COPYING file in the top-level directory.
 +
 +Block replication is used for continuous checkpoints. It is designed
 +for COLO (COurse-grain LOck-stepping) where the Secondary VM is running.
 +It can also be applied for FT/HA (Fault-tolerance/High Assurance) scenario,
 +where the Secondary VM is not running.
 +
 +This document gives an overview of block replication's design.
 +
 +== Background ==
 +High availability solutions such as micro checkpoint and COLO will do
 +consecutive checkpoints. The VM state of Primary VM and Secondary VM is
 +identical right after a VM checkpoint, but becomes different as the VM
 +executes till the next checkpoint. To support disk contents checkpoint,
 +the modified disk contents in the Secondary VM must be buffered, and are
 +only dropped at next checkpoint time. To reduce the network transportation
 +effort at the time of checkpoint, the disk modification operations of
 +Primary disk are asynchronously forwarded to the Secondary node.
 +
 +== Workflow ==
 +The following is the image of block replication workflow:
 +
 ++--+++
 +|Primary Write Requests||Secondary Write Requests|
 ++--+++
 +  |   |
 +  |  (4)
 +  |   V
 +  |  /-\
 +  |  Copy and Forward| |
 +  |-(1)--+   | Disk Buffer |
 +  |  |   | |
 +  | (3)  \-/
 +  | speculative  ^
 +  |write through(2)
 +  |  |   |
 +  V  V   |
 +   +--+   ++
 +   | Primary Disk |   | Secondary Disk |
 +   +--+   ++
 +
 +1) Primary write requests will be copied and forwarded to Secondary
 +   QEMU.
 +2) Before Primary write requests are written to Secondary disk, the
 +   original sector content will be read from Secondary disk and
 +   buffered in the Disk buffer, but it will not overwrite the existing
 +   sector content(it could be from either Secondary Write Requests or
 +   previous COW of Primary Write Requests) in the Disk buffer.
 +3) Primary write requests will be written to Secondary disk.
 +4) Secondary write requests will be buffered in the Disk buffer and it
 +   will overwrite the existing sector content in the buffer.
 +
 +== Architecture ==
 +We are going to implement COLO block replication from many basic
 +blocks that are already in QEMU.
 +
 + virtio-blk   ||
 + ^||.--
 + |||| Secondary
 +1 Quorum  ||'--
 + /  \ ||
 +/\||
 +   Primary  2 NBD  ---  2 NBD
 + disk   client|| server  
virtio-blk
 +  ||^
 ^
 +. |||
 |
 +Primary | ||  Secondary disk - hidden-disk 4 
 - active-disk 3
 +' |||  backing^   backing
 +  ||| |
 +  ||| |
 +  ||'-'
 +  ||