Re: [Qemu-block] [PATCH COLO v3 01/14] docs: block replication's description
* Kevin Wolf (kw...@redhat.com) wrote: > Am 08.05.2015 um 10:42 hat Stefan Hajnoczi geschrieben: > > On Tue, May 05, 2015 at 04:23:56PM +0100, Dr. David Alan Gilbert wrote: > > > * Stefan Hajnoczi (stefa...@redhat.com) wrote: > > > > On Fri, Apr 24, 2015 at 11:36:35AM +0200, Paolo Bonzini wrote: > > > > > > > > > > > > > > > On 24/04/2015 11:38, Wen Congyang wrote: > > > > > >> > > > > > > >> > That can be done with drive-mirror. But I think it's too early > > > > > >> > for that. > > > > > > Do you mean use drive-mirror instead of quorum? > > > > > > > > > > Only before starting up a new secondary. Basically you do a migration > > > > > with non-shared storage, and then start the secondary in colo mode. > > > > > > > > > > But it's only for the failover case. Quorum (or a new block/colo.c > > > > > driver or filter) is fine for normal colo operation. > > > > > > > > Perhaps this patch series should mirror the Secondary's disk to a Backup > > > > Secondary so that the system can be protected very quickly after > > > > failover. > > > > > > > > I think anyone serious about fault tolerance would deploy a Backup > > > > Secondary, otherwise the system cannot survive two failures unless a > > > > human administrator is lucky/fast enough to set up a new Secondary. > > > > > > I'd assumed that a higher level management layer would do the allocation > > > of a new secondary after the first failover, so no human need be involved. > > > > That doesn't help, after the first failover is too late even if it's > > done by a program. There should be no window during which the VM is > > unprotected. > > > > People who want fault tolerance care about 9s of availability. The VM > > must be protected on the new Primary as soon as the failover occurs, > > otherwise this isn't a serious fault tolerance solution. > > If you're worried about two failures in a row, why wouldn't you be > worried about three in a row? I think if you really want more than one > backup to be ready, you shouldn't go to two, but to n. Agreed, if you did multiple secondaries you'd do 'n'. But 1+2 does satisfy all but the most paranoid; and in particular it does mean that if you want to take a host down for some maintenance you can do it without worrying. But, as I said in my reply to Stefan, doing more than 1+1 gets really hairy; the combinations of failovers are much more complicated. Dave 1) It means that 1) As Stefan mentions you get worried about the lack of protection after the first failover; > Kevin -- Dr. David Alan Gilbert / dgilb...@redhat.com / Manchester, UK
Re: [Qemu-block] [PATCH COLO v3 01/14] docs: block replication's description
Am 08.05.2015 um 10:42 hat Stefan Hajnoczi geschrieben: > On Tue, May 05, 2015 at 04:23:56PM +0100, Dr. David Alan Gilbert wrote: > > * Stefan Hajnoczi (stefa...@redhat.com) wrote: > > > On Fri, Apr 24, 2015 at 11:36:35AM +0200, Paolo Bonzini wrote: > > > > > > > > > > > > On 24/04/2015 11:38, Wen Congyang wrote: > > > > >> > > > > > >> > That can be done with drive-mirror. But I think it's too early > > > > >> > for that. > > > > > Do you mean use drive-mirror instead of quorum? > > > > > > > > Only before starting up a new secondary. Basically you do a migration > > > > with non-shared storage, and then start the secondary in colo mode. > > > > > > > > But it's only for the failover case. Quorum (or a new block/colo.c > > > > driver or filter) is fine for normal colo operation. > > > > > > Perhaps this patch series should mirror the Secondary's disk to a Backup > > > Secondary so that the system can be protected very quickly after > > > failover. > > > > > > I think anyone serious about fault tolerance would deploy a Backup > > > Secondary, otherwise the system cannot survive two failures unless a > > > human administrator is lucky/fast enough to set up a new Secondary. > > > > I'd assumed that a higher level management layer would do the allocation > > of a new secondary after the first failover, so no human need be involved. > > That doesn't help, after the first failover is too late even if it's > done by a program. There should be no window during which the VM is > unprotected. > > People who want fault tolerance care about 9s of availability. The VM > must be protected on the new Primary as soon as the failover occurs, > otherwise this isn't a serious fault tolerance solution. If you're worried about two failures in a row, why wouldn't you be worried about three in a row? I think if you really want more than one backup to be ready, you shouldn't go to two, but to n. Kevin pgptDGm99rx0M.pgp Description: PGP signature
Re: [Qemu-block] [PATCH COLO v3 01/14] docs: block replication's description
* Stefan Hajnoczi (stefa...@redhat.com) wrote: > On Tue, May 05, 2015 at 04:23:56PM +0100, Dr. David Alan Gilbert wrote: > > * Stefan Hajnoczi (stefa...@redhat.com) wrote: > > > On Fri, Apr 24, 2015 at 11:36:35AM +0200, Paolo Bonzini wrote: > > > > > > > > > > > > On 24/04/2015 11:38, Wen Congyang wrote: > > > > >> > > > > > >> > That can be done with drive-mirror. But I think it's too early > > > > >> > for that. > > > > > Do you mean use drive-mirror instead of quorum? > > > > > > > > Only before starting up a new secondary. Basically you do a migration > > > > with non-shared storage, and then start the secondary in colo mode. > > > > > > > > But it's only for the failover case. Quorum (or a new block/colo.c > > > > driver or filter) is fine for normal colo operation. > > > > > > Perhaps this patch series should mirror the Secondary's disk to a Backup > > > Secondary so that the system can be protected very quickly after > > > failover. > > > > > > I think anyone serious about fault tolerance would deploy a Backup > > > Secondary, otherwise the system cannot survive two failures unless a > > > human administrator is lucky/fast enough to set up a new Secondary. > > > > I'd assumed that a higher level management layer would do the allocation > > of a new secondary after the first failover, so no human need be involved. > > That doesn't help, after the first failover is too late even if it's > done by a program. There should be no window during which the VM is > unprotected. > > People who want fault tolerance care about 9s of availability. The VM > must be protected on the new Primary as soon as the failover occurs, > otherwise this isn't a serious fault tolerance solution. I'm not aware of any other system that manages that, so I don't think that's fair. You gain a lot more availability going from a single system to the 1+1 system that COLO (or any of the checkpointing systems) propose, I can't say how many 9s it gets you. It's true having multiple secondaries would get you a bit more on top of that, but you're still a lot better off just having the one secondary. I had thought that having >1 secondary would be a nice addition, but it's a big change everywhere else (e.g. having to maintain multiple migration streams, dealing with miscompares from multiple hosts). Dave > > Stefan -- Dr. David Alan Gilbert / dgilb...@redhat.com / Manchester, UK
Re: [Qemu-block] [PATCH COLO v3 01/14] docs: block replication's description
On Tue, May 05, 2015 at 04:23:56PM +0100, Dr. David Alan Gilbert wrote: > * Stefan Hajnoczi (stefa...@redhat.com) wrote: > > On Fri, Apr 24, 2015 at 11:36:35AM +0200, Paolo Bonzini wrote: > > > > > > > > > On 24/04/2015 11:38, Wen Congyang wrote: > > > >> > > > > >> > That can be done with drive-mirror. But I think it's too early for > > > >> > that. > > > > Do you mean use drive-mirror instead of quorum? > > > > > > Only before starting up a new secondary. Basically you do a migration > > > with non-shared storage, and then start the secondary in colo mode. > > > > > > But it's only for the failover case. Quorum (or a new block/colo.c > > > driver or filter) is fine for normal colo operation. > > > > Perhaps this patch series should mirror the Secondary's disk to a Backup > > Secondary so that the system can be protected very quickly after > > failover. > > > > I think anyone serious about fault tolerance would deploy a Backup > > Secondary, otherwise the system cannot survive two failures unless a > > human administrator is lucky/fast enough to set up a new Secondary. > > I'd assumed that a higher level management layer would do the allocation > of a new secondary after the first failover, so no human need be involved. That doesn't help, after the first failover is too late even if it's done by a program. There should be no window during which the VM is unprotected. People who want fault tolerance care about 9s of availability. The VM must be protected on the new Primary as soon as the failover occurs, otherwise this isn't a serious fault tolerance solution. Stefan pgps2BxZf832m.pgp Description: PGP signature
Re: [Qemu-block] [PATCH COLO v3 01/14] docs: block replication's description
On Wed, 05/06 02:26, Dong, Eddie wrote: > > > > -Original Message- > > From: Dr. David Alan Gilbert [mailto:dgilb...@redhat.com] > > Sent: Tuesday, May 05, 2015 11:24 PM > > To: Stefan Hajnoczi > > Cc: Paolo Bonzini; Wen Congyang; Fam Zheng; Kevin Wolf; Lai Jiangshan; qemu > > block; Jiang, Yunhong; Dong, Eddie; qemu devel; Max Reitz; Gonglei; Yang > > Hongyang; zhanghailiang; arm...@redhat.com; jc...@redhat.com > > Subject: Re: [PATCH COLO v3 01/14] docs: block replication's description > > > > * Stefan Hajnoczi (stefa...@redhat.com) wrote: > > > On Fri, Apr 24, 2015 at 11:36:35AM +0200, Paolo Bonzini wrote: > > > > > > > > > > > > On 24/04/2015 11:38, Wen Congyang wrote: > > > > >> > > > > > >> > That can be done with drive-mirror. But I think it's too early > > > > >> > for that. > > > > > Do you mean use drive-mirror instead of quorum? > > > > > > > > Only before starting up a new secondary. Basically you do a > > > > migration with non-shared storage, and then start the secondary in colo > > mode. > > > > > > > > But it's only for the failover case. Quorum (or a new block/colo.c > > > > driver or filter) is fine for normal colo operation. > > > > > > Perhaps this patch series should mirror the Secondary's disk to a > > > Backup Secondary so that the system can be protected very quickly > > > after failover. > > > > > > I think anyone serious about fault tolerance would deploy a Backup > > > Secondary, otherwise the system cannot survive two failures unless a > > > human administrator is lucky/fast enough to set up a new Secondary. > > > > I'd assumed that a higher level management layer would do the allocation of > > a > > new secondary after the first failover, so no human need be involved. > > > > I agree. The cloud OS, such as open stack, will have the capability to handle > the case, together with certain API in VMM side for this (libvirt?). The question here is the QMP API to switch secondary mode to primary mode is not mentioned in this series. I think that interface matters for this series. Fam
Re: [Qemu-block] [PATCH COLO v3 01/14] docs: block replication's description
> -Original Message- > From: Dr. David Alan Gilbert [mailto:dgilb...@redhat.com] > Sent: Tuesday, May 05, 2015 11:24 PM > To: Stefan Hajnoczi > Cc: Paolo Bonzini; Wen Congyang; Fam Zheng; Kevin Wolf; Lai Jiangshan; qemu > block; Jiang, Yunhong; Dong, Eddie; qemu devel; Max Reitz; Gonglei; Yang > Hongyang; zhanghailiang; arm...@redhat.com; jc...@redhat.com > Subject: Re: [PATCH COLO v3 01/14] docs: block replication's description > > * Stefan Hajnoczi (stefa...@redhat.com) wrote: > > On Fri, Apr 24, 2015 at 11:36:35AM +0200, Paolo Bonzini wrote: > > > > > > > > > On 24/04/2015 11:38, Wen Congyang wrote: > > > >> > > > > >> > That can be done with drive-mirror. But I think it's too early for > > > >> > that. > > > > Do you mean use drive-mirror instead of quorum? > > > > > > Only before starting up a new secondary. Basically you do a > > > migration with non-shared storage, and then start the secondary in colo > mode. > > > > > > But it's only for the failover case. Quorum (or a new block/colo.c > > > driver or filter) is fine for normal colo operation. > > > > Perhaps this patch series should mirror the Secondary's disk to a > > Backup Secondary so that the system can be protected very quickly > > after failover. > > > > I think anyone serious about fault tolerance would deploy a Backup > > Secondary, otherwise the system cannot survive two failures unless a > > human administrator is lucky/fast enough to set up a new Secondary. > > I'd assumed that a higher level management layer would do the allocation of a > new secondary after the first failover, so no human need be involved. > I agree. The cloud OS, such as open stack, will have the capability to handle the case, together with certain API in VMM side for this (libvirt?). Thx Eddie
Re: [Qemu-block] [PATCH COLO v3 01/14] docs: block replication's description
* Stefan Hajnoczi (stefa...@redhat.com) wrote: > On Fri, Apr 24, 2015 at 11:36:35AM +0200, Paolo Bonzini wrote: > > > > > > On 24/04/2015 11:38, Wen Congyang wrote: > > >> > > > >> > That can be done with drive-mirror. But I think it's too early for > > >> > that. > > > Do you mean use drive-mirror instead of quorum? > > > > Only before starting up a new secondary. Basically you do a migration > > with non-shared storage, and then start the secondary in colo mode. > > > > But it's only for the failover case. Quorum (or a new block/colo.c > > driver or filter) is fine for normal colo operation. > > Perhaps this patch series should mirror the Secondary's disk to a Backup > Secondary so that the system can be protected very quickly after > failover. > > I think anyone serious about fault tolerance would deploy a Backup > Secondary, otherwise the system cannot survive two failures unless a > human administrator is lucky/fast enough to set up a new Secondary. I'd assumed that a higher level management layer would do the allocation of a new secondary after the first failover, so no human need be involved. Dave > Stefan -- Dr. David Alan Gilbert / dgilb...@redhat.com / Manchester, UK
Re: [Qemu-block] [PATCH COLO v3 01/14] docs: block replication's description
On Wed, Apr 29, 2015 at 04:37:49PM +0800, Gonglei wrote: > On 2015/4/29 16:29, Paolo Bonzini wrote: > > > > > > On 27/04/2015 11:37, Stefan Hajnoczi wrote: > But it's only for the failover case. Quorum (or a new > block/colo.c driver or filter) is fine for normal colo > operation. > >> Perhaps this patch series should mirror the Secondary's disk to a > >> Backup Secondary so that the system can be protected very quickly > >> after failover. > >> > >> I think anyone serious about fault tolerance would deploy a Backup > >> Secondary, otherwise the system cannot survive two failures > >> unless a human administrator is lucky/fast enough to set up a new > >> Secondary. > > > > Let's do one thing at a time. Otherwise nothing of this is going to > > be ever completed... > > > Yes, and the continuous backup feature is on our TODO list. We hope > this series (including basic functions and COLO framework) can be > upstream first. That's fine, I just wanted to make sure you have the issue in mind. Stefan pgptBg6Vh_cn5.pgp Description: PGP signature
Re: [Qemu-block] [PATCH COLO v3 01/14] docs: block replication's description
On 27/04/2015 11:37, Stefan Hajnoczi wrote: >>> But it's only for the failover case. Quorum (or a new >>> block/colo.c driver or filter) is fine for normal colo >>> operation. > Perhaps this patch series should mirror the Secondary's disk to a > Backup Secondary so that the system can be protected very quickly > after failover. > > I think anyone serious about fault tolerance would deploy a Backup > Secondary, otherwise the system cannot survive two failures > unless a human administrator is lucky/fast enough to set up a new > Secondary. Let's do one thing at a time. Otherwise nothing of this is going to be ever completed... Paolo
Re: [Qemu-block] [PATCH COLO v3 01/14] docs: block replication's description
On 2015/4/29 16:29, Paolo Bonzini wrote: > > > On 27/04/2015 11:37, Stefan Hajnoczi wrote: But it's only for the failover case. Quorum (or a new block/colo.c driver or filter) is fine for normal colo operation. >> Perhaps this patch series should mirror the Secondary's disk to a >> Backup Secondary so that the system can be protected very quickly >> after failover. >> >> I think anyone serious about fault tolerance would deploy a Backup >> Secondary, otherwise the system cannot survive two failures >> unless a human administrator is lucky/fast enough to set up a new >> Secondary. > > Let's do one thing at a time. Otherwise nothing of this is going to > be ever completed... > Yes, and the continuous backup feature is on our TODO list. We hope this series (including basic functions and COLO framework) can be upstream first. Regards, -Gonglei
Re: [Qemu-block] [PATCH COLO v3 01/14] docs: block replication's description
On Fri, Apr 24, 2015 at 11:36:35AM +0200, Paolo Bonzini wrote: > > > On 24/04/2015 11:38, Wen Congyang wrote: > >> > > >> > That can be done with drive-mirror. But I think it's too early for that. > > Do you mean use drive-mirror instead of quorum? > > Only before starting up a new secondary. Basically you do a migration > with non-shared storage, and then start the secondary in colo mode. > > But it's only for the failover case. Quorum (or a new block/colo.c > driver or filter) is fine for normal colo operation. Perhaps this patch series should mirror the Secondary's disk to a Backup Secondary so that the system can be protected very quickly after failover. I think anyone serious about fault tolerance would deploy a Backup Secondary, otherwise the system cannot survive two failures unless a human administrator is lucky/fast enough to set up a new Secondary. Stefan pgpL0eGPupVHC.pgp Description: PGP signature
Re: [Qemu-block] [PATCH COLO v3 01/14] docs: block replication's description
On 24/04/2015 11:53, Wen Congyang wrote: >> > Only before starting up a new secondary. Basically you do a migration >> > with non-shared storage, and then start the secondary in colo mode. >> > >> > But it's only for the failover case. Quorum (or a new block/colo.c >> > driver or filter) is fine for normal colo operation. > Is nbd+colo needed to connect the NBD server later? Elsewhere in the thread I proposed a new flag BDRV_O_NO_CONNECT and a new BlockDriver function pointer bdrv_connect. Paolo
Re: [Qemu-block] [PATCH COLO v3 01/14] docs: block replication's description
On 04/24/2015 05:36 PM, Paolo Bonzini wrote: > > > On 24/04/2015 11:38, Wen Congyang wrote: That can be done with drive-mirror. But I think it's too early for that. >> Do you mean use drive-mirror instead of quorum? > > Only before starting up a new secondary. Basically you do a migration > with non-shared storage, and then start the secondary in colo mode. > > But it's only for the failover case. Quorum (or a new block/colo.c > driver or filter) is fine for normal colo operation. Is nbd+colo needed to connect the NBD server later? Thanks Wen Congyang > > Paolo > . >
Re: [Qemu-block] [PATCH COLO v3 01/14] docs: block replication's description
On 24/04/2015 11:38, Wen Congyang wrote: >> > >> > That can be done with drive-mirror. But I think it's too early for that. > Do you mean use drive-mirror instead of quorum? Only before starting up a new secondary. Basically you do a migration with non-shared storage, and then start the secondary in colo mode. But it's only for the failover case. Quorum (or a new block/colo.c driver or filter) is fine for normal colo operation. Paolo
Re: [Qemu-block] [PATCH COLO v3 01/14] docs: block replication's description
On 04/24/2015 05:04 PM, Paolo Bonzini wrote: > > > On 24/04/2015 10:58, Dr. David Alan Gilbert wrote: If we can add a filter dynamically, we can add a filter that's file is nbd dynamically after secondary qemu's nbd server is ready. In this case, I think there is no need to touch nbd client. >> Yes, I think maybe the harder part is getting a copy of the current disk >> contents to the new secondary while the new primary is still running. > > That can be done with drive-mirror. But I think it's too early for that. Do you mean use drive-mirror instead of quorum? Hmm, I don't find the final design for primary QEMU... Thanks Wen Congyang > > Paolo > . >
Re: [Qemu-block] [PATCH COLO v3 01/14] docs: block replication's description
On 24/04/2015 10:58, Dr. David Alan Gilbert wrote: >> > If we can add a filter dynamically, we can add a filter that's file is nbd >> > dynamically after secondary qemu's nbd server is ready. In this case, I >> > think >> > there is no need to touch nbd client. > Yes, I think maybe the harder part is getting a copy of the current disk > contents to the new secondary while the new primary is still running. That can be done with drive-mirror. But I think it's too early for that. Paolo
Re: [Qemu-block] [PATCH COLO v3 01/14] docs: block replication's description
* Wen Congyang (we...@cn.fujitsu.com) wrote: > On 04/24/2015 03:47 PM, Paolo Bonzini wrote: > > > > > > On 24/04/2015 04:16, Wen Congyang wrote: > >> I think the primary shouldn't do any I/O after failover (and the > >> secondary should close the NBD server) so it is probably okay to ignore > >> the removal for now. Inserting the filter dynamically is probably > >> needed though. > > Or maybe just enabling/disabling? > >> Hmm, after failover, the secondary qemu should become primary qemu, but we > >> don't > >> know the nbd server's IP/port when we execute the secondary qemu. So we > >> need > >> to inserting nbd client dynamically after failover. > > > > True, but secondary->primary switch is already not supported in v3. > > Yes, we should consider it, and support it more easily later. > > If we can add a filter dynamically, we can add a filter that's file is nbd > dynamically after secondary qemu's nbd server is ready. In this case, I think > there is no need to touch nbd client. Yes, I think maybe the harder part is getting a copy of the current disk contents to the new secondary while the new primary is still running. Dave > > Thanks > Wen Congyang > > > > > Kevin/Stefan, is there a design document somewhere that covers at least > > static filters? > > > > Paolo > > . > > > -- Dr. David Alan Gilbert / dgilb...@redhat.com / Manchester, UK
Re: [Qemu-block] [PATCH COLO v3 01/14] docs: block replication's description
On 04/24/2015 03:47 PM, Paolo Bonzini wrote: > > > On 24/04/2015 04:16, Wen Congyang wrote: >> I think the primary shouldn't do any I/O after failover (and the >> secondary should close the NBD server) so it is probably okay to ignore >> the removal for now. Inserting the filter dynamically is probably >> needed though. Or maybe just enabling/disabling? >> Hmm, after failover, the secondary qemu should become primary qemu, but we >> don't >> know the nbd server's IP/port when we execute the secondary qemu. So we need >> to inserting nbd client dynamically after failover. > > True, but secondary->primary switch is already not supported in v3. Yes, we should consider it, and support it more easily later. If we can add a filter dynamically, we can add a filter that's file is nbd dynamically after secondary qemu's nbd server is ready. In this case, I think there is no need to touch nbd client. Thanks Wen Congyang > > Kevin/Stefan, is there a design document somewhere that covers at least > static filters? > > Paolo > . >
Re: [Qemu-block] [PATCH COLO v3 01/14] docs: block replication's description
On 24/04/2015 04:16, Wen Congyang wrote: >>> >> I think the primary shouldn't do any I/O after failover (and the >>> >> secondary should close the NBD server) so it is probably okay to ignore >>> >> the removal for now. Inserting the filter dynamically is probably >>> >> needed though. >> > >> > Or maybe just enabling/disabling? > Hmm, after failover, the secondary qemu should become primary qemu, but we > don't > know the nbd server's IP/port when we execute the secondary qemu. So we need > to inserting nbd client dynamically after failover. True, but secondary->primary switch is already not supported in v3. Kevin/Stefan, is there a design document somewhere that covers at least static filters? Paolo
Re: [Qemu-block] [PATCH COLO v3 01/14] docs: block replication's description
On 04/24/2015 10:01 AM, Fam Zheng wrote: > On Thu, 04/23 14:23, Paolo Bonzini wrote: >> >> >> On 23/04/2015 14:19, Dr. David Alan Gilbert wrote: > So that means the bdrv_start_replication and bdrv_stop_replication > callbacks are more or less redundant, at least on the primary? > > In fact, who calls them? Certainly nothing in this patch set... > :) >>> In the main colo set (I'm looking at the February version) there >>> are calls to them, the 'stop_replication' is called at failover time. >>> >>> Here is I think the later version: >>> http://lists.nongnu.org/archive/html/qemu-devel/2015-03/msg05391.html >> >> I think the primary shouldn't do any I/O after failover (and the >> secondary should close the NBD server) so it is probably okay to ignore >> the removal for now. Inserting the filter dynamically is probably >> needed though. > > Or maybe just enabling/disabling? Hmm, after failover, the secondary qemu should become primary qemu, but we don't know the nbd server's IP/port when we execute the secondary qemu. So we need to inserting nbd client dynamically after failover. Thanks Wen Congyang > > Fam > . >
Re: [Qemu-block] [PATCH COLO v3 01/14] docs: block replication's description
On Thu, 04/23 14:23, Paolo Bonzini wrote: > > > On 23/04/2015 14:19, Dr. David Alan Gilbert wrote: > >> > So that means the bdrv_start_replication and bdrv_stop_replication > >> > callbacks are more or less redundant, at least on the primary? > >> > > >> > In fact, who calls them? Certainly nothing in this patch set... > >> > :) > > In the main colo set (I'm looking at the February version) there > > are calls to them, the 'stop_replication' is called at failover time. > > > > Here is I think the later version: > > http://lists.nongnu.org/archive/html/qemu-devel/2015-03/msg05391.html > > I think the primary shouldn't do any I/O after failover (and the > secondary should close the NBD server) so it is probably okay to ignore > the removal for now. Inserting the filter dynamically is probably > needed though. Or maybe just enabling/disabling? Fam
Re: [Qemu-block] [PATCH COLO v3 01/14] docs: block replication's description
On 23/04/2015 14:19, Dr. David Alan Gilbert wrote: >> > So that means the bdrv_start_replication and bdrv_stop_replication >> > callbacks are more or less redundant, at least on the primary? >> > >> > In fact, who calls them? Certainly nothing in this patch set... >> > :) > In the main colo set (I'm looking at the February version) there > are calls to them, the 'stop_replication' is called at failover time. > > Here is I think the later version: > http://lists.nongnu.org/archive/html/qemu-devel/2015-03/msg05391.html I think the primary shouldn't do any I/O after failover (and the secondary should close the NBD server) so it is probably okay to ignore the removal for now. Inserting the filter dynamically is probably needed though. Paolo
Re: [Qemu-block] [PATCH COLO v3 01/14] docs: block replication's description
* Paolo Bonzini (pbonz...@redhat.com) wrote: > > > On 23/04/2015 14:05, Dr. David Alan Gilbert wrote: > > As presented at the moment, I don't see there's any dynamic reconfiguration > > on the primary side at the moment > > So that means the bdrv_start_replication and bdrv_stop_replication > callbacks are more or less redundant, at least on the primary? > > In fact, who calls them? Certainly nothing in this patch set... > :) In the main colo set (I'm looking at the February version) there are calls to them, the 'stop_replication' is called at failover time. Here is I think the later version: http://lists.nongnu.org/archive/html/qemu-devel/2015-03/msg05391.html Dave > > Paolo > > - it starts up in the configuration with > > the quorum(disk, NBD), and that's the way it stays throughout the > > fault-tolerant > > setup; the primary doesn't start running until the secondary is connected. > > > > Similarly the secondary startups in the configuration and stays that way; > > the interesting question to me is what happens after a failure. > > > > If the secondary fails, then your primary is still quorum(disk, NBD) but > > the NBD side is dead - so I don't think you need to do anything there > > immediately. > > > > If the primary fails, and the secondary takes over, then a lot of the > > stuff on the secondary now becomes redundent; does that stay the same > > and just operate in some form of passthrough - or does it need to > > change configuration? > > > > The hard part to me is how to bring it back into fault-tolerance now; > > after a primary failure, the secondary now needs to morph into something > > like a primary, and somehow you need to bring up a new secondary > > and get that new secondary an image of the primaries current disk. -- Dr. David Alan Gilbert / dgilb...@redhat.com / Manchester, UK
Re: [Qemu-block] [PATCH COLO v3 01/14] docs: block replication's description
On 23/04/2015 14:05, Dr. David Alan Gilbert wrote: > As presented at the moment, I don't see there's any dynamic reconfiguration > on the primary side at the moment So that means the bdrv_start_replication and bdrv_stop_replication callbacks are more or less redundant, at least on the primary? In fact, who calls them? Certainly nothing in this patch set... :) Paolo - it starts up in the configuration with > the quorum(disk, NBD), and that's the way it stays throughout the > fault-tolerant > setup; the primary doesn't start running until the secondary is connected. > > Similarly the secondary startups in the configuration and stays that way; > the interesting question to me is what happens after a failure. > > If the secondary fails, then your primary is still quorum(disk, NBD) but > the NBD side is dead - so I don't think you need to do anything there > immediately. > > If the primary fails, and the secondary takes over, then a lot of the > stuff on the secondary now becomes redundent; does that stay the same > and just operate in some form of passthrough - or does it need to > change configuration? > > The hard part to me is how to bring it back into fault-tolerance now; > after a primary failure, the secondary now needs to morph into something > like a primary, and somehow you need to bring up a new secondary > and get that new secondary an image of the primaries current disk.
Re: [Qemu-block] [PATCH COLO v3 01/14] docs: block replication's description
* Paolo Bonzini (pbonz...@redhat.com) wrote: > > > On 23/04/2015 13:36, Kevin Wolf wrote: > > Crap. Then we need to figure out dynamic reconfiguration for filters > > (CCed Markus and Jeff). > > > > And this is really part of the fundamental operation mode and not just a > > way to give users a way to change their mind at runtime? Because if it > > were, we could go forward without that for the start and add dynamic > > reconfiguration in a second step. > > I honestly don't know. Wen, David? As presented at the moment, I don't see there's any dynamic reconfiguration on the primary side at the moment - it starts up in the configuration with the quorum(disk, NBD), and that's the way it stays throughout the fault-tolerant setup; the primary doesn't start running until the secondary is connected. Similarly the secondary startups in the configuration and stays that way; the interesting question to me is what happens after a failure. If the secondary fails, then your primary is still quorum(disk, NBD) but the NBD side is dead - so I don't think you need to do anything there immediately. If the primary fails, and the secondary takes over, then a lot of the stuff on the secondary now becomes redundent; does that stay the same and just operate in some form of passthrough - or does it need to change configuration? The hard part to me is how to bring it back into fault-tolerance now; after a primary failure, the secondary now needs to morph into something like a primary, and somehow you need to bring up a new secondary and get that new secondary an image of the primaries current disk. Dave > Paolo > > > Anyway, even if we move it to a second step, it looks like we need to > > design something rather soon now. -- Dr. David Alan Gilbert / dgilb...@redhat.com / Manchester, UK
Re: [Qemu-block] [PATCH COLO v3 01/14] docs: block replication's description
On 23/04/2015 13:36, Kevin Wolf wrote: > Crap. Then we need to figure out dynamic reconfiguration for filters > (CCed Markus and Jeff). > > And this is really part of the fundamental operation mode and not just a > way to give users a way to change their mind at runtime? Because if it > were, we could go forward without that for the start and add dynamic > reconfiguration in a second step. I honestly don't know. Wen, David? Paolo > Anyway, even if we move it to a second step, it looks like we need to > design something rather soon now.
Re: [Qemu-block] [PATCH COLO v3 01/14] docs: block replication's description
Am 23.04.2015 um 12:44 hat Paolo Bonzini geschrieben: > On 23/04/2015 12:40, Kevin Wolf wrote: > > The question that is still open for me is whether it would be a colo.c > > or an active-mirror.c, i.e. if this would be tied specifically to COLO > > or if it could be kept generic enough that it could be used for other > > use cases as well. > > Understood (now). > > >>> What I think is really needed here is essentially an active mirror > >>> filter. > >> > >> Yes, an active synchronous mirror. It can be either a filter or a > >> device. Has anyone ever come up with a design for filters? Colo > >> doesn't need much more complexity than a "toy" blkverify filter. > > > > I think what we're doing now for quorum/blkverify/blkdebug is okay. > > > > The tricky and yet unsolved part is how to add/remove filter BDSes at > > runtime (dynamic reconfiguration), but IIUC that isn't needed here. > > Yes, it is. The "defer connection to NBD when replication is started" > is effectively "add the COLO filter" (with the NBD connection as a > children) when replication is started. > > Similarly "close the NBD device when replication is stopped" is > effectively "remove the COLO filter" (which brings the NBD connection > down with it). Crap. Then we need to figure out dynamic reconfiguration for filters (CCed Markus and Jeff). And this is really part of the fundamental operation mode and not just a way to give users a way to change their mind at runtime? Because if it were, we could go forward without that for the start and add dynamic reconfiguration in a second step. Anyway, even if we move it to a second step, it looks like we need to design something rather soon now. Kevin
Re: [Qemu-block] [PATCH COLO v3 01/14] docs: block replication's description
On 04/23/2015 06:44 PM, Paolo Bonzini wrote: > > > On 23/04/2015 12:40, Kevin Wolf wrote: >> The question that is still open for me is whether it would be a colo.c >> or an active-mirror.c, i.e. if this would be tied specifically to COLO >> or if it could be kept generic enough that it could be used for other >> use cases as well. > > Understood (now). > What I think is really needed here is essentially an active mirror filter. >>> >>> Yes, an active synchronous mirror. It can be either a filter or a >>> device. Has anyone ever come up with a design for filters? Colo >>> doesn't need much more complexity than a "toy" blkverify filter. >> >> I think what we're doing now for quorum/blkverify/blkdebug is okay. >> >> The tricky and yet unsolved part is how to add/remove filter BDSes at >> runtime (dynamic reconfiguration), but IIUC that isn't needed here. > > Yes, it is. The "defer connection to NBD when replication is started" > is effectively "add the COLO filter" (with the NBD connection as a > children) when replication is started. > > Similarly "close the NBD device when replication is stopped" is > effectively "remove the COLO filter" (which brings the NBD connection > down with it). Hmm, I don't understand it clearly. Do you mean: 1. COLO filter is quorum's child 2. We can add/remove quorum's child at run-time. If I misunderstand something, please correct me. Thanks Wen Congyang > > Paolo > . >
Re: [Qemu-block] [PATCH COLO v3 01/14] docs: block replication's description
On 23/04/2015 12:40, Kevin Wolf wrote: > The question that is still open for me is whether it would be a colo.c > or an active-mirror.c, i.e. if this would be tied specifically to COLO > or if it could be kept generic enough that it could be used for other > use cases as well. Understood (now). >>> What I think is really needed here is essentially an active mirror >>> filter. >> >> Yes, an active synchronous mirror. It can be either a filter or a >> device. Has anyone ever come up with a design for filters? Colo >> doesn't need much more complexity than a "toy" blkverify filter. > > I think what we're doing now for quorum/blkverify/blkdebug is okay. > > The tricky and yet unsolved part is how to add/remove filter BDSes at > runtime (dynamic reconfiguration), but IIUC that isn't needed here. Yes, it is. The "defer connection to NBD when replication is started" is effectively "add the COLO filter" (with the NBD connection as a children) when replication is started. Similarly "close the NBD device when replication is stopped" is effectively "remove the COLO filter" (which brings the NBD connection down with it). Paolo
Re: [Qemu-block] [PATCH COLO v3 01/14] docs: block replication's description
Am 23.04.2015 um 12:33 hat Paolo Bonzini geschrieben: > On 23/04/2015 12:17, Kevin Wolf wrote: > > > Perhaps quorum is not a great match after all, and it's better to add a > > > new "colo" driver similar to quorum but simpler and only using the read > > > policy that you need for colo. The new driver would also know how to > > > use BDRV_O_NO_CONNECT. In any case the amount of work needed would not > > > be too big. > > > > I thought the same, but haven't looked at the details yet. But if I > > understand correctly, the plan is to take quorum and add options to turn > > off the functionality of using a quorum - that's a bit odd. > > Yes, indeed. Quorum was okay for experimenting, now it's better to "cp > quorum.c colo.c" and clean up the code instead of adding options to > quorum. There's not going to be more duplication between quorum.c and > colo.c than, say, between colo.c and blkverify.c. The question that is still open for me is whether it would be a colo.c or an active-mirror.c, i.e. if this would be tied specifically to COLO or if it could be kept generic enough that it could be used for other use cases as well. > > What I think is really needed here is essentially an active mirror > > filter. > > Yes, an active synchronous mirror. It can be either a filter or a > device. Has anyone ever come up with a design for filters? Colo > doesn't need much more complexity than a "toy" blkverify filter. I think what we're doing now for quorum/blkverify/blkdebug is okay. The tricky and yet unsolved part is how to add/remove filter BDSes at runtime (dynamic reconfiguration), but IIUC that isn't needed here. Kevin
Re: [Qemu-block] [PATCH COLO v3 01/14] docs: block replication's description
On 23/04/2015 12:17, Kevin Wolf wrote: > > Perhaps quorum is not a great match after all, and it's better to add a > > new "colo" driver similar to quorum but simpler and only using the read > > policy that you need for colo. The new driver would also know how to > > use BDRV_O_NO_CONNECT. In any case the amount of work needed would not > > be too big. > > I thought the same, but haven't looked at the details yet. But if I > understand correctly, the plan is to take quorum and add options to turn > off the functionality of using a quorum - that's a bit odd. Yes, indeed. Quorum was okay for experimenting, now it's better to "cp quorum.c colo.c" and clean up the code instead of adding options to quorum. There's not going to be more duplication between quorum.c and colo.c than, say, between colo.c and blkverify.c. > What I think is really needed here is essentially an active mirror > filter. Yes, an active synchronous mirror. It can be either a filter or a device. Has anyone ever come up with a design for filters? Colo doesn't need much more complexity than a "toy" blkverify filter. Paolo
Re: [Qemu-block] [PATCH COLO v3 01/14] docs: block replication's description
Am 23.04.2015 um 12:05 hat Paolo Bonzini geschrieben: > > > On 23/04/2015 11:14, Wen Congyang wrote: > > The bs->file->driver should support backing file, and use backing reference > > already. > > > > What about the primary side? We should control when to connect to NBD > > server, > > not in nbd_open(). Why do you need to create the block device before the connection should be made? > My naive suggestion could be to add a BDRV_O_NO_CONNECT option to > bdrv_open and a separate bdrv_connect callback. Open would fail if > BDRV_O_NO_CONNECT is specified and drv->bdrv_connect is NULL. > > You would then need a way to have quorum pass BDRV_O_NO_CONNECT. Please don't add new flags. If we have to, we can introduce a new option (in the QDict), but first let's check if it's really necessary. > Perhaps quorum is not a great match after all, and it's better to add a > new "colo" driver similar to quorum but simpler and only using the read > policy that you need for colo. The new driver would also know how to > use BDRV_O_NO_CONNECT. In any case the amount of work needed would not > be too big. I thought the same, but haven't looked at the details yet. But if I understand correctly, the plan is to take quorum and add options to turn off the functionality of using a quorum - that's a bit odd. What I think is really needed here is essentially an active mirror filter. Kevin
Re: [Qemu-block] [PATCH COLO v3 01/14] docs: block replication's description
On 04/23/2015 05:55 PM, Stefan Hajnoczi wrote: > On Wed, Apr 22, 2015 at 05:28:01PM +0800, Wen Congyang wrote: >> On 04/22/2015 05:18 PM, Stefan Hajnoczi wrote: >>> On Tue, Apr 21, 2015 at 05:28:01PM +0200, Paolo Bonzini wrote: On 21/04/2015 03:25, Wen Congyang wrote: >>> Please do not introduce "+colo" block drivers. This approach is >>> invasive and makes block replication specific to only a few block >>> drivers, e.g. NBD or qcow2. > NBD is used to connect to secondary qemu, so it must be used. But the > primary > qemu uses quorum, so the primary disk can be any format. > The secondary disk is nbd target, and it can also be any format. The cache > disk(active disk/hidden disk) is an empty disk, and it is created before > run > COLO. The cache disk format is qcow2 now. In theory, it can be ant format > which > supports backing file. But the driver should be updated to support colo > mode. > >> A cleaner approach is a QMP command or -drive options that work for any >> BlockDriverState. > > OK, I will add a new drive option to avoid use "+colo". Actually I liked the "foo+colo" names. These are just internal details of the implementations and the primary/secondary disks actually can be any format. Stefan, what was your worry with the +colo block drivers? >>> >>> Why does NBD need to know about COLO? It should be possible to use >>> iSCSI or other protocols too. >> >> Hmm, if you want to use iSCSI or other protocols, you should update the >> driver >> to implement block replication's control interface. >> >> Currently, we only support nbd now. > > I took a quick look at the NBD patches in this series, it looks like > they are a hacky way to make quorum dynamically reconfigurable. > > In other words, what you really need is a way to enable/disable a quorum > child or even add/remove children at run-time. > > NBD is not the right place to implement that. Add APIs to quorum so > COLO code can use them. > > Or maybe I'm misinterpreting the patches, I only took a quick look... Hmm, if we can enable/disable or add/remove a child at run-time, it is another choice. Thanks Wen Congyang > > Stefan >
Re: [Qemu-block] [PATCH COLO v3 01/14] docs: block replication's description
On 23/04/2015 11:14, Wen Congyang wrote: > The bs->file->driver should support backing file, and use backing reference > already. > > What about the primary side? We should control when to connect to NBD server, > not in nbd_open(). My naive suggestion could be to add a BDRV_O_NO_CONNECT option to bdrv_open and a separate bdrv_connect callback. Open would fail if BDRV_O_NO_CONNECT is specified and drv->bdrv_connect is NULL. You would then need a way to have quorum pass BDRV_O_NO_CONNECT. Perhaps quorum is not a great match after all, and it's better to add a new "colo" driver similar to quorum but simpler and only using the read policy that you need for colo. The new driver would also know how to use BDRV_O_NO_CONNECT. In any case the amount of work needed would not be too big. Paolo
Re: [Qemu-block] [PATCH COLO v3 01/14] docs: block replication's description
On 04/23/2015 05:00 PM, Kevin Wolf wrote: > Am 22.04.2015 um 12:12 hat Paolo Bonzini geschrieben: >> On 22/04/2015 11:31, Kevin Wolf wrote: Actually I liked the "foo+colo" names. These are just internal details of the implementations and the primary/secondary disks actually can be any format. Stefan, what was your worry with the +colo block drivers? >>> >>> I haven't read the patches yet, so I may be misunderstanding, but >>> wouldn't a separate filter driver be more appropriate than modifying >>> qcow2 with logic that has nothing to do with the image format? >> >> Possibly; on the other hand, why multiply the size of the test matrix >> with options that no one will use and that will bitrot? > > Because it may be the right design. > > If you're really worried about the test matrix, put a check in the > filter block driver that its bs->file is qcow2. Of course, such an > artificial restriction looks a bit ugly, but using a bad design just > in order to get the same restriction is even worse. The bs->file->driver should support backing file, and use backing reference already. What about the primary side? We should control when to connect to NBD server, not in nbd_open(). Thanks Wen Congyang > > Stefan originally wanted to put image streaming in the QED driver. I > think we'll agree today that it was right to reject that. It's simply > not functionality related to the format. Adding replication logic to > qcow2 looks similar to me in that respect. > > Kevin > . >
Re: [Qemu-block] [PATCH COLO v3 01/14] docs: block replication's description
On Wed, Apr 22, 2015 at 05:28:01PM +0800, Wen Congyang wrote: > On 04/22/2015 05:18 PM, Stefan Hajnoczi wrote: > > On Tue, Apr 21, 2015 at 05:28:01PM +0200, Paolo Bonzini wrote: > >> On 21/04/2015 03:25, Wen Congyang wrote: > > Please do not introduce "+colo" block drivers. This approach is > > invasive and makes block replication specific to only a few block > > drivers, e.g. NBD or qcow2. > >>> NBD is used to connect to secondary qemu, so it must be used. But the > >>> primary > >>> qemu uses quorum, so the primary disk can be any format. > >>> The secondary disk is nbd target, and it can also be any format. The cache > >>> disk(active disk/hidden disk) is an empty disk, and it is created before > >>> run > >>> COLO. The cache disk format is qcow2 now. In theory, it can be ant format > >>> which > >>> supports backing file. But the driver should be updated to support colo > >>> mode. > >>> > A cleaner approach is a QMP command or -drive options that work for any > BlockDriverState. > >>> > >>> OK, I will add a new drive option to avoid use "+colo". > >> > >> Actually I liked the "foo+colo" names. > >> > >> These are just internal details of the implementations and the > >> primary/secondary disks actually can be any format. > >> > >> Stefan, what was your worry with the +colo block drivers? > > > > Why does NBD need to know about COLO? It should be possible to use > > iSCSI or other protocols too. > > Hmm, if you want to use iSCSI or other protocols, you should update the driver > to implement block replication's control interface. > > Currently, we only support nbd now. I took a quick look at the NBD patches in this series, it looks like they are a hacky way to make quorum dynamically reconfigurable. In other words, what you really need is a way to enable/disable a quorum child or even add/remove children at run-time. NBD is not the right place to implement that. Add APIs to quorum so COLO code can use them. Or maybe I'm misinterpreting the patches, I only took a quick look... Stefan pgplmUuucC3yz.pgp Description: PGP signature
Re: [Qemu-block] [PATCH COLO v3 01/14] docs: block replication's description
On 04/23/2015 05:26 PM, Paolo Bonzini wrote: > > > On 23/04/2015 11:00, Kevin Wolf wrote: >> Because it may be the right design. >> >> If you're really worried about the test matrix, put a check in the >> filter block driver that its bs->file is qcow2. Of course, such an >> artificial restriction looks a bit ugly, but using a bad design just >> in order to get the same restriction is even worse. >> >> Stefan originally wanted to put image streaming in the QED driver. I >> think we'll agree today that it was right to reject that. It's simply >> not functionality related to the format. Adding replication logic to >> qcow2 looks similar to me in that respect. > > Yes, I can't deny it is similar. Still, there is a very important > difference: limiting colo's internal workings to qcow2 or NBD doesn't > limit what the user can do (while streaming limited the user to image > files in QED format). > > It may also depend on how the patches look like and how much the colo > code relies on other internal state. > > For NBD the answer is almost nothing, and you don't even need a filter > driver. You only need to separate sharply the "configure" and "open" > phases. So it may indeed be possible to generalize the handling of the > secondary to non-NBD. > > It may be the same for the primary; I admit I haven't even tried to read > the qcow2 patch, as I couldn't do a meaningful review. For qcow2, we need to read/write from NBD target directly after failover, because the cache image(the format is qcow2) may be put in ramfs to get better performance. The other thing is not changed. For qcow2, if we use a filter driver, the bs->file->drv should support backing file, and make_empty. So it can be the other format. Thanks Wen Congyang > > Paolo > . >
Re: [Qemu-block] [PATCH COLO v3 01/14] docs: block replication's description
Am 23.04.2015 um 11:26 hat Paolo Bonzini geschrieben: > > > On 23/04/2015 11:00, Kevin Wolf wrote: > > Because it may be the right design. > > > > If you're really worried about the test matrix, put a check in the > > filter block driver that its bs->file is qcow2. Of course, such an > > artificial restriction looks a bit ugly, but using a bad design just > > in order to get the same restriction is even worse. > > > > Stefan originally wanted to put image streaming in the QED driver. I > > think we'll agree today that it was right to reject that. It's simply > > not functionality related to the format. Adding replication logic to > > qcow2 looks similar to me in that respect. > > Yes, I can't deny it is similar. Still, there is a very important > difference: limiting colo's internal workings to qcow2 or NBD doesn't > limit what the user can do (while streaming limited the user to image > files in QED format). > > It may also depend on how the patches look like and how much the colo > code relies on other internal state. > > For NBD the answer is almost nothing, and you don't even need a filter > driver. You only need to separate sharply the "configure" and "open" > phases. So it may indeed be possible to generalize the handling of the > secondary to non-NBD. > > It may be the same for the primary; I admit I haven't even tried to read > the qcow2 patch, as I couldn't do a meaningful review. The qcow2 patch only modifies two existing lines. The rest it adds is a the qcow2+colo BlockDriver, which references some qcow2 functions directly and has a wrapper for others. On a quick scan, it didn't seem like it accesses any internal qcow2 variables or calls any private functions. In other words, it's the perfect example for a filter. Kevin
Re: [Qemu-block] [PATCH COLO v3 01/14] docs: block replication's description
On 23/04/2015 11:00, Kevin Wolf wrote: > Because it may be the right design. > > If you're really worried about the test matrix, put a check in the > filter block driver that its bs->file is qcow2. Of course, such an > artificial restriction looks a bit ugly, but using a bad design just > in order to get the same restriction is even worse. > > Stefan originally wanted to put image streaming in the QED driver. I > think we'll agree today that it was right to reject that. It's simply > not functionality related to the format. Adding replication logic to > qcow2 looks similar to me in that respect. Yes, I can't deny it is similar. Still, there is a very important difference: limiting colo's internal workings to qcow2 or NBD doesn't limit what the user can do (while streaming limited the user to image files in QED format). It may also depend on how the patches look like and how much the colo code relies on other internal state. For NBD the answer is almost nothing, and you don't even need a filter driver. You only need to separate sharply the "configure" and "open" phases. So it may indeed be possible to generalize the handling of the secondary to non-NBD. It may be the same for the primary; I admit I haven't even tried to read the qcow2 patch, as I couldn't do a meaningful review. Paolo
Re: [Qemu-block] [PATCH COLO v3 01/14] docs: block replication's description
Am 22.04.2015 um 12:12 hat Paolo Bonzini geschrieben: > On 22/04/2015 11:31, Kevin Wolf wrote: > >> Actually I liked the "foo+colo" names. > >> > >> These are just internal details of the implementations and the > >> primary/secondary disks actually can be any format. > >> > >> Stefan, what was your worry with the +colo block drivers? > > > > I haven't read the patches yet, so I may be misunderstanding, but > > wouldn't a separate filter driver be more appropriate than modifying > > qcow2 with logic that has nothing to do with the image format? > > Possibly; on the other hand, why multiply the size of the test matrix > with options that no one will use and that will bitrot? Because it may be the right design. If you're really worried about the test matrix, put a check in the filter block driver that its bs->file is qcow2. Of course, such an artificial restriction looks a bit ugly, but using a bad design just in order to get the same restriction is even worse. Stefan originally wanted to put image streaming in the QED driver. I think we'll agree today that it was right to reject that. It's simply not functionality related to the format. Adding replication logic to qcow2 looks similar to me in that respect. Kevin
Re: [Qemu-block] [PATCH COLO v3 01/14] docs: block replication's description
* Wen Congyang (we...@cn.fujitsu.com) wrote: > Signed-off-by: Wen Congyang > Signed-off-by: Paolo Bonzini > Signed-off-by: Yang Hongyang > Signed-off-by: zhanghailiang > Signed-off-by: Gonglei > --- > docs/block-replication.txt | 153 > + > 1 file changed, 153 insertions(+) > create mode 100644 docs/block-replication.txt > > diff --git a/docs/block-replication.txt b/docs/block-replication.txt > new file mode 100644 > index 000..4426ffc > --- /dev/null > +++ b/docs/block-replication.txt > @@ -0,0 +1,153 @@ > +Block replication > + > +Copyright Fujitsu, Corp. 2015 > +Copyright (c) 2015 Intel Corporation > +Copyright (c) 2015 HUAWEI TECHNOLOGIES CO., LTD. > + > +This work is licensed under the terms of the GNU GPL, version 2 or later. > +See the COPYING file in the top-level directory. > + > +Block replication is used for continuous checkpoints. It is designed > +for COLO (COurse-grain LOck-stepping) where the Secondary VM is running. > +It can also be applied for FT/HA (Fault-tolerance/High Assurance) scenario, > +where the Secondary VM is not running. > + > +This document gives an overview of block replication's design. > + > +== Background == > +High availability solutions such as micro checkpoint and COLO will do > +consecutive checkpoints. The VM state of Primary VM and Secondary VM is > +identical right after a VM checkpoint, but becomes different as the VM > +executes till the next checkpoint. To support disk contents checkpoint, > +the modified disk contents in the Secondary VM must be buffered, and are > +only dropped at next checkpoint time. To reduce the network transportation > +effort at the time of checkpoint, the disk modification operations of > +Primary disk are asynchronously forwarded to the Secondary node. > + > +== Workflow == > +The following is the image of block replication workflow: > + > ++--+++ > +|Primary Write Requests||Secondary Write Requests| > ++--+++ > + | | > + | (4) > + | V > + | /-\ > + | Copy and Forward| | > + |-(1)--+ | Disk Buffer | > + | | | | > + | (3) \-/ > + | speculative ^ > + |write through(2) > + | | | > + V V | > + +--+ ++ > + | Primary Disk | | Secondary Disk | > + +--+ ++ > + > +1) Primary write requests will be copied and forwarded to Secondary > + QEMU. > +2) Before Primary write requests are written to Secondary disk, the > + original sector content will be read from Secondary disk and > + buffered in the Disk buffer, but it will not overwrite the existing > + sector content(it could be from either "Secondary Write Requests" or > + previous COW of "Primary Write Requests") in the Disk buffer. > +3) Primary write requests will be written to Secondary disk. > +4) Secondary write requests will be buffered in the Disk buffer and it > + will overwrite the existing sector content in the buffer. > + > +== Architecture == > +We are going to implement COLO block replication from many basic > +blocks that are already in QEMU. > + > + virtio-blk || > + ^||.-- > + |||| Secondary > +1 Quorum ||'-- > + / \ || > +/\|| > + Primary 2 NBD ---> 2 NBD > + disk client|| server >virtio-blk > + ||^ > ^ > +. ||| > | > +Primary | || Secondary disk <- hidden-disk 4 > <- active-disk 3 > +' ||| backing^ backing > + ||| | > + ||| | > + ||'-' > + || drive-backup sync=no
Re: [Qemu-block] [PATCH COLO v3 01/14] docs: block replication's description
On 22/04/2015 11:31, Kevin Wolf wrote: >> Actually I liked the "foo+colo" names. >> >> These are just internal details of the implementations and the >> primary/secondary disks actually can be any format. >> >> Stefan, what was your worry with the +colo block drivers? > > I haven't read the patches yet, so I may be misunderstanding, but > wouldn't a separate filter driver be more appropriate than modifying > qcow2 with logic that has nothing to do with the image format? Possibly; on the other hand, why multiply the size of the test matrix with options that no one will use and that will bitrot? Paolo
Re: [Qemu-block] [PATCH COLO v3 01/14] docs: block replication's description
On 04/22/2015 05:29 PM, Stefan Hajnoczi wrote: > On Tue, Apr 21, 2015 at 09:25:59AM +0800, Wen Congyang wrote: >> On 04/20/2015 11:30 PM, Stefan Hajnoczi wrote: >>> On Fri, Apr 03, 2015 at 06:01:07PM +0800, Wen Congyang wrote: >>> One general question about the design: the Secondary host needs 3x >>> storage space since it has the Secondary Disk, hidden-disk, and >>> active-disk. Each image requires a certain amount of space depending on >>> writes or COW operations. Is 3x the upper bound or is there a way to >>> reduce the bound? >> >> active disk and hidden disk are temp file. It will be maked empty in >> bdrv_do_checkpoint(). Their format is qcow2 now, so it doesn't need too >> many spaces if we do checkpoint periodically. > > A question related to checkpoints: both Primary and Secondary are active > (running) in COLO. The Secondary will be slower since it performs extra > work; disk I/O on the Secondary has a COW overhead. > > Does this force the Primary to wait for checkpoint commit so that the > Secondary can catch up? > > I'm a little confused about that since the point of COLO is to avoid the > overheads of microcheckpointing, but there still seems to be a > checkpointing bottleneck for disk I/O-intensive applications. > >>> >>> The bound is important since large amounts of data become a bottleneck >>> for writeout/commit operations. They could cause downtime if the guest >>> is blocked until the entire Disk Buffer has been written to the >>> Secondary Disk during failover, for example. >> >> OK, I will test it. In my test, vm_stop() will take about 2-3 seconds if >> I run filebench in the guest. Is there anyway to speed it up? > > Is it necessary to commit the active disk and hidden disk to the > Secondary Disk on failover? Maybe the VM could continue executing > immediately and run a block-commit job. The active disk and hidden disk > files can be dropped once block-commit finishes. > We need to stop the vm before doing checkpoint. So if vm_stop() takes too much time, it will affect the performance. On failover, we can commit the data while the vm is running. But the active disk and hidden disk may be put in ramfs, The guest writes faster than block-commit... Thanks Wen Congyang
Re: [Qemu-block] [PATCH COLO v3 01/14] docs: block replication's description
Am 21.04.2015 um 17:28 hat Paolo Bonzini geschrieben: > > > On 21/04/2015 03:25, Wen Congyang wrote: > >> > Please do not introduce "+colo" block drivers. This approach is > >> > invasive and makes block replication specific to only a few block > >> > drivers, e.g. NBD or qcow2. > > NBD is used to connect to secondary qemu, so it must be used. But the > > primary > > qemu uses quorum, so the primary disk can be any format. > > The secondary disk is nbd target, and it can also be any format. The cache > > disk(active disk/hidden disk) is an empty disk, and it is created before run > > COLO. The cache disk format is qcow2 now. In theory, it can be ant format > > which > > supports backing file. But the driver should be updated to support colo > > mode. > > > > > A cleaner approach is a QMP command or -drive options that work for any > > > BlockDriverState. > > > > OK, I will add a new drive option to avoid use "+colo". > > Actually I liked the "foo+colo" names. > > These are just internal details of the implementations and the > primary/secondary disks actually can be any format. > > Stefan, what was your worry with the +colo block drivers? I haven't read the patches yet, so I may be misunderstanding, but wouldn't a separate filter driver be more appropriate than modifying qcow2 with logic that has nothing to do with the image format? Kevin
Re: [Qemu-block] [PATCH COLO v3 01/14] docs: block replication's description
On Tue, Apr 21, 2015 at 09:25:59AM +0800, Wen Congyang wrote: > On 04/20/2015 11:30 PM, Stefan Hajnoczi wrote: > > On Fri, Apr 03, 2015 at 06:01:07PM +0800, Wen Congyang wrote: > > One general question about the design: the Secondary host needs 3x > > storage space since it has the Secondary Disk, hidden-disk, and > > active-disk. Each image requires a certain amount of space depending on > > writes or COW operations. Is 3x the upper bound or is there a way to > > reduce the bound? > > active disk and hidden disk are temp file. It will be maked empty in > bdrv_do_checkpoint(). Their format is qcow2 now, so it doesn't need too > many spaces if we do checkpoint periodically. A question related to checkpoints: both Primary and Secondary are active (running) in COLO. The Secondary will be slower since it performs extra work; disk I/O on the Secondary has a COW overhead. Does this force the Primary to wait for checkpoint commit so that the Secondary can catch up? I'm a little confused about that since the point of COLO is to avoid the overheads of microcheckpointing, but there still seems to be a checkpointing bottleneck for disk I/O-intensive applications. > > > > The bound is important since large amounts of data become a bottleneck > > for writeout/commit operations. They could cause downtime if the guest > > is blocked until the entire Disk Buffer has been written to the > > Secondary Disk during failover, for example. > > OK, I will test it. In my test, vm_stop() will take about 2-3 seconds if > I run filebench in the guest. Is there anyway to speed it up? Is it necessary to commit the active disk and hidden disk to the Secondary Disk on failover? Maybe the VM could continue executing immediately and run a block-commit job. The active disk and hidden disk files can be dropped once block-commit finishes. pgpSmtN_bltYK.pgp Description: PGP signature
Re: [Qemu-block] [PATCH COLO v3 01/14] docs: block replication's description
On 04/22/2015 05:18 PM, Stefan Hajnoczi wrote: > On Tue, Apr 21, 2015 at 05:28:01PM +0200, Paolo Bonzini wrote: >> On 21/04/2015 03:25, Wen Congyang wrote: > Please do not introduce "+colo" block drivers. This approach is > invasive and makes block replication specific to only a few block > drivers, e.g. NBD or qcow2. >>> NBD is used to connect to secondary qemu, so it must be used. But the >>> primary >>> qemu uses quorum, so the primary disk can be any format. >>> The secondary disk is nbd target, and it can also be any format. The cache >>> disk(active disk/hidden disk) is an empty disk, and it is created before run >>> COLO. The cache disk format is qcow2 now. In theory, it can be ant format >>> which >>> supports backing file. But the driver should be updated to support colo >>> mode. >>> A cleaner approach is a QMP command or -drive options that work for any BlockDriverState. >>> >>> OK, I will add a new drive option to avoid use "+colo". >> >> Actually I liked the "foo+colo" names. >> >> These are just internal details of the implementations and the >> primary/secondary disks actually can be any format. >> >> Stefan, what was your worry with the +colo block drivers? > > Why does NBD need to know about COLO? It should be possible to use > iSCSI or other protocols too. Hmm, if you want to use iSCSI or other protocols, you should update the driver to implement block replication's control interface. Currently, we only support nbd now. Thanks Wen Congyang > > Stefan >
Re: [Qemu-block] [PATCH COLO v3 01/14] docs: block replication's description
On Tue, Apr 21, 2015 at 05:28:01PM +0200, Paolo Bonzini wrote: > On 21/04/2015 03:25, Wen Congyang wrote: > >> > Please do not introduce "+colo" block drivers. This approach is > >> > invasive and makes block replication specific to only a few block > >> > drivers, e.g. NBD or qcow2. > > NBD is used to connect to secondary qemu, so it must be used. But the > > primary > > qemu uses quorum, so the primary disk can be any format. > > The secondary disk is nbd target, and it can also be any format. The cache > > disk(active disk/hidden disk) is an empty disk, and it is created before run > > COLO. The cache disk format is qcow2 now. In theory, it can be ant format > > which > > supports backing file. But the driver should be updated to support colo > > mode. > > > > > A cleaner approach is a QMP command or -drive options that work for any > > > BlockDriverState. > > > > OK, I will add a new drive option to avoid use "+colo". > > Actually I liked the "foo+colo" names. > > These are just internal details of the implementations and the > primary/secondary disks actually can be any format. > > Stefan, what was your worry with the +colo block drivers? Why does NBD need to know about COLO? It should be possible to use iSCSI or other protocols too. Stefan pgpGr2J18TCwu.pgp Description: PGP signature
Re: [Qemu-block] [PATCH COLO v3 01/14] docs: block replication's description
On 21/04/2015 03:25, Wen Congyang wrote: >> > Please do not introduce "+colo" block drivers. This approach is >> > invasive and makes block replication specific to only a few block >> > drivers, e.g. NBD or qcow2. > NBD is used to connect to secondary qemu, so it must be used. But the primary > qemu uses quorum, so the primary disk can be any format. > The secondary disk is nbd target, and it can also be any format. The cache > disk(active disk/hidden disk) is an empty disk, and it is created before run > COLO. The cache disk format is qcow2 now. In theory, it can be ant format > which > supports backing file. But the driver should be updated to support colo mode. > > > A cleaner approach is a QMP command or -drive options that work for any > > BlockDriverState. > > OK, I will add a new drive option to avoid use "+colo". Actually I liked the "foo+colo" names. These are just internal details of the implementations and the primary/secondary disks actually can be any format. Stefan, what was your worry with the +colo block drivers? Paolo
Re: [Qemu-block] [PATCH COLO v3 01/14] docs: block replication's description
On 04/20/2015 11:30 PM, Stefan Hajnoczi wrote: > On Fri, Apr 03, 2015 at 06:01:07PM +0800, Wen Congyang wrote: >> Signed-off-by: Wen Congyang >> Signed-off-by: Paolo Bonzini >> Signed-off-by: Yang Hongyang >> Signed-off-by: zhanghailiang >> Signed-off-by: Gonglei >> --- >> docs/block-replication.txt | 153 >> + >> 1 file changed, 153 insertions(+) >> create mode 100644 docs/block-replication.txt >> >> diff --git a/docs/block-replication.txt b/docs/block-replication.txt >> new file mode 100644 >> index 000..4426ffc >> --- /dev/null >> +++ b/docs/block-replication.txt >> @@ -0,0 +1,153 @@ >> +Block replication >> + >> +Copyright Fujitsu, Corp. 2015 >> +Copyright (c) 2015 Intel Corporation >> +Copyright (c) 2015 HUAWEI TECHNOLOGIES CO., LTD. >> + >> +This work is licensed under the terms of the GNU GPL, version 2 or later. >> +See the COPYING file in the top-level directory. >> + >> +Block replication is used for continuous checkpoints. It is designed >> +for COLO (COurse-grain LOck-stepping) where the Secondary VM is running. >> +It can also be applied for FT/HA (Fault-tolerance/High Assurance) scenario, >> +where the Secondary VM is not running. >> + >> +This document gives an overview of block replication's design. >> + >> +== Background == >> +High availability solutions such as micro checkpoint and COLO will do >> +consecutive checkpoints. The VM state of Primary VM and Secondary VM is >> +identical right after a VM checkpoint, but becomes different as the VM >> +executes till the next checkpoint. To support disk contents checkpoint, >> +the modified disk contents in the Secondary VM must be buffered, and are >> +only dropped at next checkpoint time. To reduce the network transportation >> +effort at the time of checkpoint, the disk modification operations of >> +Primary disk are asynchronously forwarded to the Secondary node. >> + >> +== Workflow == >> +The following is the image of block replication workflow: >> + >> ++--+++ >> +|Primary Write Requests||Secondary Write Requests| >> ++--+++ >> + | | >> + | (4) >> + | V >> + | /-\ >> + | Copy and Forward| | >> + |-(1)--+ | Disk Buffer | >> + | | | | >> + | (3) \-/ >> + | speculative ^ >> + |write through(2) >> + | | | >> + V V | >> + +--+ ++ >> + | Primary Disk | | Secondary Disk | >> + +--+ ++ >> + >> +1) Primary write requests will be copied and forwarded to Secondary >> + QEMU. >> +2) Before Primary write requests are written to Secondary disk, the >> + original sector content will be read from Secondary disk and >> + buffered in the Disk buffer, but it will not overwrite the existing >> + sector content(it could be from either "Secondary Write Requests" or >> + previous COW of "Primary Write Requests") in the Disk buffer. >> +3) Primary write requests will be written to Secondary disk. >> +4) Secondary write requests will be buffered in the Disk buffer and it >> + will overwrite the existing sector content in the buffer. >> + >> +== Architecture == >> +We are going to implement COLO block replication from many basic >> +blocks that are already in QEMU. >> + >> + virtio-blk || >> + ^||.-- >> + |||| Secondary >> +1 Quorum ||'-- >> + / \ || >> +/\|| >> + Primary 2 NBD ---> 2 NBD >> + disk client|| server >> virtio-blk >> + ||^ >> ^ >> +. ||| >> | >> +Primary | || Secondary disk <- hidden-disk 4 >> <- active-disk 3 >> +' ||| backing^ >> backing >> + ||| | >> +
Re: [Qemu-block] [PATCH COLO v3 01/14] docs: block replication's description
On Fri, Apr 03, 2015 at 06:01:07PM +0800, Wen Congyang wrote: > Signed-off-by: Wen Congyang > Signed-off-by: Paolo Bonzini > Signed-off-by: Yang Hongyang > Signed-off-by: zhanghailiang > Signed-off-by: Gonglei > --- > docs/block-replication.txt | 153 > + > 1 file changed, 153 insertions(+) > create mode 100644 docs/block-replication.txt > > diff --git a/docs/block-replication.txt b/docs/block-replication.txt > new file mode 100644 > index 000..4426ffc > --- /dev/null > +++ b/docs/block-replication.txt > @@ -0,0 +1,153 @@ > +Block replication > + > +Copyright Fujitsu, Corp. 2015 > +Copyright (c) 2015 Intel Corporation > +Copyright (c) 2015 HUAWEI TECHNOLOGIES CO., LTD. > + > +This work is licensed under the terms of the GNU GPL, version 2 or later. > +See the COPYING file in the top-level directory. > + > +Block replication is used for continuous checkpoints. It is designed > +for COLO (COurse-grain LOck-stepping) where the Secondary VM is running. > +It can also be applied for FT/HA (Fault-tolerance/High Assurance) scenario, > +where the Secondary VM is not running. > + > +This document gives an overview of block replication's design. > + > +== Background == > +High availability solutions such as micro checkpoint and COLO will do > +consecutive checkpoints. The VM state of Primary VM and Secondary VM is > +identical right after a VM checkpoint, but becomes different as the VM > +executes till the next checkpoint. To support disk contents checkpoint, > +the modified disk contents in the Secondary VM must be buffered, and are > +only dropped at next checkpoint time. To reduce the network transportation > +effort at the time of checkpoint, the disk modification operations of > +Primary disk are asynchronously forwarded to the Secondary node. > + > +== Workflow == > +The following is the image of block replication workflow: > + > ++--+++ > +|Primary Write Requests||Secondary Write Requests| > ++--+++ > + | | > + | (4) > + | V > + | /-\ > + | Copy and Forward| | > + |-(1)--+ | Disk Buffer | > + | | | | > + | (3) \-/ > + | speculative ^ > + |write through(2) > + | | | > + V V | > + +--+ ++ > + | Primary Disk | | Secondary Disk | > + +--+ ++ > + > +1) Primary write requests will be copied and forwarded to Secondary > + QEMU. > +2) Before Primary write requests are written to Secondary disk, the > + original sector content will be read from Secondary disk and > + buffered in the Disk buffer, but it will not overwrite the existing > + sector content(it could be from either "Secondary Write Requests" or > + previous COW of "Primary Write Requests") in the Disk buffer. > +3) Primary write requests will be written to Secondary disk. > +4) Secondary write requests will be buffered in the Disk buffer and it > + will overwrite the existing sector content in the buffer. > + > +== Architecture == > +We are going to implement COLO block replication from many basic > +blocks that are already in QEMU. > + > + virtio-blk || > + ^||.-- > + |||| Secondary > +1 Quorum ||'-- > + / \ || > +/\|| > + Primary 2 NBD ---> 2 NBD > + disk client|| server >virtio-blk > + ||^ > ^ > +. ||| > | > +Primary | || Secondary disk <- hidden-disk 4 > <- active-disk 3 > +' ||| backing^ backing > + ||| | > + ||| | > + ||'-' > + || dri