from:"Changlong Xie"

[Qemu-block] [PATCH] MAINTAINERS: update Wen's email address

2017-04-17 Thread Changlong Xie

So he can get CC'ed on future patches and bugs for this feature

Signed-off-by: Changlong Xie <xiecl.f...@cn.fujitsu.com>
---
 MAINTAINERS | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/MAINTAINERS b/MAINTAINERS
index c60235e..5638992 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -1817,7 +1817,7 @@ S: Supported
 F: tests/image-fuzzer/
 
 Replication
-M: Wen Congyang <we...@cn.fujitsu.com>
+M: Wen Congyang <wencongya...@huawei.com>
 M: Changlong Xie <xiecl.f...@cn.fujitsu.com>
 S: Supported
 F: replication*
-- 
1.9.3

[Qemu-block] [PATCH V1] replication: clarify permissions

2017-03-14 Thread Changlong Xie

Even if hidden_disk, secondary_disk are backing files, they all need
write permissions in replication scenario. Otherwise we will encouter
below exceptions on secondary side during adding nbd server:

{'execute': 'nbd-server-add', 'arguments': {'device': 'colo-disk', 'writable': 
true } }
{"error": {"class": "GenericError", "desc": "Conflicts with use by 
hidden-qcow2-driver as 'backing', which does not allow 'write' on 
sec-qcow2-driver-for-nbd"}}

CC: Zhang Hailiang <zhang.zhanghaili...@huawei.com>
CC: Zhang Chen <zhangchen.f...@cn.fujitsu.com>
CC: Wen Congyang <wencongya...@huawei.com>
Signed-off-by: Changlong Xie <xiecl.f...@cn.fujitsu.com>
---
 block/replication.c | 14 +-
 1 file changed, 13 insertions(+), 1 deletion(-)

diff --git a/block/replication.c b/block/replication.c
index 22f170f..bf3c395 100644
--- a/block/replication.c
+++ b/block/replication.c
@@ -155,6 +155,18 @@ static void replication_close(BlockDriverState *bs)
 replication_remove(s->rs);
 }
 
+static void replication_child_perm(BlockDriverState *bs, BdrvChild *c,
+   const BdrvChildRole *role,
+   uint64_t perm, uint64_t shared,
+   uint64_t *nperm, uint64_t *nshared)
+{
+*nperm = *nshared = BLK_PERM_CONSISTENT_READ \
+| BLK_PERM_WRITE \
+| BLK_PERM_WRITE_UNCHANGED;
+
+return;
+}
+
 static int64_t replication_getlength(BlockDriverState *bs)
 {
 return bdrv_getlength(bs->file->bs);
@@ -660,7 +672,7 @@ BlockDriver bdrv_replication = {
 
 .bdrv_open  = replication_open,
 .bdrv_close = replication_close,
-.bdrv_child_perm= bdrv_filter_default_perms,
+.bdrv_child_perm= replication_child_perm,
 
 .bdrv_getlength = replication_getlength,
 .bdrv_co_readv  = replication_co_readv,
-- 
1.9.3

Re: [Qemu-block] [PATCH RFC v2 4/6] replication: fix code logic with the new shared_disk option

2016-12-20 Thread Changlong Xie


On 12/05/2016 04:35 PM, zhanghailiang wrote:

Some code logic only be needed in non-shared disk, here
we adjust these codes to prepare for shared disk scenario.

Signed-off-by: zhanghailiang 
---
  block/replication.c | 47 ---
  1 file changed, 28 insertions(+), 19 deletions(-)

diff --git a/block/replication.c b/block/replication.c
index 35e9ab3..6574cc2 100644
--- a/block/replication.c
+++ b/block/replication.c
@@ -531,21 +531,28 @@ static void replication_start(ReplicationState *rs, 
ReplicationMode mode,
  aio_context_release(aio_context);
  return;
  }
-bdrv_op_block_all(top_bs, s->blocker);
-bdrv_op_unblock(top_bs, BLOCK_OP_TYPE_DATAPLANE, s->blocker);

-job = backup_job_create(NULL, s->secondary_disk->bs, 
s->hidden_disk->bs,
-0, MIRROR_SYNC_MODE_NONE, NULL, false,
+/*
+ * Only in the case of non-shared disk,
+ * the backup job is in the secondary side
+ */
+if (!s->is_shared_disk) {
+bdrv_op_block_all(top_bs, s->blocker);
+bdrv_op_unblock(top_bs, BLOCK_OP_TYPE_DATAPLANE, s->blocker);
+job = backup_job_create(NULL, s->secondary_disk->bs,
+s->hidden_disk->bs, 0,
+MIRROR_SYNC_MODE_NONE, NULL, false,
  BLOCKDEV_ON_ERROR_REPORT,
  BLOCKDEV_ON_ERROR_REPORT, BLOCK_JOB_INTERNAL,
  backup_job_completed, bs, NULL, _err);


Coding style here.


-if (local_err) {
-error_propagate(errp, local_err);
-backup_job_cleanup(bs);
-aio_context_release(aio_context);
-return;
+if (local_err) {
+error_propagate(errp, local_err);
+backup_job_cleanup(bs);
+aio_context_release(aio_context);
+return;
+}
+block_job_start(job);
  }
-block_job_start(job);

  secondary_do_checkpoint(s, errp);
  break;
@@ -575,14 +582,16 @@ static void replication_do_checkpoint(ReplicationState 
*rs, Error **errp)
  case REPLICATION_MODE_PRIMARY:
  break;
  case REPLICATION_MODE_SECONDARY:
-if (!s->secondary_disk->bs->job) {
-error_setg(errp, "Backup job was cancelled unexpectedly");
-break;
-}
-backup_do_checkpoint(s->secondary_disk->bs->job, _err);
-if (local_err) {
-error_propagate(errp, local_err);
-break;
+if (!s->is_shared_disk) {
+if (!s->secondary_disk->bs->job) {
+error_setg(errp, "Backup job was cancelled unexpectedly");
+break;
+}
+backup_do_checkpoint(s->secondary_disk->bs->job, _err);
+if (local_err) {
+error_propagate(errp, local_err);
+break;
+}
  }
  secondary_do_checkpoint(s, errp);
  break;
@@ -663,7 +672,7 @@ static void replication_stop(ReplicationState *rs, bool 
failover, Error **errp)
   * before the BDS is closed, because we will access hidden
   * disk, secondary disk in backup_job_completed().
   */
-if (s->secondary_disk->bs->job) {
+if (!s->is_shared_disk && s->secondary_disk->bs->job) {
  block_job_cancel_sync(s->secondary_disk->bs->job);
  }

Re: [Qemu-block] [PATCH RFC v2 3/6] replication: Split out backup_do_checkpoint() from secondary_do_checkpoint()

2016-12-20 Thread Changlong Xie


On 12/05/2016 04:35 PM, zhanghailiang wrote:

The helper backup_do_checkpoint() will be used for primary related
codes. Here we split it out from secondary_do_checkpoint().

Besides, it is unnecessary to call backup_do_checkpoint() in
replication starting and normally stop replication path.
We only need call it while do real checkpointing.

Signed-off-by: zhanghailiang <zhang.zhanghaili...@huawei.com>
---
  block/replication.c | 36 +++-
  1 file changed, 19 insertions(+), 17 deletions(-)

diff --git a/block/replication.c b/block/replication.c
index e87ae87..35e9ab3 100644
--- a/block/replication.c
+++ b/block/replication.c
@@ -332,20 +332,8 @@ static bool 
replication_recurse_is_first_non_filter(BlockDriverState *bs,

  static void secondary_do_checkpoint(BDRVReplicationState *s, Error **errp)
  {
-Error *local_err = NULL;
  int ret;

-if (!s->secondary_disk->bs->job) {
-error_setg(errp, "Backup job was cancelled unexpectedly");
-return;
-}
-
-backup_do_checkpoint(s->secondary_disk->bs->job, _err);
-if (local_err) {
-error_propagate(errp, local_err);
-return;
-}
-
  ret = s->active_disk->bs->drv->bdrv_make_empty(s->active_disk->bs);
  if (ret < 0) {
  error_setg(errp, "Cannot make active disk empty");
@@ -558,6 +546,8 @@ static void replication_start(ReplicationState *rs, 
ReplicationMode mode,
  return;
  }
  block_job_start(job);
+
+secondary_do_checkpoint(s, errp);
  break;
  default:
  aio_context_release(aio_context);
@@ -566,10 +556,6 @@ static void replication_start(ReplicationState *rs, 
ReplicationMode mode,

  s->replication_state = BLOCK_REPLICATION_RUNNING;

-if (s->mode == REPLICATION_MODE_SECONDARY) {
-secondary_do_checkpoint(s, errp);
-}
-
  s->error = 0;
  aio_context_release(aio_context);
  }
@@ -579,13 +565,29 @@ static void replication_do_checkpoint(ReplicationState 
*rs, Error **errp)
  BlockDriverState *bs = rs->opaque;
  BDRVReplicationState *s;
  AioContext *aio_context;
+Error *local_err = NULL;

  aio_context = bdrv_get_aio_context(bs);
  aio_context_acquire(aio_context);
  s = bs->opaque;

-if (s->mode == REPLICATION_MODE_SECONDARY) {
+switch (s->mode) {
+case REPLICATION_MODE_PRIMARY:
+break;
+case REPLICATION_MODE_SECONDARY:
+if (!s->secondary_disk->bs->job) {
+error_setg(errp, "Backup job was cancelled unexpectedly");
+break;
+}
+backup_do_checkpoint(s->secondary_disk->bs->job, _err);
+if (local_err) {
+error_propagate(errp, local_err);
+break;
+}
  secondary_do_checkpoint(s, errp);
+break;
+    default:
+    abort();
  }
  aio_context_release(aio_context);
  }



Looks good to me

Reviewed-by: Changlong Xie <xiecl.f...@cn.fujitsu.com>

Re: [Qemu-block] [PATCH RFC v2 2/6] replication: add shared-disk and shared-disk-id options

2016-12-20 Thread Changlong Xie


On 12/05/2016 04:35 PM, zhanghailiang wrote:

diff --git a/qapi/block-core.json b/qapi/block-core.json
index c29bef7..52d7e0d 100644
--- a/qapi/block-core.json
+++ b/qapi/block-core.json
@@ -2232,12 +2232,19 @@
  #  node who owns the replication node chain. Must not be given in
  #  primary mode.
  #
+# @shared-disk-id: #optional The id of shared disk while in replication mode.
+#
+# @shared-disk: #optional To indicate whether or not a disk is shared by
+#   primary VM and secondary VM.
+#


I think we need more detailed description here.

For @shared-disk, we can only both enable or disable it on both side.
For @shared-disk-id, it must/only be given when @shared-disk enable on 
Primary side.


More, you also need to perfect the replication_open() logic.


  # Since: 2.8
  ##
  { 'struct': 'BlockdevOptionsReplication',
'base': 'BlockdevOptionsGenericFormat',
'data': { 'mode': 'ReplicationMode',
-'*top-id': 'str' } }
+'*top-id': 'str',
+'*shared-disk-id': 'str',
+'*shared-disk': 'bool' } }

  ##
  # @NFSTransport
--

Re: [Qemu-block] [PATCH RFC v2 1/6] docs/block-replication: Add description for shared-disk case

2016-12-20 Thread Changlong Xie

   |   ||  |
+  V |   |
+   +---+|
+   |   shared disk | <--+
+   +---+
+
+
+1) Primary writes will read original data and forward it to Secondary
+   QEMU.
+2) The hidden-disk buffers the original content that is modified by the
+   primary VM. It should also be an empty disk, and the driver supports
+   bdrv_make_empty() and backing file.
+3) Primary write requests will be written to Shared disk.
+4) Secondary write requests will be buffered in the active disk and it
+   will overwrite the existing sector content in the buffer.
+
  == Failure Handling ==
  There are 7 internal errors when block replication is running:
  1. I/O error on primary disk
@@ -145,7 +213,7 @@ d. replication_stop_all()
 things except failover. The caller must hold the I/O mutex lock if it is
 in migration/checkpoint thread.

-== Usage ==
+== Non-shared disk usage ==
  Primary:
-drive if=xxx,driver=quorum,read-pattern=fifo,id=colo1,vote-threshold=1,\
   children.0.file.filename=1.raw,\
@@ -234,6 +302,69 @@ Secondary:
The primary host is down, so we should do the following thing:
{ 'execute': 'nbd-server-stop' }

+== Shared disk usage ==
+Primary:
+ -drive if=virtio,id=primary_disk0,file.filename=1.raw,driver=raw
+
+Issue qmp command:
+  { 'execute': 'blockdev-add',
+'arguments': {
+'driver': 'replication',
+'node-name': 'rep',
+'mode': 'primary',
+'shared-disk-id': 'primary_disk0',
+'shared-disk': true,
+'file': {
+'driver': 'nbd',
+'export': 'hidden_disk0',
+'server': {
+'type': 'inet',
+'data': {
+'host': 'xxx.xxx.xxx.xxx',
+'port': 'yyy'
+}
+}
+}
+ }
+  }
+
+Secondary:
+ -drive 
if=none,driver=qcow2,file.filename=/mnt/ramfs/hidden_disk.img,id=hidden_disk0,\
+backing.driver=raw,backing.file.filename=1.raw \
+ -drive if=virtio,id=active-disk0,driver=replication,mode=secondary,\
+file.driver=qcow2,top-id=active-disk0,\
+file.file.filename=/mnt/ramfs/active_disk.img,\
+file.backing=hidden_disk0,shared-disk=on
+
+Issue qmp command:
+1. { 'execute': 'nbd-server-start',
+ 'arguments': {
+'addr': {
+'type': 'inet',
+'data': {
+'host': '0',
+'port': 'yyy'
+}
+}
+ }
+   }
+2. { 'execute': 'nbd-server-add',
+ 'arguments': {
+'device': 'hidden_disk0',
+'writable': true
+}
+  }
+
+After Failover:
+Primary:
+  { 'execute': 'x-blockdev-del',
+'arguments': {
+'node-name': 'rep'
+}
+  }
+
+Secondary:
+  {'execute': 'nbd-server-stop' }
+
  TODO:
  1. Continuous block replication
-2. Shared disk



Looks good to me

Reviewed-by: Changlong Xie <xiecl.f...@cn.fujitsu.com>

Re: [Qemu-block] [PATCH RFC 1/7] docs/block-replication: Add description for shared-disk case

2016-11-27 Thread Changlong Xie


On 11/28/2016 01:13 PM, Hailiang Zhang wrote:


On 2016/10/25 17:03, Changlong Xie wrote:

On 10/20/2016 09:57 PM, zhanghailiang wrote:

Introuduce the scenario of shared-disk block replication
and how to use it.

Signed-off-by: zhanghailiang <zhang.zhanghaili...@huawei.com>
Signed-off-by: Wen Congyang <we...@cn.fujitsu.com>
Signed-off-by: Zhang Chen <zhangchen.f...@cn.fujitsu.com>
---
   docs/block-replication.txt | 131
+++--
   1 file changed, 127 insertions(+), 4 deletions(-)

diff --git a/docs/block-replication.txt b/docs/block-replication.txt
index 6bde673..97fcfc1 100644
--- a/docs/block-replication.txt
+++ b/docs/block-replication.txt
@@ -24,7 +24,7 @@ only dropped at next checkpoint time. To reduce the
network transportation
   effort during a vmstate checkpoint, the disk modification
operations of
   the Primary disk are asynchronously forwarded to the Secondary node.

-== Workflow ==
+== Non-shared disk workflow ==
   The following is the image of block replication workflow:

   +--+
++
@@ -57,7 +57,7 @@ The following is the image of block replication
workflow:
   4) Secondary write requests will be buffered in the Disk
buffer and it
  will overwrite the existing sector content in the buffer.

-== Architecture ==
+== None-shared disk architecture ==


s/None-shared/Non-shared/g




   We are going to implement block replication from many basic
   blocks that are already in QEMU.

@@ -106,6 +106,74 @@ any state that would otherwise be lost by the
speculative write-through
   of the NBD server into the secondary disk. So before block
replication,
   the primary disk and secondary disk should contain the same data.

+== Shared Disk Mode Workflow ==
+The following is the image of block replication workflow:
+
++--+++
+|Primary Write Requests||Secondary Write Requests|
++--+++
+  |   |
+  |  (4)
+  |   V
+  |  /-\
+  | (2)Forward and write through | |
+  | +--> | Disk Buffer |
+  | || |
+  | |\-/
+  | |(1)read   |
+  | |  |
+   (3)write   | |  | backing file
+  V |  |
+ +-+   |
+ | Shared Disk | <-+
+ +-+
+
+1) Primary writes will read original data and forward it to
Secondary
+   QEMU.
+2) Before Primary write requests are written to Shared disk, the
+   original sector content will be read from Shared disk and
+   forwarded and buffered in the Disk buffer on the secondary site,
+   but it will not overwrite the existing


extra spaces at the end of line




+   sector content(it could be from either "Secondary Write
Requests" or


Need a space before "(" for better style.




+   previous COW of "Primary Write Requests") in the Disk buffer.
+3) Primary write requests will be written to Shared disk.
+4) Secondary write requests will be buffered in the Disk buffer
and it
+   will overwrite the existing sector content in the buffer.
+
+== Shared Disk Mode Architecture ==
+We are going to implement block replication from many basic
+blocks that are already in QEMU.
+ virtio-blk
||   .--
+ /
||   | Secondary
+/
||   '--
+   /
|| virtio-blk
+  /
||  |
+  |
||   replication(5)
+  |NBD  >   NBD
(2)   |
+  |  client ||server ---> hidden
disk <-- active disk(4)
+  | ^   ||  |
+  |  replication(1) ||  |
+  | |   ||  |
+  |   +-'   ||  |
+ (3)  |drive-backup sync=none   ||  |
+. |   +-+   ||  |
+Primary | | |   ||   backing|
+' |

Re: [Qemu-block] [Qemu-devel] [PATCH] docs/block-replication.txt: Introduce nbd qmp commands

2016-11-07 Thread Changlong Xie


On 11/07/2016 03:50 PM, Markus Armbruster wrote:

Changlong Xie <xiecl.f...@cn.fujitsu.com> writes:


Signed-off-by: Changlong Xie <xiecl.f...@cn.fujitsu.com>
---
  docs/block-replication.txt | 22 +-
  1 file changed, 17 insertions(+), 5 deletions(-)

diff --git a/docs/block-replication.txt b/docs/block-replication.txt
index 6bde673..6b9c77b 100644
--- a/docs/block-replication.txt
+++ b/docs/block-replication.txt
@@ -152,9 +152,22 @@ Primary:
   children.0.driver=raw

Run qmp command in primary qemu:
-{ 'execute': 'human-monitor-command',
+{ 'execute': 'blockdev-add',
'arguments': {
-  'command-line': 'drive_add -n buddy 
driver=replication,mode=primary,file.driver=nbd,file.host=,file.port=,file.export=colo1,node-name=nbd_client1'
+  'driver': 'replication',
+  'node-name': 'nbd_client1',
+  'mode': 'primary',
+  'file': {
+  'driver': 'nbd',
+  'export': 'colo1',
+  'server': {
+  'type': 'inet',
+  'data': {
+  'host': '',
+  'port': ''
+  }
+  }
+  }
}
  }
  { 'execute': 'x-blockdev-change',
@@ -223,12 +236,11 @@ Primary:
  'child': 'children.1'
  }
}
-  { 'execute': 'human-monitor-command',
+  { 'execute': 'x-blockdev-del',
  'arguments': {
-'command-line': 'drive_del '
+'node-name': 'nbd_client1'
  }
}
-  Note: there is no qmp command to remove the blockdev now

  Secondary:
The primary host is down, so we should do the following thing:


This is premature: both blockdev-add and x-blockdev-del still aren't
ready for production.  Getting close, though.


Sound nice : ), so let this patch pending here.

Thanks
-Xie



.

[Qemu-block] [PATCH] docs/block-replication.txt: Introduce nbd qmp commands

2016-11-06 Thread Changlong Xie

Signed-off-by: Changlong Xie <xiecl.f...@cn.fujitsu.com>
---
 docs/block-replication.txt | 22 +-
 1 file changed, 17 insertions(+), 5 deletions(-)

diff --git a/docs/block-replication.txt b/docs/block-replication.txt
index 6bde673..6b9c77b 100644
--- a/docs/block-replication.txt
+++ b/docs/block-replication.txt
@@ -152,9 +152,22 @@ Primary:
  children.0.driver=raw
 
   Run qmp command in primary qemu:
-{ 'execute': 'human-monitor-command',
+{ 'execute': 'blockdev-add',
   'arguments': {
-  'command-line': 'drive_add -n buddy 
driver=replication,mode=primary,file.driver=nbd,file.host=,file.port=,file.export=colo1,node-name=nbd_client1'
+  'driver': 'replication',
+  'node-name': 'nbd_client1',
+  'mode': 'primary',
+  'file': {
+  'driver': 'nbd',
+  'export': 'colo1',
+  'server': {
+  'type': 'inet',
+  'data': {
+  'host': '',
+  'port': ''
+  }
+  }
+  }
   }
 }
 { 'execute': 'x-blockdev-change',
@@ -223,12 +236,11 @@ Primary:
 'child': 'children.1'
 }
   }
-  { 'execute': 'human-monitor-command',
+  { 'execute': 'x-blockdev-del',
 'arguments': {
-'command-line': 'drive_del '
+'node-name': 'nbd_client1'
 }
   }
-  Note: there is no qmp command to remove the blockdev now
 
 Secondary:
   The primary host is down, so we should do the following thing:
-- 
1.9.3

Re: [Qemu-block] [PATCH RFC 0/7] COLO block replication supports shared disk case

2016-10-25 Thread Changlong Xie


I did't review p5/p6, I think you can merge p5/p6 into a single one.
Also don't forget update qapi/block-core.json with p3.

Thanks
-Xie

On 10/20/2016 09:57 PM, zhanghailiang wrote:

COLO block replication doesn't support the shared disk case,
Here we try to implement it.

Just as the scenario of non-shared disk block replication,
we are going to implement block replication from many basic
blocks that are already in QEMU.
The architecture is:

  virtio-blk ||   
.--
  /  ||   | 
Secondary
 /   ||   
'--
/|| 
virtio-blk
   / || 
 |
   | ||   
replication(5)
   |NBD  >   NBD   (2)  
 |
   |  client ||server ---> hidden disk <-- 
active disk(4)
   | ^   ||  |
   |  replication(1) ||  |
   | |   ||  |
   |   +-'   ||  |
  (3)  |drive-backup sync=none   ||  |
. |   +-+   ||  |
Primary | | |   ||   backing|
' | |   ||  |
   V |   |
+---+|
|   shared disk | <--+
+---+
1) Primary writes will read original data and forward it to Secondary
QEMU.
2) The hidden-disk will buffers the original content that is modified
by the primary VM. It should also be an empty disk, and
the driver supports bdrv_make_empty() and backing file.
3) Primary write requests will be written to Shared disk.
4) Secondary write requests will be buffered in the active disk and it
will overwrite the existing sector content in the buffe

For more details, please refer to patch 1.

The complete codes can be found from the link:
https://github.com/coloft/qemu/tree/colo-v5.1-developing-COLO-frame-v21-with-shared-disk

Test steps:
1. Secondary:
# x86_64-softmmu/qemu-system-x86_64 -boot c -m 2048 -smp 2 -qmp stdio -vnc :9 
-name secondary -enable-kvm -cpu qemu64,+kvmclock -device piix3-usb-uhci -drive 
if=none,driver=qcow2,file.filename=/mnt/ramfs/hidden_disk.img,id=hidden_disk0,backing.driver=raw,backing.file.filename=/work/kvm/suse11_sp3_64
  -drive 
if=virtio,id=active-disk0,driver=replication,mode=secondary,file.driver=qcow2,top-id=active-disk0,file.file.filename=/mnt/ramfs/active_disk.img,file.backing=hidden_disk0,shared-disk=on
 -incoming tcp:0:

Issue qmp commands:
{'execute':'qmp_capabilities'}
{'execute': 'nbd-server-start', 'arguments': {'addr': {'type': 'inet', 'data': 
{'host': '0', 'port': '9998'} } } }
{'execute': 'nbd-server-add', 'arguments': {'device': 'hidden_disk0', 
'writable': true } }

2.Primary:
# x86_64-softmmu/qemu-system-x86_64 -enable-kvm -m 2048 -smp 2 -qmp stdio -vnc 
:9 -name primary -cpu qemu64,+kvmclock -device piix3-usb-uhci -drive 
if=virtio,id=primary_disk0,file.filename=/work/kvm/suse11_sp3_64,driver=raw -S

Issue qmp commands:
{'execute':'qmp_capabilities'}
{'execute': 'human-monitor-command', 'arguments': {'command-line': 'drive_add 
-n buddy 
driver=replication,mode=primary,file.driver=nbd,file.host=9.42.3.17,file.port=9998,file.export=hidden_disk0,shared-disk-id=primary_disk0,shared-disk=on,node-name=rep'}}
{'execute': 'migrate-set-capabilities', 'arguments': {'capabilities': [ 
{'capability': 'x-colo', 'state': true } ] } }
{'execute': 'migrate', 'arguments': {'uri': 'tcp:9.42.3.17:' } }

3. Failover
Secondary side:
Issue qmp commands:
{ 'execute': 'nbd-server-stop' }
{ "execute": "x-colo-lost-heartbeat" }

Please review and any commits are welcomed.

Cc: Juan Quintela 
Cc: Amit Shah 
Cc: Dr. David Alan Gilbert (git) 

zhanghailiang (7):
   docs/block-replication: Add description for shared-disk case
   block-backend: Introduce blk_root() helper
   replication: add shared-disk and shared-disk-id options
   replication: Split out backup_do_checkpoint() from
 secondary_do_checkpoint()
   replication: fix code logic with the new shared_disk option
   replication: Implement block replication for shared disk case
   nbd/replication: implement .bdrv_get_info() for nbd and replication
 driver

  block/block-backend.c  |   5 ++
  block/nbd.c|  12

Re: [Qemu-block] [PATCH RFC 3/7] replication: add shared-disk and shared-disk-id options

2016-10-25 Thread Changlong Xie


On 10/20/2016 09:57 PM, zhanghailiang wrote:

We use these two options to identify which disk is
shared

Signed-off-by: zhanghailiang 
Signed-off-by: Wen Congyang 
Signed-off-by: Zhang Chen 
---
  block/replication.c | 33 +
  1 file changed, 33 insertions(+)

diff --git a/block/replication.c b/block/replication.c
index 3bd1cf1..2a2fdb2 100644
--- a/block/replication.c
+++ b/block/replication.c
@@ -25,9 +25,12 @@
  typedef struct BDRVReplicationState {
  ReplicationMode mode;
  int replication_state;
+bool is_shared_disk;
+char *shared_disk_id;
  BdrvChild *active_disk;
  BdrvChild *hidden_disk;
  BdrvChild *secondary_disk;
+BdrvChild *primary_disk;
  char *top_id;
  ReplicationState *rs;
  Error *blocker;
@@ -53,6 +56,9 @@ static void replication_stop(ReplicationState *rs, bool 
failover,

  #define REPLICATION_MODE"mode"
  #define REPLICATION_TOP_ID  "top-id"
+#define REPLICATION_SHARED_DISK "shared-disk"
+#define REPLICATION_SHARED_DISK_ID "shared-disk-id"
+
  static QemuOptsList replication_runtime_opts = {
  .name = "replication",
  .head = QTAILQ_HEAD_INITIALIZER(replication_runtime_opts.head),
@@ -65,6 +71,14 @@ static QemuOptsList replication_runtime_opts = {
  .name = REPLICATION_TOP_ID,
  .type = QEMU_OPT_STRING,
  },
+{
+.name = REPLICATION_SHARED_DISK_ID,
+.type = QEMU_OPT_STRING,
+},
+{
+.name = REPLICATION_SHARED_DISK,
+.type = QEMU_OPT_BOOL,
+},
  { /* end of list */ }
  },
  };
@@ -85,6 +99,8 @@ static int replication_open(BlockDriverState *bs, QDict 
*options,
  QemuOpts *opts = NULL;
  const char *mode;
  const char *top_id;
+const char *shared_disk_id;
+BlockBackend *blk;

  ret = -EINVAL;
  opts = qemu_opts_create(_runtime_opts, NULL, 0, _abort);
@@ -114,6 +130,22 @@ static int replication_open(BlockDriverState *bs, QDict 
*options,
 "The option mode's value should be primary or secondary");
  goto fail;
  }


Now we have four runtime options 
"mode"/"top-id"/"shared-disk"/"shared-disk-id". But the current checking 
logic is too weak, i think you need enhance it to avoid opts misusage.



+s->is_shared_disk = qemu_opt_get_bool(opts, REPLICATION_SHARED_DISK,
+false);


Missing one space.


+if (s->is_shared_disk && (s->mode == REPLICATION_MODE_PRIMARY)) {
+shared_disk_id = qemu_opt_get(opts, REPLICATION_SHARED_DISK_ID);
+if (!shared_disk_id) {
+error_setg(_err, "Missing shared disk blk");
+goto fail;
+}
+s->shared_disk_id = g_strdup(shared_disk_id);
+blk = blk_by_name(s->shared_disk_id);
+if (!blk) {
+error_setg(_err, "There is no %s block", s->shared_disk_id);
+goto fail;
+}
+s->primary_disk = blk_root(blk);
+}

  s->rs = replication_new(bs, _ops);

@@ -130,6 +162,7 @@ static void replication_close(BlockDriverState *bs)
  {
  BDRVReplicationState *s = bs->opaque;

+g_free(s->shared_disk_id);
  if (s->replication_state == BLOCK_REPLICATION_RUNNING) {
  replication_stop(s->rs, false, NULL);
  }

Re: [Qemu-block] [PATCH RFC 4/7] replication: Split out backup_do_checkpoint() from secondary_do_checkpoint()

2016-10-25 Thread Changlong Xie


On 10/20/2016 09:57 PM, zhanghailiang wrote:

The helper backup_do_checkpoint() will be used for primary related
codes. Here we split it out from secondary_do_checkpoint().

Besides, it is unnecessary to call backup_do_checkpoint() in
replication starting and normally stop replication path.


This patch is unnecessary. We *really* need clean 
backup_job->done_bitmap in replication_start/stop path.



We only need call it while do real checkpointing.

Signed-off-by: zhanghailiang 
---
  block/replication.c | 36 +++-
  1 file changed, 19 insertions(+), 17 deletions(-)

diff --git a/block/replication.c b/block/replication.c
index 2a2fdb2..d687ffc 100644
--- a/block/replication.c
+++ b/block/replication.c
@@ -320,20 +320,8 @@ static bool 
replication_recurse_is_first_non_filter(BlockDriverState *bs,

  static void secondary_do_checkpoint(BDRVReplicationState *s, Error **errp)
  {
-Error *local_err = NULL;
  int ret;

-if (!s->secondary_disk->bs->job) {
-error_setg(errp, "Backup job was cancelled unexpectedly");
-return;
-}
-
-backup_do_checkpoint(s->secondary_disk->bs->job, _err);
-if (local_err) {
-error_propagate(errp, local_err);
-return;
-}
-
  ret = s->active_disk->bs->drv->bdrv_make_empty(s->active_disk->bs);
  if (ret < 0) {
  error_setg(errp, "Cannot make active disk empty");
@@ -539,6 +527,8 @@ static void replication_start(ReplicationState *rs, 
ReplicationMode mode,
  aio_context_release(aio_context);
  return;
  }
+
+secondary_do_checkpoint(s, errp);
  break;
  default:
  aio_context_release(aio_context);
@@ -547,10 +537,6 @@ static void replication_start(ReplicationState *rs, 
ReplicationMode mode,

  s->replication_state = BLOCK_REPLICATION_RUNNING;

-if (s->mode == REPLICATION_MODE_SECONDARY) {
-secondary_do_checkpoint(s, errp);
-}
-
  s->error = 0;
  aio_context_release(aio_context);
  }
@@ -560,13 +546,29 @@ static void replication_do_checkpoint(ReplicationState 
*rs, Error **errp)
  BlockDriverState *bs = rs->opaque;
  BDRVReplicationState *s;
  AioContext *aio_context;
+Error *local_err = NULL;

  aio_context = bdrv_get_aio_context(bs);
  aio_context_acquire(aio_context);
  s = bs->opaque;

-if (s->mode == REPLICATION_MODE_SECONDARY) {
+switch (s->mode) {
+case REPLICATION_MODE_PRIMARY:
+break;
+case REPLICATION_MODE_SECONDARY:
+if (!s->secondary_disk->bs->job) {
+error_setg(errp, "Backup job was cancelled unexpectedly");
+break;
+}
+backup_do_checkpoint(s->secondary_disk->bs->job, _err);
+if (local_err) {
+error_propagate(errp, local_err);
+break;
+}
  secondary_do_checkpoint(s, errp);
+break;
+default:
+abort();
  }
  aio_context_release(aio_context);
  }

Re: [Qemu-block] [PATCH RFC 3/7] replication: add shared-disk and shared-disk-id options

2016-10-25 Thread Changlong Xie


On 10/20/2016 09:57 PM, zhanghailiang wrote:

We use these two options to identify which disk is
shared

Signed-off-by: zhanghailiang 
Signed-off-by: Wen Congyang 
Signed-off-by: Zhang Chen 
---
  block/replication.c | 33 +
  1 file changed, 33 insertions(+)

diff --git a/block/replication.c b/block/replication.c
index 3bd1cf1..2a2fdb2 100644
--- a/block/replication.c
+++ b/block/replication.c
@@ -25,9 +25,12 @@
  typedef struct BDRVReplicationState {
  ReplicationMode mode;
  int replication_state;
+bool is_shared_disk;
+char *shared_disk_id;
  BdrvChild *active_disk;
  BdrvChild *hidden_disk;
  BdrvChild *secondary_disk;
+BdrvChild *primary_disk;
  char *top_id;
  ReplicationState *rs;
  Error *blocker;
@@ -53,6 +56,9 @@ static void replication_stop(ReplicationState *rs, bool 
failover,

  #define REPLICATION_MODE"mode"
  #define REPLICATION_TOP_ID  "top-id"
+#define REPLICATION_SHARED_DISK "shared-disk"
+#define REPLICATION_SHARED_DISK_ID "shared-disk-id"
+
  static QemuOptsList replication_runtime_opts = {
  .name = "replication",
  .head = QTAILQ_HEAD_INITIALIZER(replication_runtime_opts.head),
@@ -65,6 +71,14 @@ static QemuOptsList replication_runtime_opts = {
  .name = REPLICATION_TOP_ID,
  .type = QEMU_OPT_STRING,
  },
+{
+.name = REPLICATION_SHARED_DISK_ID,
+.type = QEMU_OPT_STRING,
+},
+{
+.name = REPLICATION_SHARED_DISK,
+.type = QEMU_OPT_BOOL,
+},
  { /* end of list */ }
  },
  };
@@ -85,6 +99,8 @@ static int replication_open(BlockDriverState *bs, QDict 
*options,
  QemuOpts *opts = NULL;
  const char *mode;
  const char *top_id;
+const char *shared_disk_id;
+BlockBackend *blk;

  ret = -EINVAL;
  opts = qemu_opts_create(_runtime_opts, NULL, 0, _abort);
@@ -114,6 +130,22 @@ static int replication_open(BlockDriverState *bs, QDict 
*options,
 "The option mode's value should be primary or secondary");
  goto fail;
  }
+s->is_shared_disk = qemu_opt_get_bool(opts, REPLICATION_SHARED_DISK,
+false);
+if (s->is_shared_disk && (s->mode == REPLICATION_MODE_PRIMARY)) {
+shared_disk_id = qemu_opt_get(opts, REPLICATION_SHARED_DISK_ID);
+if (!shared_disk_id) {
+error_setg(_err, "Missing shared disk blk");
+goto fail;
+}
+s->shared_disk_id = g_strdup(shared_disk_id);
+blk = blk_by_name(s->shared_disk_id);
+if (!blk) {
+error_setg(_err, "There is no %s block", s->shared_disk_id);


g_free(s->shared_disk_id);


+goto fail;
+}
+s->primary_disk = blk_root(blk);
+}

  s->rs = replication_new(bs, _ops);

@@ -130,6 +162,7 @@ static void replication_close(BlockDriverState *bs)
  {
  BDRVReplicationState *s = bs->opaque;

+g_free(s->shared_disk_id);
  if (s->replication_state == BLOCK_REPLICATION_RUNNING) {
  replication_stop(s->rs, false, NULL);
  }

Re: [Qemu-block] [PATCH RFC 2/7] block-backend: Introduce blk_root() helper

2016-10-25 Thread Changlong Xie

I know you need blk->root in the next patch, but we strongly don't 
recommend your current solution.  Please refer Kevin's cf2ab8fc


1409 /* XXX Ugly way to get blk->root, but that's a feature, not a 
bug. This
1410  * hack makes it obvious that vhdx_write_header() bypasses the 
BlockBackend

1411  * here, which it really shouldn't be doing. */
1412 child = QLIST_FIRST(>parents);
1413 assert(!QLIST_NEXT(child, next_parent));

Then you can drop this commit.

On 10/20/2016 09:57 PM, zhanghailiang wrote:

With this helper function, we can get the BdrvChild struct
from BlockBackend

Signed-off-by: zhanghailiang 
---
  block/block-backend.c  | 5 +
  include/sysemu/block-backend.h | 1 +
  2 files changed, 6 insertions(+)

diff --git a/block/block-backend.c b/block/block-backend.c
index 1a724a8..66387f0 100644
--- a/block/block-backend.c
+++ b/block/block-backend.c
@@ -389,6 +389,11 @@ BlockDriverState *blk_bs(BlockBackend *blk)
  return blk->root ? blk->root->bs : NULL;
  }

+BdrvChild *blk_root(BlockBackend *blk)
+{
+return blk->root;
+}
+
  static BlockBackend *bdrv_first_blk(BlockDriverState *bs)
  {
  BdrvChild *child;
diff --git a/include/sysemu/block-backend.h b/include/sysemu/block-backend.h
index b07159b..867f9f5 100644
--- a/include/sysemu/block-backend.h
+++ b/include/sysemu/block-backend.h
@@ -99,6 +99,7 @@ void blk_remove_bs(BlockBackend *blk);
  void blk_insert_bs(BlockBackend *blk, BlockDriverState *bs);
  bool bdrv_has_blk(BlockDriverState *bs);
  bool bdrv_is_root_node(BlockDriverState *bs);
+BdrvChild *blk_root(BlockBackend *blk);

  void blk_set_allow_write_beyond_eof(BlockBackend *blk, bool allow);
  void blk_iostatus_enable(BlockBackend *blk);

Re: [Qemu-block] [PATCH RFC 1/7] docs/block-replication: Add description for shared-disk case

2016-10-25 Thread Changlong Xie


On 10/20/2016 09:57 PM, zhanghailiang wrote:

Introuduce the scenario of shared-disk block replication
and how to use it.

Signed-off-by: zhanghailiang 
Signed-off-by: Wen Congyang 
Signed-off-by: Zhang Chen 
---
  docs/block-replication.txt | 131 +++--
  1 file changed, 127 insertions(+), 4 deletions(-)

diff --git a/docs/block-replication.txt b/docs/block-replication.txt
index 6bde673..97fcfc1 100644
--- a/docs/block-replication.txt
+++ b/docs/block-replication.txt
@@ -24,7 +24,7 @@ only dropped at next checkpoint time. To reduce the network 
transportation
  effort during a vmstate checkpoint, the disk modification operations of
  the Primary disk are asynchronously forwarded to the Secondary node.

-== Workflow ==
+== Non-shared disk workflow ==
  The following is the image of block replication workflow:

  +--+++
@@ -57,7 +57,7 @@ The following is the image of block replication workflow:
  4) Secondary write requests will be buffered in the Disk buffer and it
 will overwrite the existing sector content in the buffer.

-== Architecture ==
+== None-shared disk architecture ==


s/None-shared/Non-shared/g


  We are going to implement block replication from many basic
  blocks that are already in QEMU.

@@ -106,6 +106,74 @@ any state that would otherwise be lost by the speculative 
write-through
  of the NBD server into the secondary disk. So before block replication,
  the primary disk and secondary disk should contain the same data.

+== Shared Disk Mode Workflow ==
+The following is the image of block replication workflow:
+
++--+++
+|Primary Write Requests||Secondary Write Requests|
++--+++
+  |   |
+  |  (4)
+  |   V
+  |  /-\
+  | (2)Forward and write through | |
+  | +--> | Disk Buffer |
+  | || |
+  | |\-/
+  | |(1)read   |
+  | |  |
+   (3)write   | |  | backing file
+  V |  |
+ +-+   |
+ | Shared Disk | <-+
+ +-+
+
+1) Primary writes will read original data and forward it to Secondary
+   QEMU.
+2) Before Primary write requests are written to Shared disk, the
+   original sector content will be read from Shared disk and
+   forwarded and buffered in the Disk buffer on the secondary site,
+   but it will not overwrite the existing


extra spaces at the end of line


+   sector content(it could be from either "Secondary Write Requests" or


Need a space before "(" for better style.


+   previous COW of "Primary Write Requests") in the Disk buffer.
+3) Primary write requests will be written to Shared disk.
+4) Secondary write requests will be buffered in the Disk buffer and it
+   will overwrite the existing sector content in the buffer.
+
+== Shared Disk Mode Architecture ==
+We are going to implement block replication from many basic
+blocks that are already in QEMU.
+ virtio-blk ||   
.--
+ /  ||   | 
Secondary
+/   ||   
'--
+   /|| 
virtio-blk
+  / || 
 |
+  | ||   
replication(5)
+  |NBD  >   NBD   (2)  
 |
+  |  client ||server ---> hidden disk <-- 
active disk(4)
+  | ^   ||  |
+  |  replication(1) ||  |
+  | |   ||  |
+  |   +-'   ||  |
+ (3)  |drive-backup sync=none   ||  |
+. |   +-+   ||  |
+Primary | | |   ||   backing|
+' |

Re: [Qemu-block] [Qemu-devel] [PATCH] nbd: Use CoQueue for free_sema instead of CoMutex

2016-10-24 Thread Changlong Xie


On 10/24/2016 05:36 PM, Paolo Bonzini wrote:



On 24/10/2016 03:44, Changlong Xie wrote:

Ping. Any comments? It's really a problem for NBD.


Sorry, I haven't been sending pull requests.  I'll do it this week.



Thanks : )


Paolo


Thanks
 -Xie

On 10/12/2016 06:18 PM, Changlong Xie wrote:

NBD is using the CoMutex in a way that wasn't anticipated. For
example, if there are
N(N=26, MAX_NBD_REQUESTS=16) nbd write requests, so we will invoke
nbd_client_co_pwritev
N times.


time request Actions
11   in_flight=1, Coroutine=C1
22   in_flight=2, Coroutine=C2
...
15   15  in_flight=15, Coroutine=C15
16   16  in_flight=16, Coroutine=C16, free_sema->holder=C16,
mutex->locked=true
17   17  in_flight=16, Coroutine=C17, queue C17 into free_sema->queue
18   18  in_flight=16, Coroutine=C18, queue C18 into free_sema->queue
...
26   N   in_flight=16, Coroutine=C26, queue C26 into free_sema->queue



Once nbd client recieves request No.16' reply, we will re-enter C16.
It's ok, because
it's equal to 'free_sema->holder'.


time request Actions
27   16  in_flight=15, Coroutine=C16, free_sema->holder=C16,
mutex->locked=false



Then nbd_coroutine_end invokes qemu_co_mutex_unlock what will pop
coroutines from
free_sema->queue's head and enter C17. More free_sema->holder is C17 now.


time request Actions
28   17  in_flight=16, Coroutine=C17, free_sema->holder=C17,
mutex->locked=true



In above scenario, we only recieves request No.16' reply. As time goes
by, nbd client will
almostly recieves replies from requests 1 to 15 rather than request 17
who owns C17. In this
case, we will encounter assert "mutex->holder == self" failed since
Kevin's commit 0e438cdc
"coroutine: Let CoMutex remember who holds it". For example, if nbd
client recieves request
No.15' reply, qemu will stop unexpectedly:


time request   Actions
29   15(most case) in_flight=15, Coroutine=C15, free_sema->holder=C17,
mutex->locked=false



Per Paolo's suggestion "The simplest fix is to change it to CoQueue,
which is like a condition
variable", this patch replaces CoMutex with CoQueue.

Cc: Wen Congyang <we...@cn.fujitsu.com>
Reported-by: zhanghailiang <zhang.zhanghaili...@huawei.com>
Suggested-by: Paolo Bonzini <pbonz...@redhat.com>
Signed-off-by: Changlong Xie <xiecl.f...@cn.fujitsu.com>
---
   block/nbd-client.c | 8 
   block/nbd-client.h | 2 +-
   2 files changed, 5 insertions(+), 5 deletions(-)

diff --git a/block/nbd-client.c b/block/nbd-client.c
index 2cf3237..40b28ab 100644
--- a/block/nbd-client.c
+++ b/block/nbd-client.c
@@ -199,8 +199,8 @@ static void nbd_coroutine_start(NbdClientSession *s,
   {
   /* Poor man semaphore.  The free_sema is locked when no other
request
* can be accepted, and unlocked after receiving one reply.  */
-if (s->in_flight >= MAX_NBD_REQUESTS - 1) {
-qemu_co_mutex_lock(>free_sema);
+if (s->in_flight == MAX_NBD_REQUESTS) {
+qemu_co_queue_wait(>free_sema);
   assert(s->in_flight < MAX_NBD_REQUESTS);
   }
   s->in_flight++;
@@ -214,7 +214,7 @@ static void nbd_coroutine_end(NbdClientSession *s,
   int i = HANDLE_TO_INDEX(s, request->handle);
   s->recv_coroutine[i] = NULL;
   if (s->in_flight-- == MAX_NBD_REQUESTS) {
-qemu_co_mutex_unlock(>free_sema);
+qemu_co_queue_next(>free_sema);
   }
   }

@@ -386,7 +386,7 @@ int nbd_client_init(BlockDriverState *bs,
   }

   qemu_co_mutex_init(>send_mutex);
-qemu_co_mutex_init(>free_sema);
+qemu_co_queue_init(>free_sema);
   client->sioc = sioc;
   object_ref(OBJECT(client->sioc));

diff --git a/block/nbd-client.h b/block/nbd-client.h
index 044aca4..307b8b1 100644
--- a/block/nbd-client.h
+++ b/block/nbd-client.h
@@ -24,7 +24,7 @@ typedef struct NbdClientSession {
   off_t size;

   CoMutex send_mutex;
-CoMutex free_sema;
+CoQueue free_sema;
   Coroutine *send_coroutine;
   int in_flight;










.

Re: [Qemu-block] [Qemu-devel] [PATCH] nbd: Use CoQueue for free_sema instead of CoMutex

2016-10-12 Thread Changlong Xie


On 10/12/2016 06:18 PM, Changlong Xie wrote:


time request   Actions
29   15(most case) in_flight=15, Coroutine=C15, free_sema->holder=C17, 
mutex->locked=false


Per Paolo's suggestion "The simplest fix is to change it to CoQueue, which is 
like a condition
variable", this patch replaces CoMutex with CoQueue.

Cc: Wen Congyang<we...@cn.fujitsu.com>
Reported-by: zhanghailiang<zhang.zhanghaili...@huawei.com>
Suggested-by: Paolo Bonzini<pbonz...@redhat.com>
Signed-off-by: Changlong Xie<xiecl.f...@cn.fujitsu.com>
---


Please refer to:

http://lists.nongnu.org/archive/html/qemu-devel/2016-10/msg02172.html

[Qemu-block] [PATCH] nbd: Use CoQueue for free_sema instead of CoMutex

2016-10-12 Thread Changlong Xie

NBD is using the CoMutex in a way that wasn't anticipated. For example, if 
there are
N(N=26, MAX_NBD_REQUESTS=16) nbd write requests, so we will invoke 
nbd_client_co_pwritev
N times.

time request Actions
11   in_flight=1, Coroutine=C1
22   in_flight=2, Coroutine=C2
...
15   15  in_flight=15, Coroutine=C15
16   16  in_flight=16, Coroutine=C16, free_sema->holder=C16, 
mutex->locked=true
17   17  in_flight=16, Coroutine=C17, queue C17 into free_sema->queue
18   18  in_flight=16, Coroutine=C18, queue C18 into free_sema->queue
...
26   N   in_flight=16, Coroutine=C26, queue C26 into free_sema->queue


Once nbd client recieves request No.16' reply, we will re-enter C16. It's ok, 
because
it's equal to 'free_sema->holder'.

time request Actions
27   16  in_flight=15, Coroutine=C16, free_sema->holder=C16, 
mutex->locked=false


Then nbd_coroutine_end invokes qemu_co_mutex_unlock what will pop coroutines 
from
free_sema->queue's head and enter C17. More free_sema->holder is C17 now.

time request Actions
28   17  in_flight=16, Coroutine=C17, free_sema->holder=C17, 
mutex->locked=true


In above scenario, we only recieves request No.16' reply. As time goes by, nbd 
client will
almostly recieves replies from requests 1 to 15 rather than request 17 who owns 
C17. In this
case, we will encounter assert "mutex->holder == self" failed since Kevin's 
commit 0e438cdc
"coroutine: Let CoMutex remember who holds it". For example, if nbd client 
recieves request
No.15' reply, qemu will stop unexpectedly:

time request   Actions
29   15(most case) in_flight=15, Coroutine=C15, free_sema->holder=C17, 
mutex->locked=false


Per Paolo's suggestion "The simplest fix is to change it to CoQueue, which is 
like a condition
variable", this patch replaces CoMutex with CoQueue.

Cc: Wen Congyang <we...@cn.fujitsu.com>
Reported-by: zhanghailiang <zhang.zhanghaili...@huawei.com>
Suggested-by: Paolo Bonzini <pbonz...@redhat.com>
Signed-off-by: Changlong Xie <xiecl.f...@cn.fujitsu.com>
---
 block/nbd-client.c | 8 
 block/nbd-client.h | 2 +-
 2 files changed, 5 insertions(+), 5 deletions(-)

diff --git a/block/nbd-client.c b/block/nbd-client.c
index 2cf3237..40b28ab 100644
--- a/block/nbd-client.c
+++ b/block/nbd-client.c
@@ -199,8 +199,8 @@ static void nbd_coroutine_start(NbdClientSession *s,
 {
 /* Poor man semaphore.  The free_sema is locked when no other request
  * can be accepted, and unlocked after receiving one reply.  */
-if (s->in_flight >= MAX_NBD_REQUESTS - 1) {
-qemu_co_mutex_lock(>free_sema);
+if (s->in_flight == MAX_NBD_REQUESTS) {
+qemu_co_queue_wait(>free_sema);
 assert(s->in_flight < MAX_NBD_REQUESTS);
 }
 s->in_flight++;
@@ -214,7 +214,7 @@ static void nbd_coroutine_end(NbdClientSession *s,
 int i = HANDLE_TO_INDEX(s, request->handle);
 s->recv_coroutine[i] = NULL;
 if (s->in_flight-- == MAX_NBD_REQUESTS) {
-qemu_co_mutex_unlock(>free_sema);
+qemu_co_queue_next(>free_sema);
 }
 }
 
@@ -386,7 +386,7 @@ int nbd_client_init(BlockDriverState *bs,
 }
 
 qemu_co_mutex_init(>send_mutex);
-qemu_co_mutex_init(>free_sema);
+qemu_co_queue_init(>free_sema);
 client->sioc = sioc;
 object_ref(OBJECT(client->sioc));
 
diff --git a/block/nbd-client.h b/block/nbd-client.h
index 044aca4..307b8b1 100644
--- a/block/nbd-client.h
+++ b/block/nbd-client.h
@@ -24,7 +24,7 @@ typedef struct NbdClientSession {
 off_t size;
 
 CoMutex send_mutex;
-CoMutex free_sema;
+CoQueue free_sema;
 Coroutine *send_coroutine;
 int in_flight;
 
-- 
1.9.3

Re: [Qemu-block] [Questions] NBD issue or CoMutex->holder issue?

2016-10-11 Thread Changlong Xie

On 10/11/2016 06:47 PM, Paolo Bonzini wrote:

the free_sema->queue head, so set free_sema->holder as
>revelant coroutine.

NBD is using the CoMutex in a way that wasn't anticipated.  The simplest
fix is to change it to CoQueue, which is like a condition variable.
Instead of locking if in_flight >= MAX_NBD_REQUESTS - 1, wait on the
queue while in_flight == MAX_NBD_REQUESTS.  Instead of unlocking, use
qemu_co_queue_next to wake up one request.

Thanks for your explanation! will send out a patch later.

Thanks
-Xie

Thanks for the report!

Paolo

>For example if there are N(N=26 and MAX_NBD_REQUESTS=16) nbd write
>requests, so we'll invoke nbd_client_co_pwritev 26 times.
>time request No   Actions
>1 1   in_flight=1, Coroutine=C1
>2 2   in_flight=2, Coroutine=C2

Re: [Qemu-block] [Qemu-devel] [PATCH v2 2/2] block/replication: Clarify 'top-id' parameter usage

2016-10-11 Thread Changlong Xie


On 10/11/2016 10:54 PM, Eric Blake wrote:

The replication driver only supports the 'top-id' parameter for the
secondary side; it must not be supplied for the primary side.


Will apply in next version.

Thanks
-Xie

Re: [Qemu-block] [Qemu-devel] [PATCH v2 1/2] block/replication: prefect the logic to acquire 'top_id'

2016-10-11 Thread Changlong Xie


On 10/11/2016 10:52 PM, Eric Blake wrote:

On 10/11/2016 05:46 AM, Changlong Xie wrote:

Only g_strdup(top_id) if 'top_id' is not NULL, although there
is no memory leak here

Signed-off-by: Changlong Xie <xiecl.f...@cn.fujitsu.com>
---
  block/replication.c | 4 ++--
  1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/block/replication.c b/block/replication.c
index 3bd1cf1..5b432d9 100644
--- a/block/replication.c
+++ b/block/replication.c
@@ -104,11 +104,11 @@ static int replication_open(BlockDriverState *bs, QDict 
*options,
  } else if (!strcmp(mode, "secondary")) {
  s->mode = REPLICATION_MODE_SECONDARY;
  top_id = qemu_opt_get(opts, REPLICATION_TOP_ID);
-s->top_id = g_strdup(top_id);


g_strdup(NULL) is safe; it returns NULL in that case.


Yes, that's why i said 'there is no memory leak here' in the commit 
message.





-if (!s->top_id) {
+if (!top_id) {
  error_setg(_err, "Missing the option top-id");
  goto fail;
  }
+s->top_id = g_strdup(top_id);


I see no point to this patch, rather than churn.


It just reduce on execution path. Maybe i'm too academic :)
Will remove it in the next series.

Thanks
-Xie

[Qemu-block] [PATCH v2 0/2] block/replication fixes

2016-10-11 Thread Changlong Xie

V2:
1. fix typo

Changlong Xie (2):
  block/replication: prefect the logic to acquire 'top_id'
  block/replication: Clarify 'top-id' parameter usage

 block/replication.c  | 9 +++--
 qapi/block-core.json | 3 ++-
 2 files changed, 9 insertions(+), 3 deletions(-)

-- 
1.9.3

[Qemu-block] [PATCH v2 1/2] block/replication: prefect the logic to acquire 'top_id'

2016-10-11 Thread Changlong Xie

Only g_strdup(top_id) if 'top_id' is not NULL, although there
is no memory leak here

Signed-off-by: Changlong Xie <xiecl.f...@cn.fujitsu.com>
---
 block/replication.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/block/replication.c b/block/replication.c
index 3bd1cf1..5b432d9 100644
--- a/block/replication.c
+++ b/block/replication.c
@@ -104,11 +104,11 @@ static int replication_open(BlockDriverState *bs, QDict 
*options,
 } else if (!strcmp(mode, "secondary")) {
 s->mode = REPLICATION_MODE_SECONDARY;
 top_id = qemu_opt_get(opts, REPLICATION_TOP_ID);
-s->top_id = g_strdup(top_id);
-if (!s->top_id) {
+if (!top_id) {
 error_setg(_err, "Missing the option top-id");
 goto fail;
 }
+s->top_id = g_strdup(top_id);
 } else {
 error_setg(_err,
"The option mode's value should be primary or secondary");
-- 
1.9.3

[Qemu-block] [PATCH v1 1/2] block/replication: prefect the logic to acquire 'top_id'

2016-10-10 Thread Changlong Xie

Only g_strdup(top_id) if 'top_id' is not NULL, although there
is no memory leak here

Signed-off-by: Changlong Xie <xiecl.f...@cn.fujitsu.com>
---
 block/replication.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/block/replication.c b/block/replication.c
index 3bd1cf1..5b432d9 100644
--- a/block/replication.c
+++ b/block/replication.c
@@ -104,11 +104,11 @@ static int replication_open(BlockDriverState *bs, QDict 
*options,
 } else if (!strcmp(mode, "secondary")) {
 s->mode = REPLICATION_MODE_SECONDARY;
 top_id = qemu_opt_get(opts, REPLICATION_TOP_ID);
-s->top_id = g_strdup(top_id);
-if (!s->top_id) {
+if (!top_id) {
 error_setg(_err, "Missing the option top-id");
 goto fail;
 }
+s->top_id = g_strdup(top_id);
 } else {
 error_setg(_err,
"The option mode's value should be primary or secondary");
-- 
1.9.3

[Qemu-block] [PATCH v1 0/2] block/replication fixes

2016-10-10 Thread Changlong Xie

Changlong Xie (2):
  block/replication: prefect the logic to acquire 'top_id'
  block/replication: Clarify 'top-id' parameter usage

 block/replication.c  | 9 +++--
 qapi/block-core.json | 3 ++-
 2 files changed, 9 insertions(+), 3 deletions(-)

-- 
1.9.3

[Qemu-block] [PATCH v1 2/2] block/replication: Clarify 'top-id' parameter usage

2016-10-10 Thread Changlong Xie

Replication driver only support 'top-id' parameter in secondary side,
and it must not be supplied in primary side

Signed-off-by: Changlong Xie <xiecl.f...@cn.fujitsu.com>
---
 block/replication.c  | 5 +
 qapi/block-core.json | 3 ++-
 2 files changed, 7 insertions(+), 1 deletion(-)

diff --git a/block/replication.c b/block/replication.c
index 5b432d9..1e8284b 100644
--- a/block/replication.c
+++ b/block/replication.c
@@ -101,6 +101,11 @@ static int replication_open(BlockDriverState *bs, QDict 
*options,
 
 if (!strcmp(mode, "primary")) {
 s->mode = REPLICATION_MODE_PRIMARY;
+top_id = qemu_opt_get(opts, REPLICATION_TOP_ID);
+if (top_id) {
+error_setg(_err, "The primary side do not support option 
top-id");
+goto fail;
+}
 } else if (!strcmp(mode, "secondary")) {
 s->mode = REPLICATION_MODE_SECONDARY;
 top_id = qemu_opt_get(opts, REPLICATION_TOP_ID);
diff --git a/qapi/block-core.json b/qapi/block-core.json
index 4badb97..ec92df4 100644
--- a/qapi/block-core.json
+++ b/qapi/block-core.json
@@ -2197,7 +2197,8 @@
 # @mode: the replication mode
 #
 # @top-id: #optional In secondary mode, node name or device ID of the root
-#  node who owns the replication node chain. Ignored in primary mode.
+#  node who owns the replication node chain. Must not be given in
+#  primary mode.
 #
 # Since: 2.8
 ##
-- 
1.9.3

Re: [Qemu-block] [PATCH v24 11/12] support replication driver in blockdev-add

2016-08-15 Thread Changlong Xie


On 08/15/2016 04:37 PM, Kevin Wolf wrote:

Am 15.08.2016 um 03:49 hat Changlong Xie geschrieben:

On 08/09/2016 05:08 PM, Kevin Wolf wrote:

Am 27.07.2016 um 09:01 hat Changlong Xie geschrieben:

From: Wen Congyang <we...@cn.fujitsu.com>

Signed-off-by: Wen Congyang <we...@cn.fujitsu.com>
Signed-off-by: Changlong Xie <xiecl.f...@cn.fujitsu.com>
Signed-off-by: Wang WeiWei <wangww.f...@cn.fujitsu.com>
Signed-off-by: zhanghailiang <zhang.zhanghaili...@huawei.com>
Signed-off-by: Gonglei <arei.gong...@huawei.com>
Reviewed-by: Eric Blake <ebl...@redhat.com>



@@ -2078,6 +2079,23 @@
  { 'enum' : 'ReplicationMode', 'data' : [ 'primary', 'secondary' ] }

  ##
+# @BlockdevOptionsReplication
+#
+# Driver specific block device options for replication
+#
+# @mode: the replication mode
+#
+# @top-id: #optional In secondary mode, node name or device ID of the root
+#  node who owns the replication node chain. Ignored in primary mode.


Can we change this to "Must not be given in primary mode"? Not sure what
the code currently does, but I think it should error out if top-id is


Replication driver will ignore "top-id" parameter in Primary mode.


This is not good behaviour, which is why I requested a change.



Hi stefan

Would you like me send another [PATCH v25] based your block-next? Or a 
separate patch until your tree is merged.


Thanks
-Xie


Kevin


given there.


+#
+# Since: 2.8
+##
+{ 'struct': 'BlockdevOptionsReplication',
+  'base': 'BlockdevOptionsGenericFormat',
+  'data': { 'mode': 'ReplicationMode',
+'*top-id': 'str' } }


Kevin



.

Re: [Qemu-block] [PATCH v24 11/12] support replication driver in blockdev-add

2016-08-14 Thread Changlong Xie


On 08/09/2016 05:08 PM, Kevin Wolf wrote:

Am 27.07.2016 um 09:01 hat Changlong Xie geschrieben:

From: Wen Congyang <we...@cn.fujitsu.com>

Signed-off-by: Wen Congyang <we...@cn.fujitsu.com>
Signed-off-by: Changlong Xie <xiecl.f...@cn.fujitsu.com>
Signed-off-by: Wang WeiWei <wangww.f...@cn.fujitsu.com>
Signed-off-by: zhanghailiang <zhang.zhanghaili...@huawei.com>
Signed-off-by: Gonglei <arei.gong...@huawei.com>
Reviewed-by: Eric Blake <ebl...@redhat.com>



@@ -2078,6 +2079,23 @@
  { 'enum' : 'ReplicationMode', 'data' : [ 'primary', 'secondary' ] }

  ##
+# @BlockdevOptionsReplication
+#
+# Driver specific block device options for replication
+#
+# @mode: the replication mode
+#
+# @top-id: #optional In secondary mode, node name or device ID of the root
+#  node who owns the replication node chain. Ignored in primary mode.


Can we change this to "Must not be given in primary mode"? Not sure what
the code currently does, but I think it should error out if top-id is


Replication driver will ignore "top-id" parameter in Primary mode.


given there.


+#
+# Since: 2.8
+##
+{ 'struct': 'BlockdevOptionsReplication',
+  'base': 'BlockdevOptionsGenericFormat',
+  'data': { 'mode': 'ReplicationMode',
+'*top-id': 'str' } }


Kevin


.

[Qemu-block] [PATCH v24 12/12] MAINTAINERS: add maintainer for replication

2016-07-27 Thread Changlong Xie

As per Stefan's suggestion, add Wen and I as co-maintainers
of replication.

Cc: Stefan Hajnoczi <stefa...@redhat.com>
Signed-off-by: Wen Congyang <we...@cn.fujitsu.com>
Signed-off-by: Changlong Xie <xiecl.f...@cn.fujitsu.com>
---
 MAINTAINERS | 9 +
 1 file changed, 9 insertions(+)

diff --git a/MAINTAINERS b/MAINTAINERS
index 1d0e2c3..25b9438 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -1619,6 +1619,15 @@ L: qemu-block@nongnu.org
 S: Supported
 F: tests/image-fuzzer/
 
+Replication
+M: Wen Congyang <we...@cn.fujitsu.com>
+M: Changlong Xie <xiecl.f...@cn.fujitsu.com>
+S: Supported
+F: replication*
+F: block/replication.c
+F: tests/test-replication.c
+F: docs/block-replication.txt
+
 Build and test automation
 -
 M: Alex Bennée <alex.ben...@linaro.org>
-- 
1.9.3

[Qemu-block] [PATCH v24 11/12] support replication driver in blockdev-add

2016-07-27 Thread Changlong Xie

From: Wen Congyang <we...@cn.fujitsu.com>

Signed-off-by: Wen Congyang <we...@cn.fujitsu.com>
Signed-off-by: Changlong Xie <xiecl.f...@cn.fujitsu.com>
Signed-off-by: Wang WeiWei <wangww.f...@cn.fujitsu.com>
Signed-off-by: zhanghailiang <zhang.zhanghaili...@huawei.com>
Signed-off-by: Gonglei <arei.gong...@huawei.com>
Reviewed-by: Eric Blake <ebl...@redhat.com>
---
 qapi/block-core.json | 23 +--
 1 file changed, 21 insertions(+), 2 deletions(-)

diff --git a/qapi/block-core.json b/qapi/block-core.json
index 8c250ec..aadc75a 100644
--- a/qapi/block-core.json
+++ b/qapi/block-core.json
@@ -248,6 +248,7 @@
 #   2.3: 'host_floppy' deprecated
 #   2.5: 'host_floppy' dropped
 #   2.6: 'luks' added
+#   2.8: 'replication' added
 #
 # @backing_file: #optional the name of the backing file (for copy-on-write)
 #
@@ -1671,8 +1672,8 @@
   'data': [ 'archipelago', 'blkdebug', 'blkverify', 'bochs', 'cloop',
 'dmg', 'file', 'ftp', 'ftps', 'host_cdrom', 'host_device',
 'http', 'https', 'luks', 'null-aio', 'null-co', 'parallels',
-'qcow', 'qcow2', 'qed', 'quorum', 'raw', 'tftp', 'vdi', 'vhdx',
-'vmdk', 'vpc', 'vvfat' ] }
+'qcow', 'qcow2', 'qed', 'quorum', 'raw', 'replication', 'tftp',
+'vdi', 'vhdx', 'vmdk', 'vpc', 'vvfat' ] }
 
 ##
 # @BlockdevOptionsFile
@@ -2078,6 +2079,23 @@
 { 'enum' : 'ReplicationMode', 'data' : [ 'primary', 'secondary' ] }
 
 ##
+# @BlockdevOptionsReplication
+#
+# Driver specific block device options for replication
+#
+# @mode: the replication mode
+#
+# @top-id: #optional In secondary mode, node name or device ID of the root
+#  node who owns the replication node chain. Ignored in primary mode.
+#
+# Since: 2.8
+##
+{ 'struct': 'BlockdevOptionsReplication',
+  'base': 'BlockdevOptionsGenericFormat',
+  'data': { 'mode': 'ReplicationMode',
+'*top-id': 'str' } }
+
+##
 # @BlockdevOptions
 #
 # Options for creating a block device.  Many options are available for all
@@ -2142,6 +2160,7 @@
   'quorum': 'BlockdevOptionsQuorum',
   'raw':'BlockdevOptionsGenericFormat',
 # TODO rbd: Wait for structured options
+  'replication':'BlockdevOptionsReplication',
 # TODO sheepdog: Wait for structured options
 # TODO ssh: Should take InetSocketAddress for 'host'?
   'tftp':   'BlockdevOptionsFile',
-- 
1.9.3

[Qemu-block] [PATCH v24 10/12] tests: add unit test case for replication

2016-07-27 Thread Changlong Xie

Signed-off-by: Wen Congyang <we...@cn.fujitsu.com>
Signed-off-by: Changlong Xie <xiecl.f...@cn.fujitsu.com>
Signed-off-by: Wang WeiWei <wangww.f...@cn.fujitsu.com>
---
 tests/.gitignore |   1 +
 tests/Makefile.include   |   4 +
 tests/test-replication.c | 575 +++
 3 files changed, 580 insertions(+)
 create mode 100644 tests/test-replication.c

diff --git a/tests/.gitignore b/tests/.gitignore
index dbb5263..b4a9cfc 100644
--- a/tests/.gitignore
+++ b/tests/.gitignore
@@ -63,6 +63,7 @@ test-qmp-introspect.[ch]
 test-qmp-marshal.c
 test-qmp-output-visitor
 test-rcu-list
+test-replication
 test-rfifolock
 test-string-input-visitor
 test-string-output-visitor
diff --git a/tests/Makefile.include b/tests/Makefile.include
index 2010b11..34fc840 100644
--- a/tests/Makefile.include
+++ b/tests/Makefile.include
@@ -111,6 +111,7 @@ check-unit-y += tests/test-crypto-xts$(EXESUF)
 check-unit-y += tests/test-crypto-block$(EXESUF)
 gcov-files-test-logging-y = tests/test-logging.c
 check-unit-y += tests/test-logging$(EXESUF)
+check-unit-$(CONFIG_REPLICATION) += tests/test-replication$(EXESUF)
 
 check-block-$(CONFIG_POSIX) += tests/qemu-iotests-quick.sh
 
@@ -472,6 +473,9 @@ tests/test-base64$(EXESUF): tests/test-base64.o \
 
 tests/test-logging$(EXESUF): tests/test-logging.o $(test-util-obj-y)
 
+tests/test-replication$(EXESUF): tests/test-replication.o $(test-util-obj-y) \
+   $(test-block-obj-y)
+
 tests/test-qapi-types.c tests/test-qapi-types.h :\
 $(SRC_PATH)/tests/qapi-schema/qapi-schema-test.json 
$(SRC_PATH)/scripts/qapi-types.py $(qapi-py)
$(call quiet-command,$(PYTHON) $(SRC_PATH)/scripts/qapi-types.py \
diff --git a/tests/test-replication.c b/tests/test-replication.c
new file mode 100644
index 000..b63f1ef
--- /dev/null
+++ b/tests/test-replication.c
@@ -0,0 +1,575 @@
+/*
+ * Block replication tests
+ *
+ * Copyright (c) 2016 FUJITSU LIMITED
+ * Author: Changlong Xie <xiecl.f...@cn.fujitsu.com>
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2 or
+ * later.  See the COPYING file in the top-level directory.
+ */
+
+#include "qemu/osdep.h"
+
+#include "qapi/error.h"
+#include "replication.h"
+#include "block/block_int.h"
+#include "sysemu/block-backend.h"
+
+#define IMG_SIZE (64 * 1024 * 1024)
+
+/* primary */
+#define P_ID "primary-id"
+static char p_local_disk[] = "/tmp/p_local_disk.XX";
+
+/* secondary */
+#define S_ID "secondary-id"
+#define S_LOCAL_DISK_ID "secondary-local-disk-id"
+static char s_local_disk[] = "/tmp/s_local_disk.XX";
+static char s_active_disk[] = "/tmp/s_active_disk.XX";
+static char s_hidden_disk[] = "/tmp/s_hidden_disk.XX";
+
+/* FIXME: steal from blockdev.c */
+QemuOptsList qemu_drive_opts = {
+.name = "drive",
+.head = QTAILQ_HEAD_INITIALIZER(qemu_drive_opts.head),
+.desc = {
+{ /* end of list */ }
+},
+};
+
+#define NOT_DONE 0x7fff
+
+static void blk_rw_done(void *opaque, int ret)
+{
+*(int *)opaque = ret;
+}
+
+static void test_blk_read(BlockBackend *blk, long pattern,
+  int64_t pattern_offset, int64_t pattern_count,
+  int64_t offset, int64_t count,
+  bool expect_failed)
+{
+void *pattern_buf = NULL;
+QEMUIOVector qiov;
+void *cmp_buf = NULL;
+int async_ret = NOT_DONE;
+
+if (pattern) {
+cmp_buf = g_malloc(pattern_count);
+memset(cmp_buf, pattern, pattern_count);
+}
+
+pattern_buf = g_malloc(count);
+if (pattern) {
+memset(pattern_buf, pattern, count);
+} else {
+memset(pattern_buf, 0x00, count);
+}
+
+qemu_iovec_init(, 1);
+qemu_iovec_add(, pattern_buf, count);
+
+blk_aio_preadv(blk, offset, , 0, blk_rw_done, _ret);
+while (async_ret == NOT_DONE) {
+main_loop_wait(false);
+}
+
+if (expect_failed) {
+g_assert(async_ret != 0);
+} else {
+g_assert(async_ret == 0);
+if (pattern) {
+g_assert(memcmp(pattern_buf + pattern_offset,
+cmp_buf, pattern_count) <= 0);
+}
+}
+
+g_free(pattern_buf);
+}
+
+static void test_blk_write(BlockBackend *blk, long pattern, int64_t offset,
+   int64_t count, bool expect_failed)
+{
+void *pattern_buf = NULL;
+QEMUIOVector qiov;
+int async_ret = NOT_DONE;
+
+pattern_buf = g_malloc(count);
+if (pattern) {
+memset(pattern_buf, pattern, count);
+} else {
+memset(pattern_buf, 0x00, count);
+}
+
+qemu_iovec_init(, 1);
+qemu_iovec_add(, pattern_buf, count);
+
+blk_aio_pwritev(blk, offset, , 0, blk_rw_done, _ret);
+while (async_ret == NOT_DONE) {
+main_loop_wait(false);
+}
+
+if (expect_failed) {
+g_assert(a

[Qemu-block] [PATCH v24 09/12] Implement new driver for block replication

2016-07-27 Thread Changlong Xie

From: Wen Congyang <we...@cn.fujitsu.com>

Signed-off-by: Wen Congyang <we...@cn.fujitsu.com>
Signed-off-by: Changlong Xie <xiecl.f...@cn.fujitsu.com>
Signed-off-by: Wang WeiWei <wangww.f...@cn.fujitsu.com>
Signed-off-by: zhanghailiang <zhang.zhanghaili...@huawei.com>
Signed-off-by: Gonglei <arei.gong...@huawei.com>
---
 block/Makefile.objs |   1 +
 block/replication.c | 659 
 2 files changed, 660 insertions(+)
 create mode 100644 block/replication.c

diff --git a/block/Makefile.objs b/block/Makefile.objs
index 8a3270b..55da626 100644
--- a/block/Makefile.objs
+++ b/block/Makefile.objs
@@ -23,6 +23,7 @@ block-obj-$(CONFIG_LIBSSH2) += ssh.o
 block-obj-y += accounting.o dirty-bitmap.o
 block-obj-y += write-threshold.o
 block-obj-y += backup.o
+block-obj-$(CONFIG_REPLICATION) += replication.o
 
 block-obj-y += crypto.o
 
diff --git a/block/replication.c b/block/replication.c
new file mode 100644
index 000..1117f02
--- /dev/null
+++ b/block/replication.c
@@ -0,0 +1,659 @@
+/*
+ * Replication Block filter
+ *
+ * Copyright (c) 2016 HUAWEI TECHNOLOGIES CO., LTD.
+ * Copyright (c) 2016 Intel Corporation
+ * Copyright (c) 2016 FUJITSU LIMITED
+ *
+ * Author:
+ *   Wen Congyang <we...@cn.fujitsu.com>
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2 or later.
+ * See the COPYING file in the top-level directory.
+ */
+
+#include "qemu/osdep.h"
+#include "qemu-common.h"
+#include "block/nbd.h"
+#include "block/blockjob.h"
+#include "block/block_int.h"
+#include "block/block_backup.h"
+#include "sysemu/block-backend.h"
+#include "qapi/error.h"
+#include "replication.h"
+
+typedef struct BDRVReplicationState {
+ReplicationMode mode;
+int replication_state;
+BdrvChild *active_disk;
+BdrvChild *hidden_disk;
+BdrvChild *secondary_disk;
+char *top_id;
+ReplicationState *rs;
+Error *blocker;
+int orig_hidden_flags;
+int orig_secondary_flags;
+int error;
+} BDRVReplicationState;
+
+enum {
+BLOCK_REPLICATION_NONE, /* block replication is not started */
+BLOCK_REPLICATION_RUNNING,  /* block replication is running */
+BLOCK_REPLICATION_FAILOVER, /* failover is running in background */
+BLOCK_REPLICATION_FAILOVER_FAILED,  /* failover failed */
+BLOCK_REPLICATION_DONE, /* block replication is done */
+};
+
+static void replication_start(ReplicationState *rs, ReplicationMode mode,
+  Error **errp);
+static void replication_do_checkpoint(ReplicationState *rs, Error **errp);
+static void replication_get_error(ReplicationState *rs, Error **errp);
+static void replication_stop(ReplicationState *rs, bool failover,
+ Error **errp);
+
+#define REPLICATION_MODE"mode"
+#define REPLICATION_TOP_ID  "top-id"
+static QemuOptsList replication_runtime_opts = {
+.name = "replication",
+.head = QTAILQ_HEAD_INITIALIZER(replication_runtime_opts.head),
+.desc = {
+{
+.name = REPLICATION_MODE,
+.type = QEMU_OPT_STRING,
+},
+{
+.name = REPLICATION_TOP_ID,
+.type = QEMU_OPT_STRING,
+},
+{ /* end of list */ }
+},
+};
+
+static ReplicationOps replication_ops = {
+.start = replication_start,
+.checkpoint = replication_do_checkpoint,
+.get_error = replication_get_error,
+.stop = replication_stop,
+};
+
+static int replication_open(BlockDriverState *bs, QDict *options,
+int flags, Error **errp)
+{
+int ret;
+BDRVReplicationState *s = bs->opaque;
+Error *local_err = NULL;
+QemuOpts *opts = NULL;
+const char *mode;
+const char *top_id;
+
+ret = -EINVAL;
+opts = qemu_opts_create(_runtime_opts, NULL, 0, _abort);
+qemu_opts_absorb_qdict(opts, options, _err);
+if (local_err) {
+goto fail;
+}
+
+mode = qemu_opt_get(opts, REPLICATION_MODE);
+if (!mode) {
+error_setg(_err, "Missing the option mode");
+goto fail;
+}
+
+if (!strcmp(mode, "primary")) {
+s->mode = REPLICATION_MODE_PRIMARY;
+} else if (!strcmp(mode, "secondary")) {
+s->mode = REPLICATION_MODE_SECONDARY;
+top_id = qemu_opt_get(opts, REPLICATION_TOP_ID);
+s->top_id = g_strdup(top_id);
+if (!s->top_id) {
+error_setg(_err, "Missing the option top-id");
+goto fail;
+}
+} else {
+error_setg(_err,
+   "The option mode's value should be primary or secondary");
+goto fail;
+}
+
+s->rs = replication_new(bs, _ops);
+
+ret = 0;
+
+fail:
+qemu_opts_del(opts);
+error_propagate(errp, local_e

[Qemu-block] [PATCH v24 05/12] docs: block replication's description

2016-07-27 Thread Changlong Xie

From: Wen Congyang <we...@cn.fujitsu.com>

Signed-off-by: Wen Congyang <we...@cn.fujitsu.com>
Signed-off-by: Changlong Xie <xiecl.f...@cn.fujitsu.com>
Signed-off-by: Wang WeiWei <wangww.f...@cn.fujitsu.com>
Signed-off-by: zhanghailiang <zhang.zhanghaili...@huawei.com>
Signed-off-by: Gonglei <arei.gong...@huawei.com>
---
 docs/block-replication.txt | 239 +
 1 file changed, 239 insertions(+)
 create mode 100644 docs/block-replication.txt

diff --git a/docs/block-replication.txt b/docs/block-replication.txt
new file mode 100644
index 000..6bde673
--- /dev/null
+++ b/docs/block-replication.txt
@@ -0,0 +1,239 @@
+Block replication
+
+Copyright Fujitsu, Corp. 2016
+Copyright (c) 2016 Intel Corporation
+Copyright (c) 2016 HUAWEI TECHNOLOGIES CO., LTD.
+
+This work is licensed under the terms of the GNU GPL, version 2 or later.
+See the COPYING file in the top-level directory.
+
+Block replication is used for continuous checkpoints. It is designed
+for COLO (COarse-grain LOck-stepping) where the Secondary VM is running.
+It can also be applied for FT/HA (Fault-tolerance/High Assurance) scenario,
+where the Secondary VM is not running.
+
+This document gives an overview of block replication's design.
+
+== Background ==
+High availability solutions such as micro checkpoint and COLO will do
+consecutive checkpoints. The VM state of the Primary and Secondary VM is
+identical right after a VM checkpoint, but becomes different as the VM
+executes till the next checkpoint. To support disk contents checkpoint,
+the modified disk contents in the Secondary VM must be buffered, and are
+only dropped at next checkpoint time. To reduce the network transportation
+effort during a vmstate checkpoint, the disk modification operations of
+the Primary disk are asynchronously forwarded to the Secondary node.
+
+== Workflow ==
+The following is the image of block replication workflow:
+
++--+++
+|Primary Write Requests||Secondary Write Requests|
++--+++
+  |   |
+  |  (4)
+  |   V
+  |  /-\
+  |  Copy and Forward| |
+  |-(1)--+   | Disk Buffer |
+  |  |   | |
+  | (3)  \-/
+  | speculative  ^
+  |write through(2)
+  |  |   |
+  V  V   |
+   +--+   ++
+   | Primary Disk |   | Secondary Disk |
+   +--+   ++
+
+1) Primary write requests will be copied and forwarded to Secondary
+   QEMU.
+2) Before Primary write requests are written to Secondary disk, the
+   original sector content will be read from Secondary disk and
+   buffered in the Disk buffer, but it will not overwrite the existing
+   sector content (it could be from either "Secondary Write Requests" or
+   previous COW of "Primary Write Requests") in the Disk buffer.
+3) Primary write requests will be written to Secondary disk.
+4) Secondary write requests will be buffered in the Disk buffer and it
+   will overwrite the existing sector content in the buffer.
+
+== Architecture ==
+We are going to implement block replication from many basic
+blocks that are already in QEMU.
+
+ virtio-blk   ||
+ ^||.--
+ |||| Secondary
+1 Quorum  ||'--
+ /  \ ||
+/\||
+   Primary2 filter
+ disk ^
 virtio-blk
+  |
  ^
+3 NBD  --->  3 NBD 
  |
+client|| server
  2 filter
+  ||^  
  ^
+. |||  
  |
+Primary | ||  Secondary disk <- hidden-disk 5 
<- active-disk 4
+'

[Qemu-block] [PATCH v24 07/12] configure: support replication

2016-07-27 Thread Changlong Xie

configure --(enable/disable)-replication to switch replication
support on/off, and it is on by default.
We later introduce replation support.

Signed-off-by: Wen Congyang <we...@cn.fujitsu.com>
Signed-off-by: Changlong Xie <xiecl.f...@cn.fujitsu.com>
Signed-off-by: Wang WeiWei <wangww.f...@cn.fujitsu.com>
---
 configure | 11 +++
 1 file changed, 11 insertions(+)

diff --git a/configure b/configure
index 5ada56d..6f5d151 100755
--- a/configure
+++ b/configure
@@ -320,6 +320,7 @@ vhdx=""
 numa=""
 tcmalloc="no"
 jemalloc="no"
+replication="yes"
 
 # parse CC options first
 for opt do
@@ -1150,6 +1151,10 @@ for opt do
   ;;
   --enable-jemalloc) jemalloc="yes"
   ;;
+  --disable-replication) replication="no"
+  ;;
+  --enable-replication) replication="yes"
+  ;;
   *)
   echo "ERROR: unknown option $opt"
   echo "Try '$0 --help' for more information"
@@ -1380,6 +1385,7 @@ disabled with --disable-FEATURE, default is enabled if 
available:
   numalibnuma support
   tcmalloctcmalloc support
   jemallocjemalloc support
+  replication replication support
 
 NOTE: The object files are built at the place where configure is launched
 EOF
@@ -4895,6 +4901,7 @@ echo "NUMA host support $numa"
 echo "tcmalloc support  $tcmalloc"
 echo "jemalloc support  $jemalloc"
 echo "avx2 optimization $avx2_opt"
+echo "replication support $replication"
 
 if test "$sdl_too_old" = "yes"; then
 echo "-> Your SDL version is too old - please upgrade to have SDL support"
@@ -5465,6 +5472,10 @@ if test "$have_rtnetlink" = "yes" ; then
   echo "CONFIG_RTNETLINK=y" >> $config_host_mak
 fi
 
+if test "$replication" = "yes" ; then
+  echo "CONFIG_REPLICATION=y" >> $config_host_mak
+fi
+
 # Hold two types of flag:
 #   CONFIG_THREAD_SETNAME_BYTHREAD  - we've got a way of setting the name on
 # a thread we have a handle to
-- 
1.9.3

[Qemu-block] [PATCH v24 08/12] Introduce new APIs to do replication operation

2016-07-27 Thread Changlong Xie

This commit introduces six replication interfaces(for block, network etc).
Firstly we can use replication_(new/remove) to create/destroy replication
instances, then in migration we can use replication_(start/stop/do_checkpoint
/get_error)_all to handle all replication operations. More detail please
refer to replication.h

Signed-off-by: Wen Congyang <we...@cn.fujitsu.com>
Signed-off-by: Changlong Xie <xiecl.f...@cn.fujitsu.com>
Signed-off-by: Wang WeiWei <wangww.f...@cn.fujitsu.com>
Signed-off-by: zhanghailiang <zhang.zhanghaili...@huawei.com>
Signed-off-by: Gonglei <arei.gong...@huawei.com>
---
 Makefile.objs|   1 +
 qapi/block-core.json |  13 
 replication.c| 107 +++
 replication.h| 174 +++
 4 files changed, 295 insertions(+)
 create mode 100644 replication.c
 create mode 100644 replication.h

diff --git a/Makefile.objs b/Makefile.objs
index 7f1f0a3..a9d82c3 100644
--- a/Makefile.objs
+++ b/Makefile.objs
@@ -15,6 +15,7 @@ block-obj-$(CONFIG_POSIX) += aio-posix.o
 block-obj-$(CONFIG_WIN32) += aio-win32.o
 block-obj-y += block/
 block-obj-y += qemu-io-cmds.o
+block-obj-$(CONFIG_REPLICATION) += replication.o
 
 block-obj-m = block/
 
diff --git a/qapi/block-core.json b/qapi/block-core.json
index 8c92918..8c250ec 100644
--- a/qapi/block-core.json
+++ b/qapi/block-core.json
@@ -2065,6 +2065,19 @@
 '*read-pattern': 'QuorumReadPattern' } }
 
 ##
+# @ReplicationMode
+#
+# An enumeration of replication modes.
+#
+# @primary: Primary mode, the vm's state will be sent to secondary QEMU.
+#
+# @secondary: Secondary mode, receive the vm's state from primary QEMU.
+#
+# Since: 2.8
+##
+{ 'enum' : 'ReplicationMode', 'data' : [ 'primary', 'secondary' ] }
+
+##
 # @BlockdevOptions
 #
 # Options for creating a block device.  Many options are available for all
diff --git a/replication.c b/replication.c
new file mode 100644
index 000..be3a42f
--- /dev/null
+++ b/replication.c
@@ -0,0 +1,107 @@
+/*
+ * Replication filter
+ *
+ * Copyright (c) 2016 HUAWEI TECHNOLOGIES CO., LTD.
+ * Copyright (c) 2016 Intel Corporation
+ * Copyright (c) 2016 FUJITSU LIMITED
+ *
+ * Author:
+ *   Changlong Xie <xiecl.f...@cn.fujitsu.com>
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2 or later.
+ * See the COPYING file in the top-level directory.
+ */
+
+#include "qemu/osdep.h"
+#include "qapi/error.h"
+#include "replication.h"
+
+static QLIST_HEAD(, ReplicationState) replication_states;
+
+ReplicationState *replication_new(void *opaque, ReplicationOps *ops)
+{
+ReplicationState *rs;
+
+assert(ops != NULL);
+rs = g_new0(ReplicationState, 1);
+rs->opaque = opaque;
+rs->ops = ops;
+QLIST_INSERT_HEAD(_states, rs, node);
+
+return rs;
+}
+
+void replication_remove(ReplicationState *rs)
+{
+if (rs) {
+QLIST_REMOVE(rs, node);
+g_free(rs);
+}
+}
+
+/*
+ * The caller of the function MUST make sure vm stopped
+ */
+void replication_start_all(ReplicationMode mode, Error **errp)
+{
+ReplicationState *rs, *next;
+Error *local_err = NULL;
+
+QLIST_FOREACH_SAFE(rs, _states, node, next) {
+if (rs->ops && rs->ops->start) {
+rs->ops->start(rs, mode, _err);
+}
+if (local_err) {
+error_propagate(errp, local_err);
+return;
+}
+}
+}
+
+void replication_do_checkpoint_all(Error **errp)
+{
+ReplicationState *rs, *next;
+Error *local_err = NULL;
+
+QLIST_FOREACH_SAFE(rs, _states, node, next) {
+if (rs->ops && rs->ops->checkpoint) {
+rs->ops->checkpoint(rs, _err);
+}
+if (local_err) {
+error_propagate(errp, local_err);
+return;
+}
+}
+}
+
+void replication_get_error_all(Error **errp)
+{
+ReplicationState *rs, *next;
+Error *local_err = NULL;
+
+QLIST_FOREACH_SAFE(rs, _states, node, next) {
+if (rs->ops && rs->ops->get_error) {
+rs->ops->get_error(rs, _err);
+}
+if (local_err) {
+error_propagate(errp, local_err);
+return;
+}
+}
+}
+
+void replication_stop_all(bool failover, Error **errp)
+{
+ReplicationState *rs, *next;
+Error *local_err = NULL;
+
+QLIST_FOREACH_SAFE(rs, _states, node, next) {
+if (rs->ops && rs->ops->stop) {
+rs->ops->stop(rs, failover, _err);
+}
+if (local_err) {
+error_propagate(errp, local_err);
+return;
+}
+}
+}
diff --git a/replication.h b/replication.h
new file mode 100644
index 000..ece6ca6
--- /dev/null
+++ b/replication.h
@@ -0,0 +1,174 @@
+/*
+ * Replication filter
+ *
+ * Copyright (c) 2016 HUAWEI TECHNOLOGIES CO., LTD.
+ * Copyright (c) 2016 In

[Qemu-block] [PATCH v24 03/12] Backup: export interfaces for extra serialization

2016-07-27 Thread Changlong Xie

Normal backup(sync='none') workflow:
step 1. NBD peformance I/O write from client to server
   qcow2_co_writev
bdrv_co_writev
 ...
   bdrv_aligned_pwritev
notifier_with_return_list_notify -> backup_do_cow
 bdrv_driver_pwritev // write new contents

step 2. drive-backup sync=none
   backup_do_cow
   {
wait_for_overlapping_requests
cow_request_begin
for(; start < end; start++) {
bdrv_co_readv_no_serialising //read old contents from Secondary disk
bdrv_co_writev // write old contents to hidden-disk
}
cow_request_end
   }

step 3. Then roll back to "step 1" to write new contents to Secondary disk.

And for replication, we must make sure that we only read the old contents from
Secondary disk in order to keep contents consistent.

1) Replication workflow of Secondary
 virtio-blk
  ^
--->  1 NBD   |
   || server   3 replication
   ||^^
   |||   backing backing  |
   ||  Secondary disk 6< hidden-disk 5 < active-disk 4
   ||| ^
   ||'-'
   ||   drive-backup sync=none 2

Hence, we need these interfaces to implement coarse-grained serialization 
between
COW of Secondary disk and the read operation of replication.

Example codes about how to use them:

*#include "block/block_backup.h"

static coroutine_fn int xxx_co_readv()
{
CowRequest req;
BlockJob *job = secondary_disk->bs->job;

if (job) {
  backup_wait_for_overlapping_requests(job, start, end);
  backup_cow_request_begin(, job, start, end);
  ret = bdrv_co_readv();
  backup_cow_request_end();
  goto out;
}
ret = bdrv_co_readv();
out:
    return ret;
}

Signed-off-by: Changlong Xie <xiecl.f...@cn.fujitsu.com>
Signed-off-by: Wen Congyang <we...@cn.fujitsu.com>
Signed-off-by: Wang WeiWei <wangww.f...@cn.fujitsu.com>
---
 block/backup.c   | 41 ++---
 include/block/block_backup.h | 14 ++
 2 files changed, 48 insertions(+), 7 deletions(-)

diff --git a/block/backup.c b/block/backup.c
index 3bce416..919b63a 100644
--- a/block/backup.c
+++ b/block/backup.c
@@ -28,13 +28,6 @@
 #define BACKUP_CLUSTER_SIZE_DEFAULT (1 << 16)
 #define SLICE_TIME 1ULL /* ns */
 
-typedef struct CowRequest {
-int64_t start;
-int64_t end;
-QLIST_ENTRY(CowRequest) list;
-CoQueue wait_queue; /* coroutines blocked on this request */
-} CowRequest;
-
 typedef struct BackupBlockJob {
 BlockJob common;
 BlockBackend *target;
@@ -271,6 +264,40 @@ void backup_do_checkpoint(BlockJob *job, Error **errp)
 bitmap_zero(backup_job->done_bitmap, len);
 }
 
+void backup_wait_for_overlapping_requests(BlockJob *job, int64_t sector_num,
+  int nb_sectors)
+{
+BackupBlockJob *backup_job = container_of(job, BackupBlockJob, common);
+int64_t sectors_per_cluster = cluster_size_sectors(backup_job);
+int64_t start, end;
+
+assert(job->driver->job_type == BLOCK_JOB_TYPE_BACKUP);
+
+start = sector_num / sectors_per_cluster;
+end = DIV_ROUND_UP(sector_num + nb_sectors, sectors_per_cluster);
+wait_for_overlapping_requests(backup_job, start, end);
+}
+
+void backup_cow_request_begin(CowRequest *req, BlockJob *job,
+  int64_t sector_num,
+  int nb_sectors)
+{
+BackupBlockJob *backup_job = container_of(job, BackupBlockJob, common);
+int64_t sectors_per_cluster = cluster_size_sectors(backup_job);
+int64_t start, end;
+
+assert(job->driver->job_type == BLOCK_JOB_TYPE_BACKUP);
+
+start = sector_num / sectors_per_cluster;
+end = DIV_ROUND_UP(sector_num + nb_sectors, sectors_per_cluster);
+cow_request_begin(req, backup_job, start, end);
+}
+
+void backup_cow_request_end(CowRequest *req)
+{
+cow_request_end(req);
+}
+
 static const BlockJobDriver backup_job_driver = {
 .instance_size  = sizeof(BackupBlockJob),
 .job_type   = BLOCK_JOB_TYPE_BACKUP,
diff --git a/include/block/block_backup.h b/include/block/block_backup.h
index 157596c..8a75947 100644
--- a/include/block/block_backup.h
+++ b/include/block/block_backup.h
@@ -20,6 +20,20 @@
 
 #include "block/block_int.h"
 
+typedef struct CowRequest {
+int64_t start;
+int64_t end;
+QLIST_ENTRY(CowRequest) list;
+CoQueue wait_queue; /* coroutines blocked on this request */
+} CowRequest;
+
+void backup_wait_for_overlapping_requests(BlockJob *

[Qemu-block] [PATCH v24 06/12] auto complete active commit

2016-07-27 Thread Changlong Xie

From: Wen Congyang <we...@cn.fujitsu.com>

Auto complete mirror job in background to prevent from
blocking synchronously

Signed-off-by: Wen Congyang <we...@cn.fujitsu.com>
Signed-off-by: Changlong Xie <xiecl.f...@cn.fujitsu.com>
Signed-off-by: Wang WeiWei <wangww.f...@cn.fujitsu.com>
---
 block/mirror.c| 13 +
 blockdev.c|  2 +-
 include/block/block_int.h |  3 ++-
 qemu-img.c|  2 +-
 4 files changed, 13 insertions(+), 7 deletions(-)

diff --git a/block/mirror.c b/block/mirror.c
index b1e633e..63ed253 100644
--- a/block/mirror.c
+++ b/block/mirror.c
@@ -856,7 +856,8 @@ static void mirror_start_job(const char *job_id, 
BlockDriverState *bs,
  BlockCompletionFunc *cb,
  void *opaque, Error **errp,
  const BlockJobDriver *driver,
- bool is_none_mode, BlockDriverState *base)
+ bool is_none_mode, BlockDriverState *base,
+ bool auto_complete)
 {
 MirrorBlockJob *s;
 
@@ -892,6 +893,9 @@ static void mirror_start_job(const char *job_id, 
BlockDriverState *bs,
 s->granularity = granularity;
 s->buf_size = ROUND_UP(buf_size, granularity);
 s->unmap = unmap;
+if (auto_complete) {
+s->should_complete = true;
+}
 
 s->dirty_bitmap = bdrv_create_dirty_bitmap(bs, granularity, NULL, errp);
 if (!s->dirty_bitmap) {
@@ -930,14 +934,15 @@ void mirror_start(const char *job_id, BlockDriverState 
*bs,
 mirror_start_job(job_id, bs, target, replaces,
  speed, granularity, buf_size, backing_mode,
  on_source_error, on_target_error, unmap, cb, opaque, errp,
- _job_driver, is_none_mode, base);
+ _job_driver, is_none_mode, base, false);
 }
 
 void commit_active_start(const char *job_id, BlockDriverState *bs,
  BlockDriverState *base, int64_t speed,
  BlockdevOnError on_error,
  BlockCompletionFunc *cb,
- void *opaque, Error **errp)
+ void *opaque, Error **errp,
+ bool auto_complete)
 {
 int64_t length, base_length;
 int orig_base_flags;
@@ -978,7 +983,7 @@ void commit_active_start(const char *job_id, 
BlockDriverState *bs,
 mirror_start_job(job_id, bs, base, NULL, speed, 0, 0,
  MIRROR_LEAVE_BACKING_CHAIN,
  on_error, on_error, false, cb, opaque, _err,
- _active_job_driver, false, base);
+ _active_job_driver, false, base, auto_complete);
 if (local_err) {
 error_propagate(errp, local_err);
 goto error_restore_flags;
diff --git a/blockdev.c b/blockdev.c
index 2d38537..38d4c89 100644
--- a/blockdev.c
+++ b/blockdev.c
@@ -3145,7 +3145,7 @@ void qmp_block_commit(bool has_job_id, const char 
*job_id, const char *device,
 goto out;
 }
 commit_active_start(has_job_id ? job_id : NULL, bs, base_bs, speed,
-on_error, block_job_cb, bs, _err);
+on_error, block_job_cb, bs, _err, false);
 } else {
 commit_start(has_job_id ? job_id : NULL, bs, base_bs, top_bs, speed,
  on_error, block_job_cb, bs,
diff --git a/include/block/block_int.h b/include/block/block_int.h
index 8054146..0586cee 100644
--- a/include/block/block_int.h
+++ b/include/block/block_int.h
@@ -694,13 +694,14 @@ void commit_start(const char *job_id, BlockDriverState 
*bs,
  * @cb: Completion function for the job.
  * @opaque: Opaque pointer value passed to @cb.
  * @errp: Error object.
+ * @auto_complete: Auto complete the job.
  *
  */
 void commit_active_start(const char *job_id, BlockDriverState *bs,
  BlockDriverState *base, int64_t speed,
  BlockdevOnError on_error,
  BlockCompletionFunc *cb,
- void *opaque, Error **errp);
+ void *opaque, Error **errp, bool auto_complete);
 /*
  * mirror_start:
  * @job_id: The id of the newly-created job, or %NULL to use the
diff --git a/qemu-img.c b/qemu-img.c
index 2e40e1f..ae204c9 100644
--- a/qemu-img.c
+++ b/qemu-img.c
@@ -921,7 +921,7 @@ static int img_commit(int argc, char **argv)
 };
 
 commit_active_start("commit", bs, base_bs, 0, BLOCKDEV_ON_ERROR_REPORT,
-common_block_job_cb, , _err);
+common_block_job_cb, , _err, false);
 if (local_err) {
 goto done;
 }
-- 
1.9.3

[Qemu-block] [PATCH v24 01/12] unblock backup operations in backing file

2016-07-27 Thread Changlong Xie

From: Wen Congyang <we...@cn.fujitsu.com>

Signed-off-by: Wen Congyang <we...@cn.fujitsu.com>
Signed-off-by: Changlong Xie <xiecl.f...@cn.fujitsu.com>
Signed-off-by: Wang WeiWei <wangww.f...@cn.fujitsu.com>
---
 block.c | 17 +
 1 file changed, 17 insertions(+)

diff --git a/block.c b/block.c
index 9f037db..63e4636 100644
--- a/block.c
+++ b/block.c
@@ -1310,6 +1310,23 @@ void bdrv_set_backing_hd(BlockDriverState *bs, 
BlockDriverState *backing_hd)
 /* Otherwise we won't be able to commit due to check in bdrv_commit */
 bdrv_op_unblock(backing_hd, BLOCK_OP_TYPE_COMMIT_TARGET,
 bs->backing_blocker);
+/*
+ * We do backup in 3 ways:
+ * 1. drive backup
+ *The target bs is new opened, and the source is top BDS
+ * 2. blockdev backup
+ *Both the source and the target are top BDSes.
+ * 3. internal backup(used for block replication)
+ *Both the source and the target are backing file
+ *
+ * In case 1 and 2, neither the source nor the target is the backing file.
+ * In case 3, we will block the top BDS, so there is only one block job
+ * for the top BDS and its backing chain.
+ */
+bdrv_op_unblock(backing_hd, BLOCK_OP_TYPE_BACKUP_SOURCE,
+bs->backing_blocker);
+bdrv_op_unblock(backing_hd, BLOCK_OP_TYPE_BACKUP_TARGET,
+bs->backing_blocker);
 out:
 bdrv_refresh_limits(bs, NULL);
 }
-- 
1.9.3

[Qemu-block] [PATCH v24 04/12] Link backup into block core

2016-07-27 Thread Changlong Xie

From: Wen Congyang <we...@cn.fujitsu.com>

Some programs that add a dependency on it will use
the block layer directly.

Signed-off-by: Wen Congyang <we...@cn.fujitsu.com>
Signed-off-by: Changlong Xie <xiecl.f...@cn.fujitsu.com>
Signed-off-by: Wang WeiWei <wangww.f...@cn.fujitsu.com>
Signed-off-by: zhanghailiang <zhang.zhanghaili...@huawei.com>
Signed-off-by: Gonglei <arei.gong...@huawei.com>
Reviewed-by: Stefan Hajnoczi <stefa...@redhat.com>
Reviewed-by: Jeff Cody <jc...@redhat.com>
---
 block/Makefile.objs | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/block/Makefile.objs b/block/Makefile.objs
index 2593a2f..8a3270b 100644
--- a/block/Makefile.objs
+++ b/block/Makefile.objs
@@ -22,11 +22,11 @@ block-obj-$(CONFIG_ARCHIPELAGO) += archipelago.o
 block-obj-$(CONFIG_LIBSSH2) += ssh.o
 block-obj-y += accounting.o dirty-bitmap.o
 block-obj-y += write-threshold.o
+block-obj-y += backup.o
 
 block-obj-y += crypto.o
 
 common-obj-y += stream.o
-common-obj-y += backup.o
 
 iscsi.o-cflags := $(LIBISCSI_CFLAGS)
 iscsi.o-libs   := $(LIBISCSI_LIBS)
-- 
1.9.3

[Qemu-block] [PATCH v24 02/12] Backup: clear all bitmap when doing block checkpoint

2016-07-27 Thread Changlong Xie

From: Wen Congyang <we...@cn.fujitsu.com>

Signed-off-by: Wen Congyang <we...@cn.fujitsu.com>
Signed-off-by: Changlong Xie <xiecl.f...@cn.fujitsu.com>
Signed-off-by: Wang WeiWei <wangww.f...@cn.fujitsu.com>
Signed-off-by: zhanghailiang <zhang.zhanghaili...@huawei.com>
Signed-off-by: Gonglei <arei.gong...@huawei.com>
---
 block/backup.c   | 18 ++
 include/block/block_backup.h | 25 +
 2 files changed, 43 insertions(+)
 create mode 100644 include/block/block_backup.h

diff --git a/block/backup.c b/block/backup.c
index 2c05323..3bce416 100644
--- a/block/backup.c
+++ b/block/backup.c
@@ -17,6 +17,7 @@
 #include "block/block.h"
 #include "block/block_int.h"
 #include "block/blockjob.h"
+#include "block/block_backup.h"
 #include "qapi/error.h"
 #include "qapi/qmp/qerror.h"
 #include "qemu/ratelimit.h"
@@ -253,6 +254,23 @@ static void backup_attached_aio_context(BlockJob *job, 
AioContext *aio_context)
 blk_set_aio_context(s->target, aio_context);
 }
 
+void backup_do_checkpoint(BlockJob *job, Error **errp)
+{
+BackupBlockJob *backup_job = container_of(job, BackupBlockJob, common);
+int64_t len;
+
+assert(job->driver->job_type == BLOCK_JOB_TYPE_BACKUP);
+
+if (backup_job->sync_mode != MIRROR_SYNC_MODE_NONE) {
+error_setg(errp, "The backup job only supports block checkpoint in"
+   " sync=none mode");
+return;
+}
+
+len = DIV_ROUND_UP(backup_job->common.len, backup_job->cluster_size);
+bitmap_zero(backup_job->done_bitmap, len);
+}
+
 static const BlockJobDriver backup_job_driver = {
 .instance_size  = sizeof(BackupBlockJob),
 .job_type   = BLOCK_JOB_TYPE_BACKUP,
diff --git a/include/block/block_backup.h b/include/block/block_backup.h
new file mode 100644
index 000..157596c
--- /dev/null
+++ b/include/block/block_backup.h
@@ -0,0 +1,25 @@
+/*
+ * QEMU backup
+ *
+ * Copyright (c) 2013 Proxmox Server Solutions
+ * Copyright (c) 2016 HUAWEI TECHNOLOGIES CO., LTD.
+ * Copyright (c) 2016 Intel Corporation
+ * Copyright (c) 2016 FUJITSU LIMITED
+ *
+ * Authors:
+ *  Dietmar Maurer <diet...@proxmox.com>
+ *  Changlong Xie <xiecl.f...@cn.fujitsu.com>
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2 or later.
+ * See the COPYING file in the top-level directory.
+ *
+ */
+
+#ifndef BLOCK_BACKUP_H
+#define BLOCK_BACKUP_H
+
+#include "block/block_int.h"
+
+void backup_do_checkpoint(BlockJob *job, Error **errp);
+
+#endif
-- 
1.9.3

[Qemu-block] [PATCH v24 00/12] Block replication for continuous checkpoints

2016-07-27 Thread Changlong Xie

1. Address Alberto Garcia's comments
V7:
1. Implement adding/removing quorum child. Remove the option non-connect.
2. Simplify the backing refrence option according to Stefan Hajnoczi's 
suggestion
V6:
1. Rebase to the newest qemu.
V5:
1. Address the comments from Gong Lei
2. Speed the failover up. The secondary vm can take over very quickly even
   if there are too many I/O requests.
V4:
1. Introduce a new driver replication to avoid touch nbd and qcow2.
V3:
1: use error_setg() instead of error_set()
2. Add a new block job API
3. Active disk, hidden disk and nbd target uses the same AioContext
4. Add a testcase to test new hbitmap API
V2:
1. Redesign the secondary qemu(use image-fleecing)
2. Use Error objects to return error message
3. Address the comments from Max Reitz and Eric Blake

Changlong Xie (5):
  Backup: export interfaces for extra serialization
  configure: support replication
  Introduce new APIs to do replication operation
  tests: add unit test case for replication
  MAINTAINERS: add maintainer for replication

Wen Congyang (7):
  unblock backup operations in backing file
  Backup: clear all bitmap when doing block checkpoint
  Link backup into block core
  docs: block replication's description
  auto complete active commit
  Implement new driver for block replication
  support replication driver in blockdev-add

 MAINTAINERS  |   9 +
 Makefile.objs|   1 +
 block.c  |  17 ++
 block/Makefile.objs  |   3 +-
 block/backup.c   |  59 +++-
 block/mirror.c   |  13 +-
 block/replication.c  | 659 +++
 blockdev.c   |   2 +-
 configure|  11 +
 docs/block-replication.txt   | 239 
 include/block/block_backup.h |  39 +++
 include/block/block_int.h|   3 +-
 qapi/block-core.json |  36 ++-
 qemu-img.c   |   2 +-
 replication.c| 107 +++
 replication.h| 174 
 tests/.gitignore |   1 +
 tests/Makefile.include   |   4 +
 tests/test-replication.c | 575 +
 19 files changed, 1937 insertions(+), 17 deletions(-)
 create mode 100644 block/replication.c
 create mode 100644 docs/block-replication.txt
 create mode 100644 include/block/block_backup.h
 create mode 100644 replication.c
 create mode 100644 replication.h
 create mode 100644 tests/test-replication.c

-- 
1.9.3

Re: [Qemu-block] [PATCH v23 12/12] MAINTAINERS: add maintainer for replication

2016-07-26 Thread Changlong Xie

On 07/27/2016 12:25 AM, Max Reitz wrote:

+replication

While some acronyms are written fully in lower case in this file, this
is not an acronym, so I'd capitalize it as "Replication", or maybe call
it "Block replication" instead.

I just know the rule and "Repliation" is good for me.

>+M: Wen Congyang<we...@cn.fujitsu.com>
>+M: Changlong Xie<xiecl.f...@cn.fujitsu.com>
>+S: Supported
>+F: replication*
>+F: block/replication.c
>+F: test/test-replication.c

docs/block-replication.txt should probably be mentioned as well.

Surely

Max

>+
>  Build and test automation
>  -
>  M: Alex Bennée<alex.ben...@linaro.org>
>

Re: [Qemu-block] [PATCH v23 11/12] support replication driver in blockdev-add

2016-07-26 Thread Changlong Xie


On 07/27/2016 12:22 AM, Max Reitz wrote:

On 26.07.2016 10:15, Changlong Xie wrote:

From: Wen Congyang <we...@cn.fujitsu.com>

Signed-off-by: Wen Congyang <we...@cn.fujitsu.com>
Signed-off-by: Changlong Xie <xiecl.f...@cn.fujitsu.com>
Signed-off-by: Wang WeiWei <wangww.f...@cn.fujitsu.com>
Signed-off-by: zhanghailiang <zhang.zhanghaili...@huawei.com>
Signed-off-by: Gonglei <arei.gong...@huawei.com>
Reviewed-by: Eric Blake <ebl...@redhat.com>
---
  qapi/block-core.json | 22 --
  1 file changed, 20 insertions(+), 2 deletions(-)

diff --git a/qapi/block-core.json b/qapi/block-core.json
index 7258a87..48aa112 100644
--- a/qapi/block-core.json
+++ b/qapi/block-core.json
@@ -248,6 +248,7 @@
  #   2.3: 'host_floppy' deprecated
  #   2.5: 'host_floppy' dropped
  #   2.6: 'luks' added
+#   2.8: 'replication' added
  #
  # @backing_file: #optional the name of the backing file (for copy-on-write)
  #
@@ -1696,8 +1697,8 @@
'data': [ 'archipelago', 'blkdebug', 'blkverify', 'bochs', 'cloop',
  'dmg', 'file', 'ftp', 'ftps', 'gluster', 'host_cdrom',
  'host_device', 'http', 'https', 'luks', 'null-aio', 'null-co',
-'parallels', 'qcow', 'qcow2', 'qed', 'quorum', 'raw', 'tftp',
-'vdi', 'vhdx', 'vmdk', 'vpc', 'vvfat' ] }
+'parallels', 'qcow', 'qcow2', 'qed', 'quorum', 'raw', 
'replication',
+'tftp', 'vdi', 'vhdx', 'vmdk', 'vpc', 'vvfat' ] }

  ##
  # @BlockdevOptionsFile
@@ -2160,6 +2161,22 @@
  { 'enum' : 'ReplicationMode', 'data' : [ 'primary', 'secondary' ] }

  ##
+# @BlockdevOptionsReplication
+#
+# Driver specific block device options for replication
+#
+# @mode: the replication mode
+#
+# @top-id: the id to protect replication model chain


It's hard for me to understand this sentence without reading the code
and thus knowing what this ID is used for. I'd use the following instead:

@top-id: In secondary mode, node name or device ID of the root node who
owns the replication node chain. Ignored in primary mode.


Pretty description



Also, since this parameter is only necessary in secondary mode and
completely ignored in primary mode, I would probably make it an optional
parameter.


Surely.



Max


+#
+# Since: 2.8
+##
+{ 'struct': 'BlockdevOptionsReplication',
+  'base': 'BlockdevOptionsGenericFormat',
+  'data': { 'mode': 'ReplicationMode',
+'top-id': 'str' } }
+
+##
  # @BlockdevOptions
  #
  # Options for creating a block device.  Many options are available for all
@@ -2224,6 +2241,7 @@
'quorum': 'BlockdevOptionsQuorum',
'raw':'BlockdevOptionsGenericFormat',
  # TODO rbd: Wait for structured options
+  'replication':'BlockdevOptionsReplication',
  # TODO sheepdog: Wait for structured options
  # TODO ssh: Should take InetSocketAddress for 'host'?
'tftp':   'BlockdevOptionsFile',

[Qemu-block] [PATCH v23 10/12] tests: add unit test case for replication

2016-07-26 Thread Changlong Xie

Signed-off-by: Wen Congyang <we...@cn.fujitsu.com>
Signed-off-by: Changlong Xie <xiecl.f...@cn.fujitsu.com>
Signed-off-by: Wang WeiWei <wangww.f...@cn.fujitsu.com>
---
 tests/.gitignore |   1 +
 tests/Makefile.include   |   4 +
 tests/test-replication.c | 575 +++
 3 files changed, 580 insertions(+)
 create mode 100644 tests/test-replication.c

diff --git a/tests/.gitignore b/tests/.gitignore
index dbb5263..b4a9cfc 100644
--- a/tests/.gitignore
+++ b/tests/.gitignore
@@ -63,6 +63,7 @@ test-qmp-introspect.[ch]
 test-qmp-marshal.c
 test-qmp-output-visitor
 test-rcu-list
+test-replication
 test-rfifolock
 test-string-input-visitor
 test-string-output-visitor
diff --git a/tests/Makefile.include b/tests/Makefile.include
index 9286148..bc6a44e 100644
--- a/tests/Makefile.include
+++ b/tests/Makefile.include
@@ -111,6 +111,7 @@ check-unit-y += tests/test-crypto-xts$(EXESUF)
 check-unit-y += tests/test-crypto-block$(EXESUF)
 gcov-files-test-logging-y = tests/test-logging.c
 check-unit-y += tests/test-logging$(EXESUF)
+check-unit-$(CONFIG_REPLICATION) += tests/test-replication$(EXESUF)
 
 check-block-$(CONFIG_POSIX) += tests/qemu-iotests-quick.sh
 
@@ -478,6 +479,9 @@ tests/test-base64$(EXESUF): tests/test-base64.o \
 
 tests/test-logging$(EXESUF): tests/test-logging.o $(test-util-obj-y)
 
+tests/test-replication$(EXESUF): tests/test-replication.o $(test-util-obj-y) \
+   $(test-block-obj-y)
+
 tests/test-qapi-types.c tests/test-qapi-types.h :\
 $(SRC_PATH)/tests/qapi-schema/qapi-schema-test.json 
$(SRC_PATH)/scripts/qapi-types.py $(qapi-py)
$(call quiet-command,$(PYTHON) $(SRC_PATH)/scripts/qapi-types.py \
diff --git a/tests/test-replication.c b/tests/test-replication.c
new file mode 100644
index 000..b63f1ef
--- /dev/null
+++ b/tests/test-replication.c
@@ -0,0 +1,575 @@
+/*
+ * Block replication tests
+ *
+ * Copyright (c) 2016 FUJITSU LIMITED
+ * Author: Changlong Xie <xiecl.f...@cn.fujitsu.com>
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2 or
+ * later.  See the COPYING file in the top-level directory.
+ */
+
+#include "qemu/osdep.h"
+
+#include "qapi/error.h"
+#include "replication.h"
+#include "block/block_int.h"
+#include "sysemu/block-backend.h"
+
+#define IMG_SIZE (64 * 1024 * 1024)
+
+/* primary */
+#define P_ID "primary-id"
+static char p_local_disk[] = "/tmp/p_local_disk.XX";
+
+/* secondary */
+#define S_ID "secondary-id"
+#define S_LOCAL_DISK_ID "secondary-local-disk-id"
+static char s_local_disk[] = "/tmp/s_local_disk.XX";
+static char s_active_disk[] = "/tmp/s_active_disk.XX";
+static char s_hidden_disk[] = "/tmp/s_hidden_disk.XX";
+
+/* FIXME: steal from blockdev.c */
+QemuOptsList qemu_drive_opts = {
+.name = "drive",
+.head = QTAILQ_HEAD_INITIALIZER(qemu_drive_opts.head),
+.desc = {
+{ /* end of list */ }
+},
+};
+
+#define NOT_DONE 0x7fff
+
+static void blk_rw_done(void *opaque, int ret)
+{
+*(int *)opaque = ret;
+}
+
+static void test_blk_read(BlockBackend *blk, long pattern,
+  int64_t pattern_offset, int64_t pattern_count,
+  int64_t offset, int64_t count,
+  bool expect_failed)
+{
+void *pattern_buf = NULL;
+QEMUIOVector qiov;
+void *cmp_buf = NULL;
+int async_ret = NOT_DONE;
+
+if (pattern) {
+cmp_buf = g_malloc(pattern_count);
+memset(cmp_buf, pattern, pattern_count);
+}
+
+pattern_buf = g_malloc(count);
+if (pattern) {
+memset(pattern_buf, pattern, count);
+} else {
+memset(pattern_buf, 0x00, count);
+}
+
+qemu_iovec_init(, 1);
+qemu_iovec_add(, pattern_buf, count);
+
+blk_aio_preadv(blk, offset, , 0, blk_rw_done, _ret);
+while (async_ret == NOT_DONE) {
+main_loop_wait(false);
+}
+
+if (expect_failed) {
+g_assert(async_ret != 0);
+} else {
+g_assert(async_ret == 0);
+if (pattern) {
+g_assert(memcmp(pattern_buf + pattern_offset,
+cmp_buf, pattern_count) <= 0);
+}
+}
+
+g_free(pattern_buf);
+}
+
+static void test_blk_write(BlockBackend *blk, long pattern, int64_t offset,
+   int64_t count, bool expect_failed)
+{
+void *pattern_buf = NULL;
+QEMUIOVector qiov;
+int async_ret = NOT_DONE;
+
+pattern_buf = g_malloc(count);
+if (pattern) {
+memset(pattern_buf, pattern, count);
+} else {
+memset(pattern_buf, 0x00, count);
+}
+
+qemu_iovec_init(, 1);
+qemu_iovec_add(, pattern_buf, count);
+
+blk_aio_pwritev(blk, offset, , 0, blk_rw_done, _ret);
+while (async_ret == NOT_DONE) {
+main_loop_wait(false);
+}
+
+if (expect_failed) {
+g_assert(a

[Qemu-block] [PATCH v23 09/12] Implement new driver for block replication

2016-07-26 Thread Changlong Xie

From: Wen Congyang <we...@cn.fujitsu.com>

Signed-off-by: Wen Congyang <we...@cn.fujitsu.com>
Signed-off-by: Changlong Xie <xiecl.f...@cn.fujitsu.com>
Signed-off-by: Wang WeiWei <wangww.f...@cn.fujitsu.com>
Signed-off-by: zhanghailiang <zhang.zhanghaili...@huawei.com>
Signed-off-by: Gonglei <arei.gong...@huawei.com>
---
 block/Makefile.objs |   1 +
 block/replication.c | 658 
 2 files changed, 659 insertions(+)
 create mode 100644 block/replication.c

diff --git a/block/Makefile.objs b/block/Makefile.objs
index 8a3270b..55da626 100644
--- a/block/Makefile.objs
+++ b/block/Makefile.objs
@@ -23,6 +23,7 @@ block-obj-$(CONFIG_LIBSSH2) += ssh.o
 block-obj-y += accounting.o dirty-bitmap.o
 block-obj-y += write-threshold.o
 block-obj-y += backup.o
+block-obj-$(CONFIG_REPLICATION) += replication.o
 
 block-obj-y += crypto.o
 
diff --git a/block/replication.c b/block/replication.c
new file mode 100644
index 000..ec35348
--- /dev/null
+++ b/block/replication.c
@@ -0,0 +1,658 @@
+/*
+ * Replication Block filter
+ *
+ * Copyright (c) 2016 HUAWEI TECHNOLOGIES CO., LTD.
+ * Copyright (c) 2016 Intel Corporation
+ * Copyright (c) 2016 FUJITSU LIMITED
+ *
+ * Author:
+ *   Wen Congyang <we...@cn.fujitsu.com>
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2 or later.
+ * See the COPYING file in the top-level directory.
+ */
+
+#include "qemu/osdep.h"
+#include "qemu-common.h"
+#include "block/nbd.h"
+#include "block/blockjob.h"
+#include "block/block_int.h"
+#include "block/block_backup.h"
+#include "sysemu/block-backend.h"
+#include "qapi/error.h"
+#include "replication.h"
+
+typedef struct BDRVReplicationState {
+ReplicationMode mode;
+int replication_state;
+BdrvChild *active_disk;
+BdrvChild *hidden_disk;
+BdrvChild *secondary_disk;
+char *top_id;
+ReplicationState *rs;
+Error *blocker;
+int orig_hidden_flags;
+int orig_secondary_flags;
+int error;
+} BDRVReplicationState;
+
+enum {
+BLOCK_REPLICATION_NONE, /* block replication is not started */
+BLOCK_REPLICATION_RUNNING,  /* block replication is running */
+BLOCK_REPLICATION_FAILOVER, /* failover is running in background */
+BLOCK_REPLICATION_FAILOVER_FAILED,  /* failover failed */
+BLOCK_REPLICATION_DONE, /* block replication is done */
+};
+
+static void replication_start(ReplicationState *rs, ReplicationMode mode,
+  Error **errp);
+static void replication_do_checkpoint(ReplicationState *rs, Error **errp);
+static void replication_get_error(ReplicationState *rs, Error **errp);
+static void replication_stop(ReplicationState *rs, bool failover,
+ Error **errp);
+
+#define REPLICATION_MODE"mode"
+#define REPLICATION_TOP_ID  "top-id"
+static QemuOptsList replication_runtime_opts = {
+.name = "replication",
+.head = QTAILQ_HEAD_INITIALIZER(replication_runtime_opts.head),
+.desc = {
+{
+.name = REPLICATION_MODE,
+.type = QEMU_OPT_STRING,
+},
+{
+.name = REPLICATION_TOP_ID,
+.type = QEMU_OPT_STRING,
+},
+{ /* end of list */ }
+},
+};
+
+static ReplicationOps replication_ops = {
+.start = replication_start,
+.checkpoint = replication_do_checkpoint,
+.get_error = replication_get_error,
+.stop = replication_stop,
+};
+
+static int replication_open(BlockDriverState *bs, QDict *options,
+int flags, Error **errp)
+{
+int ret;
+BDRVReplicationState *s = bs->opaque;
+Error *local_err = NULL;
+QemuOpts *opts = NULL;
+const char *mode;
+const char *top_id;
+
+ret = -EINVAL;
+opts = qemu_opts_create(_runtime_opts, NULL, 0, _abort);
+qemu_opts_absorb_qdict(opts, options, _err);
+if (local_err) {
+goto fail;
+}
+
+mode = qemu_opt_get(opts, REPLICATION_MODE);
+if (!mode) {
+error_setg(_err, "Missing the option mode");
+goto fail;
+}
+
+if (!strcmp(mode, "primary")) {
+s->mode = REPLICATION_MODE_PRIMARY;
+} else if (!strcmp(mode, "secondary")) {
+s->mode = REPLICATION_MODE_SECONDARY;
+top_id = qemu_opt_get(opts, REPLICATION_TOP_ID);
+s->top_id = g_strdup(top_id);
+if (!s->top_id) {
+error_setg(_err, "Missing the option top-id");
+goto fail;
+}
+} else {
+error_setg(_err,
+   "The option mode's value should be primary or secondary");
+goto fail;
+}
+
+s->rs = replication_new(bs, _ops);
+
+ret = 0;
+
+fail:
+qemu_opts_del(opts);
+error_propagate(errp, local_e

[Qemu-block] [PATCH v23 08/12] Introduce new APIs to do replication operation

2016-07-26 Thread Changlong Xie

This commit introduces six replication interfaces(for block, network etc).
Firstly we can use replication_(new/remove) to create/destroy replication
instances, then in migration we can use replication_(start/stop/do_checkpoint
/get_error)_all to handle all replication operations. More detail please
refer to replication.h

Signed-off-by: Wen Congyang <we...@cn.fujitsu.com>
Signed-off-by: Changlong Xie <xiecl.f...@cn.fujitsu.com>
Signed-off-by: Wang WeiWei <wangww.f...@cn.fujitsu.com>
Signed-off-by: zhanghailiang <zhang.zhanghaili...@huawei.com>
Signed-off-by: Gonglei <arei.gong...@huawei.com>
---
 Makefile.objs|   1 +
 qapi/block-core.json |  13 
 replication.c| 107 +++
 replication.h| 174 +++
 4 files changed, 295 insertions(+)
 create mode 100644 replication.c
 create mode 100644 replication.h

diff --git a/Makefile.objs b/Makefile.objs
index 6d5ddcf..7301544 100644
--- a/Makefile.objs
+++ b/Makefile.objs
@@ -15,6 +15,7 @@ block-obj-$(CONFIG_POSIX) += aio-posix.o
 block-obj-$(CONFIG_WIN32) += aio-win32.o
 block-obj-y += block/
 block-obj-y += qemu-io-cmds.o
+block-obj-$(CONFIG_REPLICATION) += replication.o
 
 block-obj-m = block/
 
diff --git a/qapi/block-core.json b/qapi/block-core.json
index f462345..7258a87 100644
--- a/qapi/block-core.json
+++ b/qapi/block-core.json
@@ -2147,6 +2147,19 @@
 '*debug_level': 'int' } }
 
 ##
+# @ReplicationMode
+#
+# An enumeration of replication modes.
+#
+# @primary: Primary mode, the vm's state will be sent to secondary QEMU.
+#
+# @secondary: Secondary mode, receive the vm's state from primary QEMU.
+#
+# Since: 2.8
+##
+{ 'enum' : 'ReplicationMode', 'data' : [ 'primary', 'secondary' ] }
+
+##
 # @BlockdevOptions
 #
 # Options for creating a block device.  Many options are available for all
diff --git a/replication.c b/replication.c
new file mode 100644
index 000..be3a42f
--- /dev/null
+++ b/replication.c
@@ -0,0 +1,107 @@
+/*
+ * Replication filter
+ *
+ * Copyright (c) 2016 HUAWEI TECHNOLOGIES CO., LTD.
+ * Copyright (c) 2016 Intel Corporation
+ * Copyright (c) 2016 FUJITSU LIMITED
+ *
+ * Author:
+ *   Changlong Xie <xiecl.f...@cn.fujitsu.com>
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2 or later.
+ * See the COPYING file in the top-level directory.
+ */
+
+#include "qemu/osdep.h"
+#include "qapi/error.h"
+#include "replication.h"
+
+static QLIST_HEAD(, ReplicationState) replication_states;
+
+ReplicationState *replication_new(void *opaque, ReplicationOps *ops)
+{
+ReplicationState *rs;
+
+assert(ops != NULL);
+rs = g_new0(ReplicationState, 1);
+rs->opaque = opaque;
+rs->ops = ops;
+QLIST_INSERT_HEAD(_states, rs, node);
+
+return rs;
+}
+
+void replication_remove(ReplicationState *rs)
+{
+if (rs) {
+QLIST_REMOVE(rs, node);
+g_free(rs);
+}
+}
+
+/*
+ * The caller of the function MUST make sure vm stopped
+ */
+void replication_start_all(ReplicationMode mode, Error **errp)
+{
+ReplicationState *rs, *next;
+Error *local_err = NULL;
+
+QLIST_FOREACH_SAFE(rs, _states, node, next) {
+if (rs->ops && rs->ops->start) {
+rs->ops->start(rs, mode, _err);
+}
+if (local_err) {
+error_propagate(errp, local_err);
+return;
+}
+}
+}
+
+void replication_do_checkpoint_all(Error **errp)
+{
+ReplicationState *rs, *next;
+Error *local_err = NULL;
+
+QLIST_FOREACH_SAFE(rs, _states, node, next) {
+if (rs->ops && rs->ops->checkpoint) {
+rs->ops->checkpoint(rs, _err);
+}
+if (local_err) {
+error_propagate(errp, local_err);
+return;
+}
+}
+}
+
+void replication_get_error_all(Error **errp)
+{
+ReplicationState *rs, *next;
+Error *local_err = NULL;
+
+QLIST_FOREACH_SAFE(rs, _states, node, next) {
+if (rs->ops && rs->ops->get_error) {
+rs->ops->get_error(rs, _err);
+}
+if (local_err) {
+error_propagate(errp, local_err);
+return;
+}
+}
+}
+
+void replication_stop_all(bool failover, Error **errp)
+{
+ReplicationState *rs, *next;
+Error *local_err = NULL;
+
+QLIST_FOREACH_SAFE(rs, _states, node, next) {
+if (rs->ops && rs->ops->stop) {
+rs->ops->stop(rs, failover, _err);
+}
+if (local_err) {
+error_propagate(errp, local_err);
+return;
+}
+}
+}
diff --git a/replication.h b/replication.h
new file mode 100644
index 000..ece6ca6
--- /dev/null
+++ b/replication.h
@@ -0,0 +1,174 @@
+/*
+ * Replication filter
+ *
+ * Copyright (c) 2016 HUAWEI TECHNOLOGIES CO., LTD.
+ * Copyright (c) 2016 Intel Corporation
+ * Copy

[Qemu-block] [PATCH v23 11/12] support replication driver in blockdev-add

2016-07-26 Thread Changlong Xie

From: Wen Congyang <we...@cn.fujitsu.com>

Signed-off-by: Wen Congyang <we...@cn.fujitsu.com>
Signed-off-by: Changlong Xie <xiecl.f...@cn.fujitsu.com>
Signed-off-by: Wang WeiWei <wangww.f...@cn.fujitsu.com>
Signed-off-by: zhanghailiang <zhang.zhanghaili...@huawei.com>
Signed-off-by: Gonglei <arei.gong...@huawei.com>
Reviewed-by: Eric Blake <ebl...@redhat.com>
---
 qapi/block-core.json | 22 --
 1 file changed, 20 insertions(+), 2 deletions(-)

diff --git a/qapi/block-core.json b/qapi/block-core.json
index 7258a87..48aa112 100644
--- a/qapi/block-core.json
+++ b/qapi/block-core.json
@@ -248,6 +248,7 @@
 #   2.3: 'host_floppy' deprecated
 #   2.5: 'host_floppy' dropped
 #   2.6: 'luks' added
+#   2.8: 'replication' added
 #
 # @backing_file: #optional the name of the backing file (for copy-on-write)
 #
@@ -1696,8 +1697,8 @@
   'data': [ 'archipelago', 'blkdebug', 'blkverify', 'bochs', 'cloop',
 'dmg', 'file', 'ftp', 'ftps', 'gluster', 'host_cdrom',
 'host_device', 'http', 'https', 'luks', 'null-aio', 'null-co',
-'parallels', 'qcow', 'qcow2', 'qed', 'quorum', 'raw', 'tftp',
-'vdi', 'vhdx', 'vmdk', 'vpc', 'vvfat' ] }
+'parallels', 'qcow', 'qcow2', 'qed', 'quorum', 'raw', 
'replication',
+'tftp', 'vdi', 'vhdx', 'vmdk', 'vpc', 'vvfat' ] }
 
 ##
 # @BlockdevOptionsFile
@@ -2160,6 +2161,22 @@
 { 'enum' : 'ReplicationMode', 'data' : [ 'primary', 'secondary' ] }
 
 ##
+# @BlockdevOptionsReplication
+#
+# Driver specific block device options for replication
+#
+# @mode: the replication mode
+#
+# @top-id: the id to protect replication model chain
+#
+# Since: 2.8
+##
+{ 'struct': 'BlockdevOptionsReplication',
+  'base': 'BlockdevOptionsGenericFormat',
+  'data': { 'mode': 'ReplicationMode',
+'top-id': 'str' } }
+
+##
 # @BlockdevOptions
 #
 # Options for creating a block device.  Many options are available for all
@@ -2224,6 +2241,7 @@
   'quorum': 'BlockdevOptionsQuorum',
   'raw':'BlockdevOptionsGenericFormat',
 # TODO rbd: Wait for structured options
+  'replication':'BlockdevOptionsReplication',
 # TODO sheepdog: Wait for structured options
 # TODO ssh: Should take InetSocketAddress for 'host'?
   'tftp':   'BlockdevOptionsFile',
-- 
1.9.3

[Qemu-block] [PATCH v23 12/12] MAINTAINERS: add maintainer for replication

2016-07-26 Thread Changlong Xie

As per Stefan's suggestion, add Wen and I as co-maintainers
of replication.

Cc: Stefan Hajnoczi <stefa...@redhat.com>
Signed-off-by: Wen Congyang <we...@cn.fujitsu.com>
Signed-off-by: Changlong Xie <xiecl.f...@cn.fujitsu.com>
---
 MAINTAINERS | 8 
 1 file changed, 8 insertions(+)

diff --git a/MAINTAINERS b/MAINTAINERS
index d1439a8..8fa2a25 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -1619,6 +1619,14 @@ L: qemu-block@nongnu.org
 S: Supported
 F: tests/image-fuzzer/
 
+replication
+M: Wen Congyang <we...@cn.fujitsu.com>
+M: Changlong Xie <xiecl.f...@cn.fujitsu.com>
+S: Supported
+F: replication*
+F: block/replication.c
+F: test/test-replication.c
+
 Build and test automation
 -
 M: Alex Bennée <alex.ben...@linaro.org>
-- 
1.9.3

[Qemu-block] [PATCH v23 07/12] configure: support replication

2016-07-26 Thread Changlong Xie

configure --(enable/disable)-replication to switch replication
support on/off, and it is on by default.
We later introduce replation support.

Signed-off-by: Wen Congyang <we...@cn.fujitsu.com>
Signed-off-by: Changlong Xie <xiecl.f...@cn.fujitsu.com>
Signed-off-by: Wang WeiWei <wangww.f...@cn.fujitsu.com>
---
 configure | 11 +++
 1 file changed, 11 insertions(+)

diff --git a/configure b/configure
index 6ffa4a8..20a6564 100755
--- a/configure
+++ b/configure
@@ -320,6 +320,7 @@ vhdx=""
 numa=""
 tcmalloc="no"
 jemalloc="no"
+replication="yes"
 
 # parse CC options first
 for opt do
@@ -1150,6 +1151,10 @@ for opt do
   ;;
   --enable-jemalloc) jemalloc="yes"
   ;;
+  --disable-replication) replication="no"
+  ;;
+  --enable-replication) replication="yes"
+  ;;
   *)
   echo "ERROR: unknown option $opt"
   echo "Try '$0 --help' for more information"
@@ -1380,6 +1385,7 @@ disabled with --disable-FEATURE, default is enabled if 
available:
   numalibnuma support
   tcmalloctcmalloc support
   jemallocjemalloc support
+  replication replication support
 
 NOTE: The object files are built at the place where configure is launched
 EOF
@@ -4896,6 +4902,7 @@ echo "NUMA host support $numa"
 echo "tcmalloc support  $tcmalloc"
 echo "jemalloc support  $jemalloc"
 echo "avx2 optimization $avx2_opt"
+echo "replication support $replication"
 
 if test "$sdl_too_old" = "yes"; then
 echo "-> Your SDL version is too old - please upgrade to have SDL support"
@@ -5466,6 +5473,10 @@ if test "$have_rtnetlink" = "yes" ; then
   echo "CONFIG_RTNETLINK=y" >> $config_host_mak
 fi
 
+if test "$replication" = "yes" ; then
+  echo "CONFIG_REPLICATION=y" >> $config_host_mak
+fi
+
 # Hold two types of flag:
 #   CONFIG_THREAD_SETNAME_BYTHREAD  - we've got a way of setting the name on
 # a thread we have a handle to
-- 
1.9.3

[Qemu-block] [PATCH v23 05/12] docs: block replication's description

2016-07-26 Thread Changlong Xie

From: Wen Congyang <we...@cn.fujitsu.com>

Signed-off-by: Wen Congyang <we...@cn.fujitsu.com>
Signed-off-by: Changlong Xie <xiecl.f...@cn.fujitsu.com>
Signed-off-by: Wang WeiWei <wangww.f...@cn.fujitsu.com>
Signed-off-by: zhanghailiang <zhang.zhanghaili...@huawei.com>
Signed-off-by: Gonglei <arei.gong...@huawei.com>
---
 docs/block-replication.txt | 239 +
 1 file changed, 239 insertions(+)
 create mode 100644 docs/block-replication.txt

diff --git a/docs/block-replication.txt b/docs/block-replication.txt
new file mode 100644
index 000..6bde673
--- /dev/null
+++ b/docs/block-replication.txt
@@ -0,0 +1,239 @@
+Block replication
+
+Copyright Fujitsu, Corp. 2016
+Copyright (c) 2016 Intel Corporation
+Copyright (c) 2016 HUAWEI TECHNOLOGIES CO., LTD.
+
+This work is licensed under the terms of the GNU GPL, version 2 or later.
+See the COPYING file in the top-level directory.
+
+Block replication is used for continuous checkpoints. It is designed
+for COLO (COarse-grain LOck-stepping) where the Secondary VM is running.
+It can also be applied for FT/HA (Fault-tolerance/High Assurance) scenario,
+where the Secondary VM is not running.
+
+This document gives an overview of block replication's design.
+
+== Background ==
+High availability solutions such as micro checkpoint and COLO will do
+consecutive checkpoints. The VM state of the Primary and Secondary VM is
+identical right after a VM checkpoint, but becomes different as the VM
+executes till the next checkpoint. To support disk contents checkpoint,
+the modified disk contents in the Secondary VM must be buffered, and are
+only dropped at next checkpoint time. To reduce the network transportation
+effort during a vmstate checkpoint, the disk modification operations of
+the Primary disk are asynchronously forwarded to the Secondary node.
+
+== Workflow ==
+The following is the image of block replication workflow:
+
++--+++
+|Primary Write Requests||Secondary Write Requests|
++--+++
+  |   |
+  |  (4)
+  |   V
+  |  /-\
+  |  Copy and Forward| |
+  |-(1)--+   | Disk Buffer |
+  |  |   | |
+  | (3)  \-/
+  | speculative  ^
+  |write through(2)
+  |  |   |
+  V  V   |
+   +--+   ++
+   | Primary Disk |   | Secondary Disk |
+   +--+   ++
+
+1) Primary write requests will be copied and forwarded to Secondary
+   QEMU.
+2) Before Primary write requests are written to Secondary disk, the
+   original sector content will be read from Secondary disk and
+   buffered in the Disk buffer, but it will not overwrite the existing
+   sector content (it could be from either "Secondary Write Requests" or
+   previous COW of "Primary Write Requests") in the Disk buffer.
+3) Primary write requests will be written to Secondary disk.
+4) Secondary write requests will be buffered in the Disk buffer and it
+   will overwrite the existing sector content in the buffer.
+
+== Architecture ==
+We are going to implement block replication from many basic
+blocks that are already in QEMU.
+
+ virtio-blk   ||
+ ^||.--
+ |||| Secondary
+1 Quorum  ||'--
+ /  \ ||
+/\||
+   Primary2 filter
+ disk ^
 virtio-blk
+  |
  ^
+3 NBD  --->  3 NBD 
  |
+client|| server
  2 filter
+  ||^  
  ^
+. |||  
  |
+Primary | ||  Secondary disk <- hidden-disk 5 
<- active-disk 4
+'

[Qemu-block] [PATCH v23 00/12] Block replication for continuous checkpoints

2016-07-26 Thread Changlong Xie

ere are too many I/O requests.
V4:
1. Introduce a new driver replication to avoid touch nbd and qcow2.
V3:
1: use error_setg() instead of error_set()
2. Add a new block job API
3. Active disk, hidden disk and nbd target uses the same AioContext
4. Add a testcase to test new hbitmap API
V2:
1. Redesign the secondary qemu(use image-fleecing)
2. Use Error objects to return error message
3. Address the comments from Max Reitz and Eric Blake

Changlong Xie (5):
  Backup: export interfaces for extra serialization
  configure: support replication
  Introduce new APIs to do replication operation
  tests: add unit test case for replication
  MAINTAINERS: add maintainer for replication

Wen Congyang (7):
  unblock backup operations in backing file
  Backup: clear all bitmap when doing block checkpoint
  Link backup into block core
  docs: block replication's description
  auto complete active commit
  Implement new driver for block replication
  support replication driver in blockdev-add

 MAINTAINERS  |   8 +
 Makefile.objs|   1 +
 block.c  |  17 ++
 block/Makefile.objs  |   3 +-
 block/backup.c   |  59 +++-
 block/mirror.c   |  13 +-
 block/replication.c  | 658 +++
 blockdev.c   |   2 +-
 configure|  11 +
 docs/block-replication.txt   | 239 
 include/block/block_backup.h |  39 +++
 include/block/block_int.h|   3 +-
 qapi/block-core.json |  35 ++-
 qemu-img.c   |   2 +-
 replication.c| 107 +++
 replication.h| 174 
 tests/.gitignore |   1 +
 tests/Makefile.include   |   4 +
 tests/test-replication.c | 575 +
 19 files changed, 1934 insertions(+), 17 deletions(-)
 create mode 100644 block/replication.c
 create mode 100644 docs/block-replication.txt
 create mode 100644 include/block/block_backup.h
 create mode 100644 replication.c
 create mode 100644 replication.h
 create mode 100644 tests/test-replication.c

-- 
1.9.3

[Qemu-block] [PATCH v23 04/12] Link backup into block core

2016-07-26 Thread Changlong Xie

From: Wen Congyang <we...@cn.fujitsu.com>

Some programs that add a dependency on it will use
the block layer directly.

Signed-off-by: Wen Congyang <we...@cn.fujitsu.com>
Signed-off-by: Changlong Xie <xiecl.f...@cn.fujitsu.com>
Signed-off-by: Wang WeiWei <wangww.f...@cn.fujitsu.com>
Signed-off-by: zhanghailiang <zhang.zhanghaili...@huawei.com>
Signed-off-by: Gonglei <arei.gong...@huawei.com>
Reviewed-by: Stefan Hajnoczi <stefa...@redhat.com>
Reviewed-by: Jeff Cody <jc...@redhat.com>
---
 block/Makefile.objs | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/block/Makefile.objs b/block/Makefile.objs
index 2593a2f..8a3270b 100644
--- a/block/Makefile.objs
+++ b/block/Makefile.objs
@@ -22,11 +22,11 @@ block-obj-$(CONFIG_ARCHIPELAGO) += archipelago.o
 block-obj-$(CONFIG_LIBSSH2) += ssh.o
 block-obj-y += accounting.o dirty-bitmap.o
 block-obj-y += write-threshold.o
+block-obj-y += backup.o
 
 block-obj-y += crypto.o
 
 common-obj-y += stream.o
-common-obj-y += backup.o
 
 iscsi.o-cflags := $(LIBISCSI_CFLAGS)
 iscsi.o-libs   := $(LIBISCSI_LIBS)
-- 
1.9.3

[Qemu-block] [PATCH v23 03/12] Backup: export interfaces for extra serialization

2016-07-26 Thread Changlong Xie

Normal backup(sync='none') workflow:
step 1. NBD peformance I/O write from client to server
   qcow2_co_writev
bdrv_co_writev
 ...
   bdrv_aligned_pwritev
notifier_with_return_list_notify -> backup_do_cow
 bdrv_driver_pwritev // write new contents

step 2. drive-backup sync=none
   backup_do_cow
   {
wait_for_overlapping_requests
cow_request_begin
for(; start < end; start++) {
bdrv_co_readv_no_serialising //read old contents from Secondary disk
bdrv_co_writev // write old contents to hidden-disk
}
cow_request_end
   }

step 3. Then roll back to "step 1" to write new contents to Secondary disk.

And for replication, we must make sure that we only read the old contents from
Secondary disk in order to keep contents consistent.

1) Replication workflow of Secondary
 virtio-blk
  ^
--->  1 NBD   |
   || server   3 replication
   ||^^
   |||   backing backing  |
   ||  Secondary disk 6< hidden-disk 5 < active-disk 4
   ||| ^
   ||'-'
   ||   drive-backup sync=none 2

Hence, we need these interfaces to implement coarse-grained serialization 
between
COW of Secondary disk and the read operation of replication.

Example codes about how to use them:

*#include "block/block_backup.h"

static coroutine_fn int xxx_co_readv()
{
CowRequest req;
BlockJob *job = secondary_disk->bs->job;

if (job) {
  backup_wait_for_overlapping_requests(job, start, end);
  backup_cow_request_begin(, job, start, end);
  ret = bdrv_co_readv();
  backup_cow_request_end();
  goto out;
}
ret = bdrv_co_readv();
out:
    return ret;
}

Signed-off-by: Changlong Xie <xiecl.f...@cn.fujitsu.com>
Signed-off-by: Wen Congyang <we...@cn.fujitsu.com>
Signed-off-by: Wang WeiWei <wangww.f...@cn.fujitsu.com>
---
 block/backup.c   | 41 ++---
 include/block/block_backup.h | 14 ++
 2 files changed, 48 insertions(+), 7 deletions(-)

diff --git a/block/backup.c b/block/backup.c
index 3bce416..919b63a 100644
--- a/block/backup.c
+++ b/block/backup.c
@@ -28,13 +28,6 @@
 #define BACKUP_CLUSTER_SIZE_DEFAULT (1 << 16)
 #define SLICE_TIME 1ULL /* ns */
 
-typedef struct CowRequest {
-int64_t start;
-int64_t end;
-QLIST_ENTRY(CowRequest) list;
-CoQueue wait_queue; /* coroutines blocked on this request */
-} CowRequest;
-
 typedef struct BackupBlockJob {
 BlockJob common;
 BlockBackend *target;
@@ -271,6 +264,40 @@ void backup_do_checkpoint(BlockJob *job, Error **errp)
 bitmap_zero(backup_job->done_bitmap, len);
 }
 
+void backup_wait_for_overlapping_requests(BlockJob *job, int64_t sector_num,
+  int nb_sectors)
+{
+BackupBlockJob *backup_job = container_of(job, BackupBlockJob, common);
+int64_t sectors_per_cluster = cluster_size_sectors(backup_job);
+int64_t start, end;
+
+assert(job->driver->job_type == BLOCK_JOB_TYPE_BACKUP);
+
+start = sector_num / sectors_per_cluster;
+end = DIV_ROUND_UP(sector_num + nb_sectors, sectors_per_cluster);
+wait_for_overlapping_requests(backup_job, start, end);
+}
+
+void backup_cow_request_begin(CowRequest *req, BlockJob *job,
+  int64_t sector_num,
+  int nb_sectors)
+{
+BackupBlockJob *backup_job = container_of(job, BackupBlockJob, common);
+int64_t sectors_per_cluster = cluster_size_sectors(backup_job);
+int64_t start, end;
+
+assert(job->driver->job_type == BLOCK_JOB_TYPE_BACKUP);
+
+start = sector_num / sectors_per_cluster;
+end = DIV_ROUND_UP(sector_num + nb_sectors, sectors_per_cluster);
+cow_request_begin(req, backup_job, start, end);
+}
+
+void backup_cow_request_end(CowRequest *req)
+{
+cow_request_end(req);
+}
+
 static const BlockJobDriver backup_job_driver = {
 .instance_size  = sizeof(BackupBlockJob),
 .job_type   = BLOCK_JOB_TYPE_BACKUP,
diff --git a/include/block/block_backup.h b/include/block/block_backup.h
index 157596c..8a75947 100644
--- a/include/block/block_backup.h
+++ b/include/block/block_backup.h
@@ -20,6 +20,20 @@
 
 #include "block/block_int.h"
 
+typedef struct CowRequest {
+int64_t start;
+int64_t end;
+QLIST_ENTRY(CowRequest) list;
+CoQueue wait_queue; /* coroutines blocked on this request */
+} CowRequest;
+
+void backup_wait_for_overlapping_requests(BlockJob *

[Qemu-block] [PATCH v23 01/12] unblock backup operations in backing file

2016-07-26 Thread Changlong Xie

From: Wen Congyang <we...@cn.fujitsu.com>

Signed-off-by: Wen Congyang <we...@cn.fujitsu.com>
Signed-off-by: Changlong Xie <xiecl.f...@cn.fujitsu.com>
Signed-off-by: Wang WeiWei <wangww.f...@cn.fujitsu.com>
---
 block.c | 17 +
 1 file changed, 17 insertions(+)

diff --git a/block.c b/block.c
index 30d64e6..194a060 100644
--- a/block.c
+++ b/block.c
@@ -1311,6 +1311,23 @@ void bdrv_set_backing_hd(BlockDriverState *bs, 
BlockDriverState *backing_hd)
 /* Otherwise we won't be able to commit due to check in bdrv_commit */
 bdrv_op_unblock(backing_hd, BLOCK_OP_TYPE_COMMIT_TARGET,
 bs->backing_blocker);
+/*
+ * We do backup in 3 ways:
+ * 1. drive backup
+ *The target bs is new opened, and the source is top BDS
+ * 2. blockdev backup
+ *Both the source and the target are top BDSes.
+ * 3. internal backup(used for block replication)
+ *Both the source and the target are backing file
+ *
+ * In case 1 and 2, neither the source nor the target is the backing file.
+ * In case 3, we will block the top BDS, so there is only one block job
+ * for the top BDS and its backing chain.
+ */
+bdrv_op_unblock(backing_hd, BLOCK_OP_TYPE_BACKUP_SOURCE,
+bs->backing_blocker);
+bdrv_op_unblock(backing_hd, BLOCK_OP_TYPE_BACKUP_TARGET,
+bs->backing_blocker);
 out:
 bdrv_refresh_limits(bs, NULL);
 }
-- 
1.9.3

[Qemu-block] [PATCH v23 06/12] auto complete active commit

2016-07-26 Thread Changlong Xie

From: Wen Congyang <we...@cn.fujitsu.com>

Auto complete mirror job in background to prevent from
blocking synchronously

Signed-off-by: Wen Congyang <we...@cn.fujitsu.com>
Signed-off-by: Changlong Xie <xiecl.f...@cn.fujitsu.com>
Signed-off-by: Wang WeiWei <wangww.f...@cn.fujitsu.com>
---
 block/mirror.c| 13 +
 blockdev.c|  2 +-
 include/block/block_int.h |  3 ++-
 qemu-img.c|  2 +-
 4 files changed, 13 insertions(+), 7 deletions(-)

diff --git a/block/mirror.c b/block/mirror.c
index 69a1a7c..30c2477 100644
--- a/block/mirror.c
+++ b/block/mirror.c
@@ -906,7 +906,8 @@ static void mirror_start_job(const char *job_id, 
BlockDriverState *bs,
  BlockCompletionFunc *cb,
  void *opaque, Error **errp,
  const BlockJobDriver *driver,
- bool is_none_mode, BlockDriverState *base)
+ bool is_none_mode, BlockDriverState *base,
+ bool auto_complete)
 {
 MirrorBlockJob *s;
 
@@ -942,6 +943,9 @@ static void mirror_start_job(const char *job_id, 
BlockDriverState *bs,
 s->granularity = granularity;
 s->buf_size = ROUND_UP(buf_size, granularity);
 s->unmap = unmap;
+if (auto_complete) {
+s->should_complete = true;
+}
 
 s->dirty_bitmap = bdrv_create_dirty_bitmap(bs, granularity, NULL, errp);
 if (!s->dirty_bitmap) {
@@ -980,14 +984,15 @@ void mirror_start(const char *job_id, BlockDriverState 
*bs,
 mirror_start_job(job_id, bs, target, replaces,
  speed, granularity, buf_size, backing_mode,
  on_source_error, on_target_error, unmap, cb, opaque, errp,
- _job_driver, is_none_mode, base);
+ _job_driver, is_none_mode, base, false);
 }
 
 void commit_active_start(const char *job_id, BlockDriverState *bs,
  BlockDriverState *base, int64_t speed,
  BlockdevOnError on_error,
  BlockCompletionFunc *cb,
- void *opaque, Error **errp)
+ void *opaque, Error **errp,
+ bool auto_complete)
 {
 int64_t length, base_length;
 int orig_base_flags;
@@ -1028,7 +1033,7 @@ void commit_active_start(const char *job_id, 
BlockDriverState *bs,
 mirror_start_job(job_id, bs, base, NULL, speed, 0, 0,
  MIRROR_LEAVE_BACKING_CHAIN,
  on_error, on_error, false, cb, opaque, _err,
- _active_job_driver, false, base);
+ _active_job_driver, false, base, auto_complete);
 if (local_err) {
 error_propagate(errp, local_err);
 goto error_restore_flags;
diff --git a/blockdev.c b/blockdev.c
index eafeba9..be7be7b 100644
--- a/blockdev.c
+++ b/blockdev.c
@@ -3140,7 +3140,7 @@ void qmp_block_commit(bool has_job_id, const char 
*job_id, const char *device,
 goto out;
 }
 commit_active_start(has_job_id ? job_id : NULL, bs, base_bs, speed,
-on_error, block_job_cb, bs, _err);
+on_error, block_job_cb, bs, _err, false);
 } else {
 commit_start(has_job_id ? job_id : NULL, bs, base_bs, top_bs, speed,
  on_error, block_job_cb, bs,
diff --git a/include/block/block_int.h b/include/block/block_int.h
index 1fe0fd9..f812740 100644
--- a/include/block/block_int.h
+++ b/include/block/block_int.h
@@ -699,13 +699,14 @@ void commit_start(const char *job_id, BlockDriverState 
*bs,
  * @cb: Completion function for the job.
  * @opaque: Opaque pointer value passed to @cb.
  * @errp: Error object.
+ * @auto_complete: Auto complete the job.
  *
  */
 void commit_active_start(const char *job_id, BlockDriverState *bs,
  BlockDriverState *base, int64_t speed,
  BlockdevOnError on_error,
  BlockCompletionFunc *cb,
- void *opaque, Error **errp);
+ void *opaque, Error **errp, bool auto_complete);
 /*
  * mirror_start:
  * @job_id: The id of the newly-created job, or %NULL to use the
diff --git a/qemu-img.c b/qemu-img.c
index 2e40e1f..ae204c9 100644
--- a/qemu-img.c
+++ b/qemu-img.c
@@ -921,7 +921,7 @@ static int img_commit(int argc, char **argv)
 };
 
 commit_active_start("commit", bs, base_bs, 0, BLOCKDEV_ON_ERROR_REPORT,
-common_block_job_cb, , _err);
+common_block_job_cb, , _err, false);
 if (local_err) {
 goto done;
 }
-- 
1.9.3

Re: [Qemu-block] [PATCH v22 00/10] Block replication for continuous checkpoints

2016-07-25 Thread Changlong Xie


On 07/25/2016 10:34 PM, Stefan Hajnoczi wrote:

On Mon, Jul 25, 2016 at 11:44:34AM +0800, Changlong Xie wrote:

COLO block is the necessary prerequisite of COLO framework and COLO network,
what are blocked by these patchsets now.

Since v19, Stefan said he had reviewed most part of this patchsets. So, this
series *REALLY* need more comments from all of you.

I've ping so many times, but no response until now. All i have to do is
rebase them to the lastest code.

Do you have time to review this?  Any response would be appreciate.


There has been no review activity for a long time.  You have been
patient and addressed issues that I raised back when I reviewed the
series.  Both you and I have pinged maintainers numerous times.  It's
not fair to keep this out of tree at this stage.

Since no one else seems to care I suggest rebasing/testing a final time
with the following changes:

1. Update the MAINTAINERS for new files you are adding.
2. Make the feature optional in ./configure so distros that don't feel
comfortable shipping it yet can easily disable it:

  $ ./configure --disable-replication && make -j4



Will follow you advise in next version.


Then I will merge it if there are no comments.

Stefan

Re: [Qemu-block] [PATCH v22 10/10] support replication driver in blockdev-add

2016-07-25 Thread Changlong Xie


On 07/26/2016 07:00 AM, Max Reitz wrote:

On 22.07.2016 12:16, Wang WeiWei wrote:

From: Wen Congyang <we...@cn.fujitsu.com>

Signed-off-by: Wen Congyang <we...@cn.fujitsu.com>
Signed-off-by: zhanghailiang <zhang.zhanghaili...@huawei.com>
Signed-off-by: Gonglei <arei.gong...@huawei.com>
Signed-off-by: Changlong Xie <xiecl.f...@cn.fujitsu.com>
Reviewed-by: Eric Blake <ebl...@redhat.com>
Signed-off-by: Wang WeiWei <wangww.f...@cn.fujitsu.com>
---
  qapi/block-core.json | 19 +--
  1 file changed, 17 insertions(+), 2 deletions(-)

diff --git a/qapi/block-core.json b/qapi/block-core.json
index 7f05b68..59565e9 100644
--- a/qapi/block-core.json
+++ b/qapi/block-core.json
@@ -248,6 +248,7 @@
  #   2.3: 'host_floppy' deprecated
  #   2.5: 'host_floppy' dropped
  #   2.6: 'luks' added
+#   2.7: 'replication' added


Probably 2.8


yes




  #
  # @backing_file: #optional the name of the backing file (for copy-on-write)
  #
@@ -1696,8 +1697,8 @@
'data': [ 'archipelago', 'blkdebug', 'blkverify', 'bochs', 'cloop',
  'dmg', 'file', 'ftp', 'ftps', 'gluster', 'host_cdrom',
  'host_device', 'http', 'https', 'luks', 'null-aio', 'null-co',
-'parallels', 'qcow', 'qcow2', 'qed', 'quorum', 'raw', 'tftp',
-'vdi', 'vhdx', 'vmdk', 'vpc', 'vvfat' ] }
+'parallels', 'qcow', 'qcow2', 'qed', 'quorum', 'raw', 
'replication',
+'tftp', 'vdi', 'vhdx', 'vmdk', 'vpc', 'vvfat' ] }

  ##
  # @BlockdevOptionsFile
@@ -2160,6 +2161,19 @@
  { 'enum' : 'ReplicationMode', 'data' : [ 'primary', 'secondary' ] }

  ##
+# @BlockdevOptionsReplication
+#
+# Driver specific block device options for replication
+#
+# @mode: the replication mode


What about top-id?


Sorry, i'll add it in next version




+#
+# Since: 2.7


2.8


yes.



Max


+##
+{ 'struct': 'BlockdevOptionsReplication',
+  'base': 'BlockdevOptionsGenericFormat',
+  'data': { 'mode': 'ReplicationMode'  } }
+
+##
  # @BlockdevOptions
  #
  # Options for creating a block device.  Many options are available for all
@@ -2224,6 +2238,7 @@
'quorum': 'BlockdevOptionsQuorum',
'raw':'BlockdevOptionsGenericFormat',
  # TODO rbd: Wait for structured options
+  'replication':'BlockdevOptionsReplication',
  # TODO sheepdog: Wait for structured options
  # TODO ssh: Should take InetSocketAddress for 'host'?
'tftp':   'BlockdevOptionsFile',

Re: [Qemu-block] [PATCH v22 07/10] Introduce new APIs to do replication operation

2016-07-25 Thread Changlong Xie


On 07/26/2016 06:23 AM, Max Reitz wrote:

+# @primary: Primary mode, the vm's state will be sent to secondary QEMU.
>+#
>+# @secondary: Secondary mode, receive the vm's state from primary QEMU.
>+#
>+# Since: 2.7

Probably 2.8 now.



I'll update 2.7 to 2.8 for all these series


Max

Re: [Qemu-block] [PATCH v22 03/10] Backup: export interfaces for extra serialization

2016-07-25 Thread Changlong Xie


On 07/26/2016 05:50 AM, Max Reitz wrote:

On 22.07.2016 12:16, Wang WeiWei wrote:

From: Changlong Xie <xiecl.f...@cn.fujitsu.com>

Normal backup(sync='none') workflow:
step 1. NBD peformance I/O write from client to server
qcow2_co_writev
 bdrv_co_writev
  ...
bdrv_aligned_pwritev
 notifier_with_return_list_notify -> backup_do_cow
  bdrv_driver_pwritev // write new contents

step 2. drive-backup sync=none
backup_do_cow
{
 wait_for_overlapping_requests
 cow_request_begin
 for(; start < end; start++) {
 bdrv_co_readv_no_serialising //read old contents from Secondary 
disk
 bdrv_co_writev // write old contents to hidden-disk
 }
 cow_request_end
}

step 3. Then roll back to "step 1" to write new contents to Secondary disk.

And for replication, we must make sure that we only read the old contents from
Secondary disk in order to keep contents consistent.

1) Replication workflow of Secondary
  virtio-blk
   ^
--->  1 NBD   |
|| server   3 replication
||^^
|||   backing backing  |
||  Secondary disk 6< hidden-disk 5 < active-disk 4
||| ^
||'-'
||   drive-backup sync=none 2

Hence, we need these interfaces to implement coarse-grained serialization 
between
COW of Secondary disk and the read operation of replication.

Example codes about how to use them:

*#include "block/block_backup.h"

static coroutine_fn int xxx_co_readv()
{
 CowRequest req;
 BlockJob *job = secondary_disk->bs->job;

 if (job) {
   backup_wait_for_overlapping_requests(job, start, end);
   backup_cow_request_begin(, job, start, end);
   ret = bdrv_co_readv();
   backup_cow_request_end();
   goto out;
 }
 ret = bdrv_co_readv();
out:
     return ret;
}

Signed-off-by: Changlong Xie <xiecl.f...@cn.fujitsu.com>
Signed-off-by: Wen Congyang <we...@cn.fujitsu.com>
Signed-off-by: Wang WeiWei <wangww.f...@cn.fujitsu.com>
---
  block/backup.c   | 41 ++---
  include/block/block_backup.h | 14 ++
  2 files changed, 48 insertions(+), 7 deletions(-)

diff --git a/block/backup.c b/block/backup.c
index 3bce416..919b63a 100644
--- a/block/backup.c
+++ b/block/backup.c
@@ -28,13 +28,6 @@
  #define BACKUP_CLUSTER_SIZE_DEFAULT (1 << 16)
  #define SLICE_TIME 1ULL /* ns */

-typedef struct CowRequest {
-int64_t start;
-int64_t end;
-QLIST_ENTRY(CowRequest) list;
-CoQueue wait_queue; /* coroutines blocked on this request */
-} CowRequest;
-
  typedef struct BackupBlockJob {
  BlockJob common;
  BlockBackend *target;
@@ -271,6 +264,40 @@ void backup_do_checkpoint(BlockJob *job, Error **errp)
  bitmap_zero(backup_job->done_bitmap, len);
  }

+void backup_wait_for_overlapping_requests(BlockJob *job, int64_t sector_num,
+  int nb_sectors)
+{
+BackupBlockJob *backup_job = container_of(job, BackupBlockJob, common);
+int64_t sectors_per_cluster = cluster_size_sectors(backup_job);


It's not really ideal to use backup_job here...


For COLO, Secondary disk always does backup job before failover.




+int64_t start, end;
+
+assert(job->driver->job_type == BLOCK_JOB_TYPE_BACKUP);


...and then only here assert that it's actually a valid object pointer.

Not catastrophic, though, since it's an assertion and thus deemed
impossible to go wrong anyway.


This is just what I thought



Max


+
+start = sector_num / sectors_per_cluster;
+end = DIV_ROUND_UP(sector_num + nb_sectors, sectors_per_cluster);
+wait_for_overlapping_requests(backup_job, start, end);
+}
+
+void backup_cow_request_begin(CowRequest *req, BlockJob *job,
+  int64_t sector_num,
+  int nb_sectors)
+{
+BackupBlockJob *backup_job = container_of(job, BackupBlockJob, common);
+int64_t sectors_per_cluster = cluster_size_sectors(backup_job);
+int64_t start, end;
+
+assert(job->driver->job_type == BLOCK_JOB_TYPE_BACKUP);
+
+start = sector_num / sectors_per_cluster;
+end = DIV_ROUND_UP(sector_num + nb_sectors, sectors_per_cluster);
+cow_request_begin(req, backup_job, start, end);
+}
+
+void backup_cow_request_end(CowRequest *req)
+{
+cow_request_end(req);
+}
+
  static const BlockJobDriver backup_job_driver = {
  .instance_size  = sizeof(BackupBlockJob),
  .job_type

Re: [Qemu-block] [PATCH v22 02/10] Backup: clear all bitmap when doing block checkpoint

2016-07-25 Thread Changlong Xie


On 07/26/2016 05:18 AM, Max Reitz wrote:

On 22.07.2016 12:15, Wang WeiWei wrote:

From: Wen Congyang <we...@cn.fujitsu.com>

Signed-off-by: Wen Congyang <we...@cn.fujitsu.com>
Signed-off-by: zhanghailiang <zhang.zhanghaili...@huawei.com>
Signed-off-by: Gonglei <arei.gong...@huawei.com>
Signed-off-by: Changlong Xie <xiecl.f...@cn.fujitsu.com>
Signed-off-by: Wang WeiWei <wangww.f...@cn.fujitsu.com>
---
  block/backup.c   | 18 ++
  include/block/block_backup.h |  3 +++
  2 files changed, 21 insertions(+)
  create mode 100644 include/block/block_backup.h


[...]


diff --git a/include/block/block_backup.h b/include/block/block_backup.h
new file mode 100644
index 000..3753bcb
--- /dev/null
+++ b/include/block/block_backup.h
@@ -0,0 +1,3 @@
+#include "block/block_int.h"
+
+void backup_do_checkpoint(BlockJob *job, Error **errp);


Include guards and copyright notice missing. I don't mind the missing
copyright notice because it will default to GPL anyway. Also, I


I think i can copy the Copyright from backup.c and update it


personally don't really mind the missing include guards, but I thought


Btw i'll add #ifndef...#define...#endif in next version


I'd say something anyway.


warm welcome



Max

Re: [Qemu-block] [PATCH v22 00/10] Block replication for continuous checkpoints

2016-07-24 Thread Changlong Xie

ing quorum child. Remove the option non-connect.
2. Simplify the backing refrence option according to Stefan Hajnoczi's 
suggestion
V6:
1. Rebase to the newest qemu.
V5:
1. Address the comments from Gong Lei
2. Speed the failover up. The secondary vm can take over very quickly even
if there are too many I/O requests.
V4:
1. Introduce a new driver replication to avoid touch nbd and qcow2.
V3:
1: use error_setg() instead of error_set()
2. Add a new block job API
3. Active disk, hidden disk and nbd target uses the same AioContext
4. Add a testcase to test new hbitmap API
V2:
1. Redesign the secondary qemu(use image-fleecing)
2. Use Error objects to return error message
3. Address the comments from Max Reitz and Eric Blake

Changlong Xie (3):
   Backup: export interfaces for extra serialization
   Introduce new APIs to do replication operation
   tests: add unit test case for replication

Wen Congyang (7):
   unblock backup operations in backing file
   Backup: clear all bitmap when doing block checkpoint
   Link backup into block core
   docs: block replication's description
   auto complete active commit
   Implement new driver for block replication
   support replication driver in blockdev-add

  Makefile.objs|   1 +
  block.c  |  17 ++
  block/Makefile.objs  |   3 +-
  block/backup.c   |  59 +++-
  block/mirror.c   |  13 +-
  block/replication.c  | 658 +++
  blockdev.c   |   2 +-
  docs/block-replication.txt   | 239 
  include/block/block_backup.h |  17 ++
  include/block/block_int.h|   3 +-
  qapi/block-core.json |  32 ++-
  qemu-img.c   |   2 +-
  replication.c| 107 +++
  replication.h| 174 
  tests/.gitignore |   1 +
  tests/Makefile.include   |   4 +
  tests/test-replication.c | 575 +
  17 files changed, 1890 insertions(+), 17 deletions(-)
  create mode 100644 block/replication.c
  create mode 100644 docs/block-replication.txt
  create mode 100644 include/block/block_backup.h
  create mode 100644 replication.c
  create mode 100644 replication.h
  create mode 100644 tests/test-replication.c

[Qemu-block] [PATCH v21 08/10] Implement new driver for block replication

2016-07-05 Thread Changlong Xie

From: Wen Congyang <we...@cn.fujitsu.com>

Signed-off-by: Wen Congyang <we...@cn.fujitsu.com>
Signed-off-by: zhanghailiang <zhang.zhanghaili...@huawei.com>
Signed-off-by: Gonglei <arei.gong...@huawei.com>
Signed-off-by: Changlong Xie <xiecl.f...@cn.fujitsu.com>
---
 block/Makefile.objs |   1 +
 block/replication.c | 657 
 2 files changed, 658 insertions(+)
 create mode 100644 block/replication.c

diff --git a/block/Makefile.objs b/block/Makefile.objs
index fbfe647..5e28b45 100644
--- a/block/Makefile.objs
+++ b/block/Makefile.objs
@@ -23,6 +23,7 @@ block-obj-$(CONFIG_LIBSSH2) += ssh.o
 block-obj-y += accounting.o dirty-bitmap.o
 block-obj-y += write-threshold.o
 block-obj-y += backup.o
+block-obj-y += replication.o
 
 block-obj-y += crypto.o
 
diff --git a/block/replication.c b/block/replication.c
new file mode 100644
index 000..1dabb5d
--- /dev/null
+++ b/block/replication.c
@@ -0,0 +1,657 @@
+/*
+ * Replication Block filter
+ *
+ * Copyright (c) 2016 HUAWEI TECHNOLOGIES CO., LTD.
+ * Copyright (c) 2016 Intel Corporation
+ * Copyright (c) 2016 FUJITSU LIMITED
+ *
+ * Author:
+ *   Wen Congyang <we...@cn.fujitsu.com>
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2 or later.
+ * See the COPYING file in the top-level directory.
+ */
+
+#include "qemu/osdep.h"
+#include "qemu-common.h"
+#include "block/nbd.h"
+#include "block/blockjob.h"
+#include "block/block_int.h"
+#include "block/block_backup.h"
+#include "sysemu/block-backend.h"
+#include "qapi/error.h"
+#include "replication.h"
+
+typedef struct BDRVReplicationState {
+ReplicationMode mode;
+int replication_state;
+BdrvChild *active_disk;
+BdrvChild *hidden_disk;
+BdrvChild *secondary_disk;
+char *top_id;
+ReplicationState *rs;
+Error *blocker;
+int orig_hidden_flags;
+int orig_secondary_flags;
+int error;
+} BDRVReplicationState;
+
+enum {
+BLOCK_REPLICATION_NONE, /* block replication is not started */
+BLOCK_REPLICATION_RUNNING,  /* block replication is running */
+BLOCK_REPLICATION_FAILOVER, /* failover is running in background */
+BLOCK_REPLICATION_FAILOVER_FAILED,  /* failover failed */
+BLOCK_REPLICATION_DONE, /* block replication is done */
+};
+
+static void replication_start(ReplicationState *rs, ReplicationMode mode,
+  Error **errp);
+static void replication_do_checkpoint(ReplicationState *rs, Error **errp);
+static void replication_get_error(ReplicationState *rs, Error **errp);
+static void replication_stop(ReplicationState *rs, bool failover,
+ Error **errp);
+
+#define REPLICATION_MODE"mode"
+#define REPLICATION_TOP_ID  "top-id"
+static QemuOptsList replication_runtime_opts = {
+.name = "replication",
+.head = QTAILQ_HEAD_INITIALIZER(replication_runtime_opts.head),
+.desc = {
+{
+.name = REPLICATION_MODE,
+.type = QEMU_OPT_STRING,
+},
+{
+.name = REPLICATION_TOP_ID,
+.type = QEMU_OPT_STRING,
+},
+{ /* end of list */ }
+},
+};
+
+static ReplicationOps replication_ops = {
+.start = replication_start,
+.checkpoint = replication_do_checkpoint,
+.get_error = replication_get_error,
+.stop = replication_stop,
+};
+
+static int replication_open(BlockDriverState *bs, QDict *options,
+int flags, Error **errp)
+{
+int ret;
+BDRVReplicationState *s = bs->opaque;
+Error *local_err = NULL;
+QemuOpts *opts = NULL;
+const char *mode;
+const char *top_id;
+
+ret = -EINVAL;
+opts = qemu_opts_create(_runtime_opts, NULL, 0, _abort);
+qemu_opts_absorb_qdict(opts, options, _err);
+if (local_err) {
+goto fail;
+}
+
+mode = qemu_opt_get(opts, REPLICATION_MODE);
+if (!mode) {
+error_setg(_err, "Missing the option mode");
+goto fail;
+}
+
+if (!strcmp(mode, "primary")) {
+s->mode = REPLICATION_MODE_PRIMARY;
+} else if (!strcmp(mode, "secondary")) {
+s->mode = REPLICATION_MODE_SECONDARY;
+top_id = qemu_opt_get(opts, REPLICATION_TOP_ID);
+s->top_id = g_strdup(top_id);
+if (!s->top_id) {
+error_setg(_err, "Missing the option top-id");
+goto fail;
+}
+} else {
+error_setg(_err,
+   "The option mode's value should be primary or secondary");
+goto fail;
+}
+
+s->rs = replication_new(bs, _ops);
+
+ret = 0;
+
+fail:
+qemu_opts_del(opts);
+error_propagate(errp, local_err);
+
+return ret;
+}
+
+static void replication_close(BlockDrive

[Qemu-block] [PATCH v21 05/10] docs: block replication's description

2016-07-05 Thread Changlong Xie

From: Wen Congyang <we...@cn.fujitsu.com>

Signed-off-by: Wen Congyang <we...@cn.fujitsu.com>
Signed-off-by: zhanghailiang <zhang.zhanghaili...@huawei.com>
Signed-off-by: Gonglei <arei.gong...@huawei.com>
Signed-off-by: Changlong Xie <xiecl.f...@cn.fujitsu.com>
---
 docs/block-replication.txt | 239 +
 1 file changed, 239 insertions(+)
 create mode 100644 docs/block-replication.txt

diff --git a/docs/block-replication.txt b/docs/block-replication.txt
new file mode 100644
index 000..6bde673
--- /dev/null
+++ b/docs/block-replication.txt
@@ -0,0 +1,239 @@
+Block replication
+
+Copyright Fujitsu, Corp. 2016
+Copyright (c) 2016 Intel Corporation
+Copyright (c) 2016 HUAWEI TECHNOLOGIES CO., LTD.
+
+This work is licensed under the terms of the GNU GPL, version 2 or later.
+See the COPYING file in the top-level directory.
+
+Block replication is used for continuous checkpoints. It is designed
+for COLO (COarse-grain LOck-stepping) where the Secondary VM is running.
+It can also be applied for FT/HA (Fault-tolerance/High Assurance) scenario,
+where the Secondary VM is not running.
+
+This document gives an overview of block replication's design.
+
+== Background ==
+High availability solutions such as micro checkpoint and COLO will do
+consecutive checkpoints. The VM state of the Primary and Secondary VM is
+identical right after a VM checkpoint, but becomes different as the VM
+executes till the next checkpoint. To support disk contents checkpoint,
+the modified disk contents in the Secondary VM must be buffered, and are
+only dropped at next checkpoint time. To reduce the network transportation
+effort during a vmstate checkpoint, the disk modification operations of
+the Primary disk are asynchronously forwarded to the Secondary node.
+
+== Workflow ==
+The following is the image of block replication workflow:
+
++--+++
+|Primary Write Requests||Secondary Write Requests|
++--+++
+  |   |
+  |  (4)
+  |   V
+  |  /-\
+  |  Copy and Forward| |
+  |-(1)--+   | Disk Buffer |
+  |  |   | |
+  | (3)  \-/
+  | speculative  ^
+  |write through(2)
+  |  |   |
+  V  V   |
+   +--+   ++
+   | Primary Disk |   | Secondary Disk |
+   +--+   ++
+
+1) Primary write requests will be copied and forwarded to Secondary
+   QEMU.
+2) Before Primary write requests are written to Secondary disk, the
+   original sector content will be read from Secondary disk and
+   buffered in the Disk buffer, but it will not overwrite the existing
+   sector content (it could be from either "Secondary Write Requests" or
+   previous COW of "Primary Write Requests") in the Disk buffer.
+3) Primary write requests will be written to Secondary disk.
+4) Secondary write requests will be buffered in the Disk buffer and it
+   will overwrite the existing sector content in the buffer.
+
+== Architecture ==
+We are going to implement block replication from many basic
+blocks that are already in QEMU.
+
+ virtio-blk   ||
+ ^||.--
+ |||| Secondary
+1 Quorum  ||'--
+ /  \ ||
+/\||
+   Primary2 filter
+ disk ^
 virtio-blk
+  |
  ^
+3 NBD  --->  3 NBD 
  |
+client|| server
  2 filter
+  ||^  
  ^
+. |||  
  |
+Primary | ||  Secondary disk <- hidden-disk 5 
<- active-disk 4
+'

[Qemu-block] [PATCH v21 09/10] tests: add unit test case for replication

2016-07-05 Thread Changlong Xie

Signed-off-by: Wen Congyang <we...@cn.fujitsu.com>
Signed-off-by: Changlong Xie <xiecl.f...@cn.fujitsu.com>
---
 tests/.gitignore |   1 +
 tests/Makefile.include   |   4 +
 tests/test-replication.c | 557 +++
 3 files changed, 562 insertions(+)
 create mode 100644 tests/test-replication.c

diff --git a/tests/.gitignore b/tests/.gitignore
index 840ea39..fc41498 100644
--- a/tests/.gitignore
+++ b/tests/.gitignore
@@ -62,6 +62,7 @@ test-qmp-introspect.[ch]
 test-qmp-marshal.c
 test-qmp-output-visitor
 test-rcu-list
+test-replication
 test-rfifolock
 test-string-input-visitor
 test-string-output-visitor
diff --git a/tests/Makefile.include b/tests/Makefile.include
index f8e3c6b..270ef43 100644
--- a/tests/Makefile.include
+++ b/tests/Makefile.include
@@ -109,6 +109,7 @@ check-unit-y += tests/test-crypto-xts$(EXESUF)
 check-unit-y += tests/test-crypto-block$(EXESUF)
 gcov-files-test-logging-y = tests/test-logging.c
 check-unit-y += tests/test-logging$(EXESUF)
+check-unit-y += tests/test-replication$(EXESUF)
 
 check-block-$(CONFIG_POSIX) += tests/qemu-iotests-quick.sh
 
@@ -469,6 +470,9 @@ tests/test-base64$(EXESUF): tests/test-base64.o \
 
 tests/test-logging$(EXESUF): tests/test-logging.o $(test-util-obj-y)
 
+tests/test-replication$(EXESUF): tests/test-replication.o $(test-util-obj-y) \
+   $(test-block-obj-y)
+
 tests/test-qapi-types.c tests/test-qapi-types.h :\
 $(SRC_PATH)/tests/qapi-schema/qapi-schema-test.json 
$(SRC_PATH)/scripts/qapi-types.py $(qapi-py)
$(call quiet-command,$(PYTHON) $(SRC_PATH)/scripts/qapi-types.py \
diff --git a/tests/test-replication.c b/tests/test-replication.c
new file mode 100644
index 000..23d25f9
--- /dev/null
+++ b/tests/test-replication.c
@@ -0,0 +1,557 @@
+/*
+ * Block replication tests
+ *
+ * Copyright (c) 2016 FUJITSU LIMITED
+ * Author: Changlong Xie <xiecl.f...@cn.fujitsu.com>
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2 or
+ * later.  See the COPYING file in the top-level directory.
+ */
+
+#include "qemu/osdep.h"
+
+#include "qapi/error.h"
+#include "replication.h"
+#include "block/block_int.h"
+#include "sysemu/block-backend.h"
+
+#define IMG_SIZE (64 * 1024 * 1024)
+
+/* primary */
+static char p_local_disk[] = "/tmp/p_local_disk.XX";
+
+/* secondary */
+#define S_ID "secondary-id"
+#define S_LOCAL_DISK_ID "secondary-local-disk-id"
+static char s_local_disk[] = "/tmp/s_local_disk.XX";
+static char s_active_disk[] = "/tmp/s_active_disk.XX";
+static char s_hidden_disk[] = "/tmp/s_hidden_disk.XX";
+
+/* FIXME: steal from blockdev.c */
+QemuOptsList qemu_drive_opts = {
+.name = "drive",
+.head = QTAILQ_HEAD_INITIALIZER(qemu_drive_opts.head),
+.desc = {
+{ /* end of list */ }
+},
+};
+
+static void io_read(BlockDriverState *bs, long pattern, int64_t pattern_offset,
+int64_t pattern_count, int64_t offset, int64_t count,
+bool expect_failed)
+{
+char *buf;
+void *cmp_buf = NULL;
+int ret;
+
+/* alloc pattern buffer */
+if (pattern) {
+cmp_buf = g_malloc(pattern_count);
+memset(cmp_buf, pattern, pattern_count);
+}
+
+/* alloc read buffer */
+buf = qemu_blockalign(bs, count);
+memset(buf, 0xab, count);
+
+/* do read */
+ret = bdrv_read(bs, offset >> BDRV_SECTOR_BITS, (uint8_t *)buf,
+count >> BDRV_SECTOR_BITS);
+
+/* assert and compare buf */
+if (expect_failed) {
+g_assert(ret < 0);
+} else {
+g_assert(ret >= 0);
+if (pattern) {
+g_assert(memcmp(buf + pattern_offset, cmp_buf, pattern_count) <= 
0);
+}
+}
+
+g_free(cmp_buf);
+qemu_vfree(buf);
+}
+
+static void io_write(BlockDriverState *bs, long pattern, int64_t offset,
+ int64_t count, bool expect_failed)
+{
+void *pattern_buf = NULL;
+int ret;
+
+/* alloc pattern buffer */
+if (pattern) {
+pattern_buf = qemu_blockalign(bs, count);
+memset(pattern_buf, pattern, count);
+}
+
+/* do write */
+if (pattern) {
+ret = bdrv_write(bs, offset >> BDRV_SECTOR_BITS, (uint8_t 
*)pattern_buf,
+ count >> BDRV_SECTOR_BITS);
+} else {
+ret = bdrv_pwrite_zeroes(bs, offset, count, 0);
+}
+
+/* assert */
+if (expect_failed) {
+g_assert(ret < 0);
+} else {
+g_assert(ret >= 0);
+}
+
+qemu_vfree(pattern_buf);
+}
+
+/*
+ * Create a uniquely-named empty temporary file.
+ */
+static void make_temp(char *template)
+{
+int fd;
+
+fd = mkstemp(template);
+g_assert(fd >= 0);
+close(fd);
+}
+
+
+static void prepare_imgs(void)
+{
+Error *local_err = NULL;
+
+make_temp(p_local_disk);
+

[Qemu-block] [PATCH v21 06/10] auto complete active commit

2016-07-05 Thread Changlong Xie

From: Wen Congyang <we...@cn.fujitsu.com>

Auto complete mirror job in background to prevent from
blocking synchronously

Signed-off-by: Wen Congyang <we...@cn.fujitsu.com>
Signed-off-by: Changlong Xie <xiecl.f...@cn.fujitsu.com>
---
 block/mirror.c| 13 +
 blockdev.c|  2 +-
 include/block/block_int.h |  3 ++-
 qemu-img.c|  2 +-
 4 files changed, 13 insertions(+), 7 deletions(-)

diff --git a/block/mirror.c b/block/mirror.c
index 8d96049..c5ae246 100644
--- a/block/mirror.c
+++ b/block/mirror.c
@@ -853,7 +853,8 @@ static void mirror_start_job(BlockDriverState *bs, 
BlockDriverState *target,
  BlockCompletionFunc *cb,
  void *opaque, Error **errp,
  const BlockJobDriver *driver,
- bool is_none_mode, BlockDriverState *base)
+ bool is_none_mode, BlockDriverState *base,
+ bool auto_complete)
 {
 MirrorBlockJob *s;
 
@@ -889,6 +890,9 @@ static void mirror_start_job(BlockDriverState *bs, 
BlockDriverState *target,
 s->granularity = granularity;
 s->buf_size = ROUND_UP(buf_size, granularity);
 s->unmap = unmap;
+if (auto_complete) {
+s->should_complete = true;
+}
 
 s->dirty_bitmap = bdrv_create_dirty_bitmap(bs, granularity, NULL, errp);
 if (!s->dirty_bitmap) {
@@ -927,14 +931,15 @@ void mirror_start(BlockDriverState *bs, BlockDriverState 
*target,
 mirror_start_job(bs, target, replaces,
  speed, granularity, buf_size, backing_mode,
  on_source_error, on_target_error, unmap, cb, opaque, errp,
- _job_driver, is_none_mode, base);
+ _job_driver, is_none_mode, base, false);
 }
 
 void commit_active_start(BlockDriverState *bs, BlockDriverState *base,
  int64_t speed,
  BlockdevOnError on_error,
  BlockCompletionFunc *cb,
- void *opaque, Error **errp)
+ void *opaque, Error **errp,
+ bool auto_complete)
 {
 int64_t length, base_length;
 int orig_base_flags;
@@ -974,7 +979,7 @@ void commit_active_start(BlockDriverState *bs, 
BlockDriverState *base,
 
 mirror_start_job(bs, base, NULL, speed, 0, 0, MIRROR_LEAVE_BACKING_CHAIN,
  on_error, on_error, false, cb, opaque, _err,
- _active_job_driver, false, base);
+ _active_job_driver, false, base, auto_complete);
 if (local_err) {
 error_propagate(errp, local_err);
 goto error_restore_flags;
diff --git a/blockdev.c b/blockdev.c
index 3a104a0..e755680 100644
--- a/blockdev.c
+++ b/blockdev.c
@@ -3168,7 +3168,7 @@ void qmp_block_commit(const char *device,
 goto out;
 }
 commit_active_start(bs, base_bs, speed, on_error, block_job_cb,
-bs, _err);
+bs, _err, false);
 } else {
 commit_start(bs, base_bs, top_bs, speed, on_error, block_job_cb, bs,
  has_backing_file ? backing_file : NULL, _err);
diff --git a/include/block/block_int.h b/include/block/block_int.h
index 2057156..39d14f1 100644
--- a/include/block/block_int.h
+++ b/include/block/block_int.h
@@ -670,13 +670,14 @@ void commit_start(BlockDriverState *bs, BlockDriverState 
*base,
  * @cb: Completion function for the job.
  * @opaque: Opaque pointer value passed to @cb.
  * @errp: Error object.
+ * @auto_complete: Auto complete the job.
  *
  */
 void commit_active_start(BlockDriverState *bs, BlockDriverState *base,
  int64_t speed,
  BlockdevOnError on_error,
  BlockCompletionFunc *cb,
- void *opaque, Error **errp);
+ void *opaque, Error **errp, bool auto_complete);
 /*
  * mirror_start:
  * @bs: Block device to operate on.
diff --git a/qemu-img.c b/qemu-img.c
index 3322a1e..9cdafb0 100644
--- a/qemu-img.c
+++ b/qemu-img.c
@@ -922,7 +922,7 @@ static int img_commit(int argc, char **argv)
 };
 
 commit_active_start(bs, base_bs, 0, BLOCKDEV_ON_ERROR_REPORT,
-common_block_job_cb, , _err);
+common_block_job_cb, , _err, false);
 if (local_err) {
 goto done;
 }
-- 
1.9.3

[Qemu-block] [PATCH v21 10/10] support replication driver in blockdev-add

2016-07-05 Thread Changlong Xie

From: Wen Congyang <we...@cn.fujitsu.com>

Signed-off-by: Wen Congyang <we...@cn.fujitsu.com>
Signed-off-by: zhanghailiang <zhang.zhanghaili...@huawei.com>
Signed-off-by: Gonglei <arei.gong...@huawei.com>
Signed-off-by: Changlong Xie <xiecl.f...@cn.fujitsu.com>
Reviewed-by: Eric Blake <ebl...@redhat.com>
---
 qapi/block-core.json | 20 ++--
 1 file changed, 18 insertions(+), 2 deletions(-)

diff --git a/qapi/block-core.json b/qapi/block-core.json
index e56cdf4..b9f9839 100644
--- a/qapi/block-core.json
+++ b/qapi/block-core.json
@@ -248,6 +248,7 @@
 #   2.3: 'host_floppy' deprecated
 #   2.5: 'host_floppy' dropped
 #   2.6: 'luks' added
+#   2.7: 'replication' added
 #
 # @backing_file: #optional the name of the backing file (for copy-on-write)
 #
@@ -1632,6 +1633,7 @@
 # Drivers that are supported in block device operations.
 #
 # @host_device, @host_cdrom: Since 2.1
+# @replication: Since 2.7
 #
 # Since: 2.0
 ##
@@ -1639,8 +1641,8 @@
   'data': [ 'archipelago', 'blkdebug', 'blkverify', 'bochs', 'cloop',
 'dmg', 'file', 'ftp', 'ftps', 'host_cdrom', 'host_device',
 'http', 'https', 'luks', 'null-aio', 'null-co', 'parallels',
-'qcow', 'qcow2', 'qed', 'quorum', 'raw', 'tftp', 'vdi', 'vhdx',
-'vmdk', 'vpc', 'vvfat' ] }
+'qcow', 'qcow2', 'qed', 'quorum', 'raw', 'replication', 'tftp',
+'vdi', 'vhdx', 'vmdk', 'vpc', 'vvfat' ] }
 
 ##
 # @BlockdevOptionsFile
@@ -2045,6 +2047,19 @@
 { 'enum' : 'ReplicationMode', 'data' : [ 'primary', 'secondary' ] }
 
 ##
+# @BlockdevOptionsReplication
+#
+# Driver specific block device options for replication
+#
+# @mode: the replication mode
+#
+# Since: 2.7
+##
+{ 'struct': 'BlockdevOptionsReplication',
+  'base': 'BlockdevOptionsGenericFormat',
+  'data': { 'mode': 'ReplicationMode'  } }
+
+##
 # @BlockdevOptions
 #
 # Options for creating a block device.  Many options are available for all
@@ -2125,6 +2140,7 @@
   'quorum': 'BlockdevOptionsQuorum',
   'raw':'BlockdevOptionsGenericFormat',
 # TODO rbd: Wait for structured options
+  'replication':'BlockdevOptionsReplication',
 # TODO sheepdog: Wait for structured options
 # TODO ssh: Should take InetSocketAddress for 'host'?
   'tftp':   'BlockdevOptionsFile',
-- 
1.9.3

[Qemu-block] [PATCH v21 03/10] Backup: export interfaces for extra serialization

2016-07-05 Thread Changlong Xie

Normal backup(sync='none') workflow:
step 1. NBD peformance I/O write from client to server
   qcow2_co_writev
bdrv_co_writev
 ...
   bdrv_aligned_pwritev
notifier_with_return_list_notify -> backup_do_cow
 bdrv_driver_pwritev // write new contents

step 2. drive-backup sync=none
   backup_do_cow
   {
wait_for_overlapping_requests
cow_request_begin
for(; start < end; start++) {
bdrv_co_readv_no_serialising //read old contents from Secondary disk
bdrv_co_writev // write old contents to hidden-disk
}
cow_request_end
   }

step 3. Then roll back to "step 1" to write new contents to Secondary disk.

And for replication, we must make sure that we only read the old contents from
Secondary disk in order to keep contents consistent.

1) Replication workflow of Secondary
 virtio-blk
  ^
--->  1 NBD   |
   || server   3 replication
   ||^^
   |||   backing backing  |
   ||  Secondary disk 6< hidden-disk 5 < active-disk 4
   ||| ^
   ||'-'
   ||   drive-backup sync=none 2

Hence, we need these interfaces to implement coarse-grained serialization 
between
COW of Secondary disk and the read operation of replication.

Example codes about how to use them:

*#include "block/block_backup.h"

static coroutine_fn int xxx_co_readv()
{
CowRequest req;
BlockJob *job = secondary_disk->bs->job;

if (job) {
  backup_wait_for_overlapping_requests(job, start, end);
  backup_cow_request_begin(, job, start, end);
  ret = bdrv_co_readv();
  backup_cow_request_end();
  goto out;
}
ret = bdrv_co_readv();
out:
    return ret;
}

Signed-off-by: Changlong Xie <xiecl.f...@cn.fujitsu.com>
Signed-off-by: Wen Congyang <we...@cn.fujitsu.com>
---
 block/backup.c   | 41 ++---
 include/block/block_backup.h | 14 ++
 2 files changed, 48 insertions(+), 7 deletions(-)

diff --git a/block/backup.c b/block/backup.c
index 1964a5a..bc935b4 100644
--- a/block/backup.c
+++ b/block/backup.c
@@ -28,13 +28,6 @@
 #define BACKUP_CLUSTER_SIZE_DEFAULT (1 << 16)
 #define SLICE_TIME 1ULL /* ns */
 
-typedef struct CowRequest {
-int64_t start;
-int64_t end;
-QLIST_ENTRY(CowRequest) list;
-CoQueue wait_queue; /* coroutines blocked on this request */
-} CowRequest;
-
 typedef struct BackupBlockJob {
 BlockJob common;
 BlockBackend *target;
@@ -271,6 +264,40 @@ void backup_do_checkpoint(BlockJob *job, Error **errp)
 bitmap_zero(backup_job->done_bitmap, len);
 }
 
+void backup_wait_for_overlapping_requests(BlockJob *job, int64_t sector_num,
+  int nb_sectors)
+{
+BackupBlockJob *backup_job = container_of(job, BackupBlockJob, common);
+int64_t sectors_per_cluster = cluster_size_sectors(backup_job);
+int64_t start, end;
+
+assert(job->driver->job_type == BLOCK_JOB_TYPE_BACKUP);
+
+start = sector_num / sectors_per_cluster;
+end = DIV_ROUND_UP(sector_num + nb_sectors, sectors_per_cluster);
+wait_for_overlapping_requests(backup_job, start, end);
+}
+
+void backup_cow_request_begin(CowRequest *req, BlockJob *job,
+  int64_t sector_num,
+  int nb_sectors)
+{
+BackupBlockJob *backup_job = container_of(job, BackupBlockJob, common);
+int64_t sectors_per_cluster = cluster_size_sectors(backup_job);
+int64_t start, end;
+
+assert(job->driver->job_type == BLOCK_JOB_TYPE_BACKUP);
+
+start = sector_num / sectors_per_cluster;
+end = DIV_ROUND_UP(sector_num + nb_sectors, sectors_per_cluster);
+cow_request_begin(req, backup_job, start, end);
+}
+
+void backup_cow_request_end(CowRequest *req)
+{
+cow_request_end(req);
+}
+
 static const BlockJobDriver backup_job_driver = {
 .instance_size  = sizeof(BackupBlockJob),
 .job_type   = BLOCK_JOB_TYPE_BACKUP,
diff --git a/include/block/block_backup.h b/include/block/block_backup.h
index 3753bcb..e0e7ce6 100644
--- a/include/block/block_backup.h
+++ b/include/block/block_backup.h
@@ -1,3 +1,17 @@
 #include "block/block_int.h"
 
+typedef struct CowRequest {
+int64_t start;
+int64_t end;
+QLIST_ENTRY(CowRequest) list;
+CoQueue wait_queue; /* coroutines blocked on this request */
+} CowRequest;
+
+void backup_wait_for_overlapping_requests(BlockJob *job, int64_t sector_num,
+

[Qemu-block] [PATCH v21 00/10] Block replication for continuous checkpoints

2016-07-05 Thread Changlong Xie

Block replication is a very important feature which is used for
continuous checkpoints(for example: COLO).

You can get the detailed information about block replication from here:
http://wiki.qemu.org/Features/BlockReplication

Usage:
Please refer to docs/block-replication.txt

You can get the patch here:
https://github.com/Pating/qemu/tree/changlox/block-replication-v21

You can get the patch with framework here:
https://github.com/Pating/qemu/tree/changlox/colo_framework_v20

TODO:
1. Continuous block replication. It will be started after basic functions
   are accepted.

Changs Log:
V21:
1. Rebase to the lastest code
2. use bdrv_pwrite_zeroes() and BDRV_SECTOR_BITS for p9
V20 Resend:
1. Resend to avoid bothering qemu-trivial maintainers
2. Address comments from Eric, fix header file issue and add a brief commit 
message for p7
V20:
1. Rebase to the lastest code
2. Address comments from stefan
p8: 
1. error_setg() with an error message when check_top_bs() fails. 
2. remove bdrv_ref(s->hidden_disk->bs) since commit 5c438bc6
3. use bloc_job_cancel_sync() before active commit
p9: 
1. fix uninitialized 'pattern_buf'
2. introduce mkstemp(3) to fix unique filenames
3. use qemu_vfree() for qemu_blockalign() memory
4. add missing replication_start_all()
5. remove useless pattern for io_write()
V19:
1. Rebase to v2.6.0
2. Address comments from stefan
p3: a new patch that export interfaces for extra serialization
p8: 
1. call replication_stop() before freeing s->top_id
2. check top_bs
3. reopen file readonly in error return paths
4. enable extra serialization between read and COW
p9: try to hanlde SIGABRT
V18:
p6: add local_err in all replication callbacks to prevent "errp == NULL"
p7: add missing qemu_iovec_destroy(xxx)
V17:
1. Rebase to the lastest codes 
p2: refactor backup_do_checkpoint addressed comments from Jeff Cody
p4: fix bugs in "drive_add buddy xxx" hmp commands
p6: add "since: 2.7"
p7: fix bug in replication_close(), add missing "qapi/error.h", add 
test-replication 
p8: add "since: 2.7"
V16:
1. Rebase to the newest codes
2. Address comments from Stefan & hailiang
p3: we don't need this patch now
p4: add "top-id" parameters for secondary
p6: fix NULL pointer in replication callbacks, remove unnecessary typedefs, 
add doc comments that explain the semantics of Replication
p7: Refactor AioContext for thread-safe, remove unnecessary get_top_bs()
*Note*: I'm working on replication testcase now, will send out in V17
V15:
1. Rebase to the newest codes
2. Fix typos and coding style addresed Eric's comments
3. Address Stefan's comments
   1) Make backup_do_checkpoint public, drop the changes on BlockJobDriver
   2) Update the message and description for [PATCH 4/9]
   3) Make replication_(start/stop/do_checkpoint)_all as global interfaces
   4) Introduce AioContext lock to protect start/stop/do_checkpoint callbacks
   5) Use BdrvChild instead of holding on to BlockDriverState * pointers
4. Clear BDRV_O_INACTIVE for hidden disk's open_flags since commit 09e0c771  
5. Introduce replication_get_error_all to check replication status
6. Remove useless discard interface
V14:
1. Implement auto complete active commit
2. Implement active commit block job for replication.c
3. Address the comments from Stefan, add replication-specific API and data
   structure, also remove old block layer APIs
V13:
1. Rebase to the newest codes
2. Remove redundant marcos and semicolon in replication.c 
3. Fix typos in block-replication.txt
V12:
1. Rebase to the newest codes
2. Use backing reference to replcace 'allow-write-backing-file'
V11:
1. Reopen the backing file when starting blcok replication if it is not
   opened in R/W mode
2. Unblock BLOCK_OP_TYPE_BACKUP_SOURCE and BLOCK_OP_TYPE_BACKUP_TARGET
   when opening backing file
3. Block the top BDS so there is only one block job for the top BDS and
   its backing chain.
V10:
1. Use blockdev-remove-medium and blockdev-insert-medium to replace backing
   reference.
2. Address the comments from Eric Blake
V9:
1. Update the error messages
2. Rebase to the newest qemu
3. Split child add/delete support. These patches are sent in another patchset.
V8:
1. Address Alberto Garcia's comments
V7:
1. Implement adding/removing quorum child. Remove the option non-connect.
2. Simplify the backing refrence option according to Stefan Hajnoczi's 
suggestion
V6:
1. Rebase to the newest qemu.
V5:
1. Address the comments from Gong Lei
2. Speed the failover up. The secondary vm can take over very quickly even
   if there are too many I/O requests.
V4:
1. Introduce a new driver replication to avoid touch nbd and qcow2.
V3:
1: use error_setg() instead of error_set()
2. Add a new block job API
3. Active disk, hidden disk and nbd target uses the same AioContext
4. Add a testcase to test new hbitmap API
V2:
1. Redesign the secondary qemu(use image-fleecing)
2. Use Error objects to return error message
3. Address the comments from

[Qemu-block] [PATCH v21 04/10] Link backup into block core

2016-07-05 Thread Changlong Xie

From: Wen Congyang <we...@cn.fujitsu.com>

Some programs that add a dependency on it will use
the block layer directly.

Signed-off-by: Wen Congyang <we...@cn.fujitsu.com>
Signed-off-by: zhanghailiang <zhang.zhanghaili...@huawei.com>
Signed-off-by: Gonglei <arei.gong...@huawei.com>
Signed-off-by: Changlong Xie <xiecl.f...@cn.fujitsu.com>
Reviewed-by: Stefan Hajnoczi <stefa...@redhat.com>
Reviewed-by: Jeff Cody <jc...@redhat.com>
---
 block/Makefile.objs | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/block/Makefile.objs b/block/Makefile.objs
index 44a5416..fbfe647 100644
--- a/block/Makefile.objs
+++ b/block/Makefile.objs
@@ -22,12 +22,12 @@ block-obj-$(CONFIG_ARCHIPELAGO) += archipelago.o
 block-obj-$(CONFIG_LIBSSH2) += ssh.o
 block-obj-y += accounting.o dirty-bitmap.o
 block-obj-y += write-threshold.o
+block-obj-y += backup.o
 
 block-obj-y += crypto.o
 
 common-obj-y += stream.o
 common-obj-y += commit.o
-common-obj-y += backup.o
 
 iscsi.o-cflags := $(LIBISCSI_CFLAGS)
 iscsi.o-libs   := $(LIBISCSI_LIBS)
-- 
1.9.3

[Qemu-block] [PATCH v21 02/10] Backup: clear all bitmap when doing block checkpoint

2016-07-05 Thread Changlong Xie

From: Wen Congyang <we...@cn.fujitsu.com>

Signed-off-by: Wen Congyang <we...@cn.fujitsu.com>
Signed-off-by: zhanghailiang <zhang.zhanghaili...@huawei.com>
Signed-off-by: Gonglei <arei.gong...@huawei.com>
Signed-off-by: Changlong Xie <xiecl.f...@cn.fujitsu.com>
---
 block/backup.c   | 18 ++
 include/block/block_backup.h |  3 +++
 2 files changed, 21 insertions(+)
 create mode 100644 include/block/block_backup.h

diff --git a/block/backup.c b/block/backup.c
index f87f8d5..1964a5a 100644
--- a/block/backup.c
+++ b/block/backup.c
@@ -17,6 +17,7 @@
 #include "block/block.h"
 #include "block/block_int.h"
 #include "block/blockjob.h"
+#include "block/block_backup.h"
 #include "qapi/error.h"
 #include "qapi/qmp/qerror.h"
 #include "qemu/ratelimit.h"
@@ -253,6 +254,23 @@ static void backup_attached_aio_context(BlockJob *job, 
AioContext *aio_context)
 blk_set_aio_context(s->target, aio_context);
 }
 
+void backup_do_checkpoint(BlockJob *job, Error **errp)
+{
+BackupBlockJob *backup_job = container_of(job, BackupBlockJob, common);
+int64_t len;
+
+assert(job->driver->job_type == BLOCK_JOB_TYPE_BACKUP);
+
+if (backup_job->sync_mode != MIRROR_SYNC_MODE_NONE) {
+error_setg(errp, "The backup job only supports block checkpoint in"
+   " sync=none mode");
+return;
+}
+
+len = DIV_ROUND_UP(backup_job->common.len, backup_job->cluster_size);
+bitmap_zero(backup_job->done_bitmap, len);
+}
+
 static const BlockJobDriver backup_job_driver = {
 .instance_size  = sizeof(BackupBlockJob),
 .job_type   = BLOCK_JOB_TYPE_BACKUP,
diff --git a/include/block/block_backup.h b/include/block/block_backup.h
new file mode 100644
index 000..3753bcb
--- /dev/null
+++ b/include/block/block_backup.h
@@ -0,0 +1,3 @@
+#include "block/block_int.h"
+
+void backup_do_checkpoint(BlockJob *job, Error **errp);
-- 
1.9.3

[Qemu-block] [PATCH v21 07/10] Introduce new APIs to do replication operation

2016-07-05 Thread Changlong Xie

This commit introduces six replication interfaces(for block, network etc).
Firstly we can use replication_(new/remove) to create/destroy replication
instances, then in migration we can use replication_(start/stop/do_checkpoint
/get_error)_all to handle all replication operations. More detail please
refer to replication.h

Signed-off-by: Wen Congyang <we...@cn.fujitsu.com>
Signed-off-by: zhanghailiang <zhang.zhanghaili...@huawei.com>
Signed-off-by: Gonglei <arei.gong...@huawei.com>
Signed-off-by: Changlong Xie <xiecl.f...@cn.fujitsu.com>
---
 Makefile.objs|   1 +
 qapi/block-core.json |  13 
 replication.c| 107 +++
 replication.h| 174 +++
 4 files changed, 295 insertions(+)
 create mode 100644 replication.c
 create mode 100644 replication.h

diff --git a/Makefile.objs b/Makefile.objs
index 7f1f0a3..4abdc81 100644
--- a/Makefile.objs
+++ b/Makefile.objs
@@ -15,6 +15,7 @@ block-obj-$(CONFIG_POSIX) += aio-posix.o
 block-obj-$(CONFIG_WIN32) += aio-win32.o
 block-obj-y += block/
 block-obj-y += qemu-io-cmds.o
+block-obj-y += replication.o
 
 block-obj-m = block/
 
diff --git a/qapi/block-core.json b/qapi/block-core.json
index 98a20d2..e56cdf4 100644
--- a/qapi/block-core.json
+++ b/qapi/block-core.json
@@ -2032,6 +2032,19 @@
 '*read-pattern': 'QuorumReadPattern' } }
 
 ##
+# @ReplicationMode
+#
+# An enumeration of replication modes.
+#
+# @primary: Primary mode, the vm's state will be sent to secondary QEMU.
+#
+# @secondary: Secondary mode, receive the vm's state from primary QEMU.
+#
+# Since: 2.7
+##
+{ 'enum' : 'ReplicationMode', 'data' : [ 'primary', 'secondary' ] }
+
+##
 # @BlockdevOptions
 #
 # Options for creating a block device.  Many options are available for all
diff --git a/replication.c b/replication.c
new file mode 100644
index 000..be3a42f
--- /dev/null
+++ b/replication.c
@@ -0,0 +1,107 @@
+/*
+ * Replication filter
+ *
+ * Copyright (c) 2016 HUAWEI TECHNOLOGIES CO., LTD.
+ * Copyright (c) 2016 Intel Corporation
+ * Copyright (c) 2016 FUJITSU LIMITED
+ *
+ * Author:
+ *   Changlong Xie <xiecl.f...@cn.fujitsu.com>
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2 or later.
+ * See the COPYING file in the top-level directory.
+ */
+
+#include "qemu/osdep.h"
+#include "qapi/error.h"
+#include "replication.h"
+
+static QLIST_HEAD(, ReplicationState) replication_states;
+
+ReplicationState *replication_new(void *opaque, ReplicationOps *ops)
+{
+ReplicationState *rs;
+
+assert(ops != NULL);
+rs = g_new0(ReplicationState, 1);
+rs->opaque = opaque;
+rs->ops = ops;
+QLIST_INSERT_HEAD(_states, rs, node);
+
+return rs;
+}
+
+void replication_remove(ReplicationState *rs)
+{
+if (rs) {
+QLIST_REMOVE(rs, node);
+g_free(rs);
+}
+}
+
+/*
+ * The caller of the function MUST make sure vm stopped
+ */
+void replication_start_all(ReplicationMode mode, Error **errp)
+{
+ReplicationState *rs, *next;
+Error *local_err = NULL;
+
+QLIST_FOREACH_SAFE(rs, _states, node, next) {
+if (rs->ops && rs->ops->start) {
+rs->ops->start(rs, mode, _err);
+}
+if (local_err) {
+error_propagate(errp, local_err);
+return;
+}
+}
+}
+
+void replication_do_checkpoint_all(Error **errp)
+{
+ReplicationState *rs, *next;
+Error *local_err = NULL;
+
+QLIST_FOREACH_SAFE(rs, _states, node, next) {
+if (rs->ops && rs->ops->checkpoint) {
+rs->ops->checkpoint(rs, _err);
+}
+if (local_err) {
+error_propagate(errp, local_err);
+return;
+}
+}
+}
+
+void replication_get_error_all(Error **errp)
+{
+ReplicationState *rs, *next;
+Error *local_err = NULL;
+
+QLIST_FOREACH_SAFE(rs, _states, node, next) {
+if (rs->ops && rs->ops->get_error) {
+rs->ops->get_error(rs, _err);
+}
+if (local_err) {
+error_propagate(errp, local_err);
+return;
+}
+}
+}
+
+void replication_stop_all(bool failover, Error **errp)
+{
+ReplicationState *rs, *next;
+Error *local_err = NULL;
+
+QLIST_FOREACH_SAFE(rs, _states, node, next) {
+if (rs->ops && rs->ops->stop) {
+rs->ops->stop(rs, failover, _err);
+}
+if (local_err) {
+error_propagate(errp, local_err);
+return;
+}
+}
+}
diff --git a/replication.h b/replication.h
new file mode 100644
index 000..ece6ca6
--- /dev/null
+++ b/replication.h
@@ -0,0 +1,174 @@
+/*
+ * Replication filter
+ *
+ * Copyright (c) 2016 HUAWEI TECHNOLOGIES CO., LTD.
+ * Copyright (c) 2016 Intel Corporation
+ * Copyright (c) 2016 FUJITSU LIMITED
+ *
+ * Author:
+ *   Changlo

[Qemu-block] [PATCH v21 01/10] unblock backup operations in backing file

2016-07-05 Thread Changlong Xie

From: Wen Congyang <we...@cn.fujitsu.com>

Signed-off-by: Wen Congyang <we...@cn.fujitsu.com>
Signed-off-by: Changlong Xie <xiecl.f...@cn.fujitsu.com>
---
 block.c | 17 +
 1 file changed, 17 insertions(+)

diff --git a/block.c b/block.c
index f4648e9..f7e7e43 100644
--- a/block.c
+++ b/block.c
@@ -1309,6 +1309,23 @@ void bdrv_set_backing_hd(BlockDriverState *bs, 
BlockDriverState *backing_hd)
 /* Otherwise we won't be able to commit due to check in bdrv_commit */
 bdrv_op_unblock(backing_hd, BLOCK_OP_TYPE_COMMIT_TARGET,
 bs->backing_blocker);
+/*
+ * We do backup in 3 ways:
+ * 1. drive backup
+ *The target bs is new opened, and the source is top BDS
+ * 2. blockdev backup
+ *Both the source and the target are top BDSes.
+ * 3. internal backup(used for block replication)
+ *Both the source and the target are backing file
+ *
+ * In case 1 and 2, neither the source nor the target is the backing file.
+ * In case 3, we will block the top BDS, so there is only one block job
+ * for the top BDS and its backing chain.
+ */
+bdrv_op_unblock(backing_hd, BLOCK_OP_TYPE_BACKUP_SOURCE,
+bs->backing_blocker);
+bdrv_op_unblock(backing_hd, BLOCK_OP_TYPE_BACKUP_TARGET,
+bs->backing_blocker);
 out:
 bdrv_refresh_limits(bs, NULL);
 }
-- 
1.9.3

Re: [Qemu-block] [PATCH v20 Resend 09/10] tests: add unit test case for replication

2016-07-03 Thread Changlong Xie


On 06/14/2016 03:53 PM, Changlong Xie wrote:

Signed-off-by: Wen Congyang <we...@cn.fujitsu.com>
Signed-off-by: Changlong Xie <xiecl.f...@cn.fujitsu.com>
---
  tests/.gitignore |   1 +
  tests/Makefile   |   4 +
  tests/test-replication.c | 555 +++
  3 files changed, 560 insertions(+)
  create mode 100644 tests/test-replication.c

diff --git a/tests/.gitignore b/tests/.gitignore
index a06a8ba..d22ab06 100644
--- a/tests/.gitignore
+++ b/tests/.gitignore
@@ -58,6 +58,7 @@ test-qmp-introspect.[ch]
  test-qmp-marshal.c
  test-qmp-output-visitor
  test-rcu-list
+test-replication
  test-rfifolock
  test-string-input-visitor
  test-string-output-visitor
diff --git a/tests/Makefile b/tests/Makefile
index a3e20e3..901b8e4 100644
--- a/tests/Makefile
+++ b/tests/Makefile
@@ -103,6 +103,7 @@ check-unit-y += tests/test-crypto-xts$(EXESUF)
  check-unit-y += tests/test-crypto-block$(EXESUF)
  gcov-files-test-logging-y = tests/test-logging.c
  check-unit-y += tests/test-logging$(EXESUF)
+check-unit-y += tests/test-replication$(EXESUF)

  check-block-$(CONFIG_POSIX) += tests/qemu-iotests-quick.sh

@@ -451,6 +452,9 @@ tests/test-base64$(EXESUF): tests/test-base64.o \

  tests/test-logging$(EXESUF): tests/test-logging.o $(test-util-obj-y)

+tests/test-replication$(EXESUF): tests/test-replication.o $(test-util-obj-y) \
+   $(test-block-obj-y)
+
  tests/test-qapi-types.c tests/test-qapi-types.h :\
  $(SRC_PATH)/tests/qapi-schema/qapi-schema-test.json 
$(SRC_PATH)/scripts/qapi-types.py $(qapi-py)
$(call quiet-command,$(PYTHON) $(SRC_PATH)/scripts/qapi-types.py \
diff --git a/tests/test-replication.c b/tests/test-replication.c
new file mode 100644
index 000..b5bb2eb
--- /dev/null
+++ b/tests/test-replication.c
@@ -0,0 +1,555 @@
+/*
+ * Block replication tests
+ *
+ * Copyright (c) 2016 FUJITSU LIMITED
+ * Author: Changlong Xie <xiecl.f...@cn.fujitsu.com>
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2 or
+ * later.  See the COPYING file in the top-level directory.
+ */
+
+#include "qemu/osdep.h"
+
+#include "qapi/error.h"
+#include "replication.h"
+#include "block/block_int.h"
+#include "sysemu/block-backend.h"
+
+#define IMG_SIZE (64 * 1024 * 1024)
+
+/* primary */
+static char p_local_disk[] = "/tmp/p_local_disk.XX";
+
+/* secondary */
+#define S_ID "secondary-id"
+#define S_LOCAL_DISK_ID "secondary-local-disk-id"
+static char s_local_disk[] = "/tmp/s_local_disk.XX";
+static char s_active_disk[] = "/tmp/s_active_disk.XX";
+static char s_hidden_disk[] = "/tmp/s_hidden_disk.XX";
+
+/* FIXME: steal from blockdev.c */
+QemuOptsList qemu_drive_opts = {
+.name = "drive",
+.head = QTAILQ_HEAD_INITIALIZER(qemu_drive_opts.head),
+.desc = {
+{ /* end of list */ }
+},
+};
+
+static void io_read(BlockDriverState *bs, long pattern, int64_t pattern_offset,
+int64_t pattern_count, int64_t offset, int64_t count,
+bool expect_failed)
+{
+char *buf;
+void *cmp_buf = NULL;
+int ret;
+
+/* alloc pattern buffer */
+if (pattern) {
+cmp_buf = g_malloc(pattern_count);
+memset(cmp_buf, pattern, pattern_count);
+}
+
+/* alloc read buffer */
+buf = qemu_blockalign(bs, count);
+memset(buf, 0xab, count);
+
+/* do read */
+ret = bdrv_read(bs, offset >> 9, (uint8_t *)buf, count >> 9);
+
+/* assert and compare buf */
+if (expect_failed) {
+g_assert(ret < 0);
+} else {
+g_assert(ret >= 0);
+if (pattern) {
+g_assert(memcmp(buf + pattern_offset, cmp_buf, pattern_count) <= 
0);
+}
+}
+
+g_free(cmp_buf);
+qemu_vfree(buf);
+}
+
+static void io_write(BlockDriverState *bs, long pattern, int64_t offset,
+ int64_t count, bool expect_failed)
+{
+void *pattern_buf = NULL;
+int ret;
+
+/* alloc pattern buffer */
+if (pattern) {
+pattern_buf = qemu_blockalign(bs, count);
+memset(pattern_buf, pattern, count);
+}
+
+/* do write */
+if (pattern) {
+ret = bdrv_write(bs, offset >> 9, (uint8_t *)pattern_buf, count >> 9);
+} else {
+ret = bdrv_write_zeroes(bs, offset >> 9, count >> 9, 0);


Commit 74021bc "block: Switch bdrv_write_zeroes() to byte interface", so 
i'll use bdrv_pwrite_zeroes() in next version. Also will 
s/9/BDRV_SECTOR_BITS/



+}
+
+/* assert */
+if (expect_failed) {
+g_assert(ret < 0);
+} else {
+g_assert(ret >= 0);
+}
+
+qemu_vfree(pattern_buf);
+}
+
+/*
+ * Create a uniquely-named empty temporary file.
+ */
+static void make_temp(char *template)
+{
+int fd;
+
+fd = mkstemp(template);
+g_assert(fd >= 0);
+cl

Re: [Qemu-block] [PATCH v20 Resend 00/10] Block replication for continuous checkpoints

2016-06-17 Thread Changlong Xie


For v19, Stefan said he had reviewed most part of this patchsets.
So, this series need more comments from block and block job maintainers.

@Jeff and/or Kevin, ping...

On 06/14/2016 03:53 PM, Changlong Xie wrote:

Block replication is a very important feature which is used for
continuous checkpoints(for example: COLO).

You can get the detailed information about block replication from here:
http://wiki.qemu.org/Features/BlockReplication

Usage:
Please refer to docs/block-replication.txt

You can get the patch here:
https://github.com/Pating/qemu/tree/changlox/block-replication-v20-resend

You can get the patch with framework here:
https://github.com/Pating/qemu/tree/changlox/colo_framework_v19

TODO:
1. Continuous block replication. It will be started after basic functions
are accepted.

Changs Log:
V20 Resend:
1. Resend to avoid bothering qemu-trivial maintainers
2. Address comments from Eric, fix header file issue and add a brief commit 
message for p7
V20:
1. Rebase to the lastest code
2. Address comments from stefan
p8:
1. error_setg() with an error message when check_top_bs() fails.
2. remove bdrv_ref(s->hidden_disk->bs) since commit 5c438bc6
3. use bloc_job_cancel_sync() before active commit
p9:
1. fix uninitialized 'pattern_buf'
2. introduce mkstemp(3) to fix unique filenames
3. use qemu_vfree() for qemu_blockalign() memory
4. add missing replication_start_all()
5. remove useless pattern for io_write()
V19:
1. Rebase to v2.6.0
2. Address comments from stefan
p3: a new patch that export interfaces for extra serialization
p8:
1. call replication_stop() before freeing s->top_id
2. check top_bs
3. reopen file readonly in error return paths
4. enable extra serialization between read and COW
p9: try to hanlde SIGABRT
V18:
p6: add local_err in all replication callbacks to prevent "errp == NULL"
p7: add missing qemu_iovec_destroy(xxx)
V17:
1. Rebase to the lastest codes
p2: refactor backup_do_checkpoint addressed comments from Jeff Cody
p4: fix bugs in "drive_add buddy xxx" hmp commands
p6: add "since: 2.7"
p7: fix bug in replication_close(), add missing "qapi/error.h", add 
test-replication
p8: add "since: 2.7"
V16:
1. Rebase to the newest codes
2. Address comments from Stefan & hailiang
p3: we don't need this patch now
p4: add "top-id" parameters for secondary
p6: fix NULL pointer in replication callbacks, remove unnecessary typedefs,
add doc comments that explain the semantics of Replication
p7: Refactor AioContext for thread-safe, remove unnecessary get_top_bs()
*Note*: I'm working on replication testcase now, will send out in V17
V15:
1. Rebase to the newest codes
2. Fix typos and coding style addresed Eric's comments
3. Address Stefan's comments
1) Make backup_do_checkpoint public, drop the changes on BlockJobDriver
2) Update the message and description for [PATCH 4/9]
3) Make replication_(start/stop/do_checkpoint)_all as global interfaces
4) Introduce AioContext lock to protect start/stop/do_checkpoint callbacks
5) Use BdrvChild instead of holding on to BlockDriverState * pointers
4. Clear BDRV_O_INACTIVE for hidden disk's open_flags since commit 09e0c771
5. Introduce replication_get_error_all to check replication status
6. Remove useless discard interface
V14:
1. Implement auto complete active commit
2. Implement active commit block job for replication.c
3. Address the comments from Stefan, add replication-specific API and data
structure, also remove old block layer APIs
V13:
1. Rebase to the newest codes
2. Remove redundant marcos and semicolon in replication.c
3. Fix typos in block-replication.txt
V12:
1. Rebase to the newest codes
2. Use backing reference to replcace 'allow-write-backing-file'
V11:
1. Reopen the backing file when starting blcok replication if it is not
opened in R/W mode
2. Unblock BLOCK_OP_TYPE_BACKUP_SOURCE and BLOCK_OP_TYPE_BACKUP_TARGET
when opening backing file
3. Block the top BDS so there is only one block job for the top BDS and
its backing chain.
V10:
1. Use blockdev-remove-medium and blockdev-insert-medium to replace backing
reference.
2. Address the comments from Eric Blake
V9:
1. Update the error messages
2. Rebase to the newest qemu
3. Split child add/delete support. These patches are sent in another patchset.
V8:
1. Address Alberto Garcia's comments
V7:
1. Implement adding/removing quorum child. Remove the option non-connect.
2. Simplify the backing refrence option according to Stefan Hajnoczi's 
suggestion
V6:
1. Rebase to the newest qemu.
V5:
1. Address the comments from Gong Lei
2. Speed the failover up. The secondary vm can take over very quickly even
if there are too many I/O requests.
V4:
1. Introduce a new driver replication to avoid touch nbd and qcow2.
V3:
1: use error_setg() instead of error_set()
2. Add a new block job API
3. Active disk, hidden disk and nbd target uses the same AioContext
4. Add a testcase to test new hbitmap API
V2

[Qemu-block] [PATCH v20 Resend 09/10] tests: add unit test case for replication

2016-06-14 Thread Changlong Xie

Signed-off-by: Wen Congyang <we...@cn.fujitsu.com>
Signed-off-by: Changlong Xie <xiecl.f...@cn.fujitsu.com>
---
 tests/.gitignore |   1 +
 tests/Makefile   |   4 +
 tests/test-replication.c | 555 +++
 3 files changed, 560 insertions(+)
 create mode 100644 tests/test-replication.c

diff --git a/tests/.gitignore b/tests/.gitignore
index a06a8ba..d22ab06 100644
--- a/tests/.gitignore
+++ b/tests/.gitignore
@@ -58,6 +58,7 @@ test-qmp-introspect.[ch]
 test-qmp-marshal.c
 test-qmp-output-visitor
 test-rcu-list
+test-replication
 test-rfifolock
 test-string-input-visitor
 test-string-output-visitor
diff --git a/tests/Makefile b/tests/Makefile
index a3e20e3..901b8e4 100644
--- a/tests/Makefile
+++ b/tests/Makefile
@@ -103,6 +103,7 @@ check-unit-y += tests/test-crypto-xts$(EXESUF)
 check-unit-y += tests/test-crypto-block$(EXESUF)
 gcov-files-test-logging-y = tests/test-logging.c
 check-unit-y += tests/test-logging$(EXESUF)
+check-unit-y += tests/test-replication$(EXESUF)
 
 check-block-$(CONFIG_POSIX) += tests/qemu-iotests-quick.sh
 
@@ -451,6 +452,9 @@ tests/test-base64$(EXESUF): tests/test-base64.o \
 
 tests/test-logging$(EXESUF): tests/test-logging.o $(test-util-obj-y)
 
+tests/test-replication$(EXESUF): tests/test-replication.o $(test-util-obj-y) \
+   $(test-block-obj-y)
+
 tests/test-qapi-types.c tests/test-qapi-types.h :\
 $(SRC_PATH)/tests/qapi-schema/qapi-schema-test.json 
$(SRC_PATH)/scripts/qapi-types.py $(qapi-py)
$(call quiet-command,$(PYTHON) $(SRC_PATH)/scripts/qapi-types.py \
diff --git a/tests/test-replication.c b/tests/test-replication.c
new file mode 100644
index 000..b5bb2eb
--- /dev/null
+++ b/tests/test-replication.c
@@ -0,0 +1,555 @@
+/*
+ * Block replication tests
+ *
+ * Copyright (c) 2016 FUJITSU LIMITED
+ * Author: Changlong Xie <xiecl.f...@cn.fujitsu.com>
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2 or
+ * later.  See the COPYING file in the top-level directory.
+ */
+
+#include "qemu/osdep.h"
+
+#include "qapi/error.h"
+#include "replication.h"
+#include "block/block_int.h"
+#include "sysemu/block-backend.h"
+
+#define IMG_SIZE (64 * 1024 * 1024)
+
+/* primary */
+static char p_local_disk[] = "/tmp/p_local_disk.XX";
+
+/* secondary */
+#define S_ID "secondary-id"
+#define S_LOCAL_DISK_ID "secondary-local-disk-id"
+static char s_local_disk[] = "/tmp/s_local_disk.XX";
+static char s_active_disk[] = "/tmp/s_active_disk.XX";
+static char s_hidden_disk[] = "/tmp/s_hidden_disk.XX";
+
+/* FIXME: steal from blockdev.c */
+QemuOptsList qemu_drive_opts = {
+.name = "drive",
+.head = QTAILQ_HEAD_INITIALIZER(qemu_drive_opts.head),
+.desc = {
+{ /* end of list */ }
+},
+};
+
+static void io_read(BlockDriverState *bs, long pattern, int64_t pattern_offset,
+int64_t pattern_count, int64_t offset, int64_t count,
+bool expect_failed)
+{
+char *buf;
+void *cmp_buf = NULL;
+int ret;
+
+/* alloc pattern buffer */
+if (pattern) {
+cmp_buf = g_malloc(pattern_count);
+memset(cmp_buf, pattern, pattern_count);
+}
+
+/* alloc read buffer */
+buf = qemu_blockalign(bs, count);
+memset(buf, 0xab, count);
+
+/* do read */
+ret = bdrv_read(bs, offset >> 9, (uint8_t *)buf, count >> 9);
+
+/* assert and compare buf */
+if (expect_failed) {
+g_assert(ret < 0);
+} else {
+g_assert(ret >= 0);
+if (pattern) {
+g_assert(memcmp(buf + pattern_offset, cmp_buf, pattern_count) <= 
0);
+}
+}
+
+g_free(cmp_buf);
+qemu_vfree(buf);
+}
+
+static void io_write(BlockDriverState *bs, long pattern, int64_t offset,
+ int64_t count, bool expect_failed)
+{
+void *pattern_buf = NULL;
+int ret;
+
+/* alloc pattern buffer */
+if (pattern) {
+pattern_buf = qemu_blockalign(bs, count);
+memset(pattern_buf, pattern, count);
+}
+
+/* do write */
+if (pattern) {
+ret = bdrv_write(bs, offset >> 9, (uint8_t *)pattern_buf, count >> 9);
+} else {
+ret = bdrv_write_zeroes(bs, offset >> 9, count >> 9, 0);
+}
+
+/* assert */
+if (expect_failed) {
+g_assert(ret < 0);
+} else {
+g_assert(ret >= 0);
+}
+
+qemu_vfree(pattern_buf);
+}
+
+/*
+ * Create a uniquely-named empty temporary file.
+ */
+static void make_temp(char *template)
+{
+int fd;
+
+fd = mkstemp(template);
+g_assert(fd >= 0);
+close(fd);
+}
+
+
+static void prepare_imgs(void)
+{
+Error *local_err = NULL;
+
+make_temp(p_local_disk);
+make_temp(s_local_disk);
+make_temp(s_active_disk);
+make_temp(s_hidden_disk);
+
+/* Primary */
+bdrv_

[Qemu-block] [PATCH v20 Resend 08/10] Implement new driver for block replication

2016-06-14 Thread Changlong Xie

From: Wen Congyang <we...@cn.fujitsu.com>

Signed-off-by: Wen Congyang <we...@cn.fujitsu.com>
Signed-off-by: zhanghailiang <zhang.zhanghaili...@huawei.com>
Signed-off-by: Gonglei <arei.gong...@huawei.com>
Signed-off-by: Changlong Xie <xiecl.f...@cn.fujitsu.com>
---
 block/Makefile.objs |   1 +
 block/replication.c | 657 
 2 files changed, 658 insertions(+)
 create mode 100644 block/replication.c

diff --git a/block/Makefile.objs b/block/Makefile.objs
index fbfe647..5e28b45 100644
--- a/block/Makefile.objs
+++ b/block/Makefile.objs
@@ -23,6 +23,7 @@ block-obj-$(CONFIG_LIBSSH2) += ssh.o
 block-obj-y += accounting.o dirty-bitmap.o
 block-obj-y += write-threshold.o
 block-obj-y += backup.o
+block-obj-y += replication.o
 
 block-obj-y += crypto.o
 
diff --git a/block/replication.c b/block/replication.c
new file mode 100644
index 000..1dabb5d
--- /dev/null
+++ b/block/replication.c
@@ -0,0 +1,657 @@
+/*
+ * Replication Block filter
+ *
+ * Copyright (c) 2016 HUAWEI TECHNOLOGIES CO., LTD.
+ * Copyright (c) 2016 Intel Corporation
+ * Copyright (c) 2016 FUJITSU LIMITED
+ *
+ * Author:
+ *   Wen Congyang <we...@cn.fujitsu.com>
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2 or later.
+ * See the COPYING file in the top-level directory.
+ */
+
+#include "qemu/osdep.h"
+#include "qemu-common.h"
+#include "block/nbd.h"
+#include "block/blockjob.h"
+#include "block/block_int.h"
+#include "block/block_backup.h"
+#include "sysemu/block-backend.h"
+#include "qapi/error.h"
+#include "replication.h"
+
+typedef struct BDRVReplicationState {
+ReplicationMode mode;
+int replication_state;
+BdrvChild *active_disk;
+BdrvChild *hidden_disk;
+BdrvChild *secondary_disk;
+char *top_id;
+ReplicationState *rs;
+Error *blocker;
+int orig_hidden_flags;
+int orig_secondary_flags;
+int error;
+} BDRVReplicationState;
+
+enum {
+BLOCK_REPLICATION_NONE, /* block replication is not started */
+BLOCK_REPLICATION_RUNNING,  /* block replication is running */
+BLOCK_REPLICATION_FAILOVER, /* failover is running in background */
+BLOCK_REPLICATION_FAILOVER_FAILED,  /* failover failed */
+BLOCK_REPLICATION_DONE, /* block replication is done */
+};
+
+static void replication_start(ReplicationState *rs, ReplicationMode mode,
+  Error **errp);
+static void replication_do_checkpoint(ReplicationState *rs, Error **errp);
+static void replication_get_error(ReplicationState *rs, Error **errp);
+static void replication_stop(ReplicationState *rs, bool failover,
+ Error **errp);
+
+#define REPLICATION_MODE"mode"
+#define REPLICATION_TOP_ID  "top-id"
+static QemuOptsList replication_runtime_opts = {
+.name = "replication",
+.head = QTAILQ_HEAD_INITIALIZER(replication_runtime_opts.head),
+.desc = {
+{
+.name = REPLICATION_MODE,
+.type = QEMU_OPT_STRING,
+},
+{
+.name = REPLICATION_TOP_ID,
+.type = QEMU_OPT_STRING,
+},
+{ /* end of list */ }
+},
+};
+
+static ReplicationOps replication_ops = {
+.start = replication_start,
+.checkpoint = replication_do_checkpoint,
+.get_error = replication_get_error,
+.stop = replication_stop,
+};
+
+static int replication_open(BlockDriverState *bs, QDict *options,
+int flags, Error **errp)
+{
+int ret;
+BDRVReplicationState *s = bs->opaque;
+Error *local_err = NULL;
+QemuOpts *opts = NULL;
+const char *mode;
+const char *top_id;
+
+ret = -EINVAL;
+opts = qemu_opts_create(_runtime_opts, NULL, 0, _abort);
+qemu_opts_absorb_qdict(opts, options, _err);
+if (local_err) {
+goto fail;
+}
+
+mode = qemu_opt_get(opts, REPLICATION_MODE);
+if (!mode) {
+error_setg(_err, "Missing the option mode");
+goto fail;
+}
+
+if (!strcmp(mode, "primary")) {
+s->mode = REPLICATION_MODE_PRIMARY;
+} else if (!strcmp(mode, "secondary")) {
+s->mode = REPLICATION_MODE_SECONDARY;
+top_id = qemu_opt_get(opts, REPLICATION_TOP_ID);
+s->top_id = g_strdup(top_id);
+if (!s->top_id) {
+error_setg(_err, "Missing the option top-id");
+goto fail;
+}
+} else {
+error_setg(_err,
+   "The option mode's value should be primary or secondary");
+goto fail;
+}
+
+s->rs = replication_new(bs, _ops);
+
+ret = 0;
+
+fail:
+qemu_opts_del(opts);
+error_propagate(errp, local_err);
+
+return ret;
+}
+
+static void replication_close(BlockDrive

[Qemu-block] [PATCH v20 Resend 07/10] Introduce new APIs to do replication operation

2016-06-14 Thread Changlong Xie

This commit introduces six replication interfaces(for block, network etc).
Firstly we can use replication_(new/remove) to create/destroy replication
instances, then in migration we can use replication_(start/stop/do_checkpoint
/get_error)_all to handle all replication operations. More detail please
refer to replication.h

Signed-off-by: Wen Congyang <we...@cn.fujitsu.com>
Signed-off-by: zhanghailiang <zhang.zhanghaili...@huawei.com>
Signed-off-by: Gonglei <arei.gong...@huawei.com>
Signed-off-by: Changlong Xie <xiecl.f...@cn.fujitsu.com>
---
 Makefile.objs|   1 +
 qapi/block-core.json |  13 
 replication.c| 107 +++
 replication.h| 174 +++
 4 files changed, 295 insertions(+)
 create mode 100644 replication.c
 create mode 100644 replication.h

diff --git a/Makefile.objs b/Makefile.objs
index da49b71..f77f6b0 100644
--- a/Makefile.objs
+++ b/Makefile.objs
@@ -15,6 +15,7 @@ block-obj-$(CONFIG_POSIX) += aio-posix.o
 block-obj-$(CONFIG_WIN32) += aio-win32.o
 block-obj-y += block/
 block-obj-y += qemu-io-cmds.o
+block-obj-y += replication.o
 
 block-obj-m = block/
 
diff --git a/qapi/block-core.json b/qapi/block-core.json
index 98a20d2..e56cdf4 100644
--- a/qapi/block-core.json
+++ b/qapi/block-core.json
@@ -2032,6 +2032,19 @@
 '*read-pattern': 'QuorumReadPattern' } }
 
 ##
+# @ReplicationMode
+#
+# An enumeration of replication modes.
+#
+# @primary: Primary mode, the vm's state will be sent to secondary QEMU.
+#
+# @secondary: Secondary mode, receive the vm's state from primary QEMU.
+#
+# Since: 2.7
+##
+{ 'enum' : 'ReplicationMode', 'data' : [ 'primary', 'secondary' ] }
+
+##
 # @BlockdevOptions
 #
 # Options for creating a block device.  Many options are available for all
diff --git a/replication.c b/replication.c
new file mode 100644
index 000..c95e523
--- /dev/null
+++ b/replication.c
@@ -0,0 +1,107 @@
+/*
+ * Replication filter
+ *
+ * Copyright (c) 2016 HUAWEI TECHNOLOGIES CO., LTD.
+ * Copyright (c) 2016 Intel Corporation
+ * Copyright (c) 2016 FUJITSU LIMITED
+ *
+ * Author:
+ *   Changlong Xie <xiecl.f...@cn.fujitsu.com>
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2 or later.
+ * See the COPYING file in the top-level directory.
+ */
+
+#include "qemu/osdep.h"
+#include "qapi/error.h"
+#include "replication.h"
+
+static QLIST_HEAD(, ReplicationState) replication_states;
+
+ReplicationState *replication_new(void *opaque, ReplicationOps *ops)
+{
+ReplicationState *rs;
+
+assert(ops != NULL);
+rs = g_new0(ReplicationState, 1);
+rs->opaque = opaque;
+rs->ops = ops;
+QLIST_INSERT_HEAD(_states, rs, node);
+
+return rs;
+}
+
+void replication_remove(ReplicationState *rs)
+{
+if (rs) {
+QLIST_REMOVE(rs, node);
+g_free(rs);
+}
+}
+
+/*
+ * The caller of the function MUST make sure vm stopped
+ */
+void replication_start_all(ReplicationMode mode, Error **errp)
+{
+ReplicationState *rs, *next;
+Error *local_err = NULL;
+
+QLIST_FOREACH_SAFE(rs, _states, node, next) {
+if (rs->ops && rs->ops->start) {
+rs->ops->start(rs, mode, _err);
+}
+if (local_err) {
+   error_propagate(errp, local_err);
+   return;
+}
+}
+}
+
+void replication_do_checkpoint_all(Error **errp)
+{
+ReplicationState *rs, *next;
+Error *local_err = NULL;
+
+QLIST_FOREACH_SAFE(rs, _states, node, next) {
+if (rs->ops && rs->ops->checkpoint) {
+rs->ops->checkpoint(rs, _err);
+}
+if (local_err) {
+   error_propagate(errp, local_err);
+   return;
+}
+}
+}
+
+void replication_get_error_all(Error **errp)
+{
+ReplicationState *rs, *next;
+Error *local_err = NULL;
+
+QLIST_FOREACH_SAFE(rs, _states, node, next) {
+if (rs->ops && rs->ops->get_error) {
+rs->ops->get_error(rs, _err);
+}
+if (local_err) {
+   error_propagate(errp, local_err);
+   return;
+}
+}
+}
+
+void replication_stop_all(bool failover, Error **errp)
+{
+ReplicationState *rs, *next;
+Error *local_err = NULL;
+
+QLIST_FOREACH_SAFE(rs, _states, node, next) {
+if (rs->ops && rs->ops->stop) {
+rs->ops->stop(rs, failover, _err);
+}
+if (local_err) {
+   error_propagate(errp, local_err);
+   return;
+}
+}
+}
diff --git a/replication.h b/replication.h
new file mode 100644
index 000..ece6ca6
--- /dev/null
+++ b/replication.h
@@ -0,0 +1,174 @@
+/*
+ * Replication filter
+ *
+ * Copyright (c) 2016 HUAWEI TECHNOLOGIES CO., LTD.
+ * Copyright (c) 2016 Intel Corporation
+ * Copyright (c) 2016 FUJITSU LIMITED
+ *
+ * Author:
+ *   Changlong Xie <

[Qemu-block] [PATCH v20 Resend 10/10] support replication driver in blockdev-add

2016-06-14 Thread Changlong Xie

From: Wen Congyang <we...@cn.fujitsu.com>

Signed-off-by: Wen Congyang <we...@cn.fujitsu.com>
Signed-off-by: zhanghailiang <zhang.zhanghaili...@huawei.com>
Signed-off-by: Gonglei <arei.gong...@huawei.com>
Signed-off-by: Changlong Xie <xiecl.f...@cn.fujitsu.com>
Reviewed-by: Eric Blake <ebl...@redhat.com>
---
 qapi/block-core.json | 20 ++--
 1 file changed, 18 insertions(+), 2 deletions(-)

diff --git a/qapi/block-core.json b/qapi/block-core.json
index e56cdf4..b9f9839 100644
--- a/qapi/block-core.json
+++ b/qapi/block-core.json
@@ -248,6 +248,7 @@
 #   2.3: 'host_floppy' deprecated
 #   2.5: 'host_floppy' dropped
 #   2.6: 'luks' added
+#   2.7: 'replication' added
 #
 # @backing_file: #optional the name of the backing file (for copy-on-write)
 #
@@ -1632,6 +1633,7 @@
 # Drivers that are supported in block device operations.
 #
 # @host_device, @host_cdrom: Since 2.1
+# @replication: Since 2.7
 #
 # Since: 2.0
 ##
@@ -1639,8 +1641,8 @@
   'data': [ 'archipelago', 'blkdebug', 'blkverify', 'bochs', 'cloop',
 'dmg', 'file', 'ftp', 'ftps', 'host_cdrom', 'host_device',
 'http', 'https', 'luks', 'null-aio', 'null-co', 'parallels',
-'qcow', 'qcow2', 'qed', 'quorum', 'raw', 'tftp', 'vdi', 'vhdx',
-'vmdk', 'vpc', 'vvfat' ] }
+'qcow', 'qcow2', 'qed', 'quorum', 'raw', 'replication', 'tftp',
+'vdi', 'vhdx', 'vmdk', 'vpc', 'vvfat' ] }
 
 ##
 # @BlockdevOptionsFile
@@ -2045,6 +2047,19 @@
 { 'enum' : 'ReplicationMode', 'data' : [ 'primary', 'secondary' ] }
 
 ##
+# @BlockdevOptionsReplication
+#
+# Driver specific block device options for replication
+#
+# @mode: the replication mode
+#
+# Since: 2.7
+##
+{ 'struct': 'BlockdevOptionsReplication',
+  'base': 'BlockdevOptionsGenericFormat',
+  'data': { 'mode': 'ReplicationMode'  } }
+
+##
 # @BlockdevOptions
 #
 # Options for creating a block device.  Many options are available for all
@@ -2125,6 +2140,7 @@
   'quorum': 'BlockdevOptionsQuorum',
   'raw':'BlockdevOptionsGenericFormat',
 # TODO rbd: Wait for structured options
+  'replication':'BlockdevOptionsReplication',
 # TODO sheepdog: Wait for structured options
 # TODO ssh: Should take InetSocketAddress for 'host'?
   'tftp':   'BlockdevOptionsFile',
-- 
1.9.3

[Qemu-block] [PATCH v20 Resend 05/10] docs: block replication's description

2016-06-14 Thread Changlong Xie

From: Wen Congyang <we...@cn.fujitsu.com>

Signed-off-by: Wen Congyang <we...@cn.fujitsu.com>
Signed-off-by: zhanghailiang <zhang.zhanghaili...@huawei.com>
Signed-off-by: Gonglei <arei.gong...@huawei.com>
Signed-off-by: Changlong Xie <xiecl.f...@cn.fujitsu.com>
---
 docs/block-replication.txt | 239 +
 1 file changed, 239 insertions(+)
 create mode 100644 docs/block-replication.txt

diff --git a/docs/block-replication.txt b/docs/block-replication.txt
new file mode 100644
index 000..6bde673
--- /dev/null
+++ b/docs/block-replication.txt
@@ -0,0 +1,239 @@
+Block replication
+
+Copyright Fujitsu, Corp. 2016
+Copyright (c) 2016 Intel Corporation
+Copyright (c) 2016 HUAWEI TECHNOLOGIES CO., LTD.
+
+This work is licensed under the terms of the GNU GPL, version 2 or later.
+See the COPYING file in the top-level directory.
+
+Block replication is used for continuous checkpoints. It is designed
+for COLO (COarse-grain LOck-stepping) where the Secondary VM is running.
+It can also be applied for FT/HA (Fault-tolerance/High Assurance) scenario,
+where the Secondary VM is not running.
+
+This document gives an overview of block replication's design.
+
+== Background ==
+High availability solutions such as micro checkpoint and COLO will do
+consecutive checkpoints. The VM state of the Primary and Secondary VM is
+identical right after a VM checkpoint, but becomes different as the VM
+executes till the next checkpoint. To support disk contents checkpoint,
+the modified disk contents in the Secondary VM must be buffered, and are
+only dropped at next checkpoint time. To reduce the network transportation
+effort during a vmstate checkpoint, the disk modification operations of
+the Primary disk are asynchronously forwarded to the Secondary node.
+
+== Workflow ==
+The following is the image of block replication workflow:
+
++--+++
+|Primary Write Requests||Secondary Write Requests|
++--+++
+  |   |
+  |  (4)
+  |   V
+  |  /-\
+  |  Copy and Forward| |
+  |-(1)--+   | Disk Buffer |
+  |  |   | |
+  | (3)  \-/
+  | speculative  ^
+  |write through(2)
+  |  |   |
+  V  V   |
+   +--+   ++
+   | Primary Disk |   | Secondary Disk |
+   +--+   ++
+
+1) Primary write requests will be copied and forwarded to Secondary
+   QEMU.
+2) Before Primary write requests are written to Secondary disk, the
+   original sector content will be read from Secondary disk and
+   buffered in the Disk buffer, but it will not overwrite the existing
+   sector content (it could be from either "Secondary Write Requests" or
+   previous COW of "Primary Write Requests") in the Disk buffer.
+3) Primary write requests will be written to Secondary disk.
+4) Secondary write requests will be buffered in the Disk buffer and it
+   will overwrite the existing sector content in the buffer.
+
+== Architecture ==
+We are going to implement block replication from many basic
+blocks that are already in QEMU.
+
+ virtio-blk   ||
+ ^||.--
+ |||| Secondary
+1 Quorum  ||'--
+ /  \ ||
+/\||
+   Primary2 filter
+ disk ^
 virtio-blk
+  |
  ^
+3 NBD  --->  3 NBD 
  |
+client|| server
  2 filter
+  ||^  
  ^
+. |||  
  |
+Primary | ||  Secondary disk <- hidden-disk 5 
<- active-disk 4
+'

[Qemu-block] [PATCH v20 Resend 02/10] Backup: clear all bitmap when doing block checkpoint

2016-06-14 Thread Changlong Xie

From: Wen Congyang <we...@cn.fujitsu.com>

Signed-off-by: Wen Congyang <we...@cn.fujitsu.com>
Signed-off-by: zhanghailiang <zhang.zhanghaili...@huawei.com>
Signed-off-by: Gonglei <arei.gong...@huawei.com>
Signed-off-by: Changlong Xie <xiecl.f...@cn.fujitsu.com>
---
 block/backup.c   | 18 ++
 include/block/block_backup.h |  3 +++
 2 files changed, 21 insertions(+)
 create mode 100644 include/block/block_backup.h

diff --git a/block/backup.c b/block/backup.c
index feeb9f8..baf3936 100644
--- a/block/backup.c
+++ b/block/backup.c
@@ -17,6 +17,7 @@
 #include "block/block.h"
 #include "block/block_int.h"
 #include "block/blockjob.h"
+#include "block/block_backup.h"
 #include "qapi/error.h"
 #include "qapi/qmp/qerror.h"
 #include "qemu/ratelimit.h"
@@ -246,6 +247,23 @@ static void backup_abort(BlockJob *job)
 }
 }
 
+void backup_do_checkpoint(BlockJob *job, Error **errp)
+{
+BackupBlockJob *backup_job = container_of(job, BackupBlockJob, common);
+int64_t len;
+
+assert(job->driver->job_type == BLOCK_JOB_TYPE_BACKUP);
+
+if (backup_job->sync_mode != MIRROR_SYNC_MODE_NONE) {
+error_setg(errp, "The backup job only supports block checkpoint in"
+   " sync=none mode");
+return;
+}
+
+len = DIV_ROUND_UP(backup_job->common.len, backup_job->cluster_size);
+bitmap_zero(backup_job->done_bitmap, len);
+}
+
 static const BlockJobDriver backup_job_driver = {
 .instance_size  = sizeof(BackupBlockJob),
 .job_type   = BLOCK_JOB_TYPE_BACKUP,
diff --git a/include/block/block_backup.h b/include/block/block_backup.h
new file mode 100644
index 000..3753bcb
--- /dev/null
+++ b/include/block/block_backup.h
@@ -0,0 +1,3 @@
+#include "block/block_int.h"
+
+void backup_do_checkpoint(BlockJob *job, Error **errp);
-- 
1.9.3

[Qemu-block] [PATCH v20 Resend 06/10] auto complete active commit

2016-06-14 Thread Changlong Xie

From: Wen Congyang <we...@cn.fujitsu.com>

Auto complete mirror job in background to prevent from
blocking synchronously

Signed-off-by: Wen Congyang <we...@cn.fujitsu.com>
Signed-off-by: Changlong Xie <xiecl.f...@cn.fujitsu.com>
---
 block/mirror.c| 13 +
 blockdev.c|  2 +-
 include/block/block_int.h |  3 ++-
 qemu-img.c|  2 +-
 4 files changed, 13 insertions(+), 7 deletions(-)

diff --git a/block/mirror.c b/block/mirror.c
index 80fd3c7..40fad19 100644
--- a/block/mirror.c
+++ b/block/mirror.c
@@ -805,7 +805,8 @@ static void mirror_start_job(BlockDriverState *bs, 
BlockDriverState *target,
  BlockCompletionFunc *cb,
  void *opaque, Error **errp,
  const BlockJobDriver *driver,
- bool is_none_mode, BlockDriverState *base)
+ bool is_none_mode, BlockDriverState *base,
+ bool auto_complete)
 {
 MirrorBlockJob *s;
 
@@ -840,6 +841,9 @@ static void mirror_start_job(BlockDriverState *bs, 
BlockDriverState *target,
 s->granularity = granularity;
 s->buf_size = ROUND_UP(buf_size, granularity);
 s->unmap = unmap;
+if (auto_complete) {
+s->should_complete = true;
+}
 
 s->dirty_bitmap = bdrv_create_dirty_bitmap(bs, granularity, NULL, errp);
 if (!s->dirty_bitmap) {
@@ -877,14 +881,15 @@ void mirror_start(BlockDriverState *bs, BlockDriverState 
*target,
 mirror_start_job(bs, target, replaces,
  speed, granularity, buf_size,
  on_source_error, on_target_error, unmap, cb, opaque, errp,
- _job_driver, is_none_mode, base);
+ _job_driver, is_none_mode, base, false);
 }
 
 void commit_active_start(BlockDriverState *bs, BlockDriverState *base,
  int64_t speed,
  BlockdevOnError on_error,
  BlockCompletionFunc *cb,
- void *opaque, Error **errp)
+ void *opaque, Error **errp,
+ bool auto_complete)
 {
 int64_t length, base_length;
 int orig_base_flags;
@@ -924,7 +929,7 @@ void commit_active_start(BlockDriverState *bs, 
BlockDriverState *base,
 
 mirror_start_job(bs, base, NULL, speed, 0, 0,
  on_error, on_error, false, cb, opaque, _err,
- _active_job_driver, false, base);
+ _active_job_driver, false, base, auto_complete);
 if (local_err) {
 error_propagate(errp, local_err);
 goto error_restore_flags;
diff --git a/blockdev.c b/blockdev.c
index 717785e..734bfb0 100644
--- a/blockdev.c
+++ b/blockdev.c
@@ -3155,7 +3155,7 @@ void qmp_block_commit(const char *device,
 goto out;
 }
 commit_active_start(bs, base_bs, speed, on_error, block_job_cb,
-bs, _err);
+bs, _err, false);
 } else {
 commit_start(bs, base_bs, top_bs, speed, on_error, block_job_cb, bs,
  has_backing_file ? backing_file : NULL, _err);
diff --git a/include/block/block_int.h b/include/block/block_int.h
index 30a9717..89b66e8 100644
--- a/include/block/block_int.h
+++ b/include/block/block_int.h
@@ -653,13 +653,14 @@ void commit_start(BlockDriverState *bs, BlockDriverState 
*base,
  * @cb: Completion function for the job.
  * @opaque: Opaque pointer value passed to @cb.
  * @errp: Error object.
+ * @auto_complete: Auto complete the job.
  *
  */
 void commit_active_start(BlockDriverState *bs, BlockDriverState *base,
  int64_t speed,
  BlockdevOnError on_error,
  BlockCompletionFunc *cb,
- void *opaque, Error **errp);
+ void *opaque, Error **errp, bool auto_complete);
 /*
  * mirror_start:
  * @bs: Block device to operate on.
diff --git a/qemu-img.c b/qemu-img.c
index 4b56ad3..e6a480f 100644
--- a/qemu-img.c
+++ b/qemu-img.c
@@ -911,7 +911,7 @@ static int img_commit(int argc, char **argv)
 };
 
 commit_active_start(bs, base_bs, 0, BLOCKDEV_ON_ERROR_REPORT,
-common_block_job_cb, , _err);
+common_block_job_cb, , _err, false);
 if (local_err) {
 goto done;
 }
-- 
1.9.3

[Qemu-block] [PATCH v20 Resend 01/10] unblock backup operations in backing file

2016-06-14 Thread Changlong Xie

From: Wen Congyang <we...@cn.fujitsu.com>

Signed-off-by: Wen Congyang <we...@cn.fujitsu.com>
Signed-off-by: Changlong Xie <xiecl.f...@cn.fujitsu.com>
---
 block.c | 17 +
 1 file changed, 17 insertions(+)

diff --git a/block.c b/block.c
index 736432f..dcf63f4 100644
--- a/block.c
+++ b/block.c
@@ -1310,6 +1310,23 @@ void bdrv_set_backing_hd(BlockDriverState *bs, 
BlockDriverState *backing_hd)
 /* Otherwise we won't be able to commit due to check in bdrv_commit */
 bdrv_op_unblock(backing_hd, BLOCK_OP_TYPE_COMMIT_TARGET,
 bs->backing_blocker);
+/*
+ * We do backup in 3 ways:
+ * 1. drive backup
+ *The target bs is new opened, and the source is top BDS
+ * 2. blockdev backup
+ *Both the source and the target are top BDSes.
+ * 3. internal backup(used for block replication)
+ *Both the source and the target are backing file
+ *
+ * In case 1 and 2, neither the source nor the target is the backing file.
+ * In case 3, we will block the top BDS, so there is only one block job
+ * for the top BDS and its backing chain.
+ */
+bdrv_op_unblock(backing_hd, BLOCK_OP_TYPE_BACKUP_SOURCE,
+bs->backing_blocker);
+bdrv_op_unblock(backing_hd, BLOCK_OP_TYPE_BACKUP_TARGET,
+bs->backing_blocker);
 out:
 bdrv_refresh_limits(bs, NULL);
 }
-- 
1.9.3

Re: [Qemu-block] [PATCH v20 00/10] Block replication for continuous checkpoints

2016-06-12 Thread Changlong Xie


On 06/08/2016 09:36 PM, Eric Blake wrote:

On 06/07/2016 07:11 PM, Changlong Xie wrote:

Block replication is a very important feature which is used for
continuous checkpoints(for example: COLO).



Side note: Including qemu-trivial in CC: on a patch series at v20 feels
wrong.  Obviously it is not trivial to be ten patches with this much churn.



To avoid bothering qemu-trivial maintainers, i'll resend the patch sets 
until you have some comments on [PATCH V20 07/10].


Thanks
-Xie

Re: [Qemu-block] [PATCH v20 07/10] Introduce new APIs to do replication operation

2016-06-12 Thread Changlong Xie


On 06/08/2016 09:41 PM, Eric Blake wrote:

On 06/07/2016 07:11 PM, Changlong Xie wrote:

Signed-off-by: Wen Congyang <we...@cn.fujitsu.com>
Signed-off-by: zhanghailiang <zhang.zhanghaili...@huawei.com>
Signed-off-by: Gonglei <arei.gong...@huawei.com>
Signed-off-by: Changlong Xie <xiecl.f...@cn.fujitsu.com>


No mention of the API names in the commit message?  Grepping 'git log'
is easier if there is something useful to grep for.


I'll use a brief description, such as below:

This commit introduces six replication interfaces(for block, network 
etc). Firstly we can use replication_(new/remove) to create/destroy 
replication instances, then in migration we can use 
replication_(start/stop/do_checkpoint/get_error)_all to handle all 
replication operations. More detail please refer to replication.h





---
  Makefile.objs|   1 +
  qapi/block-core.json |  13 
  replication.c| 105 ++
  replication.h| 176 +++
  4 files changed, 295 insertions(+)
  create mode 100644 replication.c
  create mode 100644 replication.h




+++ b/qapi/block-core.json
@@ -2032,6 +2032,19 @@
  '*read-pattern': 'QuorumReadPattern' } }

  ##
+# @ReplicationMode
+#
+# An enumeration of replication modes.
+#
+# @primary: Primary mode, the vm's state will be sent to secondary QEMU.
+#
+# @secondary: Secondary mode, receive the vm's state from primary QEMU.
+#
+# Since: 2.7
+##
+{ 'enum' : 'ReplicationMode', 'data' : [ 'primary', 'secondary' ] }
+


This part is fine, from an interface point of view. However, I have not
closely reviewed the rest of the patch or series.  That said, here's
some quick things that caught my eye.


Appreciate.





+++ b/replication.c
@@ -0,0 +1,105 @@
+/*
+ * Replication filter
+ *
+ * Copyright (c) 2016 HUAWEI TECHNOLOGIES CO., LTD.
+ * Copyright (c) 2016 Intel Corporation
+ * Copyright (c) 2016 FUJITSU LIMITED
+ *
+ * Author:
+ *   Changlong Xie <xiecl.f...@cn.fujitsu.com>
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2 or later.
+ * See the COPYING file in the top-level directory.
+ */
+
+#include "replication.h"


All new .c files must include "qemu/osdep.h" first.



+++ b/replication.h
@@ -0,0 +1,176 @@
+/*
+ * Replication filter
+ *
+ * Copyright (c) 2016 HUAWEI TECHNOLOGIES CO., LTD.
+ * Copyright (c) 2016 Intel Corporation
+ * Copyright (c) 2016 FUJITSU LIMITED
+ *
+ * Author:
+ *   Changlong Xie <xiecl.f...@cn.fujitsu.com>
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2 or later.
+ * See the COPYING file in the top-level directory.
+ */
+
+#ifndef REPLICATION_H
+#define REPLICATION_H
+
+#include "qemu/osdep.h"


And .h files must NOT include osdep.h.


+#include "qapi/error.h"


Do you really need the full error.h, or is typedefs.h enough to get the
forward declaration of Error?


It's for error_propagate() in replication.c, and in summary to your 
comments on replication.h/replication.c, i'll

1) remove all uncorrelated *.h in replication.h
2) use following header files in replication.c

#include "qemu/osdep.h"
#include "qapi/error.h"
#include "qemu/queue.h"
#include "replication.h"

Thanks

Re: [Qemu-block] [PATCH v20 00/10] Block replication for continuous checkpoints

2016-06-12 Thread Changlong Xie


On 06/10/2016 09:22 PM, Michael Tokarev wrote:

08.06.2016 04:11, Changlong Xie wrote:

Block replication is a very important feature which is used for
continuous checkpoints(for example: COLO).

...

I'm not sure I understand why this has been sent to qemu-trivial? :)



$HOME/.gitconfig mislead me, will remove it.

Thanks


Thanks

/mjt


.

[Qemu-block] [PATCH v20 08/10] Implement new driver for block replication

2016-06-07 Thread Changlong Xie

From: Wen Congyang <we...@cn.fujitsu.com>

Signed-off-by: Wen Congyang <we...@cn.fujitsu.com>
Signed-off-by: zhanghailiang <zhang.zhanghaili...@huawei.com>
Signed-off-by: Gonglei <arei.gong...@huawei.com>
Signed-off-by: Changlong Xie <xiecl.f...@cn.fujitsu.com>
---
 block/Makefile.objs |   1 +
 block/replication.c | 657 
 2 files changed, 658 insertions(+)
 create mode 100644 block/replication.c

diff --git a/block/Makefile.objs b/block/Makefile.objs
index fbfe647..5e28b45 100644
--- a/block/Makefile.objs
+++ b/block/Makefile.objs
@@ -23,6 +23,7 @@ block-obj-$(CONFIG_LIBSSH2) += ssh.o
 block-obj-y += accounting.o dirty-bitmap.o
 block-obj-y += write-threshold.o
 block-obj-y += backup.o
+block-obj-y += replication.o
 
 block-obj-y += crypto.o
 
diff --git a/block/replication.c b/block/replication.c
new file mode 100644
index 000..1dabb5d
--- /dev/null
+++ b/block/replication.c
@@ -0,0 +1,657 @@
+/*
+ * Replication Block filter
+ *
+ * Copyright (c) 2016 HUAWEI TECHNOLOGIES CO., LTD.
+ * Copyright (c) 2016 Intel Corporation
+ * Copyright (c) 2016 FUJITSU LIMITED
+ *
+ * Author:
+ *   Wen Congyang <we...@cn.fujitsu.com>
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2 or later.
+ * See the COPYING file in the top-level directory.
+ */
+
+#include "qemu/osdep.h"
+#include "qemu-common.h"
+#include "block/nbd.h"
+#include "block/blockjob.h"
+#include "block/block_int.h"
+#include "block/block_backup.h"
+#include "sysemu/block-backend.h"
+#include "qapi/error.h"
+#include "replication.h"
+
+typedef struct BDRVReplicationState {
+ReplicationMode mode;
+int replication_state;
+BdrvChild *active_disk;
+BdrvChild *hidden_disk;
+BdrvChild *secondary_disk;
+char *top_id;
+ReplicationState *rs;
+Error *blocker;
+int orig_hidden_flags;
+int orig_secondary_flags;
+int error;
+} BDRVReplicationState;
+
+enum {
+BLOCK_REPLICATION_NONE, /* block replication is not started */
+BLOCK_REPLICATION_RUNNING,  /* block replication is running */
+BLOCK_REPLICATION_FAILOVER, /* failover is running in background */
+BLOCK_REPLICATION_FAILOVER_FAILED,  /* failover failed */
+BLOCK_REPLICATION_DONE, /* block replication is done */
+};
+
+static void replication_start(ReplicationState *rs, ReplicationMode mode,
+  Error **errp);
+static void replication_do_checkpoint(ReplicationState *rs, Error **errp);
+static void replication_get_error(ReplicationState *rs, Error **errp);
+static void replication_stop(ReplicationState *rs, bool failover,
+ Error **errp);
+
+#define REPLICATION_MODE"mode"
+#define REPLICATION_TOP_ID  "top-id"
+static QemuOptsList replication_runtime_opts = {
+.name = "replication",
+.head = QTAILQ_HEAD_INITIALIZER(replication_runtime_opts.head),
+.desc = {
+{
+.name = REPLICATION_MODE,
+.type = QEMU_OPT_STRING,
+},
+{
+.name = REPLICATION_TOP_ID,
+.type = QEMU_OPT_STRING,
+},
+{ /* end of list */ }
+},
+};
+
+static ReplicationOps replication_ops = {
+.start = replication_start,
+.checkpoint = replication_do_checkpoint,
+.get_error = replication_get_error,
+.stop = replication_stop,
+};
+
+static int replication_open(BlockDriverState *bs, QDict *options,
+int flags, Error **errp)
+{
+int ret;
+BDRVReplicationState *s = bs->opaque;
+Error *local_err = NULL;
+QemuOpts *opts = NULL;
+const char *mode;
+const char *top_id;
+
+ret = -EINVAL;
+opts = qemu_opts_create(_runtime_opts, NULL, 0, _abort);
+qemu_opts_absorb_qdict(opts, options, _err);
+if (local_err) {
+goto fail;
+}
+
+mode = qemu_opt_get(opts, REPLICATION_MODE);
+if (!mode) {
+error_setg(_err, "Missing the option mode");
+goto fail;
+}
+
+if (!strcmp(mode, "primary")) {
+s->mode = REPLICATION_MODE_PRIMARY;
+} else if (!strcmp(mode, "secondary")) {
+s->mode = REPLICATION_MODE_SECONDARY;
+top_id = qemu_opt_get(opts, REPLICATION_TOP_ID);
+s->top_id = g_strdup(top_id);
+if (!s->top_id) {
+error_setg(_err, "Missing the option top-id");
+goto fail;
+}
+} else {
+error_setg(_err,
+   "The option mode's value should be primary or secondary");
+goto fail;
+}
+
+s->rs = replication_new(bs, _ops);
+
+ret = 0;
+
+fail:
+qemu_opts_del(opts);
+error_propagate(errp, local_err);
+
+return ret;
+}
+
+static void replication_close(BlockDrive

[Qemu-block] [PATCH v20 09/10] tests: add unit test case for replication

2016-06-07 Thread Changlong Xie

Signed-off-by: Wen Congyang <we...@cn.fujitsu.com>
Signed-off-by: Changlong Xie <xiecl.f...@cn.fujitsu.com>
---
 tests/.gitignore |   1 +
 tests/Makefile   |   4 +
 tests/test-replication.c | 555 +++
 3 files changed, 560 insertions(+)
 create mode 100644 tests/test-replication.c

diff --git a/tests/.gitignore b/tests/.gitignore
index a06a8ba..d22ab06 100644
--- a/tests/.gitignore
+++ b/tests/.gitignore
@@ -58,6 +58,7 @@ test-qmp-introspect.[ch]
 test-qmp-marshal.c
 test-qmp-output-visitor
 test-rcu-list
+test-replication
 test-rfifolock
 test-string-input-visitor
 test-string-output-visitor
diff --git a/tests/Makefile b/tests/Makefile
index a3e20e3..901b8e4 100644
--- a/tests/Makefile
+++ b/tests/Makefile
@@ -103,6 +103,7 @@ check-unit-y += tests/test-crypto-xts$(EXESUF)
 check-unit-y += tests/test-crypto-block$(EXESUF)
 gcov-files-test-logging-y = tests/test-logging.c
 check-unit-y += tests/test-logging$(EXESUF)
+check-unit-y += tests/test-replication$(EXESUF)
 
 check-block-$(CONFIG_POSIX) += tests/qemu-iotests-quick.sh
 
@@ -451,6 +452,9 @@ tests/test-base64$(EXESUF): tests/test-base64.o \
 
 tests/test-logging$(EXESUF): tests/test-logging.o $(test-util-obj-y)
 
+tests/test-replication$(EXESUF): tests/test-replication.o $(test-util-obj-y) \
+   $(test-block-obj-y)
+
 tests/test-qapi-types.c tests/test-qapi-types.h :\
 $(SRC_PATH)/tests/qapi-schema/qapi-schema-test.json 
$(SRC_PATH)/scripts/qapi-types.py $(qapi-py)
$(call quiet-command,$(PYTHON) $(SRC_PATH)/scripts/qapi-types.py \
diff --git a/tests/test-replication.c b/tests/test-replication.c
new file mode 100644
index 000..b5bb2eb
--- /dev/null
+++ b/tests/test-replication.c
@@ -0,0 +1,555 @@
+/*
+ * Block replication tests
+ *
+ * Copyright (c) 2016 FUJITSU LIMITED
+ * Author: Changlong Xie <xiecl.f...@cn.fujitsu.com>
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2 or
+ * later.  See the COPYING file in the top-level directory.
+ */
+
+#include "qemu/osdep.h"
+
+#include "qapi/error.h"
+#include "replication.h"
+#include "block/block_int.h"
+#include "sysemu/block-backend.h"
+
+#define IMG_SIZE (64 * 1024 * 1024)
+
+/* primary */
+static char p_local_disk[] = "/tmp/p_local_disk.XX";
+
+/* secondary */
+#define S_ID "secondary-id"
+#define S_LOCAL_DISK_ID "secondary-local-disk-id"
+static char s_local_disk[] = "/tmp/s_local_disk.XX";
+static char s_active_disk[] = "/tmp/s_active_disk.XX";
+static char s_hidden_disk[] = "/tmp/s_hidden_disk.XX";
+
+/* FIXME: steal from blockdev.c */
+QemuOptsList qemu_drive_opts = {
+.name = "drive",
+.head = QTAILQ_HEAD_INITIALIZER(qemu_drive_opts.head),
+.desc = {
+{ /* end of list */ }
+},
+};
+
+static void io_read(BlockDriverState *bs, long pattern, int64_t pattern_offset,
+int64_t pattern_count, int64_t offset, int64_t count,
+bool expect_failed)
+{
+char *buf;
+void *cmp_buf = NULL;
+int ret;
+
+/* alloc pattern buffer */
+if (pattern) {
+cmp_buf = g_malloc(pattern_count);
+memset(cmp_buf, pattern, pattern_count);
+}
+
+/* alloc read buffer */
+buf = qemu_blockalign(bs, count);
+memset(buf, 0xab, count);
+
+/* do read */
+ret = bdrv_read(bs, offset >> 9, (uint8_t *)buf, count >> 9);
+
+/* assert and compare buf */
+if (expect_failed) {
+g_assert(ret < 0);
+} else {
+g_assert(ret >= 0);
+if (pattern) {
+g_assert(memcmp(buf + pattern_offset, cmp_buf, pattern_count) <= 
0);
+}
+}
+
+g_free(cmp_buf);
+qemu_vfree(buf);
+}
+
+static void io_write(BlockDriverState *bs, long pattern, int64_t offset,
+ int64_t count, bool expect_failed)
+{
+void *pattern_buf = NULL;
+int ret;
+
+/* alloc pattern buffer */
+if (pattern) {
+pattern_buf = qemu_blockalign(bs, count);
+memset(pattern_buf, pattern, count);
+}
+
+/* do write */
+if (pattern) {
+ret = bdrv_write(bs, offset >> 9, (uint8_t *)pattern_buf, count >> 9);
+} else {
+ret = bdrv_write_zeroes(bs, offset >> 9, count >> 9, 0);
+}
+
+/* assert */
+if (expect_failed) {
+g_assert(ret < 0);
+} else {
+g_assert(ret >= 0);
+}
+
+qemu_vfree(pattern_buf);
+}
+
+/*
+ * Create a uniquely-named empty temporary file.
+ */
+static void make_temp(char *template)
+{
+int fd;
+
+fd = mkstemp(template);
+g_assert(fd >= 0);
+close(fd);
+}
+
+
+static void prepare_imgs(void)
+{
+Error *local_err = NULL;
+
+make_temp(p_local_disk);
+make_temp(s_local_disk);
+make_temp(s_active_disk);
+make_temp(s_hidden_disk);
+
+/* Primary */
+bdrv_

[Qemu-block] [PATCH v20 10/10] support replication driver in blockdev-add

2016-06-07 Thread Changlong Xie

From: Wen Congyang <we...@cn.fujitsu.com>

Signed-off-by: Wen Congyang <we...@cn.fujitsu.com>
Signed-off-by: zhanghailiang <zhang.zhanghaili...@huawei.com>
Signed-off-by: Gonglei <arei.gong...@huawei.com>
Signed-off-by: Changlong Xie <xiecl.f...@cn.fujitsu.com>
Reviewed-by: Eric Blake <ebl...@redhat.com>
---
 qapi/block-core.json | 20 ++--
 1 file changed, 18 insertions(+), 2 deletions(-)

diff --git a/qapi/block-core.json b/qapi/block-core.json
index e56cdf4..b9f9839 100644
--- a/qapi/block-core.json
+++ b/qapi/block-core.json
@@ -248,6 +248,7 @@
 #   2.3: 'host_floppy' deprecated
 #   2.5: 'host_floppy' dropped
 #   2.6: 'luks' added
+#   2.7: 'replication' added
 #
 # @backing_file: #optional the name of the backing file (for copy-on-write)
 #
@@ -1632,6 +1633,7 @@
 # Drivers that are supported in block device operations.
 #
 # @host_device, @host_cdrom: Since 2.1
+# @replication: Since 2.7
 #
 # Since: 2.0
 ##
@@ -1639,8 +1641,8 @@
   'data': [ 'archipelago', 'blkdebug', 'blkverify', 'bochs', 'cloop',
 'dmg', 'file', 'ftp', 'ftps', 'host_cdrom', 'host_device',
 'http', 'https', 'luks', 'null-aio', 'null-co', 'parallels',
-'qcow', 'qcow2', 'qed', 'quorum', 'raw', 'tftp', 'vdi', 'vhdx',
-'vmdk', 'vpc', 'vvfat' ] }
+'qcow', 'qcow2', 'qed', 'quorum', 'raw', 'replication', 'tftp',
+'vdi', 'vhdx', 'vmdk', 'vpc', 'vvfat' ] }
 
 ##
 # @BlockdevOptionsFile
@@ -2045,6 +2047,19 @@
 { 'enum' : 'ReplicationMode', 'data' : [ 'primary', 'secondary' ] }
 
 ##
+# @BlockdevOptionsReplication
+#
+# Driver specific block device options for replication
+#
+# @mode: the replication mode
+#
+# Since: 2.7
+##
+{ 'struct': 'BlockdevOptionsReplication',
+  'base': 'BlockdevOptionsGenericFormat',
+  'data': { 'mode': 'ReplicationMode'  } }
+
+##
 # @BlockdevOptions
 #
 # Options for creating a block device.  Many options are available for all
@@ -2125,6 +2140,7 @@
   'quorum': 'BlockdevOptionsQuorum',
   'raw':'BlockdevOptionsGenericFormat',
 # TODO rbd: Wait for structured options
+  'replication':'BlockdevOptionsReplication',
 # TODO sheepdog: Wait for structured options
 # TODO ssh: Should take InetSocketAddress for 'host'?
   'tftp':   'BlockdevOptionsFile',
-- 
1.9.3

[Qemu-block] [PATCH v20 06/10] auto complete active commit

2016-06-07 Thread Changlong Xie

From: Wen Congyang <we...@cn.fujitsu.com>

Auto complete mirror job in background to prevent from
blocking synchronously

Signed-off-by: Wen Congyang <we...@cn.fujitsu.com>
Signed-off-by: Changlong Xie <xiecl.f...@cn.fujitsu.com>
---
 block/mirror.c| 13 +
 blockdev.c|  2 +-
 include/block/block_int.h |  3 ++-
 qemu-img.c|  2 +-
 4 files changed, 13 insertions(+), 7 deletions(-)

diff --git a/block/mirror.c b/block/mirror.c
index 80fd3c7..40fad19 100644
--- a/block/mirror.c
+++ b/block/mirror.c
@@ -805,7 +805,8 @@ static void mirror_start_job(BlockDriverState *bs, 
BlockDriverState *target,
  BlockCompletionFunc *cb,
  void *opaque, Error **errp,
  const BlockJobDriver *driver,
- bool is_none_mode, BlockDriverState *base)
+ bool is_none_mode, BlockDriverState *base,
+ bool auto_complete)
 {
 MirrorBlockJob *s;
 
@@ -840,6 +841,9 @@ static void mirror_start_job(BlockDriverState *bs, 
BlockDriverState *target,
 s->granularity = granularity;
 s->buf_size = ROUND_UP(buf_size, granularity);
 s->unmap = unmap;
+if (auto_complete) {
+s->should_complete = true;
+}
 
 s->dirty_bitmap = bdrv_create_dirty_bitmap(bs, granularity, NULL, errp);
 if (!s->dirty_bitmap) {
@@ -877,14 +881,15 @@ void mirror_start(BlockDriverState *bs, BlockDriverState 
*target,
 mirror_start_job(bs, target, replaces,
  speed, granularity, buf_size,
  on_source_error, on_target_error, unmap, cb, opaque, errp,
- _job_driver, is_none_mode, base);
+ _job_driver, is_none_mode, base, false);
 }
 
 void commit_active_start(BlockDriverState *bs, BlockDriverState *base,
  int64_t speed,
  BlockdevOnError on_error,
  BlockCompletionFunc *cb,
- void *opaque, Error **errp)
+ void *opaque, Error **errp,
+ bool auto_complete)
 {
 int64_t length, base_length;
 int orig_base_flags;
@@ -924,7 +929,7 @@ void commit_active_start(BlockDriverState *bs, 
BlockDriverState *base,
 
 mirror_start_job(bs, base, NULL, speed, 0, 0,
  on_error, on_error, false, cb, opaque, _err,
- _active_job_driver, false, base);
+ _active_job_driver, false, base, auto_complete);
 if (local_err) {
 error_propagate(errp, local_err);
 goto error_restore_flags;
diff --git a/blockdev.c b/blockdev.c
index 717785e..734bfb0 100644
--- a/blockdev.c
+++ b/blockdev.c
@@ -3155,7 +3155,7 @@ void qmp_block_commit(const char *device,
 goto out;
 }
 commit_active_start(bs, base_bs, speed, on_error, block_job_cb,
-bs, _err);
+bs, _err, false);
 } else {
 commit_start(bs, base_bs, top_bs, speed, on_error, block_job_cb, bs,
  has_backing_file ? backing_file : NULL, _err);
diff --git a/include/block/block_int.h b/include/block/block_int.h
index 30a9717..89b66e8 100644
--- a/include/block/block_int.h
+++ b/include/block/block_int.h
@@ -653,13 +653,14 @@ void commit_start(BlockDriverState *bs, BlockDriverState 
*base,
  * @cb: Completion function for the job.
  * @opaque: Opaque pointer value passed to @cb.
  * @errp: Error object.
+ * @auto_complete: Auto complete the job.
  *
  */
 void commit_active_start(BlockDriverState *bs, BlockDriverState *base,
  int64_t speed,
  BlockdevOnError on_error,
  BlockCompletionFunc *cb,
- void *opaque, Error **errp);
+ void *opaque, Error **errp, bool auto_complete);
 /*
  * mirror_start:
  * @bs: Block device to operate on.
diff --git a/qemu-img.c b/qemu-img.c
index 4b56ad3..e6a480f 100644
--- a/qemu-img.c
+++ b/qemu-img.c
@@ -911,7 +911,7 @@ static int img_commit(int argc, char **argv)
 };
 
 commit_active_start(bs, base_bs, 0, BLOCKDEV_ON_ERROR_REPORT,
-common_block_job_cb, , _err);
+common_block_job_cb, , _err, false);
 if (local_err) {
 goto done;
 }
-- 
1.9.3

[Qemu-block] [PATCH v20 03/10] Backup: export interfaces for extra serialization

2016-06-07 Thread Changlong Xie

Normal backup(sync='none') workflow:
step 1. NBD peformance I/O write from client to server
   qcow2_co_writev
bdrv_co_writev
 ...
   bdrv_aligned_pwritev
notifier_with_return_list_notify -> backup_do_cow
 bdrv_driver_pwritev // write new contents

step 2. drive-backup sync=none
   backup_do_cow
   {
wait_for_overlapping_requests
cow_request_begin
for(; start < end; start++) {
bdrv_co_readv_no_serialising //read old contents from Secondary disk
bdrv_co_writev // write old contents to hidden-disk
}
cow_request_end
   }

step 3. Then roll back to "step 1" to write new contents to Secondary disk.

And for replication, we must make sure that we only read the old contents from
Secondary disk in order to keep contents consistent.

1) Replication workflow of Secondary
 virtio-blk
  ^
--->  1 NBD   |
   || server   3 replication
   ||^^
   |||   backing backing  |
   ||  Secondary disk 6< hidden-disk 5 < active-disk 4
   ||| ^
   ||'-'
   ||   drive-backup sync=none 2

Hence, we need these interfaces to implement coarse-grained serialization 
between
COW of Secondary disk and the read operation of replication.

Example codes about how to use them:

*#include "block/block_backup.h"

static coroutine_fn int xxx_co_readv()
{
CowRequest req;
BlockJob *job = secondary_disk->bs->job;

if (job) {
  backup_wait_for_overlapping_requests(job, start, end);
  backup_cow_request_begin(, job, start, end);
  ret = bdrv_co_readv();
  backup_cow_request_end();
  goto out;
}
ret = bdrv_co_readv();
out:
    return ret;
}

Signed-off-by: Changlong Xie <xiecl.f...@cn.fujitsu.com>
Signed-off-by: Wen Congyang <we...@cn.fujitsu.com>
---
 block/backup.c   | 41 ++---
 include/block/block_backup.h | 14 ++
 2 files changed, 48 insertions(+), 7 deletions(-)

diff --git a/block/backup.c b/block/backup.c
index baf3936..b8e1c44 100644
--- a/block/backup.c
+++ b/block/backup.c
@@ -28,13 +28,6 @@
 #define BACKUP_CLUSTER_SIZE_DEFAULT (1 << 16)
 #define SLICE_TIME 1ULL /* ns */
 
-typedef struct CowRequest {
-int64_t start;
-int64_t end;
-QLIST_ENTRY(CowRequest) list;
-CoQueue wait_queue; /* coroutines blocked on this request */
-} CowRequest;
-
 typedef struct BackupBlockJob {
 BlockJob common;
 BlockBackend *target;
@@ -264,6 +257,40 @@ void backup_do_checkpoint(BlockJob *job, Error **errp)
 bitmap_zero(backup_job->done_bitmap, len);
 }
 
+void backup_wait_for_overlapping_requests(BlockJob *job, int64_t sector_num,
+  int nb_sectors)
+{
+BackupBlockJob *backup_job = container_of(job, BackupBlockJob, common);
+int64_t sectors_per_cluster = cluster_size_sectors(backup_job);
+int64_t start, end;
+
+assert(job->driver->job_type == BLOCK_JOB_TYPE_BACKUP);
+
+start = sector_num / sectors_per_cluster;
+end = DIV_ROUND_UP(sector_num + nb_sectors, sectors_per_cluster);
+wait_for_overlapping_requests(backup_job, start, end);
+}
+
+void backup_cow_request_begin(CowRequest *req, BlockJob *job,
+  int64_t sector_num,
+  int nb_sectors)
+{
+BackupBlockJob *backup_job = container_of(job, BackupBlockJob, common);
+int64_t sectors_per_cluster = cluster_size_sectors(backup_job);
+int64_t start, end;
+
+assert(job->driver->job_type == BLOCK_JOB_TYPE_BACKUP);
+
+start = sector_num / sectors_per_cluster;
+end = DIV_ROUND_UP(sector_num + nb_sectors, sectors_per_cluster);
+cow_request_begin(req, backup_job, start, end);
+}
+
+void backup_cow_request_end(CowRequest *req)
+{
+cow_request_end(req);
+}
+
 static const BlockJobDriver backup_job_driver = {
 .instance_size  = sizeof(BackupBlockJob),
 .job_type   = BLOCK_JOB_TYPE_BACKUP,
diff --git a/include/block/block_backup.h b/include/block/block_backup.h
index 3753bcb..e0e7ce6 100644
--- a/include/block/block_backup.h
+++ b/include/block/block_backup.h
@@ -1,3 +1,17 @@
 #include "block/block_int.h"
 
+typedef struct CowRequest {
+int64_t start;
+int64_t end;
+QLIST_ENTRY(CowRequest) list;
+CoQueue wait_queue; /* coroutines blocked on this request */
+} CowRequest;
+
+void backup_wait_for_overlapping_requests(BlockJob *job, int64_t sector_num,
+

[Qemu-block] [PATCH v20 04/10] Link backup into block core

2016-06-07 Thread Changlong Xie

From: Wen Congyang <we...@cn.fujitsu.com>

Some programs that add a dependency on it will use
the block layer directly.

Signed-off-by: Wen Congyang <we...@cn.fujitsu.com>
Signed-off-by: zhanghailiang <zhang.zhanghaili...@huawei.com>
Signed-off-by: Gonglei <arei.gong...@huawei.com>
Signed-off-by: Changlong Xie <xiecl.f...@cn.fujitsu.com>
Reviewed-by: Stefan Hajnoczi <stefa...@redhat.com>
Reviewed-by: Jeff Cody <jc...@redhat.com>
---
 block/Makefile.objs | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/block/Makefile.objs b/block/Makefile.objs
index 44a5416..fbfe647 100644
--- a/block/Makefile.objs
+++ b/block/Makefile.objs
@@ -22,12 +22,12 @@ block-obj-$(CONFIG_ARCHIPELAGO) += archipelago.o
 block-obj-$(CONFIG_LIBSSH2) += ssh.o
 block-obj-y += accounting.o dirty-bitmap.o
 block-obj-y += write-threshold.o
+block-obj-y += backup.o
 
 block-obj-y += crypto.o
 
 common-obj-y += stream.o
 common-obj-y += commit.o
-common-obj-y += backup.o
 
 iscsi.o-cflags := $(LIBISCSI_CFLAGS)
 iscsi.o-libs   := $(LIBISCSI_LIBS)
-- 
1.9.3

[Qemu-block] [PATCH v20 00/10] Block replication for continuous checkpoints

2016-06-07 Thread Changlong Xie

Block replication is a very important feature which is used for
continuous checkpoints(for example: COLO).

You can get the detailed information about block replication from here:
http://wiki.qemu.org/Features/BlockReplication

Usage:
Please refer to docs/block-replication.txt

You can get the patch here:
https://github.com/Pating/qemu/tree/changlox/block-replication-v20

You can get the patch with framework here:
https://github.com/Pating/qemu/tree/changlox/colo_framework_v19

TODO:
1. Continuous block replication. It will be started after basic functions
   are accepted.

Changs Log:
V20:
1. Rebase to the lastest code
2. Address comments from stefan
p8: 
1. error_setg() with an error message when check_top_bs() fails. 
2. remove bdrv_ref(s->hidden_disk->bs) since commit 5c438bc6
3. use bloc_job_cancel_sync() before active commit
p9: 
1. fix uninitialized 'pattern_buf'
2. introduce mkstemp(3) to fix unique filenames
3. use qemu_vfree() for qemu_blockalign() memory
4. add missing replication_start_all()
5. remove useless pattern for io_write()
V19:
1. Rebase to v2.6.0
2. Address comments from stefan
p3: a new patch that export interfaces for extra serialization
p8: 
1. call replication_stop() before freeing s->top_id
2. check top_bs
3. reopen file readonly in error return paths
4. enable extra serialization between read and COW
p9: try to hanlde SIGABRT
V18:
p6: add local_err in all replication callbacks to prevent "errp == NULL"
p7: add missing qemu_iovec_destroy(xxx)
V17:
1. Rebase to the lastest codes 
p2: refactor backup_do_checkpoint addressed comments from Jeff Cody
p4: fix bugs in "drive_add buddy xxx" hmp commands
p6: add "since: 2.7"
p7: fix bug in replication_close(), add missing "qapi/error.h", add 
test-replication 
p8: add "since: 2.7"
V16:
1. Rebase to the newest codes
2. Address comments from Stefan & hailiang
p3: we don't need this patch now
p4: add "top-id" parameters for secondary
p6: fix NULL pointer in replication callbacks, remove unnecessary typedefs, 
add doc comments that explain the semantics of Replication
p7: Refactor AioContext for thread-safe, remove unnecessary get_top_bs()
*Note*: I'm working on replication testcase now, will send out in V17
V15:
1. Rebase to the newest codes
2. Fix typos and coding style addresed Eric's comments
3. Address Stefan's comments
   1) Make backup_do_checkpoint public, drop the changes on BlockJobDriver
   2) Update the message and description for [PATCH 4/9]
   3) Make replication_(start/stop/do_checkpoint)_all as global interfaces
   4) Introduce AioContext lock to protect start/stop/do_checkpoint callbacks
   5) Use BdrvChild instead of holding on to BlockDriverState * pointers
4. Clear BDRV_O_INACTIVE for hidden disk's open_flags since commit 09e0c771  
5. Introduce replication_get_error_all to check replication status
6. Remove useless discard interface
V14:
1. Implement auto complete active commit
2. Implement active commit block job for replication.c
3. Address the comments from Stefan, add replication-specific API and data
   structure, also remove old block layer APIs
V13:
1. Rebase to the newest codes
2. Remove redundant marcos and semicolon in replication.c 
3. Fix typos in block-replication.txt
V12:
1. Rebase to the newest codes
2. Use backing reference to replcace 'allow-write-backing-file'
V11:
1. Reopen the backing file when starting blcok replication if it is not
   opened in R/W mode
2. Unblock BLOCK_OP_TYPE_BACKUP_SOURCE and BLOCK_OP_TYPE_BACKUP_TARGET
   when opening backing file
3. Block the top BDS so there is only one block job for the top BDS and
   its backing chain.
V10:
1. Use blockdev-remove-medium and blockdev-insert-medium to replace backing
   reference.
2. Address the comments from Eric Blake
V9:
1. Update the error messages
2. Rebase to the newest qemu
3. Split child add/delete support. These patches are sent in another patchset.
V8:
1. Address Alberto Garcia's comments
V7:
1. Implement adding/removing quorum child. Remove the option non-connect.
2. Simplify the backing refrence option according to Stefan Hajnoczi's 
suggestion
V6:
1. Rebase to the newest qemu.
V5:
1. Address the comments from Gong Lei
2. Speed the failover up. The secondary vm can take over very quickly even
   if there are too many I/O requests.
V4:
1. Introduce a new driver replication to avoid touch nbd and qcow2.
V3:
1: use error_setg() instead of error_set()
2. Add a new block job API
3. Active disk, hidden disk and nbd target uses the same AioContext
4. Add a testcase to test new hbitmap API
V2:
1. Redesign the secondary qemu(use image-fleecing)
2. Use Error objects to return error message
3. Address the comments from Max Reitz and Eric Blake

Changlong Xie (3):
  Backup: export interfaces for extra serialization
  Introduce new APIs to do replication operation
  tests: add unit test case for replication

Wen Congyang (7):
  unblock backup operations in backing

[Qemu-block] [PATCH v20 01/10] unblock backup operations in backing file

2016-06-07 Thread Changlong Xie

From: Wen Congyang <we...@cn.fujitsu.com>

Signed-off-by: Wen Congyang <we...@cn.fujitsu.com>
Signed-off-by: Changlong Xie <xiecl.f...@cn.fujitsu.com>
---
 block.c | 17 +
 1 file changed, 17 insertions(+)

diff --git a/block.c b/block.c
index 736432f..dcf63f4 100644
--- a/block.c
+++ b/block.c
@@ -1310,6 +1310,23 @@ void bdrv_set_backing_hd(BlockDriverState *bs, 
BlockDriverState *backing_hd)
 /* Otherwise we won't be able to commit due to check in bdrv_commit */
 bdrv_op_unblock(backing_hd, BLOCK_OP_TYPE_COMMIT_TARGET,
 bs->backing_blocker);
+/*
+ * We do backup in 3 ways:
+ * 1. drive backup
+ *The target bs is new opened, and the source is top BDS
+ * 2. blockdev backup
+ *Both the source and the target are top BDSes.
+ * 3. internal backup(used for block replication)
+ *Both the source and the target are backing file
+ *
+ * In case 1 and 2, neither the source nor the target is the backing file.
+ * In case 3, we will block the top BDS, so there is only one block job
+ * for the top BDS and its backing chain.
+ */
+bdrv_op_unblock(backing_hd, BLOCK_OP_TYPE_BACKUP_SOURCE,
+bs->backing_blocker);
+bdrv_op_unblock(backing_hd, BLOCK_OP_TYPE_BACKUP_TARGET,
+bs->backing_blocker);
 out:
 bdrv_refresh_limits(bs, NULL);
 }
-- 
1.9.3

[Qemu-block] [PATCH v20 02/10] Backup: clear all bitmap when doing block checkpoint

2016-06-07 Thread Changlong Xie

From: Wen Congyang <we...@cn.fujitsu.com>

Signed-off-by: Wen Congyang <we...@cn.fujitsu.com>
Signed-off-by: zhanghailiang <zhang.zhanghaili...@huawei.com>
Signed-off-by: Gonglei <arei.gong...@huawei.com>
Signed-off-by: Changlong Xie <xiecl.f...@cn.fujitsu.com>
---
 block/backup.c   | 18 ++
 include/block/block_backup.h |  3 +++
 2 files changed, 21 insertions(+)
 create mode 100644 include/block/block_backup.h

diff --git a/block/backup.c b/block/backup.c
index feeb9f8..baf3936 100644
--- a/block/backup.c
+++ b/block/backup.c
@@ -17,6 +17,7 @@
 #include "block/block.h"
 #include "block/block_int.h"
 #include "block/blockjob.h"
+#include "block/block_backup.h"
 #include "qapi/error.h"
 #include "qapi/qmp/qerror.h"
 #include "qemu/ratelimit.h"
@@ -246,6 +247,23 @@ static void backup_abort(BlockJob *job)
 }
 }
 
+void backup_do_checkpoint(BlockJob *job, Error **errp)
+{
+BackupBlockJob *backup_job = container_of(job, BackupBlockJob, common);
+int64_t len;
+
+assert(job->driver->job_type == BLOCK_JOB_TYPE_BACKUP);
+
+if (backup_job->sync_mode != MIRROR_SYNC_MODE_NONE) {
+error_setg(errp, "The backup job only supports block checkpoint in"
+   " sync=none mode");
+return;
+}
+
+len = DIV_ROUND_UP(backup_job->common.len, backup_job->cluster_size);
+bitmap_zero(backup_job->done_bitmap, len);
+}
+
 static const BlockJobDriver backup_job_driver = {
 .instance_size  = sizeof(BackupBlockJob),
 .job_type   = BLOCK_JOB_TYPE_BACKUP,
diff --git a/include/block/block_backup.h b/include/block/block_backup.h
new file mode 100644
index 000..3753bcb
--- /dev/null
+++ b/include/block/block_backup.h
@@ -0,0 +1,3 @@
+#include "block/block_int.h"
+
+void backup_do_checkpoint(BlockJob *job, Error **errp);
-- 
1.9.3

Re: [Qemu-block] [PATCH v19 08/10] Implement new driver for block replication

2016-06-06 Thread Changlong Xie


On 05/20/2016 03:36 PM, Changlong Xie wrote:

+if (!failover) {
+/*
+ * This BDS will be closed, and the job should be completed
+ * before the BDS is closed, because we will access hidden
+ * disk, secondary disk in backup_job_completed().
+ */
+if (s->secondary_disk->bs->job) {
+block_job_cancel_sync(s->secondary_disk->bs->job);
+}
+secondary_do_checkpoint(s, errp);
+s->replication_state = BLOCK_REPLICATION_DONE;
+aio_context_release(aio_context);
+return;
+}
+
+s->replication_state = BLOCK_REPLICATION_FAILOVER;
+if (s->secondary_disk->bs->job) {
+block_job_cancel(s->secondary_disk->bs->job);
+}


Since commit b6d2e599 "block: Convert block job core to BlockBackend", 
blockjob uses BB instead of bdrv_ref(), this introduces unexpected 
Segmentation fault with COLO.


In the below backtrace, you can see that. During failover, s->target was 
changed to an illegal value "0x1e1e1e1e1e1e1e1e" in bakup_complete.
Then the active commit job what also has a pointer that refer to 
s->target will use this illegal pointer. To avoid this, we should use 
"bloc_job_cancel_sync" to ensure backup job complete synchronously.


% MALLOC_PERTURB_=$(($RANDOM % 255 + 1))
% export MALLOC_PERTURB_
% gdb --args ./tests/test-replication
(gdb) b backup_complete
(gdb) r
(gdb) n
(gdb) n
(gdb) watch s->target
(gdb) c
Old value = (BlockBackend *) 0x55f1d990
New value = (BlockBackend *) 0x1e1e1e1e1e1e1e1e
0x75a811eb in memset () from /lib64/libc.so.6
(gdb) bt
#0  0x75a811eb in memset () from /lib64/libc.so.6
#1  0x75a7500e in _int_free () from /lib64/libc.so.6
#2  0x7705bf7f in g_free () from /lib64/libglib-2.0.so.0
#3  0x5557e924 in block_job_unref (job=0x55f1d630) at 
blockjob.c:124
#4  0x5557e9da in block_job_completed_single 
(job=0x55f1d630) at blockjob.c:143
#5  0x5557ecc8 in block_job_completed (job=0x55f1d630, 
ret=0) at blockjob.c:215
#6  0x555e6d49 in backup_complete (job=0x55f1d630, 
opaque=0x55f1dd50) at block/backup.c:325
#7  0x5557f5f4 in block_job_defer_to_main_loop_bh 
(opaque=0x596e1dc0) at blockjob.c:500

#8  0x555747d7 in aio_bh_call (bh=0x596e1c30) at async.c:66
#9  0x55574899 in aio_bh_poll (ctx=0x55ce2d60) at async.c:94
#10 0x55581d4d in aio_dispatch (ctx=0x55ce2d60) at 
aio-posix.c:308
#11 0x555823cb in aio_poll (ctx=0x55ce2d60, blocking=false) 
at aio-posix.c:479
#12 0x555d639b in bdrv_drain_poll (bs=0x55dec210) at 
block/io.c:190
#13 0x555d6566 in bdrv_drained_begin (bs=0x55dec210) at 
block/io.c:240
#14 0x55577261 in bdrv_child_cb_drained_begin 
(child=0x596e1ab0) at block.c:665
#15 0x555d5e81 in bdrv_parent_drained_begin (bs=0x55f0e130) 
at block/io.c:54
#16 0x555d652b in bdrv_drained_begin (bs=0x55f0e130) at 
block/io.c:232
#17 0x55577261 in bdrv_child_cb_drained_begin 
(child=0x596e1810) at block.c:665
#18 0x555d5e81 in bdrv_parent_drained_begin (bs=0x596d7030) 
at block/io.c:54
#19 0x555d652b in bdrv_drained_begin (bs=0x596d7030) at 
block/io.c:232
#20 0x55577261 in bdrv_child_cb_drained_begin 
(child=0x55df8830) at block.c:665
#21 0x555d5e81 in bdrv_parent_drained_begin (bs=0x55df0bb0) 
at block/io.c:54

#22 0x555d66ee in bdrv_drain_all () at block/io.c:301
#23 0x55579f9f in bdrv_reopen_multiple (bs_queue=0x55f1dd30, 
errp=0x7fffd768) at block.c:1953
#24 0x5557a169 in bdrv_reopen (bs=0x55df0bb0, 
bdrv_flags=24578, errp=0x7fffd8e0) at block.c:1994
#25 0x555d54a7 in commit_active_start (bs=0x55f0e130, 
base=0x55df0bb0, speed=0, on_error=BLOCKDEV_ON_ERROR_REPORT, 
cb=0x555e8b97 ,
opaque=0x55dec210, errp=0x7fffd8e0, auto_complete=true) at 
block/mirror.c:901
#26 0x555e8d98 in replication_stop (rs=0x5746f7b0, 
failover=true, errp=0x7fffd8e0) at block/replication.c:623
#27 0x555873d1 in replication_stop_all (failover=true, 
errp=0x7fffd928) at replication.c:98
#28 0x5557406d in test_secondary_start () at 
tests/test-replication.c:403
#29 0x7707a5e1 in g_test_run_suite_internal () from 
/lib64/libglib-2.0.so.0
#30 0x7707a7a6 in g_test_run_suite_internal () from 
/lib64/libglib-2.0.so.0
#31 0x7707a7a6 in g_test_run_suite_internal () from 
/lib64/libglib-2.0.so.0

#32 0x7707ab1b in g_test_run_suite () from /lib64/libglib-2.0.so.0
#33 0x555746c8 in main (argc=1, argv=0x7fffdda8) at 
tests/test-replication.c:545

(gdb) d breakpoints
(gdb) c
Program received signal SIGSEGV, Segmentation fault.
0x555c7d6c in blk_bs (bl

Re: [Qemu-block] [PATCH v19 08/10] Implement new driver for block replication

2016-06-06 Thread Changlong Xie


On 05/20/2016 03:36 PM, Changlong Xie wrote:

+
+/*
+ * Must protect backup target if backup job was stopped/cancelled
+ * unexpectedly
+ */
+bdrv_ref(s->hidden_disk->bs);
+
+backup_start(s->secondary_disk->bs, s->hidden_disk->bs, 0,
+ MIRROR_SYNC_MODE_NONE, NULL, BLOCKDEV_ON_ERROR_REPORT,
+ BLOCKDEV_ON_ERROR_REPORT, backup_job_completed,
+ s, NULL, _err);
+if (local_err) {
+error_propagate(errp, local_err);
+backup_job_cleanup(s);
+bdrv_unref(s->hidden_disk->bs);
+aio_context_release(aio_context);
+return;
+}
+break;
+default:
+aio_context_release(aio_context);
+abort();
+}


commit 5c438bc6 introduce BB for I/O, so we don't need protect backup 
target by ourself now.


-/*
- * Must protect backup target if backup job was stopped/cancelled
- * unexpectedly
- */
-bdrv_ref(s->hidden_disk->bs);
-
 backup_start(s->secondary_disk->bs, s->hidden_disk->bs, 0,
  MIRROR_SYNC_MODE_NONE, NULL, 
BLOCKDEV_ON_ERROR_REPORT,

  BLOCKDEV_ON_ERROR_REPORT, backup_job_completed,
@@ -508,7 +502,6 @@ static void replication_start(ReplicationState *rs, 
ReplicationMode mode,

 if (local_err) {
 error_propagate(errp, local_err);
 backup_job_cleanup(s);
-bdrv_unref(s->hidden_disk->bs);
 aio_context_release(aio_context);
 return;
 }

will update in next version.

Re: [Qemu-block] [PATCH v19 09/10] tests: add unit test case for replication

2016-05-31 Thread Changlong Xie


On 05/31/2016 01:34 AM, Stefan Hajnoczi wrote:

On Fri, May 20, 2016 at 03:36:19PM +0800, Changlong Xie wrote:

+/* primary */
+#define P_LOCAL_DISK "/tmp/p_local_disk.XX"
+#define P_COMMAND "driver=replication,mode=primary,node-name=xxx,"\
+  "file.driver=qcow2,file.file.filename="P_LOCAL_DISK
+
+/* secondary */
+#define S_LOCAL_DISK "/tmp/s_local_disk.XX"
+#define S_ACTIVE_DISK "/tmp/s_active_disk.XX"
+#define S_HIDDEN_DISK "/tmp/s_hidden_disk.XX"


Please use unique filenames so that multiple instances of the test can
run in parallel on a single machine.  mkstemp(3) can be used to do this.



will fix in next version.


+static void io_read(BlockDriverState *bs, long pattern, int64_t pattern_offset,
+int64_t pattern_count, int64_t offset, int64_t count,
+bool expect_failed)
+{
+char *buf;
+void *cmp_buf;
+int ret;
+
+/* 1. alloc pattern buffer */
+if (pattern) {
+cmp_buf = g_malloc(pattern_count);
+memset(cmp_buf, pattern, pattern_count);
+}
+
+/* 2. alloc read buffer */
+buf = qemu_blockalign(bs, count);
+memset(buf, 0xab, count);
+
+/* 3. do read */
+ret = bdrv_read(bs, offset >> 9, (uint8_t *)buf, count >> 9);
+
+/* 4. assert and compare buf */
+if (expect_failed) {
+g_assert(ret < 0);
+} else {
+g_assert(ret >= 0);
+if (pattern) {
+g_assert(memcmp(buf + pattern_offset, cmp_buf, pattern_count) <= 
0);
+g_free(cmp_buf);


if pattern && expect_failed then cmp_buf is leaked.  Probably best to
initialize cmp_buf = NULL and have an unconditional g_free(cmp_buf) at
the end of the function to avoid leaks.



Yes, you are right.


+}
+}
+g_free(buf);


qemu_blockalign() memory is freed with qemu_vfree(), not g_free().



will fix


+static void test_primary_do_checkpoint(void)
+{
+BlockDriverState *bs;
+Error *local_err = NULL;
+
+bs = start_primary();
+
+replication_do_checkpoint_all(_err);
+g_assert(!local_err);
+
+teardown_primary(bs);
+}


Shouldn't replication_start_all() be called before
replication_do_checkpoint_all()?



It seems i missed it.


+int main(int argc, char **argv)
+{
+int ret;
+qemu_init_main_loop(_fatal);
+bdrv_init();
+
+do {} while (g_main_context_iteration(NULL, false));


Why is this necessary?


Will remove it.

Re: [Qemu-block] [PATCH v19 08/10] Implement new driver for block replication

2016-05-30 Thread Changlong Xie


On 05/31/2016 02:14 AM, Stefan Hajnoczi wrote:

On Fri, May 20, 2016 at 03:36:18PM +0800, Changlong Xie wrote:

+/* start backup job now */
+error_setg(>blocker,
+   "block device is in use by internal backup job");
+
+top_bs = bdrv_lookup_bs(s->top_id, s->top_id, errp);
+if (!top_bs || !check_top_bs(top_bs, bs)) {
+reopen_backing_file(s, false, NULL);
+aio_context_release(aio_context);
+return;
+}


Missing error_setg() with an error message when check_top_bs() fails.



Will add.

Thanks
-Xie

Re: [Qemu-block] [PATCH v19 00/10] Block replication for continuous checkpoints

2016-05-26 Thread Changlong Xie


Ping here : )

Hi fam, do you have time to help reviewing this patchset? Consider of we 
are in the same time zone what will speed up code reviewing process,

any feedback will be appreciated.

Thanks
-Xie

On 05/20/2016 03:36 PM, Changlong Xie wrote:

Block replication is a very important feature which is used for
continuous checkpoints(for example: COLO).

You can get the detailed information about block replication from here:
http://wiki.qemu.org/Features/BlockReplication

Usage:
Please refer to docs/block-replication.txt

You can get the patch here:
https://github.com/Pating/qemu/tree/changlox/block-replication-v19

You can get the patch with framework here:
https://github.com/Pating/qemu/tree/changlox/colo_framework_v18

TODO:
1. Continuous block replication. It will be started after basic functions

Re: [Qemu-block] [PATCH v19 09/10] tests: add unit test case for replication

2016-05-26 Thread Changlong Xie


On 05/20/2016 03:36 PM, Changlong Xie wrote:

+static void io_write(BlockDriverState *bs, long pattern, int64_t pattern_count,
+ int64_t offset, int64_t count, bool expect_failed)
+{
+void *pattern_buf;


Should initialize as NULL to avoid below warnning:

tests/test-replication.c:104:15: error: ‘pattern_buf’ may be used 
uninitialized in this function [-Werror=maybe-uninitialized]

 g_free(pattern_buf);

Will fix in next version.


+int ret;
+
+/* 1. alloc pattern buffer */
+if (pattern) {
+pattern_buf = g_malloc(pattern_count);
+memset(pattern_buf, pattern, pattern_count);
+}
+
+/* 2. do write */
+if (pattern) {
+ret = bdrv_write(bs, offset >> 9, (uint8_t *)pattern_buf, count >> 9);
+} else {
+ret = bdrv_write_zeroes(bs, offset >> 9, count >> 9, 0);
+}
+
+/* 3. assert */
+if (expect_failed) {
+g_assert(ret < 0);
+} else {
+g_assert(ret >= 0);
+g_free(pattern_buf);
+}
+}

1 2 3 >

1 - 100 of 218 matches

Mail list logo