from:"Wen Congyang"

Re: [Qemu-devel] virsh dump (qemu guest memory dump?): KASLR enabled linux guest support

2016-11-08 Thread Wen Congyang

On 11/09/2016 01:02 PM, Dave Young wrote:
> On 11/09/16 at 11:58am, Wen Congyang wrote:
>> On 11/09/2016 11:17 AM, Dave Young wrote:
>>> Drop qiaonuohan, seems the mail address is wrong..
>>>
>>> On 11/09/16 at 11:01am, Dave Young wrote:
>>>> Hi,
>>>>
>>>> Latest linux kernel enabled kaslr to randomiz phys/virt memory
>>>> addresses, we had some effort to support kexec/kdump so that crash
>>>> utility can still works in case crashed kernel has kaslr enabled.
>>>>
>>>> But according to Dave Anderson virsh dump does not work, quoted messages
>>>> from Dave below:
>>>>
>>>> """
>>>> with virsh dump, there's no way of even knowing that KASLR
>>>> has randomized the kernel __START_KERNEL_map region, because there is no
>>>> virtual address information -- e.g., like "SYMBOL(_stext)" in the kdump
>>>> vmcoreinfo data to compare against the vmlinux file symbol value.
>>>> Unless virsh dump can export some basic virtual memory data, which
>>>> they say it can't, I don't see how KASLR can ever be supported.
>>>> """
>>>>
>>>> I assume virsh dump is using qemu guest memory dump facility so it
>>>> should be first addressed in qemu. Thus post this query to qemu devel
>>>> list. If this is not correct please let me know.
>>
>> IIRC, 'virsh dump --memory-only' uses dump-guest-memory, and 'virsh dump'
>> uses migration to dump.
> 
> Do they need different fixes? Dave, I guess you mean --memory-only, but
> could you clarify and confirm it?
> 
>>
>> I think I should study kaslr first...
> 
> Thanks for taking care of it.

Can you give me the patch for kexec/kdump. I want to know what I need to do
for dump-guest-memory.

Thanks
Wen Congyang

> 
>>
>> Thanks
>> Wen Congyang
>>
>>>>
>>>> Could you qemu dump people make it work? Or we can not support virt dump
>>>> as long as KASLR being enabled. Latest Fedora kernel has enabled it in 
>>>> x86_64.
>>>>
>>>> Thanks
>>>> Dave
>>>
>>>
>>>
>>
>>
>>
> 
> 
> .
>

Re: [Qemu-devel] virsh dump (qemu guest memory dump?): KASLR enabled linux guest support

2016-11-08 Thread Wen Congyang

On 11/09/2016 11:17 AM, Dave Young wrote:
> Drop qiaonuohan, seems the mail address is wrong..
> 
> On 11/09/16 at 11:01am, Dave Young wrote:
>> Hi,
>>
>> Latest linux kernel enabled kaslr to randomiz phys/virt memory
>> addresses, we had some effort to support kexec/kdump so that crash
>> utility can still works in case crashed kernel has kaslr enabled.
>>
>> But according to Dave Anderson virsh dump does not work, quoted messages
>> from Dave below:
>>
>> """
>> with virsh dump, there's no way of even knowing that KASLR
>> has randomized the kernel __START_KERNEL_map region, because there is no
>> virtual address information -- e.g., like "SYMBOL(_stext)" in the kdump
>> vmcoreinfo data to compare against the vmlinux file symbol value.
>> Unless virsh dump can export some basic virtual memory data, which
>> they say it can't, I don't see how KASLR can ever be supported.
>> """
>>
>> I assume virsh dump is using qemu guest memory dump facility so it
>> should be first addressed in qemu. Thus post this query to qemu devel
>> list. If this is not correct please let me know.

IIRC, 'virsh dump --memory-only' uses dump-guest-memory, and 'virsh dump'
uses migration to dump.

I think I should study kaslr first...

Thanks
Wen Congyang

>>
>> Could you qemu dump people make it work? Or we can not support virt dump
>> as long as KASLR being enabled. Latest Fedora kernel has enabled it in 
>> x86_64.
>>
>> Thanks
>> Dave
> 
> 
>

Re: [Qemu-devel] [PATCH] replication: interrupt failover if the main device is closed

2016-10-09 Thread Wen Congyang


At 2016/10/7 20:21, Paolo Bonzini wrote:

Without this change, there is a race condition in tests/test-replication.
Depending on how fast the failover job (active commit) runs, there is a
chance of two bad things happening:

1) replication_done can be called after the secondary has been closed
and hence when the BDRVReplicationState is not valid anymore.

2) two copies of the active disk are present during the
/replication/secondary/stop test (that test runs immediately after
/replication/secondary/start, which tests failover).  This causes the
corruption detector to fire.

Signed-off-by: Paolo Bonzini <pbonz...@redhat.com>


This patch looks fine to me.
Reviewed-by: Wen Congyang <we...@cn.fujitsu.com>


---
  block/replication.c | 3 +++
  1 file changed, 3 insertions(+)

diff --git a/block/replication.c b/block/replication.c
index 3bd1cf1..5231a00 100644
--- a/block/replication.c
+++ b/block/replication.c
@@ -133,6 +133,9 @@ static void replication_close(BlockDriverState *bs)
  if (s->replication_state == BLOCK_REPLICATION_RUNNING) {
  replication_stop(s->rs, false, NULL);
  }
+if (s->replication_state == BLOCK_REPLICATION_FAILOVER) {
+block_job_cancel_sync(s->active_disk->bs->job);
+}

  if (s->mode == REPLICATION_MODE_SECONDARY) {
  g_free(s->top_id);

Re: [Qemu-devel] [PATCH v12 2/3] quorum: implement bdrv_add_child() and bdrv_del_child()

2016-04-06 Thread Wen Congyang

On 04/01/2016 11:20 PM, Max Reitz wrote:
> On 31.03.2016 13:42, Alberto Garcia wrote:
>> On Wed 30 Mar 2016 05:07:15 PM CEST, Max Reitz wrote:
>>>> I also have another (not directly related) question: why not simply
>>>> use the node name when removing children? I understood that the idea
>>>> was that it's possible to have the same node attached twice to the
>>>> same Quorum, but can you actually do that? And what's the use case?
>>>
>>> What I like about using the child role name is that it automatically
>>> prevents you from specifying a node that is not a child of the given
>>> parent.
>>
>> Right, but checking if a node is not a child and returning an error is
>> very simple. And it doesn't require the user to keep track of the node
>> name *and* the child role name.
> 
> Yes. But I think that you need to know parent and child anyway if you
> want to modify (delete) an edge in the graph.
> 
> Also, it may be possible to have multiple parents per node. Actually, it
> is already possible because the BB-BDS relationship is modeled as a
> parent-child relationship. Thus, I'm not sure whether it would be
> sufficient to specify a single node if you want to delete a single edge.
> 
>> Unless I'm forgetting something this would be the first time we expose
>> the child role name in the API, that's why I'm wondering if it's
>> something worth doing.
> 
> Well, the roles are kind of exposed already. It's exactly what you
> specify in -drive or blockdev-add.
> 
>>> Which makes me notice that it might be a good idea to require the user
>>> to specify the child's role when adding a new child. In this version
>>> of this series (where only quorum is supported), the children are just
>>> inserted in numerical order (first free slot is taken first), but
>>> maybe the user wants to insert them in a different order.
>>
>> For the Quorum case it totally makes sense to let the user choose the
>> position of the new child.
>>
>> But for creating a Quorum array in the first place we don't require
>> that, the order is the one that the user provides, and the user does not
>> need to know about the child role names at that point.
> 
> Depends. If you create an empty quorum BDS and then add the children
> using the QAPI command introduced in this series, you are right. But if
> you add children along with creating the quorum BDS (be it via -drive or
> via blockdev-add), one has to specify the child role names.

I think the problem is that: the child role name is wrong.

If we always attach the child in the tail, we can do it like this:
the child role name is children.XXX, and the XXX's value is larger than
any child role name's XXX.

For example:
Quorum has one child: children.1(children.0 is removed)
We add a new child, its role name is children.2, not children.0.

If we want to attach the child not in the tail, for example:
Quorum has two children: children.0, children.1. And the new child should
be before children.1. In this case, we should rename children.1 to children.2
and the new child role name can be children.1. If we allow such usage, we
should rename the other child role name when add/deleting a child. It means
that we should query the role name again after add/deleting a child.

Thanks
Wen Congyang

> 
> Max
>

Re: [Qemu-devel] 'make check' failure on Fedora 23

2016-04-06 Thread Wen Congyang

On 04/06/2016 07:35 AM, Eric Blake wrote:
> Fedora 23 recently pushed acpica-tools.x86_64 20160318-1.fc23; with this
> installed, 'make check-qtest' (part of 'make check') now fails with:
> 
> GTESTER check-qtest-x86_64
> **
> ERROR:tests/bios-tables-test.c:455:normalize_asl: assertion failed:
> (block_name)
> GTester: last random seed: R02S920fc706036c189c6183c50ae002dbd1
> **
> ERROR:tests/bios-tables-test.c:455:normalize_asl: assertion failed:
> (block_name)
> GTester: last random seed: R02Sccc109c5192857eddb99203160e3f750
> **
> ERROR:tests/bios-tables-test.c:455:normalize_asl: assertion failed:
> (block_name)
> GTester: last random seed: R02S4c7289cb7d625a50855ca9a69e2ba8dc
> **
> ERROR:tests/bios-tables-test.c:455:normalize_asl: assertion failed:
> (block_name)
> GTester: last random seed: R02Sf7cb7c41ed3718deced4d57cde200401
> /home/eblake/qemu/tests/Makefile:638: recipe for target
> 'check-qtest-x86_64' failed
> 
> Reverting (# dnf downgrade --disablerepo=updates acpica-tools) to
> acpica-tools.x86_64 20150619-2.fc23 lets the tests pass again.  I don't
> know enough about the test or the tools to know if the problem is a
> change in expected output (the qemu test is not tolerant enough, in
> which case we should improve the test), or a regression in upstream
> acpica-tools (in which case I should file a BZ), but this should
> probably be addressed before 2.6.

After this commit, iasl's output is changed:
https://github.com/acpica/acpica/commit/1ecbb3d551255dab943f3bbe7e9da0145d154bba

Thanks
Wen Congyang

>

Re: [Qemu-devel] 'make check' failure on Fedora 23

2016-04-05 Thread Wen Congyang

On 04/06/2016 07:35 AM, Eric Blake wrote:
> Fedora 23 recently pushed acpica-tools.x86_64 20160318-1.fc23; with this
> installed, 'make check-qtest' (part of 'make check') now fails with:
> 
> GTESTER check-qtest-x86_64
> **
> ERROR:tests/bios-tables-test.c:455:normalize_asl: assertion failed:
> (block_name)
> GTester: last random seed: R02S920fc706036c189c6183c50ae002dbd1
> **
> ERROR:tests/bios-tables-test.c:455:normalize_asl: assertion failed:
> (block_name)
> GTester: last random seed: R02Sccc109c5192857eddb99203160e3f750
> **
> ERROR:tests/bios-tables-test.c:455:normalize_asl: assertion failed:
> (block_name)
> GTester: last random seed: R02S4c7289cb7d625a50855ca9a69e2ba8dc
> **
> ERROR:tests/bios-tables-test.c:455:normalize_asl: assertion failed:
> (block_name)
> GTester: last random seed: R02Sf7cb7c41ed3718deced4d57cde200401
> /home/eblake/qemu/tests/Makefile:638: recipe for target
> 'check-qtest-x86_64' failed
> 
> Reverting (# dnf downgrade --disablerepo=updates acpica-tools) to
> acpica-tools.x86_64 20150619-2.fc23 lets the tests pass again.  I don't
> know enough about the test or the tools to know if the problem is a
> change in expected output (the qemu test is not tolerant enough, in
> which case we should improve the test), or a regression in upstream
> acpica-tools (in which case I should file a BZ), but this should
> probably be addressed before 2.6.
> 

In the function load_asl(), we will use iasl to convert aml file to
asl source code file.

I copy the iasl file from /tmp/ to another directory, so I can use
old version and new version iasl to investigate.

Here is the different:
@@ -18,7 +18,7 @@
  * Compiler ID  "BXPC"
  * Compiler Version 0x0001 (1)
  */
-DefinitionBlock ("t1.aml", "DSDT", 1, "BOCHS ", "BXPCDSDT", 0x0001)
+DefinitionBlock ("", "DSDT", 1, "BOCHS ", "BXPCDSDT", 0x00000001)
 {
 Scope (\)
 {

The newest iasl does't output the aml filename: "tl.aml"

In the function normalize_asl(): we hope the output contains the aml filename...

Thanks
Wen Congyang

Re: [Qemu-devel] [PATCH] filter-buffer: fix segfault while start qemu with status=off property

2016-04-01 Thread Wen Congyang

On 04/01/2016 04:24 PM, Hailiang Zhang wrote:
> On 2016/4/1 15:39, Jason Wang wrote:
>>
>>
>> On 04/01/2016 03:08 PM, zhanghailiang wrote:
>>> After commit 338d3f, we support 'status' property for filter object.
>>> The segfault can be triggered by starting qemu with 'status=off' property
>>> for filter, when the s->incoming_queue is NULL, we reference it directly
>>> in qemu_net_queue_flush().
>>>
>>> Let's check the value of 's->incoming_queue' before calling
>>> qemu_net_queue_flush().
>>>
>>> Signed-off-by: zhanghailiang <zhang.zhanghaili...@huawei.com>
>>> ---
>>>   net/filter-buffer.c | 2 +-
>>>   1 file changed, 1 insertion(+), 1 deletion(-)
>>>
>>> diff --git a/net/filter-buffer.c b/net/filter-buffer.c
>>> index cc6bd94..79e2ce3 100644
>>> --- a/net/filter-buffer.c
>>> +++ b/net/filter-buffer.c
>>> @@ -34,7 +34,7 @@ static void filter_buffer_flush(NetFilterState *nf)
>>>   {
>>>   FilterBufferState *s = FILTER_BUFFER(nf);
>>>
>>> -if (!qemu_net_queue_flush(s->incoming_queue)) {
>>> +if (s->incoming_queue && !qemu_net_queue_flush(s->incoming_queue)) {
>>>   /* Unable to empty the queue, purge remaining packets */
>>>   qemu_net_queue_purge(s->incoming_queue, nf->netdev);
>>>   }
>>
>> We'd better handle this at generic layer and don't let a specific net
>> filter need to worry about this.
>>
>> Looks like the issue is we may trigger status_changed() too early (even
>> before the the filter was initialized).
>>
> 
> Yes ~
> 
>> How about not call status_changed() if the initialization is not done?
>>
> 
> But seems that it is difficult to confirm if the filter is initialized
> or not ...

If nfc->setup() is not called, nf->netdev is NULL.

Thanks
Wen Congyang

> 
>> .
>>
> 
> 
> 
> 
>

Re: [Qemu-devel] [PATCH] crypto: do an explicit check for nettle pbkdf functions

2016-03-29 Thread Wen Congyang

On 03/29/2016 10:50 PM, Daniel P. Berrange wrote:
> Support for the PBKDF functions in nettle was not introduced
> until version 2.6. Some distros QEMU targets have older
> versions and thus lack PBKDF support. Address this by doing
> a check in configure for the desired function and then skipping
> compilation of the nettle-pbkdf.o module
> 
> Reported-by: Wen Congyang <we...@cn.fujitsu.com>
> Signed-off-by: Daniel P. Berrange <berra...@redhat.com>

I build the qemu with this patch. It is OK now.

Thanks
Wen Congyang

> ---
>  configure| 16 
>  crypto/Makefile.objs |  4 ++--
>  2 files changed, 18 insertions(+), 2 deletions(-)
> 
> diff --git a/configure b/configure
> index f4a03b8..2d78bcd 100755
> --- a/configure
> +++ b/configure
> @@ -308,6 +308,7 @@ gnutls=""
>  gnutls_hash=""
>  gnutls_rnd=""
>  nettle=""
> +nettle_kdf="no"
>  gcrypt=""
>  gcrypt_kdf="no"
>  vte=""
> @@ -2335,6 +2336,17 @@ if test "$nettle" != "no"; then
>  libs_tools="$nettle_libs $libs_tools"
>  QEMU_CFLAGS="$QEMU_CFLAGS $nettle_cflags"
>  nettle="yes"
> +
> +cat > $TMPC << EOF
> +#include 
> +int main(void) {
> + pbkdf2_hmac_sha256(8, NULL, 1000, 8, NULL, 8, NULL);
> + return 0;
> +}
> +EOF
> +if compile_prog "$nettle_cflags" "$nettle_libs" ; then
> +nettle_kdf=yes
> +fi
>  else
>  if test "$nettle" = "yes"; then
>  feature_not_found "nettle" "Install nettle devel"
> @@ -4746,6 +4758,7 @@ if test "$nettle" = "yes"; then
>  else
>  echo "nettle$nettle"
>  fi
> +echo "nettle kdf$nettle_kdf"
>  echo "libtasn1  $tasn1"
>  echo "VTE support   $vte"
>  echo "curses support$curses"
> @@ -5130,6 +5143,9 @@ fi
>  if test "$nettle" = "yes" ; then
>echo "CONFIG_NETTLE=y" >> $config_host_mak
>echo "CONFIG_NETTLE_VERSION_MAJOR=${nettle_version%%.*}" >> 
> $config_host_mak
> +  if test "$nettle_kdf" = "yes" ; then
> +echo "CONFIG_NETTLE_KDF=y" >> $config_host_mak
> +  fi
>  fi
>  if test "$tasn1" = "yes" ; then
>echo "CONFIG_TASN1=y" >> $config_host_mak
> diff --git a/crypto/Makefile.objs b/crypto/Makefile.objs
> index 9f2c87e..0737f48 100644
> --- a/crypto/Makefile.objs
> +++ b/crypto/Makefile.objs
> @@ -11,8 +11,8 @@ crypto-obj-y += secret.o
>  crypto-obj-$(CONFIG_GCRYPT) += random-gcrypt.o
>  crypto-obj-$(if $(CONFIG_GCRYPT),n,$(CONFIG_GNUTLS_RND)) += random-gnutls.o
>  crypto-obj-y += pbkdf.o
> -crypto-obj-$(CONFIG_NETTLE) += pbkdf-nettle.o
> -crypto-obj-$(if $(CONFIG_NETTLE),n,$(CONFIG_GCRYPT_KDF)) += pbkdf-gcrypt.o
> +crypto-obj-$(CONFIG_NETTLE_KDF) += pbkdf-nettle.o
> +crypto-obj-$(if $(CONFIG_NETTLE_KDF),n,$(CONFIG_GCRYPT_KDF)) += 
> pbkdf-gcrypt.o
>  crypto-obj-y += ivgen.o
>  crypto-obj-y += ivgen-essiv.o
>  crypto-obj-y += ivgen-plain.o
>

Re: [Qemu-devel] [RFC for-2.7 1/1] block/qapi: Add query-block-node-tree

2016-03-25 Thread Wen Congyang

On 03/25/2016 03:07 AM, Max Reitz wrote:
> This command returns the tree of BlockDriverStates under a given root
> node.
> 
> Every tree node is described by its node name and the connection of a
> parent node to its children additionally contains the role the child
> assumes.
> 
> A node's name can then be used e.g. in conjunction with
> query-named-block-nodes to get more information about the node.

I found another problem:

{'execute': 'query-block-node-tree', 'arguments': {'root-node': 'disk1' } }
{"return": {"children": [{"role": "children.1", "node": {"children": [{"role": 
"file", "node": {}}], "node-name": "test1"}}, {"role": "children.0", "node": 
{"children": [{"role": "file", "node": {}}]}}]}}

s->children[0] is children.0, and s->children[1] is children.1.
But we output them in reverse order. The reason is:

BdrvChild *bdrv_attach_child(BlockDriverState *parent_bs,
 BlockDriverState *child_bs,
 const char *child_name,
 const BdrvChildRole *child_role)
{
    BdrvChild *child = bdrv_root_attach_child(child_bs, child_name, child_role);
QLIST_INSERT_HEAD(_bs->children, child, next);
return child;
}

We insert the new child to the head, not the tail...

Thanks
Wen Congyang

> 
> Signed-off-by: Max Reitz <mre...@redhat.com>
> ---
>  block/qapi.c | 43 +++
>  qapi/block-core.json | 46 ++
>  qmp-commands.hx  | 38 ++
>  3 files changed, 127 insertions(+)
> 
> diff --git a/block/qapi.c b/block/qapi.c
> index 6a4869a..a35d32b 100644
> --- a/block/qapi.c
> +++ b/block/qapi.c
> @@ -493,6 +493,49 @@ BlockInfoList *qmp_query_block(Error **errp)
>  return head;
>  }
>  
> +static BlockNodeTreeNode *qmp_query_block_node_tree_by_bs(BlockDriverState 
> *bs)
> +{
> +BlockNodeTreeNode *bntn;
> +BlockNodeTreeChildList **p_next;
> +BdrvChild *child;
> +
> +bntn = g_new0(BlockNodeTreeNode, 1);
> +
> +bntn->node_name = g_strdup(bdrv_get_node_name(bs));
> +bntn->has_node_name = bntn->node_name;
> +
> +p_next = >children;
> +QLIST_FOREACH(child, >children, next) {
> +BlockNodeTreeChild *bntc;
> +
> +bntc = g_new(BlockNodeTreeChild, 1);
> +*bntc = (BlockNodeTreeChild){
> +.role = g_strdup(child->name),
> +.node = qmp_query_block_node_tree_by_bs(child->bs),
> +};
> +
> +*p_next = g_new0(BlockNodeTreeChildList, 1);
> +(*p_next)->value = bntc;
> +p_next = &(*p_next)->next;
> +}
> +
> +*p_next = NULL;
> +return bntn;
> +}
> +
> +BlockNodeTreeNode *qmp_query_block_node_tree(const char *root_node,
> + Error **errp)
> +{
> +BlockDriverState *bs;
> +
> +bs = bdrv_lookup_bs(root_node, root_node, errp);
> +if (!bs) {
> +return NULL;
> +}
> +
> +return qmp_query_block_node_tree_by_bs(bs);
> +}
> +
>  static bool next_query_bds(BlockBackend **blk, BlockDriverState **bs,
> bool query_nodes)
>  {
> diff --git a/qapi/block-core.json b/qapi/block-core.json
> index b1cf77d..754ccd6 100644
> --- a/qapi/block-core.json
> +++ b/qapi/block-core.json
> @@ -470,6 +470,52 @@
>  
>  
>  ##
> +# @BlockNodeTreeNode:
> +#
> +# Describes a node in the block node graph.
> +#
> +# @node-name: If present, the node's name.
> +#
> +# @children:  List of the node's children.
> +#
> +# Since: 2.7
> +##
> +{ 'struct': 'BlockNodeTreeNode',
> +  'data': { '*node-name': 'str',
> +'children': ['BlockNodeTreeChild'] } }
> +
> +##
> +# @BlockNodeTreeChild:
> +#
> +# Describes a child node in the block node graph.
> +#
> +# @role: Role the child assumes for its parent, e.g. "file" or "backing".
> +#
> +# @node: The child node's BlockNodeTreeNode structure.
> +#
> +# Since: 2.7
> +##
> +{ 'struct': 'BlockNodeTreeChild',
> +  'data': { 'role': 'str',
> +'node': 'BlockNodeTreeNode' } }
> +
> +##
> +# @query-block-node-tree:
> +#
> +# Queries the tree of nodes under a given node in the block graph.
> +#
> +# @root-node: Node name or device name of the tree's root node.
> +#
> +# Returns: The root node's BlockNodeTreeNode structure.
> +#
> +# Since: 2.7
> +##
> +{ 'command': 'query-block-node-tree',
>

Re: [Qemu-devel] [RFC for-2.7 1/1] block/qapi: Add query-block-node-tree

2016-03-24 Thread Wen Congyang

On 03/25/2016 03:07 AM, Max Reitz wrote:
> This command returns the tree of BlockDriverStates under a given root
> node.
> 
> Every tree node is described by its node name and the connection of a
> parent node to its children additionally contains the role the child
> assumes.
> 
> A node's name can then be used e.g. in conjunction with
> query-named-block-nodes to get more information about the node.

I test this patch, and it works.
{'execute': 'query-block-node-tree', 'arguments': {'root-node': 'disk1' } }
{"return": {"children": [{"role": "children.0", "node": {"children": [{"role": 
"file", "node": {"children": [], "node-name": "#block175"}}], "node-name": 
"#block267"}}], "node-name": "#block040"}}

Shoule we hide the node name like "#blockxxx"?
If the bs doesn't have any child, should we output: '"children": [], '?

Can we add a new parameter: depth? For example, If I only want to know the 
quorum's
child name, we can limit the depth, and the output may be very clear.

Thanks
Wen Congyang

> 
> Signed-off-by: Max Reitz <mre...@redhat.com>
> ---
>  block/qapi.c | 43 +++
>  qapi/block-core.json | 46 ++
>  qmp-commands.hx  | 38 ++
>  3 files changed, 127 insertions(+)
> 
> diff --git a/block/qapi.c b/block/qapi.c
> index 6a4869a..a35d32b 100644
> --- a/block/qapi.c
> +++ b/block/qapi.c
> @@ -493,6 +493,49 @@ BlockInfoList *qmp_query_block(Error **errp)
>  return head;
>  }
>  
> +static BlockNodeTreeNode *qmp_query_block_node_tree_by_bs(BlockDriverState 
> *bs)
> +{
> +BlockNodeTreeNode *bntn;
> +BlockNodeTreeChildList **p_next;
> +BdrvChild *child;
> +
> +bntn = g_new0(BlockNodeTreeNode, 1);
> +
> +bntn->node_name = g_strdup(bdrv_get_node_name(bs));
> +bntn->has_node_name = bntn->node_name;
> +
> +p_next = >children;
> +QLIST_FOREACH(child, >children, next) {
> +BlockNodeTreeChild *bntc;
> +
> +bntc = g_new(BlockNodeTreeChild, 1);
> +*bntc = (BlockNodeTreeChild){
> +.role = g_strdup(child->name),
> +.node = qmp_query_block_node_tree_by_bs(child->bs),
> +};
> +
> +*p_next = g_new0(BlockNodeTreeChildList, 1);
> +(*p_next)->value = bntc;
> +p_next = &(*p_next)->next;
> +}
> +
> +*p_next = NULL;
> +return bntn;
> +}
> +
> +BlockNodeTreeNode *qmp_query_block_node_tree(const char *root_node,
> + Error **errp)
> +{
> +BlockDriverState *bs;
> +
> +bs = bdrv_lookup_bs(root_node, root_node, errp);
> +if (!bs) {
> +return NULL;
> +}
> +
> +return qmp_query_block_node_tree_by_bs(bs);
> +}
> +
>  static bool next_query_bds(BlockBackend **blk, BlockDriverState **bs,
> bool query_nodes)
>  {
> diff --git a/qapi/block-core.json b/qapi/block-core.json
> index b1cf77d..754ccd6 100644
> --- a/qapi/block-core.json
> +++ b/qapi/block-core.json
> @@ -470,6 +470,52 @@
>  
>  
>  ##
> +# @BlockNodeTreeNode:
> +#
> +# Describes a node in the block node graph.
> +#
> +# @node-name: If present, the node's name.
> +#
> +# @children:  List of the node's children.
> +#
> +# Since: 2.7
> +##
> +{ 'struct': 'BlockNodeTreeNode',
> +  'data': { '*node-name': 'str',
> +'children': ['BlockNodeTreeChild'] } }
> +
> +##
> +# @BlockNodeTreeChild:
> +#
> +# Describes a child node in the block node graph.
> +#
> +# @role: Role the child assumes for its parent, e.g. "file" or "backing".
> +#
> +# @node: The child node's BlockNodeTreeNode structure.
> +#
> +# Since: 2.7
> +##
> +{ 'struct': 'BlockNodeTreeChild',
> +  'data': { 'role': 'str',
> +'node': 'BlockNodeTreeNode' } }
> +
> +##
> +# @query-block-node-tree:
> +#
> +# Queries the tree of nodes under a given node in the block graph.
> +#
> +# @root-node: Node name or device name of the tree's root node.
> +#
> +# Returns: The root node's BlockNodeTreeNode structure.
> +#
> +# Since: 2.7
> +##
> +{ 'command': 'query-block-node-tree',
> +  'data': { 'root-node': 'str' },
> +  'returns': 'BlockNodeTreeNode' }
> +
> +
> +##
>  # @BlockDeviceTimedStats:
>  #
>  # Statistics of a block device during a given interval of time.
> diff --git a/qmp-commands.hx b/qmp-commands.hx
> index 9e05365..5c404aa

Re: [Qemu-devel] [PATCH v16 0/8] Block replication for continuous checkpoints

2016-03-24 Thread Wen Congyang

Ping


On 03/11/2016 06:34 PM, Changlong Xie wrote:
> Block replication is a very important feature which is used for
> continuous checkpoints(for example: COLO).
> 
> You can get the detailed information about block replication from here:
> http://wiki.qemu.org/Features/BlockReplication
> 
> Usage:
> Please refer to docs/block-replication.txt
> 
> This patch series is based on the following patch series:
> 1. http://lists.nongnu.org/archive/html/qemu-devel/2016-03/msg02319.html 
> 
> Patch status:
> 1. Acked patches: none 
> 2. Reviewed patches: patch 4
> 3. Updated patches: patch 7, 8
> 
> You can get the patch here:
> https://github.com/Pating/qemu/tree/changlox/block-replication-v16
> 
> You can get the patch with framework here:
> https://github.com/Pating/qemu/tree/changlox/colo_framework_v15
> 
> TODO:
> 1. Continuous block replication. It will be started after basic functions
>are accepted.
> 
> Changs Log:
> V16:
> 1. Rebase to the newest codes
> 2. Address comments from Stefan & hailiang
> p3: we don't need this patch now
> p4: add "top-id" parameters for secondary
> p6: fix NULL pointer in replication callbacks, remove unnecessary typedefs, 
> add doc comments that explain the semantics of Replication
> p7: Refactor AioContext for thread-safe, remove unnecessary get_top_bs()
> *Note*: I'm working on replication testcase now, will send out in V17
> V15:
> 1. Rebase to the newest codes
> 2. Fix typos and coding style addresed Eric's comments
> 3. Address Stefan's comments
>1) Make backup_do_checkpoint public, drop the changes on BlockJobDriver
>2) Update the message and description for [PATCH 4/9]
>3) Make replication_(start/stop/do_checkpoint)_all as global interfaces
>4) Introduce AioContext lock to protect start/stop/do_checkpoint callbacks
>5) Use BdrvChild instead of holding on to BlockDriverState * pointers
> 4. Clear BDRV_O_INACTIVE for hidden disk's open_flags since commit 09e0c771  
> 5. Introduce replication_get_error_all to check replication status
> 6. Remove useless discard interface
> V14:
> 1. Implement auto complete active commit
> 2. Implement active commit block job for replication.c
> 3. Address the comments from Stefan, add replication-specific API and data
>structure, also remove old block layer APIs
> V13:
> 1. Rebase to the newest codes
> 2. Remove redundant marcos and semicolon in replication.c 
> 3. Fix typos in block-replication.txt
> V12:
> 1. Rebase to the newest codes
> 2. Use backing reference to replcace 'allow-write-backing-file'
> V11:
> 1. Reopen the backing file when starting blcok replication if it is not
>opened in R/W mode
> 2. Unblock BLOCK_OP_TYPE_BACKUP_SOURCE and BLOCK_OP_TYPE_BACKUP_TARGET
>when opening backing file
> 3. Block the top BDS so there is only one block job for the top BDS and
>its backing chain.
> V10:
> 1. Use blockdev-remove-medium and blockdev-insert-medium to replace backing
>reference.
> 2. Address the comments from Eric Blake
> V9:
> 1. Update the error messages
> 2. Rebase to the newest qemu
> 3. Split child add/delete support. These patches are sent in another patchset.
> V8:
> 1. Address Alberto Garcia's comments
> V7:
> 1. Implement adding/removing quorum child. Remove the option non-connect.
> 2. Simplify the backing refrence option according to Stefan Hajnoczi's 
> suggestion
> V6:
> 1. Rebase to the newest qemu.
> V5:
> 1. Address the comments from Gong Lei
> 2. Speed the failover up. The secondary vm can take over very quickly even
>if there are too many I/O requests.
> V4:
> 1. Introduce a new driver replication to avoid touch nbd and qcow2.
> V3:
> 1: use error_setg() instead of error_set()
> 2. Add a new block job API
> 3. Active disk, hidden disk and nbd target uses the same AioContext
> 4. Add a testcase to test new hbitmap API
> V2:
> 1. Redesign the secondary qemu(use image-fleecing)
> 2. Use Error objects to return error message
> 3. Address the comments from Max Reitz and Eric Blake
> 
> Changlong Xie (1):
>   Introduce new APIs to do replication operation
> 
> Wen Congyang (7):
>   unblock backup operations in backing file
>   Backup: clear all bitmap when doing block checkpoint
>   Link backup into block core
>   docs: block replication's description
>   auto complete active commit
>   Implement new driver for block replication
>   support replication driver in blockdev-add
> 
>  Makefile.objs  |   1 +
>  block.c|  18 ++
>  block/Makefile.objs|   3 +-
>  block/backup.c |  15 ++
>  block/mirror.c |  13 +-
>  block/replication.c| 6

[Qemu-devel] [PATCH] quorum: Implement bdrv_get_specific_info

2016-03-23 Thread Wen Congyang

The monitor command 'query-block' or 'info block' will output the format 
specific
information. So we can get each child's child-name after this patch. This useful
for dynamic reconfiguration.

Signed-off-by: Wen Congyang <we...@cn.fujitsu.com>
---
 block/quorum.c   | 27 +++
 qapi/block-core.json | 15 ++-
 2 files changed, 41 insertions(+), 1 deletion(-)

diff --git a/block/quorum.c b/block/quorum.c
index da15465..afe6c3f 100644
--- a/block/quorum.c
+++ b/block/quorum.c
@@ -1054,6 +1054,31 @@ static void quorum_refresh_filename(BlockDriverState 
*bs, QDict *options)
 bs->full_open_options = opts;
 }
 
+static ImageInfoSpecific *quorum_get_specific_info(BlockDriverState *bs)
+{
+int i;
+BDRVQuorumState *s = bs->opaque;
+ImageInfoSpecific *spec_info = g_new0(ImageInfoSpecific, 1);
+strList **next;
+
+*spec_info = (ImageInfoSpecific){
+.type = IMAGE_INFO_SPECIFIC_KIND_QUORUM,
+.u = {
+.quorum.data = g_new0(ImageInfoSpecificQuorum, 1),
+},
+};
+
+next = _info->u.quorum.data->child_name;
+for (i = 0; i < s->num_children; i++) {
+*next = g_new0(strList, 1);
+(*next)->value = g_strdup(s->children[i]->name);
+(*next)->next = NULL;
+next = &(*next)->next;
+}
+
+return spec_info;
+}
+
 static BlockDriver bdrv_quorum = {
 .format_name= "quorum",
 .protocol_name  = "quorum",
@@ -1077,6 +1102,8 @@ static BlockDriver bdrv_quorum = {
 
 .is_filter  = true,
 .bdrv_recurse_is_first_non_filter   = quorum_recurse_is_first_non_filter,
+
+.bdrv_get_specific_info = quorum_get_specific_info,
 };
 
 static void bdrv_quorum_init(void)
diff --git a/qapi/block-core.json b/qapi/block-core.json
index b1cf77d..bd3e12d 100644
--- a/qapi/block-core.json
+++ b/qapi/block-core.json
@@ -75,6 +75,18 @@
   } }
 
 ##
+# @ImageInfoSpecificQuorum:
+#
+# @child-name: List of child name
+#
+# Since: 2.7
+##
+{ 'struct': 'ImageInfoSpecificQuorum',
+  'data': {
+  'child-name': ['str']
+  } }
+
+##
 # @ImageInfoSpecific:
 #
 # A discriminated record of image format specific information structures.
@@ -85,7 +97,8 @@
 { 'union': 'ImageInfoSpecific',
   'data': {
   'qcow2': 'ImageInfoSpecificQCow2',
-  'vmdk': 'ImageInfoSpecificVmdk'
+  'vmdk': 'ImageInfoSpecificVmdk',
+  'quorum': 'ImageInfoSpecificQuorum'
   } }
 
 ##
-- 
2.5.5

Re: [Qemu-devel] [PULL v3 02/13] crypto: add support for PBKDF2 algorithm

2016-03-23 Thread Wen Congyang

General Public
> + * License as published by the Free Software Foundation; either
> + * version 2 of the License, or (at your option) any later version.
> + *
> + * This library is distributed in the hope that it will be useful,
> + * but WITHOUT ANY WARRANTY; without even the implied warranty of
> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
> + * Lesser General Public License for more details.
> + *
> + * You should have received a copy of the GNU Lesser General Public
> + * License along with this library; if not, see 
> <http://www.gnu.org/licenses/>.
> + *
> + */
> +
> +#include "qemu/osdep.h"
> +#include "crypto/pbkdf.h"
> +#include "gcrypt.h"
> +
> +bool qcrypto_pbkdf2_supports(QCryptoHashAlgorithm hash)
> +{
> +switch (hash) {
> +case QCRYPTO_HASH_ALG_MD5:
> +case QCRYPTO_HASH_ALG_SHA1:
> +case QCRYPTO_HASH_ALG_SHA256:
> +return true;
> +default:
> +return false;
> +}
> +}
> +
> +int qcrypto_pbkdf2(QCryptoHashAlgorithm hash,
> +   const uint8_t *key, size_t nkey,
> +   const uint8_t *salt, size_t nsalt,
> +   unsigned int iterations,
> +   uint8_t *out, size_t nout,
> +   Error **errp)
> +{
> +static const int hash_map[QCRYPTO_HASH_ALG__MAX] = {
> +[QCRYPTO_HASH_ALG_MD5] = GCRY_MD_MD5,
> +[QCRYPTO_HASH_ALG_SHA1] = GCRY_MD_SHA1,
> +[QCRYPTO_HASH_ALG_SHA256] = GCRY_MD_SHA256,
> +};
> +int ret;
> +
> +if (hash >= G_N_ELEMENTS(hash_map) ||
> +hash_map[hash] == GCRY_MD_NONE) {
> +error_setg(errp, "Unexpected hash algorithm %d", hash);
> +return -1;
> +}
> +
> +ret = gcry_kdf_derive(key, nkey, GCRY_KDF_PBKDF2,
> +  hash_map[hash],
> +  salt, nsalt, iterations,
> +  nout, out);
> +if (ret != 0) {
> +error_setg(errp, "Cannot derive password: %s",
> +   gcry_strerror(ret));
> +return -1;
> +}
> +
> +return 0;
> +}
> diff --git a/crypto/pbkdf-nettle.c b/crypto/pbkdf-nettle.c
> new file mode 100644
> index 000..1aa7395
> --- /dev/null
> +++ b/crypto/pbkdf-nettle.c
> @@ -0,0 +1,65 @@
> +/*
> + * QEMU Crypto PBKDF support (Password-Based Key Derivation Function)
> + *
> + * Copyright (c) 2015-2016 Red Hat, Inc.
> + *
> + * This library is free software; you can redistribute it and/or
> + * modify it under the terms of the GNU Lesser General Public
> + * License as published by the Free Software Foundation; either
> + * version 2 of the License, or (at your option) any later version.
> + *
> + * This library is distributed in the hope that it will be useful,
> + * but WITHOUT ANY WARRANTY; without even the implied warranty of
> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
> + * Lesser General Public License for more details.
> + *
> + * You should have received a copy of the GNU Lesser General Public
> + * License along with this library; if not, see 
> <http://www.gnu.org/licenses/>.
> + *
> + */
> +
> +#include "qemu/osdep.h"
> +#include "crypto/pbkdf.h"
> +#include "nettle/pbkdf2.h"

I get the following building error:

  CCcrypto/pbkdf.o
  CCcrypto/pbkdf-nettle.o
  CCcrypto/ivgen.o
crypto/pbkdf-nettle.c:23:27: error: nettle/pbkdf2.h: No such file or directory
crypto/pbkdf-nettle.c: In function ‘qcrypto_pbkdf2’:
crypto/pbkdf-nettle.c:46: warning: implicit declaration of function 
‘pbkdf2_hmac_sha1’
crypto/pbkdf-nettle.c:46: warning: nested extern declaration of 
‘pbkdf2_hmac_sha1’
crypto/pbkdf-nettle.c:53: warning: implicit declaration of function 
‘pbkdf2_hmac_sha256’
crypto/pbkdf-nettle.c:53: warning: nested extern declaration of 
‘pbkdf2_hmac_sha256’
make: *** [crypto/pbkdf-nettle.o] Error 1
make: *** Waiting for unfinished jobs

rpm -qf /usr/include/nettle/
libnettle-devel-2.4-8.1.2

The nettle version is very old..
The OS is SUSE 11 SP3.

Thanks
Wen Congyang

Re: [Qemu-devel] [PATCH v2 1/1] Introduce "xen-load-devices-state"

2016-03-23 Thread Wen Congyang

On 03/23/2016 04:56 PM, Dr. David Alan Gilbert wrote:
> * Changlong Xie (xiecl.f...@cn.fujitsu.com) wrote:
>> On 03/22/2016 08:22 PM, Dr. David Alan Gilbert wrote:
>>> * Changlong Xie (xiecl.f...@cn.fujitsu.com) wrote:
>>>> From: Wen Congyang <we...@cn.fujitsu.com>
>>>>
>>>> Introduce a "xen-load-devices-state" QAPI command that can be used to
>>>> load the state of all devices, but not the RAM or the block devices of
>>>> the VM.
>>>>
>>>> We only have hmp commands savevm/loadvm, and qmp commands
>>>> xen-save-devices-state.
>>>
>>> Can you explain on Xen how the RAM actually gets loaded?
>>
>> Xen use xc(xen toolstack) to do RAM restore/save
>>
>>>
>>>>
>>>> We use this new command for COLO:
>>>> 1. suspend both primary vm and secondary vm
>>>> 2. sync the state
>>>> 3. resume both primary vm and secondary vm
>>>>
>>>> In such case, we need to update all devices' state in any time.
>>>>
>>>> Signed-off-by: Wen Congyang <we...@cn.fujitsu.com>
>>>> Signed-off-by: Changlong Xie <xiecl.f...@cn.fujitsu.com>
>>>> ---
>>>>  migration/savevm.c | 36 
>>>>  qapi-schema.json   | 14 ++
>>>>  qmp-commands.hx| 27 +++
>>>>  3 files changed, 77 insertions(+)
>>>>
>>>> diff --git a/migration/savevm.c b/migration/savevm.c
>>>> index 96e7db5..aaead12 100644
>>>> --- a/migration/savevm.c
>>>> +++ b/migration/savevm.c
>>>> @@ -50,6 +50,7 @@
>>>>  #include "qemu/iov.h"
>>>>  #include "block/snapshot.h"
>>>>  #include "block/qapi.h"
>>>> +#include "hw/xen/xen.h"
>>>>
>>>>
>>>>  #ifndef ETH_P_RARP
>>>> @@ -1768,6 +1769,12 @@ qemu_loadvm_section_start_full(QEMUFile *f, 
>>>> MigrationIncomingState *mis)
>>>>  return -EINVAL;
>>>>  }
>>>>
>>>> +/* Validate if it is a device's state */
>>>> +if (xen_enabled() && se->is_ram) {
>>>> +error_report("loadvm: %s RAM loading not allowed on Xen", idstr);
>>>> +return -EINVAL;
>>>> +}
>>>> +
>>>>  /* Add entry */
>>>>  le = g_malloc0(sizeof(*le));
>>>>
>>>> @@ -2077,6 +2084,35 @@ void qmp_xen_save_devices_state(const char 
>>>> *filename, Error **errp)
>>>>  }
>>>>  }
>>>>
>>>> +void qmp_xen_load_devices_state(const char *filename, Error **errp)
>>>> +{
>>>> +QEMUFile *f;
>>>> +int saved_vm_running;
>>>> +int ret;
>>>> +
>>>> +saved_vm_running = runstate_is_running();
>>>> +vm_stop(RUN_STATE_RESTORE_VM);
>>>> +
>>>> +f = qemu_fopen(filename, "rb");
>>>> +if (!f) {
>>>> +error_setg_file_open(errp, errno, filename);
>>>> +goto out;
>>>> +}
>>>> +
>>>> +migration_incoming_state_new(f);
>>>> +ret = qemu_loadvm_state(f);
>>>> +qemu_fclose(f);
>>>> +migration_incoming_state_destroy();
>>>> +if (ret < 0) {
>>>> +error_setg(errp, QERR_IO_ERROR);
>>>> +}
>>>> +
>>>> +out:
>>>> +if (saved_vm_running) {
>>>> +vm_start();
>>>> +}
>>>
>>> Does it ever happen that you had it running immediately
>>> before you did this command? Somehow you'd have to have loaded the RAM
>>
>> No, we suspend vm before running this command in xen to make sure the
>> condition that you discribed never happen.
> 
> OK, so then if the VM is suspended, what does the:
> 
> if (saved_vm_running) {
> vm_start();
> }
> 
> at the end of your routine do?

+saved_vm_running = runstate_is_running();
+vm_stop(RUN_STATE_RESTORE_VM);

It is copied from the other codes in qemu.
I think we should return failure if the vm is running:
if (runstate_is_running()) {
error_setg(xxx);
return;
}

Thanks
Wen Congyang

> 
> Dave
> 
> 
>>
>> Thanks
>>  -Xie
>>
>>> at just the right point, and I don't see how that would happen if the guest
>>> w

Re: [Qemu-devel] [PATCH V4 1/2] net/filter-mirror: implement filter-redirector

2016-03-20 Thread Wen Congyang

On 03/17/2016 02:10 PM, Jason Wang wrote:
> 
> 
> On 03/16/2016 05:34 PM, Wen Congyang wrote:
>> On 03/16/2016 04:18 PM, Jason Wang wrote:
>>>>
>>>>
>>>> On 03/15/2016 06:03 PM, Zhang Chen wrote:
>>>>>> Filter-redirector is a netfilter plugin.
>>>>>> It gives qemu the ability to redirect net packet.
>>>>>> redirector can redirect filter's net packet to outdev.
>>>>>> and redirect indev's packet to filter.
>>>>>>
>>>>>>   filter
>>>>>> +
>>>>>> |
>>>>>> |
>>>>>> redirector  |
>>>>>>+--+
>>>>>>|| |
>>>>>>|| |
>>>>>>|| |
>>>>>>   indev +---+   +-->  outdev
>>>>>>|| |
>>>>>>|| |
>>>>>>|| |
>>>>>>+--+
>>>>>> |
>>>>>>         |
>>>>>> v
>>>>>>   filter
>>>>>>
>>>>>> usage:
>>>>>>
>>>>>> -netdev user,id=hn0
>>>>>> -chardev socket,id=s0,host=ip_primary,port=X,server,nowait
>>>>>> -chardev socket,id=s1,host=ip_primary,port=Y,server,nowait
>>>>>> -filter-redirector,id=r0,netdev=hn0,queue=tx/rx/all,indev=s0,outdev=s1
>>>>>>
>>>>>> Signed-off-by: Zhang Chen <zhangchen.f...@cn.fujitsu.com>
>>>>>> Signed-off-by: Wen Congyang <we...@cn.fujitsu.com>
>>>>>> Signed-off-by: Li Zhijian <lizhij...@cn.fujitsu.com>
>>>>>> ---
>>>>>>  net/filter-mirror.c | 236 
>>>>>> 
>>>>>>  qemu-options.hx |   9 ++
>>>>>>  vl.c|   3 +-
>>>>>>  3 files changed, 247 insertions(+), 1 deletion(-)
>>>>>>
>>>>>> diff --git a/net/filter-mirror.c b/net/filter-mirror.c
>>>>>> index 1b1ec16..77ece41 100644
>>>>>> --- a/net/filter-mirror.c
>>>>>> +++ b/net/filter-mirror.c
>>>>>> @@ -26,12 +26,23 @@
>>>>>>  #define FILTER_MIRROR(obj) \
>>>>>>  OBJECT_CHECK(MirrorState, (obj), TYPE_FILTER_MIRROR)
>>>>>>  
>>>>>> +#define FILTER_REDIRECTOR(obj) \
>>>>>> +OBJECT_CHECK(MirrorState, (obj), TYPE_FILTER_REDIRECTOR)
>>>>>> +
>>>>>>  #define TYPE_FILTER_MIRROR "filter-mirror"
>>>>>> +#define TYPE_FILTER_REDIRECTOR "filter-redirector"
>>>>>> +#define REDIRECTOR_MAX_LEN NET_BUFSIZE
>>>>>>  
>>>>>>  typedef struct MirrorState {
>>>>>>  NetFilterState parent_obj;
>>>>>> +char *indev;
>>>>>>  char *outdev;
>>>>>> +CharDriverState *chr_in;
>>>>>>  CharDriverState *chr_out;
>>>>>> +int state; /* 0 = getting length, 1 = getting data */
>>>>>> +unsigned int index;
>>>>>> +unsigned int packet_len;
>>>>>> +uint8_t buf[REDIRECTOR_MAX_LEN];
>>>>>>  } MirrorState;
>>>>>>  
>>>>>>  static int filter_mirror_send(CharDriverState *chr_out,
>>>>>> @@ -68,6 +79,89 @@ err:
>>>>>>  return ret < 0 ? ret : -EIO;
>>>>>>  }
>>>>>>  
>>>>>> +static void
>>>>>> +redirector_to_filter(NetFilterState *nf, const uint8_t *buf, int len)
>>>>>> +{
>>>>>> +struct iovec iov = {
>>>>>> +.iov_base = (void *)buf,
>>>>>> +.iov_len = len,
>>>>>> +};
>>>>>> +
>>>>>> +if (nf->direction == NET_FILTER_DIRECTION_ALL ||
>>>>>> +nf->direction == NET_FILTER_DIRECTION_TX) {
>>>>>> +qemu_netfilter_pass_to_next(nf->netdev, 0, , 1

Re: [Qemu-devel] [PATCH v12 2/3] quorum: implement bdrv_add_child() and bdrv_del_child()

2016-03-19 Thread Wen Congyang

On 03/17/2016 05:48 PM, Dr. David Alan Gilbert wrote:
> * Wen Congyang (we...@cn.fujitsu.com) wrote:
>> On 03/17/2016 05:10 PM, Alberto Garcia wrote:
>>> On Thu 17 Mar 2016 02:22:40 AM CET, Wen Congyang <we...@cn.fujitsu.com> 
>>> wrote:
>>>>>>>> @@ -81,6 +82,8 @@ typedef struct BDRVQuorumState {
>>>>>>>>   bool rewrite_corrupted;/* true if the driver must 
>>>>>>>> rewrite-on-read corrupted
>>>>>>>>   * block if Quorum is reached.
>>>>>>>>   */
>>>>>>>> +unsigned long *index_bitmap;
>>>>>>
>>>>>> Hi Berto
>>>>>>
>>>>>> *NOTE*, In the old version, we just used "bs->node_name", but in the
>>>>>> lastest one, as Kevin suggested we introduce
>>>>>> "child->child_name"(formart as "children.xxx"), this is the key cause
>>>>>> why we need this two functions here.
>>>>>
>>>>> I'm sorry I missed this discussion earlier. Your code seems technically
>>>>> correct but I have several questions:
>>>>>
>>>>> - I read that one of the reasons for this change is that "In theory, the
>>>>>   same node could be attached twice to the same parent in different
>>>>>   roles.". Is there any example of that? What's the use case?
>>>>
>>>> Kevin may know the case.
>>>
>>> Kevin, do you have an example?
>>>
>>>>> - How do you obtain the child name?
>>>>
>>>> IIRC, the answer is no now. I think we can improve 'info block' output
>>>
>>> Okay, but then we should extend that first, otherwise this API cannot be
>>> used.
>>>
>>>>> - I see that if you have children.0 and children.1 (let's say hd0.qcow2
>>>>>   and hd1.qcow2), then you remove children.0 and add it again, it will
>>>>>   keep the 'children.0' name (that's what the bitmap is for if I'm
>>>>>   understanding it correctly). However the position in the s->children
>>>>>   array will change because you do memmove() when you remove children.0
>>>>>   and then add it again to the end of the array.
>>>>>
>>>>>   Initial status:
>>>>>
>>>>> s->children[0] <--> "children.0" (hd0.qcow2)
>>>>> s->children[1] <--> "children.1" (hd1.qcow2)
>>>>>
>>>>>   children.0 (hd0.qcow2) is removed:
>>>>>
>>>>> s->children[0] <--> "children.1" (hd1.qcow2)
>>>>>
>>>>>   children.0 (hd0.qcow2) is added again:
>>>>>
>>>>> s->children[0] <--> "children.1" (hd1.qcow2)
>>>>> s->children[1] <--> "children.0" (hd0.qcow2)
>>>>
>>>> Yes, it is correct.
>>>>
>>>>>
>>>>>   Is this correct? Is this the indented behavior? Since you are reading
>>>>>   in FIFO mode, now hd1.qcow2 will always be read first, so if
>>>>>   children.1 was the secondary disk, it has just become the primary.
>>>>
>>>> Yes.
>>>
>>> And don't you need a way to control the order in which the disks must be
>>> read for COLO?
>>
>> I think in fifo mode, we should read the disk first that is added earlier.
>>
>> We don't need a way to control the order now.
> 
> Can you document fully how it's used in COLO then?

Do you mean document it in docs/block-replication.txt?

> We should have the failure modes documented, and how you'll use
> it after failover etc   Without that it's really difficult to tell
> if this naming is right.

For COLO, children.0 is the real disk, children.1 is replication driver.
After failure, children.1 will be removed by the user. If we want to
continue do COLO, we need add a new children.1 again.

> The children.0 notation is really confusing in the way that Berto
> describes; I hit this a couple of months ago and it really doesn't
> make sense.

Do you mean: read from children.1 first, and then read from children.0 in
fifo mode? Yes, the behavior is very strange.

Thanks
Wen Congyang

> 
> Dave
> 
>>
>> Thanks
>> Wen Congyang
>>
>>>
>>> Berto
>>>
>>>
>>> .
>>>
>>
>>
>>
> --
> Dr. David Alan Gilbert / dgilb...@redhat.com / Manchester, UK
> 
> 
> .
>

Re: [Qemu-devel] [PATCH v12 2/3] quorum: implement bdrv_add_child() and bdrv_del_child()

2016-03-19 Thread Wen Congyang

On 03/16/2016 08:38 PM, Alberto Garcia wrote:
> On Mon 14 Mar 2016 07:02:08 AM CET, Changlong Xie <xiecl.f...@cn.fujitsu.com> 
> wrote:
> 
>>>> @@ -81,6 +82,8 @@ typedef struct BDRVQuorumState {
>>>>   bool rewrite_corrupted;/* true if the driver must rewrite-on-read 
>>>> corrupted
>>>>   * block if Quorum is reached.
>>>>   */
>>>> +unsigned long *index_bitmap;
>>
>> Hi Berto
>>
>> *NOTE*, In the old version, we just used "bs->node_name", but in the
>> lastest one, as Kevin suggested we introduce
>> "child->child_name"(formart as "children.xxx"), this is the key cause
>> why we need this two functions here.
> 
> I'm sorry I missed this discussion earlier. Your code seems technically
> correct but I have several questions:
> 
> - I read that one of the reasons for this change is that "In theory, the
>   same node could be attached twice to the same parent in different
>   roles.". Is there any example of that? What's the use case?

Kevin may know the case.

> 
> - How do you obtain the child name?

IIRC, the answer is no now. I think we can improve 'info block' output

> 
> - I see that if you have children.0 and children.1 (let's say hd0.qcow2
>   and hd1.qcow2), then you remove children.0 and add it again, it will
>   keep the 'children.0' name (that's what the bitmap is for if I'm
>   understanding it correctly). However the position in the s->children
>   array will change because you do memmove() when you remove children.0
>   and then add it again to the end of the array.
> 
>   Initial status:
> 
> s->children[0] <--> "children.0" (hd0.qcow2)
> s->children[1] <--> "children.1" (hd1.qcow2)
> 
>   children.0 (hd0.qcow2) is removed:
> 
> s->children[0] <--> "children.1" (hd1.qcow2)
> 
>   children.0 (hd0.qcow2) is added again:
> 
> s->children[0] <--> "children.1" (hd1.qcow2)
> s->children[1] <--> "children.0" (hd0.qcow2)

Yes, it is correct.

> 
>   Is this correct? Is this the indented behavior? Since you are reading
>   in FIFO mode, now hd1.qcow2 will always be read first, so if
>   children.1 was the secondary disk, it has just become the primary.

Yes.

Thanks
Wen Congyang

> 
> I also think that it would be great to have tests for this
> functionality, but they can be added later.
> 
> Thanks,
> 
> Berto
> 
> 
> .
>

Re: [Qemu-devel] [PATCH v12 2/3] quorum: implement bdrv_add_child() and bdrv_del_child()

2016-03-19 Thread Wen Congyang

On 03/17/2016 05:10 PM, Alberto Garcia wrote:
> On Thu 17 Mar 2016 02:22:40 AM CET, Wen Congyang <we...@cn.fujitsu.com> wrote:
>>>>>> @@ -81,6 +82,8 @@ typedef struct BDRVQuorumState {
>>>>>>   bool rewrite_corrupted;/* true if the driver must rewrite-on-read 
>>>>>> corrupted
>>>>>>   * block if Quorum is reached.
>>>>>>   */
>>>>>> +unsigned long *index_bitmap;
>>>>
>>>> Hi Berto
>>>>
>>>> *NOTE*, In the old version, we just used "bs->node_name", but in the
>>>> lastest one, as Kevin suggested we introduce
>>>> "child->child_name"(formart as "children.xxx"), this is the key cause
>>>> why we need this two functions here.
>>>
>>> I'm sorry I missed this discussion earlier. Your code seems technically
>>> correct but I have several questions:
>>>
>>> - I read that one of the reasons for this change is that "In theory, the
>>>   same node could be attached twice to the same parent in different
>>>   roles.". Is there any example of that? What's the use case?
>>
>> Kevin may know the case.
> 
> Kevin, do you have an example?
> 
>>> - How do you obtain the child name?
>>
>> IIRC, the answer is no now. I think we can improve 'info block' output
> 
> Okay, but then we should extend that first, otherwise this API cannot be
> used.
> 
>>> - I see that if you have children.0 and children.1 (let's say hd0.qcow2
>>>   and hd1.qcow2), then you remove children.0 and add it again, it will
>>>   keep the 'children.0' name (that's what the bitmap is for if I'm
>>>   understanding it correctly). However the position in the s->children
>>>   array will change because you do memmove() when you remove children.0
>>>   and then add it again to the end of the array.
>>>
>>>   Initial status:
>>>
>>> s->children[0] <--> "children.0" (hd0.qcow2)
>>> s->children[1] <--> "children.1" (hd1.qcow2)
>>>
>>>   children.0 (hd0.qcow2) is removed:
>>>
>>> s->children[0] <--> "children.1" (hd1.qcow2)
>>>
>>>   children.0 (hd0.qcow2) is added again:
>>>
>>> s->children[0] <--> "children.1" (hd1.qcow2)
>>> s->children[1] <--> "children.0" (hd0.qcow2)
>>
>> Yes, it is correct.
>>
>>>
>>>   Is this correct? Is this the indented behavior? Since you are reading
>>>   in FIFO mode, now hd1.qcow2 will always be read first, so if
>>>   children.1 was the secondary disk, it has just become the primary.
>>
>> Yes.
> 
> And don't you need a way to control the order in which the disks must be
> read for COLO?

I think in fifo mode, we should read the disk first that is added earlier.

We don't need a way to control the order now.

Thanks
Wen Congyang

> 
> Berto
> 
> 
> .
>

Re: [Qemu-devel] [PATCH v12 2/3] quorum: implement bdrv_add_child() and bdrv_del_child()

2016-03-19 Thread Wen Congyang

On 03/17/2016 07:25 PM, Dr. David Alan Gilbert wrote:
> * Wen Congyang (we...@cn.fujitsu.com) wrote:
>> On 03/17/2016 06:07 PM, Alberto Garcia wrote:
>>> On Thu 17 Mar 2016 10:56:09 AM CET, Wen Congyang wrote:
>>>>> We should have the failure modes documented, and how you'll use it
>>>>> after failover etc Without that it's really difficult to tell if this
>>>>> naming is right.
>>>>
>>>> For COLO, children.0 is the real disk, children.1 is replication
>>>> driver.  After failure, children.1 will be removed by the user. If we
>>>> want to continue do COLO, we need add a new children.1 again.
>>>
>>> What if children.0 fails ?
>>
>> For COLO, reading from children.1 always fails. if children.0 fails, it
>> means that reading from the disk fails. The guest vm will see the I/O error.
> 
> How do we get that to cause a fail over before the guest detects it?
> If the primary's local disk (children.0) fails then if we can failover
> at that point then the guest carries running on the secondary without
> ever knowing about the failure.

COLO is not designed for such case. The children.0 can also be quorum, so
you can add more than one real disk, and get more reliability. Another
choice is that, the real disk is an external storage, and it has
its own replication solution.

COLO is designed for such case: the host is crashed, and the guest is still
alive after failover, the client doesn't know this event.

Thanks
Wen Congyang

> 
> Dave
> 
>>
>> Thanks
>> Wen Congyang
>>
>>>
>>> Berto
>>>
>>>
>>> .
>>>
>>
>>
>>
> --
> Dr. David Alan Gilbert / dgilb...@redhat.com / Manchester, UK
> 
> 
> .
>

Re: [Qemu-devel] [PATCH v12 2/3] quorum: implement bdrv_add_child() and bdrv_del_child()

2016-03-18 Thread Wen Congyang

On 03/17/2016 06:07 PM, Alberto Garcia wrote:
> On Thu 17 Mar 2016 10:56:09 AM CET, Wen Congyang wrote:
>>> We should have the failure modes documented, and how you'll use it
>>> after failover etc Without that it's really difficult to tell if this
>>> naming is right.
>>
>> For COLO, children.0 is the real disk, children.1 is replication
>> driver.  After failure, children.1 will be removed by the user. If we
>> want to continue do COLO, we need add a new children.1 again.
> 
> What if children.0 fails ?

For COLO, reading from children.1 always fails. if children.0 fails, it
means that reading from the disk fails. The guest vm will see the I/O error.

Thanks
Wen Congyang

> 
> Berto
> 
> 
> .
>

[Qemu-devel] [PATCH] quorum: add child name into filename

2016-03-18 Thread Wen Congyang

The monitor command 'query-block' or 'info block' will output the filename.
So we can get each children's child-name after this patch. This useful for
dynamic reconfiguration.

Signed-off-by: Wen Congyang <we...@cn.fujitsu.com>
---
 block/quorum.c | 10 +++---
 1 file changed, 7 insertions(+), 3 deletions(-)

diff --git a/block/quorum.c b/block/quorum.c
index da15465..182766a 100644
--- a/block/quorum.c
+++ b/block/quorum.c
@@ -1036,9 +1036,13 @@ static void quorum_refresh_filename(BlockDriverState 
*bs, QDict *options)
 
 children = qlist_new();
 for (i = 0; i < s->num_children; i++) {
-QINCREF(s->children[i]->bs->full_open_options);
-qlist_append_obj(children,
- QOBJECT(s->children[i]->bs->full_open_options));
+QDict *child_opts;
+const char *child_name = s->children[i]->name;
+
+child_opts = 
qdict_clone_shallow(s->children[i]->bs->full_open_options);
+qdict_put_obj(child_opts, "child-name",
+  QOBJECT(qstring_from_str(child_name)));
+qlist_append_obj(children, QOBJECT(child_opts));
 }
 
 opts = qdict_new();
-- 
2.5.0

Re: [Qemu-devel] [PATCH V4 1/2] net/filter-mirror: implement filter-redirector

2016-03-16 Thread Wen Congyang

On 03/16/2016 04:18 PM, Jason Wang wrote:
> 
> 
> On 03/15/2016 06:03 PM, Zhang Chen wrote:
>> Filter-redirector is a netfilter plugin.
>> It gives qemu the ability to redirect net packet.
>> redirector can redirect filter's net packet to outdev.
>> and redirect indev's packet to filter.
>>
>>   filter
>> +
>> |
>> |
>> redirector  |
>>+--+
>>|| |
>>|| |
>>|| |
>>   indev +---+   +-->  outdev
>>|| |
>>|| |
>>|| |
>>+--+
>> |
>> |
>> v
>>   filter
>>
>> usage:
>>
>> -netdev user,id=hn0
>> -chardev socket,id=s0,host=ip_primary,port=X,server,nowait
>> -chardev socket,id=s1,host=ip_primary,port=Y,server,nowait
>> -filter-redirector,id=r0,netdev=hn0,queue=tx/rx/all,indev=s0,outdev=s1
>>
>> Signed-off-by: Zhang Chen <zhangchen.f...@cn.fujitsu.com>
>> Signed-off-by: Wen Congyang <we...@cn.fujitsu.com>
>> Signed-off-by: Li Zhijian <lizhij...@cn.fujitsu.com>
>> ---
>>  net/filter-mirror.c | 236 
>> 
>>  qemu-options.hx |   9 ++
>>  vl.c|   3 +-
>>  3 files changed, 247 insertions(+), 1 deletion(-)
>>
>> diff --git a/net/filter-mirror.c b/net/filter-mirror.c
>> index 1b1ec16..77ece41 100644
>> --- a/net/filter-mirror.c
>> +++ b/net/filter-mirror.c
>> @@ -26,12 +26,23 @@
>>  #define FILTER_MIRROR(obj) \
>>  OBJECT_CHECK(MirrorState, (obj), TYPE_FILTER_MIRROR)
>>  
>> +#define FILTER_REDIRECTOR(obj) \
>> +OBJECT_CHECK(MirrorState, (obj), TYPE_FILTER_REDIRECTOR)
>> +
>>  #define TYPE_FILTER_MIRROR "filter-mirror"
>> +#define TYPE_FILTER_REDIRECTOR "filter-redirector"
>> +#define REDIRECTOR_MAX_LEN NET_BUFSIZE
>>  
>>  typedef struct MirrorState {
>>  NetFilterState parent_obj;
>> +char *indev;
>>  char *outdev;
>> +CharDriverState *chr_in;
>>  CharDriverState *chr_out;
>> +int state; /* 0 = getting length, 1 = getting data */
>> +unsigned int index;
>> +unsigned int packet_len;
>> +uint8_t buf[REDIRECTOR_MAX_LEN];
>>  } MirrorState;
>>  
>>  static int filter_mirror_send(CharDriverState *chr_out,
>> @@ -68,6 +79,89 @@ err:
>>  return ret < 0 ? ret : -EIO;
>>  }
>>  
>> +static void
>> +redirector_to_filter(NetFilterState *nf, const uint8_t *buf, int len)
>> +{
>> +struct iovec iov = {
>> +.iov_base = (void *)buf,
>> +.iov_len = len,
>> +};
>> +
>> +if (nf->direction == NET_FILTER_DIRECTION_ALL ||
>> +nf->direction == NET_FILTER_DIRECTION_TX) {
>> +qemu_netfilter_pass_to_next(nf->netdev, 0, , 1, nf);
>> +}
>> +
>> +if (nf->direction == NET_FILTER_DIRECTION_ALL ||
>> +nf->direction == NET_FILTER_DIRECTION_RX) {
>> +qemu_netfilter_pass_to_next(nf->netdev->peer, 0, , 1, nf);
>> + }
>> +}
>> +
>> +static int redirector_chr_can_read(void *opaque)
>> +{
>> +return REDIRECTOR_MAX_LEN;
>> +}
>> +
>> +static void redirector_chr_read(void *opaque, const uint8_t *buf, int size)
>> +{
>> +NetFilterState *nf = opaque;
>> +MirrorState *s = FILTER_REDIRECTOR(nf);
>> +unsigned int l;
>> +
>> +if (size == 0) {
>> +/* the peer is closed ? */
>> +return ;
>> +}
> 
> Looks like if you want to handle connection close, you need use event
> handler when calling qemu_chr_add_handlers().

In which case, we will see size is 0 if we don't have a event handler?

For redirector filter, I think we don't care about if the char device
is disconnected. If the char device is ready again, we will continue
to read from the char device.

So I think we just add more comments here.

> 
>> +
>> +/* most of code is stolen from net_socket_send */
> 
> This comment seems redundant.
> 
>> +while (size > 0) {
>> +/* reassemble a packet from the network */
>> +switch (s->state) {
>> +case 0:
>> +l = 4 - s

Re: [Qemu-devel] [PATCH v12 2/3] quorum: implement bdrv_add_child() and bdrv_del_child()

2016-03-15 Thread Wen Congyang

On 03/11/2016 08:21 PM, Alberto Garcia wrote:
> On Thu 10 Mar 2016 03:49:40 AM CET, Changlong Xie wrote:
>> @@ -81,6 +82,8 @@ typedef struct BDRVQuorumState {
>>  bool rewrite_corrupted;/* true if the driver must rewrite-on-read 
>> corrupted
>>  * block if Quorum is reached.
>>  */
>> +unsigned long *index_bitmap;
>> +int bsize;
>   [...]
>> +static int get_new_child_index(BDRVQuorumState *s)
>   [...]
>> +static void remove_child_index(BDRVQuorumState *s, int index)
>   [...]
> 
> Sorry if I missed a previous discussion, but why is this necessary?

Hi, Alberto Garcia

Do you have any comments about this patch or give a R-B?

Thanks
Wen Congyang

> 
> Berto
> 
> 
> .
>

Re: [Qemu-devel] [PATCH V9 0/2] net/filter-mirror:add filter-mirror and unit test

2016-03-15 Thread Wen Congyang

On 03/15/2016 03:04 PM, Jason Wang wrote:
> 
> 
> On 03/15/2016 01:38 PM, Zhang Chen wrote:
>> Filter-mirror is a netfilter plugin.
>> It gives qemu the ability to mirror
>> packets to a chardev.
>>
>> v9:
>>  - add qmp("{ 'execute' : 'query-status'}")
>>before iov_send() and change pipe
>>to socket in test-filter-mirror.c
> 
> Want to merge the series, but it doesn't build on my laptop (and another
> machine).
> 
> CHK version_gen.h
>   CCnet/filter-mirror.o
> In file included from /home/devel/git/qemu/include/net/filter.h:12:0,
>  from net/filter-mirror.c:12:
> /home/devel/git/qemu/include/qom/object.h:300:39: error: unknown type
> name ‘Error’
>Error **errp);

I think he doesn't use the newest commit.
After the commit 2744d920, we should include qemu/osdep.h first

Thanks
Wen Congyang

> 
>> v8:
>>  - The outdev of filter-mirror test changed
>>from -chardev socket to -chardev pipe
>>
>> v7:
>>  - fix mktemp() to mkstemp()
>>
>> v6:
>>  - Address Jason's comments.
>>
>> v5:
>>  - Address Jason's comments.
>>
>> v4:
>>  - Address Jason's comments.
>>
>> v3:
>>  - Add filter-mirror unit test according
>>to Jason's comments
>>  - Address zhanghailiang's comments.
>>  - Address Jason's comments.
>>
>> v2:
>>  - Address zhanghailiang's comments.
>>  - Address Eric Blake's comments.
>>  - Address Yang Hongyang's comments.
>>  - Address Dave's comments.
>>
>> v1:
>>  initial patch.
>>
>>
>> Zhang Chen (2):
>>   net/filter-mirror:Add filter-mirror
>>   tests/test-filter-mirror:add filter-mirror unit test
>>
>>  net/Makefile.objs  |   1 +
>>  net/filter-mirror.c| 181 
>> +
>>  qemu-options.hx|   5 ++
>>  tests/.gitignore   |   1 +
>>  tests/Makefile |   2 +
>>  tests/test-filter-mirror.c |  92 +++
>>  vl.c   |   3 +-
>>  7 files changed, 284 insertions(+), 1 deletion(-)
>>  create mode 100644 net/filter-mirror.c
>>  create mode 100644 tests/test-filter-mirror.c
>>
> 
> 
> 
> .
>

Re: [Qemu-devel] [PATCH] quorum: Fix crash in quorum_aio_cb()

2016-03-10 Thread Wen Congyang

On 03/10/2016 08:13 PM, Alberto Garcia wrote:
> quorum_aio_cb() emits the QUORUM_REPORT_BAD event if there's
> an I/O error in a Quorum child. However sacb->aiocb must be
> correctly initialized for this to happen. read_quorum_children() and
> read_fifo_child() are not doing this, which results in a QEMU crash.

If we use FIFO mode, we don't call quorum_report_bad() in quorum_aio_cb().
But it is OK to iniialize sacb->aiocb for it.

> 
> Signed-off-by: Alberto Garcia <be...@igalia.com>
> Reviewed-by: Max Reitz <mre...@redhat.com>

Reviewed-by: Wen Congyang <we...@cn.fujitsu.com>

> ---
>  block/quorum.c | 12 +++-
>  1 file changed, 7 insertions(+), 5 deletions(-)
> 
> diff --git a/block/quorum.c b/block/quorum.c
> index b9ba028..e640688 100644
> --- a/block/quorum.c
> +++ b/block/quorum.c
> @@ -646,8 +646,9 @@ static BlockAIOCB *read_quorum_children(QuorumAIOCB *acb)
>  }
>  
>  for (i = 0; i < s->num_children; i++) {
> -bdrv_aio_readv(s->children[i]->bs, acb->sector_num, 
> >qcrs[i].qiov,
> -   acb->nb_sectors, quorum_aio_cb, >qcrs[i]);
> +acb->qcrs[i].aiocb = bdrv_aio_readv(s->children[i]->bs, 
> acb->sector_num,
> +>qcrs[i].qiov, 
> acb->nb_sectors,
> +quorum_aio_cb, >qcrs[i]);
>  }
>  
>  return >common;
> @@ -662,9 +663,10 @@ static BlockAIOCB *read_fifo_child(QuorumAIOCB *acb)
>  qemu_iovec_init(>qcrs[acb->child_iter].qiov, acb->qiov->niov);
>  qemu_iovec_clone(>qcrs[acb->child_iter].qiov, acb->qiov,
>   acb->qcrs[acb->child_iter].buf);
> -bdrv_aio_readv(s->children[acb->child_iter]->bs, acb->sector_num,
> -   >qcrs[acb->child_iter].qiov, acb->nb_sectors,
> -   quorum_aio_cb, >qcrs[acb->child_iter]);
> +acb->qcrs[acb->child_iter].aiocb =
> +bdrv_aio_readv(s->children[acb->child_iter]->bs, acb->sector_num,
> +   >qcrs[acb->child_iter].qiov, acb->nb_sectors,
> +   quorum_aio_cb, >qcrs[acb->child_iter]);
>  
>  return >common;
>  }
>

Re: [Qemu-devel] [PATCH V8 1/2] net/filter-mirror:Add filter-mirror

2016-03-09 Thread Wen Congyang

On 03/09/2016 05:07 PM, Zhang Chen wrote:
> Filter-mirror is a netfilter plugin.
> It gives qemu the ability to mirror
> packets to a chardev.
> 
> usage:
> 
> -netdev tap,id=hn0
> -chardev socket,id=mirror0,host=ip_primary,port=X,server,nowait
> -filter-mirror,id=m0,netdev=hn0,queue=tx/rx/all,outdev=mirror0
> 
> Signed-off-by: Zhang Chen <zhangchen.f...@cn.fujitsu.com>
> Signed-off-by: Wen Congyang <we...@cn.fujitsu.com>
> Reviewed-by: Yang Hongyang <hongyang.y...@easystack.cn>
> Reviewed-by: zhanghailiang <zhang.zhanghaili...@huawei.com>
> ---
>  net/Makefile.objs   |   1 +
>  net/filter-mirror.c | 181 
> 
>  qemu-options.hx |   5 ++
>  vl.c|   3 +-
>  4 files changed, 189 insertions(+), 1 deletion(-)
>  create mode 100644 net/filter-mirror.c
> 
> diff --git a/net/Makefile.objs b/net/Makefile.objs
> index 5fa2f97..b7c22fd 100644
> --- a/net/Makefile.objs
> +++ b/net/Makefile.objs
> @@ -15,3 +15,4 @@ common-obj-$(CONFIG_VDE) += vde.o
>  common-obj-$(CONFIG_NETMAP) += netmap.o
>  common-obj-y += filter.o
>  common-obj-y += filter-buffer.o
> +common-obj-y += filter-mirror.o
> diff --git a/net/filter-mirror.c b/net/filter-mirror.c
> new file mode 100644
> index 000..ee13d94
> --- /dev/null
> +++ b/net/filter-mirror.c
> @@ -0,0 +1,181 @@
> +/*
> + * Copyright (c) 2016 HUAWEI TECHNOLOGIES CO., LTD.
> + * Copyright (c) 2016 FUJITSU LIMITED
> + * Copyright (c) 2016 Intel Corporation
> + *
> + * Author: Zhang Chen <zhangchen.f...@cn.fujitsu.com>
> + *
> + * This work is licensed under the terms of the GNU GPL, version 2 or
> + * later.  See the COPYING file in the top-level directory.
> + */
> +
> +#include "net/filter.h"
> +#include "net/net.h"
> +#include "qemu-common.h"
> +#include "qapi/qmp/qerror.h"
> +#include "qapi-visit.h"
> +#include "qom/object.h"
> +#include "qemu/main-loop.h"
> +#include "qemu/error-report.h"
> +#include "trace.h"
> +#include "sysemu/char.h"
> +#include "qemu/iov.h"
> +#include "qemu/sockets.h"
> +
> +#define FILTER_MIRROR(obj) \
> +OBJECT_CHECK(MirrorState, (obj), TYPE_FILTER_MIRROR)
> +
> +#define TYPE_FILTER_MIRROR "filter-mirror"
> +
> +typedef struct MirrorState {
> +NetFilterState parent_obj;
> +char *outdev;
> +CharDriverState *chr_out;
> +} MirrorState;
> +
> +static int filter_mirror_send(NetFilterState *nf,
> +   const struct iovec *iov,
> +   int iovcnt)

Please change the indent.

Thanks
Wen Congyang

> +{
> +MirrorState *s = FILTER_MIRROR(nf);
> +int ret = 0;
> +ssize_t size = 0;
> +uint32_t len =  0;
> +char *buf;
> +
> +size = iov_size(iov, iovcnt);
> +if (!size) {
> +return 0;
> +}
> +
> +len = htonl(size);
> +ret = qemu_chr_fe_write_all(s->chr_out, (uint8_t *), sizeof(len));
> +if (ret != sizeof(len)) {
> +goto err;
> +}
> +
> +buf = g_malloc(size);
> +iov_to_buf(iov, iovcnt, 0, buf, size);
> +ret = qemu_chr_fe_write_all(s->chr_out, (uint8_t *)buf, size);
> +g_free(buf);
> +if (ret != size) {
> +goto err;
> +}
> +
> +return 0;
> +
> +err:
> +return ret < 0 ? ret : -EIO;
> +}
> +
> +static ssize_t filter_mirror_receive_iov(NetFilterState *nf,
> + NetClientState *sender,
> + unsigned flags,
> + const struct iovec *iov,
> + int iovcnt,
> + NetPacketSent *sent_cb)
> +{
> +int ret;
> +
> +ret = filter_mirror_send(nf, iov, iovcnt);
> +if (ret) {
> +error_report("filter_mirror_send failed(%s)", strerror(-ret));
> +}
> +
> +/*
> + * we don't hope this error interrupt the normal
> + * path of net packet, so we always return zero.
> + */
> +return 0;
> +}
> +
> +static void filter_mirror_cleanup(NetFilterState *nf)
> +{
> +MirrorState *s = FILTER_MIRROR(nf);
> +
> +if (s->chr_out) {
> +qemu_chr_fe_release(s->chr_out);
> +}
> +}
> +
> +static void filter_mirror_setup(NetFilterState *nf, Error **errp)
> +{
> +MirrorState *s = FILTER_MIRROR(nf);
> +
> +if (!s->outdev) {
> +error_setg(errp, "filter filter mirror needs 'outdev' "

Re: [Qemu-devel] [PULL 00/14] Net patches

2016-03-08 Thread Wen Congyang

On 03/09/2016 12:26 PM, Li Zhijian wrote:
> 
> 
> On 03/09/2016 09:36 AM, Wen Congyang wrote:
>> On 03/08/2016 05:54 PM, Peter Maydell wrote:
>>> On 8 March 2016 at 16:06, Zhang Chen <zhangchen.f...@cn.fujitsu.com> wrote:
>>>> I found the reason for this problem is that
>>>> unix_connect() have not connect to sock_path before iov_send().
>>>> It need time to establish connection. so can we fix it with usleep()
>>>> like this:
>>>>
>>>>  recv_sock = unix_connect(sock_path, NULL);
>>>>  g_assert_cmpint(recv_sock, !=, -1);
>>>> +usleep(1000);
>>>>
>>>>  ret = iov_send(send_sock[0], iov, 2, 0, sizeof(size) +
>>>> sizeof(send_buf));
>>>>  g_assert_cmpint(ret, ==, sizeof(send_buf) + sizeof(size));
>>>>  close(send_sock[0]);
>>>>
>>>>  ret = qemu_recv(recv_sock, , sizeof(len), 0);
>>>
>>> I would prefer it if we could find a way to fix this race
>>> reliably rather than just inserting a delay and hoping it
>>> is sufficient. Otherwise the test is likely to be unreliable
>>> if run on a heavily loaded or slow machine.
>>
>> Yes, but there is no way to know when tcp_chr_accept() is called. Add a event
>> to notify it?
>>
>> Thanks
>> Wen Congyang
>>
> 
> Hi, Jason, PMM
> As Congyang said that this is a bug of testcase instead of filter-mirror.
> Maybe we should re-wrok the testcase, for example
> - using -chardev pipe instead of -chardev socket, because we are
>   intend to test the packet mirror fuction instead of -chardev socket

I think it is OK to change it.

Thanks
Wen Congyang

> 
> How about that ?
> 
> 
>>>
>>> thanks
>>> -- PMM
>>>
>>>
>>>
>>
>>
>>
>>
>>
> .
>

Re: [Qemu-devel] [PULL 00/14] Net patches

2016-03-08 Thread Wen Congyang

On 03/08/2016 05:54 PM, Peter Maydell wrote:
> On 8 March 2016 at 16:06, Zhang Chen <zhangchen.f...@cn.fujitsu.com> wrote:
>> I found the reason for this problem is that
>> unix_connect() have not connect to sock_path before iov_send().
>> It need time to establish connection. so can we fix it with usleep()
>> like this:
>>
>> recv_sock = unix_connect(sock_path, NULL);
>> g_assert_cmpint(recv_sock, !=, -1);
>> +usleep(1000);
>>
>> ret = iov_send(send_sock[0], iov, 2, 0, sizeof(size) +
>> sizeof(send_buf));
>> g_assert_cmpint(ret, ==, sizeof(send_buf) + sizeof(size));
>> close(send_sock[0]);
>>
>> ret = qemu_recv(recv_sock, , sizeof(len), 0);
> 
> I would prefer it if we could find a way to fix this race
> reliably rather than just inserting a delay and hoping it
> is sufficient. Otherwise the test is likely to be unreliable
> if run on a heavily loaded or slow machine.

Yes, but there is no way to know when tcp_chr_accept() is called. Add a event
to notify it?

Thanks
Wen Congyang

> 
> thanks
> -- PMM
> 
> 
>

Re: [Qemu-devel] [PULL 00/14] Net patches

2016-03-08 Thread Wen Congyang

On 03/08/2016 05:06 PM, Zhang Chen wrote:
> 
> 
> On 03/08/2016 03:56 PM, Jason Wang wrote:
>>
>> On 03/08/2016 03:50 PM, Wen Congyang wrote:
>>> On 03/08/2016 03:33 PM, Jason Wang wrote:
>>>> On 03/08/2016 12:51 PM, Peter Maydell wrote:
>>>>> On 7 March 2016 at 10:12, Jason Wang <jasow...@redhat.com> wrote:
>>>>>> The following changes since commit 
>>>>>> 1464ad45cd6cdeb0b5c1a54d3d3791396e47e52f:
>>>>>>
>>>>>>Merge remote-tracking branch 
>>>>>> 'remotes/armbru/tags/pull-qapi-2016-03-04' into staging (2016-03-06 
>>>>>> 11:53:27 +)
>>>>>>
>>>>>> are available in the git repository at:
>>>>>>
>>>>>>https://github.com/jasowang/qemu.git tags/net-pull-request
>>>>>>
>>>>>> for you to fetch changes up to a2f2e45c6edbba9e1961056fa77c696208b40c8e:
>>>>>>
>>>>>>net: check packet payload length (2016-03-07 10:15:48 +0800)
>>>>>>
>>>>>> 
>>>>>>
>>>>>> - a new netfilter implementation: mirror
>>>>>> - netfilter could be disabled and enabled through qom-set now
>>>>>> - fix netfilter crash when specifiying wrong parameters
>>>>>> - rocker switch now can allow user to specifiy world
>>>>>> - fix OOB access for ne2000
>>>>> Hi; I'm afraid this makes "make check" hang for me (Linux, x86-64):
>>>>>
>>>>> TEST: tests/test-netfilter... (pid=26854)
>>>>>/i386/netfilter/addremove_one:   OK
>>>>>/i386/netfilter/remove_netdev_one:   OK
>>>>>/i386/netfilter/addremove_multi: OK
>>>>>/i386/netfilter/remove_netdev_multi: OK
>>>>> PASS: tests/test-netfilter
>>>>> TEST: tests/test-filter-mirror... (pid=26858)
>>>>>/i386/netfilter/mirror:
>>>>>
>>>>> (consistently, every time I run make check, on the same test).
>>>>>
>>>>> thanks
>>>>> -- PMM
>>>> Sorry, it manages to pass on my machine before submitting the pull
>>>> request. But when I re-try this several times, it fails.
>>>>
>>>> This probably means we have bug in mirror implementation. Chen and
>>>> Congyang, please try to fix this bug and resubmit a new version of the
>>>> patch.
>>>>
>>>> Will drop mirror from this pull request and submit a V2.
>>> OK. what is the version of the kernel that you use?
>> 4.2 but probably unrelated. Gdb shows the test wait at recv().
> 
> Hi~ Jason.
> 
> I found the reason for this problem is that
> unix_connect() have not connect to sock_path before iov_send().

After unix_connect() returns, the connection is established.
qemu char device will call qemu_chr_accept() after the connection
is established. If we send data before qemu_chr_accept() is called,
the data will be dropped by qemu char device:
static int tcp_chr_write(CharDriverState *chr, const uint8_t *buf, int len) 

{   
TCPCharDriver *s = chr->opaque; 
if (s->connected) { 
...
return ret; 
} else {
        /* XXX: indicate an error ? */  
return len; 
}   
}  

We should wait some to let qemu_chr_accept() is called before sending
data.

Thanks
Wen Congyang

> It need time to establish connection. so can we fix it with usleep()
> like this:
> 
> recv_sock = unix_connect(sock_path, NULL);
> g_assert_cmpint(recv_sock, !=, -1);
> +usleep(1000);
> 
> ret = iov_send(send_sock[0], iov, 2, 0, sizeof(size) + sizeof(send_buf));
> g_assert_cmpint(ret, ==, sizeof(send_buf) + sizeof(size));
> close(send_sock[0]);
> 
> ret = qemu_recv(recv_sock, , sizeof(len), 0);
> 
> 
> 
>>> Thanks
>>> Wen Congyang
>>>
>>>> Thanks
>>>>
>>>>
>>>> .
>>>>
>>>
>>
>>
>> .
>>
>

Re: [Qemu-devel] [PULL 00/14] Net patches

2016-03-07 Thread Wen Congyang

On 03/08/2016 03:33 PM, Jason Wang wrote:
> 
> 
> On 03/08/2016 12:51 PM, Peter Maydell wrote:
>> On 7 March 2016 at 10:12, Jason Wang <jasow...@redhat.com> wrote:
>>> The following changes since commit 1464ad45cd6cdeb0b5c1a54d3d3791396e47e52f:
>>>
>>>   Merge remote-tracking branch 'remotes/armbru/tags/pull-qapi-2016-03-04' 
>>> into staging (2016-03-06 11:53:27 +)
>>>
>>> are available in the git repository at:
>>>
>>>   https://github.com/jasowang/qemu.git tags/net-pull-request
>>>
>>> for you to fetch changes up to a2f2e45c6edbba9e1961056fa77c696208b40c8e:
>>>
>>>   net: check packet payload length (2016-03-07 10:15:48 +0800)
>>>
>>> 
>>>
>>> - a new netfilter implementation: mirror
>>> - netfilter could be disabled and enabled through qom-set now
>>> - fix netfilter crash when specifiying wrong parameters
>>> - rocker switch now can allow user to specifiy world
>>> - fix OOB access for ne2000
>> Hi; I'm afraid this makes "make check" hang for me (Linux, x86-64):
>>
>> TEST: tests/test-netfilter... (pid=26854)
>>   /i386/netfilter/addremove_one:   OK
>>   /i386/netfilter/remove_netdev_one:   OK
>>   /i386/netfilter/addremove_multi: OK
>>   /i386/netfilter/remove_netdev_multi: OK
>> PASS: tests/test-netfilter
>> TEST: tests/test-filter-mirror... (pid=26858)
>>   /i386/netfilter/mirror:
>>
>> (consistently, every time I run make check, on the same test).
>>
>> thanks
>> -- PMM
> 
> Sorry, it manages to pass on my machine before submitting the pull
> request. But when I re-try this several times, it fails.
> 
> This probably means we have bug in mirror implementation. Chen and
> Congyang, please try to fix this bug and resubmit a new version of the
> patch.
> 
> Will drop mirror from this pull request and submit a V2.

OK. what is the version of the kernel that you use?

Thanks
Wen Congyang

> 
> Thanks
> 
> 
> .
>

Re: [Qemu-devel] [PATCH v5 2/3] qmp event: Refactor QUORUM_REPORT_BAD

2016-02-24 Thread Wen Congyang

On 02/25/2016 12:59 AM, Eric Blake wrote:
> On 02/24/2016 03:11 AM, Changlong Xie wrote:
>> Introduce QuorumOpType, and make QUORUM_REPORT_BAD compatible
>> with it.
>>
>> Cc: Dr. David Alan Gilbert <dgilb...@redhat.com>
>> Cc: Wen Congyang <we...@cn.fujitsu.com>
>> Signed-off-by: Wen Congyang <we...@cn.fujitsu.com>
>> Signed-off-by: Changlong Xie <xiecl.f...@cn.fujitsu.com>
>> ---
> 
>> +++ b/docs/qmp-events.txt
>> @@ -307,6 +307,7 @@ Emitted to report a corruption of a Quorum file.
>>  
>>  Data:
>>  
>> +- "type":  Quorum operation type (json-string, optional)
> 
> I don't think 'type' needs to be optional, after all.  Just always
> output it.

If we output read/write type, old libvirt will ignore the read/write error 
events?

Thanks
Wen Congyang

> 
>>  - "error": Error message (json-string, optional)
>> Only present on failure.  This field contains a 
>> human-readable
>> error message.  There are no semantics other than that 
>> the
>> @@ -318,10 +319,17 @@ Data:
>>  
>>  Example:
>>  
>> +Read/Write operation:
>>  { "event": "QUORUM_REPORT_BAD",
>>   "data": { "node-name": "node0", "sector-num": 345435, "sectors-count": 
>> 5 },
>>   "timestamp": { "seconds": 1344522075, "microseconds": 745528 } }
> 
> and this example would then show "type":"read"
>

Re: [Qemu-devel] [PATCH v15 7/9] Introduce new APIs to do replication operation

2016-02-19 Thread Wen Congyang

On 02/19/2016 04:41 PM, Hailiang Zhang wrote:
> Hi,
> 
> On 2016/2/15 9:13, Wen Congyang wrote:
>> On 02/15/2016 08:57 AM, Hailiang Zhang wrote:
>>> On 2016/2/5 12:18, Changlong Xie wrote:
>>>> Signed-off-by: Wen Congyang <we...@cn.fujitsu.com>
>>>> Signed-off-by: zhanghailiang <zhang.zhanghaili...@huawei.com>
>>>> Signed-off-by: Gonglei <arei.gong...@huawei.com>
>>>> Signed-off-by: Changlong Xie <xiecl.f...@cn.fujitsu.com>
>>>> ---
>>>>Makefile.objs|  1 +
>>>>qapi/block-core.json | 13 
>>>>replication.c| 94 
>>>> 
>>>>replication.h| 53 +
>>>>4 files changed, 161 insertions(+)
>>>>create mode 100644 replication.c
>>>>create mode 100644 replication.h
>>>>
>>>> diff --git a/Makefile.objs b/Makefile.objs
>>>> index 06b95c7..a8c74b7 100644
>>>> --- a/Makefile.objs
>>>> +++ b/Makefile.objs
>>>> @@ -15,6 +15,7 @@ block-obj-$(CONFIG_POSIX) += aio-posix.o
>>>>block-obj-$(CONFIG_WIN32) += aio-win32.o
>>>>block-obj-y += block/
>>>>block-obj-y += qemu-io-cmds.o
>>>> +block-obj-y += replication.o
>>>>
>>>>block-obj-m = block/
>>>>
>>>> diff --git a/qapi/block-core.json b/qapi/block-core.json
>>>> index 7e9e8fe..12362b8 100644
>>>> --- a/qapi/block-core.json
>>>> +++ b/qapi/block-core.json
>>>> @@ -2002,6 +2002,19 @@
>>>>'*read-pattern': 'QuorumReadPattern' } }
>>>>
>>>>##
>>>> +# @ReplicationMode
>>>> +#
>>>> +# An enumeration of replication modes.
>>>> +#
>>>> +# @primary: Primary mode, the vm's state will be sent to secondary QEMU.
>>>> +#
>>>> +# @secondary: Secondary mode, receive the vm's state from primary QEMU.
>>>> +#
>>>> +# Since: 2.6
>>>> +##
>>>> +{ 'enum' : 'ReplicationMode', 'data' : [ 'primary', 'secondary' ] }
>>>> +
>>>> +##
>>>># @BlockdevOptions
>>>>#
>>>># Options for creating a block device.
>>>> diff --git a/replication.c b/replication.c
>>>> new file mode 100644
>>>> index 000..e8ac2f0
>>>> --- /dev/null
>>>> +++ b/replication.c
>>>> @@ -0,0 +1,94 @@
>>>> +/*
>>>> + * Replication filter
>>>> + *
>>>> + * Copyright (c) 2016 HUAWEI TECHNOLOGIES CO., LTD.
>>>> + * Copyright (c) 2016 Intel Corporation
>>>> + * Copyright (c) 2016 FUJITSU LIMITED
>>>> + *
>>>> + * Author:
>>>> + *   Wen Congyang <we...@cn.fujitsu.com>
>>>> + *
>>>> + * This work is licensed under the terms of the GNU GPL, version 2 or 
>>>> later.
>>>> + * See the COPYING file in the top-level directory.
>>>> + */
>>>> +
>>>> +#include "replication.h"
>>>> +
>>>> +static QLIST_HEAD(, ReplicationState) replication_states;
>>>> +
>>>> +ReplicationState *replication_new(void *opaque, ReplicationOps *ops)
>>>> +{
>>>> +ReplicationState *rs;
>>>> +
>>>> +rs = g_new0(ReplicationState, 1);
>>>> +rs->opaque = opaque;
>>>> +rs->ops = ops;
>>>> +QLIST_INSERT_HEAD(_states, rs, node);
>>>> +
>>>> +return rs;
>>>> +}
>>>> +
>>>> +void replication_remove(ReplicationState *rs)
>>>> +{
>>>> +QLIST_REMOVE(rs, node);
>>>> +g_free(rs);
>>>> +}
>>>> +
>>>> +/*
>>>> + * The caller of the function *MUST* make sure vm stopped
>>>> + */
>>>> +void replication_start_all(ReplicationMode mode, Error **errp)
>>>> +{
>>>
>>> Is this public API is only used for block ?
>>> If yes, I'd like it with a 'block_' prefix.
>>
>> No, we hope it can be used for nic too.
>>
> 
> OK, i got why you designed these APIs, I like this idea that
> use the callback/notifier to notify the status of COLO for block/nic.
> 
> But let's do something more graceful.
> For COLO, we can consider it has four states:
> Prepare/start checkpoint(with VM stopped)/finish checkpo

Re: [Qemu-devel] [PATCH v2 1/1] quorum: Change vote rules for 64 bits hash

2016-02-19 Thread Wen Congyang

On 02/18/2016 11:16 PM, Alberto Garcia wrote:
> On Tue 16 Feb 2016 03:15:44 AM CET, Changlong Xie <xiecl.f...@cn.fujitsu.com> 
> wrote:
>> If quorum has two children(A, B). A do flush sucessfully, but B flush
>> failed.  We MUST choice A as winner rather than just pick anyone of
>> them. Otherwise the filesystem of guest will become read-only with
>> following errors:
>>
>> end_request: I/O error, dev vda, sector 11159960
>> Aborting journal on device vda3-8
>> EXT4-fs error (device vda3): ext4_journal_start_sb:327: Detected abort 
>> journal
>> EXT4-fs (vda3): Remounting filesystem read-only
> 
> Hi Xie,
> 
> Let's see if I'm getting this right:
> 
> - When Quorum flushes to disk, there's a vote among the return values of
>   the flush operations of its members, and the one that wins is the one
>   that Quorum returns.
> 
> - If there's a tie then Quorum choses the first result from the list of
>   winners.
> 
> - With your patch you want to give priority to the vote with result == 0
>   if there's any, so Quorum would return 0 (and succeed).
> 
> This seems to me like an ad-hoc fix for a particular use case. What if
> you have 3 members and two of them fail with the same error code? Would
> you still return 0 or the error code from the other two?

For example:
children.0 returns 0
children.1 returns -EIO
children.2 returns -EPIPE

In this case, quorum returns -EPIPE now(without this patch).

For example:
children.0 returns -EPIPE
children.1 returns -EIO
children.2 returns 0
In this case, quorum returns 0 now.

If two children returns the same error, and only one returns 0, this patch 
doesn't
change the behavior.

Back to your question, before this patch, sometimes quorum returns error, and
sometimes quorum returns 0. In such case, which is better? Always return 0 or
always return error? In my opinion, we can always return 0 if we allow quorum
returns 0 in case 2.

> 
> Also, is this only supposed to be used in FIFO mode? Your patch doesn't
> seem to make any distinction.

IIRC, FIFO mode is only for read operation. This patch is not for FIFO mode.

Thanks
Wen Congyang

> 
> Thanks!
> 
> Berto
> 
> 
> .
>

Re: [Qemu-devel] [PATCH v15 7/9] Introduce new APIs to do replication operation

2016-02-14 Thread Wen Congyang

On 02/15/2016 08:57 AM, Hailiang Zhang wrote:
> On 2016/2/5 12:18, Changlong Xie wrote:
>> Signed-off-by: Wen Congyang <we...@cn.fujitsu.com>
>> Signed-off-by: zhanghailiang <zhang.zhanghaili...@huawei.com>
>> Signed-off-by: Gonglei <arei.gong...@huawei.com>
>> Signed-off-by: Changlong Xie <xiecl.f...@cn.fujitsu.com>
>> ---
>>   Makefile.objs|  1 +
>>   qapi/block-core.json | 13 
>>   replication.c| 94 
>> 
>>   replication.h| 53 +
>>   4 files changed, 161 insertions(+)
>>   create mode 100644 replication.c
>>   create mode 100644 replication.h
>>
>> diff --git a/Makefile.objs b/Makefile.objs
>> index 06b95c7..a8c74b7 100644
>> --- a/Makefile.objs
>> +++ b/Makefile.objs
>> @@ -15,6 +15,7 @@ block-obj-$(CONFIG_POSIX) += aio-posix.o
>>   block-obj-$(CONFIG_WIN32) += aio-win32.o
>>   block-obj-y += block/
>>   block-obj-y += qemu-io-cmds.o
>> +block-obj-y += replication.o
>>
>>   block-obj-m = block/
>>
>> diff --git a/qapi/block-core.json b/qapi/block-core.json
>> index 7e9e8fe..12362b8 100644
>> --- a/qapi/block-core.json
>> +++ b/qapi/block-core.json
>> @@ -2002,6 +2002,19 @@
>>   '*read-pattern': 'QuorumReadPattern' } }
>>
>>   ##
>> +# @ReplicationMode
>> +#
>> +# An enumeration of replication modes.
>> +#
>> +# @primary: Primary mode, the vm's state will be sent to secondary QEMU.
>> +#
>> +# @secondary: Secondary mode, receive the vm's state from primary QEMU.
>> +#
>> +# Since: 2.6
>> +##
>> +{ 'enum' : 'ReplicationMode', 'data' : [ 'primary', 'secondary' ] }
>> +
>> +##
>>   # @BlockdevOptions
>>   #
>>   # Options for creating a block device.
>> diff --git a/replication.c b/replication.c
>> new file mode 100644
>> index 000..e8ac2f0
>> --- /dev/null
>> +++ b/replication.c
>> @@ -0,0 +1,94 @@
>> +/*
>> + * Replication filter
>> + *
>> + * Copyright (c) 2016 HUAWEI TECHNOLOGIES CO., LTD.
>> + * Copyright (c) 2016 Intel Corporation
>> + * Copyright (c) 2016 FUJITSU LIMITED
>> + *
>> + * Author:
>> + *   Wen Congyang <we...@cn.fujitsu.com>
>> + *
>> + * This work is licensed under the terms of the GNU GPL, version 2 or later.
>> + * See the COPYING file in the top-level directory.
>> + */
>> +
>> +#include "replication.h"
>> +
>> +static QLIST_HEAD(, ReplicationState) replication_states;
>> +
>> +ReplicationState *replication_new(void *opaque, ReplicationOps *ops)
>> +{
>> +ReplicationState *rs;
>> +
>> +rs = g_new0(ReplicationState, 1);
>> +rs->opaque = opaque;
>> +rs->ops = ops;
>> +QLIST_INSERT_HEAD(_states, rs, node);
>> +
>> +return rs;
>> +}
>> +
>> +void replication_remove(ReplicationState *rs)
>> +{
>> +QLIST_REMOVE(rs, node);
>> +g_free(rs);
>> +}
>> +
>> +/*
>> + * The caller of the function *MUST* make sure vm stopped
>> + */
>> +void replication_start_all(ReplicationMode mode, Error **errp)
>> +{
> 
> Is this public API is only used for block ?
> If yes, I'd like it with a 'block_' prefix.

No, we hope it can be used for nic too.

Thanks
Wen Congyang

> 
>> +ReplicationState *rs, *next;
>> +
>> +QLIST_FOREACH_SAFE(rs, _states, node, next) {
>> +if (rs->ops && rs->ops->start) {
>> +rs->ops->start(rs, mode, errp);
>> +}
>> +if (*errp != NULL) {
> 
> This is incorrect, you miss checking if errp is NULL,
> if errp is NULL, there will be an error that accessing memory at address 0x0.
> Same with other places in this patch.
> 
>> +return;
>> +}
>> +}
>> +}
>> +
>> +void replication_do_checkpoint_all(Error **errp)
>> +{
>> +ReplicationState *rs, *next;
>> +
>> +QLIST_FOREACH_SAFE(rs, _states, node, next) {
>> +if (rs->ops && rs->ops->checkpoint) {
>> +rs->ops->checkpoint(rs, errp);
>> +}
>> +if (*errp != NULL) {
>> +return;
> 
>> +}
>> +}
>> +}
>> +
>> +void replication_get_error_all(Error **errp)
>> +{
>> +ReplicationState *rs, *next;
>> +
>> +QLIST_FOREACH_SAFE(rs, _states, node, next) {
>> +if (rs-&

Re: [Qemu-devel] [PATCH v13 00/10] Block replication for continuous checkpoints

2016-02-04 Thread Wen Congyang

On 02/04/2016 05:07 PM, Dr. David Alan Gilbert wrote:
> * Changlong Xie (xiecl.f...@cn.fujitsu.com) wrote:
>> On 02/01/2016 09:18 AM, Wen Congyang wrote:
>>> On 01/29/2016 06:47 PM, Dr. David Alan Gilbert wrote:
>>>> * Wen Congyang (we...@cn.fujitsu.com) wrote:
>>>>> On 01/29/2016 06:07 PM, Dr. David Alan Gilbert wrote:
>>>>>> * Wen Congyang (we...@cn.fujitsu.com) wrote:
>>>>>>> On 01/27/2016 07:03 PM, Dr. David Alan Gilbert wrote:
>>>>>>>> Hi,
>>>>>>>>   I've got a block error if I kill the secondary.
>>>>>>>>
>>>>>>>> Start both primary & secondary
>>>>>>>> kill -9 secondary qemu
>>>>>>>> x_colo_lost_heartbeat on primary
>>>>>>>>
>>>>>>>> The guest sees a block error and the ext4 root switches to read-only.
>>>>>>>>
>>>>>>>> I gdb'd the primary with a breakpoint on quorum_report_bad; see
>>>>>>>> backtrace below.
>>>>>>>> (This is based on colo-v2.4-periodic-mode of the framework
>>>>>>>> code with the block and network proxy merged in; so it could be my
>>>>>>>> merging but I don't think so ?)
>>>>>>>>
>>>>>>>>
>>>>>>>> (gdb) where
>>>>>>>> #0  quorum_report_bad (node_name=0x7f2946a0892c "node0", ret=-5, 
>>>>>>>> acb=0x7f2946cb3910, acb=0x7f2946cb3910)
>>>>>>>> at /root/colo/jan-2016/qemu/block/quorum.c:222
>>>>>>>> #1  0x7f2943b23058 in quorum_aio_cb (opaque=, 
>>>>>>>> ret=)
>>>>>>>> at /root/colo/jan-2016/qemu/block/quorum.c:315
>>>>>>>> #2  0x7f2943b311be in bdrv_co_complete (acb=0x7f2946cb3f60) at 
>>>>>>>> /root/colo/jan-2016/qemu/block/io.c:2122
>>>>>>>> #3  0x7f2943ae777d in aio_bh_call (bh=) at 
>>>>>>>> /root/colo/jan-2016/qemu/async.c:64
>>>>>>>> #4  aio_bh_poll (ctx=ctx@entry=0x7f2945b771d0) at 
>>>>>>>> /root/colo/jan-2016/qemu/async.c:92
>>>>>>>> #5  0x7f2943af5090 in aio_dispatch (ctx=0x7f2945b771d0) at 
>>>>>>>> /root/colo/jan-2016/qemu/aio-posix.c:305
>>>>>>>> #6  0x7f2943ae756e in aio_ctx_dispatch (source=, 
>>>>>>>> callback=,
>>>>>>>> user_data=) at /root/colo/jan-2016/qemu/async.c:231
>>>>>>>> #7  0x7f293b84a79a in g_main_context_dispatch () from 
>>>>>>>> /lib64/libglib-2.0.so.0
>>>>>>>> #8  0x7f2943af3a00 in glib_pollfds_poll () at 
>>>>>>>> /root/colo/jan-2016/qemu/main-loop.c:211
>>>>>>>> #9  os_host_main_loop_wait (timeout=) at 
>>>>>>>> /root/colo/jan-2016/qemu/main-loop.c:256
>>>>>>>> #10 main_loop_wait (nonblocking=) at 
>>>>>>>> /root/colo/jan-2016/qemu/main-loop.c:504
>>>>>>>> #11 0x7f29438529ee in main_loop () at 
>>>>>>>> /root/colo/jan-2016/qemu/vl.c:1945
>>>>>>>> #12 main (argc=, argv=, envp=>>>>>>> out>) at /root/colo/jan-2016/qemu/vl.c:4707
>>>>>>>>
>>>>>>>> (gdb) p s->num_children
>>>>>>>> $1 = 2
>>>>>>>> (gdb) p acb->success_count
>>>>>>>> $2 = 0
>>>>>>>> (gdb) p acb->is_read
>>>>>>>> $5 = false
>>>>>>>
>>>>>>> Sorry for the late reply.
>>>>>>
>>>>>> No problem.
>>>>>>
>>>>>>> What it the value of acb->count?
>>>>>>
>>>>>> (gdb) p acb->count
>>>>>> $1 = 1
>>>>>
>>>>> Note, the count is 1, not 2. Writing to children.0 is in flight. If 
>>>>> writing to children.0 successes,
>>>>> the guest doesn't know this error.
>>>>>>> If secondary host is down, you should remove quorum's children.1. 
>>>>>>> Otherwise, you will get
>>>>>>> I/O error event.
>>>>>>
>>>>>> Is that safe?  If the secondary fails, do you always have time to issue 
>>&

Re: [Qemu-devel] [PATCH v14 7/8] Implement new driver for block replication

2016-02-03 Thread Wen Congyang

On 02/03/2016 05:32 PM, Stefan Hajnoczi wrote:
> On Wed, Feb 03, 2016 at 09:29:15AM +0800, Wen Congyang wrote:
>> On 02/02/2016 10:34 PM, Stefan Hajnoczi wrote:
>>> On Mon, Feb 01, 2016 at 09:13:36AM +0800, Wen Congyang wrote:
>>>> On 01/29/2016 11:46 PM, Stefan Hajnoczi wrote:
>>>>> On Fri, Jan 29, 2016 at 11:13:42AM +0800, Changlong Xie wrote:
>>>>>> On 01/28/2016 11:15 PM, Stefan Hajnoczi wrote:
>>>>>>> On Thu, Jan 28, 2016 at 09:13:24AM +0800, Wen Congyang wrote:
>>>>>>>> On 01/27/2016 10:46 PM, Stefan Hajnoczi wrote:
>>>>>>>>> On Wed, Jan 13, 2016 at 05:18:31PM +0800, Changlong Xie wrote:
>>>>>>> I'm concerned that the bdrv_drain_all() in vm_stop() can take a long
>>>>>>> time if the disk is slow/failing.  bdrv_drain_all() blocks until all
>>>>>>> in-flight I/O requests have completed.  What does the Primary do if the
>>>>>>> Secondary becomes unresponsive?
>>>>>>
>>>>>> Actually, we knew this problem. But currently, there seems no better way 
>>>>>> to
>>>>>> resolve it. If you have any ideas?
>>>>>
>>>>> Is it possible to hold the checkpoint information and acknowledge the
>>>>> checkpoint right away, without waiting for bdrv_drain_all() or any
>>>>> Secondory guest activity to complete?
>>>>
>>>> There is no way to know that secondary becomes unreponsive.
>>>
>>> I meant whether it is necessary for the Secondary to vm_stop() and apply
>>> the checkpoint before acknowledging the checkpoint to the Primary?
>>
>> I don't understand this.
>> Here is the COLO checkpoint flow:
>>
>> PrimarySecondary
>> new checkpoint notice --->
>> vm_stop()  vm_stop()
>> vm state(device state, memory, cpu)   --->
>>load state
>>   <--- done
>> vm_start() vm_start()
> 
> If the Secondary's vm_stop() call blocks then the Primary is stuck too.
> 
> I was wondering whether the Secondary can do:
> 
> <---  done
>   vm_stop()
>   load state
> 
> It simply receives the checkpoint data into a buffer and immediately
> replies with "done".  vm_stop() and load state is only performed after
> sending "done".

Secondary vm is running, so we should also get the pages that are dirtied
by secondary vm, but not dirtied by primary vm.
We have two ways to do it:
1. Cache all original memory in the secondary qemu
2. Send the dirty pfn list to primary qemu, and get it.

If we ack the checkpoint and the call vm_stop(), we only can select 1. It
means that secondary qemu costs more memory.
In COLO mode, we will compare the output socket, and will do checkpoint if
the application level data is different. If we ack the checkpoint and the
call vm_stop(), the client can not get any more data until secondary vm
is running again. So we still 'wait' the secondary vm.


> 
> The advantage is that the Primary will not be delayed by the Secondary.
> It's an approach that doesn't block.
> 
> But perhaps it's a problem if the Secondary is slower than the Primary
> since the Secondary still needs to complete vm_stop() and load state
> before it can resume execution?
> 
>>>>> I think this really means falling back to microcheckpointing until the
>>>>> Secondary guest can checkpoint.  Instead of a blocking vm_stop() we
>>>>> would prevent vcpus from running and when the last pending I/O finishes
>>>>> the Secondary could apply the last checkpoint.  This approach does not
>>>>> block QEMU (the monitor, etc).
>>>>>
>>>>
>>>> If secondary host becomes unresponsive, it means that we cannot do 
>>>> mocrocheckpointing.
>>>> We should do failover in this case.
>>>
>>> This is dangerous because it means that a delay/failure in the Secondary
>>> would cause the Primary to fail over to the broken Secondary.  All the
>>> more reason not to perform blocking operations on the Secondary in the
>>> checkpoint code path.
>>
>> If the secondary is broken, primary qemu will take over.
> 
> Does the Primary use a timeout between "new checkpoint notice" and
> Secondary's "done" so it can move on if the Secondary is unresponsive?

To hailiang:
IIRC, we don't use a timeout but I think we can do it. In our design, there is
an exteranl heartbeat to check primary and secondary status, and decide when
to do checkpoint.

Thanks
Wen Congyang


> 
> Stefan
>

Re: [Qemu-devel] [PATCH v14 7/8] Implement new driver for block replication

2016-02-02 Thread Wen Congyang

On 02/02/2016 10:34 PM, Stefan Hajnoczi wrote:
> On Mon, Feb 01, 2016 at 09:13:36AM +0800, Wen Congyang wrote:
>> On 01/29/2016 11:46 PM, Stefan Hajnoczi wrote:
>>> On Fri, Jan 29, 2016 at 11:13:42AM +0800, Changlong Xie wrote:
>>>> On 01/28/2016 11:15 PM, Stefan Hajnoczi wrote:
>>>>> On Thu, Jan 28, 2016 at 09:13:24AM +0800, Wen Congyang wrote:
>>>>>> On 01/27/2016 10:46 PM, Stefan Hajnoczi wrote:
>>>>>>> On Wed, Jan 13, 2016 at 05:18:31PM +0800, Changlong Xie wrote:
>>>>> I'm concerned that the bdrv_drain_all() in vm_stop() can take a long
>>>>> time if the disk is slow/failing.  bdrv_drain_all() blocks until all
>>>>> in-flight I/O requests have completed.  What does the Primary do if the
>>>>> Secondary becomes unresponsive?
>>>>
>>>> Actually, we knew this problem. But currently, there seems no better way to
>>>> resolve it. If you have any ideas?
>>>
>>> Is it possible to hold the checkpoint information and acknowledge the
>>> checkpoint right away, without waiting for bdrv_drain_all() or any
>>> Secondory guest activity to complete?
>>
>> There is no way to know that secondary becomes unreponsive.
> 
> I meant whether it is necessary for the Secondary to vm_stop() and apply
> the checkpoint before acknowledging the checkpoint to the Primary?

I don't understand this.
Here is the COLO checkpoint flow:

PrimarySecondary
new checkpoint notice --->
vm_stop()  vm_stop()
vm state(device state, memory, cpu)   --->
   load state
  <--- done
vm_start() vm_start()
> 
>>> I think this really means falling back to microcheckpointing until the
>>> Secondary guest can checkpoint.  Instead of a blocking vm_stop() we
>>> would prevent vcpus from running and when the last pending I/O finishes
>>> the Secondary could apply the last checkpoint.  This approach does not
>>> block QEMU (the monitor, etc).
>>>
>>
>> If secondary host becomes unresponsive, it means that we cannot do 
>> mocrocheckpointing.
>> We should do failover in this case.
> 
> This is dangerous because it means that a delay/failure in the Secondary
> would cause the Primary to fail over to the broken Secondary.  All the
> more reason not to perform blocking operations on the Secondary in the
> checkpoint code path.

If the secondary is broken, primary qemu will take over.

Thanks
Wen Congyang

> 
> Stefan
>

Re: [Qemu-devel] [PATCH v14 5/8] docs: block replication's description

2016-02-02 Thread Wen Congyang

On 02/03/2016 11:35 AM, Eric Blake wrote:
> On 02/02/2016 08:18 PM, Changlong Xie wrote:
>> On 02/02/2016 11:20 PM, Eric Blake wrote:
>>> On 01/13/2016 02:18 AM, Changlong Xie wrote:
>>>> From: Wen Congyang <we...@cn.fujitsu.com>
>>>>
>>>> Signed-off-by: Wen Congyang <we...@cn.fujitsu.com>
>>>> Signed-off-by: zhanghailiang <zhang.zhanghaili...@huawei.com>
>>>> Signed-off-by: Gonglei <arei.gong...@huawei.com>
>>>> Signed-off-by: Changlong Xie <xiecl.f...@cn.fujitsu.com>
>>>> ---
>>>>   docs/block-replication.txt | 229
> 
>>>> +== Usage ==
>>>> +Primary:
>>>> +  -drive
>>>> if=xxx,driver=quorum,read-pattern=fifo,id=colo1,vote-threshold=1,\
>>>> + children.0.file.filename=1.raw,\
>>>> + children.0.driver=raw
>>>> +
>>>> +  Run qmp command in primary qemu:
>>>> +{ 'execute': 'human-monitor-command',
>>>> +  'arguments': {
>>>> +  'command-line': 'drive_add buddy
>>>> driver=replication,mode=primary,file.driver=nbd,file.host=,file.port=,file.export=colo1,node-name=nbd_client1,if=none'
>>>>
>>>
>>> Eww. We shouldn't ever have to pack a command line as . single QMP
>>> string that needs reparsing.  Instead, you should pass the information
>>> as a nested QMP dictionary, something like:
>>>
>>> 'arguments': {
>>>'remote-command': { 'command': 'drive_add', 'name': 'buddy',
>>>'driver': 'replication', 'mode': 'primary',
>>>'file': { 'driver': 'nbd', 'host': '',
>>
>> Hi Eric
>>
>> What is 'remote-command' here? I just tried below commands, but got some
>> errors.
> 
> Oh, I totally missed that this was using the existing
> 'human-monitor-command' to send an HMP command, instead of trying to
> send a formal QMP command.  I thought you were documenting a new QMP
> command.
> 
> Still, it would be nice to do this command using strict QMP (that would
> be via 'blockdev-add') rather than via HMP (an all-in-one text command
> crammed into the single 'command-line' argument).
> 
>>
>> 'blockdev-add' doesn't support 'nbd'. So we use 'drive_add' here, and
>> it's a hmp command. If i'm right, there seems just one way to execute
>> hmp commands in QMP:
> 
> For 2.6, we _really_ ought to get blockdev-add working for everything.
> We're running short on time, though :(
> 

If the qmp command 'blockdev-add' supports nbd in qemu-2.6, we will update
this document when it is suppoted.

Thanks
Wen Congyang

Re: [Qemu-devel] [PATCH v14 7/8] Implement new driver for block replication

2016-01-31 Thread Wen Congyang

On 01/29/2016 11:46 PM, Stefan Hajnoczi wrote:
> On Fri, Jan 29, 2016 at 11:13:42AM +0800, Changlong Xie wrote:
>> On 01/28/2016 11:15 PM, Stefan Hajnoczi wrote:
>>> On Thu, Jan 28, 2016 at 09:13:24AM +0800, Wen Congyang wrote:
>>>> On 01/27/2016 10:46 PM, Stefan Hajnoczi wrote:
>>>>> On Wed, Jan 13, 2016 at 05:18:31PM +0800, Changlong Xie wrote:
>>> I'm concerned that the bdrv_drain_all() in vm_stop() can take a long
>>> time if the disk is slow/failing.  bdrv_drain_all() blocks until all
>>> in-flight I/O requests have completed.  What does the Primary do if the
>>> Secondary becomes unresponsive?
>>
>> Actually, we knew this problem. But currently, there seems no better way to
>> resolve it. If you have any ideas?
> 
> Is it possible to hold the checkpoint information and acknowledge the
> checkpoint right away, without waiting for bdrv_drain_all() or any
> Secondory guest activity to complete?

There is no way to know that secondary becomes unreponsive.

> 
> I think this really means falling back to microcheckpointing until the
> Secondary guest can checkpoint.  Instead of a blocking vm_stop() we
> would prevent vcpus from running and when the last pending I/O finishes
> the Secondary could apply the last checkpoint.  This approach does not
> block QEMU (the monitor, etc).
> 

If secondary host becomes unresponsive, it means that we cannot do 
mocrocheckpointing.
We should do failover in this case.

Thanks
Wen Congyang

Re: [Qemu-devel] [PATCH v13 00/10] Block replication for continuous checkpoints

2016-01-31 Thread Wen Congyang

On 01/29/2016 06:47 PM, Dr. David Alan Gilbert wrote:
> * Wen Congyang (we...@cn.fujitsu.com) wrote:
>> On 01/29/2016 06:07 PM, Dr. David Alan Gilbert wrote:
>>> * Wen Congyang (we...@cn.fujitsu.com) wrote:
>>>> On 01/27/2016 07:03 PM, Dr. David Alan Gilbert wrote:
>>>>> Hi,
>>>>>   I've got a block error if I kill the secondary.
>>>>>
>>>>> Start both primary & secondary
>>>>> kill -9 secondary qemu
>>>>> x_colo_lost_heartbeat on primary
>>>>>
>>>>> The guest sees a block error and the ext4 root switches to read-only.
>>>>>
>>>>> I gdb'd the primary with a breakpoint on quorum_report_bad; see
>>>>> backtrace below.
>>>>> (This is based on colo-v2.4-periodic-mode of the framework
>>>>> code with the block and network proxy merged in; so it could be my
>>>>> merging but I don't think so ?)
>>>>>
>>>>>
>>>>> (gdb) where
>>>>> #0  quorum_report_bad (node_name=0x7f2946a0892c "node0", ret=-5, 
>>>>> acb=0x7f2946cb3910, acb=0x7f2946cb3910)
>>>>> at /root/colo/jan-2016/qemu/block/quorum.c:222
>>>>> #1  0x7f2943b23058 in quorum_aio_cb (opaque=, 
>>>>> ret=)
>>>>> at /root/colo/jan-2016/qemu/block/quorum.c:315
>>>>> #2  0x7f2943b311be in bdrv_co_complete (acb=0x7f2946cb3f60) at 
>>>>> /root/colo/jan-2016/qemu/block/io.c:2122
>>>>> #3  0x7f2943ae777d in aio_bh_call (bh=) at 
>>>>> /root/colo/jan-2016/qemu/async.c:64
>>>>> #4  aio_bh_poll (ctx=ctx@entry=0x7f2945b771d0) at 
>>>>> /root/colo/jan-2016/qemu/async.c:92
>>>>> #5  0x7f2943af5090 in aio_dispatch (ctx=0x7f2945b771d0) at 
>>>>> /root/colo/jan-2016/qemu/aio-posix.c:305
>>>>> #6  0x7f2943ae756e in aio_ctx_dispatch (source=, 
>>>>> callback=, 
>>>>> user_data=) at /root/colo/jan-2016/qemu/async.c:231
>>>>> #7  0x7f293b84a79a in g_main_context_dispatch () from 
>>>>> /lib64/libglib-2.0.so.0
>>>>> #8  0x7f2943af3a00 in glib_pollfds_poll () at 
>>>>> /root/colo/jan-2016/qemu/main-loop.c:211
>>>>> #9  os_host_main_loop_wait (timeout=) at 
>>>>> /root/colo/jan-2016/qemu/main-loop.c:256
>>>>> #10 main_loop_wait (nonblocking=) at 
>>>>> /root/colo/jan-2016/qemu/main-loop.c:504
>>>>> #11 0x7f29438529ee in main_loop () at 
>>>>> /root/colo/jan-2016/qemu/vl.c:1945
>>>>> #12 main (argc=, argv=, envp=>>>> out>) at /root/colo/jan-2016/qemu/vl.c:4707
>>>>>
>>>>> (gdb) p s->num_children
>>>>> $1 = 2
>>>>> (gdb) p acb->success_count
>>>>> $2 = 0
>>>>> (gdb) p acb->is_read
>>>>> $5 = false
>>>>
>>>> Sorry for the late reply.
>>>
>>> No problem.
>>>
>>>> What it the value of acb->count?
>>>
>>> (gdb) p acb->count
>>> $1 = 1
>>
>> Note, the count is 1, not 2. Writing to children.0 is in flight. If writing 
>> to children.0 successes,
>> the guest doesn't know this error.
>>>> If secondary host is down, you should remove quorum's children.1. 
>>>> Otherwise, you will get
>>>> I/O error event.
>>>
>>> Is that safe?  If the secondary fails, do you always have time to issue the 
>>> command to
>>> remove the children.1  before the guest sees the error?
>>
>> We will write to two children, and expect that writing to children.0 will 
>> success. If so,
>> the guest doesn't know this error. You just get the I/O error event.
> 
> I think children.0 is the disk, and that should be OK - so only the 
> children.1/replication should
> be failing - so in that case why do I see the error?

I don't know, and I will check the codes.

> The 'node0' in the backtrace above is the name of the replication, so it does 
> look like the error
> is coming from the replication.

No, the backtrace is just report an I/O error events to the management 
application.

> 
>>> Anyway, I tried removing children.1 but it segfaults now, I guess the 
>>> replication is unhappy:
>>>
>>> (qemu) x_block_change colo-disk0 -d children.1
>>> (qemu) x_colo_lost_heartbeat 
>>
>> Hmm, you should not remove the child before failover. I will check it how to

Re: [Qemu-devel] [PATCH v13 00/10] Block replication for continuous checkpoints

2016-01-29 Thread Wen Congyang

On 01/29/2016 06:07 PM, Dr. David Alan Gilbert wrote:
> * Wen Congyang (we...@cn.fujitsu.com) wrote:
>> On 01/27/2016 07:03 PM, Dr. David Alan Gilbert wrote:
>>> Hi,
>>>   I've got a block error if I kill the secondary.
>>>
>>> Start both primary & secondary
>>> kill -9 secondary qemu
>>> x_colo_lost_heartbeat on primary
>>>
>>> The guest sees a block error and the ext4 root switches to read-only.
>>>
>>> I gdb'd the primary with a breakpoint on quorum_report_bad; see
>>> backtrace below.
>>> (This is based on colo-v2.4-periodic-mode of the framework
>>> code with the block and network proxy merged in; so it could be my
>>> merging but I don't think so ?)
>>>
>>>
>>> (gdb) where
>>> #0  quorum_report_bad (node_name=0x7f2946a0892c "node0", ret=-5, 
>>> acb=0x7f2946cb3910, acb=0x7f2946cb3910)
>>> at /root/colo/jan-2016/qemu/block/quorum.c:222
>>> #1  0x7f2943b23058 in quorum_aio_cb (opaque=, 
>>> ret=)
>>> at /root/colo/jan-2016/qemu/block/quorum.c:315
>>> #2  0x7f2943b311be in bdrv_co_complete (acb=0x7f2946cb3f60) at 
>>> /root/colo/jan-2016/qemu/block/io.c:2122
>>> #3  0x7f2943ae777d in aio_bh_call (bh=) at 
>>> /root/colo/jan-2016/qemu/async.c:64
>>> #4  aio_bh_poll (ctx=ctx@entry=0x7f2945b771d0) at 
>>> /root/colo/jan-2016/qemu/async.c:92
>>> #5  0x7f2943af5090 in aio_dispatch (ctx=0x7f2945b771d0) at 
>>> /root/colo/jan-2016/qemu/aio-posix.c:305
>>> #6  0x7f2943ae756e in aio_ctx_dispatch (source=, 
>>> callback=, 
>>> user_data=) at /root/colo/jan-2016/qemu/async.c:231
>>> #7  0x7f293b84a79a in g_main_context_dispatch () from 
>>> /lib64/libglib-2.0.so.0
>>> #8  0x7f2943af3a00 in glib_pollfds_poll () at 
>>> /root/colo/jan-2016/qemu/main-loop.c:211
>>> #9  os_host_main_loop_wait (timeout=) at 
>>> /root/colo/jan-2016/qemu/main-loop.c:256
>>> #10 main_loop_wait (nonblocking=) at 
>>> /root/colo/jan-2016/qemu/main-loop.c:504
>>> #11 0x7f29438529ee in main_loop () at /root/colo/jan-2016/qemu/vl.c:1945
>>> #12 main (argc=, argv=, envp=) 
>>> at /root/colo/jan-2016/qemu/vl.c:4707
>>>
>>> (gdb) p s->num_children
>>> $1 = 2
>>> (gdb) p acb->success_count
>>> $2 = 0
>>> (gdb) p acb->is_read
>>> $5 = false
>>
>> Sorry for the late reply.
> 
> No problem.
> 
>> What it the value of acb->count?
> 
> (gdb) p acb->count
> $1 = 1

Note, the count is 1, not 2. Writing to children.0 is in flight. If writing to 
children.0 successes,
the guest doesn't know this error.

> 
>> If secondary host is down, you should remove quorum's children.1. Otherwise, 
>> you will get
>> I/O error event.
> 
> Is that safe?  If the secondary fails, do you always have time to issue the 
> command to
> remove the children.1  before the guest sees the error?

We will write to two children, and expect that writing to children.0 will 
success. If so,
the guest doesn't know this error. You just get the I/O error event.

> 
> Anyway, I tried removing children.1 but it segfaults now, I guess the 
> replication is unhappy:
> 
> (qemu) x_block_change colo-disk0 -d children.1
> (qemu) x_colo_lost_heartbeat 

Hmm, you should not remove the child before failover. I will check it how to 
avoid it in the codes.

> 
> 12973 Segmentation fault  (core dumped) 
> ./try/x86_64-softmmu/qemu-system-x86_64 -enable-kvm $console_param -S -boot c 
> -m 4080 -smp 4 -machine pc-i440fx-2.5,accel=kvm -name debug-threads=on -trace 
> events=trace-file -device virtio-rng-pci $block_param $net_param
> 
> #0  0x7f0a398a864c in bdrv_stop_replication (bs=0x7f0a3b0a8430, 
> failover=true, errp=0x7fff6a5c3420)
> at /root/colo/jan-2016/qemu/block.c:4426
> 
> (gdb) p drv
> $1 = (BlockDriver *) 0x5d2a
> 
>   it looks like the whole of bs is bogus.
> 
> #1  0x7f0a398d87f6 in quorum_stop_replication (bs=, 
> failover=, 
> errp=) at /root/colo/jan-2016/qemu/block/quorum.c:1213
> 
> (gdb) p s->replication_index
> $3 = 1
> 
> I guess quorum_del_child needs to stop replication before it removes the 
> child?

Yes, but in the newest version, quorum doesn't know the block replication, and 
I think
we shoud add an reference to the bs when starting block replication.

Thanks
Wen Congyang

> (although it would have to be careful not to block on the dead nbd).
> 
> #2  0x7f0a398a8901 in bdrv_stop_replication_all 
> (failover=failover@entry=true, errp=

Re: [Qemu-devel] [PATCH v13 00/10] Block replication for continuous checkpoints

2016-01-28 Thread Wen Congyang

On 01/27/2016 07:03 PM, Dr. David Alan Gilbert wrote:
> Hi,
>   I've got a block error if I kill the secondary.
> 
> Start both primary & secondary
> kill -9 secondary qemu
> x_colo_lost_heartbeat on primary
> 
> The guest sees a block error and the ext4 root switches to read-only.
> 
> I gdb'd the primary with a breakpoint on quorum_report_bad; see
> backtrace below.
> (This is based on colo-v2.4-periodic-mode of the framework
> code with the block and network proxy merged in; so it could be my
> merging but I don't think so ?)
> 
> 
> (gdb) where
> #0  quorum_report_bad (node_name=0x7f2946a0892c "node0", ret=-5, 
> acb=0x7f2946cb3910, acb=0x7f2946cb3910)
> at /root/colo/jan-2016/qemu/block/quorum.c:222
> #1  0x7f2943b23058 in quorum_aio_cb (opaque=, 
> ret=)
> at /root/colo/jan-2016/qemu/block/quorum.c:315
> #2  0x7f2943b311be in bdrv_co_complete (acb=0x7f2946cb3f60) at 
> /root/colo/jan-2016/qemu/block/io.c:2122
> #3  0x7f2943ae777d in aio_bh_call (bh=) at 
> /root/colo/jan-2016/qemu/async.c:64
> #4  aio_bh_poll (ctx=ctx@entry=0x7f2945b771d0) at 
> /root/colo/jan-2016/qemu/async.c:92
> #5  0x7f2943af5090 in aio_dispatch (ctx=0x7f2945b771d0) at 
> /root/colo/jan-2016/qemu/aio-posix.c:305
> #6  0x7f2943ae756e in aio_ctx_dispatch (source=, 
> callback=, 
> user_data=) at /root/colo/jan-2016/qemu/async.c:231
> #7  0x7f293b84a79a in g_main_context_dispatch () from 
> /lib64/libglib-2.0.so.0
> #8  0x7f2943af3a00 in glib_pollfds_poll () at 
> /root/colo/jan-2016/qemu/main-loop.c:211
> #9  os_host_main_loop_wait (timeout=) at 
> /root/colo/jan-2016/qemu/main-loop.c:256
> #10 main_loop_wait (nonblocking=) at 
> /root/colo/jan-2016/qemu/main-loop.c:504
> #11 0x7f29438529ee in main_loop () at /root/colo/jan-2016/qemu/vl.c:1945
> #12 main (argc=, argv=, envp=) 
> at /root/colo/jan-2016/qemu/vl.c:4707
> 
> (gdb) p s->num_children
> $1 = 2
> (gdb) p acb->success_count
> $2 = 0
> (gdb) p acb->is_read
> $5 = false

Sorry for the late reply.
What it the value of acb->count?

If secondary host is down, you should remove quorum's children.1. Otherwise, 
you will get
I/O error event.

Thanks
Wen Congyang

> 
> (qemu) info block
> colo-disk0 (#block080): json:{"children": [{"driver": "raw", "file": 
> {"driver": "file", "filename": "/root/colo/bugzilla.raw"}}, {"driver": 
> "replication", "mode": "primary", "file": {"port": "8889", "host": "ibpair", 
> "driver": "nbd", "export": "colo-disk0"}}], "driver": "quorum", "blkverify": 
> false, "rewrite-corrupted": false, "vote-threshold": 1} (quorum)
> Cache mode:   writeback, direct
> 
> Dave
> 
> * Changlong Xie (xiecl.f...@cn.fujitsu.com) wrote:
>> Block replication is a very important feature which is used for
>> continuous checkpoints(for example: COLO).
>>
>> You can get the detailed information about block replication from here:
>> http://wiki.qemu.org/Features/BlockReplication
>>
>> Usage:
>> Please refer to docs/block-replication.txt
>>
>> This patch series is based on the following patch series:
>> 1. http://lists.nongnu.org/archive/html/qemu-devel/2015-12/msg04570.html
>>
>> You can get the patch here:
>> https://github.com/Pating/qemu/tree/changlox/block-replication-v13
>>
>> You can get the patch with framework here:
>> https://github.com/Pating/qemu/tree/changlox/colo_framework_v12
>>
>> TODO:
>> 1. Continuous block replication. It will be started after basic functions
>>are accepted.
>>
>> Changs Log:
>> V13:
>> 1. Rebase to the newest codes
>> 2. Remove redundant marcos and semicolon in replication.c 
>> 3. Fix typos in block-replication.txt
>> V12:
>> 1. Rebase to the newest codes
>> 2. Use backing reference to replcace 'allow-write-backing-file'
>> V11:
>> 1. Reopen the backing file when starting blcok replication if it is not
>>opened in R/W mode
>> 2. Unblock BLOCK_OP_TYPE_BACKUP_SOURCE and BLOCK_OP_TYPE_BACKUP_TARGET
>>when opening backing file
>> 3. Block the top BDS so there is only one block job for the top BDS and
>>its backing chain.
>> V10:
>> 1. Use blockdev-remove-medium and blockdev-insert-medium to replace backing
>>reference.
>> 2. Address the comments from Eric Blake
>> V9:
>> 1. Update the error messages
>> 2. Rebase to the newest qemu
>>

Re: [Qemu-devel] [PATCH v14 7/8] Implement new driver for block replication

2016-01-27 Thread Wen Congyang

On 01/27/2016 10:46 PM, Stefan Hajnoczi wrote:
> On Wed, Jan 13, 2016 at 05:18:31PM +0800, Changlong Xie wrote:
>> From: Wen Congyang <we...@cn.fujitsu.com>
>>
>> Signed-off-by: Wen Congyang <we...@cn.fujitsu.com>
>> Signed-off-by: zhanghailiang <zhang.zhanghaili...@huawei.com>
>> Signed-off-by: Gonglei <arei.gong...@huawei.com>
>> Signed-off-by: Changlong Xie <xiecl.f...@cn.fujitsu.com>
>> ---
>>  block/Makefile.objs  |   1 +
>>  block/replication-comm.c |  66 +
>>  block/replication.c  | 590 
>> +++
>>  include/block/replication-comm.h |  50 
>>  qapi/block-core.json |  13 +
>>  5 files changed, 720 insertions(+)
>>  create mode 100644 block/replication-comm.c
>>  create mode 100644 block/replication.c
>>  create mode 100644 include/block/replication-comm.h
>>
>> diff --git a/block/Makefile.objs b/block/Makefile.objs
>> index fa05f37..7037662 100644
>> --- a/block/Makefile.objs
>> +++ b/block/Makefile.objs
>> @@ -23,6 +23,7 @@ block-obj-$(CONFIG_LIBSSH2) += ssh.o
>>  block-obj-y += accounting.o
>>  block-obj-y += write-threshold.o
>>  block-obj-y += backup.o
>> +block-obj-y += replication-comm.o replication.o
>>  
>>  common-obj-y += stream.o
>>  common-obj-y += commit.o
>> diff --git a/block/replication-comm.c b/block/replication-comm.c
>> new file mode 100644
>> index 000..8af748b
>> --- /dev/null
>> +++ b/block/replication-comm.c
>> @@ -0,0 +1,66 @@
>> +/*
>> + * Replication Block filter
> 
> Is the start/stop/checkpoint callback interface only useful for block
> replication?
> 
> This seems like a generic interface for registering with COLO.  Other
> components (networking, etc) might also need start/stop/checkpoint
> callbacks.  If that's the case then this code should be outside block/
> and the brs->bs field should either be void *opaque or removed (the
> caller needs to use container_of()).

Yes, we will do it in the next version.

> 
>> + *
>> + * Copyright (c) 2015 HUAWEI TECHNOLOGIES CO., LTD.
>> + * Copyright (c) 2015 Intel Corporation
>> + * Copyright (c) 2015 FUJITSU LIMITED
>> + *
>> + * Author:
>> + *   Wen Congyang <we...@cn.fujitsu.com>
>> + *
>> + * This work is licensed under the terms of the GNU GPL, version 2 or later.
>> + * See the COPYING file in the top-level directory.
>> + */
>> +
>> +#include "block/replication-comm.h"
>> +
>> +static QLIST_HEAD(, BlockReplicationState) block_replication_states;
>> +
>> +BlockReplicationState *block_replication_new(BlockDriverState *bs,
>> + BlockReplicationOps *ops)
>> +{
>> +BlockReplicationState *brs;
>> +
>> +brs = g_new0(BlockReplicationState, 1);
>> +brs->bs = bs;
>> +brs->ops = ops;
>> +QLIST_INSERT_HEAD(_replication_states, brs, node);
>> +
>> +return brs;
>> +}
>> +
>> +void block_replication_remove(BlockReplicationState *brs)
>> +{
>> +QLIST_REMOVE(brs, node);
>> +g_free(brs);
>> +}
>> +
>> +void block_replication_start_all(ReplicationMode mode, Error **errp)
>> +{
>> +BlockReplicationState *brs, *next;
>> +QLIST_FOREACH_SAFE(brs, _replication_states, node, next) {
>> +if (brs->ops && brs->ops->start) {
>> +brs->ops->start(brs, mode, errp);
>> +}
>> +}
>> +}
>> +
>> +void block_replication_do_checkpoint_all(Error **errp)
>> +{
>> +BlockReplicationState *brs, *next;
>> +QLIST_FOREACH_SAFE(brs, _replication_states, node, next) {
>> +if (brs->ops && brs->ops->checkpoint) {
>> +brs->ops->checkpoint(brs, errp);
>> +}
>> +}
>> +}
>> +
>> +void block_replication_stop_all(bool failover, Error **errp)
>> +{
>> +BlockReplicationState *brs, *next;
>> +QLIST_FOREACH_SAFE(brs, _replication_states, node, next) {
>> +if (brs->ops && brs->ops->stop) {
>> +brs->ops->stop(brs, failover, errp);
>> +}
>> +}
>> +}
>> diff --git a/block/replication.c b/block/replication.c
>> new file mode 100644
>> index 000..29c677a
>> --- /dev/null
>> +++ b/block/replication.c
>> @@ -0,0 +1,590 @@
>> +/*
>> + * Replication Block filter
>> + *
>> + * Copyright

Re: [Qemu-devel] COLO: how to flip a secondary to a primary?

2016-01-25 Thread Wen Congyang

On 01/26/2016 02:59 AM, Dr. David Alan Gilbert wrote:
> * Wen Congyang (we...@cn.fujitsu.com) wrote:
>> On 01/23/2016 03:35 AM, Dr. David Alan Gilbert wrote:
>>> Hi,
>>>   I've been looking at what's needed to add a new secondary after
>>> a primary failed; from the block side it doesn't look as hard
>>> as I'd expected, perhaps you can tell me if I'm missing something!
>>>
>>> The normal primary setup is:
>>>
>>>quorum
>>>   Real disk
>>>   nbd client
>>
>> quorum
>>real disk
>>replication
>>   nbd client
>>
>>>
>>> The normal secondary setup is:
>>>replication
>>>   active-disk
>>>   hidden-disk
>>>   Real-disk
>>
>> IIRC, we can do it like this:
>> quorum
>>replication
>>   active-disk
>>   hidden-disk
>>   real-disk
> 
> Yes.
> 
>>> With a couple of minor code hacks; I changed the secondary to be:
>>>
>>>quorum
>>>   replication
>>> active-disk
>>> hidden-disk
>>> Real-disk
>>>   dummy-disk
>>
>> after failover,
>> quorum
>>replicaion(old, mode is secondary)
>>  active-disk
>>  hidden-disk*
>>  real-disk*
>>replication(new, mode is primary)
>>  nbd-client
> 
> Do you need to keep the old secondary-replication?
> Does that just pass straight through?

Yes, the old secondary-replication can work in the newest mode.
For example, we don't start colo again after failover, we do nothing.

> 
>> In the newest version, we active commit active-disk to real-disk.
>> So it will be:
>> quorum
>>replicaion(old, mode is secondary)
>>  active-disk(it is real disk now)
>>replication(new, mode is primary)
>>  nbd-client
> 
> How does that active-commit work?  I didn't think you
> could change the real disk until you had the full checkpoint,
> since you don't know whether the primary or secondaries
> changes need to be written?

I start the active-commit work when doing failover. After failover,
the primary changes after last checkpoint should be dropped(How to cancel
the inprogress write ops?).

> 
>>> and then after the primary fails, I start a new secondary
>>> on another host and then on the old secondary do:
>>>
>>>   nbd_server_stop
>>>   stop
>>>   x_block_change top-quorum -d children.0 # deletes use of real 
>>> disk, leaves dummy
>>>   drive_del active-disk0
>>>   x_block_change top-quorum -a node-real-disk
>>>   x_block_change top-quorum -d children.1 # Seems to have deleted 
>>> the dummy?!, the disk is now child 0
>>>   drive_add buddy 
>>> driver=replication,mode=primary,file.driver=nbd,file.host=ibpair,file.port=8889,file.export=colo-disk0,node-name=nbd-client,if=none,cache=none
>>>   x_block_change top-quorum -a nbd-client
>>>   c
>>>   migrate_set_capability x-colo on
>>>   migrate -d -b tcp:ibpair:
>>>
>>> and I think that means what was the secondary, has the same disk
>>> structure as a normal primary.
>>> That's not quite happy yet, and I've not figured out why - but the
>>> order/structure of the block devices looks right?
>>>
>>> Notes:
>>>a) The dummy serves two purposes, 1) it works around the segfault
>>>   I reported in the other mail, 2) when I delete the real disk in the
>>>   first x_block_change it means the quorum still has 1 disk so doesn't
>>>   get upset.
>>
>> I don't understand the purpose 2.
> 
> quorum wont allow you to delete all it's members ('The number of children 
> cannot be lower than the vote threshold 1')
> and it's very tricky getting the order correct with add/delete; for example
> I tried:
> 
> drive_add buddy 
> driver=replication,mode=primary,file.driver=nbd,file.host=ibpair,file.port=8889,file.export=colo-disk0,node-name=nbd-client,if=none,cache=none
> # gets children.1
> x_block_change top-quorum -a nbd-client
> # deletes the secondary replication
> x_block_change top-quorum -d children.0
> drive_del active-disk0

The active-disk0 contains some data, and you should not delete it.
If we do active-commit after failover, the active-disk0 is the real disk.

> # ends up as children.0 but in the 2nd slot
> x_block_change top-quorum -a node-real-disk
> 
> info block shows me:
> top-quorum (#block615): json:{"children": [
> {"driver":

Re: [Qemu-devel] [PATCH v9 2/3] quorum: implement bdrv_add_child() and bdrv_del_child()

2016-01-24 Thread Wen Congyang

On 01/23/2016 04:02 AM, Dr. David Alan Gilbert wrote:
> * Alberto Garcia (be...@igalia.com) wrote:
>> On Thu 21 Jan 2016 05:58:42 PM CET, Eric Blake <ebl...@redhat.com> wrote:
>>>>>> In general, what do you do to make sure that the data in a new Quorum
>>>>>> child is consistent with that of the rest of the array?
>>>>>
>>>>> Quorum can have more than one child when it starts. But we don't do
>>>>> the similar check. So I don't think we should do such check here.
>>>>
>>>> Yes, but when you start a VM you can verify in advance that all
>>>> members of the Quorum have the same data. If you do that on a running
>>>> VM how can you know if the new disk is consistent with the others?
>>>
>>> User error if it is not.  Just the same as it is user error if you
>>> request a shallow drive-mirror but the destination is not the same
>>> contents as the backing file.  I don't think qemu has to protect us
>>> from user error in this case.
>>
>> But the backing file is read-only so the user can guarantee that the
>> destination has the same data before the shallow mirror. How do you do
>> that in this case?
> 
> I think in the colo case they're relying on doing a block migrate
> to synchronise the remote disk prior to switching into colo mode.

Yes, we can do a block migration to sync the disk. After the migration finished,
we stop block migration before starting colo.

Thanks
Wen Congyang

> 
> Dave
> 
>> Berto
> --
> Dr. David Alan Gilbert / dgilb...@redhat.com / Manchester, UK
> 
> 
> .
>

Re: [Qemu-devel] [PATCH v13 00/10] Block replication for continuous checkpoints

2016-01-24 Thread Wen Congyang

On 01/22/2016 11:14 PM, Dr. David Alan Gilbert wrote:
> Hi,
>   I can trigger a segfault if I wire in the block replication together with
> a quorum instance; it only triggers with both of them present but,
> it looks like the problem is a disagreement about the number of quorum
> members;  I'm triggering this on the 'colo-v2.4-periodic-mode' branch
> that is posted in the colo-framework set that I think includes this set
> (from https://github.com/coloft/qemu.git).
> 
> To trigger:
> ./git/colo/jan-16/try/x86_64-softmmu/qemu-system-x86_64 -nographic -S
> 
> (qemu) drive_add 0 
> if=none,id=colo-disk0,file.filename=/home/localvms/bugzilla.raw,driver=raw,node-name=node0
> (qemu) drive_add 1 
> if=none,id=active-disk0,throttling.bps-total=7000,driver=replication,mode=secondary,file.driver=qcow2,file.file.filename=/run/colo-active-disk.qcow2,file.backing.driver=qcow2,file.backing.file.filename=/run/colo-hidden-disk.qcow2,file.backing.backing=colo-disk0
> (qemu) drive_add 2 
> if=none,id=top-quorum,driver=quorum,read-pattern=fifo,vote-threshold=1,children.0=active-disk0
> (qemu) device_add virtio-blk-pci,drive=top-quorum,addr=9
> 
> *** Error in `/root/colo/jan-2016/./try/x86_64-softmmu/qemu-system-x86_64': 
> free(): invalid pointer: 0x55a8fdf0 ***
> === Backtrace: =
> /lib64/libc.so.6(+0x7cfe1)[0x7110ffe1]
> /lib64/libglib-2.0.so.0(g_free+0xf)[0x71ecc36f]
> /root/colo/jan-2016/./try/x86_64-softmmu/qemu-system-x86_64
> Program received signal SIGABRT, Aborted.
> 0x710c85f7 in raise () from /lib64/libc.so.6
> (gdb) where
> #0  0x710c85f7 in raise () from /lib64/libc.so.6
> #1  0x710c9ce8 in abort () from /lib64/libc.so.6
> #2  0x71108317 in __libc_message () from /lib64/libc.so.6
> #3  0x7110ffe1 in _int_free () from /lib64/libc.so.6
> #4  0x71ecc36f in g_free () from /lib64/libglib-2.0.so.0
> #5  0x559dfdd7 in qemu_iovec_destroy (qiov=0x57815410) at 
> /root/colo/jan-2016/qemu/util/iov.c:378
> #6  0x55989cce in quorum_aio_finalize (acb=0x57815350) at 
> /root/colo/jan-2016/qemu/block/quorum.c:171
> 171   qemu_iovec_destroy(>qcrs[i].qiov);
> (gdb) list
> 166   
> 167   if (acb->is_read) {
> 168   /* on the quorum case acb->child_iter == s->num_children - 1 */
> 169   for (i = 0; i <= acb->child_iter; i++) {
> 170   qemu_vfree(acb->qcrs[i].buf);
> 171   qemu_iovec_destroy(>qcrs[i].qiov);
> 172   }
> 173   }
> 174   
> 175   g_free(acb->qcrs);
> (gdb) p acb->child_iter
> $1 = 1
> (gdb) p i
> $3 = 1

Thanks for your test. Can you give me the following information:
1. acb->ret's value
2. s->num_children

I think it is quorum's bug, and acb->ret is < 0.

Thanks
Wen Congyang

> 
> #7  0x5598afca in quorum_aio_cb (opaque=, ret=-5)
> at /root/colo/jan-2016/qemu/block/quorum.c:302
> #8  0x559990ee in bdrv_co_complete (acb=0x57815410) at 
> /root/colo/jan-2016/qemu/block/io.c:2122
> .
> 
> So I guess acb->child_iter is wrong, since we only have one child on that 
> quorum?
> and we're trying to do a destroy on the second child.
> 
> Dave
> --
> Dr. David Alan Gilbert / dgilb...@redhat.com / Manchester, UK
> 
> 
> .
>

Re: [Qemu-devel] [PATCH v13 00/10] Block replication for continuous checkpoints

2016-01-24 Thread Wen Congyang

On 01/22/2016 11:14 PM, Dr. David Alan Gilbert wrote:
> Hi,
>   I can trigger a segfault if I wire in the block replication together with
> a quorum instance; it only triggers with both of them present but,
> it looks like the problem is a disagreement about the number of quorum
> members;  I'm triggering this on the 'colo-v2.4-periodic-mode' branch
> that is posted in the colo-framework set that I think includes this set
> (from https://github.com/coloft/qemu.git).
> 
> To trigger:
> ./git/colo/jan-16/try/x86_64-softmmu/qemu-system-x86_64 -nographic -S
> 
> (qemu) drive_add 0 
> if=none,id=colo-disk0,file.filename=/home/localvms/bugzilla.raw,driver=raw,node-name=node0
> (qemu) drive_add 1 
> if=none,id=active-disk0,throttling.bps-total=7000,driver=replication,mode=secondary,file.driver=qcow2,file.file.filename=/run/colo-active-disk.qcow2,file.backing.driver=qcow2,file.backing.file.filename=/run/colo-hidden-disk.qcow2,file.backing.backing=colo-disk0
> (qemu) drive_add 2 
> if=none,id=top-quorum,driver=quorum,read-pattern=fifo,vote-threshold=1,children.0=active-disk0
> (qemu) device_add virtio-blk-pci,drive=top-quorum,addr=9
> 
> *** Error in `/root/colo/jan-2016/./try/x86_64-softmmu/qemu-system-x86_64': 
> free(): invalid pointer: 0x55a8fdf0 ***
> === Backtrace: =
> /lib64/libc.so.6(+0x7cfe1)[0x7110ffe1]
> /lib64/libglib-2.0.so.0(g_free+0xf)[0x71ecc36f]
> /root/colo/jan-2016/./try/x86_64-softmmu/qemu-system-x86_64
> Program received signal SIGABRT, Aborted.
> 0x710c85f7 in raise () from /lib64/libc.so.6
> (gdb) where
> #0  0x710c85f7 in raise () from /lib64/libc.so.6
> #1  0x710c9ce8 in abort () from /lib64/libc.so.6
> #2  0x71108317 in __libc_message () from /lib64/libc.so.6
> #3  0x7110ffe1 in _int_free () from /lib64/libc.so.6
> #4  0x71ecc36f in g_free () from /lib64/libglib-2.0.so.0
> #5  0x559dfdd7 in qemu_iovec_destroy (qiov=0x57815410) at 
> /root/colo/jan-2016/qemu/util/iov.c:378
> #6  0x55989cce in quorum_aio_finalize (acb=0x57815350) at 
> /root/colo/jan-2016/qemu/block/quorum.c:171
> 171   qemu_iovec_destroy(>qcrs[i].qiov);
> (gdb) list
> 166   
> 167   if (acb->is_read) {
> 168   /* on the quorum case acb->child_iter == s->num_children - 1 */
> 169   for (i = 0; i <= acb->child_iter; i++) {
> 170   qemu_vfree(acb->qcrs[i].buf);
> 171   qemu_iovec_destroy(>qcrs[i].qiov);
> 172   }
> 173   }
> 174   
> 175   g_free(acb->qcrs);
> (gdb) p acb->child_iter
> $1 = 1
> (gdb) p i
> $3 = 1
> 
> #7  0x5598afca in quorum_aio_cb (opaque=, ret=-5)
> at /root/colo/jan-2016/qemu/block/quorum.c:302
> #8  0x559990ee in bdrv_co_complete (acb=0x57815410) at 
> /root/colo/jan-2016/qemu/block/io.c:2122
> .
> 
> So I guess acb->child_iter is wrong, since we only have one child on that 
> quorum?
> and we're trying to do a destroy on the second child.

Can you try the following patch:
>From 3f2c5ec288cd9a36afb392b4bba24029f3e9345a Mon Sep 17 00:00:00 2001
From: Wen Congyang <we...@cn.fujitsu.com>
Date: Mon, 25 Jan 2016 09:18:09 +0800
Subject: [PATCH] quorum: fix segfault when read fails in fifo mode

Signed-off-by: Wen Congyang <we...@cn.fujitsu.com>
---
 block/quorum.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/block/quorum.c b/block/quorum.c
index a5ae4b8..0965277 100644
--- a/block/quorum.c
+++ b/block/quorum.c
@@ -295,6 +295,9 @@ static void quorum_aio_cb(void *opaque, int ret)
 quorum_copy_qiov(acb->qiov, >qcrs[acb->child_iter].qiov);
 }
 acb->vote_ret = ret;
+if (ret < 0) {
+acb->child_iter--;
+}
 quorum_aio_finalize(acb);
 return;
 }
-- 
2.5.0



> 
> Dave
> --
> Dr. David Alan Gilbert / dgilb...@redhat.com / Manchester, UK
> 
> 
> .
>

Re: [Qemu-devel] COLO: how to flip a secondary to a primary?

2016-01-24 Thread Wen Congyang

On 01/23/2016 03:35 AM, Dr. David Alan Gilbert wrote:
> Hi,
>   I've been looking at what's needed to add a new secondary after
> a primary failed; from the block side it doesn't look as hard
> as I'd expected, perhaps you can tell me if I'm missing something!
> 
> The normal primary setup is:
> 
>quorum
>   Real disk
>   nbd client

quorum
   real disk
   replication
  nbd client

> 
> The normal secondary setup is:
>replication
>   active-disk
>   hidden-disk
>   Real-disk

IIRC, we can do it like this:
quorum
   replication
  active-disk
  hidden-disk
  real-disk

> 
> With a couple of minor code hacks; I changed the secondary to be:
> 
>quorum
>   replication
> active-disk
> hidden-disk
> Real-disk
>   dummy-disk

after failover,
quorum
   replicaion(old, mode is secondary)
 active-disk
 hidden-disk*
 real-disk*
   replication(new, mode is primary)
 nbd-client

In the newest version, we active commit active-disk to real-disk.
So it will be:
quorum
   replicaion(old, mode is secondary)
 active-disk(it is real disk now)
   replication(new, mode is primary)
 nbd-client

> 
> and then after the primary fails, I start a new secondary
> on another host and then on the old secondary do:
> 
>   nbd_server_stop
>   stop
>   x_block_change top-quorum -d children.0 # deletes use of real disk, 
> leaves dummy
>   drive_del active-disk0
>   x_block_change top-quorum -a node-real-disk
>   x_block_change top-quorum -d children.1 # Seems to have deleted the 
> dummy?!, the disk is now child 0
>   drive_add buddy 
> driver=replication,mode=primary,file.driver=nbd,file.host=ibpair,file.port=8889,file.export=colo-disk0,node-name=nbd-client,if=none,cache=none
>   x_block_change top-quorum -a nbd-client
>   c
>   migrate_set_capability x-colo on
>   migrate -d -b tcp:ibpair:
> 
> and I think that means what was the secondary, has the same disk
> structure as a normal primary.
> That's not quite happy yet, and I've not figured out why - but the
> order/structure of the block devices looks right?
> 
> Notes:
>a) The dummy serves two purposes, 1) it works around the segfault
>   I reported in the other mail, 2) when I delete the real disk in the
>   first x_block_change it means the quorum still has 1 disk so doesn't
>   get upset.

I don't understand the purpose 2.

>b) I had to remove the restriction in quorum_start_replication
>   on which mode it would run in. 

IIRC, this check will be removed.

>c) I'm not really sure everything knows it's in secondary mode yet, and
>   I'm not convinced whether the replication is doing the right thing.
>d) The migrate -d -b   eventually fails on the destination, not worked out 
> why
>   yet.

Can you give me the error message?

>e) Adding/deleting children on quorum is hard having to use the 
> children.0/1
>   notation when you've added children using node names - it's worrying
>   which number is which; is there a way to give them a name?

No. I think we can improve 'info block' output.

>f) I've not thought about the colo-proxy that much yet - I guess that
>   existing connections need to keep their sequence number offset but
>   new connections made by what is now the primary dont need to do anything
>   special.

Hailiang or Zhijian can answer this question.

Thanks
Wen Congyang

> 
> Dave
> --
> Dr. David Alan Gilbert / dgilb...@redhat.com / Manchester, UK
> 
> 
> .
>

Re: [Qemu-devel] [PATCH v14 0/8] Block replication for continuous checkpoints

2016-01-24 Thread Wen Congyang


Stefan:ping

Do you have time to review this series patchset?

Thanks
Wen Congyang

At 2016/1/13 17:18, Changlong Xie wrote:

Block replication is a very important feature which is used for
continuous checkpoints(for example: COLO).

You can get the detailed information about block replication from here:
http://wiki.qemu.org/Features/BlockReplication

Usage:
Please refer to docs/block-replication.txt

This patch series is based on the following patch series:
1. http://lists.nongnu.org/archive/html/qemu-devel/2015-12/msg04570.html

You can get the patch here:
https://github.com/Pating/qemu/tree/changlox/block-replication-v14

You can get the patch with framework here:
https://github.com/Pating/qemu/tree/changlox/colo_framework_v13

TODO:
1. Continuous block replication. It will be started after basic functions
are accepted.

Changs Log:
V14:
1. Implement auto complete active commit
2. Implement active commit block job for replication.c
3. Address the comments from Stefan, add replication-specific API and data
structure, also remove old block layer APIs
V13:
1. Rebase to the newest codes
2. Remove redundant marcos and semicolon in replication.c
3. Fix typos in block-replication.txt
V12:
1. Rebase to the newest codes
2. Use backing reference to replcace 'allow-write-backing-file'
V11:
1. Reopen the backing file when starting blcok replication if it is not
opened in R/W mode
2. Unblock BLOCK_OP_TYPE_BACKUP_SOURCE and BLOCK_OP_TYPE_BACKUP_TARGET
when opening backing file
3. Block the top BDS so there is only one block job for the top BDS and
its backing chain.
V10:
1. Use blockdev-remove-medium and blockdev-insert-medium to replace backing
reference.
2. Address the comments from Eric Blake
V9:
1. Update the error messages
2. Rebase to the newest qemu
3. Split child add/delete support. These patches are sent in another patchset.
V8:
1. Address Alberto Garcia's comments
V7:
1. Implement adding/removing quorum child. Remove the option non-connect.
2. Simplify the backing refrence option according to Stefan Hajnoczi's 
suggestion
V6:
1. Rebase to the newest qemu.
V5:
1. Address the comments from Gong Lei
2. Speed the failover up. The secondary vm can take over very quickly even
if there are too many I/O requests.
V4:
1. Introduce a new driver replication to avoid touch nbd and qcow2.
V3:
1: use error_setg() instead of error_set()
2. Add a new block job API
3. Active disk, hidden disk and nbd target uses the same AioContext
4. Add a testcase to test new hbitmap API
V2:
1. Redesign the secondary qemu(use image-fleecing)
2. Use Error objects to return error message
3. Address the comments from Max Reitz and Eric Blake

Wen Congyang (8):
   unblock backup operations in backing file
   Store parent BDS in BdrvChild
   Backup: clear all bitmap when doing block checkpoint
   Allow creating backup jobs when opening BDS
   docs: block replication's description
   auto complete active commit
   Implement new driver for block replication
   support replication driver in blockdev-add

  block.c  |  19 ++
  block/Makefile.objs  |   3 +-
  block/backup.c   |  14 +
  block/mirror.c   |  13 +-
  block/replication-comm.c |  66 +
  block/replication.c  | 590 +++
  blockdev.c   |   2 +-
  blockjob.c   |  11 +
  docs/block-replication.txt   | 229 +++
  include/block/block_int.h|   4 +-
  include/block/blockjob.h |  12 +
  include/block/replication-comm.h |  50 
  qapi/block-core.json |  33 ++-
  qemu-img.c   |   2 +-
  14 files changed, 1038 insertions(+), 10 deletions(-)
  create mode 100644 block/replication-comm.c
  create mode 100644 block/replication.c
  create mode 100644 docs/block-replication.txt
  create mode 100644 include/block/replication-comm.h

Re: [Qemu-devel] [PATCH RFC 3/7] net/filter: Skip the disabled filter when delivering packets

2016-01-22 Thread Wen Congyang

On 01/22/2016 04:36 PM, zhanghailiang wrote:
> If the filter is disabled, don't go through it.
> 
> Signed-off-by: zhanghailiang <zhang.zhanghaili...@huawei.com>
> ---
>  include/net/filter.h | 5 +
>  net/net.c| 4 
>  2 files changed, 9 insertions(+)
> 
> diff --git a/include/net/filter.h b/include/net/filter.h
> index 9ed5ec6..d797ee4 100644
> --- a/include/net/filter.h
> +++ b/include/net/filter.h
> @@ -74,6 +74,11 @@ ssize_t qemu_netfilter_pass_to_next(NetClientState *sender,
>  int iovcnt,
>  void *opaque);
>  
> +static inline bool qemu_need_skip_netfilter(NetFilterState *nf)
> +{
> +return nf->enabled ? false : true;
> +}
> +
>  void netfilter_print_info(NetFilterState *nf, char *output_str, int size);
>  
>  #endif /* QEMU_NET_FILTER_H */
> diff --git a/net/net.c b/net/net.c
> index 87de7c0..ec43105 100644
> --- a/net/net.c
> +++ b/net/net.c
> @@ -581,6 +581,10 @@ static ssize_t filter_receive_iov(NetClientState *nc,
>  NetFilterState *nf = NULL;
>  
>  QTAILQ_FOREACH(nf, >filters, next) {
> +/* Don't go through filter if it is off */
> +if (qemu_need_skip_netfilter(nf)) {
> +continue;
> +}
>  ret = qemu_netfilter_receive(nf, direction, sender, flags, iov,
>       iovcnt, sent_cb);
>  if (ret) {
> 

qemu_netfilter_pass_to_next() shoule also be updated.

Thanks
Wen Congyang

Re: [Qemu-devel] [RFC PATCH v2 00/10] Add colo-proxy based on netfilter

2016-01-21 Thread Wen Congyang

On 01/22/2016 11:15 AM, Jason Wang wrote:
> 
> 
> On 01/20/2016 06:30 PM, Wen Congyang wrote:
>> On 01/20/2016 06:19 PM, Jason Wang wrote:
>>>>
>>>>
>>>> On 01/20/2016 06:01 PM, Wen Congyang wrote:
>>>>>> On 01/20/2016 02:54 PM, Jason Wang wrote:
>>>>>>>>
>>>>>>>> On 01/20/2016 11:29 AM, Zhang Chen wrote:
>>>>>>>>>>>> Sure.
>>>>>>>>>>>>
>>>>>>>>>>>> Two main comments/suggestions:
>>>>>>>>>>>>
>>>>>>>>>>>> - TCP analysis is missed in current version, maybe you point a git 
>>>>>>>>>>>> tree
>>>>>>>>>>>> (or another version of RFC) to me for a better understanding of the
>>>>>>>>>>>> design. (Just a skeleton for TCP should be sufficient to discuss).
>>>>>>>>>>>> - I prefer to make the code as reusable as possible. So it's 
>>>>>>>>>>>> better to
>>>>>>>>>>>> split/decouple the reusable parts from the codes. So a vague idea 
>>>>>>>>>>>> is:
>>>>>>>>>>>>
>>>>>>>>>>>> 1) Decouple the packet comparing from the netfilter. You've 
>>>>>>>>>>>> achieved
>>>>>>>>>>>> this 99% since the work has been done in a thread. Just let the 
>>>>>>>>>>>> thread
>>>>>>>>>>>> poll sockets directly, then the comparing have the possibility to 
>>>>>>>>>>>> be
>>>>>>>>>>>> reused by other kinds of dataplane.
>>>>>>>>>>>> 2) Implement traffic mirror/redirector as filter.
>>>>>>>>>>>> 3) Implement TCP seq rewriting as a filter.
>>>>>>>>>>>>
>>>>>>>>>>>> Then, in primary node, you need just a traffic mirror, which did:
>>>>>>>>>>>> - mirror ingress traffic to secondary node
>>>>>>>>>>>> - mirror outgress traffic to packet comparing thread
>>>>>>>>>>>>
>>>>>>>>>>>> And in secondadry node, you need two filters:
>>>>>>>>>>>> - A TCP seq rewriter which adjust tcp sequence number.
>>>>>>>>>>>> - A traffic redirector which redirect packet from a socket as 
>>>>>>>>>>>> ingress
>>>>>>>>>>>> traffic, and redirect outgress traffic to the socket which could be
>>>>>>>>>>>> polled by remote packet comparing thread.
>>>>>>>>>>>>   Thoughts?
>>>>>>>>>>>>
>>>>>>>>>>>> Thanks
>>>>>>>>>>>>
>>>>>>>>>>>>>> Thanks
>>>>>>>>>>>>>> zhangchen
>>>>>>>>>>
>>>>>>>>>> Hi, Jason.
>>>>>>>>>> We consider your suggestion to split/decouple
>>>>>>>>>> the reusable parts from the codes.
>>>>>>>>>> Due to filter plugin are traversed one by one in order
>>>>>>>>>> we will split colo-proxy to three filters in each side.
>>>>>>>>>>
>>>>>>>>>> But in this plan,primary and secondary both have socket
>>>>>>>>>> server,startup is a problem.
>>>>>>>> I believe this issue could be solved by reusing socket chardev.
>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>  Primary qemu  
>>>>>>>>>> Secondary qemu
>>>>>>>>>> +--+  
>>>>>>>>>> +---+
>>>>>>>>>> | +-+  |

Re: [Qemu-devel] [RFC PATCH v2 00/10] Add colo-proxy based on netfilter

2016-01-21 Thread Wen Congyang

On 01/22/2016 02:21 PM, Jason Wang wrote:
> 
> 
> On 01/22/2016 01:56 PM, Wen Congyang wrote:
>> On 01/22/2016 01:41 PM, Jason Wang wrote:
>>>>
>>>>
>>>> On 01/22/2016 11:28 AM, Wen Congyang wrote:
>>>>>> On 01/22/2016 11:15 AM, Jason Wang wrote:
>>>>>>>>
>>>>>>>> On 01/20/2016 06:30 PM, Wen Congyang wrote:
>>>>>>>>>> On 01/20/2016 06:19 PM, Jason Wang wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On 01/20/2016 06:01 PM, Wen Congyang wrote:
>>>>>>>>>>>>>>>>>> On 01/20/2016 02:54 PM, Jason Wang wrote:
>>>>>>>>>>>>>>>>>>>>>> On 01/20/2016 11:29 AM, Zhang Chen wrote:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Sure.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Two main comments/suggestions:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> - TCP analysis is missed in current version, 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> maybe you point a git tree
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> (or another version of RFC) to me for a better 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> understanding of the
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> design. (Just a skeleton for TCP should be 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> sufficient to discuss).
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> - I prefer to make the code as reusable as 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> possible. So it's better to
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> split/decouple the reusable parts from the 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> codes. So a vague idea is:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 1) Decouple the packet comparing from the 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> netfilter. You've achieved
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> this 99% since the work has been done in a 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> thread. Just let the thread
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> poll sockets directly, then the comparing have 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> the possibility to be
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> reused by other kinds of dataplane.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 2) Implement traffic mirror/redirector as filter.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 3) Implement TCP seq rewriting as a filter.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Then, in primary node, you need just a traffic 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> mirror,

Re: [Qemu-devel] [RFC PATCH v2 00/10] Add colo-proxy based on netfilter

2016-01-21 Thread Wen Congyang

On 01/22/2016 01:41 PM, Jason Wang wrote:
> 
> 
> On 01/22/2016 11:28 AM, Wen Congyang wrote:
>> On 01/22/2016 11:15 AM, Jason Wang wrote:
>>>
>>> On 01/20/2016 06:30 PM, Wen Congyang wrote:
>>>> On 01/20/2016 06:19 PM, Jason Wang wrote:
>>>>>>
>>>>>> On 01/20/2016 06:01 PM, Wen Congyang wrote:
>>>>>>>> On 01/20/2016 02:54 PM, Jason Wang wrote:
>>>>>>>>>> On 01/20/2016 11:29 AM, Zhang Chen wrote:
>>>>>>>>>>>>>> Sure.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Two main comments/suggestions:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> - TCP analysis is missed in current version, maybe you point a 
>>>>>>>>>>>>>> git tree
>>>>>>>>>>>>>> (or another version of RFC) to me for a better understanding of 
>>>>>>>>>>>>>> the
>>>>>>>>>>>>>> design. (Just a skeleton for TCP should be sufficient to 
>>>>>>>>>>>>>> discuss).
>>>>>>>>>>>>>> - I prefer to make the code as reusable as possible. So it's 
>>>>>>>>>>>>>> better to
>>>>>>>>>>>>>> split/decouple the reusable parts from the codes. So a vague 
>>>>>>>>>>>>>> idea is:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> 1) Decouple the packet comparing from the netfilter. You've 
>>>>>>>>>>>>>> achieved
>>>>>>>>>>>>>> this 99% since the work has been done in a thread. Just let the 
>>>>>>>>>>>>>> thread
>>>>>>>>>>>>>> poll sockets directly, then the comparing have the possibility 
>>>>>>>>>>>>>> to be
>>>>>>>>>>>>>> reused by other kinds of dataplane.
>>>>>>>>>>>>>> 2) Implement traffic mirror/redirector as filter.
>>>>>>>>>>>>>> 3) Implement TCP seq rewriting as a filter.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Then, in primary node, you need just a traffic mirror, which did:
>>>>>>>>>>>>>> - mirror ingress traffic to secondary node
>>>>>>>>>>>>>> - mirror outgress traffic to packet comparing thread
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> And in secondadry node, you need two filters:
>>>>>>>>>>>>>> - A TCP seq rewriter which adjust tcp sequence number.
>>>>>>>>>>>>>> - A traffic redirector which redirect packet from a socket as 
>>>>>>>>>>>>>> ingress
>>>>>>>>>>>>>> traffic, and redirect outgress traffic to the socket which could 
>>>>>>>>>>>>>> be
>>>>>>>>>>>>>> polled by remote packet comparing thread.
>>>>>>>>>>>>>>   Thoughts?
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Thanks
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Thanks
>>>>>>>>>>>>>>>> zhangchen
>>>>>>>>>>>> Hi, Jason.
>>>>>>>>>>>> We consider your suggestion to split/decouple
>>>>>>>>>>>> the reusable parts from the codes.
>>>>>>>>>>>> Due to filter plugin are traversed one by one in order
>>>>>>>>>>>> we will split colo-proxy to three filters in each side.
>>>>>>>>>>>>
>>>>>>>>>>>> But in this plan,primary and secondary both have socket
>>>>>>>>>>

Re: [Qemu-devel] [RFC PATCH v2 00/10] Add colo-proxy based on netfilter

2016-01-21 Thread Wen Congyang

On 01/22/2016 01:33 PM, Jason Wang wrote:
> 
> 
> On 01/20/2016 06:34 PM, Wen Congyang wrote:
>> On 01/20/2016 06:03 PM, Jason Wang wrote:
>>>
>>> On 01/20/2016 05:49 PM, Wen Congyang wrote:
>>>> On 01/20/2016 05:20 PM, Jason Wang wrote:
>>>>> On 01/20/2016 03:44 PM, Wen Congyang wrote:
>>>>>>>> ...
>>>>>>>> -chardev socket,id=comparer0,host=ip_primary,port=X,server,nowait
>>>>>>>> -chardev socket,id=comparer1,host=ip_primary,port=Y,server,nowait
>>>>>>>> -chardev socket,id=mirrorer0,host=ip_primary,port=Z,server,nowait
>>>>>>>> -netdev tap,id=hn0
>>>>>>>> -traffic-mirrorer netdev=hn0,id=t0,indev=comparer0,outdev=mirrorer0
>>>>>>>> -colo-comparer primary_traffic=comparer0,secondary_traffic=comparer1
>>>>>>>> ...
>>>>>>>>
>>>>>>>> packet comparer compares the packets from two chardev: comparer0 and
>>>>>>>> comparer1.
>>>>>>>> traffic-mirrorer mirror tx to secondary node through chardev mirrorer0,
>>>>>>>> and mirror rx to packet comparer through chardev comparer0.
>>>>>>>>
>>>>>>>> In secondary node:
>>>>>>>>
>>>>>>>> ...
>>>>>>>> -chardev socket,id=redirector0,host=ip_primary,port=Y
>>>>>>>> -chardev socket,id=redirector1,host=ip_primary,port=Z
>>>>>>>> -netdev tap,id=hn0
>>>>>>>> -traffic-redirector 
>>>>>>>> netdev=hn0,id,r0,indev=redirector0,outdev=redirector1
>>>>>>>> -colo-rewriter netdev=hn0,id=c0
>>>>>>>> ...
>>>>>>>>
>>>>>>>> traffic-redirector redirect the rx traffic from primary node through
>>>>>>>> redirector0 and redirect the tx traffic to promary node through 
>>>>>>>> redirector1.
>>>>>>>> colo-rewriter rewrite seq number as a normal netfilter.
>>>>>> What are traffic-mirrorer and colo-comparer, traffic-redirector, 
>>>>>> colo-rewriter?
>>>>>> A netfilter driver?
>>>>> traffic-mirrorer/redirector is a type of netfilter that just
>>>>> mirror/redirect packets between netdev and chardev (just the mirror
>>>>> client/sever and redirect client/sever in the above graph)
>>>>> colo-rewriter is a type of netfilter that did ack/seq adjust (just the
>>>>> TCP rewriter in the above graph)
>>>>> colo-comparer is a thread object that did packet comparing (similar to
>>>>> "compare" in the above graph but not a netfiler)
>>>> Thanks. I have another question:
>>>> IIRC, both rx and tx packets walk through all netfilter objects in the 
>>>> same order.
>>>>
>>>> tx packet(sent to the guest): we want that redirector hanldes it first
>>>> rx packet(sent from the guest): we want that colo-rewriter handles it first
>>>> Change the order or use two traffic-redirectors?
>>>>
>>>> Thanks
>>>> Wen Congyang
>>> Interesting question.
>>>
>>> Two redirectors sounds ok or maybe we can go through rx filters in a
>>> reverse order?
>> netdev <---> filter1 <> filter2 <>  <> emulated device 
>> <> guest
>> So I think we can go through rx filters in a reverse order. But it changes
>> the behavior. So I am not sure if we can do it.
> 
> I think we can. Both dump and buffer filter does not require strict
> order, so it's a good time and change to do this.

OK, we will do it.

Thanks
Wen Congyang

> 
>>
>> Thanks
>> Wen Congyang
>>
>>>
>>> .
>>>
>>
>>
> 
> 
> 
> .
>

Re: [Qemu-devel] [RFC PATCH v2 00/10] Add colo-proxy based on netfilter

2016-01-21 Thread Wen Congyang

On 01/22/2016 03:42 PM, Jason Wang wrote:
> 
> 
> On 01/22/2016 02:47 PM, Wen Congyang wrote:
>> On 01/22/2016 02:21 PM, Jason Wang wrote:
>>>
>>> On 01/22/2016 01:56 PM, Wen Congyang wrote:
>>>> On 01/22/2016 01:41 PM, Jason Wang wrote:
>>>>>>
>>>>>> On 01/22/2016 11:28 AM, Wen Congyang wrote:
>>>>>>>> On 01/22/2016 11:15 AM, Jason Wang wrote:
>>>>>>>>>> On 01/20/2016 06:30 PM, Wen Congyang wrote:
>>>>>>>>>>>> On 01/20/2016 06:19 PM, Jason Wang wrote:
>>>>>>>>>>>>>>>> On 01/20/2016 06:01 PM, Wen Congyang wrote:
>>>>>>>>>>>>>>>>>>>> On 01/20/2016 02:54 PM, Jason Wang wrote:
>>>>>>>>>>>>>>>>>>>>>>>> On 01/20/2016 11:29 AM, Zhang Chen wrote:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Sure.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Two main comments/suggestions:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> - TCP analysis is missed in current version, 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> maybe you point a git tree
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> (or another version of RFC) to me for a better 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> understanding of the
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> design. (Just a skeleton for TCP should be 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> sufficient to discuss).
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> - I prefer to make the code as reusable as 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> possible. So it's better to
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> split/decouple the reusable parts from the 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> codes. So a vague idea is:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 1) Decouple the packet comparing from the 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> netfilter. You've achieved
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> this 99% since the work has been done in a 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> thread. Just let the thread
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> poll sockets directly, then the comparing have 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> the possibility to be
&

Re: [Qemu-devel] [PATCH v9 2/3] quorum: implement bdrv_add_child() and bdrv_del_child()

2016-01-20 Thread Wen Congyang

On 01/20/2016 11:43 PM, Alberto Garcia wrote:
> On Fri 25 Dec 2015 10:22:55 AM CET, Changlong Xie wrote:
>> @@ -875,9 +878,9 @@ static int quorum_open(BlockDriverState *bs, QDict 
>> *options, int flags,
>>  ret = -EINVAL;
>>  goto exit;
>>  }
>> -if (s->num_children < 2) {
>> +if (s->num_children < 1) {
>>  error_setg(_err,
>> -   "Number of provided children must be greater than 1");
>> +   "Number of provided children must be 1 or more");
>>  ret = -EINVAL;
>>  goto exit;
>>  }
> 
> I have a question: if you have a Quorum with just one member and you add
> a new one, how do you know if it has the same data as the existing one?
> 
> In general, what do you do to make sure that the data in a new Quorum
> child is consistent with that of the rest of the array?

Quorum can have more than one child when it starts. But we don't do the
similar check. So I don't think we should do such check here.

Thanks
Wen Congyang

> 
> Berto
> 
> 
> .
>

Re: [Qemu-devel] [RFC PATCH v2 00/10] Add colo-proxy based on netfilter

2016-01-20 Thread Wen Congyang

On 01/20/2016 05:20 PM, Jason Wang wrote:
> 
> 
> On 01/20/2016 03:44 PM, Wen Congyang wrote:
>>>>
>>>> ...
>>>> -chardev socket,id=comparer0,host=ip_primary,port=X,server,nowait
>>>> -chardev socket,id=comparer1,host=ip_primary,port=Y,server,nowait
>>>> -chardev socket,id=mirrorer0,host=ip_primary,port=Z,server,nowait
>>>> -netdev tap,id=hn0
>>>> -traffic-mirrorer netdev=hn0,id=t0,indev=comparer0,outdev=mirrorer0
>>>> -colo-comparer primary_traffic=comparer0,secondary_traffic=comparer1
>>>> ...
>>>>
>>>> packet comparer compares the packets from two chardev: comparer0 and
>>>> comparer1.
>>>> traffic-mirrorer mirror tx to secondary node through chardev mirrorer0,
>>>> and mirror rx to packet comparer through chardev comparer0.
>>>>
>>>> In secondary node:
>>>>
>>>> ...
>>>> -chardev socket,id=redirector0,host=ip_primary,port=Y
>>>> -chardev socket,id=redirector1,host=ip_primary,port=Z
>>>> -netdev tap,id=hn0
>>>> -traffic-redirector netdev=hn0,id,r0,indev=redirector0,outdev=redirector1
>>>> -colo-rewriter netdev=hn0,id=c0
>>>> ...
>>>>
>>>> traffic-redirector redirect the rx traffic from primary node through
>>>> redirector0 and redirect the tx traffic to promary node through 
>>>> redirector1.
>>>> colo-rewriter rewrite seq number as a normal netfilter.
>> What are traffic-mirrorer and colo-comparer, traffic-redirector, 
>> colo-rewriter?
>> A netfilter driver?
> 
> traffic-mirrorer/redirector is a type of netfilter that just
> mirror/redirect packets between netdev and chardev (just the mirror
> client/sever and redirect client/sever in the above graph)
> colo-rewriter is a type of netfilter that did ack/seq adjust (just the
> TCP rewriter in the above graph)
> colo-comparer is a thread object that did packet comparing (similar to
> "compare" in the above graph but not a netfiler)

Thanks. I have another question:
IIRC, both rx and tx packets walk through all netfilter objects in the same 
order.

tx packet(sent to the guest): we want that redirector hanldes it first
rx packet(sent from the guest): we want that colo-rewriter handles it first
Change the order or use two traffic-redirectors?

Thanks
Wen Congyang

> 
>>
>> If not, how to get the packet from the netdev, and send back the packet to
>> the netdev?
>>
>> Thanks
>> Wen Congyang
>>
> 
> 
> 
> .
>

Re: [Qemu-devel] [RFC PATCH v2 00/10] Add colo-proxy based on netfilter

2016-01-20 Thread Wen Congyang

e   |  |  |   
>> +> mirror   +---> adjust |   adjust+-->redirect| | |
>> | | |  client   | |  server|||  |  |   | 
>> | |  server   |   | ack|   seq   |  |client  | | |
>> | | |   | ||||  |  |   | 
>> | |   |   || |  || | |
>> | | +^--+ +^---++-+--+  |  |   | 
>> | +---+   ++-+  ++---+ | |
>> | |  | tx  |  rx  | rx  |  |   | 
>> |txall   |  rx | |
>> | +-+  |   | 
>> +--+ |
>> ||
>> +---+
>>   
>> |
>> ||||  
>> |   |
>> +--+  
>> +---+
>>  ||
>>  |guest receive   |guest send
>>  ||
>> ++v+
>> |  |
>> |  |
>> | tap 
>> |  NOTE: filter direction is rx/tx/all
>> | 
>> |  rx:receive packets sent to the netdev
>> | 
>> |  tx:receive packets sent by the netdev
>> +--+
>>
>>
>>
> 
> I still like to decouple comparer from netfilter. It have two obvious
> advantages:
> 
> - make it can be reused by other dataplane (e.g vhost)
> - secondary redirector could redirect rx to comparer on primary node
> directly which simplify the design.
> 
>>
>>
>>
>>
>> guest recv packet route
>>
>> primary
>> tap --> mirror client filter
>> mirror client will send packet to guest,at the
>> same time, copy and forward packet to secondary
>> mirror server.
>>
>> secondary
>> mirror server filter --> TCP rewriter
>> if recv packet is TCP packet,we will adjust ack
>> and update TCP checksum, then send to secondary
>> guest. else directly send to guest.
>>
>>
>> guest send packet route
>>
>> primary
>> guest --> redirect server filter
>> redirect server filter recv primary guest packet
>> but do nothing, just pass to next filter.
>>
>> redirect server filter --> compare filter
>> compare filter recv primary guest packet then
>> waiting scondary redirect packet to compare it.
>> if packet same,send primary packet and clear secondary
>> packet, else send primary packet and do
>> checkpoint.
>>
>> secondary
>> guest --> TCP rewriter filter
>> if the packet is TCP packet,we will adjust seq
>> and update TCP checksum. then send it to
>> redirect client filter. else directly send to
>> redirect client filter.
>>
>> redirect client filter --> redirect server filter
>> forward packet to primary
>>
>>
>> In failover scene（primary is down）, the TCP rewriter will keep
>> servicing
>> for the TCP connection which is established after the last checkpoint。
>>
>>
>>
>> How about this plan?
> 
> Sounds good.
> 
> And there's indeed no need to differ client/server by reusing the socket
> chardev. E.g:
> 
> In primary node:
> 
> ...
> -chardev socket,id=comparer0,host=ip_primary,port=X,server,nowait
> -chardev socket,id=comparer1,host=ip_primary,port=Y,server,nowait
> -chardev socket,id=mirrorer0,host=ip_primary,port=Z,server,nowait
> -netdev tap,id=hn0
> -traffic-mirrorer netdev=hn0,id=t0,indev=comparer0,outdev=mirrorer0
> -colo-comparer primary_traffic=comparer0,secondary_traffic=comparer1

Why mirrorer has indev? I think we can use traffic-redirector to do it.
The command line is:
-netdev tap,id=hn0
-object traffic-mirrorer,id=f0,netdev=hn0,queue=tx,outdev=mirrorer0
-object traffic-redirector,id=f1,netdev=hn0,queue=rx,outdev=comparer0
-colo-comparer primary_traffic=comparer0,secondary_traffic=comparer1,netdev=hn0
In the comparer thread, we can use qemu_net_queue_send_iov() to send
out the packet.

Also, we can merge the socketdev comparer1 and mirrorer0.

Thanks
Wen Congyang

> ...
> 
> packet comparer compares the packets from two chardev: comparer0 and
> comparer1.
> traffic-mirrorer mirror tx to secondary node through chardev mirrorer0,
> and mirror rx to packet comparer through chardev comparer0.
> 
> In secondary node:
> 
> ...
> -chardev socket,id=redirector0,host=ip_primary,port=Y
> -chardev socket,id=redirector1,host=ip_primary,port=Z
> -netdev tap,id=hn0
> -traffic-redirector netdev=hn0,id,r0,indev=redirector0,outdev=redirector1
> -colo-rewriter netdev=hn0,id=c0
> ...
> 
> traffic-redirector redirect the rx traffic from primary node through
> redirector0 and redirect the tx traffic to promary node through redirector1.
> colo-rewriter rewrite seq number as a normal netfilter.
> 
> 
> 
>>
>>
>>> .
>>>
>>
> 
> 
> 
> 
> .
>

Re: [Qemu-devel] [RFC PATCH v2 00/10] Add colo-proxy based on netfilter

2016-01-20 Thread Wen Congyang

On 01/20/2016 06:03 PM, Jason Wang wrote:
> 
> 
> On 01/20/2016 05:49 PM, Wen Congyang wrote:
>> On 01/20/2016 05:20 PM, Jason Wang wrote:
>>>
>>> On 01/20/2016 03:44 PM, Wen Congyang wrote:
>>>>>> ...
>>>>>> -chardev socket,id=comparer0,host=ip_primary,port=X,server,nowait
>>>>>> -chardev socket,id=comparer1,host=ip_primary,port=Y,server,nowait
>>>>>> -chardev socket,id=mirrorer0,host=ip_primary,port=Z,server,nowait
>>>>>> -netdev tap,id=hn0
>>>>>> -traffic-mirrorer netdev=hn0,id=t0,indev=comparer0,outdev=mirrorer0
>>>>>> -colo-comparer primary_traffic=comparer0,secondary_traffic=comparer1
>>>>>> ...
>>>>>>
>>>>>> packet comparer compares the packets from two chardev: comparer0 and
>>>>>> comparer1.
>>>>>> traffic-mirrorer mirror tx to secondary node through chardev mirrorer0,
>>>>>> and mirror rx to packet comparer through chardev comparer0.
>>>>>>
>>>>>> In secondary node:
>>>>>>
>>>>>> ...
>>>>>> -chardev socket,id=redirector0,host=ip_primary,port=Y
>>>>>> -chardev socket,id=redirector1,host=ip_primary,port=Z
>>>>>> -netdev tap,id=hn0
>>>>>> -traffic-redirector netdev=hn0,id,r0,indev=redirector0,outdev=redirector1
>>>>>> -colo-rewriter netdev=hn0,id=c0
>>>>>> ...
>>>>>>
>>>>>> traffic-redirector redirect the rx traffic from primary node through
>>>>>> redirector0 and redirect the tx traffic to promary node through 
>>>>>> redirector1.
>>>>>> colo-rewriter rewrite seq number as a normal netfilter.
>>>> What are traffic-mirrorer and colo-comparer, traffic-redirector, 
>>>> colo-rewriter?
>>>> A netfilter driver?
>>> traffic-mirrorer/redirector is a type of netfilter that just
>>> mirror/redirect packets between netdev and chardev (just the mirror
>>> client/sever and redirect client/sever in the above graph)
>>> colo-rewriter is a type of netfilter that did ack/seq adjust (just the
>>> TCP rewriter in the above graph)
>>> colo-comparer is a thread object that did packet comparing (similar to
>>> "compare" in the above graph but not a netfiler)
>> Thanks. I have another question:
>> IIRC, both rx and tx packets walk through all netfilter objects in the same 
>> order.
>>
>> tx packet(sent to the guest): we want that redirector hanldes it first
>> rx packet(sent from the guest): we want that colo-rewriter handles it first
>> Change the order or use two traffic-redirectors?
>>
>> Thanks
>> Wen Congyang
> 
> Interesting question.
> 
> Two redirectors sounds ok or maybe we can go through rx filters in a
> reverse order?

netdev <---> filter1 <> filter2 <>  <> emulated device <> 
guest
So I think we can go through rx filters in a reverse order. But it changes
the behavior. So I am not sure if we can do it.

Thanks
Wen Congyang

> 
> 
> .
>

Re: [Qemu-devel] [RFC PATCH v2 00/10] Add colo-proxy based on netfilter

2016-01-20 Thread Wen Congyang

On 01/20/2016 06:19 PM, Jason Wang wrote:
> 
> 
> On 01/20/2016 06:01 PM, Wen Congyang wrote:
>> On 01/20/2016 02:54 PM, Jason Wang wrote:
>>>
>>> On 01/20/2016 11:29 AM, Zhang Chen wrote:
>>>>> Sure.
>>>>>
>>>>> Two main comments/suggestions:
>>>>>
>>>>> - TCP analysis is missed in current version, maybe you point a git tree
>>>>> (or another version of RFC) to me for a better understanding of the
>>>>> design. (Just a skeleton for TCP should be sufficient to discuss).
>>>>> - I prefer to make the code as reusable as possible. So it's better to
>>>>> split/decouple the reusable parts from the codes. So a vague idea is:
>>>>>
>>>>> 1) Decouple the packet comparing from the netfilter. You've achieved
>>>>> this 99% since the work has been done in a thread. Just let the thread
>>>>> poll sockets directly, then the comparing have the possibility to be
>>>>> reused by other kinds of dataplane.
>>>>> 2) Implement traffic mirror/redirector as filter.
>>>>> 3) Implement TCP seq rewriting as a filter.
>>>>>
>>>>> Then, in primary node, you need just a traffic mirror, which did:
>>>>> - mirror ingress traffic to secondary node
>>>>> - mirror outgress traffic to packet comparing thread
>>>>>
>>>>> And in secondadry node, you need two filters:
>>>>> - A TCP seq rewriter which adjust tcp sequence number.
>>>>> - A traffic redirector which redirect packet from a socket as ingress
>>>>> traffic, and redirect outgress traffic to the socket which could be
>>>>> polled by remote packet comparing thread.
>>>>>   Thoughts?
>>>>>
>>>>> Thanks
>>>>>
>>>>>> Thanks
>>>>>> zhangchen
>>>>
>>>> Hi, Jason.
>>>> We consider your suggestion to split/decouple
>>>> the reusable parts from the codes.
>>>> Due to filter plugin are traversed one by one in order
>>>> we will split colo-proxy to three filters in each side.
>>>>
>>>> But in this plan,primary and secondary both have socket
>>>> server,startup is a problem.
>>> I believe this issue could be solved by reusing socket chardev.
>>>
>>>>
>>>>  Primary qemu  
>>>> Secondary qemu
>>>> +--+  
>>>> +---+
>>>> | +-+  |   | 
>>>> +--+ |
>>>> | | |  |   | 
>>>> |  | |
>>>> | |guest|  |   | 
>>>> |guest | |
>>>> | | |  |   | 
>>>> |  | |
>>>> | +---^--+--+  |   | 
>>>> +-++---+ |
>>>> | |  | |  
>>>> |^| |
>>>> | |  | |  
>>>> ||| |
>>>> | +-+ 
>>>> ||| |
>>>> |  netfilter  |  | ||  |  
>>>> netfilter|| |
>>>> | +-+  ||  | 
>>>> +--+ |
>>>> | |   |  | filter excute order  |  ||  | 
>>>> | ||  filter excute order  | |
>>>> | |   |  |+---> |  ||  | 
>>>> | || +---> | |
>>>> | |   |  |  |  ||  | 
>>&

Re: [Qemu-devel] [RFC PATCH v2 00/10] Add colo-proxy based on netfilter

2016-01-19 Thread Wen Congyang

e   |  |  |   
>> +> mirror   +---> adjust |   adjust+-->redirect| | |
>> | | |  client   | |  server|||  |  |   | 
>> | |  server   |   | ack|   seq   |  |client  | | |
>> | | |   | ||||  |  |   | 
>> | |   |   || |  || | |
>> | | +^--+ +^---++-+--+  |  |   | 
>> | +---+   ++-+  ++---+ | |
>> | |  | tx  |  rx  | rx  |  |   | 
>> |txall   |  rx | |
>> | +-+  |   | 
>> +--+ |
>> ||
>> +---+
>>   
>> |
>> ||||  
>> |   |
>> +--+  
>> +---+
>>  ||
>>  |guest receive   |guest send
>>  ||
>> ++v+
>> |  |
>> |  |
>> | tap 
>> |  NOTE: filter direction is rx/tx/all
>> | 
>> |  rx:receive packets sent to the netdev
>> | 
>> |  tx:receive packets sent by the netdev
>> +--+
>>
>>
>>
> 
> I still like to decouple comparer from netfilter. It have two obvious
> advantages:
> 
> - make it can be reused by other dataplane (e.g vhost)
> - secondary redirector could redirect rx to comparer on primary node
> directly which simplify the design.
> 
>>
>>
>>
>>
>> guest recv packet route
>>
>> primary
>> tap --> mirror client filter
>> mirror client will send packet to guest,at the
>> same time, copy and forward packet to secondary
>> mirror server.
>>
>> secondary
>> mirror server filter --> TCP rewriter
>> if recv packet is TCP packet,we will adjust ack
>> and update TCP checksum, then send to secondary
>> guest. else directly send to guest.
>>
>>
>> guest send packet route
>>
>> primary
>> guest --> redirect server filter
>> redirect server filter recv primary guest packet
>> but do nothing, just pass to next filter.
>>
>> redirect server filter --> compare filter
>> compare filter recv primary guest packet then
>> waiting scondary redirect packet to compare it.
>> if packet same,send primary packet and clear secondary
>> packet, else send primary packet and do
>> checkpoint.
>>
>> secondary
>> guest --> TCP rewriter filter
>> if the packet is TCP packet,we will adjust seq
>> and update TCP checksum. then send it to
>> redirect client filter. else directly send to
>> redirect client filter.
>>
>> redirect client filter --> redirect server filter
>> forward packet to primary
>>
>>
>> In failover scene（primary is down）, the TCP rewriter will keep
>> servicing
>> for the TCP connection which is established after the last checkpoint。
>>
>>
>>
>> How about this plan?
> 
> Sounds good.
> 
> And there's indeed no need to differ client/server by reusing the socket
> chardev. E.g:
> 
> In primary node:

Thanks for your suggestion.

> 
> ...
> -chardev socket,id=comparer0,host=ip_primary,port=X,server,nowait
> -chardev socket,id=comparer1,host=ip_primary,port=Y,server,nowait
> -chardev socket,id=mirrorer0,host=ip_primary,port=Z,server,nowait
> -netdev tap,id=hn0
> -traffic-mirrorer netdev=hn0,id=t0,indev=comparer0,outdev=mirrorer0
> -colo-comparer primary_traffic=comparer0,secondary_traffic=comparer1
> ...
> 
> packet comparer compares the packets from two chardev: comparer0 and
> comparer1.
> traffic-mirrorer mirror tx to secondary node through chardev mirrorer0,
> and mirror rx to packet comparer through chardev comparer0.
> 
> In secondary node:
> 
> ...
> -chardev socket,id=redirector0,host=ip_primary,port=Y
> -chardev socket,id=redirector1,host=ip_primary,port=Z
> -netdev tap,id=hn0
> -traffic-redirector netdev=hn0,id,r0,indev=redirector0,outdev=redirector1
> -colo-rewriter netdev=hn0,id=c0
> ...
> 
> traffic-redirector redirect the rx traffic from primary node through
> redirector0 and redirect the tx traffic to promary node through redirector1.
> colo-rewriter rewrite seq number as a normal netfilter.

What are traffic-mirrorer and colo-comparer, traffic-redirector, colo-rewriter?
A netfilter driver?

If not, how to get the packet from the netdev, and send back the packet to
the netdev?

Thanks
Wen Congyang

> 
> 
> 
>>
>>
>>> .
>>>
>>
> 
> 
> 
> 
> .
>

Re: [Qemu-devel] [PATCH v9 0/3] qapi: child add/delete support

2016-01-17 Thread Wen Congyang

Ping...

On 12/25/2015 05:22 PM, Changlong Xie wrote:
> If quorum's child is broken, we can use mirror job to replace it.
> But sometimes, the user only need to remove the broken child, and
> add it later when the problem is fixed.
> 
> ChangLog:
> v9:
> 1. Rebase to the newest codes
> 2. Remove redundant codes in quorum_add_child() and quorum_del_child()
> 3. Fix typos and in qmp-commands.hx 
> v8:
> 1. Rebase to the newest codes
> 2. Address the comments from Eric Blake
> v7:
> 1. Remove the qmp command x-blockdev-change's parameter operation according
>to Kevin's comments.
> 2. Remove the hmp command.
> v6:
> 1. Use a single qmp command x-blockdev-change to replace x-blockdev-child-add
>and x-blockdev-child-delete
> v5:
> 1. Address Eric Blake's comments
> v4:
> 1. drop nbd driver's implementation. We can use human-monitor-command
>to do it.
> 2. Rename the command name.
> v3:
> 1. Don't open BDS in bdrv_add_child(). Use the existing BDS which is
>created by the QMP command blockdev-add.
> 2. The driver NBD can support filename, path, host:port now.
> v2:
> 1. Use bdrv_get_device_or_node_name() instead of new function
>bdrv_get_id_or_node_name()
> 2. Update the error message
> 3. Update the documents in block-core.json
> 
> Wen Congyang (3):
>   Add new block driver interface to add/delete a BDS's child
>   quorum: implement bdrv_add_child() and bdrv_del_child()
>   qmp: add monitor command to add/remove a child
> 
>  block.c   |  58 --
>  block/quorum.c| 122 
> +-
>  blockdev.c|  54 
>  include/block/block.h |   9 
>  include/block/block_int.h |   5 ++
>  qapi/block-core.json  |  23 +
>  qmp-commands.hx   |  47 ++
>  7 files changed, 312 insertions(+), 6 deletions(-)
>

Re: [Qemu-devel] [Patch v12 resend 08/10] Implement new driver for block replication

2016-01-03 Thread Wen Congyang

On 12/23/2015 05:47 PM, Stefan Hajnoczi wrote:
> On Wed, Dec 02, 2015 at 01:37:25PM +0800, Wen Congyang wrote:
>> +/*
>> + * Only write to active disk if the sectors have
>> + * already been allocated in active disk/hidden disk.
>> + */
>> +qemu_iovec_init(_qiov, qiov->niov);
>> +while (remaining_sectors > 0) {
>> +ret = bdrv_is_allocated_above(top, base, sector_num,
>> +  remaining_sectors, );
> 
> There is a race condition here since multiple I/O requests can be in
> flight at the same time.   If two requests touch the same cluster
> between top->base then the result of these checks could be unreliable.

I don't think so. When we come here, primary qemu is gone, and failover is
done. We only write to active disk if the sectors have already been allocated
in active disk/hidden disk before failover. So it two requests touch the same
cluster, it is OK, because the function bdrv_is_allocated_above()'s return
value is not changed.

> 
> The simple but slow solution is to use a CoMutex to serialize requests.
> 
>> +if (ret < 0) {
>> +return ret;
>> +}
>> +
>> +qemu_iovec_reset(_qiov);
>> +qemu_iovec_concat(_qiov, qiov, bytes_done, n * 512);
>> +
>> +target = ret ? top : base;
>> +ret = bdrv_co_writev(target, sector_num, n, _qiov);
>> +if (ret < 0) {
>> +return ret;
>> +}
>> +
>> +remaining_sectors -= n;
>> +sector_num += n;
>> +bytes_done += n * BDRV_SECTOR_SIZE;
>> +}
> 
> I think this can be replaced with an active commit block job that copies
> data down from the hidden/active disk to the secondary disk.  It is okay
> to keep writing to the secondary disk while the block job is running and
> then switch over to the secondary disk once it completes.

Yes, active commit is another choice. IIRC, I don't use it because mirror job
has some problem. It is fixed now(see bdrv_drained_begin()/bdrv_drained_end()
in the mirror job).
We will use mirror job in the next version.

> 
>> +
>> +return 0;
>> +}
>> +
>> +static coroutine_fn int replication_co_discard(BlockDriverState *bs,
>> +   int64_t sector_num,
>> +   int nb_sectors)
>> +{
>> +BDRVReplicationState *s = bs->opaque;
>> +int ret;
>> +
>> +ret = replication_get_io_status(s);
>> +if (ret < 0) {
>> +return ret;
>> +}
>> +
>> +if (ret == 1) {
>> +/* It is secondary qemu and we are after failover */
>> +ret = bdrv_co_discard(s->secondary_disk, sector_num, nb_sectors);
> 
> What if the clusters are still allocated in the hidden/active disk?
> 

What does discard do? Drop the data that allocated in the disk?
If so, I think I make a misunderstand. I will fix it in the next version.

Thanks
Wen Congyang

Re: [Qemu-devel] [Patch v12 resend 05/10] docs: block replication's description

2016-01-03 Thread Wen Congyang

On 12/23/2015 05:26 PM, Stefan Hajnoczi wrote:
> On Wed, Dec 02, 2015 at 01:31:46PM +0800, Wen Congyang wrote:
>> +== Failure Handling ==
>> +There are 6 internal errors when block replication is running:
>> +1. I/O error on primary disk
>> +2. Forwarding primary write requests failed
>> +3. Backup failed
>> +4. I/O error on secondary disk
>> +5. I/O error on active disk
>> +6. Making active disk or hidden disk empty failed
>> +In case 1 and 5, we just report the error to the disk layer. In case 2, 3,
>> +4 and 6, we just report block replication's error to FT/HA manager (which
>> +decides when to do a new checkpoint, when to do failover).
>> +There is no internal error when doing failover.
> 
> Not sure this is true.
> 
> Below it says the following for failover: "We will flush the Disk buffer
> into Secondary Disk and stop block replication".  Flushing the disk
> buffer can result in I/O errors.  This means that failover operations
> are not guaranteed to succeed.

We don't use mirror job now. We may use it in the next version.
Is there any way to know the I/O error when the mirror job is running?
Get the job's status?

> 
> In practice I think this is similar to a successful failover followed by
> immediately getting I/O errors on the new Primary Disk.  It means that
> right after failover there is another failure and the system may not be
> able to continue.

Block replication is not designed for such case. For example, we don't do
failover on primary disk's failure. In such case, we just report the error
to the disk layer(It is the case 1 in the above Failure Handling).

Sorry for the late reply. Your mail is sent at 2015-12-23, but I receive
it at 2016-01-04

> 
> So this really only matters in the case where there is a new Secondary
> ready after failover.  In that case the user might expect failover to
> continue to the new Secondary (Host 3):
> 
>[X][X]
>   Host 1 <-> Host 2 <-> Host 3
>

Re: [Qemu-devel] [Patch v12 resend 00/10] Block replication for continuous checkpoints

2016-01-03 Thread Wen Congyang

On 12/23/2015 06:04 PM, Stefan Hajnoczi wrote:
> On Thu, Dec 17, 2015 at 02:22:14PM +0800, Wen Congyang wrote:
>> Stefan:Ping...
>>
>> What about this feature? I have worked for it about 1 year, but it is still 
>> in the
>> way...
> 
> The code still has TODOs.  What is the plan for supporting replication
> after failover?  This feature seems critical because anyone who wants FT
> won't be able to use this code unless it supports FT after the first
> failure.

We have implemented it based on an old version qemu. To keep the logical
simple, we don't post them now. We will post them after this feature is merged
into qemu.

> 
> ---
> 
> Adding new block layer APIs that are replication-specific is not clean.
> Only the replication block driver cares about the start/stop/checkpoint
> interface.
> 
> It is cleaner to have a separate API and data structure for block
> replication.
> 
> The replication code should define its own BlockReplicationOps struct
> and allow objects to register themselves.  Then it's no longer necessary
> to modify the core block layer to forward start/stop/checkpoint calls.
> 
> Something like:
> 
> typedef struct BlockReplicationOps BlockReplicationOps;
> typedef struct BlockReplicationState {
> const BlockReplicationOps *ops;
> QLIST_ENTRY(BlockReplicationState) list;
> } BlockReplicationState;
> 
> typedef struct {
> void start(BlockReplicationState *brs, Error **errp);
> void stop(BlockReplicationState *brs, Error **errp);
> void checkpoint(BlockReplicationState *brs, Error **errp);
> } BlockReplicationOps;
> 
> static QLIST_HEAD(BlockReplicationState) block_replication_states;
> 
> void block_replication_add(BlockReplicationState *brs);
> void block_replication_remove(BlockReplicationState *brs);
> 
> The replication block driver would add/remove itself.  The quorum block
> driver probably doesn't need to be modified (I think in your current
> patches you modify it just to forward the start/stop/checkpoint calls to
> a particular quorum child).

Yes, it is the major purpose. We also do some check in the quorum driver: 
we don't allow more than one child support block replication.

Thanks
Wen Congyang

> 
> Stefan
>

Re: [Qemu-devel] [PATCH COLO-Frame v12 25/38] qmp event: Add event notification for COLO error

2015-12-22 Thread Wen Congyang

On 12/19/2015 06:02 PM, Markus Armbruster wrote:
> Copying qemu-block because this seems related to generalising block jobs
> to background jobs.
> 
> zhanghailiang <zhang.zhanghaili...@huawei.com> writes:
> 
>> If some errors happen during VM's COLO FT stage, it's important to notify 
>> the users
>> of this event. Together with 'colo_lost_heartbeat', users can intervene in 
>> COLO's
>> failover work immediately.
>> If users don't want to get involved in COLO's failover verdict,
>> it is still necessary to notify users that we exited COLO mode.
>>
>> Cc: Markus Armbruster <arm...@redhat.com>
>> Cc: Michael Roth <mdr...@linux.vnet.ibm.com>
>> Signed-off-by: zhanghailiang <zhang.zhanghaili...@huawei.com>
>> Signed-off-by: Li Zhijian <lizhij...@cn.fujitsu.com>
>> ---
>> v11:
>> - Fix several typos found by Eric
>>
>> Signed-off-by: zhanghailiang <zhang.zhanghaili...@huawei.com>
>> ---
>>  docs/qmp-events.txt | 17 +
>>  migration/colo.c| 11 +++
>>  qapi-schema.json| 16 
>>  qapi/event.json | 17 +
>>  4 files changed, 61 insertions(+)
>>
>> diff --git a/docs/qmp-events.txt b/docs/qmp-events.txt
>> index d2f1ce4..19f68fc 100644
>> --- a/docs/qmp-events.txt
>> +++ b/docs/qmp-events.txt
>> @@ -184,6 +184,23 @@ Example:
>>  Note: The "ready to complete" status is always reset by a BLOCK_JOB_ERROR
>>  event.
>>  
>> +COLO_EXIT
>> +-
>> +
>> +Emitted when VM finishes COLO mode due to some errors happening or
>> +at the request of users.
> 
> How would the event's recipient distinguish between "due to error" and
> "at the user's request"?
> 
>> +
>> +Data:
>> +
>> + - "mode": COLO mode, primary or secondary side (json-string)
>> + - "reason":  the exit reason, internal error or external request. 
>> (json-string)
>> + - "error": error message (json-string, operation)
>> +
>> +Example:
>> +
>> +{"timestamp": {"seconds": 2032141960, "microseconds": 417172},
>> + "event": "COLO_EXIT", "data": {"mode": "primary", "reason": "request" } }
>> +
> 
> Pardon my ignorance again...  Does "VM finishes COLO mode" means have
> some kind of COLO background job, and it just finished for whatever
> reason?
> 
> If yes, this COLO job could be an instance of the general background job
> concept we're trying to grow from the existing block job concept.
> 
> I'm not asking you to rebase your work onto the background job
> infrastructure, not least for the simple reason that it doesn't exist,
> yet.  But I think it would be fruitful to compare your COLO job
> management QMP interface with the one we have for block jobs.  Not only
> may that avoid unnecessary inconsistency, it could also help shape the
> general background job interface.

COLO is not a block job. If live migration is a background jon, COLO
is also a backgroud job.

> 
> Quick overview of the block job QMP interface:
> 
> * Commands to create a job: block-commit, block-stream, drive-mirror,
>   drive-backup.
> 
> * Get information on jobs: query-block-jobs
> 
> * Pause a job: block-job-pause
> 
> * Resume a job: block-job-resume
> 
> * Cancel a job: block-job-cancel
> 
> * Block job completion events: BLOCK_JOB_COMPLETED, BLOCK_JOB_CANCELLED
> 
> * Block job error event: BLOCK_JOB_ERROR
> 
> * Block job synchronous completion: event BLOCK_JOB_READY and command
>   block-job-complete

What is background job infrastructure? Do you mean implement all the above
interfaces for each background job?

Thanks
Wen Congyang

> 
>>  DEVICE_DELETED
>>  --
>>  
>> diff --git a/migration/colo.c b/migration/colo.c
>> index d1dd4e1..d06c14f 100644
>> --- a/migration/colo.c
>> +++ b/migration/colo.c
>> @@ -18,6 +18,7 @@
>>  #include "qemu/error-report.h"
>>  #include "qemu/sockets.h"
>>  #include "migration/failover.h"
>> +#include "qapi-event.h"
>>  
>>  /* colo buffer */
>>  #define COLO_BUFFER_BASE_SIZE (4 * 1024 * 1024)
>> @@ -349,6 +350,11 @@ static void colo_process_checkpoint(MigrationState *s)
>>  out:
>>  if (ret < 0) {
>>  error_report("%s: %s", __func__, strerror(-ret));
>> +qapi_event_send_colo_exit(COLO_MODE_PRIMARY, COLO_EXIT_REASON_ERROR,
>> +

Re: [Qemu-devel] [Patch v12 resend 00/10] Block replication for continuous checkpoints

2015-12-16 Thread Wen Congyang

Stefan:Ping...

What about this feature? I have worked for it about 1 year, but it is still in 
the
way...

On 12/02/2015 01:31 PM, Wen Congyang wrote:
> Block replication is a very important feature which is used for
> continuous checkpoints(for example: COLO).
> 
> You can get the detailed information about block replication from here:
> http://wiki.qemu.org/Features/BlockReplication
> 
> Usage:
> Please refer to docs/block-replication.txt
> 
> This patch series is based on the following patch series:
> 1. http://lists.nongnu.org/archive/html/qemu-devel/2015-11/msg04949.html
> 2. http://lists.nongnu.org/archive/html/qemu-devel/2015-11/msg06043.html
> 
> You can get the patch here:
> https://github.com/coloft/qemu/tree/wency/block-replication-v12
> 
> You can get the patch with framework here:
> https://github.com/coloft/qemu/tree/wency/colo_framework_v11.2
> 
> TODO:
> 1. Continuous block replication. It will be started after basic functions
>are accepted.
> 
> Changs Log:
> V12:
> 1. Rebase to the newest codes
> 2. Use backing reference to replcace 'allow-write-backing-file'
> V11:
> 1. Reopen the backing file when starting blcok replication if it is not
>opened in R/W mode
> 2. Unblock BLOCK_OP_TYPE_BACKUP_SOURCE and BLOCK_OP_TYPE_BACKUP_TARGET
>when opening backing file
> 3. Block the top BDS so there is only one block job for the top BDS and
>its backing chain.
> V10:
> 1. Use blockdev-remove-medium and blockdev-insert-medium to replace backing
>reference.
> 2. Address the comments from Eric Blake
> V9:
> 1. Update the error messages
> 2. Rebase to the newest qemu
> 3. Split child add/delete support. These patches are sent in another patchset.
> V8:
> 1. Address Alberto Garcia's comments
> V7:
> 1. Implement adding/removing quorum child. Remove the option non-connect.
> 2. Simplify the backing refrence option according to Stefan Hajnoczi's 
> suggestion
> V6:
> 1. Rebase to the newest qemu.
> V5:
> 1. Address the comments from Gong Lei
> 2. Speed the failover up. The secondary vm can take over very quickly even
>if there are too many I/O requests.
> V4:
> 1. Introduce a new driver replication to avoid touch nbd and qcow2.
> V3:
> 1: use error_setg() instead of error_set()
> 2. Add a new block job API
> 3. Active disk, hidden disk and nbd target uses the same AioContext
> 4. Add a testcase to test new hbitmap API
> V2:
> 1. Redesign the secondary qemu(use image-fleecing)
> 2. Use Error objects to return error message
> 3. Address the comments from Max Reitz and Eric Blake
> 
> Wen Congyang (10):
>   unblock backup operations in backing file
>   Store parent BDS in BdrvChild
>   Backup: clear all bitmap when doing block checkpoint
>   Allow creating backup jobs when opening BDS
>   docs: block replication's description
>   Add new block driver interfaces to control block replication
>   quorum: implement block driver interfaces for block replication
>   Implement new driver for block replication
>   support replication driver in blockdev-add
>   Add a new API to start/stop replication, do checkpoint to all BDSes
> 
>  block.c| 145 
>  block/Makefile.objs|   3 +-
>  block/backup.c |  14 ++
>  block/quorum.c |  78 +++
>  block/replication.c| 549 
> +
>  blockjob.c |  11 +
>  docs/block-replication.txt | 227 +++
>  include/block/block.h  |   9 +
>  include/block/block_int.h  |  15 ++
>  include/block/blockjob.h   |  12 +
>  qapi/block-core.json   |  34 ++-
>  11 files changed, 1093 insertions(+), 4 deletions(-)
>  create mode 100644 block/replication.c
>  create mode 100644 docs/block-replication.txt
>

Re: [Qemu-devel] [PATCH] rcu: optimize rcu_read_lock

2015-12-16 Thread Wen Congyang

On 12/16/2015 07:32 PM, Paolo Bonzini wrote:
> rcu_read_lock cannot change rcu_gp_ongoing from true to false
> (the previous value of p_rcu_reader->ctr is zero), hence
> there is no need to check p_rcu_reader->waiting and wake up
> a concurrent synchronize_rcu.
> 
> While at it mark the wakeup as unlikely in rcu_read_unlock.
> 
> Signed-off-by: Paolo Bonzini <pbonz...@redhat.com>

Reviewed-by: Wen Congyang <we...@cn.fujitsu.com>

> ---
>  include/qemu/rcu.h | 6 +-
>  1 file changed, 1 insertion(+), 5 deletions(-)
> 
> diff --git a/include/qemu/rcu.h b/include/qemu/rcu.h
> index f6d1d56..7c7cca7 100644
> --- a/include/qemu/rcu.h
> +++ b/include/qemu/rcu.h
> @@ -88,10 +88,6 @@ static inline void rcu_read_lock(void)
>  
>  ctr = atomic_read(_gp_ctr);
>  atomic_xchg(_rcu_reader->ctr, ctr);
> -if (atomic_read(_rcu_reader->waiting)) {
> -atomic_set(_rcu_reader->waiting, false);
> -qemu_event_set(_gp_event);
> -}
>  }
>  
>  static inline void rcu_read_unlock(void)
> @@ -104,7 +100,7 @@ static inline void rcu_read_unlock(void)
>  }
>  
>  atomic_xchg(_rcu_reader->ctr, 0);
> -if (atomic_read(_rcu_reader->waiting)) {
> +if (unlikely(atomic_read(_rcu_reader->waiting))) {
>  atomic_set(_rcu_reader->waiting, false);
>  qemu_event_set(_gp_event);
>  }
>

Re: [Qemu-devel] [PATCH COLO-Frame v12 01/38] configure: Add parameter for configure to enable/disable COLO support

2015-12-15 Thread Wen Congyang

On 12/15/2015 04:22 PM, zhanghailiang wrote:
> configure --enable-colo/--disable-colo to switch COLO
> support on/off.
> COLO support is On by default.
> 
> Signed-off-by: zhanghailiang <zhang.zhanghaili...@huawei.com>
> Signed-off-by: Li Zhijian <lizhij...@cn.fujitsu.com>
> Signed-off-by: Gonglei <arei.gong...@huawei.com>
> Reviewed-by: Dr. David Alan Gilbert <dgilb...@redhat.com>
> ---
> v11:
> - Turn COLO on in default (Eric's suggestion)
> 
> Signed-off-by: zhanghailiang <zhang.zhanghaili...@huawei.com>

I think you forgot to remove this line.

Thanks
Wen Congyang

> ---
>  configure | 11 +++
>  1 file changed, 11 insertions(+)
> 
> diff --git a/configure b/configure
> index b9552fd..32e466f 100755
> --- a/configure
> +++ b/configure
> @@ -260,6 +260,7 @@ xfs=""
>  vhost_net="no"
>  vhost_scsi="no"
>  kvm="no"
> +colo="yes"
>  rdma=""
>  gprof="no"
>  debug_tcg="no"
> @@ -939,6 +940,10 @@ for opt do
>;;
>--enable-kvm) kvm="yes"
>;;
> +  --disable-colo) colo="no"
> +  ;;
> +  --enable-colo) colo="yes"
> +  ;;
>--disable-tcg-interpreter) tcg_interpreter="no"
>;;
>--enable-tcg-interpreter) tcg_interpreter="yes"
> @@ -1362,6 +1367,7 @@ disabled with --disable-FEATURE, default is enabled if 
> available:
>fdt fdt device tree
>bluez   bluez stack connectivity
>kvm KVM acceleration support
> +  coloCOarse-grain LOck-stepping VM for Non-stop Service
>rdmaRDMA-based migration support
>uuiduuid support
>vde support for vde network
> @@ -4792,6 +4798,7 @@ echo "Linux AIO support $linux_aio"
>  echo "ATTR/XATTR support $attr"
>  echo "Install blobs $blobs"
>  echo "KVM support   $kvm"
> +echo "COLO support  $colo"
>  echo "RDMA support  $rdma"
>  echo "TCG interpreter   $tcg_interpreter"
>  echo "fdt support   $fdt"
> @@ -5381,6 +5388,10 @@ if have_backend "ftrace"; then
>  fi
>  echo "CONFIG_TRACE_FILE=$trace_file" >> $config_host_mak
>  
> +if test "$colo" = "yes"; then
> +  echo "CONFIG_COLO=y" >> $config_host_mak
> +fi
> +
>  if test "$rdma" = "yes" ; then
>echo "CONFIG_RDMA=y" >> $config_host_mak
>  fi
>

Re: [Qemu-devel] [PATCH v3 06/21] block: Exclude nested options only for children in append_open_options()

2015-12-14 Thread Wen Congyang

On 12/04/2015 09:35 PM, Kevin Wolf wrote:
> Some drivers have nested options (e.g. blkdebug rule arrays), which
> don't belong to a child node and shouldn't be removed. Don't remove all
> options with "." in their name, but check for the complete prefixes of
> actually existing child nodes.

I think we should have some way to get the child->name. For example, the
monitor command 'info block' or 'query-block' display it.

Thanks
Wen Congyang

> 
> Signed-off-by: Kevin Wolf <kw...@redhat.com>
> ---
>  block.c   | 20 
>  include/block/block_int.h |  1 +
>  2 files changed, 17 insertions(+), 4 deletions(-)
> 
> diff --git a/block.c b/block.c
> index 73f0816..0dfff7a 100644
> --- a/block.c
> +++ b/block.c
> @@ -1101,11 +1101,13 @@ static int bdrv_fill_options(QDict **options, const 
> char **pfilename,
>  
>  static BdrvChild *bdrv_attach_child(BlockDriverState *parent_bs,
>  BlockDriverState *child_bs,
> +const char *child_name,
>  const BdrvChildRole *child_role)
>  {
>  BdrvChild *child = g_new(BdrvChild, 1);
>  *child = (BdrvChild) {
>  .bs = child_bs,
> +.name   = g_strdup(child_name),
>  .role   = child_role,
>  };
>  
> @@ -1119,6 +1121,7 @@ static void bdrv_detach_child(BdrvChild *child)
>  {
>  QLIST_REMOVE(child, next);
>  QLIST_REMOVE(child, next_parent);
> +g_free(child->name);
>  g_free(child);
>  }
>  
> @@ -1165,7 +1168,7 @@ void bdrv_set_backing_hd(BlockDriverState *bs, 
> BlockDriverState *backing_hd)
>  bs->backing = NULL;
>  goto out;
>  }
> -bs->backing = bdrv_attach_child(bs, backing_hd, _backing);
> +bs->backing = bdrv_attach_child(bs, backing_hd, "backing", 
> _backing);
>  bs->open_flags &= ~BDRV_O_NO_BACKING;
>  pstrcpy(bs->backing_file, sizeof(bs->backing_file), 
> backing_hd->filename);
>  pstrcpy(bs->backing_format, sizeof(bs->backing_format),
> @@ -1321,7 +1324,7 @@ BdrvChild *bdrv_open_child(const char *filename,
>  goto done;
>  }
>  
> -c = bdrv_attach_child(parent, bs, child_role);
> +c = bdrv_attach_child(parent, bs, bdref_key, child_role);
>  
>  done:
>  qdict_del(options, bdref_key);
> @@ -3951,13 +3954,22 @@ static bool append_open_options(QDict *d, 
> BlockDriverState *bs)
>  {
>  const QDictEntry *entry;
>  QemuOptDesc *desc;
> +BdrvChild *child;
>  bool found_any = false;
> +const char *p;
>  
>  for (entry = qdict_first(bs->options); entry;
>   entry = qdict_next(bs->options, entry))
>  {
> -/* Only take options for this level */
> -if (strchr(qdict_entry_key(entry), '.')) {
> +/* Exclude options for children */
> +QLIST_FOREACH(child, >children, next) {
> +if (strstart(qdict_entry_key(entry), child->name, )
> +&& (!*p || *p == '.'))
> +{
> +break;
> +}
> +}
> +if (child) {
>  continue;
>  }
>  
> diff --git a/include/block/block_int.h b/include/block/block_int.h
> index 77dc165..7265247 100644
> --- a/include/block/block_int.h
> +++ b/include/block/block_int.h
> @@ -351,6 +351,7 @@ extern const BdrvChildRole child_format;
>  
>  struct BdrvChild {
>  BlockDriverState *bs;
> +char *name;
>  const BdrvChildRole *role;
>  QLIST_ENTRY(BdrvChild) next;
>  QLIST_ENTRY(BdrvChild) next_parent;
>

Re: [Qemu-devel] [PATCH COLO-Frame v11 08/39] migration: Rename the'file' member of MigrationState

2015-12-09 Thread Wen Congyang

On 11/24/2015 05:25 PM, zhanghailiang wrote:
> Rename the 'file' member of MigrationState to 'to_dst_file'.
> 
> Signed-off-by: zhanghailiang <zhang.zhanghaili...@huawei.com>
> Cc: Dr. David Alan Gilbert <dgilb...@redhat.com>
> ---
> v11:
> - Only rename 'file' member of MigrationState

You forgot to update migration/rdma.c.

Thanks
Wen Congyang

> ---
>  include/migration/migration.h |  2 +-
>  migration/exec.c  |  4 +--
>  migration/fd.c|  4 +--
>  migration/migration.c | 72 
> ++-
>  migration/postcopy-ram.c  |  6 ++--
>  migration/savevm.c|  2 +-
>  migration/tcp.c   |  4 +--
>  migration/unix.c  |  4 +--
>  8 files changed, 51 insertions(+), 47 deletions(-)
> 
> diff --git a/include/migration/migration.h b/include/migration/migration.h
> index a57a734..ba5bcec 100644
> --- a/include/migration/migration.h
> +++ b/include/migration/migration.h
> @@ -140,7 +140,7 @@ struct MigrationState
>  size_t xfer_limit;
>  QemuThread thread;
>  QEMUBH *cleanup_bh;
> -QEMUFile *file;
> +QEMUFile *to_dst_file;
>  int parameters[MIGRATION_PARAMETER_MAX];
>  
>  int state;
> diff --git a/migration/exec.c b/migration/exec.c
> index 8406d2b..9037109 100644
> --- a/migration/exec.c
> +++ b/migration/exec.c
> @@ -36,8 +36,8 @@
>  
>  void exec_start_outgoing_migration(MigrationState *s, const char *command, 
> Error **errp)
>  {
> -s->file = qemu_popen_cmd(command, "w");
> -if (s->file == NULL) {
> +s->to_dst_file = qemu_popen_cmd(command, "w");
> +if (s->to_dst_file == NULL) {
>  error_setg_errno(errp, errno, "failed to popen the migration 
> target");
>  return;
>  }
> diff --git a/migration/fd.c b/migration/fd.c
> index 3e4bed0..9a9d6c5 100644
> --- a/migration/fd.c
> +++ b/migration/fd.c
> @@ -50,9 +50,9 @@ void fd_start_outgoing_migration(MigrationState *s, const 
> char *fdname, Error **
>  }
>  
>  if (fd_is_socket(fd)) {
> -s->file = qemu_fopen_socket(fd, "wb");
> +s->to_dst_file = qemu_fopen_socket(fd, "wb");
>  } else {
> -s->file = qemu_fdopen(fd, "wb");
> +s->to_dst_file = qemu_fdopen(fd, "wb");
>  }
>  
>  migrate_fd_connect(s);
> diff --git a/migration/migration.c b/migration/migration.c
> index 41eac0d..a4c690d 100644
> --- a/migration/migration.c
> +++ b/migration/migration.c
> @@ -834,7 +834,7 @@ static void migrate_fd_cleanup(void *opaque)
>  
>  flush_page_queue(s);
>  
> -if (s->file) {
> +if (s->to_dst_file) {
>  trace_migrate_fd_cleanup();
>  qemu_mutex_unlock_iothread();
>  if (s->migration_thread_running) {
> @@ -844,8 +844,8 @@ static void migrate_fd_cleanup(void *opaque)
>  qemu_mutex_lock_iothread();
>  
>  migrate_compress_threads_join();
> -qemu_fclose(s->file);
> -s->file = NULL;
> +qemu_fclose(s->to_dst_file);
> +s->to_dst_file = NULL;
>  }
>  
>  assert((s->state != MIGRATION_STATUS_ACTIVE) &&
> @@ -862,7 +862,7 @@ static void migrate_fd_cleanup(void *opaque)
>  void migrate_fd_error(MigrationState *s)
>  {
>  trace_migrate_fd_error();
> -assert(s->file == NULL);
> +assert(s->to_dst_file == NULL);
>  migrate_set_state(>state, MIGRATION_STATUS_SETUP,
>MIGRATION_STATUS_FAILED);
>  notifier_list_notify(_state_notifiers, s);
> @@ -871,7 +871,7 @@ void migrate_fd_error(MigrationState *s)
>  static void migrate_fd_cancel(MigrationState *s)
>  {
>  int old_state ;
> -QEMUFile *f = migrate_get_current()->file;
> +QEMUFile *f = migrate_get_current()->to_dst_file;
>  trace_migrate_fd_cancel();
>  
>  if (s->rp_state.from_dst_file) {
> @@ -942,7 +942,7 @@ MigrationState *migrate_init(const MigrationParams 
> *params)
>  s->bytes_xfer = 0;
>  s->xfer_limit = 0;
>  s->cleanup_bh = 0;
> -s->file = NULL;
> +s->to_dst_file = NULL;
>  s->state = MIGRATION_STATUS_NONE;
>  s->params = *params;
>  s->rp_state.from_dst_file = NULL;
> @@ -1122,8 +1122,9 @@ void qmp_migrate_set_speed(int64_t value, Error **errp)
>  
>  s = migrate_get_current();
>  s->bandwidth_limit = value;
> -if (s->file) {
> -qemu_file_set_rate_limit(s->file, s->bandwidth_limit / 
> XFER_LIMIT_RATIO);
> +if (s->to_dst_file) {
> +qemu_

Re: [Qemu-devel] [Patch v8 0/3] qapi: child add/delete support

2015-12-09 Thread Wen Congyang

Kevin: ping

On 11/27/2015 02:06 PM, Wen Congyang wrote:
> If quorum's child is broken, we can use mirror job to replace it.
> But sometimes, the user only need to remove the broken child, and
> add it later when the problem is fixed.
> 
> It is based on the Kevin's child name related patch:
> http://lists.nongnu.org/archive/html/qemu-devel/2015-11/msg04949.html
> 
> ChangLog:
> v8:
> 1. Rebase to the newest codes
> 2. Address the comments from Eric Blake
> v7:
> 1. Remove the qmp command x-blockdev-change's parameter operation according
>to Kevin's comments.
> 2. Remove the hmp command.
> v6:
> 1. Use a single qmp command x-blockdev-change to replace x-blockdev-child-add
>and x-blockdev-child-delete
> v5:
> 1. Address Eric Blake's comments
> v4:
> 1. drop nbd driver's implementation. We can use human-monitor-command
>to do it.
> 2. Rename the command name.
> v3:
> 1. Don't open BDS in bdrv_add_child(). Use the existing BDS which is
>created by the QMP command blockdev-add.
> 2. The driver NBD can support filename, path, host:port now.
> v2:
> 1. Use bdrv_get_device_or_node_name() instead of new function
>bdrv_get_id_or_node_name()
> 2. Update the error message
> 3. Update the documents in block-core.json
> 
> Wen Congyang (3):
>   Add new block driver interface to add/delete a BDS's child
>   quorum: implement bdrv_add_child() and bdrv_del_child()
>   qmp: add monitor command to add/remove a child
> 
>  block.c   |  58 --
>  block/quorum.c| 124 
> +-
>  blockdev.c|  54 
>  include/block/block.h |   9 
>  include/block/block_int.h |   5 ++
>  qapi/block-core.json  |  23 +
>  qmp-commands.hx   |  47 ++
>  7 files changed, 314 insertions(+), 6 deletions(-)
>

Re: [Qemu-devel] [PATCH COLO-Frame v11 34/39] net/filter-buffer: Add default filter-buffer for each netdev

2015-12-02 Thread Wen Congyang

On 11/24/2015 05:25 PM, zhanghailiang wrote:
> We add each netdev a default filter-buffer, which will be used for COLO
> or Micro-checkpoint to buffer VM's packets. The name of default filter-buffer
> is 'nop'.
> For the default filter-buffer, it will not buffer any packets in default.
> So it has no side effect for the netdev.

No, filter-buffer doesn't support vhost, so if you add default filter-buffer
for each netdev, you can't use vhost.

Thanks
Wen Congyang

> 
> Signed-off-by: zhanghailiang <zhang.zhanghaili...@huawei.com>
> Cc: Jason Wang <jasow...@redhat.com>
> Cc: Yang Hongyang <hongyang.y...@easystack.cn>
> ---
> v11:
> - New patch
> ---
>  include/net/filter.h |  3 +++
>  net/filter-buffer.c  | 74 
> 
>  net/net.c|  8 ++
>  3 files changed, 85 insertions(+)
> 
> diff --git a/include/net/filter.h b/include/net/filter.h
> index 2deda36..01a7e90 100644
> --- a/include/net/filter.h
> +++ b/include/net/filter.h
> @@ -74,4 +74,7 @@ ssize_t qemu_netfilter_pass_to_next(NetClientState *sender,
>  int iovcnt,
>  void *opaque);
>  
> +void netdev_add_default_filter_buffer(const char *netdev_id,
> +  NetFilterDirection direction,
> +  Error **errp);
>  #endif /* QEMU_NET_FILTER_H */
> diff --git a/net/filter-buffer.c b/net/filter-buffer.c
> index 57be149..195af68 100644
> --- a/net/filter-buffer.c
> +++ b/net/filter-buffer.c
> @@ -14,6 +14,12 @@
>  #include "qapi/qmp/qerror.h"
>  #include "qapi-visit.h"
>  #include "qom/object.h"
> +#include "net/net.h"
> +#include "qapi/qmp/qdict.h"
> +#include "qapi/qmp-output-visitor.h"
> +#include "qapi/qmp-input-visitor.h"
> +#include "monitor/monitor.h"
> +#include "qmp-commands.h"
>  
>  #define TYPE_FILTER_BUFFER "filter-buffer"
>  
> @@ -26,6 +32,8 @@ typedef struct FilterBufferState {
>  NetQueue *incoming_queue;
>  uint32_t interval;
>  QEMUTimer release_timer;
> +bool is_default;
> +bool enable_buffer;
>  } FilterBufferState;
>  
>  static void filter_buffer_flush(NetFilterState *nf)
> @@ -65,6 +73,10 @@ static ssize_t filter_buffer_receive_iov(NetFilterState 
> *nf,
>  {
>  FilterBufferState *s = FILTER_BUFFER(nf);
>  
> +/* Don't buffer any packets if the filter is not enabled */
> +if (!s->enable_buffer) {
> +return 0;
> +}
>  /*
>   * We return size when buffer a packet, the sender will take it as
>   * a already sent packet, so sent_cb should not be called later.
> @@ -102,6 +114,7 @@ static void filter_buffer_cleanup(NetFilterState *nf)
>  static void filter_buffer_setup(NetFilterState *nf, Error **errp)
>  {
>  FilterBufferState *s = FILTER_BUFFER(nf);
> +char *path = object_get_canonical_path_component(OBJECT(nf));
>  
>  /*
>   * We may want to accept zero interval when VM FT solutions like MC
> @@ -114,6 +127,7 @@ static void filter_buffer_setup(NetFilterState *nf, Error 
> **errp)
>  }
>  
>  s->incoming_queue = qemu_new_net_queue(qemu_netfilter_pass_to_next, nf);
> +s->is_default = !strcmp(path, "nop");
>  if (s->interval) {
>  timer_init_us(>release_timer, QEMU_CLOCK_VIRTUAL,
>filter_buffer_release_timer, nf);
> @@ -163,6 +177,66 @@ out:
>  error_propagate(errp, local_err);
>  }
>  
> +/*
> +* This will be used by COLO or MC FT, for which they will need
> +* to buffer the packets of VM's net devices, Here we add a default
> +* buffer filter for each netdev. The name of default buffer filter is
> +* 'nop'
> +*/
> +void netdev_add_default_filter_buffer(const char *netdev_id,
> +  NetFilterDirection direction,
> +  Error **errp)
> +{
> +QmpOutputVisitor *qov;
> +QmpInputVisitor *qiv;
> +Visitor *ov, *iv;
> +QObject *obj = NULL;
> +QDict *qdict;
> +void *dummy = NULL;
> +const char *id = "nop";
> +char *queue = g_strdup(NetFilterDirection_lookup[direction]);
> +NetClientState *nc = qemu_find_netdev(netdev_id);
> +Error *err = NULL;
> +
> +/* FIXME: Not support multiple queues */
> +if (!nc || nc->queue_index > 1) {
> +return;
> +}
> +qov = qmp_output_visitor_new();
> +ov = qmp_output_get_visitor(qov);
> +visit_start_struct(ov,  , NULL, NULL, 0, );
> +if (err) {
> +

Re: [Qemu-devel] [PATCH COLO-Frame v11 34/39] net/filter-buffer: Add default filter-buffer for each netdev

2015-12-02 Thread Wen Congyang

On 12/03/2015 11:53 AM, Hailiang Zhang wrote:
> On 2015/12/3 9:17, Wen Congyang wrote:
>> On 11/24/2015 05:25 PM, zhanghailiang wrote:
>>> We add each netdev a default filter-buffer, which will be used for COLO
>>> or Micro-checkpoint to buffer VM's packets. The name of default 
>>> filter-buffer
>>> is 'nop'.
>>> For the default filter-buffer, it will not buffer any packets in default.
>>> So it has no side effect for the netdev.
>>
>> No, filter-buffer doesn't support vhost, so if you add default filter-buffer
>> for each netdev, you can't use vhost.
>>
> 
> Have you tested it ? Did the default filter-buffer break vhost ?
> It's not supposed to break vhost, I will look into it. Thanks.

Yes, I have tested it. When I want to start a normal vm with vhost, I get
the following error messages:

qemu-system-x86_64: -netdev tap,id=hn0,queues=1,vhost=on: Vhost is not supported

Thanks
Wen Congyang
> 
>> Thanks
>> Wen Congyang
>>
>>>
>>> Signed-off-by: zhanghailiang <zhang.zhanghaili...@huawei.com>
>>> Cc: Jason Wang <jasow...@redhat.com>
>>> Cc: Yang Hongyang <hongyang.y...@easystack.cn>
>>> ---
>>> v11:
>>> - New patch
>>> ---
>>>   include/net/filter.h |  3 +++
>>>   net/filter-buffer.c  | 74 
>>> 
>>>   net/net.c|  8 ++
>>>   3 files changed, 85 insertions(+)
>>>
>>> diff --git a/include/net/filter.h b/include/net/filter.h
>>> index 2deda36..01a7e90 100644
>>> --- a/include/net/filter.h
>>> +++ b/include/net/filter.h
>>> @@ -74,4 +74,7 @@ ssize_t qemu_netfilter_pass_to_next(NetClientState 
>>> *sender,
>>>   int iovcnt,
>>>   void *opaque);
>>>
>>> +void netdev_add_default_filter_buffer(const char *netdev_id,
>>> +  NetFilterDirection direction,
>>> +  Error **errp);
>>>   #endif /* QEMU_NET_FILTER_H */
>>> diff --git a/net/filter-buffer.c b/net/filter-buffer.c
>>> index 57be149..195af68 100644
>>> --- a/net/filter-buffer.c
>>> +++ b/net/filter-buffer.c
>>> @@ -14,6 +14,12 @@
>>>   #include "qapi/qmp/qerror.h"
>>>   #include "qapi-visit.h"
>>>   #include "qom/object.h"
>>> +#include "net/net.h"
>>> +#include "qapi/qmp/qdict.h"
>>> +#include "qapi/qmp-output-visitor.h"
>>> +#include "qapi/qmp-input-visitor.h"
>>> +#include "monitor/monitor.h"
>>> +#include "qmp-commands.h"
>>>
>>>   #define TYPE_FILTER_BUFFER "filter-buffer"
>>>
>>> @@ -26,6 +32,8 @@ typedef struct FilterBufferState {
>>>   NetQueue *incoming_queue;
>>>   uint32_t interval;
>>>   QEMUTimer release_timer;
>>> +bool is_default;
>>> +bool enable_buffer;
>>>   } FilterBufferState;
>>>
>>>   static void filter_buffer_flush(NetFilterState *nf)
>>> @@ -65,6 +73,10 @@ static ssize_t filter_buffer_receive_iov(NetFilterState 
>>> *nf,
>>>   {
>>>   FilterBufferState *s = FILTER_BUFFER(nf);
>>>
>>> +/* Don't buffer any packets if the filter is not enabled */
>>> +if (!s->enable_buffer) {
>>> +return 0;
>>> +}
>>>   /*
>>>* We return size when buffer a packet, the sender will take it as
>>>* a already sent packet, so sent_cb should not be called later.
>>> @@ -102,6 +114,7 @@ static void filter_buffer_cleanup(NetFilterState *nf)
>>>   static void filter_buffer_setup(NetFilterState *nf, Error **errp)
>>>   {
>>>   FilterBufferState *s = FILTER_BUFFER(nf);
>>> +char *path = object_get_canonical_path_component(OBJECT(nf));
>>>
>>>   /*
>>>* We may want to accept zero interval when VM FT solutions like MC
>>> @@ -114,6 +127,7 @@ static void filter_buffer_setup(NetFilterState *nf, 
>>> Error **errp)
>>>   }
>>>
>>>   s->incoming_queue = qemu_new_net_queue(qemu_netfilter_pass_to_next, 
>>> nf);
>>> +s->is_default = !strcmp(path, "nop");
>>>   if (s->interval) {
>>>   timer_init_us(>release_timer, QEMU_CLOCK_VIRTUAL,
>>> filter_buffer_

Re: [Qemu-devel] [Patch v12 00/10] Block replication for continuous checkpoints

2015-12-01 Thread Wen Congyang

On 12/01/2015 06:40 PM, Dr. David Alan Gilbert wrote:
> * Wen Congyang (we...@cn.fujitsu.com) wrote:
>> Block replication is a very important feature which is used for
>> continuous checkpoints(for example: COLO).
>>
>> You can get the detailed information about block replication from here:
>> http://wiki.qemu.org/Features/BlockReplication
>>
>> Usage:
>> Please refer to docs/block-replication.txt
>>
>> This patch series is based on the following patch series:
>> 1. http://lists.nongnu.org/archive/html/qemu-devel/2015-11/msg04949.html
>> 2. http://lists.nongnu.org/archive/html/qemu-devel/2015-11/msg06043.html
>>
>> You can get the patch here:
>> https://github.com/coloft/qemu/tree/wency/block-replication-v12
>>
>> You can get the patch with framework here:
>> https://github.com/coloft/qemu/tree/wency/colo_framework_v11.2
> 
> Neither of these links work for me, and I see that  only messages 0..7 in the
> series hit the list.

I forgot to push it to github...
And I also received the messages 0..7, and I don't know what's wrong...

I will push it to github, and resend them.

Thanks
Wen Congyang

> 
> Dave
> 
>>
>> TODO:
>> 1. Continuous block replication. It will be started after basic functions
>>are accepted.
>>
>> Changs Log:
>> V12:
>> 1. Rebase to the newest codes
>> 2. Use backing reference to replcace 'allow-write-backing-file'
>> V11:
>> 1. Reopen the backing file when starting blcok replication if it is not
>>opened in R/W mode
>> 2. Unblock BLOCK_OP_TYPE_BACKUP_SOURCE and BLOCK_OP_TYPE_BACKUP_TARGET
>>when opening backing file
>> 3. Block the top BDS so there is only one block job for the top BDS and
>>its backing chain.
>> V10:
>> 1. Use blockdev-remove-medium and blockdev-insert-medium to replace backing
>>reference.
>> 2. Address the comments from Eric Blake
>> V9:
>> 1. Update the error messages
>> 2. Rebase to the newest qemu
>> 3. Split child add/delete support. These patches are sent in another 
>> patchset.
>> V8:
>> 1. Address Alberto Garcia's comments
>> V7:
>> 1. Implement adding/removing quorum child. Remove the option non-connect.
>> 2. Simplify the backing refrence option according to Stefan Hajnoczi's 
>> suggestion
>> V6:
>> 1. Rebase to the newest qemu.
>> V5:
>> 1. Address the comments from Gong Lei
>> 2. Speed the failover up. The secondary vm can take over very quickly even
>>if there are too many I/O requests.
>> V4:
>> 1. Introduce a new driver replication to avoid touch nbd and qcow2.
>> V3:
>> 1: use error_setg() instead of error_set()
>> 2. Add a new block job API
>> 3. Active disk, hidden disk and nbd target uses the same AioContext
>> 4. Add a testcase to test new hbitmap API
>> V2:
>> 1. Redesign the secondary qemu(use image-fleecing)
>> 2. Use Error objects to return error message
>> 3. Address the comments from Max Reitz and Eric Blake
>>
>> Wen Congyang (10):
>>   unblock backup operations in backing file
>>   Store parent BDS in BdrvChild
>>   Backup: clear all bitmap when doing block checkpoint
>>   Allow creating backup jobs when opening BDS
>>   docs: block replication's description
>>   Add new block driver interfaces to control block replication
>>   quorum: implement block driver interfaces for block replication
>>   Implement new driver for block replication
>>   support replication driver in blockdev-add
>>   Add a new API to start/stop replication, do checkpoint to all BDSes
>>
>>  block.c| 145 
>>  block/Makefile.objs|   3 +-
>>  block/backup.c |  14 ++
>>  block/quorum.c |  78 +++
>>  block/replication.c| 549 
>> +
>>  blockjob.c |  11 +
>>  docs/block-replication.txt | 227 +++
>>  include/block/block.h  |   9 +
>>  include/block/block_int.h  |  15 ++
>>  include/block/blockjob.h   |  12 +
>>  qapi/block-core.json   |  34 ++-
>>  11 files changed, 1093 insertions(+), 4 deletions(-)
>>  create mode 100644 block/replication.c
>>  create mode 100644 docs/block-replication.txt
>>
>> -- 
>> 2.5.0
>>
>>
>>
> --
> Dr. David Alan Gilbert / dgilb...@redhat.com / Manchester, UK
> 
> 
> .
>

Re: [Qemu-devel] [Patch v12 00/10] Block replication for continuous checkpoints

2015-12-01 Thread Wen Congyang

On 12/01/2015 07:58 PM, Hailiang Zhang wrote:
> On 2015/12/1 18:40, Dr. David Alan Gilbert wrote:
>> * Wen Congyang (we...@cn.fujitsu.com) wrote:
>>> Block replication is a very important feature which is used for
>>> continuous checkpoints(for example: COLO).
>>>
>>> You can get the detailed information about block replication from here:
>>> http://wiki.qemu.org/Features/BlockReplication
>>>
>>> Usage:
>>> Please refer to docs/block-replication.txt
>>>
>>> This patch series is based on the following patch series:
>>> 1. http://lists.nongnu.org/archive/html/qemu-devel/2015-11/msg04949.html
>>> 2. http://lists.nongnu.org/archive/html/qemu-devel/2015-11/msg06043.html
>>>
>>> You can get the patch here:
>>> https://github.com/coloft/qemu/tree/wency/block-replication-v12
>>>
>>> You can get the patch with framework here:
>>> https://github.com/coloft/qemu/tree/wency/colo_framework_v11.2
>>
>> Neither of these links work for me, and I see that  only messages 0..7 in the
>> series hit the list.
>>
> 
> Hi Dave,
> 
> You can refer to https://github.com/coloft/qemu/tree/colo-v2.2-periodic-mode,
> The block replication part in this link is also the newest version.

No, I remove one patch, and the usage is changed.

Thanks
Wen Congyang

> 
> Congyang has deleted this confused branch, we will pay attention to this 
> later in next version.
> 
> Thanks,
> Hailiang
> 
>>
>>>
>>> TODO:
>>> 1. Continuous block replication. It will be started after basic functions
>>> are accepted.
>>>
>>> Changs Log:
>>> V12:
>>> 1. Rebase to the newest codes
>>> 2. Use backing reference to replcace 'allow-write-backing-file'
>>> V11:
>>> 1. Reopen the backing file when starting blcok replication if it is not
>>> opened in R/W mode
>>> 2. Unblock BLOCK_OP_TYPE_BACKUP_SOURCE and BLOCK_OP_TYPE_BACKUP_TARGET
>>> when opening backing file
>>> 3. Block the top BDS so there is only one block job for the top BDS and
>>> its backing chain.
>>> V10:
>>> 1. Use blockdev-remove-medium and blockdev-insert-medium to replace backing
>>> reference.
>>> 2. Address the comments from Eric Blake
>>> V9:
>>> 1. Update the error messages
>>> 2. Rebase to the newest qemu
>>> 3. Split child add/delete support. These patches are sent in another 
>>> patchset.
>>> V8:
>>> 1. Address Alberto Garcia's comments
>>> V7:
>>> 1. Implement adding/removing quorum child. Remove the option non-connect.
>>> 2. Simplify the backing refrence option according to Stefan Hajnoczi's 
>>> suggestion
>>> V6:
>>> 1. Rebase to the newest qemu.
>>> V5:
>>> 1. Address the comments from Gong Lei
>>> 2. Speed the failover up. The secondary vm can take over very quickly even
>>> if there are too many I/O requests.
>>> V4:
>>> 1. Introduce a new driver replication to avoid touch nbd and qcow2.
>>> V3:
>>> 1: use error_setg() instead of error_set()
>>> 2. Add a new block job API
>>> 3. Active disk, hidden disk and nbd target uses the same AioContext
>>> 4. Add a testcase to test new hbitmap API
>>> V2:
>>> 1. Redesign the secondary qemu(use image-fleecing)
>>> 2. Use Error objects to return error message
>>> 3. Address the comments from Max Reitz and Eric Blake
>>>
>>> Wen Congyang (10):
>>>unblock backup operations in backing file
>>>Store parent BDS in BdrvChild
>>>Backup: clear all bitmap when doing block checkpoint
>>>Allow creating backup jobs when opening BDS
>>>docs: block replication's description
>>>Add new block driver interfaces to control block replication
>>>quorum: implement block driver interfaces for block replication
>>>Implement new driver for block replication
>>>support replication driver in blockdev-add
>>>Add a new API to start/stop replication, do checkpoint to all BDSes
>>>
>>>   block.c| 145 
>>>   block/Makefile.objs|   3 +-
>>>   block/backup.c |  14 ++
>>>   block/quorum.c |  78 +++
>>>   block/replication.c| 549 
>>> +
>>>   blockjob.c |  11 +
>>>   docs/block-replication.txt | 227 +++
>>>   include/block/block.h  |   9 +
>>>   include/block/block_int.h  |  15 ++
>>>   include/block/blockjob.h   |  12 +
>>>   qapi/block-core.json   |  34 ++-
>>>   11 files changed, 1093 insertions(+), 4 deletions(-)
>>>   create mode 100644 block/replication.c
>>>   create mode 100644 docs/block-replication.txt
>>>
>>> -- 
>>> 2.5.0
>>>
>>>
>>>
>> -- 
>> Dr. David Alan Gilbert / dgilb...@redhat.com / Manchester, UK
>>
>>
>> .
>>
> 
> 
> 
> 
> .
>

Re: [Qemu-devel] [Patch v12 00/10] Block replication for continuous checkpoints

2015-12-01 Thread Wen Congyang

On 12/02/2015 09:00 AM, Wen Congyang wrote:
> On 12/01/2015 06:40 PM, Dr. David Alan Gilbert wrote:
>> * Wen Congyang (we...@cn.fujitsu.com) wrote:
>>> Block replication is a very important feature which is used for
>>> continuous checkpoints(for example: COLO).
>>>
>>> You can get the detailed information about block replication from here:
>>> http://wiki.qemu.org/Features/BlockReplication
>>>
>>> Usage:
>>> Please refer to docs/block-replication.txt
>>>
>>> This patch series is based on the following patch series:
>>> 1. http://lists.nongnu.org/archive/html/qemu-devel/2015-11/msg04949.html
>>> 2. http://lists.nongnu.org/archive/html/qemu-devel/2015-11/msg06043.html
>>>
>>> You can get the patch here:
>>> https://github.com/coloft/qemu/tree/wency/block-replication-v12
>>>
>>> You can get the patch with framework here:
>>> https://github.com/coloft/qemu/tree/wency/colo_framework_v11.2
>>
>> Neither of these links work for me, and I see that  only messages 0..7 in the
>> series hit the list.
> 
> I forgot to push it to github...
> And I also received the messages 0..7, and I don't know what's wrong...

The reason is that: git send-email has a bug:
http://permalink.gmane.org/gmane.comp.version-control.git/274569

Thanks
Wen Congyang

> 
> I will push it to github, and resend them.
> 
> Thanks
> Wen Congyang
> 
>>
>> Dave
>>
>>>
>>> TODO:
>>> 1. Continuous block replication. It will be started after basic functions
>>>are accepted.
>>>
>>> Changs Log:
>>> V12:
>>> 1. Rebase to the newest codes
>>> 2. Use backing reference to replcace 'allow-write-backing-file'
>>> V11:
>>> 1. Reopen the backing file when starting blcok replication if it is not
>>>opened in R/W mode
>>> 2. Unblock BLOCK_OP_TYPE_BACKUP_SOURCE and BLOCK_OP_TYPE_BACKUP_TARGET
>>>when opening backing file
>>> 3. Block the top BDS so there is only one block job for the top BDS and
>>>its backing chain.
>>> V10:
>>> 1. Use blockdev-remove-medium and blockdev-insert-medium to replace backing
>>>reference.
>>> 2. Address the comments from Eric Blake
>>> V9:
>>> 1. Update the error messages
>>> 2. Rebase to the newest qemu
>>> 3. Split child add/delete support. These patches are sent in another 
>>> patchset.
>>> V8:
>>> 1. Address Alberto Garcia's comments
>>> V7:
>>> 1. Implement adding/removing quorum child. Remove the option non-connect.
>>> 2. Simplify the backing refrence option according to Stefan Hajnoczi's 
>>> suggestion
>>> V6:
>>> 1. Rebase to the newest qemu.
>>> V5:
>>> 1. Address the comments from Gong Lei
>>> 2. Speed the failover up. The secondary vm can take over very quickly even
>>>if there are too many I/O requests.
>>> V4:
>>> 1. Introduce a new driver replication to avoid touch nbd and qcow2.
>>> V3:
>>> 1: use error_setg() instead of error_set()
>>> 2. Add a new block job API
>>> 3. Active disk, hidden disk and nbd target uses the same AioContext
>>> 4. Add a testcase to test new hbitmap API
>>> V2:
>>> 1. Redesign the secondary qemu(use image-fleecing)
>>> 2. Use Error objects to return error message
>>> 3. Address the comments from Max Reitz and Eric Blake
>>>
>>> Wen Congyang (10):
>>>   unblock backup operations in backing file
>>>   Store parent BDS in BdrvChild
>>>   Backup: clear all bitmap when doing block checkpoint
>>>   Allow creating backup jobs when opening BDS
>>>   docs: block replication's description
>>>   Add new block driver interfaces to control block replication
>>>   quorum: implement block driver interfaces for block replication
>>>   Implement new driver for block replication
>>>   support replication driver in blockdev-add
>>>   Add a new API to start/stop replication, do checkpoint to all BDSes
>>>
>>>  block.c| 145 
>>>  block/Makefile.objs|   3 +-
>>>  block/backup.c |  14 ++
>>>  block/quorum.c |  78 +++
>>>  block/replication.c| 549 
>>> +
>>>  blockjob.c |  11 +
>>>  docs/block-replication.txt | 227 +++
>>>  include/block/block.h  |   9 +
>>>  include/block/block_int.h  |  15 ++
>>>  include/block/blockjob.h   |  12 +
>>>  qapi/block-core.json   |  34 ++-
>>>  11 files changed, 1093 insertions(+), 4 deletions(-)
>>>  create mode 100644 block/replication.c
>>>  create mode 100644 docs/block-replication.txt
>>>
>>> -- 
>>> 2.5.0
>>>
>>>
>>>
>> --
>> Dr. David Alan Gilbert / dgilb...@redhat.com / Manchester, UK
>>
>>
>> .
>>
> 
> 
> 
> 
> .
>

[Qemu-devel] [Patch v12 resend 05/10] docs: block replication's description

2015-12-01 Thread Wen Congyang

Signed-off-by: Wen Congyang <we...@cn.fujitsu.com>
Signed-off-by: zhanghailiang <zhang.zhanghaili...@huawei.com>
Signed-off-by: Gonglei <arei.gong...@huawei.com>
---
 docs/block-replication.txt | 227 +
 1 file changed, 227 insertions(+)
 create mode 100644 docs/block-replication.txt

diff --git a/docs/block-replication.txt b/docs/block-replication.txt
new file mode 100644
index 000..c7bad0e
--- /dev/null
+++ b/docs/block-replication.txt
@@ -0,0 +1,227 @@
+Block replication
+
+Copyright Fujitsu, Corp. 2015
+Copyright (c) 2015 Intel Corporation
+Copyright (c) 2015 HUAWEI TECHNOLOGIES CO., LTD.
+
+This work is licensed under the terms of the GNU GPL, version 2 or later.
+See the COPYING file in the top-level directory.
+
+Block replication is used for continuous checkpoints. It is designed
+for COLO (COurse-grain LOck-stepping) where the Secondary VM is running.
+It can also be applied for FT/HA (Fault-tolerance/High Assurance) scenario,
+where the Secondary VM is not running.
+
+This document gives an overview of block replication's design.
+
+== Background ==
+High availability solutions such as micro checkpoint and COLO will do
+consecutive checkpoints. The VM state of Primary VM and Secondary VM is
+identical right after a VM checkpoint, but becomes different as the VM
+executes till the next checkpoint. To support disk contents checkpoint,
+the modified disk contents in the Secondary VM must be buffered, and are
+only dropped at next checkpoint time. To reduce the network transportation
+effort at the time of checkpoint, the disk modification operations of
+Primary disk are asynchronously forwarded to the Secondary node.
+
+== Workflow ==
+The following is the image of block replication workflow:
+
++--+++
+|Primary Write Requests||Secondary Write Requests|
++--+++
+  |   |
+  |  (4)
+  |   V
+  |  /-\
+  |  Copy and Forward| |
+  |-(1)--+   | Disk Buffer |
+  |  |   | |
+  | (3)  \-/
+  | speculative  ^
+  |write through(2)
+  |  |   |
+  V  V   |
+   +--+   ++
+   | Primary Disk |   | Secondary Disk |
+   +--+   ++
+
+1) Primary write requests will be copied and forwarded to Secondary
+   QEMU.
+2) Before Primary write requests are written to Secondary disk, the
+   original sector content will be read from Secondary disk and
+   buffered in the Disk buffer, but it will not overwrite the existing
+   sector content (it could be from either "Secondary Write Requests" or
+   previous COW of "Primary Write Requests") in the Disk buffer.
+3) Primary write requests will be written to Secondary disk.
+4) Secondary write requests will be buffered in the Disk buffer and it
+   will overwrite the existing sector content in the buffer.
+
+== Architecture ==
+We are going to implement block replication from many basic
+blocks that are already in QEMU.
+
+ virtio-blk   ||
+ ^||.--
+ |||| Secondary
+1 Quorum  ||'--
+ /  \ ||
+/\||
+   Primary2 filter
+ disk ^
 virtio-blk
+  |
  ^
+3 NBD  --->  3 NBD 
  |
+client|| server
  2 filter
+  ||^  
  ^
+. |||  
  |
+Primary | ||  Secondary disk <- hidden-disk 5 
<- active-disk 4
+' ||

[Qemu-devel] [Patch v12 resend 06/10] Add new block driver interfaces to control block replication

2015-12-01 Thread Wen Congyang

Signed-off-by: Wen Congyang <we...@cn.fujitsu.com>
Signed-off-by: zhanghailiang <zhang.zhanghaili...@huawei.com>
Signed-off-by: Gonglei <arei.gong...@huawei.com>
Cc: Luiz Capitulino <lcapitul...@redhat.com>
Cc: Michael Roth <mdr...@linux.vnet.ibm.com>
Reviewed-by: Paolo Bonzini <pbonz...@redhat.com>
---
 block.c   | 43 +++
 include/block/block.h |  5 +
 include/block/block_int.h | 14 ++
 qapi/block-core.json  | 13 +
 4 files changed, 75 insertions(+)

diff --git a/block.c b/block.c
index 0a0468f..213bee8 100644
--- a/block.c
+++ b/block.c
@@ -4390,3 +4390,46 @@ void bdrv_del_child(BlockDriverState *parent_bs, 
BlockDriverState *child_bs,
 
 parent_bs->drv->bdrv_del_child(parent_bs, child_bs, errp);
 }
+
+void bdrv_start_replication(BlockDriverState *bs, ReplicationMode mode,
+Error **errp)
+{
+BlockDriver *drv = bs->drv;
+
+if (drv && drv->bdrv_start_replication) {
+drv->bdrv_start_replication(bs, mode, errp);
+} else if (bs->file) {
+bdrv_start_replication(bs->file->bs, mode, errp);
+} else {
+error_setg(errp, "The BDS %s doesn't support starting block"
+   " replication", bs->filename);
+}
+}
+
+void bdrv_do_checkpoint(BlockDriverState *bs, Error **errp)
+{
+BlockDriver *drv = bs->drv;
+
+if (drv && drv->bdrv_do_checkpoint) {
+drv->bdrv_do_checkpoint(bs, errp);
+} else if (bs->file) {
+bdrv_do_checkpoint(bs->file->bs, errp);
+} else {
+error_setg(errp, "The BDS %s doesn't support block checkpoint",
+   bs->filename);
+}
+}
+
+void bdrv_stop_replication(BlockDriverState *bs, bool failover, Error **errp)
+{
+BlockDriver *drv = bs->drv;
+
+if (drv && drv->bdrv_stop_replication) {
+drv->bdrv_stop_replication(bs, failover, errp);
+} else if (bs->file) {
+bdrv_stop_replication(bs->file->bs, failover, errp);
+} else {
+error_setg(errp, "The BDS %s doesn't support stopping block"
+   " replication", bs->filename);
+}
+}
diff --git a/include/block/block.h b/include/block/block.h
index 1d3b9c6..cd39d50 100644
--- a/include/block/block.h
+++ b/include/block/block.h
@@ -648,4 +648,9 @@ void bdrv_add_child(BlockDriverState *parent, 
BlockDriverState *child,
 void bdrv_del_child(BlockDriverState *parent, BlockDriverState *child,
 Error **errp);
 
+void bdrv_start_replication(BlockDriverState *bs, ReplicationMode mode,
+Error **errp);
+void bdrv_do_checkpoint(BlockDriverState *bs, Error **errp);
+void bdrv_stop_replication(BlockDriverState *bs, bool failover, Error **errp);
+
 #endif
diff --git a/include/block/block_int.h b/include/block/block_int.h
index 1f56046..a6aba8b 100644
--- a/include/block/block_int.h
+++ b/include/block/block_int.h
@@ -307,6 +307,20 @@ struct BlockDriver {
 void (*bdrv_del_child)(BlockDriverState *parent, BlockDriverState *child,
Error **errp);
 
+void (*bdrv_start_replication)(BlockDriverState *bs, ReplicationMode mode,
+   Error **errp);
+/* Drop Disk buffer when doing checkpoint. */
+void (*bdrv_do_checkpoint)(BlockDriverState *bs, Error **errp);
+/*
+ * After failover, we should flush Disk buffer into secondary disk
+ * and stop block replication.
+ *
+ * If the guest is shutdown, we should drop Disk buffer and stop
+ * block representation.
+ */
+void (*bdrv_stop_replication)(BlockDriverState *bs, bool failover,
+  Error **errp);
+
 QLIST_ENTRY(BlockDriver) list;
 };
 
diff --git a/qapi/block-core.json b/qapi/block-core.json
index feb8da2..2c6bd3f 100644
--- a/qapi/block-core.json
+++ b/qapi/block-core.json
@@ -1925,6 +1925,19 @@
 '*read-pattern': 'QuorumReadPattern' } }
 
 ##
+# @ReplicationMode
+#
+# An enumeration of replication modes.
+#
+# @primary: Primary mode, the vm's state will be sent to secondary QEMU.
+#
+# @secondary: Secondary mode, receive the vm's state from primary QEMU.
+#
+# Since: 2.5
+##
+{ 'enum' : 'ReplicationMode', 'data' : [ 'primary', 'secondary' ] }
+
+##
 # @BlockdevOptions
 #
 # Options for creating a block device.
-- 
2.5.0

[Qemu-devel] [Patch v12 resend 10/10] Add a new API to start/stop replication, do checkpoint to all BDSes

2015-12-01 Thread Wen Congyang

Signed-off-by: Wen Congyang <we...@cn.fujitsu.com>
Signed-off-by: zhanghailiang <zhang.zhanghaili...@huawei.com>
Signed-off-by: Gonglei <arei.gong...@huawei.com>
---
 block.c   | 83 +++
 include/block/block.h |  4 +++
 2 files changed, 87 insertions(+)

diff --git a/block.c b/block.c
index 213bee8..09ee7f1 100644
--- a/block.c
+++ b/block.c
@@ -4433,3 +4433,86 @@ void bdrv_stop_replication(BlockDriverState *bs, bool 
failover, Error **errp)
" replication", bs->filename);
 }
 }
+
+void bdrv_start_replication_all(ReplicationMode mode, Error **errp)
+{
+BlockDriverState *bs = NULL, *temp = NULL;
+Error *local_err = NULL;
+
+while ((bs = bdrv_next(bs))) {
+if (!QLIST_EMPTY(>parents)) {
+/* It is not top BDS */
+continue;
+}
+
+if (bdrv_is_read_only(bs) || !bdrv_is_inserted(bs)) {
+continue;
+}
+
+bdrv_start_replication(bs, mode, _err);
+if (local_err) {
+error_propagate(errp, local_err);
+goto fail;
+}
+}
+
+return;
+
+fail:
+while ((temp = bdrv_next(temp)) && bs != temp) {
+bdrv_stop_replication(temp, false, NULL);
+}
+}
+
+void bdrv_do_checkpoint_all(Error **errp)
+{
+BlockDriverState *bs = NULL;
+Error *local_err = NULL;
+
+while ((bs = bdrv_next(bs))) {
+if (!QLIST_EMPTY(>parents)) {
+/* It is not top BDS */
+continue;
+}
+
+if (bdrv_is_read_only(bs) || !bdrv_is_inserted(bs)) {
+continue;
+}
+
+bdrv_do_checkpoint(bs, _err);
+if (local_err) {
+error_propagate(errp, local_err);
+return;
+}
+}
+}
+
+void bdrv_stop_replication_all(bool failover, Error **errp)
+{
+BlockDriverState *bs = NULL;
+Error *local_err = NULL;
+
+while ((bs = bdrv_next(bs))) {
+if (!QLIST_EMPTY(>parents)) {
+/* It is not top BDS */
+continue;
+}
+
+if (bdrv_is_read_only(bs) || !bdrv_is_inserted(bs)) {
+continue;
+}
+
+bdrv_stop_replication(bs, failover, _err);
+if (!errp) {
+/*
+ * The caller doesn't care the result, they just
+ * want to stop all block's replication.
+ */
+continue;
+}
+if (local_err) {
+error_propagate(errp, local_err);
+return;
+}
+}
+}
diff --git a/include/block/block.h b/include/block/block.h
index cd39d50..39d246c 100644
--- a/include/block/block.h
+++ b/include/block/block.h
@@ -653,4 +653,8 @@ void bdrv_start_replication(BlockDriverState *bs, 
ReplicationMode mode,
 void bdrv_do_checkpoint(BlockDriverState *bs, Error **errp);
 void bdrv_stop_replication(BlockDriverState *bs, bool failover, Error **errp);
 
+void bdrv_start_replication_all(ReplicationMode mode, Error **errp);
+void bdrv_do_checkpoint_all(Error **errp);
+void bdrv_stop_replication_all(bool failover, Error **errp);
+
 #endif
-- 
2.5.0

[Qemu-devel] [Patch v12 resend 00/10] Block replication for continuous checkpoints

2015-12-01 Thread Wen Congyang

Block replication is a very important feature which is used for
continuous checkpoints(for example: COLO).

You can get the detailed information about block replication from here:
http://wiki.qemu.org/Features/BlockReplication

Usage:
Please refer to docs/block-replication.txt

This patch series is based on the following patch series:
1. http://lists.nongnu.org/archive/html/qemu-devel/2015-11/msg04949.html
2. http://lists.nongnu.org/archive/html/qemu-devel/2015-11/msg06043.html

You can get the patch here:
https://github.com/coloft/qemu/tree/wency/block-replication-v12

You can get the patch with framework here:
https://github.com/coloft/qemu/tree/wency/colo_framework_v11.2

TODO:
1. Continuous block replication. It will be started after basic functions
   are accepted.

Changs Log:
V12:
1. Rebase to the newest codes
2. Use backing reference to replcace 'allow-write-backing-file'
V11:
1. Reopen the backing file when starting blcok replication if it is not
   opened in R/W mode
2. Unblock BLOCK_OP_TYPE_BACKUP_SOURCE and BLOCK_OP_TYPE_BACKUP_TARGET
   when opening backing file
3. Block the top BDS so there is only one block job for the top BDS and
   its backing chain.
V10:
1. Use blockdev-remove-medium and blockdev-insert-medium to replace backing
   reference.
2. Address the comments from Eric Blake
V9:
1. Update the error messages
2. Rebase to the newest qemu
3. Split child add/delete support. These patches are sent in another patchset.
V8:
1. Address Alberto Garcia's comments
V7:
1. Implement adding/removing quorum child. Remove the option non-connect.
2. Simplify the backing refrence option according to Stefan Hajnoczi's 
suggestion
V6:
1. Rebase to the newest qemu.
V5:
1. Address the comments from Gong Lei
2. Speed the failover up. The secondary vm can take over very quickly even
   if there are too many I/O requests.
V4:
1. Introduce a new driver replication to avoid touch nbd and qcow2.
V3:
1: use error_setg() instead of error_set()
2. Add a new block job API
3. Active disk, hidden disk and nbd target uses the same AioContext
4. Add a testcase to test new hbitmap API
V2:
1. Redesign the secondary qemu(use image-fleecing)
2. Use Error objects to return error message
3. Address the comments from Max Reitz and Eric Blake

Wen Congyang (10):
  unblock backup operations in backing file
  Store parent BDS in BdrvChild
  Backup: clear all bitmap when doing block checkpoint
  Allow creating backup jobs when opening BDS
  docs: block replication's description
  Add new block driver interfaces to control block replication
  quorum: implement block driver interfaces for block replication
  Implement new driver for block replication
  support replication driver in blockdev-add
  Add a new API to start/stop replication, do checkpoint to all BDSes

 block.c| 145 
 block/Makefile.objs|   3 +-
 block/backup.c |  14 ++
 block/quorum.c |  78 +++
 block/replication.c| 549 +
 blockjob.c |  11 +
 docs/block-replication.txt | 227 +++
 include/block/block.h  |   9 +
 include/block/block_int.h  |  15 ++
 include/block/blockjob.h   |  12 +
 qapi/block-core.json   |  34 ++-
 11 files changed, 1093 insertions(+), 4 deletions(-)
 create mode 100644 block/replication.c
 create mode 100644 docs/block-replication.txt

-- 
2.5.0

[Qemu-devel] [Patch v12 resend 03/10] Backup: clear all bitmap when doing block checkpoint

2015-12-01 Thread Wen Congyang

Signed-off-by: Wen Congyang <we...@cn.fujitsu.com>
Signed-off-by: zhanghailiang <zhang.zhanghaili...@huawei.com>
Signed-off-by: Gonglei <arei.gong...@huawei.com>
Reviewed-by: Jeff Cody <jc...@redhat.com>
---
 block/backup.c   | 14 ++
 blockjob.c   | 11 +++
 include/block/blockjob.h | 12 
 3 files changed, 37 insertions(+)

diff --git a/block/backup.c b/block/backup.c
index 3b39119..1ca102d 100644
--- a/block/backup.c
+++ b/block/backup.c
@@ -253,11 +253,25 @@ static void backup_abort(BlockJob *job)
 }
 }
 
+static void backup_do_checkpoint(BlockJob *job, Error **errp)
+{
+BackupBlockJob *backup_job = container_of(job, BackupBlockJob, common);
+
+if (backup_job->sync_mode != MIRROR_SYNC_MODE_NONE) {
+error_setg(errp, "The backup job only supports block checkpoint in"
+   " sync=none mode");
+return;
+}
+
+hbitmap_reset_all(backup_job->bitmap);
+}
+
 static const BlockJobDriver backup_job_driver = {
 .instance_size  = sizeof(BackupBlockJob),
 .job_type   = BLOCK_JOB_TYPE_BACKUP,
 .set_speed  = backup_set_speed,
 .iostatus_reset = backup_iostatus_reset,
+.do_checkpoint  = backup_do_checkpoint,
 .commit = backup_commit,
 .abort  = backup_abort,
 };
diff --git a/blockjob.c b/blockjob.c
index 80adb9d..0c8edfe 100644
--- a/blockjob.c
+++ b/blockjob.c
@@ -533,3 +533,14 @@ void block_job_txn_add_job(BlockJobTxn *txn, BlockJob *job)
 QLIST_INSERT_HEAD(>jobs, job, txn_list);
 block_job_txn_ref(txn);
 }
+
+void block_job_do_checkpoint(BlockJob *job, Error **errp)
+{
+if (!job->driver->do_checkpoint) {
+error_setg(errp, "The job %s doesn't support block checkpoint",
+   BlockJobType_lookup[job->driver->job_type]);
+return;
+}
+
+job->driver->do_checkpoint(job, errp);
+}
diff --git a/include/block/blockjob.h b/include/block/blockjob.h
index d84ccd8..abdba7c 100644
--- a/include/block/blockjob.h
+++ b/include/block/blockjob.h
@@ -70,6 +70,9 @@ typedef struct BlockJobDriver {
  * never both.
  */
 void (*abort)(BlockJob *job);
+
+/** Optional callback for job types that support checkpoint. */
+void (*do_checkpoint)(BlockJob *job, Error **errp);
 } BlockJobDriver;
 
 /**
@@ -443,4 +446,13 @@ void block_job_txn_unref(BlockJobTxn *txn);
  */
 void block_job_txn_add_job(BlockJobTxn *txn, BlockJob *job);
 
+/**
+ * block_job_do_checkpoint:
+ * @job: The job.
+ * @errp: Error object.
+ *
+ * Do block checkpoint on the specified job.
+ */
+void block_job_do_checkpoint(BlockJob *job, Error **errp);
+
 #endif
-- 
2.5.0

[Qemu-devel] [Patch v12 resend 01/10] unblock backup operations in backing file

2015-12-01 Thread Wen Congyang

Signed-off-by: Wen Congyang <we...@cn.fujitsu.com>
Signed-off-by: zhanghailiang <zhang.zhanghaili...@huawei.com>
Signed-off-by: Gonglei <arei.gong...@huawei.com>
---
 block.c | 18 ++
 1 file changed, 18 insertions(+)

diff --git a/block.c b/block.c
index bfc2be8..eaf479a 100644
--- a/block.c
+++ b/block.c
@@ -1275,6 +1275,24 @@ void bdrv_set_backing_hd(BlockDriverState *bs, 
BlockDriverState *backing_hd)
 /* Otherwise we won't be able to commit due to check in bdrv_commit */
 bdrv_op_unblock(backing_hd, BLOCK_OP_TYPE_COMMIT_TARGET,
 bs->backing_blocker);
+/*
+ * We do backup in 3 ways:
+ * 1. drive backup
+ *The target bs is new opened, and the source is top BDS
+ * 2. blockdev backup
+ *Both the source and the target are top BDSes.
+ * 3. internal backup(used for block replication)
+ *Both the source and the target are backing file
+ *
+ * In case 1, and 2, the backing file is neither the source nor
+ * the target.
+ * In case 3, we will block the top BDS, so there is only one block
+ * job for the top BDS and its backing chain.
+ */
+bdrv_op_unblock(backing_hd, BLOCK_OP_TYPE_BACKUP_SOURCE,
+bs->backing_blocker);
+bdrv_op_unblock(backing_hd, BLOCK_OP_TYPE_BACKUP_TARGET,
+bs->backing_blocker);
 out:
 bdrv_refresh_limits(bs, NULL);
 }
-- 
2.5.0

[Qemu-devel] [Patch v12 resend 08/10] Implement new driver for block replication

2015-12-01 Thread Wen Congyang

Signed-off-by: Wen Congyang <we...@cn.fujitsu.com>
Signed-off-by: zhanghailiang <zhang.zhanghaili...@huawei.com>
Signed-off-by: Gonglei <arei.gong...@huawei.com>
---
 block/Makefile.objs |   1 +
 block/replication.c | 549 
 2 files changed, 550 insertions(+)
 create mode 100644 block/replication.c

diff --git a/block/Makefile.objs b/block/Makefile.objs
index fa05f37..94c1d03 100644
--- a/block/Makefile.objs
+++ b/block/Makefile.objs
@@ -23,6 +23,7 @@ block-obj-$(CONFIG_LIBSSH2) += ssh.o
 block-obj-y += accounting.o
 block-obj-y += write-threshold.o
 block-obj-y += backup.o
+block-obj-y += replication.o
 
 common-obj-y += stream.o
 common-obj-y += commit.o
diff --git a/block/replication.c b/block/replication.c
new file mode 100644
index 000..c46c916
--- /dev/null
+++ b/block/replication.c
@@ -0,0 +1,549 @@
+/*
+ * Replication Block filter
+ *
+ * Copyright (c) 2015 HUAWEI TECHNOLOGIES CO., LTD.
+ * Copyright (c) 2015 Intel Corporation
+ * Copyright (c) 2015 FUJITSU LIMITED
+ *
+ * Author:
+ *   Wen Congyang <we...@cn.fujitsu.com>
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2 or later.
+ * See the COPYING file in the top-level directory.
+ */
+
+#include "qemu-common.h"
+#include "block/block_int.h"
+#include "block/blockjob.h"
+#include "block/nbd.h"
+
+typedef struct BDRVReplicationState {
+ReplicationMode mode;
+int replication_state;
+BlockDriverState *active_disk;
+BlockDriverState *hidden_disk;
+BlockDriverState *secondary_disk;
+BlockDriverState *top_bs;
+Error *blocker;
+int orig_hidden_flags;
+int orig_secondary_flags;
+int error;
+} BDRVReplicationState;
+
+enum {
+BLOCK_REPLICATION_NONE, /* block replication is not started */
+BLOCK_REPLICATION_RUNNING,  /* block replication is running */
+BLOCK_REPLICATION_DONE, /* block replication is done(failover) */
+};
+
+#define COMMIT_CLUSTER_BITS 16
+#define COMMIT_CLUSTER_SIZE (1 << COMMIT_CLUSTER_BITS)
+#define COMMIT_SECTORS_PER_CLUSTER (COMMIT_CLUSTER_SIZE / BDRV_SECTOR_SIZE)
+
+static void replication_stop(BlockDriverState *bs, bool failover, Error 
**errp);
+
+#define REPLICATION_MODE"mode"
+static QemuOptsList replication_runtime_opts = {
+.name = "replication",
+.head = QTAILQ_HEAD_INITIALIZER(replication_runtime_opts.head),
+.desc = {
+{
+.name = REPLICATION_MODE,
+.type = QEMU_OPT_STRING,
+},
+{ /* end of list */ }
+},
+};
+
+static int replication_open(BlockDriverState *bs, QDict *options,
+int flags, Error **errp)
+{
+int ret;
+BDRVReplicationState *s = bs->opaque;;
+Error *local_err = NULL;
+QemuOpts *opts = NULL;
+const char *mode;
+
+ret = -EINVAL;
+opts = qemu_opts_create(_runtime_opts, NULL, 0, _abort);
+qemu_opts_absorb_qdict(opts, options, _err);
+if (local_err) {
+goto fail;
+}
+
+mode = qemu_opt_get(opts, REPLICATION_MODE);
+if (!mode) {
+error_setg(_err, "Missing the option mode");
+goto fail;
+}
+
+if (!strcmp(mode, "primary")) {
+s->mode = REPLICATION_MODE_PRIMARY;
+} else if (!strcmp(mode, "secondary")) {
+s->mode = REPLICATION_MODE_SECONDARY;
+} else {
+error_setg(_err,
+   "The option mode's value should be primary or secondary");
+goto fail;
+}
+
+ret = 0;
+
+fail:
+qemu_opts_del(opts);
+/* propagate error */
+if (local_err) {
+error_propagate(errp, local_err);
+}
+return ret;
+}
+
+static void replication_close(BlockDriverState *bs)
+{
+BDRVReplicationState *s = bs->opaque;
+
+if (s->replication_state == BLOCK_REPLICATION_RUNNING) {
+replication_stop(bs, false, NULL);
+}
+}
+
+static int64_t replication_getlength(BlockDriverState *bs)
+{
+return bdrv_getlength(bs->file->bs);
+}
+
+static int replication_get_io_status(BDRVReplicationState *s)
+{
+switch (s->replication_state) {
+case BLOCK_REPLICATION_NONE:
+return -EIO;
+case BLOCK_REPLICATION_RUNNING:
+return 0;
+case BLOCK_REPLICATION_DONE:
+return s->mode == REPLICATION_MODE_PRIMARY ? -EIO : 1;
+default:
+abort();
+}
+}
+
+static int replication_return_value(BDRVReplicationState *s, int ret)
+{
+if (s->mode == REPLICATION_MODE_SECONDARY) {
+return ret;
+}
+
+if (ret < 0) {
+s->error = ret;
+ret = 0;
+}
+
+return ret;
+}
+
+static coroutine_fn int replication_co_readv(BlockDriverState *bs,
+ int64_t sector_num,
+ int remaining_sectors,
+ QEMUIOVector *q

[Qemu-devel] [Patch v12 resend 09/10] support replication driver in blockdev-add

2015-12-01 Thread Wen Congyang

Signed-off-by: Wen Congyang <we...@cn.fujitsu.com>
Signed-off-by: zhanghailiang <zhang.zhanghaili...@huawei.com>
Signed-off-by: Gonglei <arei.gong...@huawei.com>
Reviewed-by: Eric Blake <ebl...@redhat.com>
---
 qapi/block-core.json | 21 ++---
 1 file changed, 18 insertions(+), 3 deletions(-)

diff --git a/qapi/block-core.json b/qapi/block-core.json
index 2c6bd3f..acc9f8d 100644
--- a/qapi/block-core.json
+++ b/qapi/block-core.json
@@ -219,7 +219,7 @@
 #   'qcow2', 'raw', 'tftp', 'vdi', 'vmdk', 'vpc', 'vvfat'
 #   2.2: 'archipelago' added, 'cow' dropped
 #   2.3: 'host_floppy' deprecated
-#   2.5: 'host_floppy' dropped
+#   2.5: 'host_floppy' dropped, 'replication' added
 #
 # @backing_file: #optional the name of the backing file (for copy-on-write)
 #
@@ -1492,6 +1492,7 @@
 # Drivers that are supported in block device operations.
 #
 # @host_device, @host_cdrom: Since 2.1
+# @replication: Since 2.5
 #
 # Since: 2.0
 ##
@@ -1499,8 +1500,8 @@
   'data': [ 'archipelago', 'blkdebug', 'blkverify', 'bochs', 'cloop',
 'dmg', 'file', 'ftp', 'ftps', 'host_cdrom', 'host_device',
 'http', 'https', 'null-aio', 'null-co', 'parallels',
-'qcow', 'qcow2', 'qed', 'quorum', 'raw', 'tftp', 'vdi', 'vhdx',
-'vmdk', 'vpc', 'vvfat' ] }
+'qcow', 'qcow2', 'qed', 'quorum', 'raw', 'replication',
+'tftp', 'vdi', 'vhdx', 'vmdk', 'vpc', 'vvfat' ] }
 
 ##
 # @BlockdevOptionsBase
@@ -1938,6 +1939,19 @@
 { 'enum' : 'ReplicationMode', 'data' : [ 'primary', 'secondary' ] }
 
 ##
+# @BlockdevOptionsReplication
+#
+# Driver specific block device options for replication
+#
+# @mode: the replication mode
+#
+# Since: 2.5
+##
+{ 'struct': 'BlockdevOptionsReplication',
+  'base': 'BlockdevOptionsGenericFormat',
+  'data': { 'mode': 'ReplicationMode'  } }
+
+##
 # @BlockdevOptions
 #
 # Options for creating a block device.
@@ -1974,6 +1988,7 @@
   'quorum': 'BlockdevOptionsQuorum',
   'raw':'BlockdevOptionsGenericFormat',
 # TODO rbd: Wait for structured options
+  'replication':'BlockdevOptionsReplication',
 # TODO sheepdog: Wait for structured options
 # TODO ssh: Should take InetSocketAddress for 'host'?
   'tftp':   'BlockdevOptionsFile',
-- 
2.5.0

[Qemu-devel] [Patch v12 resend 02/10] Store parent BDS in BdrvChild

2015-12-01 Thread Wen Congyang

We need to access the parent BDS to get the root BDS.

Signed-off-by: Wen Congyang <we...@cn.fujitsu.com>
Signed-off-by: zhanghailiang <zhang.zhanghaili...@huawei.com>
Signed-off-by: Gonglei <arei.gong...@huawei.com>
---
 block.c   | 1 +
 include/block/block_int.h | 1 +
 2 files changed, 2 insertions(+)

diff --git a/block.c b/block.c
index eaf479a..0a0468f 100644
--- a/block.c
+++ b/block.c
@@ -1204,6 +1204,7 @@ BdrvChild *bdrv_attach_child(BlockDriverState *parent_bs,
 BdrvChild *child = g_new(BdrvChild, 1);
 *child = (BdrvChild) {
 .bs = child_bs,
+.parent = parent_bs,
 .name   = g_strdup(child_name),
 .role   = child_role,
 };
diff --git a/include/block/block_int.h b/include/block/block_int.h
index ea20d12..1f56046 100644
--- a/include/block/block_int.h
+++ b/include/block/block_int.h
@@ -357,6 +357,7 @@ extern const BdrvChildRole child_format;
 
 struct BdrvChild {
 BlockDriverState *bs;
+BlockDriverState *parent;
 char *name;
 const BdrvChildRole *role;
 QLIST_ENTRY(BdrvChild) next;
-- 
2.5.0

[Qemu-devel] [Patch v12 resend 07/10] quorum: implement block driver interfaces for block replication

2015-12-01 Thread Wen Congyang

Signed-off-by: Wen Congyang <we...@cn.fujitsu.com>
Signed-off-by: zhanghailiang <zhang.zhanghaili...@huawei.com>
Signed-off-by: Gonglei <arei.gong...@huawei.com>
Reviewed-by: Alberto Garcia <be...@igalia.com>
---
 block/quorum.c | 78 ++
 1 file changed, 78 insertions(+)

diff --git a/block/quorum.c b/block/quorum.c
index b7df14b..6fa54f3 100644
--- a/block/quorum.c
+++ b/block/quorum.c
@@ -85,6 +85,8 @@ typedef struct BDRVQuorumState {
 int bsize;
 
 QuorumReadPattern read_pattern;
+
+int replication_index; /* store which child supports block replication */
 } BDRVQuorumState;
 
 typedef struct QuorumAIOCB QuorumAIOCB;
@@ -949,6 +951,7 @@ static int quorum_open(BlockDriverState *bs, QDict 
*options, int flags,
 s->bsize = s->num_children;
 
 g_free(opened);
+s->replication_index = -1;
 goto exit;
 
 close_exit:
@@ -1148,6 +1151,77 @@ static void quorum_refresh_filename(BlockDriverState 
*bs, QDict *options)
 bs->full_open_options = opts;
 }
 
+static void quorum_start_replication(BlockDriverState *bs, ReplicationMode 
mode,
+ Error **errp)
+{
+BDRVQuorumState *s = bs->opaque;
+int count = 0, i, index;
+Error *local_err = NULL;
+
+/*
+ * TODO: support REPLICATION_MODE_SECONDARY if we allow secondary
+ * QEMU becoming primary QEMU.
+ */
+if (mode != REPLICATION_MODE_PRIMARY) {
+error_setg(errp, "The replication mode for quorum should be 
'primary'");
+return;
+}
+
+if (s->read_pattern != QUORUM_READ_PATTERN_FIFO) {
+error_setg(errp, "Block replication needs read pattern 'fifo'");
+return;
+}
+
+for (i = 0; i < s->num_children; i++) {
+bdrv_start_replication(s->children[i]->bs, mode, _err);
+if (local_err) {
+error_free(local_err);
+local_err = NULL;
+} else {
+count++;
+index = i;
+}
+}
+
+if (count == 0) {
+error_setg(errp, "No child supports block replication");
+} else if (count > 1) {
+for (i = 0; i < s->num_children; i++) {
+bdrv_stop_replication(s->children[i]->bs, false, NULL);
+}
+error_setg(errp, "Too many children support block replication");
+} else {
+s->replication_index = index;
+}
+}
+
+static void quorum_do_checkpoint(BlockDriverState *bs, Error **errp)
+{
+BDRVQuorumState *s = bs->opaque;
+
+if (s->replication_index < 0) {
+error_setg(errp, "Block replication is not running");
+return;
+}
+
+bdrv_do_checkpoint(s->children[s->replication_index]->bs, errp);
+}
+
+static void quorum_stop_replication(BlockDriverState *bs, bool failover,
+Error **errp)
+{
+BDRVQuorumState *s = bs->opaque;
+
+if (s->replication_index < 0) {
+error_setg(errp, "Block replication is not running");
+return;
+}
+
+bdrv_stop_replication(s->children[s->replication_index]->bs, failover,
+  errp);
+s->replication_index = -1;
+}
+
 static BlockDriver bdrv_quorum = {
 .format_name= "quorum",
 .protocol_name  = "quorum",
@@ -1174,6 +1248,10 @@ static BlockDriver bdrv_quorum = {
 
 .is_filter  = true,
 .bdrv_recurse_is_first_non_filter   = quorum_recurse_is_first_non_filter,
+
+.bdrv_start_replication = quorum_start_replication,
+.bdrv_do_checkpoint = quorum_do_checkpoint,
+.bdrv_stop_replication  = quorum_stop_replication,
 };
 
 static void bdrv_quorum_init(void)
-- 
2.5.0

[Qemu-devel] [Patch v12 resend 04/10] Allow creating backup jobs when opening BDS

2015-12-01 Thread Wen Congyang

When opening BDS, we need to create backup jobs for
image-fleecing.

Signed-off-by: Wen Congyang <we...@cn.fujitsu.com>
Signed-off-by: zhanghailiang <zhang.zhanghaili...@huawei.com>
Signed-off-by: Gonglei <arei.gong...@huawei.com>
Reviewed-by: Stefan Hajnoczi <stefa...@redhat.com>
Reviewed-by: Jeff Cody <jc...@redhat.com>
---
 block/Makefile.objs | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/block/Makefile.objs b/block/Makefile.objs
index 58ef2ef..fa05f37 100644
--- a/block/Makefile.objs
+++ b/block/Makefile.objs
@@ -22,10 +22,10 @@ block-obj-$(CONFIG_ARCHIPELAGO) += archipelago.o
 block-obj-$(CONFIG_LIBSSH2) += ssh.o
 block-obj-y += accounting.o
 block-obj-y += write-threshold.o
+block-obj-y += backup.o
 
 common-obj-y += stream.o
 common-obj-y += commit.o
-common-obj-y += backup.o
 
 iscsi.o-cflags := $(LIBISCSI_CFLAGS)
 iscsi.o-libs   := $(LIBISCSI_LIBS)
-- 
2.5.0

Re: [Qemu-devel] [RFC PATCH 1/9] Init colo-proxy object based on netfilter

2015-11-29 Thread Wen Congyang

On 11/27/2015 08:27 PM, Zhang Chen wrote:
> From: zhangchen <zhangchen.f...@cn.fujitsu.com>
> 
> add colo-proxy in vl.c and qemu-options.hx
> 
> Signed-off-by: zhangchen <zhangchen.f...@cn.fujitsu.com>
> ---
>  qemu-options.hx | 4 
>  vl.c| 3 ++-
>  2 files changed, 6 insertions(+), 1 deletion(-)
> 
> diff --git a/qemu-options.hx b/qemu-options.hx
> index 949db7f..5e6f1e3 100644
> --- a/qemu-options.hx
> +++ b/qemu-options.hx
> @@ -3666,6 +3666,10 @@ queue @var{all|rx|tx} is an option that can be applied 
> to any netfilter.
>  @option{tx}: the filter is attached to the transmit queue of the netdev,
>   where it will receive packets sent by the netdev.
>  
> +@item -object 
> colo-proxy,id=@var{id},netdev=@var{netdevid},port=@var{t},addr=@var{ip:port},mode=@var{primary|secondary}[,queue=@var{all|rx|tx}]

1. queue *MUST* be all for the filter colo-proxy.
2. The option port should be removed
3. The option addr is socket address. The format can be host:port, or fd.

> +
> +colo-proxy

Add more description here.

Thanks
Wen Congyang

> +
>  @item -object 
> filter-dump,id=@var{id},netdev=@var{dev},file=@var{filename}][,maxlen=@var{len}]
>  
>  Dump the network traffic on netdev @var{dev} to the file specified by
> diff --git a/vl.c b/vl.c
> index f5f7c3f..9037743 100644
> --- a/vl.c
> +++ b/vl.c
> @@ -2774,7 +2774,8 @@ static bool object_create_initial(const char *type)
>   * they depend on netdevs already existing
>   */
>  if (g_str_equal(type, "filter-buffer") ||
> -g_str_equal(type, "filter-dump")) {
> +g_str_equal(type, "filter-dump") ||
> +g_str_equal(type, "colo-proxy")) {
>  return false;
>  }
>  
>

Re: [Qemu-devel] [RFC PATCH 3/9] colo-proxy: add colo-proxy framework

2015-11-29 Thread Wen Congyang

On 11/27/2015 08:27 PM, Zhang Chen wrote:
> From: zhangchen 
> 
> Colo-proxy is a plugin of qemu netfilter
> like filter-buffer and dump
> 
> Signed-off-by: zhangchen 
> ---
>  net/Makefile.objs |   1 +
>  net/colo-proxy.c  | 139 
> ++
>  net/colo-proxy.h  |  63 +
>  3 files changed, 203 insertions(+)
>  create mode 100644 net/colo-proxy.c
>  create mode 100644 net/colo-proxy.h
> 
> diff --git a/net/Makefile.objs b/net/Makefile.objs
> index 5fa2f97..95670f2 100644
> --- a/net/Makefile.objs
> +++ b/net/Makefile.objs
> @@ -15,3 +15,4 @@ common-obj-$(CONFIG_VDE) += vde.o
>  common-obj-$(CONFIG_NETMAP) += netmap.o
>  common-obj-y += filter.o
>  common-obj-y += filter-buffer.o
> +common-obj-y += colo-proxy.o
> diff --git a/net/colo-proxy.c b/net/colo-proxy.c
> new file mode 100644
> index 000..98c2699
> --- /dev/null
> +++ b/net/colo-proxy.c
> @@ -0,0 +1,139 @@
> +/*
> + * COarse-grain LOck-stepping Virtual Machines for Non-stop Service (COLO)
> + * (a.k.a. Fault Tolerance or Continuous Replication)
> + *
> + * Copyright (c) 2015 HUAWEI TECHNOLOGIES CO., LTD.
> + * Copyright (c) 2015 FUJITSU LIMITED
> + * Copyright (c) 2015 Intel Corporation
> + *
> + * Author: Zhang Chen 
> + *
> + * This work is licensed under the terms of the GNU GPL, version 2 or
> + * later.  See the COPYING file in the top-level directory.
> + */
> +
> +#include "colo-proxy.h"
> +
> +#define __DEBUG__
> +
> +#ifdef __DEBUG__
> +#define DEBUG(format, ...) printf(format, ##__VA_ARGS__)
> +#else
> +#define DEBUG(format, ...)
> +#endif
> +
> +
> +static ssize_t colo_proxy_receive_iov(NetFilterState *nf,
> + NetClientState *sender,
> + unsigned flags,
> + const struct iovec *iov,
> + int iovcnt,
> + NetPacketSent *sent_cb)
> +{
> +/*
> + * We return size when buffer a packet, the sender will take it as
> + * a already sent packet, so sent_cb should not be called later.
> + *
> + */
> +ColoProxyState *s = FILTER_COLO_PROXY(nf);
> +if (s->colo_mode == COLO_PRIMARY_MODE) {
> + /* colo_proxy_primary_handler */
> +} else {
> + /* colo_proxy_primary_handler */
> +}
> +return iov_size(iov, iovcnt);
> +}
> +
> +static void colo_proxy_cleanup(NetFilterState *nf)
> +{
> + /* cleanup */
> +}
> +
> +
> +static void colo_proxy_setup(NetFilterState *nf, Error **errp)
> +{
> +ColoProxyState *s = FILTER_COLO_PROXY(nf);
> +if (!s->addr) {
> +error_setg(errp, "filter colo_proxy needs 'addr' \
> + property set!");
> +return;
> +}
> +
> +if (nf->direction != NET_FILTER_DIRECTION_ALL) {
> +printf("colo need queue all packet,\

s/need/needs/

> +please startup colo-proxy with queue=all\n");
> +return;
> +}
> +
> +s->sockfd = -1;
> +s->has_failover = false;
> +colo_do_checkpoint = false;
> +g_queue_init(>unprocessed_connections);
> +
> +if (!strcmp(mode, PRIMARY_MODE)) {
> +s->colo_mode = COLO_PRIMARY_MODE;
> +} else if (!strcmp(mode, SECONDARY_MODE)) {
> +s->colo_mode = COLO_SECONDARY_MODE;
> +} else {
> +error_setg(errp, QERR_INVALID_PARAMETER_VALUE, "mode",
> +"primary or secondary");
> +return;
> +}
> +}
> +
> +static void colo_proxy_class_init(ObjectClass *oc, void *data)
> +{
> +NetFilterClass *nfc = NETFILTER_CLASS(oc);
> +
> +nfc->setup = colo_proxy_setup;
> +nfc->cleanup = colo_proxy_cleanup;
> +nfc->receive_iov = colo_proxy_receive_iov;
> +}
> +
> +static char *colo_proxy_get_mode(Object *obj, Error **errp)
> +{
> +return g_strdup(mode);
> +}
> +
> +static void colo_proxy_set_mode(Object *obj, const char *value, Error **errp)
> +{
> +g_free(mode);
> +mode = g_strdup(value);
> +}
> +
> +static char *colo_proxy_get_addr(Object *obj, Error **errp)
> +{
> +ColoProxyState *s = FILTER_COLO_PROXY(obj);
> +
> +return g_strdup(s->addr);
> +}
> +
> +static void colo_proxy_set_addr(Object *obj, const char *value, Error **errp)
> +{
> +ColoProxyState *s = FILTER_COLO_PROXY(obj);
> +g_free(s->addr);
> +s->addr = g_strdup(value);

You can parse the address here, and can find the format error as early as 
possible.

> +}
> +
> +static void colo_proxy_init(Object *obj)
> +{
> +object_property_add_str(obj, "mode", colo_proxy_get_mode,
> +colo_proxy_set_mode, NULL);
> +object_property_add_str(obj, "addr", colo_proxy_get_addr,
> +colo_proxy_set_addr, NULL);
> +}
> +
> +static const TypeInfo colo_proxy_info = {
> +.name = TYPE_FILTER_COLO_PROXY,
> +.parent =

[Qemu-devel] [Patch v12 05/10] docs: block replication's description

2015-11-26 Thread Wen Congyang

Signed-off-by: Wen Congyang <we...@cn.fujitsu.com>
Signed-off-by: zhanghailiang <zhang.zhanghaili...@huawei.com>
Signed-off-by: Gonglei <arei.gong...@huawei.com>
---
 docs/block-replication.txt | 227 +
 1 file changed, 227 insertions(+)
 create mode 100644 docs/block-replication.txt

diff --git a/docs/block-replication.txt b/docs/block-replication.txt
new file mode 100644
index 000..c7bad0e
--- /dev/null
+++ b/docs/block-replication.txt
@@ -0,0 +1,227 @@
+Block replication
+
+Copyright Fujitsu, Corp. 2015
+Copyright (c) 2015 Intel Corporation
+Copyright (c) 2015 HUAWEI TECHNOLOGIES CO., LTD.
+
+This work is licensed under the terms of the GNU GPL, version 2 or later.
+See the COPYING file in the top-level directory.
+
+Block replication is used for continuous checkpoints. It is designed
+for COLO (COurse-grain LOck-stepping) where the Secondary VM is running.
+It can also be applied for FT/HA (Fault-tolerance/High Assurance) scenario,
+where the Secondary VM is not running.
+
+This document gives an overview of block replication's design.
+
+== Background ==
+High availability solutions such as micro checkpoint and COLO will do
+consecutive checkpoints. The VM state of Primary VM and Secondary VM is
+identical right after a VM checkpoint, but becomes different as the VM
+executes till the next checkpoint. To support disk contents checkpoint,
+the modified disk contents in the Secondary VM must be buffered, and are
+only dropped at next checkpoint time. To reduce the network transportation
+effort at the time of checkpoint, the disk modification operations of
+Primary disk are asynchronously forwarded to the Secondary node.
+
+== Workflow ==
+The following is the image of block replication workflow:
+
++--+++
+|Primary Write Requests||Secondary Write Requests|
++--+++
+  |   |
+  |  (4)
+  |   V
+  |  /-\
+  |  Copy and Forward| |
+  |-(1)--+   | Disk Buffer |
+  |  |   | |
+  | (3)  \-/
+  | speculative  ^
+  |write through(2)
+  |  |   |
+  V  V   |
+   +--+   ++
+   | Primary Disk |   | Secondary Disk |
+   +--+   ++
+
+1) Primary write requests will be copied and forwarded to Secondary
+   QEMU.
+2) Before Primary write requests are written to Secondary disk, the
+   original sector content will be read from Secondary disk and
+   buffered in the Disk buffer, but it will not overwrite the existing
+   sector content (it could be from either "Secondary Write Requests" or
+   previous COW of "Primary Write Requests") in the Disk buffer.
+3) Primary write requests will be written to Secondary disk.
+4) Secondary write requests will be buffered in the Disk buffer and it
+   will overwrite the existing sector content in the buffer.
+
+== Architecture ==
+We are going to implement block replication from many basic
+blocks that are already in QEMU.
+
+ virtio-blk   ||
+ ^||.--
+ |||| Secondary
+1 Quorum  ||'--
+ /  \ ||
+/\||
+   Primary2 filter
+ disk ^
 virtio-blk
+  |
  ^
+3 NBD  --->  3 NBD 
  |
+client|| server
  2 filter
+  ||^  
  ^
+. |||  
  |
+Primary | ||  Secondary disk <- hidden-disk 5 
<- active-disk 4
+' ||

[Qemu-devel] [Patch v12 06/10] Add new block driver interfaces to control block replication

2015-11-26 Thread Wen Congyang

Signed-off-by: Wen Congyang <we...@cn.fujitsu.com>
Signed-off-by: zhanghailiang <zhang.zhanghaili...@huawei.com>
Signed-off-by: Gonglei <arei.gong...@huawei.com>
Cc: Luiz Capitulino <lcapitul...@redhat.com>
Cc: Michael Roth <mdr...@linux.vnet.ibm.com>
Reviewed-by: Paolo Bonzini <pbonz...@redhat.com>
---
 block.c   | 43 +++
 include/block/block.h |  5 +
 include/block/block_int.h | 14 ++
 qapi/block-core.json  | 13 +
 4 files changed, 75 insertions(+)

diff --git a/block.c b/block.c
index 0a0468f..213bee8 100644
--- a/block.c
+++ b/block.c
@@ -4390,3 +4390,46 @@ void bdrv_del_child(BlockDriverState *parent_bs, 
BlockDriverState *child_bs,
 
 parent_bs->drv->bdrv_del_child(parent_bs, child_bs, errp);
 }
+
+void bdrv_start_replication(BlockDriverState *bs, ReplicationMode mode,
+Error **errp)
+{
+BlockDriver *drv = bs->drv;
+
+if (drv && drv->bdrv_start_replication) {
+drv->bdrv_start_replication(bs, mode, errp);
+} else if (bs->file) {
+bdrv_start_replication(bs->file->bs, mode, errp);
+} else {
+error_setg(errp, "The BDS %s doesn't support starting block"
+   " replication", bs->filename);
+}
+}
+
+void bdrv_do_checkpoint(BlockDriverState *bs, Error **errp)
+{
+BlockDriver *drv = bs->drv;
+
+if (drv && drv->bdrv_do_checkpoint) {
+drv->bdrv_do_checkpoint(bs, errp);
+} else if (bs->file) {
+bdrv_do_checkpoint(bs->file->bs, errp);
+} else {
+error_setg(errp, "The BDS %s doesn't support block checkpoint",
+   bs->filename);
+}
+}
+
+void bdrv_stop_replication(BlockDriverState *bs, bool failover, Error **errp)
+{
+BlockDriver *drv = bs->drv;
+
+if (drv && drv->bdrv_stop_replication) {
+drv->bdrv_stop_replication(bs, failover, errp);
+} else if (bs->file) {
+bdrv_stop_replication(bs->file->bs, failover, errp);
+} else {
+error_setg(errp, "The BDS %s doesn't support stopping block"
+   " replication", bs->filename);
+}
+}
diff --git a/include/block/block.h b/include/block/block.h
index 1d3b9c6..cd39d50 100644
--- a/include/block/block.h
+++ b/include/block/block.h
@@ -648,4 +648,9 @@ void bdrv_add_child(BlockDriverState *parent, 
BlockDriverState *child,
 void bdrv_del_child(BlockDriverState *parent, BlockDriverState *child,
 Error **errp);
 
+void bdrv_start_replication(BlockDriverState *bs, ReplicationMode mode,
+Error **errp);
+void bdrv_do_checkpoint(BlockDriverState *bs, Error **errp);
+void bdrv_stop_replication(BlockDriverState *bs, bool failover, Error **errp);
+
 #endif
diff --git a/include/block/block_int.h b/include/block/block_int.h
index 1f56046..a6aba8b 100644
--- a/include/block/block_int.h
+++ b/include/block/block_int.h
@@ -307,6 +307,20 @@ struct BlockDriver {
 void (*bdrv_del_child)(BlockDriverState *parent, BlockDriverState *child,
Error **errp);
 
+void (*bdrv_start_replication)(BlockDriverState *bs, ReplicationMode mode,
+   Error **errp);
+/* Drop Disk buffer when doing checkpoint. */
+void (*bdrv_do_checkpoint)(BlockDriverState *bs, Error **errp);
+/*
+ * After failover, we should flush Disk buffer into secondary disk
+ * and stop block replication.
+ *
+ * If the guest is shutdown, we should drop Disk buffer and stop
+ * block representation.
+ */
+void (*bdrv_stop_replication)(BlockDriverState *bs, bool failover,
+  Error **errp);
+
 QLIST_ENTRY(BlockDriver) list;
 };
 
diff --git a/qapi/block-core.json b/qapi/block-core.json
index feb8da2..2c6bd3f 100644
--- a/qapi/block-core.json
+++ b/qapi/block-core.json
@@ -1925,6 +1925,19 @@
 '*read-pattern': 'QuorumReadPattern' } }
 
 ##
+# @ReplicationMode
+#
+# An enumeration of replication modes.
+#
+# @primary: Primary mode, the vm's state will be sent to secondary QEMU.
+#
+# @secondary: Secondary mode, receive the vm's state from primary QEMU.
+#
+# Since: 2.5
+##
+{ 'enum' : 'ReplicationMode', 'data' : [ 'primary', 'secondary' ] }
+
+##
 # @BlockdevOptions
 #
 # Options for creating a block device.
-- 
2.5.0

[Qemu-devel] [Patch v12 01/10] unblock backup operations in backing file

2015-11-26 Thread Wen Congyang

Signed-off-by: Wen Congyang <we...@cn.fujitsu.com>
---
 block.c | 18 ++
 1 file changed, 18 insertions(+)

diff --git a/block.c b/block.c
index bfc2be8..eaf479a 100644
--- a/block.c
+++ b/block.c
@@ -1275,6 +1275,24 @@ void bdrv_set_backing_hd(BlockDriverState *bs, 
BlockDriverState *backing_hd)
 /* Otherwise we won't be able to commit due to check in bdrv_commit */
 bdrv_op_unblock(backing_hd, BLOCK_OP_TYPE_COMMIT_TARGET,
 bs->backing_blocker);
+/*
+ * We do backup in 3 ways:
+ * 1. drive backup
+ *The target bs is new opened, and the source is top BDS
+ * 2. blockdev backup
+ *Both the source and the target are top BDSes.
+ * 3. internal backup(used for block replication)
+ *Both the source and the target are backing file
+ *
+ * In case 1, and 2, the backing file is neither the source nor
+ * the target.
+ * In case 3, we will block the top BDS, so there is only one block
+ * job for the top BDS and its backing chain.
+ */
+bdrv_op_unblock(backing_hd, BLOCK_OP_TYPE_BACKUP_SOURCE,
+bs->backing_blocker);
+bdrv_op_unblock(backing_hd, BLOCK_OP_TYPE_BACKUP_TARGET,
+bs->backing_blocker);
 out:
 bdrv_refresh_limits(bs, NULL);
 }
-- 
2.5.0

[Qemu-devel] [Patch v12 02/10] Store parent BDS in BdrvChild

2015-11-26 Thread Wen Congyang

We need to access the parent BDS to get the root BDS.

Signed-off-by: Wen Congyang <we...@cn.fujitsu.com>
---
 block.c   | 1 +
 include/block/block_int.h | 1 +
 2 files changed, 2 insertions(+)

diff --git a/block.c b/block.c
index eaf479a..0a0468f 100644
--- a/block.c
+++ b/block.c
@@ -1204,6 +1204,7 @@ BdrvChild *bdrv_attach_child(BlockDriverState *parent_bs,
 BdrvChild *child = g_new(BdrvChild, 1);
 *child = (BdrvChild) {
 .bs = child_bs,
+.parent = parent_bs,
 .name   = g_strdup(child_name),
 .role   = child_role,
 };
diff --git a/include/block/block_int.h b/include/block/block_int.h
index ea20d12..1f56046 100644
--- a/include/block/block_int.h
+++ b/include/block/block_int.h
@@ -357,6 +357,7 @@ extern const BdrvChildRole child_format;
 
 struct BdrvChild {
 BlockDriverState *bs;
+BlockDriverState *parent;
 char *name;
 const BdrvChildRole *role;
 QLIST_ENTRY(BdrvChild) next;
-- 
2.5.0

[Qemu-devel] [Patch v12 03/10] Backup: clear all bitmap when doing block checkpoint

2015-11-26 Thread Wen Congyang

Signed-off-by: Wen Congyang <we...@cn.fujitsu.com>
Signed-off-by: zhanghailiang <zhang.zhanghaili...@huawei.com>
Signed-off-by: Gonglei <arei.gong...@huawei.com>
Reviewed-by: Jeff Cody <jc...@redhat.com>
---
 block/backup.c   | 14 ++
 blockjob.c   | 11 +++
 include/block/blockjob.h | 12 
 3 files changed, 37 insertions(+)

diff --git a/block/backup.c b/block/backup.c
index 3b39119..1ca102d 100644
--- a/block/backup.c
+++ b/block/backup.c
@@ -253,11 +253,25 @@ static void backup_abort(BlockJob *job)
 }
 }
 
+static void backup_do_checkpoint(BlockJob *job, Error **errp)
+{
+BackupBlockJob *backup_job = container_of(job, BackupBlockJob, common);
+
+if (backup_job->sync_mode != MIRROR_SYNC_MODE_NONE) {
+error_setg(errp, "The backup job only supports block checkpoint in"
+   " sync=none mode");
+return;
+}
+
+hbitmap_reset_all(backup_job->bitmap);
+}
+
 static const BlockJobDriver backup_job_driver = {
 .instance_size  = sizeof(BackupBlockJob),
 .job_type   = BLOCK_JOB_TYPE_BACKUP,
 .set_speed  = backup_set_speed,
 .iostatus_reset = backup_iostatus_reset,
+.do_checkpoint  = backup_do_checkpoint,
 .commit = backup_commit,
 .abort  = backup_abort,
 };
diff --git a/blockjob.c b/blockjob.c
index 80adb9d..0c8edfe 100644
--- a/blockjob.c
+++ b/blockjob.c
@@ -533,3 +533,14 @@ void block_job_txn_add_job(BlockJobTxn *txn, BlockJob *job)
 QLIST_INSERT_HEAD(>jobs, job, txn_list);
 block_job_txn_ref(txn);
 }
+
+void block_job_do_checkpoint(BlockJob *job, Error **errp)
+{
+if (!job->driver->do_checkpoint) {
+error_setg(errp, "The job %s doesn't support block checkpoint",
+   BlockJobType_lookup[job->driver->job_type]);
+return;
+}
+
+job->driver->do_checkpoint(job, errp);
+}
diff --git a/include/block/blockjob.h b/include/block/blockjob.h
index d84ccd8..abdba7c 100644
--- a/include/block/blockjob.h
+++ b/include/block/blockjob.h
@@ -70,6 +70,9 @@ typedef struct BlockJobDriver {
  * never both.
  */
 void (*abort)(BlockJob *job);
+
+/** Optional callback for job types that support checkpoint. */
+void (*do_checkpoint)(BlockJob *job, Error **errp);
 } BlockJobDriver;
 
 /**
@@ -443,4 +446,13 @@ void block_job_txn_unref(BlockJobTxn *txn);
  */
 void block_job_txn_add_job(BlockJobTxn *txn, BlockJob *job);
 
+/**
+ * block_job_do_checkpoint:
+ * @job: The job.
+ * @errp: Error object.
+ *
+ * Do block checkpoint on the specified job.
+ */
+void block_job_do_checkpoint(BlockJob *job, Error **errp);
+
 #endif
-- 
2.5.0

[Qemu-devel] [Patch v8 3/3] qmp: add monitor command to add/remove a child

2015-11-26 Thread Wen Congyang

The new QMP command name is x-blockdev-change. It's just for adding/removing
quorum's child now, and doesn't support all kinds of children, all kinds of
operations, nor all block drivers. So it is experimental now.

Signed-off-by: Wen Congyang <we...@cn.fujitsu.com>
Signed-off-by: zhanghailiang <zhang.zhanghaili...@huawei.com>
Signed-off-by: Gonglei <arei.gong...@huawei.com>
---
 blockdev.c   | 54 
 qapi/block-core.json | 23 ++
 qmp-commands.hx  | 47 +
 3 files changed, 124 insertions(+)

diff --git a/blockdev.c b/blockdev.c
index 2b076fb..7d8a2b4 100644
--- a/blockdev.c
+++ b/blockdev.c
@@ -3836,6 +3836,60 @@ out:
 aio_context_release(aio_context);
 }
 
+static BlockDriverState *bdrv_find_child(BlockDriverState *parent_bs,
+ const char *child_name)
+{
+BdrvChild *child;
+
+QLIST_FOREACH(child, _bs->children, next) {
+if (strcmp(child->name, child_name) == 0) {
+return child->bs;
+}
+}
+
+return NULL;
+}
+
+void qmp_x_blockdev_change(const char *parent, bool has_child,
+   const char *child, bool has_node,
+   const char *node, Error **errp)
+{
+BlockDriverState *parent_bs, *child_bs = NULL, *new_bs = NULL;
+
+parent_bs = bdrv_lookup_bs(parent, parent, errp);
+if (!parent_bs) {
+return;
+}
+
+if (has_child == has_node) {
+if (has_child) {
+error_setg(errp, "The paramter child and node is conflict");
+} else {
+error_setg(errp, "Either child or node should be specified");
+}
+return;
+}
+
+if (has_child) {
+child_bs = bdrv_find_child(parent_bs, child);
+if (!child_bs) {
+error_setg(errp, "Node '%s' doesn't have child %s",
+   parent, child);
+return;
+}
+bdrv_del_child(parent_bs, child_bs, errp);
+}
+
+if (has_node) {
+new_bs = bdrv_find_node(node);
+if (!new_bs) {
+error_setg(errp, "Node '%s' not found", node);
+return;
+}
+bdrv_add_child(parent_bs, new_bs, errp);
+}
+}
+
 BlockJobInfoList *qmp_query_block_jobs(Error **errp)
 {
 BlockJobInfoList *head = NULL, **p_next = 
diff --git a/qapi/block-core.json b/qapi/block-core.json
index a07b13f..feb8da2 100644
--- a/qapi/block-core.json
+++ b/qapi/block-core.json
@@ -2400,3 +2400,26 @@
 ##
 { 'command': 'block-set-write-threshold',
   'data': { 'node-name': 'str', 'write-threshold': 'uint64' } }
+
+##
+# @x-blockdev-change
+#
+# Dynamically reconfigure the block driver state graph. It can be used
+# to add, remove, insert or replace a block driver state. Currently only
+# the Quorum driver implements this feature to add or remove its child.
+# This is useful to fix a broken quorum child.
+#
+# @parent: the id or name of the node that will be changed.
+#
+# @child: #optional the name of the child that will be deleted.
+#
+# @node: #optional the name of the node will be added.
+#
+# Note: this command is experimental, and its API is not stable.
+#
+# Since: 2.6
+##
+{ 'command': 'x-blockdev-change',
+  'data' : { 'parent': 'str',
+ '*child': 'str',
+ '*node': 'str' } }
diff --git a/qmp-commands.hx b/qmp-commands.hx
index 9d8b42f..9b49d51 100644
--- a/qmp-commands.hx
+++ b/qmp-commands.hx
@@ -4285,6 +4285,53 @@ Example:
 EQMP
 
 {
+.name   = "x-blockdev-change",
+.args_type  = "parent:B,child:B?,node:B?",
+.mhandler.cmd_new = qmp_marshal_x_blockdev_change,
+},
+
+SQMP
+x-blockdev-change
+-
+
+Dynamically reconfigure the block driver state graph. It can be used to
+add, remove, insert, or replace a block driver state. Currently only
+the Quorum driver implements this feature to add and remove its child.
+This is useful to fix a broken quorum child.
+
+Arguments:
+- "parent": the id or node name of which node will be changed (json-string)
+- "child": the child name which will be deleted (json-string, optional)
+- "node": the new node-name which will be added (json-string, optional)
+
+Note: this command is experimental, and not a stable API. It doesn't
+support all kinds of operations, all kinds of children, nor all block
+drivers.
+
+Example:
+
+Add a new node to a quorum
+-> { "execute": blockdev-add",
+"arguments": { "options": { "driver": "raw",
+"node-name": "new_node",
+"id": "test_new_node",
+"file": { "driver": "file",
+

[Qemu-devel] [Patch v8 1/3] Add new block driver interface to add/delete a BDS's child

2015-11-26 Thread Wen Congyang

In some cases, we want to take a quorum child offline, and take
another child online.

Signed-off-by: Wen Congyang <we...@cn.fujitsu.com>
Signed-off-by: zhanghailiang <zhang.zhanghaili...@huawei.com>
Signed-off-by: Gonglei <arei.gong...@huawei.com>
Reviewed-by: Eric Blake <ebl...@redhat.com>
Reviewed-by: Alberto Garcia <be...@igalia.com>
---
 block.c   | 50 +++
 include/block/block.h |  5 +
 include/block/block_int.h |  5 +
 3 files changed, 60 insertions(+)

diff --git a/block.c b/block.c
index 60ff84f..255a36e 100644
--- a/block.c
+++ b/block.c
@@ -4321,3 +4321,53 @@ void bdrv_refresh_filename(BlockDriverState *bs)
 QDECREF(json);
 }
 }
+
+/*
+ * Hot add/remove a BDS's child. So the user can take a child offline when
+ * it is broken and take a new child online
+ */
+void bdrv_add_child(BlockDriverState *parent_bs, BlockDriverState *child_bs,
+Error **errp)
+{
+
+if (!parent_bs->drv || !parent_bs->drv->bdrv_add_child) {
+error_setg(errp, "The node %s doesn't support adding a child",
+   bdrv_get_device_or_node_name(parent_bs));
+return;
+}
+
+if (!QLIST_EMPTY(_bs->parents)) {
+error_setg(errp, "The node %s already has a parent",
+   child_bs->node_name);
+return;
+}
+
+parent_bs->drv->bdrv_add_child(parent_bs, child_bs, errp);
+}
+
+void bdrv_del_child(BlockDriverState *parent_bs, BlockDriverState *child_bs,
+Error **errp)
+{
+BdrvChild *child;
+
+if (!parent_bs->drv || !parent_bs->drv->bdrv_del_child) {
+error_setg(errp, "The node %s doesn't support removing a child",
+   bdrv_get_device_or_node_name(parent_bs));
+return;
+}
+
+QLIST_FOREACH(child, _bs->children, next) {
+if (child->bs == child_bs) {
+break;
+}
+}
+
+if (!child) {
+error_setg(errp, "The node %s is not a child of %s",
+   bdrv_get_device_or_node_name(child_bs),
+   bdrv_get_device_or_node_name(parent_bs));
+return;
+}
+
+parent_bs->drv->bdrv_del_child(parent_bs, child_bs, errp);
+}
diff --git a/include/block/block.h b/include/block/block.h
index d9b380c..06d3369 100644
--- a/include/block/block.h
+++ b/include/block/block.h
@@ -639,4 +639,9 @@ void bdrv_drained_begin(BlockDriverState *bs);
  */
 void bdrv_drained_end(BlockDriverState *bs);
 
+void bdrv_add_child(BlockDriverState *parent, BlockDriverState *child,
+Error **errp);
+void bdrv_del_child(BlockDriverState *parent, BlockDriverState *child,
+Error **errp);
+
 #endif
diff --git a/include/block/block_int.h b/include/block/block_int.h
index 6d7bd3b..ea20d12 100644
--- a/include/block/block_int.h
+++ b/include/block/block_int.h
@@ -302,6 +302,11 @@ struct BlockDriver {
  */
 void (*bdrv_drain)(BlockDriverState *bs);
 
+void (*bdrv_add_child)(BlockDriverState *parent, BlockDriverState *child,
+   Error **errp);
+void (*bdrv_del_child)(BlockDriverState *parent, BlockDriverState *child,
+   Error **errp);
+
 QLIST_ENTRY(BlockDriver) list;
 };
 
-- 
2.5.0

[Qemu-devel] [Patch v8 2/3] quorum: implement bdrv_add_child() and bdrv_del_child()

2015-11-26 Thread Wen Congyang

Signed-off-by: Wen Congyang <we...@cn.fujitsu.com>
Signed-off-by: zhanghailiang <zhang.zhanghaili...@huawei.com>
Signed-off-by: Gonglei <arei.gong...@huawei.com>
---
 block.c   |   8 ++--
 block/quorum.c| 124 +-
 include/block/block.h |   4 ++
 3 files changed, 130 insertions(+), 6 deletions(-)

diff --git a/block.c b/block.c
index 255a36e..bfc2be8 100644
--- a/block.c
+++ b/block.c
@@ -1196,10 +1196,10 @@ static int bdrv_fill_options(QDict **options, const 
char *filename,
 return 0;
 }
 
-static BdrvChild *bdrv_attach_child(BlockDriverState *parent_bs,
-BlockDriverState *child_bs,
-const char *child_name,
-const BdrvChildRole *child_role)
+BdrvChild *bdrv_attach_child(BlockDriverState *parent_bs,
+ BlockDriverState *child_bs,
+ const char *child_name,
+ const BdrvChildRole *child_role)
 {
 BdrvChild *child = g_new(BdrvChild, 1);
 *child = (BdrvChild) {
diff --git a/block/quorum.c b/block/quorum.c
index 2810e37..b7df14b 100644
--- a/block/quorum.c
+++ b/block/quorum.c
@@ -23,6 +23,7 @@
 #include "qapi/qmp/qstring.h"
 #include "qapi-event.h"
 #include "crypto/hash.h"
+#include "qemu/bitmap.h"
 
 #define HASH_LENGTH 32
 
@@ -80,6 +81,8 @@ typedef struct BDRVQuorumState {
 bool rewrite_corrupted;/* true if the driver must rewrite-on-read corrupted
 * block if Quorum is reached.
 */
+unsigned long *index_bitmap;
+int bsize;
 
 QuorumReadPattern read_pattern;
 } BDRVQuorumState;
@@ -875,9 +878,9 @@ static int quorum_open(BlockDriverState *bs, QDict 
*options, int flags,
 ret = -EINVAL;
 goto exit;
 }
-if (s->num_children < 2) {
+if (s->num_children < 1) {
 error_setg(_err,
-   "Number of provided children must be greater than 1");
+   "Number of provided children must be 1 or more");
 ret = -EINVAL;
 goto exit;
 }
@@ -926,6 +929,7 @@ static int quorum_open(BlockDriverState *bs, QDict 
*options, int flags,
 /* allocate the children array */
 s->children = g_new0(BdrvChild *, s->num_children);
 opened = g_new0(bool, s->num_children);
+s->index_bitmap = bitmap_new(s->num_children);
 
 for (i = 0; i < s->num_children; i++) {
 char indexstr[32];
@@ -941,6 +945,8 @@ static int quorum_open(BlockDriverState *bs, QDict 
*options, int flags,
 
 opened[i] = true;
 }
+bitmap_set(s->index_bitmap, 0, s->num_children);
+s->bsize = s->num_children;
 
 g_free(opened);
 goto exit;
@@ -997,6 +1003,117 @@ static void quorum_attach_aio_context(BlockDriverState 
*bs,
 }
 }
 
+static int get_new_child_index(BDRVQuorumState *s)
+{
+int index;
+
+index = find_next_zero_bit(s->index_bitmap, s->bsize, 0);
+if (index < s->bsize) {
+return index;
+}
+
+if ((s->bsize % BITS_PER_LONG) == 0) {
+s->index_bitmap = bitmap_zero_extend(s->index_bitmap, s->bsize,
+ s->bsize + 1);
+}
+
+return s->bsize++;
+}
+
+static void remove_child_index(BDRVQuorumState *s, int index)
+{
+int last_index;
+long new_len;
+
+assert(index < s->bsize);
+
+clear_bit(index, s->index_bitmap);
+if (index < s->bsize - 1) {
+/*
+ * The last bit is always set, and we don't clear
+ * the last bit.
+ */
+return;
+}
+
+last_index = find_last_bit(s->index_bitmap, s->bsize);
+if (BITS_TO_LONGS(last_index + 1) == BITS_TO_LONGS(s->bsize)) {
+s->bsize = last_index + 1;
+return;
+}
+
+new_len = BITS_TO_LONGS(last_index + 1) * sizeof(unsigned long);
+s->index_bitmap = g_realloc(s->index_bitmap, new_len);
+s->bsize = last_index + 1;
+}
+
+static void quorum_add_child(BlockDriverState *bs, BlockDriverState *child_bs,
+ Error **errp)
+{
+BDRVQuorumState *s = bs->opaque;
+BdrvChild *child;
+char indexstr[32];
+int index = find_next_zero_bit(s->index_bitmap, s->bsize, 0);
+int ret;
+
+index = get_new_child_index(s);
+ret = snprintf(indexstr, 32, "children.%d", index);
+if (ret < 0 || ret >= 32) {
+error_setg(errp, "cannot generate child name");
+return;
+}
+
+bdrv_drain(bs);
+
+assert(s->num_children <= INT_MAX / sizeof(BdrvChild *));
+if (s->num_children == INT_MAX / sizeof(BdrvChild *)) {
+error_setg(errp, "Too many children");
+return;
+}
+s->children = g_

[Qemu-devel] [Patch v8 0/3] qapi: child add/delete support

2015-11-26 Thread Wen Congyang

If quorum's child is broken, we can use mirror job to replace it.
But sometimes, the user only need to remove the broken child, and
add it later when the problem is fixed.

It is based on the Kevin's child name related patch:
http://lists.nongnu.org/archive/html/qemu-devel/2015-11/msg04949.html

ChangLog:
v8:
1. Rebase to the newest codes
2. Address the comments from Eric Blake
v7:
1. Remove the qmp command x-blockdev-change's parameter operation according
   to Kevin's comments.
2. Remove the hmp command.
v6:
1. Use a single qmp command x-blockdev-change to replace x-blockdev-child-add
   and x-blockdev-child-delete
v5:
1. Address Eric Blake's comments
v4:
1. drop nbd driver's implementation. We can use human-monitor-command
   to do it.
2. Rename the command name.
v3:
1. Don't open BDS in bdrv_add_child(). Use the existing BDS which is
   created by the QMP command blockdev-add.
2. The driver NBD can support filename, path, host:port now.
v2:
1. Use bdrv_get_device_or_node_name() instead of new function
   bdrv_get_id_or_node_name()
2. Update the error message
3. Update the documents in block-core.json

Wen Congyang (3):
  Add new block driver interface to add/delete a BDS's child
  quorum: implement bdrv_add_child() and bdrv_del_child()
  qmp: add monitor command to add/remove a child

 block.c   |  58 --
 block/quorum.c| 124 +-
 blockdev.c|  54 
 include/block/block.h |   9 
 include/block/block_int.h |   5 ++
 qapi/block-core.json  |  23 +
 qmp-commands.hx   |  47 ++
 7 files changed, 314 insertions(+), 6 deletions(-)

-- 
2.5.0

[Qemu-devel] [Patch v12 04/10] Allow creating backup jobs when opening BDS

2015-11-26 Thread Wen Congyang

When opening BDS, we need to create backup jobs for
image-fleecing.

Signed-off-by: Wen Congyang <we...@cn.fujitsu.com>
Signed-off-by: zhanghailiang <zhang.zhanghaili...@huawei.com>
Signed-off-by: Gonglei <arei.gong...@huawei.com>
Reviewed-by: Stefan Hajnoczi <stefa...@redhat.com>
Reviewed-by: Jeff Cody <jc...@redhat.com>
---
 block/Makefile.objs | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/block/Makefile.objs b/block/Makefile.objs
index 58ef2ef..fa05f37 100644
--- a/block/Makefile.objs
+++ b/block/Makefile.objs
@@ -22,10 +22,10 @@ block-obj-$(CONFIG_ARCHIPELAGO) += archipelago.o
 block-obj-$(CONFIG_LIBSSH2) += ssh.o
 block-obj-y += accounting.o
 block-obj-y += write-threshold.o
+block-obj-y += backup.o
 
 common-obj-y += stream.o
 common-obj-y += commit.o
-common-obj-y += backup.o
 
 iscsi.o-cflags := $(LIBISCSI_CFLAGS)
 iscsi.o-libs   := $(LIBISCSI_LIBS)
-- 
2.5.0

1 2 3 4 5 6 7 8 9 10 >

1 - 100 of 1184 matches

Mail list logo