from:"Liu Yuan"

[Qemu-devel] [PATCH] MAINTAINERS: update block/sheepdog maintainers

2018-10-12 Thread Liu Yuan

From: Liu Yuan 

E-mail to one of block/sheepdog maintainers Mitake Hitoshi bounces

: unknown user: "mitake.hitoshi"

and no current address is known. So just remove it.

Signed-off-by: Liu Yuan 
---
 MAINTAINERS | 1 -
 1 file changed, 1 deletion(-)

diff --git a/MAINTAINERS b/MAINTAINERS
index 3275cc6..6670c16 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -2007,7 +2007,6 @@ F: block/rbd.c
 T: git git://github.com/codyprime/qemu-kvm-jtc.git block
 
 Sheepdog
-M: Hitoshi Mitake 
 M: Liu Yuan 
 M: Jeff Cody 
 L: qemu-bl...@nongnu.org
-- 
2.7.4

Re: [Qemu-devel] [PATCH] block/quorum: implement .bdrv_co_get_block_status

2014-07-20 Thread Liu Yuan

On Fri, Jul 18, 2014 at 03:35:57PM +0200, Paolo Bonzini wrote:
> Il 17/07/2014 13:50, Liu Yuan ha scritto:
> > - allow drive-mirror to create sprase mirror on images like qcow2
> > - allow qemu-img map to work as expected on quorum driver
> > 
> > Cc: Benoit Canet 
> > Cc: Kevin Wolf 
> > Cc: Stefan Hajnoczi 
> > Signed-off-by: Liu Yuan 
> > ---
> >  block/quorum.c | 16 
> >  1 file changed, 16 insertions(+)
> > 
> > diff --git a/block/quorum.c b/block/quorum.c
> > index ebf5c71..f0d0a98 100644
> > --- a/block/quorum.c
> > +++ b/block/quorum.c
> > @@ -780,6 +780,21 @@ static coroutine_fn int 
> > quorum_co_flush(BlockDriverState *bs)
> >  return result;
> >  }
> >  
> > +static int64_t coroutine_fn quorum_co_get_block_status(BlockDriverState 
> > *bs,
> > +   int64_t sector_num,
> > +   int nb_sectors,
> > +   int *pnum)
> > +{
> > +BDRVQuorumState *s = bs->opaque;
> > +BlockDriverState *child_bs = s->bs[0];
> > +
> > +if (child_bs->drv->bdrv_co_get_block_status)
> > +return child_bs->drv->bdrv_co_get_block_status(child_bs, 
> > sector_num,
> > +   nb_sectors, pnum);
> 
> Is this "if" necessary?

Yes, otherwise bdrv_get_block_status() will be called multiple times and the
result for qcow2 won't return desired value im my test. Or we can simply call
bdrv_get_block_status() plus some tricks?

> 
> > +return bdrv_get_block_status(child_bs, sector_num, nb_sectors, pnum);
> 
> Also, the definition of BDRV_BLOCK_OFFSET_VALID explicitly refers to
> bs->file, so you probably have to exclude it from the result as well as
> BDRV_BLOCK_RAW.
> 

Thanks for reminding.

Yuan

Re: [Qemu-devel] [PATCH 1/2] sheepdog: adopting protocol update for VDI locking

2014-08-07 Thread Liu Yuan

On Thu, Aug 07, 2014 at 04:28:39PM +0900, Hitoshi Mitake wrote:
> The update is required for supporting iSCSI multipath. It doesn't
> affect behavior of QEMU driver but adding a new field to vdi request
> struct is required.
> 
> Cc: Kevin Wolf 
> Cc: Stefan Hajnoczi 
> Cc: Liu Yuan 
> Cc: MORITA Kazutaka 
> Signed-off-by: Hitoshi Mitake 
> ---
>  block/sheepdog.c | 8 +++-
>  1 file changed, 7 insertions(+), 1 deletion(-)
> 
> diff --git a/block/sheepdog.c b/block/sheepdog.c
> index 8d9350c..36f76f0 100644
> --- a/block/sheepdog.c
> +++ b/block/sheepdog.c
> @@ -103,6 +103,9 @@
>  #define SD_INODE_SIZE (sizeof(SheepdogInode))
>  #define CURRENT_VDI_ID 0
>  
> +#define LOCK_TYPE_NORMAL 1
> +#define LOCK_TYPE_SHARED 2  /* for iSCSI multipath */

How about

#define LOCK_TYPE_NORMAL 0
#define LOCK_TYPE_SHARED 1

Then we don't need this patch. Since qemu won't make use of multipath for the
near future, we should avoid adding stuff related to multipath to qemu driver.

Thanks
Yuan

> +
>  typedef struct SheepdogReq {
>  uint8_t proto_ver;
>  uint8_t opcode;
> @@ -166,7 +169,8 @@ typedef struct SheepdogVdiReq {
>  uint8_t copy_policy;
>  uint8_t reserved[2];
>  uint32_t snapid;
> -uint32_t pad[3];
> +uint32_t type;
> +uint32_t pad[2];
>  } SheepdogVdiReq;
>  
>  typedef struct SheepdogVdiRsp {
> @@ -1090,6 +1094,7 @@ static int find_vdi_name(BDRVSheepdogState *s, const 
> char *filename,
>  memset(&hdr, 0, sizeof(hdr));
>  if (lock) {
>  hdr.opcode = SD_OP_LOCK_VDI;
> +hdr.type = LOCK_TYPE_NORMAL;
>  } else {
>  hdr.opcode = SD_OP_GET_VDI_INFO;
>  }
> @@ -1793,6 +1798,7 @@ static void sd_close(BlockDriverState *bs)
>  memset(&hdr, 0, sizeof(hdr));
>  
>  hdr.opcode = SD_OP_RELEASE_VDI;
> +hdr.type = LOCK_TYPE_NORMAL;
>  hdr.base_vdi_id = s->inode.vdi_id;
>  wlen = strlen(s->name) + 1;
>  hdr.data_length = wlen;
> -- 
> 1.8.3.2
>

Re: [Qemu-devel] [PATCH 2/2] sheepdog: improve error handling for a case of failed lock

2014-08-07 Thread Liu Yuan

On Thu, Aug 07, 2014 at 04:28:40PM +0900, Hitoshi Mitake wrote:
> Recently, sheepdog revived its VDI locking functionality. This patch
> updates sheepdog driver of QEMU for this feature:
> 
> 1. Improve error message when QEMU fails to acquire lock of
> VDI. Current sheepdog driver prints an error message "VDI isn't
> locked" when it fails to acquire lock. It is a little bit confusing
> because the mesage says VDI isn't locked but it is actually locked by
> other VM. This patch modifies this confusing message.
> 
> 2. Change error code for a case of failed locking. -EBUSY is a
> suitable one.
> 
> Reported-by: Valerio Pachera 
> Cc: Kevin Wolf 
> Cc: Stefan Hajnoczi 
> Cc: Liu Yuan 
> Cc: MORITA Kazutaka 
> Signed-off-by: Hitoshi Mitake 
> ---
>  block/sheepdog.c | 4 
>  1 file changed, 4 insertions(+)
> 
> diff --git a/block/sheepdog.c b/block/sheepdog.c
> index 36f76f0..0b3f86d 100644
> --- a/block/sheepdog.c
> +++ b/block/sheepdog.c
> @@ -1112,9 +1112,13 @@ static int find_vdi_name(BDRVSheepdogState *s, const 
> char *filename,
>  
>  if (rsp->result != SD_RES_SUCCESS) {
>  error_setg(errp, "cannot get vdi info, %s, %s %" PRIu32 " %s",
> +   rsp->result == SD_RES_VDI_NOT_LOCKED ?

I'm puzzled by this check.

we use SD_RES_VDI_LOCKED to indicate vid is already locked, no?

> +   "VDI is already locked by other VM" :
> sd_strerror(rsp->result), filename, snapid, tag);
>  if (rsp->result == SD_RES_NO_VDI) {
>  ret = -ENOENT;
> +} else if (rsp->result == SD_RES_VDI_NOT_LOCKED) {
> +ret = -EBUSY;
>  } else {
>  ret = -EIO;
>  }

It is better to use switch case to handle the result.

Thanks
Yuan

[Qemu-devel] [PATCH] cluster/zookeeper: add log information for zk auto-recoonect

2014-08-07 Thread Liu Yuan

Reported-by: Valerio Pachera 
Signed-off-by: Liu Yuan 
---
 sheep/group.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/sheep/group.c b/sheep/group.c
index 06a80bd..08e3884 100644
--- a/sheep/group.c
+++ b/sheep/group.c
@@ -979,7 +979,7 @@ static int send_join_request(void)
 {
struct sd_node *n = &sys->this_node;
 
-   sd_info("%s", node_to_str(n));
+   sd_info("%s going to rejoin the cluster", node_to_str(n));
return sys->cdrv->join(n, &sys->cinfo, sizeof(sys->cinfo));
 }
 
-- 
1.9.1

Re: [Qemu-devel] [PATCH 2/2] sheepdog: improve error handling for a case of failed lock

2014-08-07 Thread Liu Yuan

On Fri, Aug 08, 2014 at 03:17:59PM +0900, Hitoshi Mitake wrote:
> At Fri, 8 Aug 2014 13:31:39 +0800,
> Liu Yuan wrote:
> > 
> > On Thu, Aug 07, 2014 at 04:28:40PM +0900, Hitoshi Mitake wrote:
> > > Recently, sheepdog revived its VDI locking functionality. This patch
> > > updates sheepdog driver of QEMU for this feature:
> > > 
> > > 1. Improve error message when QEMU fails to acquire lock of
> > > VDI. Current sheepdog driver prints an error message "VDI isn't
> > > locked" when it fails to acquire lock. It is a little bit confusing
> > > because the mesage says VDI isn't locked but it is actually locked by
> > > other VM. This patch modifies this confusing message.
> > > 
> > > 2. Change error code for a case of failed locking. -EBUSY is a
> > > suitable one.
> > > 
> > > Reported-by: Valerio Pachera 
> > > Cc: Kevin Wolf 
> > > Cc: Stefan Hajnoczi 
> > > Cc: Liu Yuan 
> > > Cc: MORITA Kazutaka 
> > > Signed-off-by: Hitoshi Mitake 
> > > ---
> > >  block/sheepdog.c | 4 
> > >  1 file changed, 4 insertions(+)
> > > 
> > > diff --git a/block/sheepdog.c b/block/sheepdog.c
> > > index 36f76f0..0b3f86d 100644
> > > --- a/block/sheepdog.c
> > > +++ b/block/sheepdog.c
> > > @@ -1112,9 +1112,13 @@ static int find_vdi_name(BDRVSheepdogState *s, 
> > > const char *filename,
> > >  
> > >  if (rsp->result != SD_RES_SUCCESS) {
> > >  error_setg(errp, "cannot get vdi info, %s, %s %" PRIu32 " %s",
> > > +   rsp->result == SD_RES_VDI_NOT_LOCKED ?
> > 
> > I'm puzzled by this check.
> > 
> > we use SD_RES_VDI_LOCKED to indicate vid is already locked, no?
> 
> We use SD_RES_VDI_NOT_LOCKED for indicating locking by this VM fails.
> 
> > 
> > > +   "VDI is already locked by other VM" :

But this message said it was locked by others, and we have SD_RES_VDI_LOCKED for
this case.

We need fix sheep daemon for this case to return SD_RES_VDI_LOCKED for already
locked case and NOT_LOCKED for other sheep internal errors.

> > > sd_strerror(rsp->result), filename, snapid, tag);
> > >  if (rsp->result == SD_RES_NO_VDI) {
> > >  ret = -ENOENT;
> > > +} else if (rsp->result == SD_RES_VDI_NOT_LOCKED) {
> > > +ret = -EBUSY;
> > >  } else {
> > >  ret = -EIO;
> > >  }
> > 
> > It is better to use switch case to handle the result.
> 
> using switch statement in this case only increases a number of lines
> of code:
> 
> Current change:
> if (rsp->result == SD_RES_NO_VDI) {
> ret = -ENOENT;
> } else if (rsp->result == SD_RES_VDI_NOT_LOCKED) {
> ...
> 
> Change with switch:
> switch (rsp->result) {
>   case SD_RES_NO_VDI:
> ret = -ENOENT;
>   break;
>   case SD_RES_VDI_NOT_LOCKED:
> ...
> 
> The change with switch statement requires one more line for break;. I
> think if statement is suitable for this case.

If you insist on 'if-else' over swtich case, it is fine with me. But I'd suggest
switch-case because it looks cleaner and easier to understand if we have more
than 2 branches.

Thanks
Yuan

Re: [Qemu-devel] [PATCH 1/2] sheepdog: adopting protocol update for VDI locking

2014-08-08 Thread Liu Yuan

On Fri, Aug 08, 2014 at 03:12:17PM +0900, Hitoshi Mitake wrote:
> At Fri, 8 Aug 2014 13:20:39 +0800,
> Liu Yuan wrote:
> > 
> > On Thu, Aug 07, 2014 at 04:28:39PM +0900, Hitoshi Mitake wrote:
> > > The update is required for supporting iSCSI multipath. It doesn't
> > > affect behavior of QEMU driver but adding a new field to vdi request
> > > struct is required.
> > > 
> > > Cc: Kevin Wolf 
> > > Cc: Stefan Hajnoczi 
> > > Cc: Liu Yuan 
> > > Cc: MORITA Kazutaka 
> > > Signed-off-by: Hitoshi Mitake 
> > > ---
> > >  block/sheepdog.c | 8 +++-
> > >  1 file changed, 7 insertions(+), 1 deletion(-)
> > > 
> > > diff --git a/block/sheepdog.c b/block/sheepdog.c
> > > index 8d9350c..36f76f0 100644
> > > --- a/block/sheepdog.c
> > > +++ b/block/sheepdog.c
> > > @@ -103,6 +103,9 @@
> > >  #define SD_INODE_SIZE (sizeof(SheepdogInode))
> > >  #define CURRENT_VDI_ID 0
> > >  
> > > +#define LOCK_TYPE_NORMAL 1
> > > +#define LOCK_TYPE_SHARED 2  /* for iSCSI multipath */
> > 
> > How about
> > 
> > #define LOCK_TYPE_NORMAL 0
> > #define LOCK_TYPE_SHARED 1
> > 
> > Then we don't need this patch. Since qemu won't make use of multipath for 
> > the
> > near future, we should avoid adding stuff related to multipath to qemu 
> > driver.
> 
> (Cc-ing current Kazutaka-san's address)
> 
> I think this isn't a good idea. Because it means that sheep has an
> assumption about padding field of the request data struct. This sort
> of workaround can cause hard to find problems in the future.
> 

The point is, how to keep backward compatibilty? E.g, old QEMU with present
sheep daemon with lock features. Then QEMU will send 0 instead of 1 to the sheep
daemon and based on how you handle the invalid value, these might cause 
problems.

Suppose we have 1 old QEMU and 1 present QEMU who try to start the same image A.
Old QEMU will send invalid 0 to sheep daemon. We shouldn't deny it because it
can run with old sheep and should run on new sheep too. Then this image A isn't
locked due to invalid 0 value. Then present QEMU send correct LOCK signal and
will wrongly set up the image.

Start with 0 instead of 1, in my option, is reasonable to keep backward
compatibility.

Thanks
Yuan

Re: [Qemu-devel] [PATCH 1/2] sheepdog: adopting protocol update for VDI locking

2014-08-10 Thread Liu Yuan

On Mon, Aug 11, 2014 at 11:17:33AM +0900, Hitoshi Mitake wrote:
> At Fri, 8 Aug 2014 15:49:37 +0800,
> Liu Yuan wrote:
> > 
> > On Fri, Aug 08, 2014 at 03:12:17PM +0900, Hitoshi Mitake wrote:
> > > At Fri, 8 Aug 2014 13:20:39 +0800,
> > > Liu Yuan wrote:
> > > > 
> > > > On Thu, Aug 07, 2014 at 04:28:39PM +0900, Hitoshi Mitake wrote:
> > > > > The update is required for supporting iSCSI multipath. It doesn't
> > > > > affect behavior of QEMU driver but adding a new field to vdi request
> > > > > struct is required.
> > > > > 
> > > > > Cc: Kevin Wolf 
> > > > > Cc: Stefan Hajnoczi 
> > > > > Cc: Liu Yuan 
> > > > > Cc: MORITA Kazutaka 
> > > > > Signed-off-by: Hitoshi Mitake 
> > > > > ---
> > > > >  block/sheepdog.c | 8 +++-
> > > > >  1 file changed, 7 insertions(+), 1 deletion(-)
> > > > > 
> > > > > diff --git a/block/sheepdog.c b/block/sheepdog.c
> > > > > index 8d9350c..36f76f0 100644
> > > > > --- a/block/sheepdog.c
> > > > > +++ b/block/sheepdog.c
> > > > > @@ -103,6 +103,9 @@
> > > > >  #define SD_INODE_SIZE (sizeof(SheepdogInode))
> > > > >  #define CURRENT_VDI_ID 0
> > > > >  
> > > > > +#define LOCK_TYPE_NORMAL 1
> > > > > +#define LOCK_TYPE_SHARED 2  /* for iSCSI multipath */
> > > > 
> > > > How about
> > > > 
> > > > #define LOCK_TYPE_NORMAL 0
> > > > #define LOCK_TYPE_SHARED 1
> > > > 
> > > > Then we don't need this patch. Since qemu won't make use of multipath 
> > > > for the
> > > > near future, we should avoid adding stuff related to multipath to qemu 
> > > > driver.
> > > 
> > > (Cc-ing current Kazutaka-san's address)
> > > 
> > > I think this isn't a good idea. Because it means that sheep has an
> > > assumption about padding field of the request data struct. This sort
> > > of workaround can cause hard to find problems in the future.
> > > 
> > 
> > The point is, how to keep backward compatibilty? E.g, old QEMU with present
> > sheep daemon with lock features. Then QEMU will send 0 instead of 1 to the 
> > sheep
> > daemon and based on how you handle the invalid value, these might cause 
> > problems.
> > 
> > Suppose we have 1 old QEMU and 1 present QEMU who try to start the same 
> > image A.
> > Old QEMU will send invalid 0 to sheep daemon. We shouldn't deny it because 
> > it
> > can run with old sheep and should run on new sheep too. Then this image A 
> > isn't
> > locked due to invalid 0 value. Then present QEMU send correct LOCK signal 
> > and
> > will wrongly set up the image.
> > 
> > Start with 0 instead of 1, in my option, is reasonable to keep backward
> > compatibility.
> 
> I don't think so. Because the backward compatibility of the locking
> functionality is already broken in the far past.
> 
> As Meng repoted in the sheepdog mailing list, old QEMU can issue
> locking request to sheep with vid == 0:
> http://lists.wpkg.org/pipermail/sheepdog/2014-August/015438.html

I don't really understand why we can't start with 0 and can't keep backward
compatibility. By the way, I think the link has nothing to do with qemu, it is
a bug in sheep.

locking has two state, one is lock and the other unlock.

We choose 0 to mean 'lock' the vdi and 1 to 'unlock' the vdi.

So both old and new QEMU issue 0 to lock the image and 'release' request to
unlock the image. What is in the way?

> 
> Even we set the default lock type as 0, the old QEMU cannot issue a
> correct locking request. 

why?

> I'll post a patch for incrementing protocol
> version number later. But before doing that, I also want to clean
> DISCARD request. Because this request cannot work with snapshot (not
> only with the new GC algorithm. The old naive algorithm cannot work
> with the DISCARD request).

If you remove discard, what if users run new qemu with old sheep, which make
use of old algorithm and people expect discard to work?

I don't think remove operation from protocol is good idea. If sheep daemon can't
support discard, you can simply ask the sheep daemon to return success without
doing anything if it doesn't support discard.

Thanks
Yuan

Re: [Qemu-devel] [PATCH 1/2] sheepdog: adopting protocol update for VDI locking

2014-08-10 Thread Liu Yuan

On Mon, Aug 11, 2014 at 11:34:56AM +0800, Liu Yuan wrote:
> On Mon, Aug 11, 2014 at 11:17:33AM +0900, Hitoshi Mitake wrote:
> > At Fri, 8 Aug 2014 15:49:37 +0800,
> > Liu Yuan wrote:
> > > 
> > > On Fri, Aug 08, 2014 at 03:12:17PM +0900, Hitoshi Mitake wrote:
> > > > At Fri, 8 Aug 2014 13:20:39 +0800,
> > > > Liu Yuan wrote:
> > > > > 
> > > > > On Thu, Aug 07, 2014 at 04:28:39PM +0900, Hitoshi Mitake wrote:
> > > > > > The update is required for supporting iSCSI multipath. It doesn't
> > > > > > affect behavior of QEMU driver but adding a new field to vdi request
> > > > > > struct is required.
> > > > > > 
> > > > > > Cc: Kevin Wolf 
> > > > > > Cc: Stefan Hajnoczi 
> > > > > > Cc: Liu Yuan 
> > > > > > Cc: MORITA Kazutaka 
> > > > > > Signed-off-by: Hitoshi Mitake 
> > > > > > ---
> > > > > >  block/sheepdog.c | 8 +++-
> > > > > >  1 file changed, 7 insertions(+), 1 deletion(-)
> > > > > > 
> > > > > > diff --git a/block/sheepdog.c b/block/sheepdog.c
> > > > > > index 8d9350c..36f76f0 100644
> > > > > > --- a/block/sheepdog.c
> > > > > > +++ b/block/sheepdog.c
> > > > > > @@ -103,6 +103,9 @@
> > > > > >  #define SD_INODE_SIZE (sizeof(SheepdogInode))
> > > > > >  #define CURRENT_VDI_ID 0
> > > > > >  
> > > > > > +#define LOCK_TYPE_NORMAL 1
> > > > > > +#define LOCK_TYPE_SHARED 2  /* for iSCSI multipath */
> > > > > 
> > > > > How about
> > > > > 
> > > > > #define LOCK_TYPE_NORMAL 0
> > > > > #define LOCK_TYPE_SHARED 1
> > > > > 
> > > > > Then we don't need this patch. Since qemu won't make use of multipath 
> > > > > for the
> > > > > near future, we should avoid adding stuff related to multipath to 
> > > > > qemu driver.
> > > > 
> > > > (Cc-ing current Kazutaka-san's address)
> > > > 
> > > > I think this isn't a good idea. Because it means that sheep has an
> > > > assumption about padding field of the request data struct. This sort
> > > > of workaround can cause hard to find problems in the future.
> > > > 
> > > 
> > > The point is, how to keep backward compatibilty? E.g, old QEMU with 
> > > present
> > > sheep daemon with lock features. Then QEMU will send 0 instead of 1 to 
> > > the sheep
> > > daemon and based on how you handle the invalid value, these might cause 
> > > problems.
> > > 
> > > Suppose we have 1 old QEMU and 1 present QEMU who try to start the same 
> > > image A.
> > > Old QEMU will send invalid 0 to sheep daemon. We shouldn't deny it 
> > > because it
> > > can run with old sheep and should run on new sheep too. Then this image A 
> > > isn't
> > > locked due to invalid 0 value. Then present QEMU send correct LOCK signal 
> > > and
> > > will wrongly set up the image.
> > > 
> > > Start with 0 instead of 1, in my option, is reasonable to keep backward
> > > compatibility.
> > 
> > I don't think so. Because the backward compatibility of the locking
> > functionality is already broken in the far past.
> > 
> > As Meng repoted in the sheepdog mailing list, old QEMU can issue
> > locking request to sheep with vid == 0:
> > http://lists.wpkg.org/pipermail/sheepdog/2014-August/015438.html
> 
> I don't really understand why we can't start with 0 and can't keep backward
> compatibility. By the way, I think the link has nothing to do with qemu, it is
> a bug in sheep.
> 
> locking has two state, one is lock and the other unlock.
> 
> We choose 0 to mean 'lock' the vdi and 1 to 'unlock' the vdi.
> 
> So both old and new QEMU issue 0 to lock the image and 'release' request to
> unlock the image. What is in the way?
> 
> > 
> > Even we set the default lock type as 0, the old QEMU cannot issue a
> > correct locking request. 
> 
> why?
> 
> > I'll post a patch for incrementing protocol
> > version number later. But before doing that, I also want to clean
> > DISCARD request. Because this request cannot work with snapshot (not
> > only with the new GC algorithm. The old naive algorithm cannot work
> > with the DISCARD request).
> 
> If you remove discard, what if users run new qemu with old sheep, which make
> use of old algorithm and people expect discard to work?

Okay, you mean discard can't work with snapshots, but it should work with
non-snapshots, so for the users of non-snapshots, people can expect it to work.
As stated in my last email, you can handle this problem transparently without
modification of protocol.

QEMU --> discard[offset, lenght] --> sheep

sheep:
  if req on snapshot
 return success;
  else
 return sheep_handle_discard().

Thanks
Yuan

Re: [Qemu-devel] [PATCH v4 2/2] block/quorum: add simple read pattern support

2014-08-11 Thread Liu Yuan

On Thu, Jul 17, 2014 at 01:18:56PM +0800, Liu Yuan wrote:
> This patch adds single read pattern to quorum driver and quorum vote is 
> default
> pattern.
> 
> For now we do a quorum vote on all the reads, it is designed for unreliable
> underlying storage such as non-redundant NFS to make sure data integrity at 
> the
> cost of the read performance.
> 
> For some use cases as following:
> 
> VM
>   --
>   ||
>   vv
>   AB
> 
> Both A and B has hardware raid storage to justify the data integrity on its 
> own.
> So it would help performance if we do a single read instead of on all the 
> nodes.
> Further, if we run VM on either of the storage node, we can make a local read
> request for better performance.
> 
> This patch generalize the above 2 nodes case in the N nodes. That is,
> 
> vm -> write to all the N nodes, read just one of them. If single read fails, 
> we
> try to read next node in FIFO order specified by the startup command.
> 
> The 2 nodes case is very similar to DRBD[1] though lack of auto-sync
> functionality in the single device/node failure for now. But compared with 
> DRBD
> we still have some advantages over it:
> 
> - Suppose we have 20 VMs running on one(assume A) of 2 nodes' DRBD backed
> storage. And if A crashes, we need to restart all the VMs on node B. But for
> practice case, we can't because B might not have enough resources to setup 20 
> VMs
> at once. So if we run our 20 VMs with quorum driver, and scatter the 
> replicated
> images over the data center, we can very likely restart 20 VMs without any
> resource problem.
> 
> After all, I think we can build a more powerful replicated image functionality
> on quorum and block jobs(block mirror) to meet various High Availibility 
> needs.
> 
> E.g, Enable single read pattern on 2 children,
> 
> -drive driver=quorum,children.0.file.filename=0.qcow2,\
> children.1.file.filename=1.qcow2,read-pattern=fifo,vote-threshold=1
> 
> [1] http://en.wikipedia.org/wiki/Distributed_Replicated_Block_Device
> 
> Cc: Benoit Canet 
> Cc: Eric Blake 
> Cc: Kevin Wolf 
> Cc: Stefan Hajnoczi 
> Signed-off-by: Liu Yuan 
> ---
>  block/quorum.c | 179 
> +
>  1 file changed, 131 insertions(+), 48 deletions(-)
> 
> diff --git a/block/quorum.c b/block/quorum.c
> index d5ee9c0..ebf5c71 100644
> --- a/block/quorum.c
> +++ b/block/quorum.c
> @@ -24,6 +24,7 @@
>  #define QUORUM_OPT_VOTE_THRESHOLD "vote-threshold"
>  #define QUORUM_OPT_BLKVERIFY  "blkverify"
>  #define QUORUM_OPT_REWRITE"rewrite-corrupted"
> +#define QUORUM_OPT_READ_PATTERN   "read-pattern"
>  
>  /* This union holds a vote hash value */
>  typedef union QuorumVoteValue {
> @@ -74,6 +75,8 @@ typedef struct BDRVQuorumState {
>  bool rewrite_corrupted;/* true if the driver must rewrite-on-read 
> corrupted
>  * block if Quorum is reached.
>  */
> +
> +QuorumReadPattern read_pattern;
>  } BDRVQuorumState;
>  
>  typedef struct QuorumAIOCB QuorumAIOCB;
> @@ -117,6 +120,7 @@ struct QuorumAIOCB {
>  
>  bool is_read;
>  int vote_ret;
> +int child_iter; /* which child to read in fifo pattern */
>  };
>  
>  static bool quorum_vote(QuorumAIOCB *acb);
> @@ -154,8 +158,10 @@ static void quorum_aio_finalize(QuorumAIOCB *acb)
>  
>  if (acb->is_read) {
>  for (i = 0; i < s->num_children; i++) {
> -qemu_vfree(acb->qcrs[i].buf);
> -qemu_iovec_destroy(&acb->qcrs[i].qiov);
> +if (i <= acb->child_iter) {
> +qemu_vfree(acb->qcrs[i].buf);
> +qemu_iovec_destroy(&acb->qcrs[i].qiov);
> +}
>  }
>  }
>  
> @@ -256,6 +262,21 @@ static void quorum_rewrite_aio_cb(void *opaque, int ret)
>  quorum_aio_finalize(acb);
>  }
>  
> +static BlockDriverAIOCB *read_fifo_child(QuorumAIOCB *acb);
> +
> +static void quorum_copy_qiov(QEMUIOVector *dest, QEMUIOVector *source)
> +{
> +int i;
> +assert(dest->niov == source->niov);
> +assert(dest->size == source->size);
> +for (i = 0; i < source->niov; i++) {
> +assert(dest->iov[i].iov_len == source->iov[i].iov_len);
> +memcpy(dest->iov[i].iov_base,
> +   source->iov[i].iov_base,
> +   source->iov[i].iov_len);
> +}
> +}
> +
>  static void quorum_aio_cb(void *opaque, int ret)
>  {
>

Re: [Qemu-devel] [PATCH v4 2/2] block/quorum: add simple read pattern support

2014-08-11 Thread Liu Yuan

On Mon, Aug 11, 2014 at 02:31:43PM +0200, Benoît Canet wrote:
> The Thursday 17 Jul 2014 à 13:18:56 (+0800), Liu Yuan wrote :
> > This patch adds single read pattern to quorum driver and quorum vote is 
> > default
> > pattern.
> > 
> > For now we do a quorum vote on all the reads, it is designed for unreliable
> > underlying storage such as non-redundant NFS to make sure data integrity at 
> > the
> > cost of the read performance.
> > 
> > For some use cases as following:
> > 
> > VM
> >   --
> >   ||
> >   vv
> >   AB
> > 
> > Both A and B has hardware raid storage to justify the data integrity on its 
> > own.
> > So it would help performance if we do a single read instead of on all the 
> > nodes.
> > Further, if we run VM on either of the storage node, we can make a local 
> > read
> > request for better performance.
> > 
> > This patch generalize the above 2 nodes case in the N nodes. That is,
> > 
> > vm -> write to all the N nodes, read just one of them. If single read 
> > fails, we
> > try to read next node in FIFO order specified by the startup command.
> > 
> > The 2 nodes case is very similar to DRBD[1] though lack of auto-sync
> > functionality in the single device/node failure for now. But compared with 
> > DRBD
> > we still have some advantages over it:
> > 
> > - Suppose we have 20 VMs running on one(assume A) of 2 nodes' DRBD backed
> > storage. And if A crashes, we need to restart all the VMs on node B. But for
> > practice case, we can't because B might not have enough resources to setup 
> > 20 VMs
> > at once. So if we run our 20 VMs with quorum driver, and scatter the 
> > replicated
> > images over the data center, we can very likely restart 20 VMs without any
> > resource problem.
> > 
> > After all, I think we can build a more powerful replicated image 
> > functionality
> > on quorum and block jobs(block mirror) to meet various High Availibility 
> > needs.
> > 
> > E.g, Enable single read pattern on 2 children,
> > 
> > -drive driver=quorum,children.0.file.filename=0.qcow2,\
> > children.1.file.filename=1.qcow2,read-pattern=fifo,vote-threshold=1
> > 
> > [1] http://en.wikipedia.org/wiki/Distributed_Replicated_Block_Device
> > 
> > Cc: Benoit Canet 
> > Cc: Eric Blake 
> > Cc: Kevin Wolf 
> > Cc: Stefan Hajnoczi 
> > Signed-off-by: Liu Yuan 
> > ---
> >  block/quorum.c | 179 
> > +
> >  1 file changed, 131 insertions(+), 48 deletions(-)
> > 
> > diff --git a/block/quorum.c b/block/quorum.c
> > index d5ee9c0..ebf5c71 100644
> > --- a/block/quorum.c
> > +++ b/block/quorum.c
> > @@ -24,6 +24,7 @@
> >  #define QUORUM_OPT_VOTE_THRESHOLD "vote-threshold"
> >  #define QUORUM_OPT_BLKVERIFY  "blkverify"
> >  #define QUORUM_OPT_REWRITE"rewrite-corrupted"
> > +#define QUORUM_OPT_READ_PATTERN   "read-pattern"
> >  
> >  /* This union holds a vote hash value */
> >  typedef union QuorumVoteValue {
> > @@ -74,6 +75,8 @@ typedef struct BDRVQuorumState {
> >  bool rewrite_corrupted;/* true if the driver must rewrite-on-read 
> > corrupted
> >  * block if Quorum is reached.
> >  */
> > +
> > +QuorumReadPattern read_pattern;
> >  } BDRVQuorumState;
> >  
> >  typedef struct QuorumAIOCB QuorumAIOCB;
> > @@ -117,6 +120,7 @@ struct QuorumAIOCB {
> >  
> >  bool is_read;
> >  int vote_ret;
> > +int child_iter; /* which child to read in fifo pattern */
> >  };
> >  
> >  static bool quorum_vote(QuorumAIOCB *acb);
> > @@ -154,8 +158,10 @@ static void quorum_aio_finalize(QuorumAIOCB *acb)
> >  
> >  if (acb->is_read) {
> >  for (i = 0; i < s->num_children; i++) {
> > -qemu_vfree(acb->qcrs[i].buf);
> > -qemu_iovec_destroy(&acb->qcrs[i].qiov);
> > +if (i <= acb->child_iter) {
> > +qemu_vfree(acb->qcrs[i].buf);
> > +qemu_iovec_destroy(&acb->qcrs[i].qiov);
> > +}
> >  }
> >  }
> >  
> > @@ -256,6 +262,21 @@ static void quorum_rewrite_aio_cb(void *opaque, int 
> > ret)
> >  quorum_aio_finalize(acb);
> >  }
> >  
> > +st

Re: [Qemu-devel] [PATCH v4 2/2] block/quorum: add simple read pattern support

2014-08-14 Thread Liu Yuan

On Tue, Aug 12, 2014 at 10:41:28AM +0800, Liu Yuan wrote:
> On Mon, Aug 11, 2014 at 02:31:43PM +0200, Benoît Canet wrote:
> > The Thursday 17 Jul 2014 à 13:18:56 (+0800), Liu Yuan wrote :
> > > This patch adds single read pattern to quorum driver and quorum vote is 
> > > default
> > > pattern.
> > > 
> > > For now we do a quorum vote on all the reads, it is designed for 
> > > unreliable
> > > underlying storage such as non-redundant NFS to make sure data integrity 
> > > at the
> > > cost of the read performance.
> > > 
> > > For some use cases as following:
> > > 
> > > VM
> > >   --
> > >   ||
> > >   vv
> > >   AB
> > > 
> > > Both A and B has hardware raid storage to justify the data integrity on 
> > > its own.
> > > So it would help performance if we do a single read instead of on all the 
> > > nodes.
> > > Further, if we run VM on either of the storage node, we can make a local 
> > > read
> > > request for better performance.
> > > 
> > > This patch generalize the above 2 nodes case in the N nodes. That is,
> > > 
> > > vm -> write to all the N nodes, read just one of them. If single read 
> > > fails, we
> > > try to read next node in FIFO order specified by the startup command.
> > > 
> > > The 2 nodes case is very similar to DRBD[1] though lack of auto-sync
> > > functionality in the single device/node failure for now. But compared 
> > > with DRBD
> > > we still have some advantages over it:
> > > 
> > > - Suppose we have 20 VMs running on one(assume A) of 2 nodes' DRBD backed
> > > storage. And if A crashes, we need to restart all the VMs on node B. But 
> > > for
> > > practice case, we can't because B might not have enough resources to 
> > > setup 20 VMs
> > > at once. So if we run our 20 VMs with quorum driver, and scatter the 
> > > replicated
> > > images over the data center, we can very likely restart 20 VMs without any
> > > resource problem.
> > > 
> > > After all, I think we can build a more powerful replicated image 
> > > functionality
> > > on quorum and block jobs(block mirror) to meet various High Availibility 
> > > needs.
> > > 
> > > E.g, Enable single read pattern on 2 children,
> > > 
> > > -drive driver=quorum,children.0.file.filename=0.qcow2,\
> > > children.1.file.filename=1.qcow2,read-pattern=fifo,vote-threshold=1
> > > 
> > > [1] http://en.wikipedia.org/wiki/Distributed_Replicated_Block_Device
> > > 
> > > Cc: Benoit Canet 
> > > Cc: Eric Blake 
> > > Cc: Kevin Wolf 
> > > Cc: Stefan Hajnoczi 
> > > Signed-off-by: Liu Yuan 
> > > ---
> > >  block/quorum.c | 179 
> > > +
> > >  1 file changed, 131 insertions(+), 48 deletions(-)
> > > 
> > > diff --git a/block/quorum.c b/block/quorum.c
> > > index d5ee9c0..ebf5c71 100644
> > > --- a/block/quorum.c
> > > +++ b/block/quorum.c
> > > @@ -24,6 +24,7 @@
> > >  #define QUORUM_OPT_VOTE_THRESHOLD "vote-threshold"
> > >  #define QUORUM_OPT_BLKVERIFY  "blkverify"
> > >  #define QUORUM_OPT_REWRITE"rewrite-corrupted"
> > > +#define QUORUM_OPT_READ_PATTERN   "read-pattern"
> > >  
> > >  /* This union holds a vote hash value */
> > >  typedef union QuorumVoteValue {
> > > @@ -74,6 +75,8 @@ typedef struct BDRVQuorumState {
> > >  bool rewrite_corrupted;/* true if the driver must rewrite-on-read 
> > > corrupted
> > >  * block if Quorum is reached.
> > >  */
> > > +
> > > +QuorumReadPattern read_pattern;
> > >  } BDRVQuorumState;
> > >  
> > >  typedef struct QuorumAIOCB QuorumAIOCB;
> > > @@ -117,6 +120,7 @@ struct QuorumAIOCB {
> > >  
> > >  bool is_read;
> > >  int vote_ret;
> > > +int child_iter; /* which child to read in fifo pattern */
> > >  };
> > >  
> > >  static bool quorum_vote(QuorumAIOCB *acb);
> > > @@ -154,8 +158,10 @@ static void quorum_aio_finalize(QuorumAIOCB *acb)
> > >  
> > >  if (acb->is_read) {
> > >  for (i

Re: [Qemu-devel] [PATCH v3 0/2] sheepdog driver update related to VDI locking feature

2014-08-14 Thread Liu Yuan

On Mon, Aug 11, 2014 at 02:43:44PM +0900, Hitoshi Mitake wrote:
> Recently, sheepdog revived VDI locking functionality. This patchset
> updates sheepdog driver of QEMU for this feature.
> 
> v3:
>  - keep backward compatibility
> 
> v2:
>  - don't handle SD_RES_VDI_NOT_LOCKED as a special case   
> 
> Hitoshi Mitake (2):
>   sheepdog: adopting protocol update for VDI locking
>   sheepdog: improve error handling for a case of failed lock
> 
>  block/sheepdog.c | 10 +-
>  1 file changed, 9 insertions(+), 1 deletion(-)
> 

Reviewed-by: Liu Yuan

Re: [Qemu-devel] [PATCH v4 2/2] block/quorum: add simple read pattern support

2014-08-14 Thread Liu Yuan

On Thu, Aug 14, 2014 at 01:09:32PM +0200, Benoît Canet wrote:
> The Thursday 17 Jul 2014 à 13:18:56 (+0800), Liu Yuan wrote :
> > This patch adds single read pattern to quorum driver and quorum vote is 
> > default
> > pattern.
> > 
> > For now we do a quorum vote on all the reads, it is designed for unreliable
> > underlying storage such as non-redundant NFS to make sure data integrity at 
> > the
> > cost of the read performance.
> > 
> > For some use cases as following:
> > 
> > VM
> >   --
> >   ||
> >   vv
> >   AB
> > 
> > Both A and B has hardware raid storage to justify the data integrity on its 
> > own.
> > So it would help performance if we do a single read instead of on all the 
> > nodes.
> > Further, if we run VM on either of the storage node, we can make a local 
> > read
> > request for better performance.
> > 
> > This patch generalize the above 2 nodes case in the N nodes. That is,
> > 
> > vm -> write to all the N nodes, read just one of them. If single read 
> > fails, we
> > try to read next node in FIFO order specified by the startup command.
> > 
> > The 2 nodes case is very similar to DRBD[1] though lack of auto-sync
> > functionality in the single device/node failure for now. But compared with 
> > DRBD
> > we still have some advantages over it:
> > 
> > - Suppose we have 20 VMs running on one(assume A) of 2 nodes' DRBD backed
> > storage. And if A crashes, we need to restart all the VMs on node B. But for
> > practice case, we can't because B might not have enough resources to setup 
> > 20 VMs
> > at once. So if we run our 20 VMs with quorum driver, and scatter the 
> > replicated
> > images over the data center, we can very likely restart 20 VMs without any
> > resource problem.
> > 
> > After all, I think we can build a more powerful replicated image 
> > functionality
> > on quorum and block jobs(block mirror) to meet various High Availibility 
> > needs.
> > 
> > E.g, Enable single read pattern on 2 children,
> > 
> > -drive driver=quorum,children.0.file.filename=0.qcow2,\
> > children.1.file.filename=1.qcow2,read-pattern=fifo,vote-threshold=1
> > 
> > [1] http://en.wikipedia.org/wiki/Distributed_Replicated_Block_Device
> > 
> > Cc: Benoit Canet 
> > Cc: Eric Blake 
> > Cc: Kevin Wolf 
> > Cc: Stefan Hajnoczi 
> > Signed-off-by: Liu Yuan 
> > ---
> >  block/quorum.c | 179 
> > +
> >  1 file changed, 131 insertions(+), 48 deletions(-)
> > 
> > diff --git a/block/quorum.c b/block/quorum.c
> > index d5ee9c0..ebf5c71 100644
> > --- a/block/quorum.c
> > +++ b/block/quorum.c
> > @@ -24,6 +24,7 @@
> >  #define QUORUM_OPT_VOTE_THRESHOLD "vote-threshold"
> >  #define QUORUM_OPT_BLKVERIFY  "blkverify"
> >  #define QUORUM_OPT_REWRITE"rewrite-corrupted"
> > +#define QUORUM_OPT_READ_PATTERN   "read-pattern"
> >  
> >  /* This union holds a vote hash value */
> >  typedef union QuorumVoteValue {
> > @@ -74,6 +75,8 @@ typedef struct BDRVQuorumState {
> >  bool rewrite_corrupted;/* true if the driver must rewrite-on-read 
> > corrupted
> >  * block if Quorum is reached.
> >  */
> > +
> > +QuorumReadPattern read_pattern;
> >  } BDRVQuorumState;
> >  
> >  typedef struct QuorumAIOCB QuorumAIOCB;
> > @@ -117,6 +120,7 @@ struct QuorumAIOCB {
> >  
> >  bool is_read;
> >  int vote_ret;
> > +int child_iter; /* which child to read in fifo pattern */
> >  };
> >  
> >  static bool quorum_vote(QuorumAIOCB *acb);
> > @@ -154,8 +158,10 @@ static void quorum_aio_finalize(QuorumAIOCB *acb)
> >  
> >  if (acb->is_read) {
> >  for (i = 0; i < s->num_children; i++) {
> > -qemu_vfree(acb->qcrs[i].buf);
> > -qemu_iovec_destroy(&acb->qcrs[i].qiov);
> > +if (i <= acb->child_iter) {
> > +qemu_vfree(acb->qcrs[i].buf);
> > +qemu_iovec_destroy(&acb->qcrs[i].qiov);
> > +}
> 
> This seems convoluted.
> 
> What about ?
>   /* on the quorum case acb->child_iter == s->num_children - 1 */
>   for (i = 0; i <= acb->child_iter; i++) {
>   qemu_vfree(acb->qcrs[i].buf);
>   qemu_iovec_destroy(&acb->qcrs[i].qiov);
>   }
>   
> >  }

Sounds good. I'll update the patch v5.

Thanks
Yuan

[Qemu-devel] [PATCH v5 0/2] add read-pattern for block qourum

2014-08-14 Thread Liu Yuan

v5:
 - simplify a for loop in quorum_aio_finalize()

v4:
 - swap the patch order
 - update comment for fifo pattern in qaip
 - use qapi enumeration in quorum driver instead of manual parsing

v3:
 - separate patch into two, one for quorum and one for qapi for easier review
 - add enumeration for quorum read pattern
 - remove unrelated blank line fix from this patch set

v2:
 - rename single as 'fifo'
 - rename read_all_children as read_quorum_children
 - fix quorum_aio_finalize() for fifo pattern

This patch adds single read pattern to quorum driver and quorum vote is default
pattern.

For now we do a quorum vote on all the reads, it is designed for unreliable
underlying storage such as non-redundant NFS to make sure data integrity at the
cost of the read performance.

For some use cases as following:

VM
  --
  ||
  vv
  AB

Both A and B has hardware raid storage to justify the data integrity on its own.
So it would help performance if we do a single read instead of on all the nodes.
Further, if we run VM on either of the storage node, we can make a local read
request for better performance.

This patch generalize the above 2 nodes case in the N nodes. That is,

vm -> write to all the N nodes, read just one of them. If single read fails, we
try to read next node in FIFO order specified by the startup command.

The 2 nodes case is very similar to DRBD[1] though lack of auto-sync
functionality in the single device/node failure for now. But compared with DRBD
we still have some advantages over it:

- Suppose we have 20 VMs running on one(assume A) of 2 nodes' DRBD backed
storage. And if A crashes, we need to restart all the VMs on node B. But for
practice case, we can't because B might not have enough resources to setup 20 
VMs
at once. So if we run our 20 VMs with quorum driver, and scatter the replicated
images over the data center, we can very likely restart 20 VMs without any
resource problem.

After all, I think we can build a more powerful replicated image functionality
on quorum and block jobs(block mirror) to meet various High Availibility needs.

E.g, Enable single read pattern on 2 children,
  
-drive driver=quorum,children.0.file.filename=0.qcow2,\
children.1.file.filename=1.qcow2,read-pattern=fifo,vote-threshold=1

[1] http://en.wikipedia.org/wiki/Distributed_Replicated_Block_Device

Cc: Benoit Canet 
Cc: Eric Blake 
Cc: Kevin Wolf 
Cc: Stefan Hajnoczi 
Signed-off-by: Liu Yuan 
Liu Yuan (2):
  qapi: add read-pattern enum for quorum
  block/quorum: add simple read pattern support

 block/quorum.c   | 176 +--
 qapi/block-core.json |  20 +-
 2 files changed, 148 insertions(+), 48 deletions(-)

-- 
1.9.1

[Qemu-devel] [PATCH v5 2/2] block/quorum: add simple read pattern support

2014-08-14 Thread Liu Yuan

This patch adds single read pattern to quorum driver and quorum vote is default
pattern.

For now we do a quorum vote on all the reads, it is designed for unreliable
underlying storage such as non-redundant NFS to make sure data integrity at the
cost of the read performance.

For some use cases as following:

VM
  --
  ||
  vv
  AB

Both A and B has hardware raid storage to justify the data integrity on its own.
So it would help performance if we do a single read instead of on all the nodes.
Further, if we run VM on either of the storage node, we can make a local read
request for better performance.

This patch generalize the above 2 nodes case in the N nodes. That is,

vm -> write to all the N nodes, read just one of them. If single read fails, we
try to read next node in FIFO order specified by the startup command.

The 2 nodes case is very similar to DRBD[1] though lack of auto-sync
functionality in the single device/node failure for now. But compared with DRBD
we still have some advantages over it:

- Suppose we have 20 VMs running on one(assume A) of 2 nodes' DRBD backed
storage. And if A crashes, we need to restart all the VMs on node B. But for
practice case, we can't because B might not have enough resources to setup 20 
VMs
at once. So if we run our 20 VMs with quorum driver, and scatter the replicated
images over the data center, we can very likely restart 20 VMs without any
resource problem.

After all, I think we can build a more powerful replicated image functionality
on quorum and block jobs(block mirror) to meet various High Availibility needs.

E.g, Enable single read pattern on 2 children,

-drive driver=quorum,children.0.file.filename=0.qcow2,\
children.1.file.filename=1.qcow2,read-pattern=fifo,vote-threshold=1

[1] http://en.wikipedia.org/wiki/Distributed_Replicated_Block_Device

Cc: Benoit Canet 
Cc: Eric Blake 
Cc: Kevin Wolf 
Cc: Stefan Hajnoczi 
Signed-off-by: Liu Yuan 
---
 block/quorum.c | 176 ++---
 1 file changed, 129 insertions(+), 47 deletions(-)

diff --git a/block/quorum.c b/block/quorum.c
index d5ee9c0..1235d7c 100644
--- a/block/quorum.c
+++ b/block/quorum.c
@@ -24,6 +24,7 @@
 #define QUORUM_OPT_VOTE_THRESHOLD "vote-threshold"
 #define QUORUM_OPT_BLKVERIFY  "blkverify"
 #define QUORUM_OPT_REWRITE"rewrite-corrupted"
+#define QUORUM_OPT_READ_PATTERN   "read-pattern"
 
 /* This union holds a vote hash value */
 typedef union QuorumVoteValue {
@@ -74,6 +75,8 @@ typedef struct BDRVQuorumState {
 bool rewrite_corrupted;/* true if the driver must rewrite-on-read corrupted
 * block if Quorum is reached.
 */
+
+QuorumReadPattern read_pattern;
 } BDRVQuorumState;
 
 typedef struct QuorumAIOCB QuorumAIOCB;
@@ -117,6 +120,7 @@ struct QuorumAIOCB {
 
 bool is_read;
 int vote_ret;
+int child_iter; /* which child to read in fifo pattern */
 };
 
 static bool quorum_vote(QuorumAIOCB *acb);
@@ -153,7 +157,8 @@ static void quorum_aio_finalize(QuorumAIOCB *acb)
 acb->common.cb(acb->common.opaque, ret);
 
 if (acb->is_read) {
-for (i = 0; i < s->num_children; i++) {
+/* on the quorum case acb->child_iter == s->num_children - 1 */
+for (i = 0; i <= acb->child_iter; i++) {
 qemu_vfree(acb->qcrs[i].buf);
 qemu_iovec_destroy(&acb->qcrs[i].qiov);
 }
@@ -256,6 +261,21 @@ static void quorum_rewrite_aio_cb(void *opaque, int ret)
 quorum_aio_finalize(acb);
 }
 
+static BlockDriverAIOCB *read_fifo_child(QuorumAIOCB *acb);
+
+static void quorum_copy_qiov(QEMUIOVector *dest, QEMUIOVector *source)
+{
+int i;
+assert(dest->niov == source->niov);
+assert(dest->size == source->size);
+for (i = 0; i < source->niov; i++) {
+assert(dest->iov[i].iov_len == source->iov[i].iov_len);
+memcpy(dest->iov[i].iov_base,
+   source->iov[i].iov_base,
+   source->iov[i].iov_len);
+}
+}
+
 static void quorum_aio_cb(void *opaque, int ret)
 {
 QuorumChildRequest *sacb = opaque;
@@ -263,6 +283,21 @@ static void quorum_aio_cb(void *opaque, int ret)
 BDRVQuorumState *s = acb->common.bs->opaque;
 bool rewrite = false;
 
+if (acb->is_read && s->read_pattern == QUORUM_READ_PATTERN_FIFO) {
+/* We try to read next child in FIFO order if we fail to read */
+if (ret < 0 && ++acb->child_iter < s->num_children) {
+read_fifo_child(acb);
+return;
+}
+
+if (ret == 0) {
+quorum_copy_qiov(acb->qiov, &acb->qcrs[acb->child_iter].qiov);
+}
+acb->vote_ret = ret;
+quorum_aio_finalize(acb);
+return;
+}
+
 sacb->ret =

[Qemu-devel] [PATCH v5 1/2] qapi: add read-pattern enum for quorum

2014-08-14 Thread Liu Yuan

Cc: Eric Blake 
Reviewed-by: Eric Blake 
Signed-off-by: Liu Yuan 
---
 qapi/block-core.json | 20 +++-
 1 file changed, 19 insertions(+), 1 deletion(-)

diff --git a/qapi/block-core.json b/qapi/block-core.json
index e378653..42033d9 100644
--- a/qapi/block-core.json
+++ b/qapi/block-core.json
@@ -1384,6 +1384,19 @@
 'raw': 'BlockdevRef' } }
 
 ##
+# @QuorumReadPattern
+#
+# An enumeration of quorum read patterns.
+#
+# @quorum: read all the children and do a quorum vote on reads
+#
+# @fifo: read only from the first child that has not failed
+#
+# Since: 2.2
+##
+{ 'enum': 'QuorumReadPattern', 'data': [ 'quorum', 'fifo' ] }
+
+##
 # @BlockdevOptionsQuorum
 #
 # Driver specific block device options for Quorum
@@ -1398,12 +1411,17 @@
 # @rewrite-corrupted: #optional rewrite corrupted data when quorum is reached
 # (Since 2.1)
 #
+# @read-pattern: #optional choose read pattern and set to quorum by default
+#(Since 2.2)
+#
 # Since: 2.0
 ##
 { 'type': 'BlockdevOptionsQuorum',
   'data': { '*blkverify': 'bool',
 'children': [ 'BlockdevRef' ],
-'vote-threshold': 'int', '*rewrite-corrupted': 'bool' } }
+'vote-threshold': 'int',
+'*rewrite-corrupted': 'bool',
+'*read-pattern': 'QuorumReadPattern' } }
 
 ##
 # @BlockdevOptions
-- 
1.9.1

Re: [Qemu-devel] [PATCH 0/8] add basic recovery logic to quorum driver

2014-09-10 Thread Liu Yuan

On Sun, Sep 07, 2014 at 05:12:31PM +0200, Benoît Canet wrote:
> The Monday 01 Sep 2014 à 15:43:06 (+0800), Liu Yuan wrote :
> > This patch set mainly add mainly two logics to implement device recover
> > - notify qourum driver of the broken states from the child driver(s)
> > - dirty track and sync the device after it is repaired
> > 
> > Thus quorum allow VMs to continue while some child devices are broken and 
> > when
> > the child devices are repaired and return back, we sync dirty bits during
> > downtime to keep data consistency.
> > 
> > The recovery logic is based on the driver state bitmap and will sync the 
> > dirty
> > bits with a timeslice window in a coroutine in this prtimive implementation.
> > 
> > Simple graph about 2 children with threshold=1 and read-pattern=fifo:
> > (similary to DRBD)
> > 
> > + denote device sync iteration
> > - IO on a single device
> > = IO on two devices
> > 
> >   sync complete, release dirty bitmap
> >  ^
> >  |
> >   -++==
> >  | |
> >  | v
> >  |   device repaired and begin to sync
> >  v
> >device broken, create a dirty bitmap
> > 
> >   This sync logic can take care of nested broken problem, that devices are
> >   broken while in sync. We just start a sync process after the devices are
> >   repaired again and switch the devices from broken to sound only when the 
> > sync
> >   completes.
> > 
> > For read-pattern=quorum mode, it enjoys the recovery logic without any 
> > problem.
> > 
> > Todo:
> > - use aio interface to sync data (multiple transfer in one go)
> > - dynamic slice window to control sync bandwidth more smoothly
> > - add auto-reconnection mechanism to other protocol (if not support yet)
> > - add tests
> > 
> > Cc: Eric Blake 
> > Cc: Benoit Canet 
> > Cc: Kevin Wolf 
> > Cc: Stefan Hajnoczi 
> > 
> > Liu Yuan (8):
> >   block/quorum: initialize qcrs.aiocb for read
> >   block: add driver operation callbacks
> >   block/sheepdog: propagate disconnect/reconnect events to upper driver
> >   block/quorum: add quorum_aio_release() helper
> >   quorum: fix quorum_aio_cancel()
> >   block/quorum: add broken state to BlockDriverState
> >   block: add two helpers
> >   quorum: add basic device recovery logic
> > 
> >  block.c   |  17 +++
> >  block/quorum.c| 324 
> > +-
> >  block/sheepdog.c  |   9 ++
> >  include/block/block.h |   9 ++
> >  include/block/block_int.h |   6 +
> >  trace-events  |   5 +
> >  6 files changed, 336 insertions(+), 34 deletions(-)
> > 
> > -- 
> > 1.9.1
> > 
> 
> Hi liu,
> 
> Had you noticed that your series conflict with one of Fam's series in the 
> quorum cancel
> function fix patch ?

Not yet, thanks for reminding.

> Could you find an arrangement with Fam so the two patches don't collide 
> anymore ?
> 
> Do you intend to respin your series ?

Yes, I'll rebase the v2 later before more possible reviews.

Thanks
Yuan

Re: [Qemu-devel] [PATCH 0/8] add basic recovery logic to quorum driver

2014-09-10 Thread Liu Yuan

On Wed, Sep 03, 2014 at 12:19:14AM +0200, Benoît Canet wrote:
> The Monday 01 Sep 2014 à 15:43:06 (+0800), Liu Yuan wrote :
> 
> Liu,
> 
> Do you think this could work with qcow2 file backed by NFS servers ?

It depends on which client we use.

If we use Linux NFS client by 'posix file' protocol, I guess we need to hack
raw-posix.c and insert reconnect/disconnect callbacks.

I'm exploring the possibility of QEMU's nfs client block/nfs.c too.

Either way, this should work with all the protocols with proper hacks.

Thanks
Yuan

[Qemu-devel] [PATCH] sheepdog: fix NULL dereference in sd_create

2014-06-16 Thread Liu Yuan

Following command

qemu-img create -f qcow2 sheepdog:test 20g

will cause core dump because aio_context is NULL in sd_create. We should
initialize it by qemu_get_aio_context() to avoid NULL dereference.

Cc: qemu-devel@nongnu.org
Cc: Kevin Wolf 
Cc: Stefan Hajnoczi 
Signed-off-by: Liu Yuan 
---
 block/sheepdog.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/block/sheepdog.c b/block/sheepdog.c
index 1fa1939..47a8b5a 100644
--- a/block/sheepdog.c
+++ b/block/sheepdog.c
@@ -1761,6 +1761,7 @@ static int sd_create(const char *filename, 
QEMUOptionParameter *options,
 bdrv_unref(bs);
 }
 
+s->aio_context = qemu_get_aio_context();
 ret = do_sd_create(s, &vid, 0, errp);
 if (ret) {
 goto out;
-- 
1.8.1.2

Re: [Qemu-devel] [PATCH v5 2/2] block/quorum: add simple read pattern support

2014-08-17 Thread Liu Yuan

On Fri, Aug 15, 2014 at 03:59:04PM +0200, Benoît Canet wrote:
> The Friday 15 Aug 2014 à 13:05:17 (+0800), Liu Yuan wrote :
> > This patch adds single read pattern to quorum driver and quorum vote is 
> > default
> > pattern.
> > 
> > For now we do a quorum vote on all the reads, it is designed for unreliable
> > underlying storage such as non-redundant NFS to make sure data integrity at 
> > the
> > cost of the read performance.
> > 
> > For some use cases as following:
> > 
> > VM
> >   --
> >   ||
> >   vv
> >   AB
> > 
> > Both A and B has hardware raid storage to justify the data integrity on its 
> > own.
> > So it would help performance if we do a single read instead of on all the 
> > nodes.
> > Further, if we run VM on either of the storage node, we can make a local 
> > read
> > request for better performance.
> > 
> > This patch generalize the above 2 nodes case in the N nodes. That is,
> > 
> > vm -> write to all the N nodes, read just one of them. If single read 
> > fails, we
> > try to read next node in FIFO order specified by the startup command.
> > 
> > The 2 nodes case is very similar to DRBD[1] though lack of auto-sync
> > functionality in the single device/node failure for now. But compared with 
> > DRBD
> > we still have some advantages over it:
> > 
> > - Suppose we have 20 VMs running on one(assume A) of 2 nodes' DRBD backed
> > storage. And if A crashes, we need to restart all the VMs on node B. But for
> > practice case, we can't because B might not have enough resources to setup 
> > 20 VMs
> > at once. So if we run our 20 VMs with quorum driver, and scatter the 
> > replicated
> > images over the data center, we can very likely restart 20 VMs without any
> > resource problem.
> > 
> > After all, I think we can build a more powerful replicated image 
> > functionality
> > on quorum and block jobs(block mirror) to meet various High Availibility 
> > needs.
> > 
> > E.g, Enable single read pattern on 2 children,
> > 
> > -drive driver=quorum,children.0.file.filename=0.qcow2,\
> > children.1.file.filename=1.qcow2,read-pattern=fifo,vote-threshold=1
> > 
> > [1] http://en.wikipedia.org/wiki/Distributed_Replicated_Block_Device
> > 
> > Cc: Benoit Canet 
> > Cc: Eric Blake 
> > Cc: Kevin Wolf 
> > Cc: Stefan Hajnoczi 
> > Signed-off-by: Liu Yuan 
> > ---
> >  block/quorum.c | 176 
> > ++---
> >  1 file changed, 129 insertions(+), 47 deletions(-)
> > 
> > diff --git a/block/quorum.c b/block/quorum.c
> > index d5ee9c0..1235d7c 100644
> > --- a/block/quorum.c
> > +++ b/block/quorum.c
> > @@ -24,6 +24,7 @@
> >  #define QUORUM_OPT_VOTE_THRESHOLD "vote-threshold"
> >  #define QUORUM_OPT_BLKVERIFY  "blkverify"
> >  #define QUORUM_OPT_REWRITE"rewrite-corrupted"
> > +#define QUORUM_OPT_READ_PATTERN   "read-pattern"
> >  
> >  /* This union holds a vote hash value */
> >  typedef union QuorumVoteValue {
> > @@ -74,6 +75,8 @@ typedef struct BDRVQuorumState {
> >  bool rewrite_corrupted;/* true if the driver must rewrite-on-read 
> > corrupted
> >  * block if Quorum is reached.
> >  */
> > +
> > +QuorumReadPattern read_pattern;
> >  } BDRVQuorumState;
> >  
> >  typedef struct QuorumAIOCB QuorumAIOCB;
> > @@ -117,6 +120,7 @@ struct QuorumAIOCB {
> >  
> >  bool is_read;
> >  int vote_ret;
> > +int child_iter; /* which child to read in fifo pattern */
> 
> I don't understand what "fifo pattern" could mean for a bunch of disk
> as they are not forming a queue.

Naming isn't 100% accurate but as in Eric's comment (see below), both FIFO and
Round-Robin can be used for two different patterns.

> Maybe round-robin is more suitable but your code does not implement
> round-robin since it will alway start from the first disk.
> 
> Your code is scanning the disks set it's a scan pattern.
> 
> That said is it a problem that the first disk will be accessed more often 
> than the other ?

As my commit log documented, the purpose of the read pattern I added is to
speed up read against quorum original read pattern. And the use case is clear
(I hope so) and you can take DRBD as a good example for why we need it. Of
course we are far away fr

[Qemu-devel] [PATCH v6 1/2] qapi: add read-pattern enum for quorum

2014-08-18 Thread Liu Yuan

Cc: Eric Blake 
Reviewed-by: Eric Blake 
Reviewed-by: Benoît Canet 
Signed-off-by: Liu Yuan 
---
 qapi/block-core.json | 20 +++-
 1 file changed, 19 insertions(+), 1 deletion(-)

diff --git a/qapi/block-core.json b/qapi/block-core.json
index e378653..42033d9 100644
--- a/qapi/block-core.json
+++ b/qapi/block-core.json
@@ -1384,6 +1384,19 @@
 'raw': 'BlockdevRef' } }
 
 ##
+# @QuorumReadPattern
+#
+# An enumeration of quorum read patterns.
+#
+# @quorum: read all the children and do a quorum vote on reads
+#
+# @fifo: read only from the first child that has not failed
+#
+# Since: 2.2
+##
+{ 'enum': 'QuorumReadPattern', 'data': [ 'quorum', 'fifo' ] }
+
+##
 # @BlockdevOptionsQuorum
 #
 # Driver specific block device options for Quorum
@@ -1398,12 +1411,17 @@
 # @rewrite-corrupted: #optional rewrite corrupted data when quorum is reached
 # (Since 2.1)
 #
+# @read-pattern: #optional choose read pattern and set to quorum by default
+#(Since 2.2)
+#
 # Since: 2.0
 ##
 { 'type': 'BlockdevOptionsQuorum',
   'data': { '*blkverify': 'bool',
 'children': [ 'BlockdevRef' ],
-'vote-threshold': 'int', '*rewrite-corrupted': 'bool' } }
+'vote-threshold': 'int',
+'*rewrite-corrupted': 'bool',
+'*read-pattern': 'QuorumReadPattern' } }
 
 ##
 # @BlockdevOptions
-- 
1.9.1

[Qemu-devel] [PATCH v6 0/2] add read-pattern for block qourum

2014-08-18 Thread Liu Yuan

v6:
 - fix a unused warning introduced by last version

v5:
 - simplify a for loop in quorum_aio_finalize()

v4:
 - swap the patch order
 - update comment for fifo pattern in qaip
 - use qapi enumeration in quorum driver instead of manual parsing

v3:
 - separate patch into two, one for quorum and one for qapi for easier review
 - add enumeration for quorum read pattern
 - remove unrelated blank line fix from this patch set

v2:
 - rename single as 'fifo'
 - rename read_all_children as read_quorum_children
 - fix quorum_aio_finalize() for fifo pattern

This patch adds single read pattern to quorum driver and quorum vote is default
pattern.

For now we do a quorum vote on all the reads, it is designed for unreliable
underlying storage such as non-redundant NFS to make sure data integrity at the
cost of the read performance.

For some use cases as following:

VM
  --
  ||
  vv
  AB

Both A and B has hardware raid storage to justify the data integrity on its own.
So it would help performance if we do a single read instead of on all the nodes.
Further, if we run VM on either of the storage node, we can make a local read
request for better performance.

This patch generalize the above 2 nodes case in the N nodes. That is,

vm -> write to all the N nodes, read just one of them. If single read fails, we
try to read next node in FIFO order specified by the startup command.

The 2 nodes case is very similar to DRBD[1] though lack of auto-sync
functionality in the single device/node failure for now. But compared with DRBD
we still have some advantages over it:

- Suppose we have 20 VMs running on one(assume A) of 2 nodes' DRBD backed
storage. And if A crashes, we need to restart all the VMs on node B. But for
practice case, we can't because B might not have enough resources to setup 20 
VMs
at once. So if we run our 20 VMs with quorum driver, and scatter the replicated
images over the data center, we can very likely restart 20 VMs without any
resource problem.

After all, I think we can build a more powerful replicated image functionality
on quorum and block jobs(block mirror) to meet various High Availibility needs.

E.g, Enable single read pattern on 2 children,
  
-drive driver=quorum,children.0.file.filename=0.qcow2,\
children.1.file.filename=1.qcow2,read-pattern=fifo,vote-threshold=1

[1] http://en.wikipedia.org/wiki/Distributed_Replicated_Block_Device

Cc: Benoit Canet 
Cc: Eric Blake 
Cc: Kevin Wolf 
Cc: Stefan Hajnoczi 
Liu Yuan (2):
  qapi: add read-pattern enum for quorum
  block/quorum: add simple read pattern support

 block/quorum.c   | 177 +--
 qapi/block-core.json |  20 +-
 2 files changed, 148 insertions(+), 49 deletions(-)

-- 
1.9.1

[Qemu-devel] [PATCH v6 2/2] block/quorum: add simple read pattern support

2014-08-18 Thread Liu Yuan

This patch adds single read pattern to quorum driver and quorum vote is default
pattern.

For now we do a quorum vote on all the reads, it is designed for unreliable
underlying storage such as non-redundant NFS to make sure data integrity at the
cost of the read performance.

For some use cases as following:

VM
  --
  ||
  vv
  AB

Both A and B has hardware raid storage to justify the data integrity on its own.
So it would help performance if we do a single read instead of on all the nodes.
Further, if we run VM on either of the storage node, we can make a local read
request for better performance.

This patch generalize the above 2 nodes case in the N nodes. That is,

vm -> write to all the N nodes, read just one of them. If single read fails, we
try to read next node in FIFO order specified by the startup command.

The 2 nodes case is very similar to DRBD[1] though lack of auto-sync
functionality in the single device/node failure for now. But compared with DRBD
we still have some advantages over it:

- Suppose we have 20 VMs running on one(assume A) of 2 nodes' DRBD backed
storage. And if A crashes, we need to restart all the VMs on node B. But for
practice case, we can't because B might not have enough resources to setup 20 
VMs
at once. So if we run our 20 VMs with quorum driver, and scatter the replicated
images over the data center, we can very likely restart 20 VMs without any
resource problem.

After all, I think we can build a more powerful replicated image functionality
on quorum and block jobs(block mirror) to meet various High Availibility needs.

E.g, Enable single read pattern on 2 children,

-drive driver=quorum,children.0.file.filename=0.qcow2,\
children.1.file.filename=1.qcow2,read-pattern=fifo,vote-threshold=1

[1] http://en.wikipedia.org/wiki/Distributed_Replicated_Block_Device

Cc: Benoit Canet 
Cc: Eric Blake 
Cc: Kevin Wolf 
Cc: Stefan Hajnoczi 
Signed-off-by: Liu Yuan 
---
 block/quorum.c | 177 +
 1 file changed, 129 insertions(+), 48 deletions(-)

diff --git a/block/quorum.c b/block/quorum.c
index d5ee9c0..c5f1b35 100644
--- a/block/quorum.c
+++ b/block/quorum.c
@@ -24,6 +24,7 @@
 #define QUORUM_OPT_VOTE_THRESHOLD "vote-threshold"
 #define QUORUM_OPT_BLKVERIFY  "blkverify"
 #define QUORUM_OPT_REWRITE"rewrite-corrupted"
+#define QUORUM_OPT_READ_PATTERN   "read-pattern"
 
 /* This union holds a vote hash value */
 typedef union QuorumVoteValue {
@@ -74,6 +75,8 @@ typedef struct BDRVQuorumState {
 bool rewrite_corrupted;/* true if the driver must rewrite-on-read corrupted
 * block if Quorum is reached.
 */
+
+QuorumReadPattern read_pattern;
 } BDRVQuorumState;
 
 typedef struct QuorumAIOCB QuorumAIOCB;
@@ -117,6 +120,7 @@ struct QuorumAIOCB {
 
 bool is_read;
 int vote_ret;
+int child_iter; /* which child to read in fifo pattern */
 };
 
 static bool quorum_vote(QuorumAIOCB *acb);
@@ -143,7 +147,6 @@ static AIOCBInfo quorum_aiocb_info = {
 
 static void quorum_aio_finalize(QuorumAIOCB *acb)
 {
-BDRVQuorumState *s = acb->common.bs->opaque;
 int i, ret = 0;
 
 if (acb->vote_ret) {
@@ -153,7 +156,8 @@ static void quorum_aio_finalize(QuorumAIOCB *acb)
 acb->common.cb(acb->common.opaque, ret);
 
 if (acb->is_read) {
-for (i = 0; i < s->num_children; i++) {
+/* on the quorum case acb->child_iter == s->num_children - 1 */
+for (i = 0; i <= acb->child_iter; i++) {
 qemu_vfree(acb->qcrs[i].buf);
 qemu_iovec_destroy(&acb->qcrs[i].qiov);
 }
@@ -256,6 +260,21 @@ static void quorum_rewrite_aio_cb(void *opaque, int ret)
 quorum_aio_finalize(acb);
 }
 
+static BlockDriverAIOCB *read_fifo_child(QuorumAIOCB *acb);
+
+static void quorum_copy_qiov(QEMUIOVector *dest, QEMUIOVector *source)
+{
+int i;
+assert(dest->niov == source->niov);
+assert(dest->size == source->size);
+for (i = 0; i < source->niov; i++) {
+assert(dest->iov[i].iov_len == source->iov[i].iov_len);
+memcpy(dest->iov[i].iov_base,
+   source->iov[i].iov_base,
+   source->iov[i].iov_len);
+}
+}
+
 static void quorum_aio_cb(void *opaque, int ret)
 {
 QuorumChildRequest *sacb = opaque;
@@ -263,6 +282,21 @@ static void quorum_aio_cb(void *opaque, int ret)
 BDRVQuorumState *s = acb->common.bs->opaque;
 bool rewrite = false;
 
+if (acb->is_read && s->read_pattern == QUORUM_READ_PATTERN_FIFO) {
+/* We try to read next child in FIFO order if we fail to read */
+if (ret < 0 && ++acb->child_iter < s->num_children) {
+read_fifo_child(acb);
+return;
+}
+
+

Re: [Qemu-devel] [PATCH v5 2/2] block/quorum: add simple read pattern support

2014-08-21 Thread Liu Yuan

On Mon, Aug 18, 2014 at 01:59:28PM +0800, Liu Yuan wrote:
> On Fri, Aug 15, 2014 at 03:59:04PM +0200, Benoît Canet wrote:
> > The Friday 15 Aug 2014 à 13:05:17 (+0800), Liu Yuan wrote :
> > > This patch adds single read pattern to quorum driver and quorum vote is 
> > > default
> > > pattern.
> > > 
> > > For now we do a quorum vote on all the reads, it is designed for 
> > > unreliable
> > > underlying storage such as non-redundant NFS to make sure data integrity 
> > > at the
> > > cost of the read performance.
> > > 
> > > For some use cases as following:
> > > 
> > > VM
> > >   --
> > >   ||
> > >   vv
> > >   AB
> > > 
> > > Both A and B has hardware raid storage to justify the data integrity on 
> > > its own.
> > > So it would help performance if we do a single read instead of on all the 
> > > nodes.
> > > Further, if we run VM on either of the storage node, we can make a local 
> > > read
> > > request for better performance.
> > > 
> > > This patch generalize the above 2 nodes case in the N nodes. That is,
> > > 
> > > vm -> write to all the N nodes, read just one of them. If single read 
> > > fails, we
> > > try to read next node in FIFO order specified by the startup command.
> > > 
> > > The 2 nodes case is very similar to DRBD[1] though lack of auto-sync
> > > functionality in the single device/node failure for now. But compared 
> > > with DRBD
> > > we still have some advantages over it:
> > > 
> > > - Suppose we have 20 VMs running on one(assume A) of 2 nodes' DRBD backed
> > > storage. And if A crashes, we need to restart all the VMs on node B. But 
> > > for
> > > practice case, we can't because B might not have enough resources to 
> > > setup 20 VMs
> > > at once. So if we run our 20 VMs with quorum driver, and scatter the 
> > > replicated
> > > images over the data center, we can very likely restart 20 VMs without any
> > > resource problem.
> > > 
> > > After all, I think we can build a more powerful replicated image 
> > > functionality
> > > on quorum and block jobs(block mirror) to meet various High Availibility 
> > > needs.
> > > 
> > > E.g, Enable single read pattern on 2 children,
> > > 
> > > -drive driver=quorum,children.0.file.filename=0.qcow2,\
> > > children.1.file.filename=1.qcow2,read-pattern=fifo,vote-threshold=1
> > > 
> > > [1] http://en.wikipedia.org/wiki/Distributed_Replicated_Block_Device
> > > 
> > > Cc: Benoit Canet 
> > > Cc: Eric Blake 
> > > Cc: Kevin Wolf 
> > > Cc: Stefan Hajnoczi 
> > > Signed-off-by: Liu Yuan 
> > > ---
> > >  block/quorum.c | 176 
> > > ++---
> > >  1 file changed, 129 insertions(+), 47 deletions(-)
> > > 
> > > diff --git a/block/quorum.c b/block/quorum.c
> > > index d5ee9c0..1235d7c 100644
> > > --- a/block/quorum.c
> > > +++ b/block/quorum.c
> > > @@ -24,6 +24,7 @@
> > >  #define QUORUM_OPT_VOTE_THRESHOLD "vote-threshold"
> > >  #define QUORUM_OPT_BLKVERIFY  "blkverify"
> > >  #define QUORUM_OPT_REWRITE"rewrite-corrupted"
> > > +#define QUORUM_OPT_READ_PATTERN   "read-pattern"
> > >  
> > >  /* This union holds a vote hash value */
> > >  typedef union QuorumVoteValue {
> > > @@ -74,6 +75,8 @@ typedef struct BDRVQuorumState {
> > >  bool rewrite_corrupted;/* true if the driver must rewrite-on-read 
> > > corrupted
> > >  * block if Quorum is reached.
> > >  */
> > > +
> > > +QuorumReadPattern read_pattern;
> > >  } BDRVQuorumState;
> > >  
> > >  typedef struct QuorumAIOCB QuorumAIOCB;
> > > @@ -117,6 +120,7 @@ struct QuorumAIOCB {
> > >  
> > >  bool is_read;
> > >  int vote_ret;
> > > +int child_iter; /* which child to read in fifo pattern */
> > 
> > I don't understand what "fifo pattern" could mean for a bunch of disk
> > as they are not forming a queue.
> 
> Naming isn't 100% accurate but as in Eric's comment (see below), both FIFO and
> Round-Robin can be us

Re: [Qemu-devel] [PATCH v6 0/2] add read-pattern for block qourum

2014-08-24 Thread Liu Yuan

On Mon, Aug 18, 2014 at 05:41:03PM +0800, Liu Yuan wrote:
> v6:
>  - fix a unused warning introduced by last version
> 

Hi Stefan and Kevin,

   Benoît Canet has added Reviewed-by tag to both patches, could one of you pick
this patch set?

Thanks
Yuan

Re: [Qemu-devel] [PATCH v6 0/2] add read-pattern for block qourum

2014-08-28 Thread Liu Yuan

On Mon, Aug 25, 2014 at 10:46:21AM +0800, Liu Yuan wrote:
> On Mon, Aug 18, 2014 at 05:41:03PM +0800, Liu Yuan wrote:
> > v6:
> >  - fix a unused warning introduced by last version
> > 
> 
> Hi Stefan and Kevin,
> 
>Benoît Canet has added Reviewed-by tag to both patches, could one of you 
> pick
> this patch set?
> 

ping

[Qemu-devel] [PATCH] sheepdog: fix a core dump while do auto-reconnecting

2014-08-28 Thread Liu Yuan

We should reinit Local_err as NULL inside the while loop or g_free() will report
corrupption and abort the QEMU when sheepdog driver tries reconnecting.

qemu-system-x86_64: failed to get the header, Resource temporarily unavailable
qemu-system-x86_64: Failed to connect to socket: Connection refused
qemu-system-x86_64: (null)
[xcb] Unknown sequence number while awaiting reply
[xcb] Most likely this is a multi-threaded client and XInitThreads has not been 
called
[xcb] Aborting, sorry about that.
qemu-system-x86_64: ../../src/xcb_io.c:298: poll_for_response: Assertion 
`!xcb_xlib_threads_sequence_lost' failed.
Aborted (core dumped)

Cc: qemu-devel@nongnu.org
Cc: Markus Armbruster 
Cc: Kevin Wolf 
Cc: Stefan Hajnoczi 
Signed-off-by: Liu Yuan 
---
 block/sheepdog.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/block/sheepdog.c b/block/sheepdog.c
index 3af8743..a3a19b1 100644
--- a/block/sheepdog.c
+++ b/block/sheepdog.c
@@ -712,7 +712,6 @@ static void coroutine_fn send_pending_req(BDRVSheepdogState 
*s, uint64_t oid)
 
 static coroutine_fn void reconnect_to_sdog(void *opaque)
 {
-Error *local_err = NULL;
 BDRVSheepdogState *s = opaque;
 AIOReq *aio_req, *next;
 BlockDriverState *bs = s->bs;
@@ -731,6 +730,7 @@ static coroutine_fn void reconnect_to_sdog(void *opaque)
 
 /* Try to reconnect the sheepdog server every one second. */
 while (s->fd < 0) {
+Error *local_err = NULL;
 s->fd = get_sheep_fd(s, &local_err);
 if (s->fd < 0) {
 DPRINTF("Wait for connection to be established\n");
-- 
1.9.1

[Qemu-devel] [PATCH v2] sheepdog: fix a core dump while do auto-reconnecting

2014-08-28 Thread Liu Yuan

We should reinit local_err as NULL inside the while loop or g_free() will report
corrupption and abort the QEMU when sheepdog driver tries reconnecting.

This was broken in commit 356b4ca.

qemu-system-x86_64: failed to get the header, Resource temporarily unavailable
qemu-system-x86_64: Failed to connect to socket: Connection refused
qemu-system-x86_64: (null)
[xcb] Unknown sequence number while awaiting reply
[xcb] Most likely this is a multi-threaded client and XInitThreads has not been 
called
[xcb] Aborting, sorry about that.
qemu-system-x86_64: ../../src/xcb_io.c:298: poll_for_response: Assertion 
`!xcb_xlib_threads_sequence_lost' failed.
Aborted (core dumped)

Cc: qemu-devel@nongnu.org
Cc: Markus Armbruster 
Cc: Kevin Wolf 
Cc: Stefan Hajnoczi 
Reviewed-by: Markus Armbruster 
Signed-off-by: Liu Yuan 
---
 block/sheepdog.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/block/sheepdog.c b/block/sheepdog.c
index 12cbd9d..53c24d6 100644
--- a/block/sheepdog.c
+++ b/block/sheepdog.c
@@ -712,7 +712,6 @@ static void coroutine_fn send_pending_req(BDRVSheepdogState 
*s, uint64_t oid)
 
 static coroutine_fn void reconnect_to_sdog(void *opaque)
 {
-Error *local_err = NULL;
 BDRVSheepdogState *s = opaque;
 AIOReq *aio_req, *next;
 
@@ -727,6 +726,7 @@ static coroutine_fn void reconnect_to_sdog(void *opaque)
 
 /* Try to reconnect the sheepdog server every one second. */
 while (s->fd < 0) {
+Error *local_err = NULL;
 s->fd = get_sheep_fd(s, &local_err);
 if (s->fd < 0) {
 DPRINTF("Wait for connection to be established\n");
-- 
1.9.1

Re: [Qemu-devel] [PATCH v6 0/2] add read-pattern for block qourum

2014-08-28 Thread Liu Yuan

On Thu, Aug 28, 2014 at 11:33:08AM +0100, Stefan Hajnoczi wrote:
> On Mon, Aug 18, 2014 at 05:41:03PM +0800, Liu Yuan wrote:
> > v6:
> >  - fix a unused warning introduced by last version
> > 
> > v5:
> >  - simplify a for loop in quorum_aio_finalize()
> > 
> > v4:
> >  - swap the patch order
> >  - update comment for fifo pattern in qaip
> >  - use qapi enumeration in quorum driver instead of manual parsing
> > 
> > v3:
> >  - separate patch into two, one for quorum and one for qapi for easier 
> > review
> >  - add enumeration for quorum read pattern
> >  - remove unrelated blank line fix from this patch set
> > 
> > v2:
> >  - rename single as 'fifo'
> >  - rename read_all_children as read_quorum_children
> >  - fix quorum_aio_finalize() for fifo pattern
> > 
> > This patch adds single read pattern to quorum driver and quorum vote is 
> > default
> > pattern.
> > 
> > For now we do a quorum vote on all the reads, it is designed for unreliable
> > underlying storage such as non-redundant NFS to make sure data integrity at 
> > the
> > cost of the read performance.
> > 
> > For some use cases as following:
> > 
> > VM
> >   --
> >   ||
> >   vv
> >   AB
> > 
> > Both A and B has hardware raid storage to justify the data integrity on its 
> > own.
> > So it would help performance if we do a single read instead of on all the 
> > nodes.
> > Further, if we run VM on either of the storage node, we can make a local 
> > read
> > request for better performance.
> > 
> > This patch generalize the above 2 nodes case in the N nodes. That is,
> > 
> > vm -> write to all the N nodes, read just one of them. If single read 
> > fails, we
> > try to read next node in FIFO order specified by the startup command.
> > 
> > The 2 nodes case is very similar to DRBD[1] though lack of auto-sync
> > functionality in the single device/node failure for now. But compared with 
> > DRBD
> > we still have some advantages over it:
> > 
> > - Suppose we have 20 VMs running on one(assume A) of 2 nodes' DRBD backed
> > storage. And if A crashes, we need to restart all the VMs on node B. But for
> > practice case, we can't because B might not have enough resources to setup 
> > 20 VMs
> > at once. So if we run our 20 VMs with quorum driver, and scatter the 
> > replicated
> > images over the data center, we can very likely restart 20 VMs without any
> > resource problem.
> > 
> > After all, I think we can build a more powerful replicated image 
> > functionality
> > on quorum and block jobs(block mirror) to meet various High Availibility 
> > needs.
> > 
> > E.g, Enable single read pattern on 2 children,
> >   
> > -drive driver=quorum,children.0.file.filename=0.qcow2,\
> > children.1.file.filename=1.qcow2,read-pattern=fifo,vote-threshold=1
> > 
> > [1] http://en.wikipedia.org/wiki/Distributed_Replicated_Block_Device
> > 
> > Cc: Benoit Canet 
> > Cc: Eric Blake 
> > Cc: Kevin Wolf 
> > Cc: Stefan Hajnoczi 
> > Liu Yuan (2):
> >   qapi: add read-pattern enum for quorum
> >   block/quorum: add simple read pattern support
> > 
> >  block/quorum.c   | 177 
> > +--
> >  qapi/block-core.json |  20 +-
> >  2 files changed, 148 insertions(+), 49 deletions(-)
> 
> I dropped the \n from the error_setg() error message while merging.
> Please do not use \n with error_setg().

Thanks for your fix.
 
> Please extend the quorum qemu-iotests to cover the new fifo read
> pattern.  You can send the tests as a separate patch series.
> 

Okay, will do later.

Yuan

[Qemu-devel] [PATCH] block: kill tail whitespace in block.c

2014-08-31 Thread Liu Yuan

Cc: Kevin Wolf 
Cc: Stefan Hajnoczi 
Signed-off-by: Liu Yuan 
---
 block.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/block.c b/block.c
index e9380f6..c12b8de 100644
--- a/block.c
+++ b/block.c
@@ -2239,7 +2239,7 @@ int bdrv_commit(BlockDriverState *bs)
 
 if (!drv)
 return -ENOMEDIUM;
-
+
 if (!bs->backing_hd) {
 return -ENOTSUP;
 }
-- 
1.9.1

[Qemu-devel] [PATCH 1/8] block/quorum: initialize qcrs.aiocb for read

2014-09-01 Thread Liu Yuan

This is required by quorum_aio_cancel()

Cc: Eric Blake 
Cc: Benoit Canet 
Cc: Kevin Wolf 
Cc: Stefan Hajnoczi 
Signed-off-by: Liu Yuan 
---
 block/quorum.c | 23 ---
 1 file changed, 12 insertions(+), 11 deletions(-)

diff --git a/block/quorum.c b/block/quorum.c
index af48e8c..5866bca 100644
--- a/block/quorum.c
+++ b/block/quorum.c
@@ -653,8 +653,10 @@ static BlockDriverAIOCB *read_quorum_children(QuorumAIOCB 
*acb)
 }
 
 for (i = 0; i < s->num_children; i++) {
-bdrv_aio_readv(s->bs[i], acb->sector_num, &acb->qcrs[i].qiov,
-   acb->nb_sectors, quorum_aio_cb, &acb->qcrs[i]);
+acb->qcrs[i].aiocb = bdrv_aio_readv(s->bs[i], acb->sector_num,
+&acb->qcrs[i].qiov,
+acb->nb_sectors, quorum_aio_cb,
+&acb->qcrs[i]);
 }
 
 return &acb->common;
@@ -663,15 +665,14 @@ static BlockDriverAIOCB *read_quorum_children(QuorumAIOCB 
*acb)
 static BlockDriverAIOCB *read_fifo_child(QuorumAIOCB *acb)
 {
 BDRVQuorumState *s = acb->common.bs->opaque;
-
-acb->qcrs[acb->child_iter].buf = qemu_blockalign(s->bs[acb->child_iter],
- acb->qiov->size);
-qemu_iovec_init(&acb->qcrs[acb->child_iter].qiov, acb->qiov->niov);
-qemu_iovec_clone(&acb->qcrs[acb->child_iter].qiov, acb->qiov,
- acb->qcrs[acb->child_iter].buf);
-bdrv_aio_readv(s->bs[acb->child_iter], acb->sector_num,
-   &acb->qcrs[acb->child_iter].qiov, acb->nb_sectors,
-   quorum_aio_cb, &acb->qcrs[acb->child_iter]);
+int i = acb->child_iter;
+
+acb->qcrs[i].buf = qemu_blockalign(s->bs[i], acb->qiov->size);
+qemu_iovec_init(&acb->qcrs[i].qiov, acb->qiov->niov);
+qemu_iovec_clone(&acb->qcrs[i].qiov, acb->qiov, acb->qcrs[i].buf);
+acb->qcrs[i].aiocb = bdrv_aio_readv(s->bs[i], acb->sector_num,
+&acb->qcrs[i].qiov, acb->nb_sectors,
+quorum_aio_cb, &acb->qcrs[i]);
 
 return &acb->common;
 }
-- 
1.9.1

[Qemu-devel] [PATCH 4/8] block/quorum: add quorum_aio_release() helper

2014-09-01 Thread Liu Yuan

Cc: Eric Blake 
Cc: Benoit Canet 
Cc: Kevin Wolf 
Cc: Stefan Hajnoczi 
Signed-off-by: Liu Yuan 
---
 block/quorum.c | 12 
 1 file changed, 8 insertions(+), 4 deletions(-)

diff --git a/block/quorum.c b/block/quorum.c
index 5866bca..9e056d6 100644
--- a/block/quorum.c
+++ b/block/quorum.c
@@ -130,6 +130,12 @@ struct QuorumAIOCB {
 
 static bool quorum_vote(QuorumAIOCB *acb);
 
+static void quorum_aio_release(QuorumAIOCB *acb)
+{
+g_free(acb->qcrs);
+qemu_aio_release(acb);
+}
+
 static void quorum_aio_cancel(BlockDriverAIOCB *blockacb)
 {
 QuorumAIOCB *acb = container_of(blockacb, QuorumAIOCB, common);
@@ -141,8 +147,7 @@ static void quorum_aio_cancel(BlockDriverAIOCB *blockacb)
 bdrv_aio_cancel(acb->qcrs[i].aiocb);
 }
 
-g_free(acb->qcrs);
-qemu_aio_release(acb);
+quorum_aio_release(acb);
 }
 
 static AIOCBInfo quorum_aiocb_info = {
@@ -168,8 +173,7 @@ static void quorum_aio_finalize(QuorumAIOCB *acb)
 }
 }
 
-g_free(acb->qcrs);
-qemu_aio_release(acb);
+quorum_aio_release(acb);
 }
 
 static bool quorum_sha256_compare(QuorumVoteValue *a, QuorumVoteValue *b)
-- 
1.9.1

[Qemu-devel] [PATCH 0/8] add basic recovery logic to quorum driver

2014-09-01 Thread Liu Yuan

This patch set mainly add mainly two logics to implement device recover
- notify qourum driver of the broken states from the child driver(s)
- dirty track and sync the device after it is repaired

Thus quorum allow VMs to continue while some child devices are broken and when
the child devices are repaired and return back, we sync dirty bits during
downtime to keep data consistency.

The recovery logic is based on the driver state bitmap and will sync the dirty
bits with a timeslice window in a coroutine in this prtimive implementation.

Simple graph about 2 children with threshold=1 and read-pattern=fifo:
(similary to DRBD)

+ denote device sync iteration
- IO on a single device
= IO on two devices

  sync complete, release dirty bitmap
 ^
 |
  -++==
 | |
 | v
 |   device repaired and begin to sync
 v
   device broken, create a dirty bitmap

  This sync logic can take care of nested broken problem, that devices are
  broken while in sync. We just start a sync process after the devices are
  repaired again and switch the devices from broken to sound only when the sync
  completes.

For read-pattern=quorum mode, it enjoys the recovery logic without any problem.

Todo:
- use aio interface to sync data (multiple transfer in one go)
- dynamic slice window to control sync bandwidth more smoothly
- add auto-reconnection mechanism to other protocol (if not support yet)
- add tests

Cc: Eric Blake 
Cc: Benoit Canet 
Cc: Kevin Wolf 
Cc: Stefan Hajnoczi 

Liu Yuan (8):
  block/quorum: initialize qcrs.aiocb for read
  block: add driver operation callbacks
  block/sheepdog: propagate disconnect/reconnect events to upper driver
  block/quorum: add quorum_aio_release() helper
  quorum: fix quorum_aio_cancel()
  block/quorum: add broken state to BlockDriverState
  block: add two helpers
  quorum: add basic device recovery logic

 block.c   |  17 +++
 block/quorum.c| 324 +-
 block/sheepdog.c  |   9 ++
 include/block/block.h |   9 ++
 include/block/block_int.h |   6 +
 trace-events  |   5 +
 6 files changed, 336 insertions(+), 34 deletions(-)

-- 
1.9.1

[Qemu-devel] [PATCH 5/8] quorum: fix quorum_aio_cancel()

2014-09-01 Thread Liu Yuan

For a fifo read pattern, we only have one running aio (possible other cases that
has less number than num_children in the future), so we need to check if
.acb is NULL against bdrv_aio_cancel() to avoid segfault.

Cc: Eric Blake 
Cc: Benoit Canet 
Cc: Kevin Wolf 
Cc: Stefan Hajnoczi 
Signed-off-by: Liu Yuan 
---
 block/quorum.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/block/quorum.c b/block/quorum.c
index 9e056d6..b9eeda3 100644
--- a/block/quorum.c
+++ b/block/quorum.c
@@ -144,7 +144,9 @@ static void quorum_aio_cancel(BlockDriverAIOCB *blockacb)
 
 /* cancel all callbacks */
 for (i = 0; i < s->num_children; i++) {
-bdrv_aio_cancel(acb->qcrs[i].aiocb);
+if (acb->qcrs[i].aiocb) {
+bdrv_aio_cancel(acb->qcrs[i].aiocb);
+}
 }
 
 quorum_aio_release(acb);
-- 
1.9.1

[Qemu-devel] [PATCH 2/8] block: add driver operation callbacks

2014-09-01 Thread Liu Yuan

Driver operations are defined as callbacks passed from block upper drivers to
lower drivers and are supposed to be called by lower drivers.

Requests handling(queuing, submitting, etc.) are done in protocol tier in the
block layer and connection states are also maintained down there. Driver
operations are supposed to notify the upper tier (such as quorum) of the states
changes.

For now only two operation are added:

driver_disconnect: called when connection is off
driver_reconnect: called when connection is on after disconnection

Which are used to notify upper tier of the connection state.

Cc: Eric Blake 
Cc: Benoit Canet 
Cc: Kevin Wolf 
Cc: Stefan Hajnoczi 
Signed-off-by: Liu Yuan 
---
 block.c   | 7 +++
 include/block/block.h | 7 +++
 include/block/block_int.h | 3 +++
 3 files changed, 17 insertions(+)

diff --git a/block.c b/block.c
index c12b8de..22eb3e4 100644
--- a/block.c
+++ b/block.c
@@ -2152,6 +2152,13 @@ void bdrv_set_dev_ops(BlockDriverState *bs, const 
BlockDevOps *ops,
 bs->dev_opaque = opaque;
 }
 
+void bdrv_set_drv_ops(BlockDriverState *bs, const BlockDrvOps *ops,
+  void *opaque)
+{
+bs->drv_ops = ops;
+bs->drv_opaque = opaque;
+}
+
 static void bdrv_dev_change_media_cb(BlockDriverState *bs, bool load)
 {
 if (bs->dev_ops && bs->dev_ops->change_media_cb) {
diff --git a/include/block/block.h b/include/block/block.h
index 8f4ad16..a61eaf0 100644
--- a/include/block/block.h
+++ b/include/block/block.h
@@ -82,6 +82,11 @@ typedef struct BlockDevOps {
 void (*resize_cb)(void *opaque);
 } BlockDevOps;
 
+typedef struct BlockDrvOps {
+void (*driver_reconnect)(BlockDriverState *bs);
+void (*driver_disconnect)(BlockDriverState *bs);
+} BlockDrvOps;
+
 typedef enum {
 BDRV_REQ_COPY_ON_READ = 0x1,
 BDRV_REQ_ZERO_WRITE   = 0x2,
@@ -234,6 +239,8 @@ void bdrv_detach_dev(BlockDriverState *bs, void *dev);
 void *bdrv_get_attached_dev(BlockDriverState *bs);
 void bdrv_set_dev_ops(BlockDriverState *bs, const BlockDevOps *ops,
   void *opaque);
+void bdrv_set_drv_ops(BlockDriverState *bs, const BlockDrvOps *ops,
+  void *opaque);
 void bdrv_dev_eject_request(BlockDriverState *bs, bool force);
 bool bdrv_dev_has_removable_media(BlockDriverState *bs);
 bool bdrv_dev_is_tray_open(BlockDriverState *bs);
diff --git a/include/block/block_int.h b/include/block/block_int.h
index 2334895..9fdec7f 100644
--- a/include/block/block_int.h
+++ b/include/block/block_int.h
@@ -319,6 +319,9 @@ struct BlockDriverState {
 const BlockDevOps *dev_ops;
 void *dev_opaque;
 
+const BlockDrvOps *drv_ops;
+void *drv_opaque;
+
 AioContext *aio_context; /* event loop used for fd handlers, timers, etc */
 
 char filename[1024];
-- 
1.9.1

[Qemu-devel] [PATCH 8/8] quorum: add basic device recovery logic

2014-09-01 Thread Liu Yuan

For some configuration, quorum allow VMs to continue while some child devices
are broken and when the child devices are repaired and return back, we need to
sync dirty bits during downtime to keep data consistency.

The recovery logic is based on the driver state bitmap and will sync the dirty
bits with a timeslice window in a coroutine in this prtimive implementation.

Simple graph about 2 children with threshold=1 and read-pattern=fifo:

+ denote device sync iteration
- IO on a single device
= IO on two devices

  sync complete, release dirty bitmap
 ^
 |
  -++==
 | |
 | v
 |   device repaired and begin to sync
 v
   device broken, create a dirty bitmap

  This sync logic can take care of nested broken problem, that devices are
  broken while in sync. We just start a sync process after the devices are
  repaired again and switch the devices from broken to sound only when the sync
  completes.

For read-pattern=quorum mode, it enjoys the recovery logic without any problem.

Cc: Eric Blake 
Cc: Benoit Canet 
Cc: Kevin Wolf 
Cc: Stefan Hajnoczi 
Signed-off-by: Liu Yuan 
---
 block/quorum.c | 189 -
 trace-events   |   5 ++
 2 files changed, 191 insertions(+), 3 deletions(-)

diff --git a/block/quorum.c b/block/quorum.c
index 7b07e35..ffd7c2d 100644
--- a/block/quorum.c
+++ b/block/quorum.c
@@ -23,6 +23,7 @@
 #include "qapi/qmp/qlist.h"
 #include "qapi/qmp/qstring.h"
 #include "qapi-event.h"
+#include "trace.h"
 
 #define HASH_LENGTH 32
 
@@ -31,6 +32,10 @@
 #define QUORUM_OPT_REWRITE"rewrite-corrupted"
 #define QUORUM_OPT_READ_PATTERN   "read-pattern"
 
+#define SLICE_TIME  1ULL /* 100 ms */
+#define CHUNK_SIZE  (1 << 20) /* 1M */
+#define SECTORS_PER_CHUNK   (CHUNK_SIZE >> BDRV_SECTOR_BITS)
+
 /* This union holds a vote hash value */
 typedef union QuorumVoteValue {
 char h[HASH_LENGTH];   /* SHA-256 hash */
@@ -64,6 +69,7 @@ typedef struct QuorumVotes {
 
 /* the following structure holds the state of one quorum instance */
 typedef struct BDRVQuorumState {
+BlockDriverState *mybs;/* Quorum block driver base state */
 BlockDriverState **bs; /* children BlockDriverStates */
 int num_children;  /* children count */
 int threshold; /* if less than threshold children reads gave the
@@ -82,6 +88,10 @@ typedef struct BDRVQuorumState {
 */
 
 QuorumReadPattern read_pattern;
+BdrvDirtyBitmap *dirty_bitmap;
+uint8_t *sync_buf;
+HBitmapIter hbi;
+int64_t sector_num;
 } BDRVQuorumState;
 
 typedef struct QuorumAIOCB QuorumAIOCB;
@@ -290,12 +300,11 @@ static void quorum_copy_qiov(QEMUIOVector *dest, 
QEMUIOVector *source)
 }
 }
 
-static int next_fifo_child(QuorumAIOCB *acb)
+static int get_good_child(BDRVQuorumState *s, int iter)
 {
-BDRVQuorumState *s = acb->common.bs->opaque;
 int i;
 
-for (i = acb->child_iter; i < s->num_children; i++) {
+for (i = iter; i < s->num_children; i++) {
 if (!s->bs[i]->broken) {
 break;
 }
@@ -306,6 +315,13 @@ static int next_fifo_child(QuorumAIOCB *acb)
 return i;
 }
 
+static int next_fifo_child(QuorumAIOCB *acb)
+{
+BDRVQuorumState *s = acb->common.bs->opaque;
+
+return get_good_child(s, acb->child_iter);
+}
+
 static void quorum_aio_cb(void *opaque, int ret)
 {
 QuorumChildRequest *sacb = opaque;
@@ -951,6 +967,171 @@ static int parse_read_pattern(const char *opt)
 return -EINVAL;
 }
 
+static void sync_prepare(BDRVQuorumState *qs, int64_t *num)
+{
+int64_t nb, total = bdrv_nb_sectors(qs->mybs);
+
+qs->sector_num = hbitmap_iter_next(&qs->hbi);
+/* Wrap around if previous bits get dirty while syncing */
+if (qs->sector_num < 0) {
+bdrv_dirty_iter_init(qs->mybs, qs->dirty_bitmap, &qs->hbi);
+qs->sector_num = hbitmap_iter_next(&qs->hbi);
+assert(qs->sector_num >= 0);
+}
+
+for (nb = 1; nb < SECTORS_PER_CHUNK && qs->sector_num + nb < total;
+ nb++) {
+if (!bdrv_get_dirty(qs->mybs, qs->dirty_bitmap, qs->sector_num + nb)) {
+break;
+}
+}
+*num = nb;
+}
+
+static void sync_finish(BDRVQuorumState *qs, int64_t num)
+{
+int64_t i;
+
+for (i = 0; i < num; i++) {
+/* We need to advance the iterator manually */
+hbitmap_iter_next(&qs->hbi);
+}
+bdrv_reset_dirty(qs->mybs, qs->sector_num, num);
+}
+
+static int quorum_sync_iteration(BDRVQuorumState *qs, BlockDriverState *target)
+{
+BlockDriverState *source;
+QEM

[Qemu-devel] [PATCH 3/8] block/sheepdog: propagate disconnect/reconnect events to upper driver

2014-09-01 Thread Liu Yuan

This is the reference usage how we propagate connection state to upper tier.

Cc: Eric Blake 
Cc: Benoit Canet 
Cc: Kevin Wolf 
Cc: Stefan Hajnoczi 
Signed-off-by: Liu Yuan 
---
 block/sheepdog.c | 9 +
 1 file changed, 9 insertions(+)

diff --git a/block/sheepdog.c b/block/sheepdog.c
index 53c24d6..9c0fc49 100644
--- a/block/sheepdog.c
+++ b/block/sheepdog.c
@@ -714,6 +714,11 @@ static coroutine_fn void reconnect_to_sdog(void *opaque)
 {
 BDRVSheepdogState *s = opaque;
 AIOReq *aio_req, *next;
+BlockDriverState *bs = s->bs;
+
+if (bs->drv_ops && bs->drv_ops->driver_disconnect) {
+bs->drv_ops->driver_disconnect(bs);
+}
 
 aio_set_fd_handler(s->aio_context, s->fd, NULL, NULL, NULL);
 close(s->fd);
@@ -756,6 +761,10 @@ static coroutine_fn void reconnect_to_sdog(void *opaque)
 QLIST_INSERT_HEAD(&s->inflight_aio_head, aio_req, aio_siblings);
 resend_aioreq(s, aio_req);
 }
+
+if (bs->drv_ops && bs->drv_ops->driver_reconnect) {
+bs->drv_ops->driver_reconnect(bs);
+}
 }
 
 /*
-- 
1.9.1

[Qemu-devel] [PATCH 6/8] block/quorum: add broken state to BlockDriverState

2014-09-01 Thread Liu Yuan

This allow VM continues to process even if some devices are broken meanwhile
with proper configuration.

We mark the device broken when the protocol tier notify back some broken
state(s) of the device, such as diconnection via driver operations. We could
also reset the device as sound when the protocol tier is repaired.

Origianlly .threshold controls how we should decide the success of read/write
and return the failure only if the success count of read/write is less than
.threshold specified by users. But it doesn't record the states of underlying
states and will impact performance a bit in some cases.

For example, we have 3 children and .threshold is set 2. If one of the devices
broken, we should still return success and continue to run VM. But for every
IO operations, we will blindly send the requests to the broken device.

To store broken state into driver state we can save requests to borken devices
and resend the requests to the repaired ones by setting broken as false.

This is especially useful for network based protocol such as sheepdog, which
has a auto-reconnection mechanism and will never report EIO if the connection
is broken but just store the requests to its local queue and wait for resending.
Without broken state, quorum request will not come back until the connection is
re-established. So we have to skip the broken deivces to allow VM to continue
running with networked backed child (glusterfs, nfs, sheepdog, etc).

With the combination of read-pattern and threshold, we can easily mimic the DRVD
behavior with following configuration:

 read-pattern=fifo,threshold=1 will two children.

Cc: Eric Blake 
Cc: Benoit Canet 
Cc: Kevin Wolf 
Cc: Stefan Hajnoczi 
Signed-off-by: Liu Yuan 
---
 block/quorum.c| 102 ++
 include/block/block_int.h |   3 ++
 2 files changed, 87 insertions(+), 18 deletions(-)

diff --git a/block/quorum.c b/block/quorum.c
index b9eeda3..7b07e35 100644
--- a/block/quorum.c
+++ b/block/quorum.c
@@ -120,6 +120,7 @@ struct QuorumAIOCB {
 int rewrite_count;  /* number of replica to rewrite: count down to
  * zero once writes are fired
  */
+int issued_count;   /* actual read&write issued count */
 
 QuorumVotes votes;
 
@@ -170,8 +171,10 @@ static void quorum_aio_finalize(QuorumAIOCB *acb)
 if (acb->is_read) {
 /* on the quorum case acb->child_iter == s->num_children - 1 */
 for (i = 0; i <= acb->child_iter; i++) {
-qemu_vfree(acb->qcrs[i].buf);
-qemu_iovec_destroy(&acb->qcrs[i].qiov);
+if (acb->qcrs[i].buf) {
+qemu_vfree(acb->qcrs[i].buf);
+qemu_iovec_destroy(&acb->qcrs[i].qiov);
+}
 }
 }
 
@@ -207,6 +210,7 @@ static QuorumAIOCB *quorum_aio_get(BDRVQuorumState *s,
 acb->count = 0;
 acb->success_count = 0;
 acb->rewrite_count = 0;
+acb->issued_count = 0;
 acb->votes.compare = quorum_sha256_compare;
 QLIST_INIT(&acb->votes.vote_list);
 acb->is_read = false;
@@ -286,6 +290,22 @@ static void quorum_copy_qiov(QEMUIOVector *dest, 
QEMUIOVector *source)
 }
 }
 
+static int next_fifo_child(QuorumAIOCB *acb)
+{
+BDRVQuorumState *s = acb->common.bs->opaque;
+int i;
+
+for (i = acb->child_iter; i < s->num_children; i++) {
+if (!s->bs[i]->broken) {
+break;
+}
+}
+if (i == s->num_children) {
+return -1;
+}
+return i;
+}
+
 static void quorum_aio_cb(void *opaque, int ret)
 {
 QuorumChildRequest *sacb = opaque;
@@ -293,11 +313,18 @@ static void quorum_aio_cb(void *opaque, int ret)
 BDRVQuorumState *s = acb->common.bs->opaque;
 bool rewrite = false;
 
+if (ret < 0) {
+s->bs[acb->child_iter]->broken = true;
+}
+
 if (acb->is_read && s->read_pattern == QUORUM_READ_PATTERN_FIFO) {
 /* We try to read next child in FIFO order if we fail to read */
-if (ret < 0 && ++acb->child_iter < s->num_children) {
-read_fifo_child(acb);
-return;
+if (ret < 0) {
+acb->child_iter = next_fifo_child(acb);
+if (acb->child_iter > 0) {
+read_fifo_child(acb);
+return;
+}
 }
 
 if (ret == 0) {
@@ -315,9 +342,9 @@ static void quorum_aio_cb(void *opaque, int ret)
 } else {
 quorum_report_bad(acb, sacb->aiocb->bs->node_name, ret);
 }
-assert(acb->count <= s->num_children);
-assert(acb->success_count <= s->num_children);
-if (acb->count < s->num_children) {
+assert(acb->count <= acb->issued_count);
+assert(acb->success_count <= acb->issued_count);
+if (acb->

[Qemu-devel] [PATCH 7/8] block: add two helpers

2014-09-01 Thread Liu Yuan

These helpers are needed by later quorum sync device logic.

Cc: Eric Blake 
Cc: Benoit Canet 
Cc: Kevin Wolf 
Cc: Stefan Hajnoczi 
Signed-off-by: Liu Yuan 
---
 block.c   | 10 ++
 include/block/block.h |  2 ++
 2 files changed, 12 insertions(+)

diff --git a/block.c b/block.c
index 22eb3e4..2e2f1d9 100644
--- a/block.c
+++ b/block.c
@@ -2145,6 +2145,16 @@ void *bdrv_get_attached_dev(BlockDriverState *bs)
 return bs->dev;
 }
 
+BlockDriverState *bdrv_get_file(BlockDriverState *bs)
+{
+return bs->file;
+}
+
+const char *bdrv_get_filename(BlockDriverState *bs)
+{
+return bs->filename;
+}
+
 void bdrv_set_dev_ops(BlockDriverState *bs, const BlockDevOps *ops,
   void *opaque)
 {
diff --git a/include/block/block.h b/include/block/block.h
index a61eaf0..1e116cc 100644
--- a/include/block/block.h
+++ b/include/block/block.h
@@ -237,6 +237,8 @@ int bdrv_attach_dev(BlockDriverState *bs, void *dev);
 void bdrv_attach_dev_nofail(BlockDriverState *bs, void *dev);
 void bdrv_detach_dev(BlockDriverState *bs, void *dev);
 void *bdrv_get_attached_dev(BlockDriverState *bs);
+BlockDriverState *bdrv_get_file(BlockDriverState *bs);
+const char *bdrv_get_filename(BlockDriverState *bs);
 void bdrv_set_dev_ops(BlockDriverState *bs, const BlockDevOps *ops,
   void *opaque);
 void bdrv_set_drv_ops(BlockDriverState *bs, const BlockDrvOps *ops,
-- 
1.9.1

Re: [Qemu-devel] [PATCH 2/8] block: add driver operation callbacks

2014-09-01 Thread Liu Yuan

On Mon, Sep 01, 2014 at 10:28:54AM +0200, Benoît Canet wrote:
> The Monday 01 Sep 2014 à 15:43:08 (+0800), Liu Yuan wrote :
> > Driver operations are defined as callbacks passed from block upper drivers 
> > to
> > lower drivers and are supposed to be called by lower drivers.
> > 
> > Requests handling(queuing, submitting, etc.) are done in protocol tier in 
> > the
> > block layer and connection states are also maintained down there. Driver
> > operations are supposed to notify the upper tier (such as quorum) of the 
> > states
> > changes.
> > 
> > For now only two operation are added:
> > 
> > driver_disconnect: called when connection is off
> > driver_reconnect: called when connection is on after disconnection
> > 
> > Which are used to notify upper tier of the connection state.
> > 
> > Cc: Eric Blake 
> > Cc: Benoit Canet 
> > Cc: Kevin Wolf 
> > Cc: Stefan Hajnoczi 
> > Signed-off-by: Liu Yuan 
> > ---
> >  block.c   | 7 +++
> >  include/block/block.h | 7 +++
> >  include/block/block_int.h | 3 +++
> >  3 files changed, 17 insertions(+)
> > 
> > diff --git a/block.c b/block.c
> > index c12b8de..22eb3e4 100644
> > --- a/block.c
> > +++ b/block.c
> > @@ -2152,6 +2152,13 @@ void bdrv_set_dev_ops(BlockDriverState *bs, const 
> > BlockDevOps *ops,
> >  bs->dev_opaque = opaque;
> >  }
> >  
> > +void bdrv_set_drv_ops(BlockDriverState *bs, const BlockDrvOps *ops,
> > +  void *opaque)
> > +{
> > +bs->drv_ops = ops;
> > +bs->drv_opaque = opaque;
> 
> We need to be very carefull of the mix between these fields and the infamous
> bdrv_swap function.
> 
> Also I don't know if "driver operations" is the right name since the 
> BlockDriver structure's
> callback could also be named "driver operations".
> 

BlockDrvierState has a "device operation" for callbacks from devices. So I
choose "driver operation". So any sugguestion for better name?

Thanks
Yuan

Re: [Qemu-devel] [PATCH 3/8] block/sheepdog: propagate disconnect/reconnect events to upper driver

2014-09-01 Thread Liu Yuan

On Mon, Sep 01, 2014 at 10:31:47AM +0200, Benoît Canet wrote:
> The Monday 01 Sep 2014 à 15:43:09 (+0800), Liu Yuan wrote :
> > This is the reference usage how we propagate connection state to upper tier.
> > 
> > Cc: Eric Blake 
> > Cc: Benoit Canet 
> > Cc: Kevin Wolf 
> > Cc: Stefan Hajnoczi 
> > Signed-off-by: Liu Yuan 
> > ---
> >  block/sheepdog.c | 9 +
> >  1 file changed, 9 insertions(+)
> > 
> > diff --git a/block/sheepdog.c b/block/sheepdog.c
> > index 53c24d6..9c0fc49 100644
> > --- a/block/sheepdog.c
> > +++ b/block/sheepdog.c
> > @@ -714,6 +714,11 @@ static coroutine_fn void reconnect_to_sdog(void 
> > *opaque)
> >  {
> >  BDRVSheepdogState *s = opaque;
> >  AIOReq *aio_req, *next;
> > +BlockDriverState *bs = s->bs;
> > +
> > +if (bs->drv_ops && bs->drv_ops->driver_disconnect) {
> > +bs->drv_ops->driver_disconnect(bs);
> > +}
> 
> Since this sequence will be strictly the same for all the implementation
> could we create a bdrv_signal_disconnect(bs); in the block layer to make this
> code generic ?

I'm not sure if other protocol driver can have the same auto-reconnection logic.
Probably for simplicity, we keep it as is in the patch. Later when we get more
flesh of implementation of other protocols, we can make a better decision.

Thanks
Yuan

Re: [Qemu-devel] [PATCH 5/8] quorum: fix quorum_aio_cancel()

2014-09-01 Thread Liu Yuan

On Mon, Sep 01, 2014 at 10:35:27AM +0200, Benoît Canet wrote:
> The Monday 01 Sep 2014 à 15:43:11 (+0800), Liu Yuan wrote :
> > For a fifo read pattern, we only have one running aio
> 
> >(possible other cases that has less number than num_children in the future)
> I have trouble understanding this part of the commit message could you try
> to clarify it ?

Until this patch, we have two cases, read single or read all. But later patch
allow VMs to continue if some devices are down. So the discrete number 1 and N
becomes a range [1, N], that is possible running requests are from 1 to N.

Thanks
Yuan

Re: [Qemu-devel] [PATCH 6/8] block/quorum: add broken state to BlockDriverState

2014-09-01 Thread Liu Yuan

On Mon, Sep 01, 2014 at 10:57:43AM +0200, Benoît Canet wrote:
> The Monday 01 Sep 2014 à 15:43:12 (+0800), Liu Yuan wrote :
> > This allow VM continues to process even if some devices are broken meanwhile
> > with proper configuration.
> > 
> > We mark the device broken when the protocol tier notify back some broken
> > state(s) of the device, such as diconnection via driver operations. We could
> > also reset the device as sound when the protocol tier is repaired.
> > 
> > Origianlly .threshold controls how we should decide the success of 
> > read/write
> > and return the failure only if the success count of read/write is less than
> > .threshold specified by users. But it doesn't record the states of 
> > underlying
> > states and will impact performance a bit in some cases.
> > 
> > For example, we have 3 children and .threshold is set 2. If one of the 
> > devices
> > broken, we should still return success and continue to run VM. But for every
> > IO operations, we will blindly send the requests to the broken device.
> > 
> > To store broken state into driver state we can save requests to borken 
> > devices
> > and resend the requests to the repaired ones by setting broken as false.
> > 
> > This is especially useful for network based protocol such as sheepdog, which
> > has a auto-reconnection mechanism and will never report EIO if the 
> > connection
> > is broken but just store the requests to its local queue and wait for 
> > resending.
> > Without broken state, quorum request will not come back until the 
> > connection is
> > re-established. So we have to skip the broken deivces to allow VM to 
> > continue
> > running with networked backed child (glusterfs, nfs, sheepdog, etc).
> > 
> > With the combination of read-pattern and threshold, we can easily mimic the 
> > DRVD
> > behavior with following configuration:
> > 
> >  read-pattern=fifo,threshold=1 will two children.
> > 
> > Cc: Eric Blake 
> > Cc: Benoit Canet 
> > Cc: Kevin Wolf 
> > Cc: Stefan Hajnoczi 
> > Signed-off-by: Liu Yuan 
> > ---
> >  block/quorum.c| 102 
> > ++
> >  include/block/block_int.h |   3 ++
> >  2 files changed, 87 insertions(+), 18 deletions(-)
> > 
> > diff --git a/block/quorum.c b/block/quorum.c
> > index b9eeda3..7b07e35 100644
> > --- a/block/quorum.c
> > +++ b/block/quorum.c
> > @@ -120,6 +120,7 @@ struct QuorumAIOCB {
> >  int rewrite_count;  /* number of replica to rewrite: count 
> > down to
> >   * zero once writes are fired
> >   */
> > +int issued_count;   /* actual read&write issued count */
> >  
> >  QuorumVotes votes;
> >  
> > @@ -170,8 +171,10 @@ static void quorum_aio_finalize(QuorumAIOCB *acb)
> >  if (acb->is_read) {
> >  /* on the quorum case acb->child_iter == s->num_children - 1 */
> >  for (i = 0; i <= acb->child_iter; i++) {
> > -qemu_vfree(acb->qcrs[i].buf);
> > -qemu_iovec_destroy(&acb->qcrs[i].qiov);
> > +if (acb->qcrs[i].buf) {
> > +qemu_vfree(acb->qcrs[i].buf);
> > +qemu_iovec_destroy(&acb->qcrs[i].qiov);
> > +}
> >  }
> >  }
> >  
> > @@ -207,6 +210,7 @@ static QuorumAIOCB *quorum_aio_get(BDRVQuorumState *s,
> >  acb->count = 0;
> >  acb->success_count = 0;
> >  acb->rewrite_count = 0;
> > +acb->issued_count = 0;
> >  acb->votes.compare = quorum_sha256_compare;
> >  QLIST_INIT(&acb->votes.vote_list);
> >  acb->is_read = false;
> > @@ -286,6 +290,22 @@ static void quorum_copy_qiov(QEMUIOVector *dest, 
> > QEMUIOVector *source)
> >  }
> >  }
> >  
> > +static int next_fifo_child(QuorumAIOCB *acb)
> > +{
> > +BDRVQuorumState *s = acb->common.bs->opaque;
> > +int i;
> > +
> > +for (i = acb->child_iter; i < s->num_children; i++) {
> > +if (!s->bs[i]->broken) {
> > +break;
> > +}
> > +}
> > +if (i == s->num_children) {
> > +return -1;
> > +}
> > +return i;
> > +}
> > +
> >  static void quorum_aio_cb(void *opaque, int ret)
> >  {
> >  QuorumChildRequest *sacb = opaque;
> > @@ -293,11 +313

Re: [Qemu-devel] [PATCH 2/8] block: add driver operation callbacks

2014-09-01 Thread Liu Yuan

On Mon, Sep 01, 2014 at 11:28:22AM +0200, Benoît Canet wrote:
> The Monday 01 Sep 2014 à 17:19:19 (+0800), Liu Yuan wrote :
> > On Mon, Sep 01, 2014 at 10:28:54AM +0200, Benoît Canet wrote:
> > > The Monday 01 Sep 2014 à 15:43:08 (+0800), Liu Yuan wrote :
> > > > Driver operations are defined as callbacks passed from block upper 
> > > > drivers to
> > > > lower drivers and are supposed to be called by lower drivers.
> > > > 
> > > > Requests handling(queuing, submitting, etc.) are done in protocol tier 
> > > > in the
> > > > block layer and connection states are also maintained down there. Driver
> > > > operations are supposed to notify the upper tier (such as quorum) of 
> > > > the states
> > > > changes.
> > > > 
> > > > For now only two operation are added:
> > > > 
> > > > driver_disconnect: called when connection is off
> > > > driver_reconnect: called when connection is on after disconnection
> > > > 
> > > > Which are used to notify upper tier of the connection state.
> > > > 
> > > > Cc: Eric Blake 
> > > > Cc: Benoit Canet 
> > > > Cc: Kevin Wolf 
> > > > Cc: Stefan Hajnoczi 
> > > > Signed-off-by: Liu Yuan 
> > > > ---
> > > >  block.c   | 7 +++
> > > >  include/block/block.h | 7 +++
> > > >  include/block/block_int.h | 3 +++
> > > >  3 files changed, 17 insertions(+)
> > > > 
> > > > diff --git a/block.c b/block.c
> > > > index c12b8de..22eb3e4 100644
> > > > --- a/block.c
> > > > +++ b/block.c
> > > > @@ -2152,6 +2152,13 @@ void bdrv_set_dev_ops(BlockDriverState *bs, 
> > > > const BlockDevOps *ops,
> > > >  bs->dev_opaque = opaque;
> > > >  }
> > > >  
> > > > +void bdrv_set_drv_ops(BlockDriverState *bs, const BlockDrvOps *ops,
> > > > +  void *opaque)
> > > > +{
> > > > +bs->drv_ops = ops;
> > > > +bs->drv_opaque = opaque;
> > > 
> > > We need to be very carefull of the mix between these fields and the 
> > > infamous
> > > bdrv_swap function.
> > > 
> > > Also I don't know if "driver operations" is the right name since the 
> > > BlockDriver structure's
> > > callback could also be named "driver operations".
> > > 
> > 
> > BlockDrvierState has a "device operation" for callbacks from devices. So I
> > choose "driver operation". So any sugguestion for better name?
> 
> From what I see in this series the job of these callbacks is to send a 
> message or a signal to
> the upper BDS.
> 
> Also the name must reflect it goes from the child to the parent.
> 
> child_signals ?
> child_messages ?
> 

As far as I see, put child in the name will make it too quorum centric. Since it
is operation in BlockDriverState, we need to keep it as generic as we could.

These operations [here we mean disconnect() and reconnect(), but probably later
some other will add more opeartions] are passed from 'upper driver' to protocol
driver [in the code we call the protocol as 'file' driver, a narrow name too],
so I chose to name it as 'driver operation'. If we can rename 'file' as
protocol, include file, nfs, sheepdog, etc, such as

bdrv_create_file -> bdrv_create_protocol
bs.file -> bs.protocol

then the 'driver operation' here would sound better.

Thanks
Yuan

Re: [Qemu-devel] [PATCH 8/8] quorum: add basic device recovery logic

2014-09-01 Thread Liu Yuan

On Mon, Sep 01, 2014 at 11:37:20AM +0200, Benoît Canet wrote:
> The Monday 01 Sep 2014 à 15:43:14 (+0800), Liu Yuan wrote :
> > For some configuration, quorum allow VMs to continue while some child 
> > devices
> > are broken and when the child devices are repaired and return back, we need 
> > to
> > sync dirty bits during downtime to keep data consistency.
> > 
> > The recovery logic is based on the driver state bitmap and will sync the 
> > dirty
> > bits with a timeslice window in a coroutine in this prtimive implementation.
> > 
> > Simple graph about 2 children with threshold=1 and read-pattern=fifo:
> > 
> > + denote device sync iteration
> > - IO on a single device
> > = IO on two devices
> > 
> >   sync complete, release dirty bitmap
> >  ^
> >  |
> >   -++==
> >  | |
> >  | v
> >  |   device repaired and begin to sync
> >  v
> >device broken, create a dirty bitmap
> > 
> >   This sync logic can take care of nested broken problem, that devices are
> >   broken while in sync. We just start a sync process after the devices are
> >   repaired again and switch the devices from broken to sound only when the 
> > sync
> >   completes.
> > 
> > For read-pattern=quorum mode, it enjoys the recovery logic without any 
> > problem.
> > 
> > Cc: Eric Blake 
> > Cc: Benoit Canet 
> > Cc: Kevin Wolf 
> > Cc: Stefan Hajnoczi 
> > Signed-off-by: Liu Yuan 
> > ---
> >  block/quorum.c | 189 
> > -
> >  trace-events   |   5 ++
> >  2 files changed, 191 insertions(+), 3 deletions(-)
> > 
> > diff --git a/block/quorum.c b/block/quorum.c
> > index 7b07e35..ffd7c2d 100644
> > --- a/block/quorum.c
> > +++ b/block/quorum.c
> > @@ -23,6 +23,7 @@
> >  #include "qapi/qmp/qlist.h"
> >  #include "qapi/qmp/qstring.h"
> >  #include "qapi-event.h"
> > +#include "trace.h"
> >  
> >  #define HASH_LENGTH 32
> >  
> > @@ -31,6 +32,10 @@
> >  #define QUORUM_OPT_REWRITE"rewrite-corrupted"
> >  #define QUORUM_OPT_READ_PATTERN   "read-pattern"
> >  
> > +#define SLICE_TIME  1ULL /* 100 ms */
> > +#define CHUNK_SIZE  (1 << 20) /* 1M */
> > +#define SECTORS_PER_CHUNK   (CHUNK_SIZE >> BDRV_SECTOR_BITS)
> > +
> >  /* This union holds a vote hash value */
> >  typedef union QuorumVoteValue {
> >  char h[HASH_LENGTH];   /* SHA-256 hash */
> > @@ -64,6 +69,7 @@ typedef struct QuorumVotes {
> >  
> >  /* the following structure holds the state of one quorum instance */
> >  typedef struct BDRVQuorumState {
> > +BlockDriverState *mybs;/* Quorum block driver base state */
> >  BlockDriverState **bs; /* children BlockDriverStates */
> >  int num_children;  /* children count */
> >  int threshold; /* if less than threshold children reads gave 
> > the
> > @@ -82,6 +88,10 @@ typedef struct BDRVQuorumState {
> >  */
> >  
> >  QuorumReadPattern read_pattern;
> > +BdrvDirtyBitmap *dirty_bitmap;
> > +uint8_t *sync_buf;
> > +HBitmapIter hbi;
> > +int64_t sector_num;
> >  } BDRVQuorumState;
> >  
> >  typedef struct QuorumAIOCB QuorumAIOCB;
> > @@ -290,12 +300,11 @@ static void quorum_copy_qiov(QEMUIOVector *dest, 
> > QEMUIOVector *source)
> >  }
> >  }
> >  
> > -static int next_fifo_child(QuorumAIOCB *acb)
> > +static int get_good_child(BDRVQuorumState *s, int iter)
> >  {
> > -BDRVQuorumState *s = acb->common.bs->opaque;
> >  int i;
> >  
> > -for (i = acb->child_iter; i < s->num_children; i++) {
> > +for (i = iter; i < s->num_children; i++) {
> >  if (!s->bs[i]->broken) {
> >  break;
> >  }
> > @@ -306,6 +315,13 @@ static int next_fifo_child(QuorumAIOCB *acb)
> >  return i;
> >  }
> >  
> > +static int next_fifo_child(QuorumAIOCB *acb)
> > +{
> > +BDRVQuorumState *s = acb->common.bs->opaque;
> > +
> > +return get_good_child(s, acb->child_iter);
> > +}
> > +
> >  static void quorum_aio_cb(void *opaque, int

Re: [Qemu-devel] [PATCH 5/8] quorum: fix quorum_aio_cancel()

2014-09-01 Thread Liu Yuan

On Mon, Sep 01, 2014 at 11:32:04AM +0200, Benoît Canet wrote:
> The Monday 01 Sep 2014 à 17:26:09 (+0800), Liu Yuan wrote :
> > On Mon, Sep 01, 2014 at 10:35:27AM +0200, Benoît Canet wrote:
> > > The Monday 01 Sep 2014 à 15:43:11 (+0800), Liu Yuan wrote :
> > > > For a fifo read pattern, we only have one running aio
> > > 
> > > >(possible other cases that has less number than num_children in the 
> > > >future)
> > > I have trouble understanding this part of the commit message could you try
> > > to clarify it ?
> > 
> > Until this patch, we have two cases, read single or read all. But later 
> > patch
> > allow VMs to continue if some devices are down. So the discrete number 1 
> > and N
> > becomes a range [1, N], that is possible running requests are from 1 to N.
> 
> Why not 
> 
> (In some other cases some children of the num_children BDS could be disabled
> reducing the number of requests needed) ?
> 

Sounds better, I'll take yours, thanks!

Yuan

Re: [Qemu-devel] [PATCH 16/22] sheepdog: implement .bdrv_detach/attach_aio_context()

2014-05-05 Thread Liu Yuan

On Thu, May 01, 2014 at 04:54:40PM +0200, Stefan Hajnoczi wrote:
> Drop the assumption that we're using the main AioContext.  Convert
> qemu_aio_set_fd_handler() to aio_set_fd_handler() and qemu_aio_wait() to
> aio_poll().
> 
> The .bdrv_detach/attach_aio_context() interfaces also need to be
> implemented to move the socket fd handler from the old to the new
> AioContext.
> 
> Cc: MORITA Kazutaka 
> Cc: Liu Yuan 
> Signed-off-by: Stefan Hajnoczi 
> ---
>  block/sheepdog.c | 118 
> +--
>  1 file changed, 80 insertions(+), 38 deletions(-)
> 
> diff --git a/block/sheepdog.c b/block/sheepdog.c
> index 0eb33ee..4727fc1 100644
> --- a/block/sheepdog.c
> +++ b/block/sheepdog.c
> @@ -314,6 +314,7 @@ struct SheepdogAIOCB {
>  
>  typedef struct BDRVSheepdogState {
>  BlockDriverState *bs;
> +AioContext *aio_context;
>  
>  SheepdogInode inode;
>  
> @@ -496,7 +497,7 @@ static void sd_aio_cancel(BlockDriverAIOCB *blockacb)
>  sd_finish_aiocb(acb);
>  return;
>  }
> -qemu_aio_wait();
> +aio_poll(s->aio_context, true);
>  }
>  }
>  
> @@ -582,6 +583,7 @@ static void restart_co_req(void *opaque)
>  
>  typedef struct SheepdogReqCo {
>  int sockfd;
> +AioContext *aio_context;
>  SheepdogReq *hdr;
>  void *data;
>  unsigned int *wlen;
> @@ -602,14 +604,14 @@ static coroutine_fn void do_co_req(void *opaque)
>  unsigned int *rlen = srco->rlen;
>  
>  co = qemu_coroutine_self();
> -qemu_aio_set_fd_handler(sockfd, NULL, restart_co_req, co);
> +aio_set_fd_handler(srco->aio_context, sockfd, NULL, restart_co_req, co);
>  
>  ret = send_co_req(sockfd, hdr, data, wlen);
>  if (ret < 0) {
>  goto out;
>  }
>  
> -qemu_aio_set_fd_handler(sockfd, restart_co_req, NULL, co);
> +aio_set_fd_handler(srco->aio_context, sockfd, restart_co_req, NULL, co);
>  
>  ret = qemu_co_recv(sockfd, hdr, sizeof(*hdr));
>  if (ret != sizeof(*hdr)) {
> @@ -634,18 +636,19 @@ static coroutine_fn void do_co_req(void *opaque)
>  out:
>  /* there is at most one request for this sockfd, so it is safe to
>   * set each handler to NULL. */
> -qemu_aio_set_fd_handler(sockfd, NULL, NULL, NULL);
> +aio_set_fd_handler(srco->aio_context, sockfd, NULL, NULL, NULL);
>  
>  srco->ret = ret;
>  srco->finished = true;
>  }
>  
> -static int do_req(int sockfd, SheepdogReq *hdr, void *data,
> -  unsigned int *wlen, unsigned int *rlen)
> +static int do_req(int sockfd, AioContext *aio_context, SheepdogReq *hdr,
> +  void *data, unsigned int *wlen, unsigned int *rlen)
>  {
>  Coroutine *co;
>  SheepdogReqCo srco = {
>  .sockfd = sockfd,
> +.aio_context = aio_context,
>  .hdr = hdr,
>  .data = data,
>  .wlen = wlen,
> @@ -660,7 +663,7 @@ static int do_req(int sockfd, SheepdogReq *hdr, void 
> *data,
>  co = qemu_coroutine_create(do_co_req);
>  qemu_coroutine_enter(co, &srco);
>  while (!srco.finished) {
> -qemu_aio_wait();
> +aio_poll(aio_context, true);
>  }
>  }
>  
> @@ -712,7 +715,7 @@ static coroutine_fn void reconnect_to_sdog(void *opaque)
>  BDRVSheepdogState *s = opaque;
>  AIOReq *aio_req, *next;
>  
> -qemu_aio_set_fd_handler(s->fd, NULL, NULL, NULL);
> +aio_set_fd_handler(s->aio_context, s->fd, NULL, NULL, NULL);
>  close(s->fd);
>  s->fd = -1;
>  
> @@ -923,7 +926,7 @@ static int get_sheep_fd(BDRVSheepdogState *s)
>  return fd;
>  }
>  
> -qemu_aio_set_fd_handler(fd, co_read_response, NULL, s);
> +aio_set_fd_handler(s->aio_context, fd, co_read_response, NULL, s);
>  return fd;
>  }
>  
> @@ -1093,7 +1096,7 @@ static int find_vdi_name(BDRVSheepdogState *s, const 
> char *filename,
>  hdr.snapid = snapid;
>  hdr.flags = SD_FLAG_CMD_WRITE;
>  
> -ret = do_req(fd, (SheepdogReq *)&hdr, buf, &wlen, &rlen);
> +ret = do_req(fd, s->aio_context, (SheepdogReq *)&hdr, buf, &wlen, &rlen);
>  if (ret) {
>  goto out;
>  }
> @@ -1173,7 +1176,8 @@ static void coroutine_fn 
> add_aio_request(BDRVSheepdogState *s, AIOReq *aio_req,
>  
>  qemu_co_mutex_lock(&s->lock);
>  s->co_send = qemu_coroutine_self();
> -qemu_aio_set_fd_handler(s->fd, co_read_response, co_write_request, s);
> +aio_set_fd_handler(s->aio_context, s->fd,
> +   co_re

Re: [Qemu-devel] [PATCH v2 1/2] sheepdog: fix vdi object update after live snapshot

2014-06-03 Thread Liu Yuan

On Tue, Jun 03, 2014 at 01:54:21PM +0900, Hitoshi Mitake wrote:
> sheepdog driver should decide a write request is COW or not based on
> inode object which is active when the write request is issued.
> 
> Cc: Kevin Wolf 
> Cc: Stefan Hajnoczi 
> Cc: Liu Yuan 
> Cc: MORITA Kazutaka 
> Signed-off-by: Hitoshi Mitake 
> ---
>  block/sheepdog.c |   40 +++-
>  1 files changed, 23 insertions(+), 17 deletions(-)
> 
> diff --git a/block/sheepdog.c b/block/sheepdog.c
> index 4ecbf5f..637e57f 100644
> --- a/block/sheepdog.c
> +++ b/block/sheepdog.c
> @@ -282,6 +282,7 @@ typedef struct AIOReq {
>  unsigned int data_len;
>  uint8_t flags;
>  uint32_t id;
> +bool create;
>  
>  QLIST_ENTRY(AIOReq) aio_siblings;
>  } AIOReq;
> @@ -404,7 +405,7 @@ static const char * sd_strerror(int err)
>  
>  static inline AIOReq *alloc_aio_req(BDRVSheepdogState *s, SheepdogAIOCB *acb,
>  uint64_t oid, unsigned int data_len,
> -uint64_t offset, uint8_t flags,
> +uint64_t offset, uint8_t flags, bool 
> create,
>  uint64_t base_oid, unsigned int 
> iov_offset)
>  {
>  AIOReq *aio_req;
> @@ -418,6 +419,7 @@ static inline AIOReq *alloc_aio_req(BDRVSheepdogState *s, 
> SheepdogAIOCB *acb,
>  aio_req->data_len = data_len;
>  aio_req->flags = flags;
>  aio_req->id = s->aioreq_seq_num++;
> +aio_req->create = create;
>  
>  acb->nr_pending++;
>  return aio_req;
> @@ -664,8 +666,8 @@ static int do_req(int sockfd, SheepdogReq *hdr, void 
> *data,
>  }
>  
>  static void coroutine_fn add_aio_request(BDRVSheepdogState *s, AIOReq 
> *aio_req,
> -   struct iovec *iov, int niov, bool create,
> -   enum AIOCBState aiocb_type);
> + struct iovec *iov, int niov,
> + enum AIOCBState aiocb_type);
>  static void coroutine_fn resend_aioreq(BDRVSheepdogState *s, AIOReq 
> *aio_req);
>  static int reload_inode(BDRVSheepdogState *s, uint32_t snapid, const char 
> *tag);
>  static int get_sheep_fd(BDRVSheepdogState *s, Error **errp);
> @@ -698,7 +700,7 @@ static void coroutine_fn 
> send_pending_req(BDRVSheepdogState *s, uint64_t oid)
>  /* move aio_req from pending list to inflight one */
>  QLIST_REMOVE(aio_req, aio_siblings);
>  QLIST_INSERT_HEAD(&s->inflight_aio_head, aio_req, aio_siblings);
> -add_aio_request(s, aio_req, acb->qiov->iov, acb->qiov->niov, false,
> +add_aio_request(s, aio_req, acb->qiov->iov, acb->qiov->niov,
>  acb->aiocb_type);
>  }
>  }
> @@ -797,7 +799,7 @@ static void coroutine_fn aio_read_response(void *opaque)
>  }
>  idx = data_oid_to_idx(aio_req->oid);
>  
> -if (s->inode.data_vdi_id[idx] != s->inode.vdi_id) {
> +if (aio_req->create) {
>  /*
>   * If the object is newly created one, we need to update
>   * the vdi object (metadata object).  min_dirty_data_idx
> @@ -1117,8 +1119,8 @@ out:
>  }
>  
>  static void coroutine_fn add_aio_request(BDRVSheepdogState *s, AIOReq 
> *aio_req,
> -   struct iovec *iov, int niov, bool create,
> -   enum AIOCBState aiocb_type)
> + struct iovec *iov, int niov,
> + enum AIOCBState aiocb_type)
>  {
>  int nr_copies = s->inode.nr_copies;
>  SheepdogObjReq hdr;
> @@ -1129,6 +1131,7 @@ static void coroutine_fn 
> add_aio_request(BDRVSheepdogState *s, AIOReq *aio_req,
>  uint64_t offset = aio_req->offset;
>  uint8_t flags = aio_req->flags;
>  uint64_t old_oid = aio_req->base_oid;
> +bool create = aio_req->create;
>  
>  if (!nr_copies) {
>  error_report("bug");
> @@ -1315,6 +1318,7 @@ static bool check_simultaneous_create(BDRVSheepdogState 
> *s, AIOReq *aio_req)
>  DPRINTF("simultaneous create to %" PRIx64 "\n", aio_req->oid);
>  aio_req->flags = 0;
>  aio_req->base_oid = 0;
> +aio_req->create = false;
>  QLIST_REMOVE(aio_req, aio_siblings);
>  QLIST_INSERT_HEAD(&s->pending_aio_head, aio_req, aio_siblings);
>  return true;
> @@ -1327,7 +1331,8 @@ static bool check_simultaneous_create(BDRVSheepdogState 
> *s, AIOReq *aio_req)
>  s

Re: [Qemu-devel] [PATCH v2 1/2] sheepdog: fix vdi object update after live snapshot

2014-06-03 Thread Liu Yuan

On Tue, Jun 03, 2014 at 11:58:21PM +0900, Hitoshi Mitake wrote:
> On Tue, Jun 3, 2014 at 9:41 PM, Liu Yuan  wrote:
> > On Tue, Jun 03, 2014 at 01:54:21PM +0900, Hitoshi Mitake wrote:
> >> sheepdog driver should decide a write request is COW or not based on
> >> inode object which is active when the write request is issued.
> >>
> >> Cc: Kevin Wolf 
> >> Cc: Stefan Hajnoczi 
> >> Cc: Liu Yuan 
> >> Cc: MORITA Kazutaka 
> >> Signed-off-by: Hitoshi Mitake 
> >> ---
> >>  block/sheepdog.c |   40 +++-
> >>  1 files changed, 23 insertions(+), 17 deletions(-)
> >>
> >> diff --git a/block/sheepdog.c b/block/sheepdog.c
> >> index 4ecbf5f..637e57f 100644
> >> --- a/block/sheepdog.c
> >> +++ b/block/sheepdog.c
> >> @@ -282,6 +282,7 @@ typedef struct AIOReq {
> >>  unsigned int data_len;
> >>  uint8_t flags;
> >>  uint32_t id;
> >> +bool create;
> >>
> >>  QLIST_ENTRY(AIOReq) aio_siblings;
> >>  } AIOReq;
> >> @@ -404,7 +405,7 @@ static const char * sd_strerror(int err)
> >>
> >>  static inline AIOReq *alloc_aio_req(BDRVSheepdogState *s, SheepdogAIOCB 
> >> *acb,
> >>  uint64_t oid, unsigned int data_len,
> >> -uint64_t offset, uint8_t flags,
> >> +uint64_t offset, uint8_t flags, bool 
> >> create,
> >>  uint64_t base_oid, unsigned int 
> >> iov_offset)
> >>  {
> >>  AIOReq *aio_req;
> >> @@ -418,6 +419,7 @@ static inline AIOReq *alloc_aio_req(BDRVSheepdogState 
> >> *s, SheepdogAIOCB *acb,
> >>  aio_req->data_len = data_len;
> >>  aio_req->flags = flags;
> >>  aio_req->id = s->aioreq_seq_num++;
> >> +aio_req->create = create;
> >>
> >>  acb->nr_pending++;
> >>  return aio_req;
> >> @@ -664,8 +666,8 @@ static int do_req(int sockfd, SheepdogReq *hdr, void 
> >> *data,
> >>  }
> >>
> >>  static void coroutine_fn add_aio_request(BDRVSheepdogState *s, AIOReq 
> >> *aio_req,
> >> -   struct iovec *iov, int niov, bool create,
> >> -   enum AIOCBState aiocb_type);
> >> + struct iovec *iov, int niov,
> >> + enum AIOCBState aiocb_type);
> >>  static void coroutine_fn resend_aioreq(BDRVSheepdogState *s, AIOReq 
> >> *aio_req);
> >>  static int reload_inode(BDRVSheepdogState *s, uint32_t snapid, const char 
> >> *tag);
> >>  static int get_sheep_fd(BDRVSheepdogState *s, Error **errp);
> >> @@ -698,7 +700,7 @@ static void coroutine_fn 
> >> send_pending_req(BDRVSheepdogState *s, uint64_t oid)
> >>  /* move aio_req from pending list to inflight one */
> >>  QLIST_REMOVE(aio_req, aio_siblings);
> >>  QLIST_INSERT_HEAD(&s->inflight_aio_head, aio_req, aio_siblings);
> >> -add_aio_request(s, aio_req, acb->qiov->iov, acb->qiov->niov, 
> >> false,
> >> +add_aio_request(s, aio_req, acb->qiov->iov, acb->qiov->niov,
> >>  acb->aiocb_type);
> >>  }
> >>  }
> >> @@ -797,7 +799,7 @@ static void coroutine_fn aio_read_response(void 
> >> *opaque)
> >>  }
> >>  idx = data_oid_to_idx(aio_req->oid);
> >>
> >> -if (s->inode.data_vdi_id[idx] != s->inode.vdi_id) {
> >> +if (aio_req->create) {
> >>  /*
> >>   * If the object is newly created one, we need to update
> >>   * the vdi object (metadata object).  min_dirty_data_idx
> >> @@ -1117,8 +1119,8 @@ out:
> >>  }
> >>
> >>  static void coroutine_fn add_aio_request(BDRVSheepdogState *s, AIOReq 
> >> *aio_req,
> >> -   struct iovec *iov, int niov, bool create,
> >> -   enum AIOCBState aiocb_type)
> >> + struct iovec *iov, int niov,
> >> + enum AIOCBState aiocb_type)
> >>  {
> >>  int nr_copies = s->inode.nr_copies;
> >>  SheepdogObjReq hdr;
> >> @@ -1129,6 +1131,7 @@ sta

Re: [Qemu-devel] [sheepdog] [PATCH v3 0/2] bugfixes of sheepdog driver for a case of live snapshot

2014-06-05 Thread Liu Yuan

On Fri, Jun 06, 2014 at 01:35:10PM +0900, Hitoshi Mitake wrote:
> Current sheepdog driver has two problems in a mechanism of inode
> object reloading for live snapshot. It causes inconsistent state of
> snapshot volumes. A new GC algorithm implemented in sheepdog exposes
> the problems. This patchset contains bugfixes for them.
> 
> v3:
>  - update commit log
> 
> v2:
>  - corrrect wrong spelling in the commit log of 2nd patch
> 
> Hitoshi Mitake (2):
>   sheepdog: fix vdi object update after live snapshot
>   sheepdog: reload only header in a case of live snapshot
> 
>  block/sheepdog.c |   49 +
>  1 files changed, 29 insertions(+), 20 deletions(-)
> 
> -- 
> sheepdog mailing list
> sheep...@lists.wpkg.org
> http://lists.wpkg.org/mailman/listinfo/sheepdog

This series looks good to me.

Acked-by: Liu Yuan

[Qemu-devel] [PATCH] configure: make libnfs not_found message more user friendly

2014-07-09 Thread Liu Yuan

Cc: Kevin Wolf 
Signed-off-by: Liu Yuan 
---
 configure | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/configure b/configure
index 7dd43fd..684fcdf 100755
--- a/configure
+++ b/configure
@@ -3996,7 +3996,7 @@ if test "$libnfs" != "no" ; then
 LIBS="$LIBS $libnfs_libs"
   else
 if test "$libnfs" = "yes" ; then
-  feature_not_found "libnfs"
+  feature_not_found "libnfs" "libnfs (>=1.9.3) development files required 
to compile libnfs"
 fi
 libnfs="no"
   fi
-- 
1.9.1

[Qemu-devel] [PATCH] block/quorum: add simple read pattern support

2014-07-10 Thread Liu Yuan

This patch adds single read pattern to quorum driver and quorum vote is default
pattern.

For now we do a quorum vote on all the reads, it is designed for unreliable
underlying storage such as non-redundant NFS to make sure data integrity at the
cost of the read performance.

For some use cases as following:

VM
  --
  ||
  vv
  AB

Both A and B has hardware raid storage to justify the data integrity on its own.
So it would help performance if we do a single read instead of on all the nodes.
Further, if we run VM on either of the storage node, we can make a local read
request for better performance.

This patch generalize the above 2 nodes case in the N nodes. That is,

vm -> write to all the N nodes, read just one of them. If single read fails, we
try to read next node in FIFO order specified by the startup command.

The 2 nodes case is very similar to DRBD[1] though lack of auto-sync
functionality in the single device/node failure for now. But compared with DRBD
we still have some advantages over it:

- Suppose we have 20 VMs running on one(assume A) of 2 nodes' DRBD backed
storage. And if A crashes, we need to restart all the VMs on node B. But for
practice case, we can't because B might not have enough resources to setup 20 
VMs
at once. So if we run our 20 VMs with quorum driver, and scatter the replicated
images over the data center, we can very likely restart 20 VMs without any
resource problem.

After all, I think we can build a more powerful replicated image functionality
on quorum and block jobs(block mirror) to meet various High Availibility needs.

E.g, Enable single read pattern on 2 children,

-drive driver=quorum,children.0.file.filename=0.qcow2,\
children.1.file.filename=1.qcow2,read-pattern=single,vote-threshold=1

[1] http://en.wikipedia.org/wiki/Distributed_Replicated_Block_Device

Cc: Benoit Canet 
Cc: Kevin Wolf 
Cc: Stefan Hajnoczi 
Signed-off-by: Liu Yuan 
---
 block/quorum.c | 174 +
 1 file changed, 125 insertions(+), 49 deletions(-)

diff --git a/block/quorum.c b/block/quorum.c
index d5ee9c0..2f18755 100644
--- a/block/quorum.c
+++ b/block/quorum.c
@@ -24,6 +24,7 @@
 #define QUORUM_OPT_VOTE_THRESHOLD "vote-threshold"
 #define QUORUM_OPT_BLKVERIFY  "blkverify"
 #define QUORUM_OPT_REWRITE"rewrite-corrupted"
+#define QUORUM_OPT_READ_PATTERN   "read-pattern"
 
 /* This union holds a vote hash value */
 typedef union QuorumVoteValue {
@@ -74,6 +75,16 @@ typedef struct BDRVQuorumState {
 bool rewrite_corrupted;/* true if the driver must rewrite-on-read corrupted
 * block if Quorum is reached.
 */
+
+#define READ_PATTERN_QUORUM 0 /* default */
+#define READ_PATTERN_SINGLE 1
+int read_pattern;  /* single: read a single child and try first one
+* first. If error, try next child in an
+* FIFO order specifed by command line.
+* Return error if no child read succeeds.
+* quorum: read all the children and do a quorum
+* vote on reads.
+*/
 } BDRVQuorumState;
 
 typedef struct QuorumAIOCB QuorumAIOCB;
@@ -117,6 +128,7 @@ struct QuorumAIOCB {
 
 bool is_read;
 int vote_ret;
+int child_iter; /* which child to read in single pattern */
 };
 
 static bool quorum_vote(QuorumAIOCB *acb);
@@ -256,6 +268,21 @@ static void quorum_rewrite_aio_cb(void *opaque, int ret)
 quorum_aio_finalize(acb);
 }
 
+static BlockDriverAIOCB *read_single_child(QuorumAIOCB *acb);
+
+static void quorum_copy_qiov(QEMUIOVector *dest, QEMUIOVector *source)
+{
+int i;
+assert(dest->niov == source->niov);
+assert(dest->size == source->size);
+for (i = 0; i < source->niov; i++) {
+assert(dest->iov[i].iov_len == source->iov[i].iov_len);
+memcpy(dest->iov[i].iov_base,
+   source->iov[i].iov_base,
+   source->iov[i].iov_len);
+}
+}
+
 static void quorum_aio_cb(void *opaque, int ret)
 {
 QuorumChildRequest *sacb = opaque;
@@ -263,6 +290,21 @@ static void quorum_aio_cb(void *opaque, int ret)
 BDRVQuorumState *s = acb->common.bs->opaque;
 bool rewrite = false;
 
+if (acb->is_read && s->read_pattern == READ_PATTERN_SINGLE) {
+/* We try to read next child in FIFO order if we fail to read */
+if (ret < 0 && ++acb->child_iter < s->num_children) {
+read_single_child(acb);
+return;
+}
+
+if (ret == 0) {
+quorum_copy_qiov(acb->qiov, &acb->qcrs[acb->child_iter].qiov);
+}
+acb->vote_ret = ret;
+quorum_aio_finalize(acb);
+

[Qemu-devel] [PATCH] block/quorum: make quorum_getlength error message user friendly

2014-07-11 Thread Liu Yuan

When start quorum driver with 2 different sized images, we get:

qemu-system-x86_64: -drive if=virtio,driver=quorum...: Could not refresh total \
sector count: Input/output error

EIO would confuse users. With this patch, the error message goes like

Children images are not in the same size
qemu-system-x86_64: -drive if=virtio,driver=quorum...: Could not refresh total \
sector count: Invalid argument

Cc: Benoit Canet 
Cc: Kevin Wolf 
Cc: Stefan Hajnoczi 
Signed-off-by: Liu Yuan 
---
 block/quorum.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/block/quorum.c b/block/quorum.c
index 2f18755..51437ad 100644
--- a/block/quorum.c
+++ b/block/quorum.c
@@ -736,7 +736,8 @@ static int64_t quorum_getlength(BlockDriverState *bs)
 return value;
 }
 if (value != result) {
-return -EIO;
+error_printf("Children images are not in the same size\n");
+return -EINVAL;
 }
 }
 
-- 
1.9.1

[Qemu-devel] [PATCH v2] configure: make libnfs not_found message user friendly

2014-07-11 Thread Liu Yuan

Cc: Kevin Wolf 
Signed-off-by: Liu Yuan 
---
 configure | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/configure b/configure
index 7dd43fd..78e7baf 100755
--- a/configure
+++ b/configure
@@ -3996,7 +3996,7 @@ if test "$libnfs" != "no" ; then
 LIBS="$LIBS $libnfs_libs"
   else
 if test "$libnfs" = "yes" ; then
-  feature_not_found "libnfs"
+  feature_not_found "libnfs" "Install libnfs-devel >= 1.9.3"
 fi
 libnfs="no"
   fi
-- 
1.9.1

Re: [Qemu-devel] [PATCH] block/quorum: add simple read pattern support

2014-07-11 Thread Liu Yuan

On Fri, Jul 11, 2014 at 12:56:53PM +0200, Benoît Canet wrote:
> The Friday 11 Jul 2014 à 11:01:22 (+0800), Liu Yuan wrote :
> > This patch adds single read pattern to quorum driver and quorum vote is 
> > default
> > pattern.
> > 
> > For now we do a quorum vote on all the reads, it is designed for unreliable
> > underlying storage such as non-redundant NFS to make sure data integrity at 
> > the
> > cost of the read performance.
> > 
> > For some use cases as following:
> > 
> > VM
> >   --
> >   ||
> >   vv
> >   AB
> > 
> > Both A and B has hardware raid storage to justify the data integrity on its 
> > own.
> > So it would help performance if we do a single read instead of on all the 
> > nodes.
> > Further, if we run VM on either of the storage node, we can make a local 
> > read
> > request for better performance.
> > 
> > This patch generalize the above 2 nodes case in the N nodes. That is,
> > 
> > vm -> write to all the N nodes, read just one of them. If single read 
> > fails, we
> > try to read next node in FIFO order specified by the startup command.
> > 
> > The 2 nodes case is very similar to DRBD[1] though lack of auto-sync
> > functionality in the single device/node failure for now. But compared with 
> > DRBD
> > we still have some advantages over it:
> > 
> > - Suppose we have 20 VMs running on one(assume A) of 2 nodes' DRBD backed
> > storage. And if A crashes, we need to restart all the VMs on node B. But for
> > practice case, we can't because B might not have enough resources to setup 
> > 20 VMs
> > at once. So if we run our 20 VMs with quorum driver, and scatter the 
> > replicated
> > images over the data center, we can very likely restart 20 VMs without any
> > resource problem.
> > 
> > After all, I think we can build a more powerful replicated image 
> > functionality
> > on quorum and block jobs(block mirror) to meet various High Availibility 
> > needs.
> > 
> > E.g, Enable single read pattern on 2 children,
> > 
> > -drive driver=quorum,children.0.file.filename=0.qcow2,\
> > children.1.file.filename=1.qcow2,read-pattern=single,vote-threshold=1
> > 
> > [1] http://en.wikipedia.org/wiki/Distributed_Replicated_Block_Device
> > 
> > Cc: Benoit Canet 
> > Cc: Kevin Wolf 
> > Cc: Stefan Hajnoczi 
> > Signed-off-by: Liu Yuan 
> > ---
> >  block/quorum.c | 174 
> > +
> >  1 file changed, 125 insertions(+), 49 deletions(-)
> > 
> > diff --git a/block/quorum.c b/block/quorum.c
> > index d5ee9c0..2f18755 100644
> > --- a/block/quorum.c
> > +++ b/block/quorum.c
> > @@ -24,6 +24,7 @@
> >  #define QUORUM_OPT_VOTE_THRESHOLD "vote-threshold"
> >  #define QUORUM_OPT_BLKVERIFY  "blkverify"
> >  #define QUORUM_OPT_REWRITE"rewrite-corrupted"
> > +#define QUORUM_OPT_READ_PATTERN   "read-pattern"
> >  
> >  /* This union holds a vote hash value */
> >  typedef union QuorumVoteValue {
> > @@ -74,6 +75,16 @@ typedef struct BDRVQuorumState {
> >  bool rewrite_corrupted;/* true if the driver must rewrite-on-read 
> > corrupted
> >  * block if Quorum is reached.
> >  */
> > +
> > +#define READ_PATTERN_QUORUM 0 /* default */
> > +#define READ_PATTERN_SINGLE 1
> 
> Why not making these choices an enum ?

It is just a matter of taste. For now we only have 2 constants and the bonus to
have #define here is we can put them near the 'read_pattern' for easier code
read.

> Would READ_PATTERN_SINGLE be more accuratelly described by 
> READ_PATTERN_ROUND_ROBIN ?

How about FIFO? Since 'quorum' is a single word...

> More generaly would s/single/round-robin/ be more descriptive.

ditto.

> 
> > +int read_pattern;  /* single: read a single child and try first one
> > +* first. If error, try next child in an
> > +* FIFO order specifed by command line.
> > +* Return error if no child read 
> > succeeds.
> > +* quorum: read all the children and do a quorum
> > +* vote on reads.
> > +*/
> >  } BDRVQuorumState;
> >  
> >  typedef struct QuorumAIOCB QuorumAIOCB;
&

Re: [Qemu-devel] [PATCH] block/quorum: add simple read pattern support

2014-07-11 Thread Liu Yuan

On Fri, Jul 11, 2014 at 08:06:36AM -0600, Eric Blake wrote:
> On 07/10/2014 09:01 PM, Liu Yuan wrote:
> > This patch adds single read pattern to quorum driver and quorum vote is 
> > default
> > pattern.
> > 
> > For now we do a quorum vote on all the reads, it is designed for unreliable
> > underlying storage such as non-redundant NFS to make sure data integrity at 
> > the
> > cost of the read performance.
> > 
> 
> > E.g, Enable single read pattern on 2 children,
> > 
> > -drive driver=quorum,children.0.file.filename=0.qcow2,\
> > children.1.file.filename=1.qcow2,read-pattern=single,vote-threshold=1
> > 
> > [1] http://en.wikipedia.org/wiki/Distributed_Replicated_Block_Device
> > 
> > Cc: Benoit Canet 
> > Cc: Kevin Wolf 
> > Cc: Stefan Hajnoczi 
> > Signed-off-by: Liu Yuan 
> > ---
> >  block/quorum.c | 174 
> > +
> >  1 file changed, 125 insertions(+), 49 deletions(-)
> 
> That covers the command line, but what about the *.json changes to allow
> QMP to select this mode when hot-plugging?

Good catch, thanks for reminding.

Yuan

Re: [Qemu-devel] [PATCH] block/quorum: make quorum_getlength error message user friendly

2014-07-11 Thread Liu Yuan

On Fri, Jul 11, 2014 at 12:26:58PM +0200, Kevin Wolf wrote:
> Am 11.07.2014 um 09:34 hat Liu Yuan geschrieben:
> > When start quorum driver with 2 different sized images, we get:
> > 
> > qemu-system-x86_64: -drive if=virtio,driver=quorum...: Could not refresh 
> > total \
> > sector count: Input/output error
> > 
> > EIO would confuse users. With this patch, the error message goes like
> > 
> > Children images are not in the same size
> > qemu-system-x86_64: -drive if=virtio,driver=quorum...: Could not refresh 
> > total \
> > sector count: Invalid argument
> > 
> > Cc: Benoit Canet 
> > Cc: Kevin Wolf 
> > Cc: Stefan Hajnoczi 
> > Signed-off-by: Liu Yuan 
> 
> I think it would be better to do this check in quorum_open() instead of
> relying on bdrv_open() to perform a bdrv_getlength(). As a bonus,
> quorum_open() has an Error argument, so we wouldn't have to use
> something like error_printf(), which doesn't work well for QMP commands.
> 
> Kevin

Make sense. I'll do it after we settle down my 'add simple read pattern' patch
since they are conflicted.

Yuan

[Qemu-devel] [PATCH v2] block/quorum: add simple read pattern support

2014-07-14 Thread Liu Yuan

This patch adds single read pattern to quorum driver and quorum vote is default
pattern.

For now we do a quorum vote on all the reads, it is designed for unreliable
underlying storage such as non-redundant NFS to make sure data integrity at the
cost of the read performance.

For some use cases as following:

VM
  --
  ||
  vv
  AB

Both A and B has hardware raid storage to justify the data integrity on its own.
So it would help performance if we do a single read instead of on all the nodes.
Further, if we run VM on either of the storage node, we can make a local read
request for better performance.

This patch generalize the above 2 nodes case in the N nodes. That is,

vm -> write to all the N nodes, read just one of them. If single read fails, we
try to read next node in FIFO order specified by the startup command.

The 2 nodes case is very similar to DRBD[1] though lack of auto-sync
functionality in the single device/node failure for now. But compared with DRBD
we still have some advantages over it:

- Suppose we have 20 VMs running on one(assume A) of 2 nodes' DRBD backed
storage. And if A crashes, we need to restart all the VMs on node B. But for
practice case, we can't because B might not have enough resources to setup 20 
VMs
at once. So if we run our 20 VMs with quorum driver, and scatter the replicated
images over the data center, we can very likely restart 20 VMs without any
resource problem.

After all, I think we can build a more powerful replicated image functionality
on quorum and block jobs(block mirror) to meet various High Availibility needs.

E.g, Enable single read pattern on 2 children,

-drive driver=quorum,children.0.file.filename=0.qcow2,\
children.1.file.filename=1.qcow2,read-pattern=single,vote-threshold=1

[1] http://en.wikipedia.org/wiki/Distributed_Replicated_Block_Device

Cc: Benoit Canet 
Cc: Kevin Wolf 
Cc: Stefan Hajnoczi 
Signed-off-by: Liu Yuan 
---
 v2:
 - rename single as 'fifo'
 - rename read_all_children as read_quorum_children
 - fix quorum_aio_finalize() for fifo pattern

 block.c  |   2 +-
 block/quorum.c   | 181 +--
 qapi/block-core.json |   7 +-
 3 files changed, 138 insertions(+), 52 deletions(-)

diff --git a/block.c b/block.c
index 8800a6b..3dfa27c 100644
--- a/block.c
+++ b/block.c
@@ -2212,7 +2212,7 @@ int bdrv_commit(BlockDriverState *bs)
 
 if (!drv)
 return -ENOMEDIUM;
-
+
 if (!bs->backing_hd) {
 return -ENOTSUP;
 }
diff --git a/block/quorum.c b/block/quorum.c
index d5ee9c0..1443fc5 100644
--- a/block/quorum.c
+++ b/block/quorum.c
@@ -24,6 +24,7 @@
 #define QUORUM_OPT_VOTE_THRESHOLD "vote-threshold"
 #define QUORUM_OPT_BLKVERIFY  "blkverify"
 #define QUORUM_OPT_REWRITE"rewrite-corrupted"
+#define QUORUM_OPT_READ_PATTERN   "read-pattern"
 
 /* This union holds a vote hash value */
 typedef union QuorumVoteValue {
@@ -74,6 +75,16 @@ typedef struct BDRVQuorumState {
 bool rewrite_corrupted;/* true if the driver must rewrite-on-read corrupted
 * block if Quorum is reached.
 */
+
+#define READ_PATTERN_QUORUM 0 /* default */
+#define READ_PATTERN_FIFO   1
+int read_pattern;  /* fifo: read a single child and try first one
+* first. If error, try next child in an
+* FIFO order specifed by command line.
+* Return error if no child read succeeds.
+* quorum: read all the children and do a quorum
+* vote on reads.
+*/
 } BDRVQuorumState;
 
 typedef struct QuorumAIOCB QuorumAIOCB;
@@ -117,6 +128,7 @@ struct QuorumAIOCB {
 
 bool is_read;
 int vote_ret;
+int child_iter; /* which child to read in fifo pattern */
 };
 
 static bool quorum_vote(QuorumAIOCB *acb);
@@ -154,8 +166,10 @@ static void quorum_aio_finalize(QuorumAIOCB *acb)
 
 if (acb->is_read) {
 for (i = 0; i < s->num_children; i++) {
-qemu_vfree(acb->qcrs[i].buf);
-qemu_iovec_destroy(&acb->qcrs[i].qiov);
+if (i <= acb->child_iter) {
+qemu_vfree(acb->qcrs[i].buf);
+qemu_iovec_destroy(&acb->qcrs[i].qiov);
+}
 }
 }
 
@@ -256,6 +270,21 @@ static void quorum_rewrite_aio_cb(void *opaque, int ret)
 quorum_aio_finalize(acb);
 }
 
+static BlockDriverAIOCB *read_fifo_child(QuorumAIOCB *acb);
+
+static void quorum_copy_qiov(QEMUIOVector *dest, QEMUIOVector *source)
+{
+int i;
+assert(dest->niov == source->niov);
+assert(dest->size == source->size);
+for (i = 0; i < source->niov; i++) {
+assert(dest->iov[i].iov_len == sourc

Re: [Qemu-devel] [PATCH v2] block/quorum: add simple read pattern support

2014-07-14 Thread Liu Yuan

On Mon, Jul 14, 2014 at 09:33:59PM -0600, Eric Blake wrote:
> On 07/14/2014 09:19 PM, Liu Yuan wrote:
> > This patch adds single read pattern to quorum driver and quorum vote is 
> > default
> > pattern.
> > 
> 
> > This patch generalize the above 2 nodes case in the N nodes. That is,
> > 
> > vm -> write to all the N nodes, read just one of them. If single read 
> > fails, we
> > try to read next node in FIFO order specified by the startup command.
> > 
> > The 2 nodes case is very similar to DRBD[1] though lack of auto-sync
> > functionality in the single device/node failure for now. But compared with 
> > DRBD
> > we still have some advantages over it:
> > 
> 
> > +++ b/block.c
> > @@ -2212,7 +2212,7 @@ int bdrv_commit(BlockDriverState *bs)
> >  
> >  if (!drv)
> >  return -ENOMEDIUM;
> > -
> > +
> >  if (!bs->backing_hd) {
> >  return -ENOTSUP;
> >  }
> 
> While this whitespace cleanup is nice, it doesn't belong in this patch,
> when there is no other change to this unrelated file.
> 
> 
> > +++ b/qapi/block-core.json
> > @@ -1398,12 +1398,17 @@
> >  # @rewrite-corrupted: #optional rewrite corrupted data when quorum is 
> > reached
> >  # (Since 2.1)
> >  #
> > +# @read-pattern: #optional choose quorum or fifo pattern for read
> > +#set to quorum by default (Since 2.2)
> > +#
> >  # Since: 2.0
> >  ##
> >  { 'type': 'BlockdevOptionsQuorum',
> >'data': { '*blkverify': 'bool',
> >  'children': [ 'BlockdevRef' ],
> > -'vote-threshold': 'int', '*rewrite-corrupted': 'bool' } }
> > +'vote-threshold': 'int',
> > +'*rewrite-corrupted': 'bool',
> > +'*read-pattern': 'str' } }
> 
> Raw strings that encode a finite set of values are bad for type-safety.
>  Please add an enum:
> 
> { 'enum': 'QuorumReadPattern', 'data': [ 'quorum', 'fifo' ] }
> 
> then use '*read-pattern': 'QuorumReadPattern'

For startup command line, did it imply a change too? Sorry I'm not familiar with
block-core.json.

With your suggestion, how we pass read-pattern via qmp?

 read-pattern: fifo
 or 
 read-pattern: quorum


> 
> Should we offer multiple modes in addition to 'quorum'?  For example, I
> could see a difference between 'fifo' (favor read from the first quorum
> member always, unless it fails, good when the first member is local and
> other member is remote) and 'round-robin' (evenly distribute reads; each
> read goes to the next available quorum member, good when all members are
> equally distant).

I guess so. 'round-robin' in your context would benefit use case seeking for
better read load balance.

Thanks
Yuan

[Qemu-devel] [PATCH v3 2/2] qapi: add read-pattern support for quorum

2014-07-14 Thread Liu Yuan

Cc: Eric Blake 
Signed-off-by: Liu Yuan 
---
 qapi/block-core.json | 19 ++-
 1 file changed, 18 insertions(+), 1 deletion(-)

diff --git a/qapi/block-core.json b/qapi/block-core.json
index e378653..22491bc 100644
--- a/qapi/block-core.json
+++ b/qapi/block-core.json
@@ -1384,6 +1384,18 @@
 'raw': 'BlockdevRef' } }
 
 ##
+# @QuorumReadPattern
+# An enumeration of quorum read pattern.
+#
+# @quorum: read all the children and do a quorum vote on reads
+#
+# @fifo: read a single child and try next child in FIFO order if read fails.
+#
+# Since: 2.2
+##
+{ 'enum': 'QuorumReadPattern', 'data': [ 'quorum', 'fifo' ] }
+
+##
 # @BlockdevOptionsQuorum
 #
 # Driver specific block device options for Quorum
@@ -1398,12 +1410,17 @@
 # @rewrite-corrupted: #optional rewrite corrupted data when quorum is reached
 # (Since 2.1)
 #
+# @read-pattern: #optional choose read pattern and set to quorum by default
+#(Since 2.2)
+#
 # Since: 2.0
 ##
 { 'type': 'BlockdevOptionsQuorum',
   'data': { '*blkverify': 'bool',
 'children': [ 'BlockdevRef' ],
-'vote-threshold': 'int', '*rewrite-corrupted': 'bool' } }
+'vote-threshold': 'int',
+'*rewrite-corrupted': 'bool',
+'*read-pattern': 'QuorumReadPattern' } }
 
 ##
 # @BlockdevOptions
-- 
1.9.1

[Qemu-devel] [PATCH v3 1/2] block/quorum: add simple read pattern support

2014-07-14 Thread Liu Yuan

This patch adds single read pattern to quorum driver and quorum vote is default
pattern.

For now we do a quorum vote on all the reads, it is designed for unreliable
underlying storage such as non-redundant NFS to make sure data integrity at the
cost of the read performance.

For some use cases as following:

VM
  --
  ||
  vv
  AB

Both A and B has hardware raid storage to justify the data integrity on its own.
So it would help performance if we do a single read instead of on all the nodes.
Further, if we run VM on either of the storage node, we can make a local read
request for better performance.

This patch generalize the above 2 nodes case in the N nodes. That is,

vm -> write to all the N nodes, read just one of them. If single read fails, we
try to read next node in FIFO order specified by the startup command.

The 2 nodes case is very similar to DRBD[1] though lack of auto-sync
functionality in the single device/node failure for now. But compared with DRBD
we still have some advantages over it:

- Suppose we have 20 VMs running on one(assume A) of 2 nodes' DRBD backed
storage. And if A crashes, we need to restart all the VMs on node B. But for
practice case, we can't because B might not have enough resources to setup 20 
VMs
at once. So if we run our 20 VMs with quorum driver, and scatter the replicated
images over the data center, we can very likely restart 20 VMs without any
resource problem.

After all, I think we can build a more powerful replicated image functionality
on quorum and block jobs(block mirror) to meet various High Availibility needs.

E.g, Enable single read pattern on 2 children,

-drive driver=quorum,children.0.file.filename=0.qcow2,\
children.1.file.filename=1.qcow2,read-pattern=single,vote-threshold=1

[1] http://en.wikipedia.org/wiki/Distributed_Replicated_Block_Device

Cc: Benoit Canet 
Cc: Kevin Wolf 
Cc: Stefan Hajnoczi 
Signed-off-by: Liu Yuan 
---
 block/quorum.c | 181 +
 1 file changed, 131 insertions(+), 50 deletions(-)

diff --git a/block/quorum.c b/block/quorum.c
index d5ee9c0..1443fc5 100644
--- a/block/quorum.c
+++ b/block/quorum.c
@@ -24,6 +24,7 @@
 #define QUORUM_OPT_VOTE_THRESHOLD "vote-threshold"
 #define QUORUM_OPT_BLKVERIFY  "blkverify"
 #define QUORUM_OPT_REWRITE"rewrite-corrupted"
+#define QUORUM_OPT_READ_PATTERN   "read-pattern"
 
 /* This union holds a vote hash value */
 typedef union QuorumVoteValue {
@@ -74,6 +75,16 @@ typedef struct BDRVQuorumState {
 bool rewrite_corrupted;/* true if the driver must rewrite-on-read corrupted
 * block if Quorum is reached.
 */
+
+#define READ_PATTERN_QUORUM 0 /* default */
+#define READ_PATTERN_FIFO   1
+int read_pattern;  /* fifo: read a single child and try first one
+* first. If error, try next child in an
+* FIFO order specifed by command line.
+* Return error if no child read succeeds.
+* quorum: read all the children and do a quorum
+* vote on reads.
+*/
 } BDRVQuorumState;
 
 typedef struct QuorumAIOCB QuorumAIOCB;
@@ -117,6 +128,7 @@ struct QuorumAIOCB {
 
 bool is_read;
 int vote_ret;
+int child_iter; /* which child to read in fifo pattern */
 };
 
 static bool quorum_vote(QuorumAIOCB *acb);
@@ -154,8 +166,10 @@ static void quorum_aio_finalize(QuorumAIOCB *acb)
 
 if (acb->is_read) {
 for (i = 0; i < s->num_children; i++) {
-qemu_vfree(acb->qcrs[i].buf);
-qemu_iovec_destroy(&acb->qcrs[i].qiov);
+if (i <= acb->child_iter) {
+qemu_vfree(acb->qcrs[i].buf);
+qemu_iovec_destroy(&acb->qcrs[i].qiov);
+}
 }
 }
 
@@ -256,6 +270,21 @@ static void quorum_rewrite_aio_cb(void *opaque, int ret)
 quorum_aio_finalize(acb);
 }
 
+static BlockDriverAIOCB *read_fifo_child(QuorumAIOCB *acb);
+
+static void quorum_copy_qiov(QEMUIOVector *dest, QEMUIOVector *source)
+{
+int i;
+assert(dest->niov == source->niov);
+assert(dest->size == source->size);
+for (i = 0; i < source->niov; i++) {
+assert(dest->iov[i].iov_len == source->iov[i].iov_len);
+memcpy(dest->iov[i].iov_base,
+   source->iov[i].iov_base,
+   source->iov[i].iov_len);
+}
+}
+
 static void quorum_aio_cb(void *opaque, int ret)
 {
 QuorumChildRequest *sacb = opaque;
@@ -263,6 +292,21 @@ static void quorum_aio_cb(void *opaque, int ret)
 BDRVQuorumState *s = acb->common.bs->opaque;
 bool rewrite = false;
 
+if (acb->is_read && s->read_pattern ==

[Qemu-devel] [PATCH v3 0/2] add read-pattern for block qourum

2014-07-14 Thread Liu Yuan

v3:
 - separate patch into two, one for quorum and one for qapi for easier review
 - add enumeration for quorum read pattern
 - remove unrelated blank line fix from this patch set

v2:
 - rename single as 'fifo'
 - rename read_all_children as read_quorum_children
 - fix quorum_aio_finalize() for fifo pattern

This patch adds single read pattern to quorum driver and quorum vote is default
pattern.

For now we do a quorum vote on all the reads, it is designed for unreliable
underlying storage such as non-redundant NFS to make sure data integrity at the
cost of the read performance.

For some use cases as following:

VM
  --
  ||
  vv
  AB

Both A and B has hardware raid storage to justify the data integrity on its own.
So it would help performance if we do a single read instead of on all the nodes.
Further, if we run VM on either of the storage node, we can make a local read
request for better performance.

This patch generalize the above 2 nodes case in the N nodes. That is,

vm -> write to all the N nodes, read just one of them. If single read fails, we
try to read next node in FIFO order specified by the startup command.

The 2 nodes case is very similar to DRBD[1] though lack of auto-sync
functionality in the single device/node failure for now. But compared with DRBD
we still have some advantages over it:

- Suppose we have 20 VMs running on one(assume A) of 2 nodes' DRBD backed
storage. And if A crashes, we need to restart all the VMs on node B. But for
practice case, we can't because B might not have enough resources to setup 20 
VMs
at once. So if we run our 20 VMs with quorum driver, and scatter the replicated
images over the data center, we can very likely restart 20 VMs without any
resource problem.

After all, I think we can build a more powerful replicated image functionality
on quorum and block jobs(block mirror) to meet various High Availibility needs.

E.g, Enable single read pattern on 2 children,

-drive driver=quorum,children.0.file.filename=0.qcow2,\
children.1.file.filename=1.qcow2,read-pattern=single,vote-threshold=1

[1] http://en.wikipedia.org/wiki/Distributed_Replicated_Block_Device

Liu Yuan (2):
  block/quorum: add simple read pattern support
  qapi: add read-pattern support for quorum

 block/quorum.c   | 181 +--
 qapi/block-core.json |  19 +-
 2 files changed, 149 insertions(+), 51 deletions(-)

-- 
1.9.1

Re: [Qemu-devel] [PATCH v3 2/2] qapi: add read-pattern support for quorum

2014-07-15 Thread Liu Yuan

On Tue, Jul 15, 2014 at 06:22:14AM -0600, Eric Blake wrote:
> On 07/15/2014 12:34 AM, Liu Yuan wrote:
> > Cc: Eric Blake 
> > Signed-off-by: Liu Yuan 
> > ---
> >  qapi/block-core.json | 19 ++-
> >  1 file changed, 18 insertions(+), 1 deletion(-)
> > 
> > diff --git a/qapi/block-core.json b/qapi/block-core.json
> > index e378653..22491bc 100644
> > --- a/qapi/block-core.json
> > +++ b/qapi/block-core.json
> > @@ -1384,6 +1384,18 @@
> >  'raw': 'BlockdevRef' } }
> >  
> >  ##
> > +# @QuorumReadPattern
> > +# An enumeration of quorum read pattern.
> 
> s/pattern/patterns/
> 
> > +#
> > +# @quorum: read all the children and do a quorum vote on reads
> > +#
> > +# @fifo: read a single child and try next child in FIFO order if read 
> > fails.
> 
> Inconsistent on whether you are ending a line with '.'.  I'm not quite
> sure if this reads well; maybe:
> 
> @fifo: read only from the first child that has not failed
> 
> I also wonder if 'single' might be a better name than 'fifo', especially
> if we later add a 'round-robin' for read load balancing.

Both fifo and round-robin in our context indicate a single read.

but either one looks okay to me. We need to reach agreement on the name before I
send the next version, single or fifo or any other suggested, Benoit, Kevin,
Stefan?

Thanks
Yuan

Re: [Qemu-devel] [PATCH v3 2/2] qapi: add read-pattern support for quorum

2014-07-15 Thread Liu Yuan

On Wed, Jul 16, 2014 at 10:22:20AM +0800, Liu Yuan wrote:
> On Tue, Jul 15, 2014 at 06:22:14AM -0600, Eric Blake wrote:
> > On 07/15/2014 12:34 AM, Liu Yuan wrote:
> > > Cc: Eric Blake 
> > > Signed-off-by: Liu Yuan 
> > > ---
> > >  qapi/block-core.json | 19 ++-
> > >  1 file changed, 18 insertions(+), 1 deletion(-)
> > > 
> > > diff --git a/qapi/block-core.json b/qapi/block-core.json
> > > index e378653..22491bc 100644
> > > --- a/qapi/block-core.json
> > > +++ b/qapi/block-core.json
> > > @@ -1384,6 +1384,18 @@
> > >  'raw': 'BlockdevRef' } }
> > >  
> > >  ##
> > > +# @QuorumReadPattern
> > > +# An enumeration of quorum read pattern.
> > 
> > s/pattern/patterns/
> > 
> > > +#
> > > +# @quorum: read all the children and do a quorum vote on reads
> > > +#
> > > +# @fifo: read a single child and try next child in FIFO order if read 
> > > fails.
> > 
> > Inconsistent on whether you are ending a line with '.'.  I'm not quite
> > sure if this reads well; maybe:
> > 
> > @fifo: read only from the first child that has not failed
> > 
> > I also wonder if 'single' might be a better name than 'fifo', especially
> > if we later add a 'round-robin' for read load balancing.
> 
> Both fifo and round-robin in our context indicate a single read.
> 
> but either one looks okay to me. We need to reach agreement on the name 
> before I
> send the next version, single or fifo or any other suggested, Benoit, Kevin,
> Stefan?

Cc others mentioned.

Re: [Qemu-devel] [PATCH v3 1/2] block/quorum: add simple read pattern support

2014-07-15 Thread Liu Yuan

On Tue, Jul 15, 2014 at 06:14:32AM -0600, Eric Blake wrote:
> On 07/15/2014 12:34 AM, Liu Yuan wrote:
> > This patch adds single read pattern to quorum driver and quorum vote is 
> > default
> > pattern.
> > 
> > For now we do a quorum vote on all the reads, it is designed for unreliable
> > underlying storage such as non-redundant NFS to make sure data integrity at 
> > the
> > cost of the read performance.
> > 
> 
> >  */
> > +
> > +#define READ_PATTERN_QUORUM 0 /* default */
> > +#define READ_PATTERN_FIFO   1
> > +int read_pattern;  /* fifo: read a single child and try first one
> > +* first. If error, try next child in an
> > +* FIFO order specifed by command line.
> > +* Return error if no child read 
> > succeeds.
> > +* quorum: read all the children and do a quorum
> > +* vote on reads.
> > +*/
> 
> Please swap the order of your two patches, or squash them back into one.
>  If you define the qapi enum first, then you don't need to open-code
> these #defines; instead, the qapi code will generate a typedef that puts
> the C enum QuorumReadPattern into play, so that this declaration becomes
> 'QuorumReadPattern read_pattern;'...
> 
> 
> > @@ -263,6 +292,21 @@ static void quorum_aio_cb(void *opaque, int ret)
> >  BDRVQuorumState *s = acb->common.bs->opaque;
> >  bool rewrite = false;
> >  
> > +if (acb->is_read && s->read_pattern == READ_PATTERN_FIFO) {
> 
> ...and code like this can compare against the constant
> QUORUM_READ_PATTERN_FIFO (which will be generated for you from the .json
> file)...
> 
> 
> > +read_pattern = qemu_opt_get(opts, QUORUM_OPT_READ_PATTERN);
> > +if (read_pattern) {
> > +if (strcmp(read_pattern, "fifo") == 0) {
> > +s->read_pattern = READ_PATTERN_FIFO;
> > +} else if (strcmp(read_pattern, "quorum") == 0) {
> > +s->read_pattern = READ_PATTERN_QUORUM;
> > +} else {
> > +error_setg(&local_err,
> > +   "Please set read-pattern as fifo or quorum\n");
> > +ret = -EINVAL;
> > +goto exit;
> > +}
> 
> ...and instead of open-coding this, you should use parse_enum_option().
> Also, by using the generated list here, if you later add a third read
> pattern, then this parser code doesn't have to change, but already
> automatically covers all valid options encoded in the qapi list.
> 

That's awesome, thanks for your pointers, Eric.

Yuan

[Qemu-devel] [PATCH v4 0/2] add read-pattern for block qourum

2014-07-16 Thread Liu Yuan

v4:
 - swap the patch order
 - update comment for fifo pattern in qaip
 - use qapi enumeration in quorum driver instead of manual parsing

v3:
 - separate patch into two, one for quorum and one for qapi for easier review
 - add enumeration for quorum read pattern
 - remove unrelated blank line fix from this patch set

v2:
 - rename single as 'fifo'
 - rename read_all_children as read_quorum_children
 - fix quorum_aio_finalize() for fifo pattern

This patch adds single read pattern to quorum driver and quorum vote is default
pattern.

For now we do a quorum vote on all the reads, it is designed for unreliable
underlying storage such as non-redundant NFS to make sure data integrity at the
cost of the read performance.

For some use cases as following:

VM
  --
  ||
  vv
  AB

Both A and B has hardware raid storage to justify the data integrity on its own.
So it would help performance if we do a single read instead of on all the nodes.
Further, if we run VM on either of the storage node, we can make a local read
request for better performance.

This patch generalize the above 2 nodes case in the N nodes. That is,

vm -> write to all the N nodes, read just one of them. If single read fails, we
try to read next node in FIFO order specified by the startup command.

The 2 nodes case is very similar to DRBD[1] though lack of auto-sync
functionality in the single device/node failure for now. But compared with DRBD
we still have some advantages over it:

- Suppose we have 20 VMs running on one(assume A) of 2 nodes' DRBD backed
storage. And if A crashes, we need to restart all the VMs on node B. But for
practice case, we can't because B might not have enough resources to setup 20 
VMs
at once. So if we run our 20 VMs with quorum driver, and scatter the replicated
images over the data center, we can very likely restart 20 VMs without any
resource problem.

After all, I think we can build a more powerful replicated image functionality
on quorum and block jobs(block mirror) to meet various High Availibility needs.

E.g, Enable single read pattern on 2 children,

-drive driver=quorum,children.0.file.filename=0.qcow2,\
children.1.file.filename=1.qcow2,read-pattern=fifo,vote-threshold=1

[1] http://en.wikipedia.org/wiki/Distributed_Replicated_Block_Device

Liu Yuan (2):
  qapi: add read-pattern support for quorum
  block/quorum: add simple read pattern support

 block/quorum.c   | 181 +--
 qapi/block-core.json |  19 +-
 2 files changed, 149 insertions(+), 51 deletions(-)

-- 
1.9.1

[Qemu-devel] [PATCH v4 2/2] block/quorum: add simple read pattern support

2014-07-16 Thread Liu Yuan

This patch adds single read pattern to quorum driver and quorum vote is default
pattern.

For now we do a quorum vote on all the reads, it is designed for unreliable
underlying storage such as non-redundant NFS to make sure data integrity at the
cost of the read performance.

For some use cases as following:

VM
  --
  ||
  vv
  AB

Both A and B has hardware raid storage to justify the data integrity on its own.
So it would help performance if we do a single read instead of on all the nodes.
Further, if we run VM on either of the storage node, we can make a local read
request for better performance.

This patch generalize the above 2 nodes case in the N nodes. That is,

vm -> write to all the N nodes, read just one of them. If single read fails, we
try to read next node in FIFO order specified by the startup command.

The 2 nodes case is very similar to DRBD[1] though lack of auto-sync
functionality in the single device/node failure for now. But compared with DRBD
we still have some advantages over it:

- Suppose we have 20 VMs running on one(assume A) of 2 nodes' DRBD backed
storage. And if A crashes, we need to restart all the VMs on node B. But for
practice case, we can't because B might not have enough resources to setup 20 
VMs
at once. So if we run our 20 VMs with quorum driver, and scatter the replicated
images over the data center, we can very likely restart 20 VMs without any
resource problem.

After all, I think we can build a more powerful replicated image functionality
on quorum and block jobs(block mirror) to meet various High Availibility needs.

E.g, Enable single read pattern on 2 children,

-drive driver=quorum,children.0.file.filename=0.qcow2,\
children.1.file.filename=1.qcow2,read-pattern=fifo,vote-threshold=1

[1] http://en.wikipedia.org/wiki/Distributed_Replicated_Block_Device

Cc: Benoit Canet 
Cc: Eric Blake 
Cc: Kevin Wolf 
Cc: Stefan Hajnoczi 
Signed-off-by: Liu Yuan 
---
 block/quorum.c | 179 +
 1 file changed, 131 insertions(+), 48 deletions(-)

diff --git a/block/quorum.c b/block/quorum.c
index d5ee9c0..ebf5c71 100644
--- a/block/quorum.c
+++ b/block/quorum.c
@@ -24,6 +24,7 @@
 #define QUORUM_OPT_VOTE_THRESHOLD "vote-threshold"
 #define QUORUM_OPT_BLKVERIFY  "blkverify"
 #define QUORUM_OPT_REWRITE"rewrite-corrupted"
+#define QUORUM_OPT_READ_PATTERN   "read-pattern"
 
 /* This union holds a vote hash value */
 typedef union QuorumVoteValue {
@@ -74,6 +75,8 @@ typedef struct BDRVQuorumState {
 bool rewrite_corrupted;/* true if the driver must rewrite-on-read corrupted
 * block if Quorum is reached.
 */
+
+QuorumReadPattern read_pattern;
 } BDRVQuorumState;
 
 typedef struct QuorumAIOCB QuorumAIOCB;
@@ -117,6 +120,7 @@ struct QuorumAIOCB {
 
 bool is_read;
 int vote_ret;
+int child_iter; /* which child to read in fifo pattern */
 };
 
 static bool quorum_vote(QuorumAIOCB *acb);
@@ -154,8 +158,10 @@ static void quorum_aio_finalize(QuorumAIOCB *acb)
 
 if (acb->is_read) {
 for (i = 0; i < s->num_children; i++) {
-qemu_vfree(acb->qcrs[i].buf);
-qemu_iovec_destroy(&acb->qcrs[i].qiov);
+if (i <= acb->child_iter) {
+qemu_vfree(acb->qcrs[i].buf);
+qemu_iovec_destroy(&acb->qcrs[i].qiov);
+}
 }
 }
 
@@ -256,6 +262,21 @@ static void quorum_rewrite_aio_cb(void *opaque, int ret)
 quorum_aio_finalize(acb);
 }
 
+static BlockDriverAIOCB *read_fifo_child(QuorumAIOCB *acb);
+
+static void quorum_copy_qiov(QEMUIOVector *dest, QEMUIOVector *source)
+{
+int i;
+assert(dest->niov == source->niov);
+assert(dest->size == source->size);
+for (i = 0; i < source->niov; i++) {
+assert(dest->iov[i].iov_len == source->iov[i].iov_len);
+memcpy(dest->iov[i].iov_base,
+   source->iov[i].iov_base,
+   source->iov[i].iov_len);
+}
+}
+
 static void quorum_aio_cb(void *opaque, int ret)
 {
 QuorumChildRequest *sacb = opaque;
@@ -263,6 +284,21 @@ static void quorum_aio_cb(void *opaque, int ret)
 BDRVQuorumState *s = acb->common.bs->opaque;
 bool rewrite = false;
 
+if (acb->is_read && s->read_pattern == QUORUM_READ_PATTERN_FIFO) {
+/* We try to read next child in FIFO order if we fail to read */
+if (ret < 0 && ++acb->child_iter < s->num_children) {
+read_fifo_child(acb);
+return;
+}
+
+if (ret == 0) {
+quorum_copy_qiov(acb->qiov, &acb->qcrs[acb->child_iter].qiov);
+}
+acb->vote_ret = ret;
+quorum_aio_finalize(acb);
+return;
+}
+
 sacb->ret = r

[Qemu-devel] [PATCH v4 1/2] qapi: add read-pattern enum for quorum

2014-07-16 Thread Liu Yuan

Cc: Eric Blake 
Signed-off-by: Liu Yuan 
---
 qapi/block-core.json | 20 +++-
 1 file changed, 19 insertions(+), 1 deletion(-)

diff --git a/qapi/block-core.json b/qapi/block-core.json
index e378653..42033d9 100644
--- a/qapi/block-core.json
+++ b/qapi/block-core.json
@@ -1384,6 +1384,19 @@
 'raw': 'BlockdevRef' } }
 
 ##
+# @QuorumReadPattern
+#
+# An enumeration of quorum read patterns.
+#
+# @quorum: read all the children and do a quorum vote on reads
+#
+# @fifo: read only from the first child that has not failed
+#
+# Since: 2.2
+##
+{ 'enum': 'QuorumReadPattern', 'data': [ 'quorum', 'fifo' ] }
+
+##
 # @BlockdevOptionsQuorum
 #
 # Driver specific block device options for Quorum
@@ -1398,12 +1411,17 @@
 # @rewrite-corrupted: #optional rewrite corrupted data when quorum is reached
 # (Since 2.1)
 #
+# @read-pattern: #optional choose read pattern and set to quorum by default
+#(Since 2.2)
+#
 # Since: 2.0
 ##
 { 'type': 'BlockdevOptionsQuorum',
   'data': { '*blkverify': 'bool',
 'children': [ 'BlockdevRef' ],
-'vote-threshold': 'int', '*rewrite-corrupted': 'bool' } }
+'vote-threshold': 'int',
+'*rewrite-corrupted': 'bool',
+'*read-pattern': 'QuorumReadPattern' } }
 
 ##
 # @BlockdevOptions
-- 
1.9.1

[Qemu-devel] [PATCH] block/quorum: implement .bdrv_co_get_block_status

2014-07-17 Thread Liu Yuan

- allow drive-mirror to create sprase mirror on images like qcow2
- allow qemu-img map to work as expected on quorum driver

Cc: Benoit Canet 
Cc: Kevin Wolf 
Cc: Stefan Hajnoczi 
Signed-off-by: Liu Yuan 
---
 block/quorum.c | 16 
 1 file changed, 16 insertions(+)

diff --git a/block/quorum.c b/block/quorum.c
index ebf5c71..f0d0a98 100644
--- a/block/quorum.c
+++ b/block/quorum.c
@@ -780,6 +780,21 @@ static coroutine_fn int quorum_co_flush(BlockDriverState 
*bs)
 return result;
 }
 
+static int64_t coroutine_fn quorum_co_get_block_status(BlockDriverState *bs,
+   int64_t sector_num,
+   int nb_sectors,
+   int *pnum)
+{
+BDRVQuorumState *s = bs->opaque;
+BlockDriverState *child_bs = s->bs[0];
+
+if (child_bs->drv->bdrv_co_get_block_status)
+return child_bs->drv->bdrv_co_get_block_status(child_bs, sector_num,
+   nb_sectors, pnum);
+
+return bdrv_get_block_status(child_bs, sector_num, nb_sectors, pnum);
+}
+
 static bool quorum_recurse_is_first_non_filter(BlockDriverState *bs,
BlockDriverState *candidate)
 {
@@ -1038,6 +1053,7 @@ static BlockDriver bdrv_quorum = {
 .bdrv_close = quorum_close,
 
 .bdrv_co_flush_to_disk  = quorum_co_flush,
+.bdrv_co_get_block_status   = quorum_co_get_block_status,
 
 .bdrv_getlength = quorum_getlength,
 
-- 
1.9.1

Re: [Qemu-devel] [PATCH] block/quorum: implement .bdrv_co_get_block_status

2014-07-17 Thread Liu Yuan

On Fri, Jul 18, 2014 at 11:11:31AM +0800, Fam Zheng wrote:
> On Thu, 07/17 19:50, Liu Yuan wrote:
> > - allow drive-mirror to create sprase mirror on images like qcow2
> > - allow qemu-img map to work as expected on quorum driver
> > 
> > Cc: Benoit Canet 
> > Cc: Kevin Wolf 
> > Cc: Stefan Hajnoczi 
> > Signed-off-by: Liu Yuan 
> > ---
> >  block/quorum.c | 16 
> >  1 file changed, 16 insertions(+)
> > 
> > diff --git a/block/quorum.c b/block/quorum.c
> > index ebf5c71..f0d0a98 100644
> > --- a/block/quorum.c
> > +++ b/block/quorum.c
> > @@ -780,6 +780,21 @@ static coroutine_fn int 
> > quorum_co_flush(BlockDriverState *bs)
> >  return result;
> >  }
> >  
> > +static int64_t coroutine_fn quorum_co_get_block_status(BlockDriverState 
> > *bs,
> > +   int64_t sector_num,
> > +   int nb_sectors,
> > +   int *pnum)
> > +{
> > +BDRVQuorumState *s = bs->opaque;
> > +BlockDriverState *child_bs = s->bs[0];
> 
> Should we consider other children?
> 

No, other children are identical because we always do multiple sync writes
and at boot time we check the length.

Thanks
Yuan

Re: [Qemu-devel] [PATCH v2] configure: make libnfs not_found message user friendly

2014-07-18 Thread Liu Yuan

On Tue, Jul 15, 2014 at 04:16:14PM +0200, Stefan Hajnoczi wrote:
> On Mon, Jul 14, 2014 at 11:06:08AM +0800, Hu Tao wrote:
> > On Sat, Jul 12, 2014 at 11:17:40AM +0800, Liu Yuan wrote:
> > > Cc: Kevin Wolf 
> > > Signed-off-by: Liu Yuan 
> > > ---
> > >  configure | 2 +-
> > >  1 file changed, 1 insertion(+), 1 deletion(-)
> > > 
> > > diff --git a/configure b/configure
> > > index 7dd43fd..78e7baf 100755
> > > --- a/configure
> > > +++ b/configure
> > > @@ -3996,7 +3996,7 @@ if test "$libnfs" != "no" ; then
> > >  LIBS="$LIBS $libnfs_libs"
> > >else
> > >  if test "$libnfs" = "yes" ; then
> > > -  feature_not_found "libnfs"
> > > +  feature_not_found "libnfs" "Install libnfs-devel >= 1.9.3"
> > >  fi
> > >  libnfs="no"
> > >fi
> > > -- 
> > > 1.9.1
> > > 
> > 
> > I already had a patch that fixed all similar cases:
> > http://lists.nongnu.org/archive/html/qemu-devel/2014-06/msg06411.html.
> 
> Thanks, I have applied Hu's patch to my block-next tree because it
> includes this and other cases.
> 
> Please CC me on block layer patches.  I co-maintain the block layer with
> Kevin and am merging patches this week.

Umm, I git log see that this file related to block layer mostly sent to kevin,
so I thought I should only send it to Kevin. I'll include you from this time on
as I did before.

Yuan

[Qemu-devel] [Bug] Segmentation fault with QEMU v1.1.0 running Sheepdog block driver

2012-06-03 Thread Liu Yuan

Hi list,
  This segfault is met when installing RHEL 6 with Sheepdog cluster.


Program terminated with signal 11, Segmentation fault.
#0  0x7f777fc30359 in send_pending_req (s=0x7f77818a2160,
oid=71271095331718270, id=1858) at block/sheepdog.c:629
629 QLIST_FOREACH_SAFE(aio_req, &s->outstanding_aio_head,
(gdb) where
#0  0x7f777fc30359 in send_pending_req (s=0x7f77818a2160,
oid=71271095331718270, id=1858) at block/sheepdog.c:629
#1  0x7f777fc3056b in aio_read_response (opaque=0x7f77818a2160) at
block/sheepdog.c:718
#2  0x7f777fc54fa8 in coroutine_trampoline (i0=-2117283968,
i1=32631) at coroutine-ucontext.c:129
#3  0x7f777c3f4f40 in ?? () from /lib/libc.so.6
#4  0x7fffdb28af60 in ?? ()
#5  0x in ?? ()

Thanks,
Yuan

[Qemu-devel] [PATCH] sheepdog: implement direct write semantics

2012-12-19 Thread Liu Yuan

From: Liu Yuan 

Sheepdog supports both writeback/writethrough write but has not yet supported
DIRECTIO semantics which bypass the cache completely even if Sheepdog daemon is
set up with cache enabled.

Suppose cache is enabled on Sheepdog daemon size, the new cache control is

cache=writeback # enable the writeback semantics for write
cache=writethrough # enable the writethrough semantics for write
cache='directsync | none | off' # disable cache competely

Cc: MORITA Kazutaka 
Cc: Kevin Wolf 
Cc: Stefan Hajnoczi 
Signed-off-by: Liu Yuan 
---
 block/sheepdog.c |   67 ++
 1 file changed, 37 insertions(+), 30 deletions(-)

diff --git a/block/sheepdog.c b/block/sheepdog.c
index ceabc00..134329a 100644
--- a/block/sheepdog.c
+++ b/block/sheepdog.c
@@ -36,7 +36,8 @@
 
 #define SD_FLAG_CMD_WRITE0x01
 #define SD_FLAG_CMD_COW  0x02
-#define SD_FLAG_CMD_CACHE0x04
+#define SD_FLAG_CMD_CACHE0x04 /* Writeback mode for cache */
+#define SD_FLAG_CMD_DIRECT   0x08 /* Don't use cache */
 
 #define SD_RES_SUCCESS   0x00 /* Success */
 #define SD_RES_UNKNOWN   0x01 /* Unknown error */
@@ -293,7 +294,7 @@ typedef struct BDRVSheepdogState {
 
 char name[SD_MAX_VDI_LEN];
 bool is_snapshot;
-bool cache_enabled;
+uint32_t cache_flags;
 
 char *addr;
 char *port;
@@ -977,8 +978,8 @@ static int coroutine_fn add_aio_request(BDRVSheepdogState 
*s, AIOReq *aio_req,
 hdr.flags = SD_FLAG_CMD_WRITE | flags;
 }
 
-if (s->cache_enabled) {
-hdr.flags |= SD_FLAG_CMD_CACHE;
+if (s->cache_flags) {
+hdr.flags |= s->cache_flags;
 }
 
 hdr.oid = oid;
@@ -1023,7 +1024,7 @@ static int coroutine_fn add_aio_request(BDRVSheepdogState 
*s, AIOReq *aio_req,
 
 static int read_write_object(int fd, char *buf, uint64_t oid, int copies,
  unsigned int datalen, uint64_t offset,
- bool write, bool create, bool cache)
+ bool write, bool create, uint32_t cache_flags)
 {
 SheepdogObjReq hdr;
 SheepdogObjRsp *rsp = (SheepdogObjRsp *)&hdr;
@@ -1047,9 +1048,7 @@ static int read_write_object(int fd, char *buf, uint64_t 
oid, int copies,
 hdr.opcode = SD_OP_READ_OBJ;
 }
 
-if (cache) {
-hdr.flags |= SD_FLAG_CMD_CACHE;
-}
+hdr.flags |= cache_flags;
 
 hdr.oid = oid;
 hdr.data_length = datalen;
@@ -1072,18 +1071,19 @@ static int read_write_object(int fd, char *buf, 
uint64_t oid, int copies,
 }
 
 static int read_object(int fd, char *buf, uint64_t oid, int copies,
-   unsigned int datalen, uint64_t offset, bool cache)
+   unsigned int datalen, uint64_t offset,
+   uint32_t cache_flags)
 {
 return read_write_object(fd, buf, oid, copies, datalen, offset, false,
- false, cache);
+ false, cache_flags);
 }
 
 static int write_object(int fd, char *buf, uint64_t oid, int copies,
 unsigned int datalen, uint64_t offset, bool create,
-bool cache)
+uint32_t cache_flags)
 {
 return read_write_object(fd, buf, oid, copies, datalen, offset, true,
- create, cache);
+ create, cache_flags);
 }
 
 static int sd_open(BlockDriverState *bs, const char *filename, int flags)
@@ -1118,12 +1118,19 @@ static int sd_open(BlockDriverState *bs, const char 
*filename, int flags)
 goto out;
 }
 
-s->cache_enabled = true;
-s->flush_fd = connect_to_sdog(s->addr, s->port);
-if (s->flush_fd < 0) {
-error_report("failed to connect");
-ret = s->flush_fd;
-goto out;
+if (flags & BDRV_O_NOCACHE) {
+s->cache_flags = SD_FLAG_CMD_DIRECT;
+} else if (flags & BDRV_O_CACHE_WB) {
+s->cache_flags = SD_FLAG_CMD_CACHE;
+}
+
+if (s->cache_flags != SD_FLAG_CMD_DIRECT) {
+s->flush_fd = connect_to_sdog(s->addr, s->port);
+if (s->flush_fd < 0) {
+error_report("failed to connect");
+ret = s->flush_fd;
+goto out;
+}
 }
 
 if (snapid || tag[0] != '\0') {
@@ -1140,7 +1147,7 @@ static int sd_open(BlockDriverState *bs, const char 
*filename, int flags)
 
 buf = g_malloc(SD_INODE_SIZE);
 ret = read_object(fd, buf, vid_to_vdi_oid(vid), 0, SD_INODE_SIZE, 0,
-  s->cache_enabled);
+  s->cache_flags);
 
 closesocket(fd);
 
@@ -1387,7 +1394,7 @@ static void sd_close(BlockDriverState *bs)
 
 qemu_aio_set_fd_handler(s->fd, NULL, NULL, NULL, NULL);
 closesocket(s->fd);
-if (s->cache_enabled) {
+if (s->cache_flags) {
 closesocket(s->flush_fd);
 }
 g_fre

Re: [Qemu-devel] [PATCH] sheepdog: implement direct write semantics

2012-12-25 Thread Liu Yuan

On 12/25/2012 03:47 PM, MORITA Kazutaka wrote:
> I wonder if we should disable cache when cache=none.  Many management
> frontend uses cache=none by default but, I think, users still expect
> that data is cached (e.g. by disk write cache when a raw format is
> used).  cache=none only means that the host page cache is not used for
> VM disk IO.
> 
> In that sense,
> 

I am fine to adopt option to this semantics.

>> > @@ -1118,12 +1118,19 @@ static int sd_open(BlockDriverState *bs, const 
>> > char *filename, int flags)
>> >  goto out;
>> >  }
>> >  
>> > -s->cache_enabled = true;
>> > -s->flush_fd = connect_to_sdog(s->addr, s->port);
>> > -if (s->flush_fd < 0) {
>> > -error_report("failed to connect");
>> > -ret = s->flush_fd;
>> > -goto out;
>> > +if (flags & BDRV_O_NOCACHE) {
>> > +s->cache_flags = SD_FLAG_CMD_DIRECT;
>> > +} else if (flags & BDRV_O_CACHE_WB) {
> 'else' should be removed, and
> 

Seems should not. We need this 'else if' to allow writethrough flag.

>> > +s->cache_flags = SD_FLAG_CMD_CACHE;
>> > +}
>> > +
>> > +if (s->cache_flags != SD_FLAG_CMD_DIRECT) {
> should be 's->cache_flags == SD_FLAG_CMD_CACHE'?  Do we need to send a
> flush request when cache=writethourgh?

Looks good to me. I'll change it at V2.

Thanks,
Yuan

Re: [Qemu-devel] [PATCH] sheepdog: implement direct write semantics

2012-12-25 Thread Liu Yuan

On 12/25/2012 03:47 PM, MORITA Kazutaka wrote:
> I wonder if we should disable cache when cache=none.  Many management
> frontend uses cache=none by default but, I think, users still expect
> that data is cached (e.g. by disk write cache when a raw format is
> used).  cache=none only means that the host page cache is not used for
> VM disk IO.
> 
> In that sense,

Well, I found setting cache=directsync will contain 'BDRV_O_CACHE_WB'.
Is this a bug for current master? If no, my current scheme will be the
only way to bypass cache of sheepdog.

Thanks,
Yuan

Re: [Qemu-devel] [PATCH] sheepdog: implement direct write semantics

2013-01-03 Thread Liu Yuan

On 12/25/2012 04:45 PM, Liu Yuan wrote:
> Well, I found setting cache=directsync will contain 'BDRV_O_CACHE_WB'.
> Is this a bug for current master? If no, my current scheme will be the
> only way to bypass cache of sheepdog.

Ping. Can anyone confirm it is a bug that 'cache=directsync' will pass
BDRV_O_CACHE_WB' in the flags?

Thanks,
Yuan

Re: [Qemu-devel] [PATCH] sheepdog: implement direct write semantics

2013-01-04 Thread Liu Yuan

On 01/05/2013 12:38 AM, Stefan Hajnoczi wrote:
> Hi Yuan,
> BDRV_O_NOCACHE means bypass host page cache (O_DIRECT).
> 
> BDRV_O_CACHE_WB specifies the cache semantics that the guest sees - that
> means whether the disk cache is writethrough or writeback.
> 
> In other words, BDRV_O_NOCACHE is a host performance tweak while
> BDRV_O_CACHE_WB changes the cache safety of the BlockDriverState.  A
> protocol driver like sheepdog doesn't need to look at BDRV_O_CACHE_WB
> because it is implemented in block.c (see bdrv_co_do_writev() where QEMU
> will flush when after each write when !bs->enable_write_cache).

Hi Stefan,

  Thanks for your explanation. But after more investigation, I find
myself more confused:

  flags passed from block layer
  {writeback, writethrough}  0x2042
  {directsync, off, none}0x2062
  {unsafe}   0x2242

So underlying driver like Sheepdog can't depend on 'flags' passed from
.bdrv_file_open() to choose the right semantics (This was possible for
old QEMU IIRC).

If we can't rely on the 'flags' to get the cache indications of users,
would you point me how to implement tristate cache control for network
block driver like Sheepdog? For e.g, I want to implement following
semantics:
cache=writeback|none|off # enable writeback semantics for write
cache=writethrough # enable writethrough semantics for write
cache=directsync # disable cache completely

Thanks,
Yuan

Re: [Qemu-devel] [PATCH] sheepdog: implement direct write semantics

2013-01-04 Thread Liu Yuan

On 01/05/2013 12:40 PM, Liu Yuan wrote:
> On 01/05/2013 12:38 AM, Stefan Hajnoczi wrote:
>> Hi Yuan,
>> BDRV_O_NOCACHE means bypass host page cache (O_DIRECT).
>>
>> BDRV_O_CACHE_WB specifies the cache semantics that the guest sees - that
>> means whether the disk cache is writethrough or writeback.
>>
>> In other words, BDRV_O_NOCACHE is a host performance tweak while
>> BDRV_O_CACHE_WB changes the cache safety of the BlockDriverState.  A
>> protocol driver like sheepdog doesn't need to look at BDRV_O_CACHE_WB
>> because it is implemented in block.c (see bdrv_co_do_writev() where QEMU
>> will flush when after each write when !bs->enable_write_cache).
> 
> Hi Stefan,
> 
>   Thanks for your explanation. But after more investigation, I find
> myself more confused:
> 
>   flags passed from block layer
>   {writeback, writethrough}  0x2042
>   {directsync, off, none}0x2062
>   {unsafe}   0x2242
> 
> So underlying driver like Sheepdog can't depend on 'flags' passed from
> .bdrv_file_open() to choose the right semantics (This was possible for
> old QEMU IIRC).
> 
> If we can't rely on the 'flags' to get the cache indications of users,
> would you point me how to implement tristate cache control for network
> block driver like Sheepdog? For e.g, I want to implement following
> semantics:
> cache=writeback|none|off # enable writeback semantics for write
> cache=writethrough # enable writethrough semantics for write
> cache=directsync # disable cache completely
> 
> Thanks,
> Yuan
> 

I tried the old QEMU and v1.1.0 and v1.1.2, they still worked as I
expected. So I guess generic block layer got changed a bit and the
'flags' meaning turned different than old code, which did indeed allow
block drivers to interpret the 'flags' passed from bdrv_file_open().

With the current upstream code, it seems that BDRV_O_CACHE_WB is always
enabled and makes 'flags' completely unusable for block drivers to get
the indications of user by specifying 'cache=' field.

So is there other means to allow block drivers to rely on, in order to
interpret the 'cache semantics'?

Thanks,
Yuan

Re: [Qemu-devel] [PATCH] sheepdog: implement direct write semantics

2013-01-04 Thread Liu Yuan

On 01/05/2013 01:29 PM, Liu Yuan wrote:
> On 01/05/2013 12:40 PM, Liu Yuan wrote:
>> On 01/05/2013 12:38 AM, Stefan Hajnoczi wrote:
>>> Hi Yuan,
>>> BDRV_O_NOCACHE means bypass host page cache (O_DIRECT).
>>>
>>> BDRV_O_CACHE_WB specifies the cache semantics that the guest sees - that
>>> means whether the disk cache is writethrough or writeback.
>>>
>>> In other words, BDRV_O_NOCACHE is a host performance tweak while
>>> BDRV_O_CACHE_WB changes the cache safety of the BlockDriverState.  A
>>> protocol driver like sheepdog doesn't need to look at BDRV_O_CACHE_WB
>>> because it is implemented in block.c (see bdrv_co_do_writev() where QEMU
>>> will flush when after each write when !bs->enable_write_cache).
>>
>> Hi Stefan,
>>
>>   Thanks for your explanation. But after more investigation, I find
>> myself more confused:
>>
>>   flags passed from block layer
>>   {writeback, writethrough}  0x2042
>>   {directsync, off, none}0x2062
>>   {unsafe}   0x2242
>>
>> So underlying driver like Sheepdog can't depend on 'flags' passed from
>> .bdrv_file_open() to choose the right semantics (This was possible for
>> old QEMU IIRC).
>>
>> If we can't rely on the 'flags' to get the cache indications of users,
>> would you point me how to implement tristate cache control for network
>> block driver like Sheepdog? For e.g, I want to implement following
>> semantics:
>> cache=writeback|none|off # enable writeback semantics for write
>> cache=writethrough # enable writethrough semantics for write
>> cache=directsync # disable cache completely
>>
>> Thanks,
>> Yuan
>>
> 
> I tried the old QEMU and v1.1.0 and v1.1.2, they still worked as I
> expected. So I guess generic block layer got changed a bit and the
> 'flags' meaning turned different than old code, which did indeed allow
> block drivers to interpret the 'flags' passed from bdrv_file_open().
> 
> With the current upstream code, it seems that BDRV_O_CACHE_WB is always
> enabled and makes 'flags' completely unusable for block drivers to get
> the indications of user by specifying 'cache=' field.
> 
> So is there other means to allow block drivers to rely on, in order to
> interpret the 'cache semantics'?
> 

I found the commit:e1e9b0ac 'always open drivers in writeback mode'.
This is really undesired for network block drivers such as Sheepdog,
which implement its own cache mechanism that support
writeback/writethrough/directio behavior and then want to interpret
these flags on its own.

Is there any means for block drivers to get the semantics of 'cache=xxx'
from users?

Thanks,
Yuan

Re: [Qemu-devel] [PATCH] sheepdog: implement direct write semantics

2013-01-07 Thread Liu Yuan

On 01/07/2013 08:31 PM, Stefan Hajnoczi wrote:
> On Sat, Jan 05, 2013 at 03:56:11PM +0800, Liu Yuan wrote:
>> On 01/05/2013 01:29 PM, Liu Yuan wrote:
>>> On 01/05/2013 12:40 PM, Liu Yuan wrote:
>>>> On 01/05/2013 12:38 AM, Stefan Hajnoczi wrote:
>>>>> Hi Yuan,
>>>>> BDRV_O_NOCACHE means bypass host page cache (O_DIRECT).
>>>>>
>>>>> BDRV_O_CACHE_WB specifies the cache semantics that the guest sees - that
>>>>> means whether the disk cache is writethrough or writeback.
>>>>>
>>>>> In other words, BDRV_O_NOCACHE is a host performance tweak while
>>>>> BDRV_O_CACHE_WB changes the cache safety of the BlockDriverState.  A
>>>>> protocol driver like sheepdog doesn't need to look at BDRV_O_CACHE_WB
>>>>> because it is implemented in block.c (see bdrv_co_do_writev() where QEMU
>>>>> will flush when after each write when !bs->enable_write_cache).
>>>>
>>>> Hi Stefan,
>>>>
>>>>   Thanks for your explanation. But after more investigation, I find
>>>> myself more confused:
>>>>
>>>>   flags passed from block layer
>>>>   {writeback, writethrough}  0x2042
>>>>   {directsync, off, none}0x2062
>>>>   {unsafe}   0x2242
>>>>
>>>> So underlying driver like Sheepdog can't depend on 'flags' passed from
>>>> .bdrv_file_open() to choose the right semantics (This was possible for
>>>> old QEMU IIRC).
>>>>
>>>> If we can't rely on the 'flags' to get the cache indications of users,
>>>> would you point me how to implement tristate cache control for network
>>>> block driver like Sheepdog? For e.g, I want to implement following
>>>> semantics:
>>>> cache=writeback|none|off # enable writeback semantics for write
>>>> cache=writethrough # enable writethrough semantics for write
>>>> cache=directsync # disable cache completely
>>>>
>>>> Thanks,
>>>> Yuan
>>>>
>>>
>>> I tried the old QEMU and v1.1.0 and v1.1.2, they still worked as I
>>> expected. So I guess generic block layer got changed a bit and the
>>> 'flags' meaning turned different than old code, which did indeed allow
>>> block drivers to interpret the 'flags' passed from bdrv_file_open().
>>>
>>> With the current upstream code, it seems that BDRV_O_CACHE_WB is always
>>> enabled and makes 'flags' completely unusable for block drivers to get
>>> the indications of user by specifying 'cache=' field.
>>>
>>> So is there other means to allow block drivers to rely on, in order to
>>> interpret the 'cache semantics'?
>>>
>>
>> I found the commit:e1e9b0ac 'always open drivers in writeback mode'.
>> This is really undesired for network block drivers such as Sheepdog,
>> which implement its own cache mechanism that support
>> writeback/writethrough/directio behavior and then want to interpret
>> these flags on its own.
>>
>> Is there any means for block drivers to get the semantics of 'cache=xxx'
>> from users?
> 
> Hi Yuan,
> Please explain what cache semantics the sheepdog server supports so I
> can understand better what you are trying to achieve.
> 

Hi Stefan,

Sheepdog support writethrough/writeback/directio(bypass) the cache, and
I want to implement following semantics:
cache=writeback|none|off # enable writeback semantics for write
cache=writethrough # enable writethrough semantics for write
cache=directsync # disable cache completely

Thanks,
Yuan

Re: [Qemu-devel] [PATCH] sheepdog: implement direct write semantics

2013-01-07 Thread Liu Yuan

On 01/07/2013 09:23 PM, Kevin Wolf wrote:
> No, and in theory they shouldn't have to care.
> 
> Why do you want to handle writethrough semantics in the block driver
> rather than letting qemu care for sending the right flushes?

Because Sheepdog itself provide the client side cache implementation and
we need means to get the cache hint from users when starting up the VM.
This cache support:
 writethrough: provide a read cache for this VM, which is always
consistent with cluster data
 writeback: provide a writeback cache for this VM, which only flush the
dirty bits when VM send 'flush' request.
 directio: disable cache completely for this VM.

Old QEMU (before v1.2) allows block drivers to get the cache flags, so
Sheepdog can interpret these flags on its own to choose the right cache
mode for each VM. doesn't QEMU need keep backward compatibility?

Thanks,
Yuan

Re: [Qemu-devel] [PATCH] sheepdog: implement direct write semantics

2013-01-08 Thread Liu Yuan

On 01/08/2013 05:40 PM, Stefan Hajnoczi wrote:
> Otherwise use sheepdog writeback and let QEMU block.c decide when to
> flush.  Never use sheepdog writethrough because it's redundant here.

I don't get it. What do you mean by 'redundant'? If we use virtio &
sheepdog block driver, how can we specify writethrough mode for Sheepdog
cache? Here 'writethrough' means use a pure read cache, which doesn't
need flush at all.

Thanks,
Yuan

Re: [Qemu-devel] [PATCH] sheepdog: implement direct write semantics

2013-01-08 Thread Liu Yuan

On 01/08/2013 06:00 PM, Kevin Wolf wrote:
> Am 08.01.2013 10:45, schrieb Liu Yuan:
>> On 01/08/2013 05:40 PM, Stefan Hajnoczi wrote:
>>> Otherwise use sheepdog writeback and let QEMU block.c decide when to
>>> flush.  Never use sheepdog writethrough because it's redundant here.
>>
>> I don't get it. What do you mean by 'redundant'? If we use virtio &
>> sheepdog block driver, how can we specify writethrough mode for Sheepdog
>> cache? Here 'writethrough' means use a pure read cache, which doesn't
>> need flush at all.
> 
> A writethrough cache is equivalent to a write-back cache where each
> write is followed by a flush. qemu makes sure to send these flushes, so
> there is no need use Sheepdog's writethrough mode.
> 

Implement writethrough as writeback + flush will cause considerable
overhead for network block device like Sheepdog: a single write request
will be executed as two requests: write + flush. This also explains why
I saw a regression about write performance: Old QEMU can issue multiple
write requests in one go, but now the requests are sent one by one (even
with cache=writeback set), which makes Sheepdog write performance drop a
lot. Is it possible to issue multiple requests in one go as old QEMU does?

It seems it is hard to restore into old semantics of cache flags due to
new design of QEMU block layer. So will you accept that adding a 'flags'
into BlockDriverState which carry the 'cache flags' from user to keep
backward compatibility?

Thanks,
Yuan

Re: [Qemu-devel] [PATCH] sheepdog: implement direct write semantics

2013-01-08 Thread Liu Yuan

On 01/08/2013 06:51 PM, Kevin Wolf wrote:
> Am 08.01.2013 11:39, schrieb Liu Yuan:
>> On 01/08/2013 06:00 PM, Kevin Wolf wrote:
>>> Am 08.01.2013 10:45, schrieb Liu Yuan:
>>>> On 01/08/2013 05:40 PM, Stefan Hajnoczi wrote:
>>>>> Otherwise use sheepdog writeback and let QEMU block.c decide when to
>>>>> flush.  Never use sheepdog writethrough because it's redundant here.
>>>>
>>>> I don't get it. What do you mean by 'redundant'? If we use virtio &
>>>> sheepdog block driver, how can we specify writethrough mode for Sheepdog
>>>> cache? Here 'writethrough' means use a pure read cache, which doesn't
>>>> need flush at all.
>>>
>>> A writethrough cache is equivalent to a write-back cache where each
>>> write is followed by a flush. qemu makes sure to send these flushes, so
>>> there is no need use Sheepdog's writethrough mode.
>>
>> Implement writethrough as writeback + flush will cause considerable
>> overhead for network block device like Sheepdog: a single write request
>> will be executed as two requests: write + flush
> 
> Yeah, maybe we should have some kind of a FUA flag with write requests
> instead of sending a separate flush.
> 
>> This also explains why
>> I saw a regression about write performance: Old QEMU can issue multiple
>> write requests in one go, but now the requests are sent one by one (even
>> with cache=writeback set), which makes Sheepdog write performance drop a
>> lot. Is it possible to issue multiple requests in one go as old QEMU does?
> 
> Huh? We didn't change anything to that respect, or at least not that I'm
> aware of. qemu always only had single-request bdrv_co_writev, so if
> anything that batching must have happened inside Sheepdog code? Do you
> know what makes it not batch requests any more?
> 

QEMU v1.1.x works well with batched write requests. Sheepdog block
driver doesn't do batching trick as far as I know, just send request as
it is feed. There isn't noticeable changes between v1.1.x and current
master regard to Sheepdog.c.

To detail the different behavior, from Sheepdog daemon which receives
the requests from QEMU:
 old: can receive multiple many requests at the virtually the same time
and handle them concurrently
 now: only receive one request, handle it, reply and get another.

So I think the problem is, current QEMU will wait for write response
before sending another request.

>> It seems it is hard to restore into old semantics of cache flags due to
>> new design of QEMU block layer. So will you accept that adding a 'flags'
>> into BlockDriverState which carry the 'cache flags' from user to keep
>> backward compatibility?
> 
> No, going back to the old behaviour would break guest-toggled WCE.
> 

Guest-toggled WCE only works with IDE and seems that virtio-blk doesn't
support it, no? And I think there are huge virtio-blk users.

I didn't meant to break WCE. What I meant is to allow backward
compatibility. For e.g, Sheepdog driver can make use of this dedicated
cache flags to implement its own cache control and doesn't affect other
drivers at all.

Thanks,
Yuan

Re: [Qemu-devel] [PATCH] sheepdog: implement direct write semantics

2013-01-08 Thread Liu Yuan

On 01/08/2013 07:19 PM, Kevin Wolf wrote:
> I can't see a reason why it would do that. Can you bisect this?
> 

Sure, bisect it is on my schedule, but I can't promise a deadline.

 >>> It seems it is hard to restore into old semantics of cache flags due to
 >>> new design of QEMU block layer. So will you accept that adding a 
 >>> 'flags'
 >>> into BlockDriverState which carry the 'cache flags' from user to keep
 >>> backward compatibility?
>>> >>
>>> >> No, going back to the old behaviour would break guest-toggled WCE.
>>> >>
>> > 
>> > Guest-toggled WCE only works with IDE and seems that virtio-blk doesn't
>> > support it, no? And I think there are huge virtio-blk users.
> It works with virtio-blk and SCSI as well.
> 

Okay, I see the code indeed support WCE but it requires Guest kernel to
support it. For the kernel doesn't support it, there is no way to
disable write cache, no?

>> > I didn't meant to break WCE. What I meant is to allow backward
>> > compatibility. For e.g, Sheepdog driver can make use of this dedicated
>> > cache flags to implement its own cache control and doesn't affect other
>> > drivers at all.
> How would you do it? With a WCE that changes during runtime the idea of
> a flag that is passed to bdrv_open() and stays valid as long as the
> BlockDriverState exists doesn't match reality any more.

I am not sure if I get it right. What I meant is allow Sheepdog to
control cache on the 'cache flags' at startup and ignore WCE on the run
time.

So you mean, if I choose witeback cache at startup, and then Guest
disable it via WCE, then block layer will never send flush request down
to Sheepdog driver?

Thanks,
Yuan

Re: [Qemu-devel] [PATCH] sheepdog: implement direct write semantics

2013-01-08 Thread Liu Yuan

On 01/08/2013 08:12 PM, Kevin Wolf wrote:
> Ok, thanks. It would be good if it was before the hard freeze for 1.4.
> 

Oops, sorry. It is a false alarm. Last time I was running latest QEMU on
tmpfs which service the request too fast.

>  It seems it is hard to restore into old semantics of cache 
>  flags due to
>  new design of QEMU block layer. So will you accept that 
>  adding a 'flags'
>  into BlockDriverState which carry the 'cache flags' from 
>  user to keep
>  backward compatibility?
>>> >>
>>> >> No, going back to the old behaviour would break guest-toggled 
>>> >> WCE.
>>> >>
> 
>  Guest-toggled WCE only works with IDE and seems that virtio-blk 
>  doesn't
>  support it, no? And I think there are huge virtio-blk users.
>>> >> It works with virtio-blk and SCSI as well.
>>> >>
>> > 
>> > Okay, I see the code indeed support WCE but it requires Guest kernel to
>> > support it. For the kernel doesn't support it, there is no way to
>> > disable write cache, no?
> With Linux guests, it's possible for SCSI. I'm not sure about
> virtio-blk, but you might be right that you can't manually change it there.
> 
>  I didn't meant to break WCE. What I meant is to allow backward
>  compatibility. For e.g, Sheepdog driver can make use of this 
>  dedicated
>  cache flags to implement its own cache control and doesn't affect 
>  other
>  drivers at all.
>>> >> How would you do it? With a WCE that changes during runtime the idea of
>>> >> a flag that is passed to bdrv_open() and stays valid as long as the
>>> >> BlockDriverState exists doesn't match reality any more.
>> > 
>> > I am not sure if I get it right. What I meant is allow Sheepdog to
>> > control cache on the 'cache flags' at startup and ignore WCE on the run
>> > time.
> If you start with cache=writeback and then the guests switches WCE off
> and you ignore that, then you're causing data corruption in the case of
> a crash. This is not an option.
> 
>> > So you mean, if I choose witeback cache at startup, and then Guest
>> > disable it via WCE, then block layer will never send flush request down
>> > to Sheepdog driver?
> Yes, this is the problematic case. Currently the qemu block layer makes
> sure to send flushes, but if you disable that logic for Sheepdog, you
> would get broken behaviour in this case.

Maybe not for a second thought. See following combination:

   cache flagsWCE toggled and resulting behavior
   writethrough   writethrough
   writeback  writetrhough (writeback + flush as expected)

cache flags means specify 'cache=xxx' at startup and WCE toggled on the
fly in the guest (supose guest kernel support WCE control)

So the result is *not* broken. If we set cache=writethrough for
sheepdog, then WCE won't take any effect because 'flush' request will be
ignored by Sheepdog driver. And with cache=writeback, WCE does disable
the writecache and actually turns it to a writethrough cache by sending
flush req every time for write.

To conclude, let Sheepdog interpret cache flags won't cause trouble even
with current Guest WCE feature, the different is that if we set
cache=writethrough, guest can't change it via WCE toggling. Is this
behavior acceptable?

Thanks,
Yuan

Re: [Qemu-devel] [PATCH] sheepdog: implement direct write semantics

2013-01-08 Thread Liu Yuan

On 01/08/2013 09:18 PM, Liu Yuan wrote:
> So the result is *not* broken. If we set cache=writethrough for
> sheepdog, then WCE won't take any effect because 'flush' request will be
> ignored by Sheepdog driver. 

The merit of this 'writethrough' is that, write request will never be
interpreted as two requests: write + flush in Sheepdog driver.

Thanks,
Yuan

Re: [Qemu-devel] [PATCH] sheepdog: implement direct write semantics

2013-01-09 Thread Liu Yuan

On 01/09/2013 06:25 PM, Paolo Bonzini wrote:
> But why is it useful to force-disable writeback caching?  Do you have
> any performance numbers?

This is not related to performance. What I want is to allow users to
choose Sheepdog cache mode (writethrough/writeback/directio) from cache=xxx.

Obviously writeback mode show much better write performance over
writethrough cache in Sheepdog. NOTE: writethrough cache in Sheepdog is
a readonly cache which is always consistent with cluster data thus need
no flush at all. This is to boost read performance for read intensive
Guest over Sheepdog iamges.

Thanks,
Yuan

Re: [Qemu-devel] [PATCH] sheepdog: implement direct write semantics

2013-01-09 Thread Liu Yuan

On 01/09/2013 06:40 PM, Paolo Bonzini wrote:
> 1) how slower is QEMU's emulated-writethrough mode for writes, due to
> the extra requests?
> 

I'll collect some numbers on it.

> 2) how slower is QEMU's writeback mode for reads, due to the different
> structure of the cache?

Sorry, I don't get your question.

I didn't say QEMU's writeback mode is slow for reads, did I?

Thanks,
Yuan

Re: [Qemu-devel] [PATCH] sheepdog: implement direct write semantics

2013-01-09 Thread Liu Yuan

On 01/09/2013 06:46 PM, Liu Yuan wrote:
> I'll collect some numbers on it.

Well, when I use hdparm -W 0 /dev/vdx to choose the writethrough cache
in the Guest (virtio disk), it says:

test@vm1:~$ sudo hdparm -W 0 /dev/vdb

/dev/vdb:
 setting drive write-caching to 0 (off)
 HDIO_DRIVE_CMD(identify) failed: Inappropriate ioctl for device
 HDIO_DRIVE_CMD(flushcache) failed: Inappropriate ioctl for device
 HDIO_DRIVE_CMD(setcache) failed: Inappropriate ioctl for device
 HDIO_DRIVE_CMD(identify) failed: Inappropriate ioctl for device
 HDIO_DRIVE_CMD(flushcache) failed: Inappropriate ioctl for device
 HDIO_DRIVE_CMD(identify) failed: Inappropriate ioctl for device

Do I need extra flags to enable WCE?

Both Guest and Host kernel is 3.7 and I start QEMU by following command:

$ qemu-system-x86_64 --enable-kvm -drive
file=~/images/test1,if=virtio,cache=writeback -smp 2 -cpu host -m 1024
-drive file=sheepdog:test,if=virtio,cache=writeback

Thanks,
Yuan

Re: [Qemu-devel] [PATCH] sheepdog: implement direct write semantics

2013-01-09 Thread Liu Yuan

On 01/09/2013 06:46 PM, Liu Yuan wrote:
>> 1) how slower is QEMU's emulated-writethrough mode for writes, due to
>> > the extra requests?
>> > 
> I'll collect some numbers on it.
> 

Okay I got some nubmers. I run three sheep daemon on the same host to
emulate a 3 nodes cluster, Sheepdog image has 3 copies and I put
Sheepdog client cache and Sheepdog backend storage on a tmpfs.

Guest and Host are all Linux 3.7. I start QEMU by following command:

$ qemu-system-x86_64 --enable-kvm -drive
file=~/images/test1,if=virtio,cache=writeback -smp 2 -cpu host -m 1024
-drive file=sheepdog:test,if=virtio,cache=writeback

I run 5 times 'dd if=/dev/urandom of=/dev/vdb bs=1M count=100
oflag=direct' and get the average nubmer:

emulated (write + flush)  old impl (single write)
13.3 M/s   13.7 M/s

boost percentage: (13.7 - 13.3)/13.3 = 3%.

If boost number is not big, but if we run QEMU and Sheep daemon on the
separate boxes, we'll expect of bigger boost because the overhead of
extra 'flush' request is increased.

Besides performance, I think backward compatibility is more important:
  1 if we run a old kernel host (quite possible for a long running
server) which doesn't support WCE, then we will never have a chance to
choose writethrough cache for guest OS against new QEMU (most users tend
to update user space tools to exclude bugs)
  2 The upper layer software which relies on the 'cache=xxx' to choose
cache mode will fail its assumption against new QEMU.

and my proposal (adding another field to BlockDriverState to allow
self-interpretation cache flag) can work well with current block layer:

Sheepdog driver behavior:
   cache flagsWCE toggled and resulting behavior
   writethrough   writethrough
   writeback  writetrhough (writeback + flush)

We can see that the writeback behavior for Guest-WCE toggling is the
same as expected. The difference is that if we set cache=writethrough,
guest can't change it via WCE toggling.

Thanks,
Yuan

Re: [Qemu-devel] [PATCH] sheepdog: implement direct write semantics

2013-01-09 Thread Liu Yuan

On 01/09/2013 08:07 PM, Liu Yuan wrote:
> $ qemu-system-x86_64 --enable-kvm -drive
> file=~/images/test1,if=virtio,cache=writeback -smp 2 -cpu host -m 1024
> -drive file=sheepdog:test,if=virtio,cache=writeback

To emulate old impl, I modified the sheepdog block driver as following:

diff --git a/block/sheepdog.c b/block/sheepdog.c
index e821746..321a426 100644
--- a/block/sheepdog.c
+++ b/block/sheepdog.c
@@ -1118,6 +1118,7 @@ static int sd_open(BlockDriverState *bs, const
char *filename, int flags)
 goto out;
 }

+if (0) {
 s->cache_enabled = true;
 s->flush_fd = connect_to_sdog(s->addr, s->port);
 if (s->flush_fd < 0) {
@@ -1125,6 +1126,7 @@ static int sd_open(BlockDriverState *bs, const
char *filename, int flags)
 ret = s->flush_fd;
 goto out;
 }
+}

 if (snapid || tag[0] != '\0') {
 dprintf("%" PRIx32 " snapshot inode was open.\n", vid);

Re: [Qemu-devel] [PATCH] sheepdog: implement direct write semantics

2013-01-09 Thread Liu Yuan

On 01/09/2013 08:07 PM, Liu Yuan wrote:
> and my proposal (adding another field to BlockDriverState to allow
> self-interpretation cache flag) can work well with current block layer:
> 
> Sheepdog driver behavior:
>cache flagsWCE toggled and resulting behavior
>writethrough   writethrough
>writeback  writetrhough (writeback + flush)
> 
> We can see that the writeback behavior for Guest-WCE toggling is the
> same as expected. The difference is that if we set cache=writethrough,
> guest can't change it via WCE toggling.

I am wondering if above scheme can apply to the block layer, then
'cache=xxx' can act as a stronger control over WCE toggling. Then we
don't need extra field.

Thanks,
Yuan

Re: [Qemu-devel] [PATCH] sheepdog: implement direct write semantics

2013-01-09 Thread Liu Yuan

On 01/09/2013 08:42 PM, Kevin Wolf wrote:
> Am 09.01.2013 13:07, schrieb Liu Yuan:
>> Besides performance, I think backward compatibility is more important:
>>   1 if we run a old kernel host (quite possible for a long running
>> server) which doesn't support WCE, then we will never have a chance to
>> choose writethrough cache for guest OS against new QEMU (most users tend
>> to update user space tools to exclude bugs)
> 
> How does the host kernel even play a role in this context?
> 

Oops, my mind was broken. Okay, guest OS which doesn't support WCE can't
change cache type as we discussed.

>>   2 The upper layer software which relies on the 'cache=xxx' to choose
>> cache mode will fail its assumption against new QEMU.
> 
> Which assumptions do you mean? As far as I can say the behaviour hasn't
> changed, except possibly for the performance.
> 

When users set 'cache=writethrough' to export only a writethrough cache
to Guest, but with new QEMU, it will actually get a writeback cache as
default.

>> We can see that the writeback behavior for Guest-WCE toggling is the
>> same as expected. The difference is that if we set cache=writethrough,
>> guest can't change it via WCE toggling.
> 
> Then you need to make sure to communicate this to the guest. I'm not
> convinced that adding extra logic to each device for this is a good idea.
> 

We don't need to communicate to the guest. I think 'cache=xxx' means
what kind of cache the users *expect* to export to Guest OS. So if
cache=writethrough set, Guest OS couldn't turn it to writeback cache
magically. This is like I bought a disk with 'writethrough' cache
built-in, I didn't expect that it turned to be a disk with writeback
cache under the hood which could possible lose data when power outage
happened.

So Guest-WCE only works when users provide us a writeback capable disk.

Thanks,
Yuan

Re: [Qemu-devel] [PATCH] sheepdog: implement direct write semantics

2013-01-09 Thread Liu Yuan

On 01/09/2013 11:10 PM, Paolo Bonzini wrote:
> Il 09/01/2013 14:04, Liu Yuan ha scritto:
>>>>   2 The upper layer software which relies on the 'cache=xxx' to choose
>>>> cache mode will fail its assumption against new QEMU.
>>>
>>> Which assumptions do you mean? As far as I can say the behaviour hasn't
>>> changed, except possibly for the performance.
>>
>> When users set 'cache=writethrough' to export only a writethrough cache
>> to Guest, but with new QEMU, it will actually get a writeback cache as
>> default.
> 
> They get a writeback cache implementation-wise, but they get a
> writethrough cache safety-wise.  How the cache is implemented doesn't
> matter, as long as it "looks like" a writethrough cache.
> 

> In fact, consider a local disk that doesn't support FUA.  In old QEMU,
> images used to be opened with O_DSYNC and that splits each write into
> WRITE+FLUSH, just like new QEMU.  All that changes is _where_ the
> flushes are created.  Old QEMU changes it in the kernel, new QEMU
> changes it in userspace.
> 
>> We don't need to communicate to the guest. I think 'cache=xxx' means
>> what kind of cache the users *expect* to export to Guest OS. So if
>> cache=writethrough set, Guest OS couldn't turn it to writeback cache
>> magically. This is like I bought a disk with 'writethrough' cache
>> built-in, I didn't expect that it turned to be a disk with writeback
>> cache under the hood which could possible lose data when power outage
>> happened.
> 
> It's not by magic.  It's by explicitly requesting the disk to do this.
> 
> Perhaps it's a bug that the cache mode is not reset when the machine is
> reset.  I haven't checked that, but it would be a valid complaint.
> 

Ah I didn't get the current implementation right. I tried the 3.7 kernel
and it works as expected (cache=writethrough result in a 'writethrough'
cache in the guest).

It looks fine to me to emulate writethrough as writeback + flush, since
the profermance drop isn't big, though sheepdog itself support true
writethrough cache (no flush).

I am going to send v2 of directio patch for sheepdog driver.

Thanks,
Yuan

[Qemu-devel] [PATCH v2] sheepdog: implement direct write semantics

2013-01-10 Thread Liu Yuan

From: Liu Yuan 

Sheepdog supports both writeback/writethrough write but has not yet supported
DIRECTIO semantics which bypass the cache completely even if Sheepdog daemon is
set up with cache enabled.

Suppose cache is enabled on Sheepdog daemon size, the new cache control is

cache=writeback # enable the writeback semantics for write
cache=writethrough # enable the emulated writethrough semantics for write
cache=directsync # disable cache competely

Guest WCE toggling on the run time to toggle writeback/writethrough is also
supported.

Cc: MORITA Kazutaka 
Cc: Kevin Wolf 
Cc: Stefan Hajnoczi 
Signed-off-by: Liu Yuan 
---
 v2: adopt to current emulated writethrough cache setting

 block/sheepdog.c |   70 +++---
 1 file changed, 40 insertions(+), 30 deletions(-)

diff --git a/block/sheepdog.c b/block/sheepdog.c
index e821746..c51507e 100644
--- a/block/sheepdog.c
+++ b/block/sheepdog.c
@@ -36,7 +36,8 @@
 
 #define SD_FLAG_CMD_WRITE0x01
 #define SD_FLAG_CMD_COW  0x02
-#define SD_FLAG_CMD_CACHE0x04
+#define SD_FLAG_CMD_CACHE0x04 /* Writeback mode for cache */
+#define SD_FLAG_CMD_DIRECT   0x08 /* Don't use cache */
 
 #define SD_RES_SUCCESS   0x00 /* Success */
 #define SD_RES_UNKNOWN   0x01 /* Unknown error */
@@ -293,7 +294,7 @@ typedef struct BDRVSheepdogState {
 
 char name[SD_MAX_VDI_LEN];
 bool is_snapshot;
-bool cache_enabled;
+uint32_t cache_flags;
 
 char *addr;
 char *port;
@@ -977,8 +978,8 @@ static int coroutine_fn add_aio_request(BDRVSheepdogState 
*s, AIOReq *aio_req,
 hdr.flags = SD_FLAG_CMD_WRITE | flags;
 }
 
-if (s->cache_enabled) {
-hdr.flags |= SD_FLAG_CMD_CACHE;
+if (s->cache_flags) {
+hdr.flags |= s->cache_flags;
 }
 
 hdr.oid = oid;
@@ -1023,7 +1024,7 @@ static int coroutine_fn add_aio_request(BDRVSheepdogState 
*s, AIOReq *aio_req,
 
 static int read_write_object(int fd, char *buf, uint64_t oid, int copies,
  unsigned int datalen, uint64_t offset,
- bool write, bool create, bool cache)
+ bool write, bool create, uint32_t cache_flags)
 {
 SheepdogObjReq hdr;
 SheepdogObjRsp *rsp = (SheepdogObjRsp *)&hdr;
@@ -1047,9 +1048,7 @@ static int read_write_object(int fd, char *buf, uint64_t 
oid, int copies,
 hdr.opcode = SD_OP_READ_OBJ;
 }
 
-if (cache) {
-hdr.flags |= SD_FLAG_CMD_CACHE;
-}
+hdr.flags |= cache_flags;
 
 hdr.oid = oid;
 hdr.data_length = datalen;
@@ -1072,18 +1071,19 @@ static int read_write_object(int fd, char *buf, 
uint64_t oid, int copies,
 }
 
 static int read_object(int fd, char *buf, uint64_t oid, int copies,
-   unsigned int datalen, uint64_t offset, bool cache)
+   unsigned int datalen, uint64_t offset,
+   uint32_t cache_flags)
 {
 return read_write_object(fd, buf, oid, copies, datalen, offset, false,
- false, cache);
+ false, cache_flags);
 }
 
 static int write_object(int fd, char *buf, uint64_t oid, int copies,
 unsigned int datalen, uint64_t offset, bool create,
-bool cache)
+uint32_t cache_flags)
 {
 return read_write_object(fd, buf, oid, copies, datalen, offset, true,
- create, cache);
+ create, cache_flags);
 }
 
 static int sd_open(BlockDriverState *bs, const char *filename, int flags)
@@ -1118,12 +1118,22 @@ static int sd_open(BlockDriverState *bs, const char 
*filename, int flags)
 goto out;
 }
 
-s->cache_enabled = true;
-s->flush_fd = connect_to_sdog(s->addr, s->port);
-if (s->flush_fd < 0) {
-error_report("failed to connect");
-ret = s->flush_fd;
-goto out;
+/*
+ * QEMU block layer emulates writethrough cache as 'writeback + flush', so
+ * we always set SD_FLAG_CMD_CACHE (writeback cache) as default.
+ */
+s->cache_flags = SD_FLAG_CMD_CACHE;
+if (flags & BDRV_O_NOCACHE) {
+s->cache_flags = SD_FLAG_CMD_DIRECT;
+}
+
+if (s->cache_flags == SD_FLAG_CMD_CACHE) {
+s->flush_fd = connect_to_sdog(s->addr, s->port);
+if (s->flush_fd < 0) {
+error_report("failed to connect");
+ret = s->flush_fd;
+goto out;
+}
 }
 
 if (snapid || tag[0] != '\0') {
@@ -1140,7 +1150,7 @@ static int sd_open(BlockDriverState *bs, const char 
*filename, int flags)
 
 buf = g_malloc(SD_INODE_SIZE);
 ret = read_object(fd, buf, vid_to_vdi_oid(vid), 0, SD_INODE_SIZE, 0,
-  s->cache_enabled);
+  s->cache_flags);
 
 closesocket(fd);
 
@@ -1387,

Re: [Qemu-devel] [PATCH] sheepdog: implement direct write semantics

2013-01-11 Thread Liu Yuan

On 01/11/2013 03:52 PM, MORITA Kazutaka wrote:
> At Thu, 10 Jan 2013 13:38:16 +0800,
> Liu Yuan wrote:
>>
>> On 01/09/2013 11:10 PM, Paolo Bonzini wrote:
>>> Il 09/01/2013 14:04, Liu Yuan ha scritto:
>>>>>>   2 The upper layer software which relies on the 'cache=xxx' to choose
>>>>>> cache mode will fail its assumption against new QEMU.
>>>>>
>>>>> Which assumptions do you mean? As far as I can say the behaviour hasn't
>>>>> changed, except possibly for the performance.
>>>>
>>>> When users set 'cache=writethrough' to export only a writethrough cache
>>>> to Guest, but with new QEMU, it will actually get a writeback cache as
>>>> default.
>>>
>>> They get a writeback cache implementation-wise, but they get a
>>> writethrough cache safety-wise.  How the cache is implemented doesn't
>>> matter, as long as it "looks like" a writethrough cache.
>>>
>>
>>> In fact, consider a local disk that doesn't support FUA.  In old QEMU,
>>> images used to be opened with O_DSYNC and that splits each write into
>>> WRITE+FLUSH, just like new QEMU.  All that changes is _where_ the
>>> flushes are created.  Old QEMU changes it in the kernel, new QEMU
>>> changes it in userspace.
>>>
>>>> We don't need to communicate to the guest. I think 'cache=xxx' means
>>>> what kind of cache the users *expect* to export to Guest OS. So if
>>>> cache=writethrough set, Guest OS couldn't turn it to writeback cache
>>>> magically. This is like I bought a disk with 'writethrough' cache
>>>> built-in, I didn't expect that it turned to be a disk with writeback
>>>> cache under the hood which could possible lose data when power outage
>>>> happened.
>>>
>>> It's not by magic.  It's by explicitly requesting the disk to do this.
>>>
>>> Perhaps it's a bug that the cache mode is not reset when the machine is
>>> reset.  I haven't checked that, but it would be a valid complaint.
>>>
>>
>> Ah I didn't get the current implementation right. I tried the 3.7 kernel
>> and it works as expected (cache=writethrough result in a 'writethrough'
>> cache in the guest).
>>
>> It looks fine to me to emulate writethrough as writeback + flush, since
>> the profermance drop isn't big, though sheepdog itself support true
>> writethrough cache (no flush).
> 
> Can we drop the SD_FLAG_CMD_CACHE flag from sheepdog write requests
> when bdrv_enable_write_cache() is false?  Then the requests behave
> like FUA writes and we can safely omit succeeding flush requests.
> 

Let's first implement a emulated writethroughc cache and then look for a
methond if we can play tricks to implement a true writethrough cache.
This would bring complexity such as before we drop SD_FLAG_CMD_CACHE, we
need to flush beforehand. And more, bdrv_enable_write_cache() is always
true for now, I guess this need generic change block layer to get the
writethrough/writeback hints.

Thanks,
Yuan

Re: [Qemu-devel] [PATCH] sheepdog: implement direct write semantics

2013-01-11 Thread Liu Yuan

On 01/11/2013 05:00 PM, Paolo Bonzini wrote:
> That's not correct.  Try hdparm with an IDE disk, or sysfs with a SCSI
> disk (recent kernels will also support sysfs with an IDE or virtio-blk
> disk) and you'll see that it changes.

Okay, at least at startup, even with cache=writehtough set, I saw
bdrv_enable_write_cache() returns true from sd_open(). So you mean this
will be reset later (after sd_open())?

Thanks,
Yuan

Re: [Qemu-devel] [PATCH] sheepdog: implement direct write semantics

2013-01-11 Thread Liu Yuan

On 01/11/2013 05:34 PM, Paolo Bonzini wrote:
> It is always true for the protocol.  It is not always true for the format.
> 

So you mean bdrv_enable_write_cache() always return true for Sheepdog at
any time?

Thanks,
Yuan

1 2 3 4 >

1 - 100 of 328 matches

Mail list logo