Re: [Qemu-devel] [RFC PATCH 1/1] ceph/rbd block driver for qemu-kvm

2010-05-25 Thread MORITA Kazutaka
At Mon, 24 May 2010 14:16:32 -0500,
Anthony Liguori wrote:
> 
> On 05/24/2010 06:56 AM, Avi Kivity wrote:
> > On 05/24/2010 02:42 PM, MORITA Kazutaka wrote:
> >>
> >>> The server would be local and talk over a unix domain socket, perhaps
> >>> anonymous.
> >>>
> >>> nbd has other issues though, such as requiring a copy and no support 
> >>> for
> >>> metadata operations such as snapshot and file size extension.
> >>>
> >> Sorry, my explanation was unclear.  I'm not sure how running servers
> >> on localhost can solve the problem.
> >
> > The local server can convert from the local (nbd) protocol to the 
> > remote (sheepdog, ceph) protocol.
> >
> >> What I wanted to say was that we cannot specify the image of VM. With
> >> nbd protocol, command line arguments are as follows:
> >>
> >>   $ qemu nbd:hostname:port
> >>
> >> As this syntax shows, with nbd protocol the client cannot pass the VM
> >> image name to the server.
> >
> > We would extend it to allow it to connect to a unix domain socket:
> >
> >   qemu nbd:unix:/path/to/socket
> 
> nbd is a no-go because it only supports a single, synchronous I/O 
> operation at a time and has no mechanism for extensibility.
> 
> If we go this route, I think two options are worth considering.  The 
> first would be a purely socket based approach where we just accepted the 
> extra copy.
> 
> The other potential approach would be shared memory based.  We export 
> all guest ram as shared memory along with a small bounce buffer pool.  
> We would then use a ring queue (potentially even using virtio-blk) and 
> an eventfd for notification.
> 

The shared memory approach assumes that there is a local server who
can talk with the storage system.  But Ceph doesn't require the local
server, and Sheepdog would be extended to support VMs running outside
the storage system.  We could run a local daemon who can only work as
proxy, but I don't think it looks a clean approach.  So I think a
socket based approach is the right way to go.

BTW, is it required to design a common interface?  The way Sheepdog
replicates data is different from Ceph, so I think it is not possible
to define a common protocol as Christian says.

Regards,

Kazutaka

> > The server at the other end would associate the socket with a filename 
> > and forward it to the server using the remote protocol.
> >
> > However, I don't think nbd would be a good protocol.  My preference 
> > would be for a plugin API, or for a new local protocol that uses 
> > splice() to avoid copies.
> 
> I think a good shared memory implementation would be preferable to 
> plugins.  I think it's worth attempting to do a plugin interface for the 
> block layer but I strongly suspect it would not be sufficient.
> 
> I would not want to see plugins that interacted with BlockDriverState 
> directly, for instance.  We change it far too often.  Our main loop 
> functions are also not terribly stable so I'm not sure how we would 
> handle that (unless we forced all block plugins to be in a separate thread).
> 



[Qemu-devel] Re: [PATCH] add support for protocol driver create_options

2010-05-25 Thread MORITA Kazutaka
At Tue, 25 May 2010 15:43:17 +0200,
Kevin Wolf wrote:
> 
> Am 24.05.2010 08:34, schrieb MORITA Kazutaka:
> > At Fri, 21 May 2010 18:57:36 +0200,
> > Kevin Wolf wrote:
> >>
> >> Am 20.05.2010 07:36, schrieb MORITA Kazutaka:
> >>> +
> >>> +/*
> >>> + * Append an option list (list) to an option list (dest).
> >>> + *
> >>> + * If dest is NULL, a new copy of list is created.
> >>> + *
> >>> + * Returns a pointer to the first element of dest (or the newly 
> >>> allocated copy)
> >>> + */
> >>> +QEMUOptionParameter *append_option_parameters(QEMUOptionParameter *dest,
> >>> +QEMUOptionParameter *list)
> >>> +{
> >>> +size_t num_options, num_dest_options;
> >>> +
> >>> +num_options = count_option_parameters(dest);
> >>> +num_dest_options = num_options;
> >>> +
> >>> +num_options += count_option_parameters(list);
> >>> +
> >>> +dest = qemu_realloc(dest, (num_options + 1) * 
> >>> sizeof(QEMUOptionParameter));
> >>> +
> >>> +while (list && list->name) {
> >>> +if (get_option_parameter(dest, list->name) == NULL) {
> >>> +dest[num_dest_options++] = *list;
> >>
> >> You need to add a dest[num_dest_options].name = NULL; here. Otherwise
> >> the next loop iteration works on uninitialized memory and possibly an
> >> unterminated list. I got a segfault for that reason.
> >>
> > 
> > I forgot to add it, sorry.
> > Fixed version is below.
> > 
> > Thanks,
> > 
> > Kazutaka
> > 
> > ==
> > This patch enables protocol drivers to use their create options which
> > are not supported by the format.  For example, protcol drivers can use
> > a backing_file option with raw format.
> > 
> > Signed-off-by: MORITA Kazutaka 
> 
> $ ./qemu-img create -f qcow2 -o cluster_size=4k /tmp/test.qcow2 4G
> Unknown option 'cluster_size'
> qemu-img: Invalid options for file format 'qcow2'.
> 
> I think you added another num_dest_options++ which shouldn't be there.
> 

Sorry again.  I wrongly added `dest[num_dest_options++].name = NULL;'
instead of `dest[num_dest_options].name = NULL;'.

Thanks,

Kazutaka

==
This patch enables protocol drivers to use their create options which
are not supported by the format.  For example, protcol drivers can use
a backing_file option with raw format.

Signed-off-by: MORITA Kazutaka 
---
 block.c   |7 +++
 block.h   |1 +
 qemu-img.c|   49 ++---
 qemu-option.c |   53 ++---
 qemu-option.h |2 ++
 5 files changed, 86 insertions(+), 26 deletions(-)

diff --git a/block.c b/block.c
index 6e7766a..f881f10 100644
--- a/block.c
+++ b/block.c
@@ -56,7 +56,6 @@ static int bdrv_read_em(BlockDriverState *bs, int64_t 
sector_num,
 uint8_t *buf, int nb_sectors);
 static int bdrv_write_em(BlockDriverState *bs, int64_t sector_num,
  const uint8_t *buf, int nb_sectors);
-static BlockDriver *find_protocol(const char *filename);
 
 static QTAILQ_HEAD(, BlockDriverState) bdrv_states =
 QTAILQ_HEAD_INITIALIZER(bdrv_states);
@@ -210,7 +209,7 @@ int bdrv_create_file(const char* filename, 
QEMUOptionParameter *options)
 {
 BlockDriver *drv;
 
-drv = find_protocol(filename);
+drv = bdrv_find_protocol(filename);
 if (drv == NULL) {
 drv = bdrv_find_format("file");
 }
@@ -283,7 +282,7 @@ static BlockDriver *find_hdev_driver(const char *filename)
 return drv;
 }
 
-static BlockDriver *find_protocol(const char *filename)
+BlockDriver *bdrv_find_protocol(const char *filename)
 {
 BlockDriver *drv1;
 char protocol[128];
@@ -478,7 +477,7 @@ int bdrv_file_open(BlockDriverState **pbs, const char 
*filename, int flags)
 BlockDriver *drv;
 int ret;
 
-drv = find_protocol(filename);
+drv = bdrv_find_protocol(filename);
 if (!drv) {
 return -ENOENT;
 }
diff --git a/block.h b/block.h
index 24efeb6..9034ebb 100644
--- a/block.h
+++ b/block.h
@@ -54,6 +54,7 @@ void bdrv_info_stats(Monitor *mon, QObject **ret_data);
 
 void bdrv_init(void);
 void bdrv_init_with_whitelist(void);
+BlockDriver *bdrv_find_protocol(const char *filename);
 BlockDriver *bdrv_find_format(const char *format_name);
 BlockDriver *bdrv_find_whitelisted_format(const char *format_name);
 int bdrv_create(BlockDriver *drv, const char* filename,
diff --git a/qemu-img.c b/qemu-img.c
index cb007b7..ea091f0 100644
--- a/qemu-img.c
+++ b/qemu-img.c
@@ -252,8 +

Re: [Qemu-devel] [RFC PATCH 1/1] ceph/rbd block driver for qemu-kvm

2010-05-25 Thread MORITA Kazutaka
At Tue, 25 May 2010 10:12:53 -0700 (PDT),
Sage Weil wrote:
> 
> On Tue, 25 May 2010, Avi Kivity wrote:
> > > What's the reason for not having these drivers upstream? Do we gain
> > > anything by hiding them from our users and requiring them to install the
> > > drivers separately from somewhere else?
> > >
> > 
> > Six months.
> 
> FWIW, we (Ceph) aren't complaining about the 6 month lag time (and I don't 
> think the Sheepdog guys are either).
> 
I agree.  We aren't complaining about it.

> From our perspective, the current BlockDriver abstraction is ideal, as it 
> represents the reality of qemu's interaction with storage.  Any 'external' 
> interface will be inferior to that in one way or another.  But either way, 
> we are perfectly willing to work with you to all to keep in sync with any 
> future BlockDriver API improvements.  It is worth our time investment even 
> if the API is less stable.
> 
I agree.

> The ability to dynamically load a shared object using the existing api 
> would make development a bit easier, but I'm not convinced it's better for 
> for users.  I think having ceph and sheepdog upstream with qemu will serve 
> end users best, and we at least are willing to spend the time to help 
> maintain that code in qemu.git.
> 
I agree.

Regards,

Kazutaka



[Qemu-devel] [RFC PATCH v4 2/3] block: call the snapshot handlers of the protocol drivers

2010-05-27 Thread MORITA Kazutaka
When snapshot handlers are not defined in the format driver, it is
better to call the ones of the protocol driver.  This enables us to
implement snapshot support in the protocol driver.

We need to call bdrv_close() and bdrv_open() handlers of the format
driver before and after bdrv_snapshot_goto() call of the protocol.  It is
because the contents of the block driver state may need to be changed
after loading vmstate.

Signed-off-by: MORITA Kazutaka 
---
 block.c |   61 +++--
 1 files changed, 43 insertions(+), 18 deletions(-)

diff --git a/block.c b/block.c
index da0dc47..cf80dbf 100644
--- a/block.c
+++ b/block.c
@@ -1697,9 +1697,11 @@ int bdrv_save_vmstate(BlockDriverState *bs, const 
uint8_t *buf,
 BlockDriver *drv = bs->drv;
 if (!drv)
 return -ENOMEDIUM;
-if (!drv->bdrv_save_vmstate)
-return -ENOTSUP;
-return drv->bdrv_save_vmstate(bs, buf, pos, size);
+if (drv->bdrv_save_vmstate)
+return drv->bdrv_save_vmstate(bs, buf, pos, size);
+if (bs->file)
+return bdrv_save_vmstate(bs->file, buf, pos, size);
+return -ENOTSUP;
 }
 
 int bdrv_load_vmstate(BlockDriverState *bs, uint8_t *buf,
@@ -1708,9 +1710,11 @@ int bdrv_load_vmstate(BlockDriverState *bs, uint8_t *buf,
 BlockDriver *drv = bs->drv;
 if (!drv)
 return -ENOMEDIUM;
-if (!drv->bdrv_load_vmstate)
-return -ENOTSUP;
-return drv->bdrv_load_vmstate(bs, buf, pos, size);
+if (drv->bdrv_load_vmstate)
+return drv->bdrv_load_vmstate(bs, buf, pos, size);
+if (bs->file)
+return bdrv_load_vmstate(bs->file, buf, pos, size);
+return -ENOTSUP;
 }
 
 void bdrv_debug_event(BlockDriverState *bs, BlkDebugEvent event)
@@ -1734,20 +1738,37 @@ int bdrv_snapshot_create(BlockDriverState *bs,
 BlockDriver *drv = bs->drv;
 if (!drv)
 return -ENOMEDIUM;
-if (!drv->bdrv_snapshot_create)
-return -ENOTSUP;
-return drv->bdrv_snapshot_create(bs, sn_info);
+if (drv->bdrv_snapshot_create)
+return drv->bdrv_snapshot_create(bs, sn_info);
+if (bs->file)
+return bdrv_snapshot_create(bs->file, sn_info);
+return -ENOTSUP;
 }
 
 int bdrv_snapshot_goto(BlockDriverState *bs,
const char *snapshot_id)
 {
 BlockDriver *drv = bs->drv;
+int ret, open_ret;
+
 if (!drv)
 return -ENOMEDIUM;
-if (!drv->bdrv_snapshot_goto)
-return -ENOTSUP;
-return drv->bdrv_snapshot_goto(bs, snapshot_id);
+if (drv->bdrv_snapshot_goto)
+return drv->bdrv_snapshot_goto(bs, snapshot_id);
+
+if (bs->file) {
+drv->bdrv_close(bs);
+ret = bdrv_snapshot_goto(bs->file, snapshot_id);
+open_ret = drv->bdrv_open(bs, bs->open_flags);
+if (open_ret < 0) {
+bdrv_delete(bs->file);
+bs->drv = NULL;
+return open_ret;
+}
+return ret;
+}
+
+return -ENOTSUP;
 }
 
 int bdrv_snapshot_delete(BlockDriverState *bs, const char *snapshot_id)
@@ -1755,9 +1776,11 @@ int bdrv_snapshot_delete(BlockDriverState *bs, const 
char *snapshot_id)
 BlockDriver *drv = bs->drv;
 if (!drv)
 return -ENOMEDIUM;
-if (!drv->bdrv_snapshot_delete)
-return -ENOTSUP;
-return drv->bdrv_snapshot_delete(bs, snapshot_id);
+if (drv->bdrv_snapshot_delete)
+return drv->bdrv_snapshot_delete(bs, snapshot_id);
+if (bs->file)
+return bdrv_snapshot_delete(bs->file, snapshot_id);
+return -ENOTSUP;
 }
 
 int bdrv_snapshot_list(BlockDriverState *bs,
@@ -1766,9 +1789,11 @@ int bdrv_snapshot_list(BlockDriverState *bs,
 BlockDriver *drv = bs->drv;
 if (!drv)
 return -ENOMEDIUM;
-if (!drv->bdrv_snapshot_list)
-return -ENOTSUP;
-return drv->bdrv_snapshot_list(bs, psn_info);
+if (drv->bdrv_snapshot_list)
+return drv->bdrv_snapshot_list(bs, psn_info);
+if (bs->file)
+return bdrv_snapshot_list(bs->file, psn_info);
+return -ENOTSUP;
 }
 
 #define NB_SUFFIXES 4
-- 
1.5.6.5




[Qemu-devel] [RFC PATCH v4 0/3] Sheepdog: distributed storage system for QEMU

2010-05-27 Thread MORITA Kazutaka
Hi all,

This patch adds a block driver for Sheepdog distributed storage
system.  Please consider for inclusion.

I applied comments for the 2nd patch (thanks Kevin!).  The rest
patches are not changed from the previous version.


Changes from v3 to v4 are:
 - fix error handling in bdrv_snapshot_goto.

Changes from v2 to v3 are:

 - add drv->bdrv_close() and drv->bdrv_open() before and after
   bdrv_snapshot_goto() call of the protocol.
 - address the review comments on the sheepdog driver code.

Changes from v1 to v2 are:

 - rebase onto git://repo.or.cz/qemu/kevin.git block
 - modify the sheepdog driver as a protocol driver
 - add new patch to call the snapshot handler of the protocol

Thanks,

Kazutaka


MORITA Kazutaka (3):
  close all the block drivers before the qemu process exits
  block: call the snapshot handlers of the protocol drivers
  block: add sheepdog driver for distributed storage support

 Makefile.objs|2 +-
 block.c  |   70 ++-
 block.h  |1 +
 block/sheepdog.c | 1835 ++
 vl.c |1 +
 5 files changed, 1890 insertions(+), 19 deletions(-)
 create mode 100644 block/sheepdog.c




[Qemu-devel] [RFC PATCH v4 1/3] close all the block drivers before the qemu process exits

2010-05-27 Thread MORITA Kazutaka
This patch calls the close handler of the block driver before the qemu
process exits.

This is necessary because the sheepdog block driver releases the lock
of VM images in the close handler.

Signed-off-by: MORITA Kazutaka 
---
 block.c |9 +
 block.h |1 +
 vl.c|1 +
 3 files changed, 11 insertions(+), 0 deletions(-)

diff --git a/block.c b/block.c
index 24c63f6..da0dc47 100644
--- a/block.c
+++ b/block.c
@@ -646,6 +646,15 @@ void bdrv_close(BlockDriverState *bs)
 }
 }
 
+void bdrv_close_all(void)
+{
+BlockDriverState *bs;
+
+QTAILQ_FOREACH(bs, &bdrv_states, list) {
+bdrv_close(bs);
+}
+}
+
 void bdrv_delete(BlockDriverState *bs)
 {
 /* remove from list, if necessary */
diff --git a/block.h b/block.h
index 756670d..25744b1 100644
--- a/block.h
+++ b/block.h
@@ -123,6 +123,7 @@ BlockDriverAIOCB *bdrv_aio_ioctl(BlockDriverState *bs,
 /* Ensure contents are flushed to disk.  */
 void bdrv_flush(BlockDriverState *bs);
 void bdrv_flush_all(void);
+void bdrv_close_all(void);
 
 int bdrv_has_zero_init(BlockDriverState *bs);
 int bdrv_is_allocated(BlockDriverState *bs, int64_t sector_num, int nb_sectors,
diff --git a/vl.c b/vl.c
index 7121cd0..8ffe36f 100644
--- a/vl.c
+++ b/vl.c
@@ -1992,6 +1992,7 @@ static void main_loop(void)
 vm_stop(r);
 }
 }
+bdrv_close_all();
 pause_all_vcpus();
 }
 
-- 
1.5.6.5




[Qemu-devel] [RFC PATCH v4 3/3] block: add sheepdog driver for distributed storage support

2010-05-27 Thread MORITA Kazutaka
Sheepdog is a distributed storage system for QEMU. It provides highly
available block level storage volumes to VMs like Amazon EBS.  This
patch adds a qemu block driver for Sheepdog.

Sheepdog features are:
- No node in the cluster is special (no metadata node, no control
  node, etc)
- Linear scalability in performance and capacity
- No single point of failure
- Autonomous management (zero configuration)
- Useful volume management support such as snapshot and cloning
- Thin provisioning
- Autonomous load balancing

The more details are available at the project site:
http://www.osrg.net/sheepdog/

Signed-off-by: MORITA Kazutaka 
---
 Makefile.objs|2 +-
 block/sheepdog.c | 1835 ++
 2 files changed, 1836 insertions(+), 1 deletions(-)
 create mode 100644 block/sheepdog.c

diff --git a/Makefile.objs b/Makefile.objs
index 1a942e5..527a754 100644
--- a/Makefile.objs
+++ b/Makefile.objs
@@ -14,7 +14,7 @@ block-obj-$(CONFIG_LINUX_AIO) += linux-aio.o
 
 block-nested-y += raw.o cow.o qcow.o vdi.o vmdk.o cloop.o dmg.o bochs.o vpc.o 
vvfat.o
 block-nested-y += qcow2.o qcow2-refcount.o qcow2-cluster.o qcow2-snapshot.o
-block-nested-y += parallels.o nbd.o blkdebug.o
+block-nested-y += parallels.o nbd.o blkdebug.o sheepdog.o
 block-nested-$(CONFIG_WIN32) += raw-win32.o
 block-nested-$(CONFIG_POSIX) += raw-posix.o
 block-nested-$(CONFIG_CURL) += curl.o
diff --git a/block/sheepdog.c b/block/sheepdog.c
new file mode 100644
index 000..68545e8
--- /dev/null
+++ b/block/sheepdog.c
@@ -0,0 +1,1835 @@
+/*
+ * Copyright (C) 2009-2010 Nippon Telegraph and Telephone Corporation.
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License version
+ * 2 as published by the Free Software Foundation.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program. If not, see <http://www.gnu.org/licenses/>.
+ */
+#include 
+#include 
+
+#include "qemu-common.h"
+#include "qemu-error.h"
+#include "block_int.h"
+
+#define SD_PROTO_VER 0x01
+
+#define SD_DEFAULT_ADDR "localhost:7000"
+
+#define SD_OP_CREATE_AND_WRITE_OBJ  0x01
+#define SD_OP_READ_OBJ   0x02
+#define SD_OP_WRITE_OBJ  0x03
+
+#define SD_OP_NEW_VDI0x11
+#define SD_OP_LOCK_VDI   0x12
+#define SD_OP_RELEASE_VDI0x13
+#define SD_OP_GET_VDI_INFO   0x14
+#define SD_OP_READ_VDIS  0x15
+
+#define SD_FLAG_CMD_WRITE0x01
+#define SD_FLAG_CMD_COW  0x02
+
+#define SD_RES_SUCCESS   0x00 /* Success */
+#define SD_RES_UNKNOWN   0x01 /* Unknown error */
+#define SD_RES_NO_OBJ0x02 /* No object found */
+#define SD_RES_EIO   0x03 /* I/O error */
+#define SD_RES_VDI_EXIST 0x04 /* Vdi exists already */
+#define SD_RES_INVALID_PARMS 0x05 /* Invalid parameters */
+#define SD_RES_SYSTEM_ERROR  0x06 /* System error */
+#define SD_RES_VDI_LOCKED0x07 /* Vdi is locked */
+#define SD_RES_NO_VDI0x08 /* No vdi found */
+#define SD_RES_NO_BASE_VDI   0x09 /* No base vdi found */
+#define SD_RES_VDI_READ  0x0A /* Cannot read requested vdi */
+#define SD_RES_VDI_WRITE 0x0B /* Cannot write requested vdi */
+#define SD_RES_BASE_VDI_READ 0x0C /* Cannot read base vdi */
+#define SD_RES_BASE_VDI_WRITE   0x0D /* Cannot write base vdi */
+#define SD_RES_NO_TAG0x0E /* Requested tag is not found */
+#define SD_RES_STARTUP   0x0F /* Sheepdog is on starting up */
+#define SD_RES_VDI_NOT_LOCKED   0x10 /* Vdi is not locked */
+#define SD_RES_SHUTDOWN  0x11 /* Sheepdog is shutting down */
+#define SD_RES_NO_MEM0x12 /* Cannot allocate memory */
+#define SD_RES_FULL_VDI  0x13 /* we already have the maximum vdis */
+#define SD_RES_VER_MISMATCH  0x14 /* Protocol version mismatch */
+#define SD_RES_NO_SPACE  0x15 /* Server has no room for new objects */
+#define SD_RES_WAIT_FOR_FORMAT  0x16 /* Sheepdog is waiting for a format 
operation */
+#define SD_RES_WAIT_FOR_JOIN0x17 /* Sheepdog is waiting for other nodes 
joining */
+#define SD_RES_JOIN_FAILED   0x18 /* Target node had failed to join sheepdog */
+
+/*
+ * Object ID rules
+ *
+ *  0 - 19 (20 bits): data object space
+ * 20 - 31 (12 bits): reserved data object space
+ * 32 - 55 (24 bits): vdi object space
+ * 56 - 59 ( 4 bits): reserved vdi object space
+ * 60 - 63 ( 4 bits): object type indentifier space
+ */
+
+#define VDI_SPACE_SHIFT   32
+#define VDI_BIT (UINT64_C(1) << 63)
+#define VMSTATE_BIT (UINT64_C(1) << 62)
+#define MAX_DATA_OBJS (1ULL << 20)
+#define MAX_CHILDREN 1024
+#define SD_MAX_VDI_LEN 256
+#define SD_NR_VDIS   (1U << 24)
+#define SD_DATA_OBJ_SIZE (UINT64_C(1) << 22)
+
+#define SD_INODE_SIZE (sizeof(SheepdogInode))
+#define CURRENT_VDI_ID 0
+
+typedef struct SheepdogReq {
+   uint8_t proto_ver;
+   uint8_t opcode;
+   uint16_tflags;

[Qemu-devel] Re: [RFC PATCH v4 0/3] Sheepdog: distributed storage system for QEMU

2010-06-03 Thread MORITA Kazutaka
At Wed, 02 Jun 2010 12:49:02 +0200,
Kevin Wolf wrote:
> 
> Am 28.05.2010 04:44, schrieb MORITA Kazutaka:
> > Hi all,
> > 
> > This patch adds a block driver for Sheepdog distributed storage
> > system.  Please consider for inclusion.
> 
> Hint for next time: You should remove the RFC from the subject line if
> you think the patch is ready for inclusion. Otherwise I might miss this
> and think you only want comments on it.
> 

Thanks for the advice. I'll do so the next time.

> > MORITA Kazutaka (3):
> >   close all the block drivers before the qemu process exits
> >   block: call the snapshot handlers of the protocol drivers
> >   block: add sheepdog driver for distributed storage support
> 
> Thanks, I have applied the first two patches to the block branch, they
> look good to me. I'll send some comments for the third one (though it's
> only coding style until now).
> 

Thanks a lot.

Kazutaka



[Qemu-devel] Re: [RFC PATCH v4 3/3] block: add sheepdog driver for distributed storage support

2010-06-03 Thread MORITA Kazutaka
At Tue, 01 Jun 2010 09:58:04 -0500,
Thanks for your comments!

Chris Krumme wrote:
> 
> On 05/27/2010 09:44 PM, MORITA Kazutaka wrote:
> > Sheepdog is a distributed storage system for QEMU. It provides highly

> > +
> > +static int connect_to_sdog(const char *addr)
> > +{
> > +   char buf[64];
> > +   char hbuf[NI_MAXHOST], sbuf[NI_MAXSERV];
> > +   char name[256], *p;
> > +   int fd, ret;
> > +   struct addrinfo hints, *res, *res0;
> > +   int port = 0;
> > +
> > +   if (!addr) {
> > +   addr = SD_DEFAULT_ADDR;
> > +   }
> > +
> > +   strcpy(name, addr);
> >
> 
> Can strlen(addr) be > sizeof(name)?
> 

Yes, we should check the length of addr. This would causes overflows.

> > +
> > +   p = name;
> > +   while (*p) {
> > +   if (*p == ':') {
> > +   *p++ = '\0';
> >
> 
> May also need to check for p > name + sizeof(name).
> 

p should be NULL-terminated, so the check is not required, I think.

> > +   break;
> > +   } else {
> > +   p++;
> > +   }
> > +   }
> > +
> > +   if (*p == '\0') {
> > +   error_report("cannot find a port number, %s\n", name);
> > +   return -1;
> > +   }
> > +   port = strtol(p, NULL, 10);
> >
> 
> Are negative numbers valid here?
> 

No. It is better to use strtoul.


> > +
> > +static int parse_vdiname(BDRVSheepdogState *s, const char *filename,
> > +char *vdi, int vdi_len, uint32_t *snapid)
> > +{
> > +   char *p, *q;
> > +   int nr_sep;
> > +
> > +   p = q = strdup(filename);
> > +
> > +   if (!p) {
> >
> 
> I think Qemu has a version of strdup that will not return NULL.
> 

Yes. We can use qemu_strdup here.


> > +
> > +/* TODO: error cleanups */
> > +static int sd_open(BlockDriverState *bs, const char *filename, int flags)
> > +{
> > +   int ret, fd;
> > +   uint32_t vid = 0;
> > +   BDRVSheepdogState *s = bs->opaque;
> > +   char vdi[256];
> > +   uint32_t snapid;
> > +   int for_snapshot = 0;
> > +   char *buf;
> > +
> > +   strstart(filename, "sheepdog:", (const char **)&filename);
> > +
> > +   buf = qemu_malloc(SD_INODE_SIZE);
> > +
> > +   memset(vdi, 0, sizeof(vdi));
> > +   if (parse_vdiname(s, filename, vdi, sizeof(vdi),&snapid)<  0) {
> > +   goto out;
> > +   }
> > +   s->fd = get_sheep_fd(s);
> > +   if (s->fd<  0) {
> >
> 
> buf is not freed, goto out maybe.
> 

Yes, we should goto out here.


> > +
> > +static int do_sd_create(const char *addr, char *filename, char *tag,
> > +   int64_t total_sectors, uint32_t base_vid,
> > +   uint32_t *vdi_id, int snapshot)
> > +{
> > +   SheepdogVdiReq hdr;
> > +   SheepdogVdiRsp *rsp = (SheepdogVdiRsp *)&hdr;
> > +   int fd, ret;
> > +   unsigned int wlen, rlen = 0;
> > +   char buf[SD_MAX_VDI_LEN];
> > +
> > +   fd = connect_to_sdog(addr);
> > +   if (fd<  0) {
> > +   return -1;
> > +   }
> > +
> > +   strncpy(buf, filename, SD_MAX_VDI_LEN);
> > +
> > +   memset(&hdr, 0, sizeof(hdr));
> > +   hdr.opcode = SD_OP_NEW_VDI;
> > +   hdr.base_vdi_id = base_vid;
> > +
> > +   wlen = SD_MAX_VDI_LEN;
> > +
> > +   hdr.flags = SD_FLAG_CMD_WRITE;
> > +   hdr.snapid = snapshot;
> > +
> > +   hdr.data_length = wlen;
> > +   hdr.vdi_size = total_sectors * 512;
> >
> 
> There is another patch on the list changing 512 to a define for sector size.
> 

OK. We'll define SECTOR_SIZE.


> > +
> > +   ret = do_req(fd, (SheepdogReq *)&hdr, buf,&wlen,&rlen);
> > +
> > +   close(fd);
> > +
> > +   if (ret) {
> > +   return -1;
> > +   }
> > +
> > +   if (rsp->result != SD_RES_SUCCESS) {
> > +   error_report("%s, %s\n", sd_strerror(rsp->result), filename);
> > +   return -1;
> > +   }
> > +
> > +   if (vdi_id) {
> > +   *vdi_id = rsp->vdi_id;
> > +   }
> > +
> > +   return 0;
> > +}
> > +
> > +static int sd_create(const char *filename, QEMUOptionParameter *options)
> > +{
> > +   int ret;
> > +   uint32_t vid = 0;
> > +   int64_t total_sectors = 0;
> > +   char *backing_file = NULL;
> > +
> &

[Qemu-devel] Re: [RFC PATCH v4 3/3] block: add sheepdog driver for distributed storage support

2010-06-03 Thread MORITA Kazutaka
At Wed, 02 Jun 2010 15:55:42 +0200,
Kevin Wolf wrote:
> 
> Am 28.05.2010 04:44, schrieb MORITA Kazutaka:
> > Sheepdog is a distributed storage system for QEMU. It provides highly
> > available block level storage volumes to VMs like Amazon EBS.  This
> > patch adds a qemu block driver for Sheepdog.
> > 
> > Sheepdog features are:
> > - No node in the cluster is special (no metadata node, no control
> >   node, etc)
> > - Linear scalability in performance and capacity
> > - No single point of failure
> > - Autonomous management (zero configuration)
> > - Useful volume management support such as snapshot and cloning
> > - Thin provisioning
> > - Autonomous load balancing
> > 
> > The more details are available at the project site:
> > http://www.osrg.net/sheepdog/
> > 
> > Signed-off-by: MORITA Kazutaka 
> > ---
> >  Makefile.objs|2 +-
> >  block/sheepdog.c | 1835 
> > ++
> >  2 files changed, 1836 insertions(+), 1 deletions(-)
> >  create mode 100644 block/sheepdog.c
> 
> One general thing: The code uses some mix of spaces and tabs for
> indentation, with the greatest part using tabs. According to
> CODING_STYLE it should consistently use four spaces instead.
> 

OK.  I'll fix the indentation according to CODYING_STYLE.


> > +
> > +typedef struct SheepdogInode {
> > +   char name[SD_MAX_VDI_LEN];
> > +   uint64_t ctime;
> > +   uint64_t snap_ctime;
> > +   uint64_t vm_clock_nsec;
> > +   uint64_t vdi_size;
> > +   uint64_t vm_state_size;
> > +   uint16_t copy_policy;
> > +   uint8_t  nr_copies;
> > +   uint8_t  block_size_shift;
> > +   uint32_t snap_id;
> > +   uint32_t vdi_id;
> > +   uint32_t parent_vdi_id;
> > +   uint32_t child_vdi_id[MAX_CHILDREN];
> > +   uint32_t data_vdi_id[MAX_DATA_OBJS];
> 
> Wow, this is a huge array. :-)
> 
> So Sheepdog has a fixed limit of 16 TB, right?
> 

MAX_DATA_OBJS is (1 << 20), and the size of a object is 4 MB.  So the
limit of the Sheepdog image size is 4 TB.

These values are hard-coded, and I guess they should be configurable.


> 
> > +} SheepdogInode;
> > +

> > +
> > +static void sd_aio_cancel(BlockDriverAIOCB *blockacb)
> > +{
> > +   SheepdogAIOCB *acb = (SheepdogAIOCB *)blockacb;
> > +
> > +   acb->canceled = 1;
> > +}
> 
> Does this provide the right semantics? You haven't really cancelled the
> request, but you pretend to. So you actually complete the request in the
> background and then throw the return code away.
> 
> I seem to remember that posix-aio-compat.c waits at this point for
> completion of the requests, calls the callbacks and only afterwards
> returns from aio_cancel when no more requests are in flight.
> 
> Or if you can really cancel requests, it would be the best option, of
> course.
> 

Sheepdog cannot cancel the requests which are already sent to the
servers.  So, as you say, we pretend to cancel the requests without
waiting for completion of them.  However, are there any situation
where pretending to cancel causes problems in practice?

To wait for completion of the requests here, we may need to create
another thread for processing I/O like posix-aio-compat.c.


> > +
> > +static int do_send_recv(int sockfd, struct iovec *iov, int len, int offset,
> > +   int write)
> 
> I've spent at least 15 minutes figuring out what this function does. I
> think I've got it now more or less, but I've come to the conclusion that
> this code needs more comments.
> 
> I'd suggest to add a header comment to all non-trivial functions and
> maybe somewhere on the top a general description of how things work.
> 
> As far as I understood now, there are basically two parts of request
> handling:
> 
> 1. The request is sent to the server. Its AIOCB is saved in a list in
> the BDRVSheepdogState. It doesn't pass a callback or anything for the
> completion.
> 
> 2. aio_read_response is registered as a fd handler to the sheepdog
> connection. When the server responds, it searches the right AIOCB in the
> list and the second part of request handling starts.
> 
> do_send_recv is the function that is used to do all communication with
> the server. The iov stuff looks like it's only used for some data, but
> seems this is not true - it's also used for the metadata of the protocol.
> 
> Did I understand it right so far?
> 

Yes, exactly.  I'll add comments to make codes more readable.


> > +{
> > +   struct msghdr msg;
> > +   int ret, diff;
> > +
> >

[Qemu-devel] Re: [RFC PATCH v4 3/3] block: add sheepdog driver for distributed storage support

2010-06-06 Thread MORITA Kazutaka
At Fri, 04 Jun 2010 13:04:00 +0200,
Kevin Wolf wrote:
> 
> Am 03.06.2010 18:23, schrieb MORITA Kazutaka:
> >>> +static void sd_aio_cancel(BlockDriverAIOCB *blockacb)
> >>> +{
> >>> + SheepdogAIOCB *acb = (SheepdogAIOCB *)blockacb;
> >>> +
> >>> + acb->canceled = 1;
> >>> +}
> >>
> >> Does this provide the right semantics? You haven't really cancelled the
> >> request, but you pretend to. So you actually complete the request in the
> >> background and then throw the return code away.
> >>
> >> I seem to remember that posix-aio-compat.c waits at this point for
> >> completion of the requests, calls the callbacks and only afterwards
> >> returns from aio_cancel when no more requests are in flight.
> >>
> >> Or if you can really cancel requests, it would be the best option, of
> >> course.
> >>
> > 
> > Sheepdog cannot cancel the requests which are already sent to the
> > servers.  So, as you say, we pretend to cancel the requests without
> > waiting for completion of them.  However, are there any situation
> > where pretending to cancel causes problems in practice?
> 
> I'm not sure how often it would happen in practice, but if the guest OS
> thinks the old value is on disk when in fact the new one is, this could
> lead to corruption. I think if it can happen, even without evidence that
> it actually does, it's already relevant enough.
> 

I agree.

> > To wait for completion of the requests here, we may need to create
> > another thread for processing I/O like posix-aio-compat.c.
> 
> I don't think you need a thread to get the same behaviour, you just need
> to call the fd handlers like in the main loop. It would probably be the
> first driver doing this, though, and it's not an often used code path,
> so it might be a bad idea.
> 
> Maybe it's reasonable to just complete the request with -EIO? This way
> the guest couldn't make any assumption about the data written. On the
> other hand, it could be unhappy about failed requests, but that's
> probably better than corruption.
> 

Completing with -EIO looks good to me.  Thanks for the advice.
I'll send an updated patch tomorrow.

Regards,

Kazutaka



[Qemu-devel] [PATCH v5] block: add sheepdog driver for distributed storage support

2010-06-07 Thread MORITA Kazutaka
Sheepdog is a distributed storage system for QEMU. It provides highly
available block level storage volumes to VMs like Amazon EBS.  This
patch adds a qemu block driver for Sheepdog.

Sheepdog features are:
- No node in the cluster is special (no metadata node, no control
  node, etc)
- Linear scalability in performance and capacity
- No single point of failure
- Autonomous management (zero configuration)
- Useful volume management support such as snapshot and cloning
- Thin provisioning
- Autonomous load balancing

The more details are available at the project site:
http://www.osrg.net/sheepdog/

Signed-off-by: MORITA Kazutaka 
---
Changes from v4 to v5 are:
 - address the comments to the sheepdog driver (Thanks Kevin, Chris!)
 -- fix a coding style
 -- fix aio_cancel handling
 -- fix an overflow bug in coping hostname
 -- add comments to the non-trivial functions
 - remove already applied patches from the patchset

Changes from v3 to v4 are:
 - fix error handling in bdrv_snapshot_goto.

Changes from v2 to v3 are:

 - add drv->bdrv_close() and drv->bdrv_open() before and after
   bdrv_snapshot_goto() call of the protocol.
 - address the review comments on the sheepdog driver code.

Changes from v1 to v2 are:

 - rebase onto git://repo.or.cz/qemu/kevin.git block
 - modify the sheepdog driver as a protocol driver
 - add new patch to call the snapshot handler of the protocol

 Makefile.objs|2 +-
 block/sheepdog.c | 1905 ++
 2 files changed, 1906 insertions(+), 1 deletions(-)
 create mode 100644 block/sheepdog.c

diff --git a/Makefile.objs b/Makefile.objs
index 54dec26..070db8a 100644
--- a/Makefile.objs
+++ b/Makefile.objs
@@ -14,7 +14,7 @@ block-obj-$(CONFIG_LINUX_AIO) += linux-aio.o
 
 block-nested-y += raw.o cow.o qcow.o vdi.o vmdk.o cloop.o dmg.o bochs.o vpc.o 
vvfat.o
 block-nested-y += qcow2.o qcow2-refcount.o qcow2-cluster.o qcow2-snapshot.o
-block-nested-y += parallels.o nbd.o blkdebug.o
+block-nested-y += parallels.o nbd.o blkdebug.o sheepdog.o
 block-nested-$(CONFIG_WIN32) += raw-win32.o
 block-nested-$(CONFIG_POSIX) += raw-posix.o
 block-nested-$(CONFIG_CURL) += curl.o
diff --git a/block/sheepdog.c b/block/sheepdog.c
new file mode 100644
index 000..a9477a5
--- /dev/null
+++ b/block/sheepdog.c
@@ -0,0 +1,1905 @@
+/*
+ * Copyright (C) 2009-2010 Nippon Telegraph and Telephone Corporation.
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License version
+ * 2 as published by the Free Software Foundation.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program. If not, see <http://www.gnu.org/licenses/>.
+ */
+#include 
+#include 
+
+#include "qemu-common.h"
+#include "qemu-error.h"
+#include "block_int.h"
+
+#define SD_PROTO_VER 0x01
+
+#define SD_DEFAULT_ADDR "localhost"
+#define SD_DEFAULT_PORT "7000"
+
+#define SD_OP_CREATE_AND_WRITE_OBJ  0x01
+#define SD_OP_READ_OBJ   0x02
+#define SD_OP_WRITE_OBJ  0x03
+
+#define SD_OP_NEW_VDI0x11
+#define SD_OP_LOCK_VDI   0x12
+#define SD_OP_RELEASE_VDI0x13
+#define SD_OP_GET_VDI_INFO   0x14
+#define SD_OP_READ_VDIS  0x15
+
+#define SD_FLAG_CMD_WRITE0x01
+#define SD_FLAG_CMD_COW  0x02
+
+#define SD_RES_SUCCESS   0x00 /* Success */
+#define SD_RES_UNKNOWN   0x01 /* Unknown error */
+#define SD_RES_NO_OBJ0x02 /* No object found */
+#define SD_RES_EIO   0x03 /* I/O error */
+#define SD_RES_VDI_EXIST 0x04 /* Vdi exists already */
+#define SD_RES_INVALID_PARMS 0x05 /* Invalid parameters */
+#define SD_RES_SYSTEM_ERROR  0x06 /* System error */
+#define SD_RES_VDI_LOCKED0x07 /* Vdi is locked */
+#define SD_RES_NO_VDI0x08 /* No vdi found */
+#define SD_RES_NO_BASE_VDI   0x09 /* No base vdi found */
+#define SD_RES_VDI_READ  0x0A /* Cannot read requested vdi */
+#define SD_RES_VDI_WRITE 0x0B /* Cannot write requested vdi */
+#define SD_RES_BASE_VDI_READ 0x0C /* Cannot read base vdi */
+#define SD_RES_BASE_VDI_WRITE   0x0D /* Cannot write base vdi */
+#define SD_RES_NO_TAG0x0E /* Requested tag is not found */
+#define SD_RES_STARTUP   0x0F /* Sheepdog is on starting up */
+#define SD_RES_VDI_NOT_LOCKED   0x10 /* Vdi is not locked */
+#define SD_RES_SHUTDOWN  0x11 /* Sheepdog is shutting down */
+#define SD_RES_NO_MEM0x12 /* Cannot allocate memory */
+#define SD_RES_FULL_VDI  0x13 /* we already have the maximum vdis */
+#define SD_RES_VER_MISMATCH  0x14 /* Protocol version mismatch */
+#define SD_RES_NO_SPACE  0x15 /* Server has no room for new objects */
+#define SD_RES_WAIT_FOR_FORMAT  0x16 /* Waiting for a format operation */
+#define SD_RES_WAIT_FOR_JOIN0x17 /* Waiting for other nodes joining */
+#define SD_RES_JOIN_FAILED   0x18 /* Target node had failed to join sheepdog */
+
+/*
+ * Object ID rule

Re: [Qemu-devel] [PATCH v4] savevm: Really verify if a drive supports snapshots

2010-06-07 Thread MORITA Kazutaka
At Fri,  4 Jun 2010 16:35:59 -0300,
Miguel Di Ciurcio Filho wrote:
> 
> Both bdrv_can_snapshot() and bdrv_has_snapshot() does not work as advertized.
> 
> First issue: Their names implies different porpouses, but they do the same 
> thing
> and have exactly the same code. Maybe copied and pasted and forgotten?
> bdrv_has_snapshot() is called in various places for actually checking if there
> is snapshots or not.
> 
> Second issue: the way bdrv_can_snapshot() verifies if a block driver supports 
> or
> not snapshots does not catch all cases. E.g.: a raw image.
> 
> So when do_savevm() is called, first thing it does is to set a global
> BlockDriverState to save the VM memory state calling get_bs_snapshots().
> 
> static BlockDriverState *get_bs_snapshots(void)
> {
> BlockDriverState *bs;
> DriveInfo *dinfo;
> 
> if (bs_snapshots)
> return bs_snapshots;
> QTAILQ_FOREACH(dinfo, &drives, next) {
> bs = dinfo->bdrv;
> if (bdrv_can_snapshot(bs))
> goto ok;
> }
> return NULL;
>  ok:
> bs_snapshots = bs;
> return bs;
> }
> 
> bdrv_can_snapshot() may return a BlockDriverState that does not support
> snapshots and do_savevm() goes on.
> 
> Later on in do_savevm(), we find:
> 
> QTAILQ_FOREACH(dinfo, &drives, next) {
> bs1 = dinfo->bdrv;
> if (bdrv_has_snapshot(bs1)) {
> /* Write VM state size only to the image that contains the state 
> */
> sn->vm_state_size = (bs == bs1 ? vm_state_size : 0);
> ret = bdrv_snapshot_create(bs1, sn);
> if (ret < 0) {
> monitor_printf(mon, "Error while creating snapshot on '%s'\n",
>bdrv_get_device_name(bs1));
> }
> }
> }
> 
> bdrv_has_snapshot(bs1) is not checking if the device does support or has
> snapshots as explained above. Only in bdrv_snapshot_create() the device is
> actually checked for snapshot support.
> 
> So, in cases where the first device supports snapshots, and the second does 
> not,
> the snapshot on the first will happen anyways. I believe this is not a good
> behavior. It should be an all or nothing process.
> 
> This patch addresses these issues by making bdrv_can_snapshot() actually do
> what it must do and enforces better tests to avoid errors in the middle of
> do_savevm(). bdrv_has_snapshot() is removed and replaced by 
> bdrv_can_snapshot()
> where appropriate.
> 
> bdrv_can_snapshot() was moved from savevm.c to block.c. It makes more sense 
> to me.
> 
> The loadvm_state() function was updated too to enforce that when loading a VM 
> at
> least all writable devices must support snapshots too.
> 
> Signed-off-by: Miguel Di Ciurcio Filho 
> ---
>  block.c  |   11 +++
>  block.h  |1 +
>  savevm.c |   58 --
>  3 files changed, 48 insertions(+), 22 deletions(-)
> 
> diff --git a/block.c b/block.c
> index cd70730..ace3cdb 100644
> --- a/block.c
> +++ b/block.c
> @@ -1720,6 +1720,17 @@ void bdrv_debug_event(BlockDriverState *bs, 
> BlkDebugEvent event)
>  /**/
>  /* handling of snapshots */
>  
> +int bdrv_can_snapshot(BlockDriverState *bs)
> +{
> +BlockDriver *drv = bs->drv;
> +if (!drv || !drv->bdrv_snapshot_create || bdrv_is_removable(bs) ||
> +bdrv_is_read_only(bs)) {
> +return 0;
> +}
> +
> +return 1;
> +}
> +

The underlying protocol could support snapshots, so I think we should
check against bs->file too.

--- a/block.c
+++ b/block.c
@@ -1671,6 +1671,9 @@ int bdrv_can_snapshot(BlockDriverState *bs)
 BlockDriver *drv = bs->drv;
 if (!drv || !drv->bdrv_snapshot_create || bdrv_is_removable(bs) ||
 bdrv_is_read_only(bs)) {
+if (bs->file) {
+return bdrv_can_snapshot(bs->file);
+}
 return 0;
 }
 
Regards,

Kazutaka



[Qemu-devel] Re: [PATCH v5] block: add sheepdog driver for distributed storage support

2010-06-15 Thread MORITA Kazutaka
At Tue, 15 Jun 2010 10:24:14 +0200,
Kevin Wolf wrote:
> 
> Am 14.06.2010 21:48, schrieb MORITA Kazutaka:
> >> 3) qemu-io aio_read/write doesn't seem to work well with it. I only get
> >> the result of the AIO request when I exit qemu-io. This may be a qemu-io
> >> problem or a Sheepdog one. We need to look into this, qemu-io is
> >> important for testing and debugging (particularly for qemu-iotests)
> >>
> > Sheepdog receives responses from the server in the fd handler to the
> > socket connection.  But, while qemu-io executes aio_read/aio_write, it
> > doesn't call qemu_aio_wait() and the fd handler isn't invoked at all.
> > This seems to be the reason of the problem.
> > 
> > I'm not sure this is a qemu-io problem or a Sheepdog one.  If it is a
> > qemu-io problem, we need to call qemu_aio_wait() somewhere in the
> > command_loop(), I guess.  If it is a Sheepdog problem, we need to
> > consider another mechanism to receive responses...
> 
> Not sure either.
> 
> I think posix-aio-compat needs fd handlers to be called, too, and it
> kind of works. I'm saying "kind of" because after an aio_read/write
> command qemu-io exits (it doesn't with Sheepdog). And when exiting there
> is a qemu_aio_wait(), so this explains why you get a result there.
> 
> I guess it's a bug in the posix-aio-compat case rather than with Sheepdog.
> 
It seems that fgets() is interrupted by a signal in fetchline() and
qemu-io exits.

BTW, I think we should call the fd handlers when user input is idle
and the fds become ready.  I'll send the patch later.

> The good news is that if qemu-iotests works with only one aio_read/write
> command before qemu-io exits, it's going to work with Sheepdog, too.
> 
Great!


Thanks,

Kazutaka



[Qemu-devel] [PATCH 1/2] qemu-io: retry fgets() when errno is EINTR

2010-06-15 Thread MORITA Kazutaka
posix-aio-compat sends a signal in aio operations, so we should
consider that fgets() could be interrupted here.

Signed-off-by: MORITA Kazutaka 
---
 cmd.c |3 +++
 1 files changed, 3 insertions(+), 0 deletions(-)

diff --git a/cmd.c b/cmd.c
index 2336334..460df92 100644
--- a/cmd.c
+++ b/cmd.c
@@ -272,7 +272,10 @@ fetchline(void)
return NULL;
printf("%s", get_prompt());
fflush(stdout);
+again:
if (!fgets(line, MAXREADLINESZ, stdin)) {
+   if (errno == EINTR)
+   goto again;
free(line);
return NULL;
}
-- 
1.5.6.5




[Qemu-devel] [PATCH 2/2] qemu-io: check registered fds in command_loop()

2010-06-15 Thread MORITA Kazutaka
Some block drivers use an aio handler and do I/O completion routines
in it.  However, the handler is not invoked if we only do
aio_read/write, because registered fds are not checked at all.

This patch registers a command processing function as a fd handler to
STDIO, and calls qemu_aio_wait() in command_loop().  Any other
handlers can be invoked when user input is idle.

Signed-off-by: MORITA Kazutaka 
---
 cmd.c |   53 +++--
 1 files changed, 39 insertions(+), 14 deletions(-)

diff --git a/cmd.c b/cmd.c
index 460df92..2b66e24 100644
--- a/cmd.c
+++ b/cmd.c
@@ -24,6 +24,7 @@
 #include 
 
 #include "cmd.h"
+#include "qemu-aio.h"
 
 #define _(x)   x   /* not gettext support yet */
 
@@ -149,6 +150,37 @@ add_args_command(
args_func = af;
 }
 
+static char *get_prompt(void);
+
+static void do_command(void *opaque)
+{
+   int c;
+   int *done = opaque;
+   char*input;
+   char**v;
+   const cmdinfo_t *ct;
+
+   if ((input = fetchline()) == NULL) {
+   *done = 1;
+   return;
+   }
+   v = breakline(input, &c);
+   if (c) {
+   ct = find_command(v[0]);
+   if (ct)
+   *done = command(ct, c, v);
+   else
+   fprintf(stderr, _("command \"%s\" not found\n"),
+   v[0]);
+   }
+   doneline(input, v);
+
+   if (*done == 0) {
+   printf("%s", get_prompt());
+   fflush(stdout);
+   }
+}
+
 void
 command_loop(void)
 {
@@ -186,20 +218,15 @@ command_loop(void)
free(cmdline);
return;
}
+
+   printf("%s", get_prompt());
+   fflush(stdout);
+
+   qemu_aio_set_fd_handler(STDIN_FILENO, do_command, NULL, NULL, NULL, 
&done);
while (!done) {
-   if ((input = fetchline()) == NULL)
-   break;
-   v = breakline(input, &c);
-   if (c) {
-   ct = find_command(v[0]);
-   if (ct)
-   done = command(ct, c, v);
-   else
-   fprintf(stderr, _("command \"%s\" not found\n"),
-   v[0]);
-   }
-   doneline(input, v);
+   qemu_aio_wait();
}
+   qemu_aio_set_fd_handler(STDIN_FILENO, NULL, NULL, NULL, NULL, NULL);
 }
 
 /* from libxcmd/input.c */
@@ -270,8 +297,6 @@ fetchline(void)
 
if (!line)
return NULL;
-   printf("%s", get_prompt());
-   fflush(stdout);
 again:
if (!fgets(line, MAXREADLINESZ, stdin)) {
if (errno == EINTR)
-- 
1.5.6.5




[Qemu-devel] [PATCH 0/2] qemu-io: fix aio_read/write problems

2010-06-15 Thread MORITA Kazutaka
Hi,

This patchset fixes the following qemu-io problems:

 - Qemu-io exits suddenly when we do aio_read/write to drivers which
   use posix-aio-compat.

 - We cannot get the results of aio_read/write if we don't do other
   operations.  This problem occurs when the block driver uses a fd
   handler to get I/O completion.

Thanks,

Kazutaka


MORITA Kazutaka (2):
  qemu-io: retry fgets() when errno is EINTR
  qemu-io: check registered fds in command_loop()

 cmd.c |   56 ++--
 1 files changed, 42 insertions(+), 14 deletions(-)




Re: [Qemu-devel] Re: [PATCH 1/2] qemu-io: retry fgets() when errno is EINTR

2010-06-16 Thread MORITA Kazutaka
At Wed, 16 Jun 2010 13:04:47 +0200,
Kevin Wolf wrote:
> 
> Am 15.06.2010 19:53, schrieb MORITA Kazutaka:
> > posix-aio-compat sends a signal in aio operations, so we should
> > consider that fgets() could be interrupted here.
> > 
> > Signed-off-by: MORITA Kazutaka 
> > ---
> >  cmd.c |3 +++
> >  1 files changed, 3 insertions(+), 0 deletions(-)
> > 
> > diff --git a/cmd.c b/cmd.c
> > index 2336334..460df92 100644
> > --- a/cmd.c
> > +++ b/cmd.c
> > @@ -272,7 +272,10 @@ fetchline(void)
> > return NULL;
> > printf("%s", get_prompt());
> > fflush(stdout);
> > +again:
> > if (!fgets(line, MAXREADLINESZ, stdin)) {
> > +   if (errno == EINTR)
> > +   goto again;
> > free(line);
> > return NULL;
> > }
> 
> This looks like a loop replaced by goto (and braces are missing). What
> about this instead?
> 
> do {
> ret = fgets(...)
> } while (ret == NULL && errno == EINTR)
> 
> if (ret == NULL) {
>fail
> }
> 

I agree.

However, it seems that my second patch have already solved the
problem.  We register this readline routines as an aio handler now, so
fgets() does not block and cannot return with EINTR.

This patch looks no longer needed, sorry.

Thanks,

Kazutaka



Re: [Qemu-devel] Re: [PATCH 1/2] qemu-io: retry fgets() when errno is EINTRg

2010-06-17 Thread MORITA Kazutaka
At Thu, 17 Jun 2010 18:18:18 +0100,
Jamie Lokier wrote:
> 
> Kevin Wolf wrote:
> > Am 16.06.2010 18:52, schrieb MORITA Kazutaka:
> > > At Wed, 16 Jun 2010 13:04:47 +0200,
> > > Kevin Wolf wrote:
> > >>
> > >> Am 15.06.2010 19:53, schrieb MORITA Kazutaka:
> > >>> posix-aio-compat sends a signal in aio operations, so we should
> > >>> consider that fgets() could be interrupted here.
> > >>>
> > >>> Signed-off-by: MORITA Kazutaka 
> > >>> ---
> > >>>  cmd.c |3 +++
> > >>>  1 files changed, 3 insertions(+), 0 deletions(-)
> > >>>
> > >>> diff --git a/cmd.c b/cmd.c
> > >>> index 2336334..460df92 100644
> > >>> --- a/cmd.c
> > >>> +++ b/cmd.c
> > >>> @@ -272,7 +272,10 @@ fetchline(void)
> > >>> return NULL;
> > >>> printf("%s", get_prompt());
> > >>> fflush(stdout);
> > >>> +again:
> > >>> if (!fgets(line, MAXREADLINESZ, stdin)) {
> > >>> +   if (errno == EINTR)
> > >>> +   goto again;
> > >>> free(line);
> > >>> return NULL;
> > >>> }
> > >>
> > >> This looks like a loop replaced by goto (and braces are missing). What
> > >> about this instead?
> > >>
> > >> do {
> > >> ret = fgets(...)
> > >> } while (ret == NULL && errno == EINTR)
> > >>
> > >> if (ret == NULL) {
> > >>fail
> > >> }
> > >>
> > > 
> > > I agree.
> > > 
> > > However, it seems that my second patch have already solved the
> > > problem.  We register this readline routines as an aio handler now, so
> > > fgets() does not block and cannot return with EINTR.
> > > 
> > > This patch looks no longer needed, sorry.
> > 
> > Good point. Thanks for having a look.
> 
> Anyway, are you sure stdio functions can be interrupted with EINTR?
> Linus reminds us that some stdio functions have to retry internally
> anyway:
> 
> http://comments.gmane.org/gmane.comp.version-control.git/18285
> 

I think It is another problem whether fgets() retries internally when
a read system call is interrupted.  We should handle EINTR if the
system call can set EINTR.  I think a read() doesn't return with EINTR
if it doesn't block on Linux environment, but it may be not true on
other operating systems.

I send the fixed patch.  I'm not sure this patch is really needed, but
doesn't hurt anyway.

=
posix-aio-compat sends a signal in aio operations, so we should
consider that fgets() could be interrupted here.

Signed-off-by: MORITA Kazutaka 
---
 cmd.c |   14 +-
 1 files changed, 9 insertions(+), 5 deletions(-)

diff --git a/cmd.c b/cmd.c
index aee2a38..733bacd 100644
--- a/cmd.c
+++ b/cmd.c
@@ -293,14 +293,18 @@ fetchline(void)
 char *
 fetchline(void)
 {
-   char*p, *line = malloc(MAXREADLINESZ);
+   char*p, *line = malloc(MAXREADLINESZ), *ret;
 
if (!line)
return NULL;
-   if (!fgets(line, MAXREADLINESZ, stdin)) {
-   free(line);
-   return NULL;
-   }
+do {
+ret = fgets(line, MAXREADLINESZ, stdin);
+} while (ret == NULL && errno == EINTR);
+
+if (ret == NULL) {
+free(line);
+return NULL;
+}
p = line + strlen(line);
if (p != line && p[-1] == '\n')
p[-1] = '\0';
-- 
1.5.6.5




Re: [Qemu-devel] [PATCH] get rid of private bitmap functions in block/sheepdog.c, use generic ones

2011-03-14 Thread MORITA Kazutaka
On Thu, Mar 10, 2011 at 11:03 PM, Michael Tokarev  wrote:
> qemu now has generic bitmap functions,
> so don't redefine them in sheepdog.c,
> use common header instead.  A small cleanup.
>
> Here's only one function which is actually
> used in sheepdog and gets replaced with
> a generic one (simplified):
>
> - static inline int test_bit(int nr, const volatile unsigned long *addr)
> + static inline int test_bit(int nr, const unsigned long *addr)
>  {
> -  return ((1UL << (nr % BITS_PER_LONG))
>            & ((unsigned long*)addr)[nr / BITS_PER_LONG])) != 0;
> +  return 1UL & (addr[nr / BITS_PER_LONG] >> (nr & (BITS_PER_LONG-1)));
>  }
>
> The body is equivalent, but the argument is not: there's
> "volatile" in there.  Why it is used for - I'm not sure.
>
> Signed-off-by: Michael Tokarev 

Looks good.  Thanks!

Acked-by: MORITA Kazutaka 



[Qemu-devel] [PATCH 0/3] sheepdog: fix aio related issues

2011-03-29 Thread MORITA Kazutaka
This patchset fixes the Sheepodg AIO problems pointed out in:
  http://lists.gnu.org/archive/html/qemu-devel/2011-02/msg02495.html
  http://lists.gnu.org/archive/html/qemu-devel/2011-02/msg02474.html

Thanks,

Kazutaka


MORITA Kazutaka (3):
  sheepdog: make send/recv operations non-blocking
  sheepdog: allow cancellation of I/Os which are not processed yet
  sheepdog: avoid accessing a buffer of the canceled I/O request

 block/sheepdog.c |  462 +++---
 1 files changed, 334 insertions(+), 128 deletions(-)




[Qemu-devel] [PATCH 3/3] sheepdog: avoid accessing a buffer of the canceled I/O request

2011-03-29 Thread MORITA Kazutaka
We cannot access the buffer of the canceled I/O request because its
AIOCB callback is already called and the buffer is not valid.

Signed-off-by: MORITA Kazutaka 
---
 block/sheepdog.c |   12 ++--
 1 files changed, 10 insertions(+), 2 deletions(-)

diff --git a/block/sheepdog.c b/block/sheepdog.c
index ed98701..6f60721 100644
--- a/block/sheepdog.c
+++ b/block/sheepdog.c
@@ -79,6 +79,7 @@
 #define SD_DATA_OBJ_SIZE (UINT64_C(1) << 22)
 #define SD_MAX_VDI_SIZE (SD_DATA_OBJ_SIZE * MAX_DATA_OBJS)
 #define SECTOR_SIZE 512
+#define BUF_SIZE 4096
 
 #define SD_INODE_SIZE (sizeof(SheepdogInode))
 #define CURRENT_VDI_ID 0
@@ -900,8 +901,15 @@ static void aio_read_response(void *opaque)
 }
 conn_state = C_IO_DATA;
 case C_IO_DATA:
-ret = do_readv(fd, acb->qiov->iov, aio_req->data_len - done,
-   aio_req->iov_offset + done);
+if (acb->canceled) {
+char tmp_buf[BUF_SIZE];
+int len = MIN(aio_req->data_len - done, sizeof(tmp_buf));
+
+ret = do_read(fd, tmp_buf, len, 0);
+} else {
+ret = do_readv(fd, acb->qiov->iov, aio_req->data_len - done,
+   aio_req->iov_offset + done);
+}
 if (ret < 0) {
 error_report("failed to get the data, %s\n", strerror(errno));
 conn_state = C_IO_CLOSED;
-- 
1.5.6.5




[Qemu-devel] [PATCH 2/3] sheepdog: allow cancellation of I/Os which are not processed yet

2011-03-29 Thread MORITA Kazutaka
We can cancel I/O requests safely if they are not sent to the servers.

Signed-off-by: MORITA Kazutaka 
---
 block/sheepdog.c |   37 +
 1 files changed, 37 insertions(+), 0 deletions(-)

diff --git a/block/sheepdog.c b/block/sheepdog.c
index cedf806..ed98701 100644
--- a/block/sheepdog.c
+++ b/block/sheepdog.c
@@ -421,6 +421,43 @@ static void sd_finish_aiocb(SheepdogAIOCB *acb)
 static void sd_aio_cancel(BlockDriverAIOCB *blockacb)
 {
 SheepdogAIOCB *acb = (SheepdogAIOCB *)blockacb;
+BDRVSheepdogState *s = blockacb->bs->opaque;
+AIOReq *areq, *next, *oldest_send_req = NULL;
+
+if (acb->bh) {
+/*
+ * sd_readv_writev_bh_cb() is not called yet, so we can
+ * release this safely
+ */
+qemu_bh_delete(acb->bh);
+acb->bh = NULL;
+qemu_aio_release(acb);
+return;
+}
+
+QLIST_FOREACH(areq, &s->outstanding_aio_head, outstanding_aio_siblings) {
+if (areq->state == AIO_SEND_OBJREQ) {
+oldest_send_req = areq;
+}
+}
+
+QLIST_FOREACH_SAFE(areq, &s->outstanding_aio_head,
+   outstanding_aio_siblings, next) {
+if (areq->state == AIO_RECV_OBJREQ) {
+continue;
+}
+if (areq->state == AIO_SEND_OBJREQ && areq == oldest_send_req) {
+/* the oldest AIO_SEND_OBJREQ request could be being sent */
+continue;
+}
+free_aio_req(s, areq);
+}
+
+if (QLIST_EMPTY(&acb->aioreq_head)) {
+/* there is no outstanding request */
+qemu_aio_release(acb);
+return;
+}
 
 /*
  * Sheepdog cannot cancel the requests which are already sent to
-- 
1.5.6.5




[Qemu-devel] [PATCH 1/3] sheepdog: make send/recv operations non-blocking

2011-03-29 Thread MORITA Kazutaka
This patch avoids retrying send/recv in AIO path when the sheepdog
connection is not ready for the operation.

Signed-off-by: MORITA Kazutaka 
---
 block/sheepdog.c |  417 +-
 1 files changed, 289 insertions(+), 128 deletions(-)

diff --git a/block/sheepdog.c b/block/sheepdog.c
index a54e0de..cedf806 100644
--- a/block/sheepdog.c
+++ b/block/sheepdog.c
@@ -242,6 +242,19 @@ static inline int is_snapshot(struct SheepdogInode *inode)
 
 typedef struct SheepdogAIOCB SheepdogAIOCB;
 
+enum ConnectionState {
+C_IO_HEADER,
+C_IO_DATA,
+C_IO_END,
+C_IO_CLOSED,
+};
+
+enum AIOReqState {
+AIO_PENDING,/* not ready for sending this request */
+AIO_SEND_OBJREQ,/* send this request */
+AIO_RECV_OBJREQ,/* receive a result of this request */
+};
+
 typedef struct AIOReq {
 SheepdogAIOCB *aiocb;
 unsigned int iov_offset;
@@ -253,6 +266,9 @@ typedef struct AIOReq {
 uint8_t flags;
 uint32_t id;
 
+enum AIOReqState state;
+struct SheepdogObjReq hdr;
+
 QLIST_ENTRY(AIOReq) outstanding_aio_siblings;
 QLIST_ENTRY(AIOReq) aioreq_siblings;
 } AIOReq;
@@ -348,12 +364,14 @@ static const char * sd_strerror(int err)
  * 1. In the sd_aio_readv/writev, read/write requests are added to the
  *QEMU Bottom Halves.
  *
- * 2. In sd_readv_writev_bh_cb, the callbacks of BHs, we send the I/O
- *requests to the server and link the requests to the
- *outstanding_list in the BDRVSheepdogState.  we exits the
- *function without waiting for receiving the response.
+ * 2. In sd_readv_writev_bh_cb, the callbacks of BHs, we set up the
+ *I/O requests to the server and link the requests to the
+ *outstanding_list in the BDRVSheepdogState.
+ *
+ * 3. We send the request in aio_send_request, the fd handler to the
+ *sheepdog connection.
  *
- * 3. We receive the response in aio_read_response, the fd handler to
+ * 4. We receive the response in aio_read_response, the fd handler to
  *the sheepdog connection.  If metadata update is needed, we send
  *the write request to the vdi object in sd_write_done, the write
  *completion function.  The AIOCB callback is not called until all
@@ -377,8 +395,6 @@ static inline AIOReq *alloc_aio_req(BDRVSheepdogState *s, 
SheepdogAIOCB *acb,
 aio_req->flags = flags;
 aio_req->id = s->aioreq_seq_num++;
 
-QLIST_INSERT_HEAD(&s->outstanding_aio_head, aio_req,
-  outstanding_aio_siblings);
 QLIST_INSERT_HEAD(&acb->aioreq_head, aio_req, aioreq_siblings);
 
 return aio_req;
@@ -640,20 +656,17 @@ static int do_readv_writev(int sockfd, struct iovec *iov, 
int len,
 again:
 ret = do_send_recv(sockfd, iov, len, iov_offset, write);
 if (ret < 0) {
-if (errno == EINTR || errno == EAGAIN) {
+if (errno == EINTR) {
 goto again;
 }
+if (errno == EAGAIN) {
+return 0;
+}
 error_report("failed to recv a rsp, %s\n", strerror(errno));
-return 1;
-}
-
-iov_offset += ret;
-len -= ret;
-if (len) {
-goto again;
+return -errno;
 }
 
-return 0;
+return ret;
 }
 
 static int do_readv(int sockfd, struct iovec *iov, int len, int iov_offset)
@@ -666,30 +679,30 @@ static int do_writev(int sockfd, struct iovec *iov, int 
len, int iov_offset)
 return do_readv_writev(sockfd, iov, len, iov_offset, 1);
 }
 
-static int do_read_write(int sockfd, void *buf, int len, int write)
+static int do_read_write(int sockfd, void *buf, int len, int skip, int write)
 {
 struct iovec iov;
 
 iov.iov_base = buf;
-iov.iov_len = len;
+iov.iov_len = len + skip;
 
-return do_readv_writev(sockfd, &iov, len, 0, write);
+return do_readv_writev(sockfd, &iov, len, skip, write);
 }
 
-static int do_read(int sockfd, void *buf, int len)
+static int do_read(int sockfd, void *buf, int len, int skip)
 {
-return do_read_write(sockfd, buf, len, 0);
+return do_read_write(sockfd, buf, len, skip, 0);
 }
 
-static int do_write(int sockfd, void *buf, int len)
+static int do_write(int sockfd, void *buf, int len, int skip)
 {
-return do_read_write(sockfd, buf, len, 1);
+return do_read_write(sockfd, buf, len, skip, 1);
 }
 
 static int send_req(int sockfd, SheepdogReq *hdr, void *data,
 unsigned int *wlen)
 {
-int ret;
+int ret, done = 0;
 struct iovec iov[2];
 
 iov[0].iov_base = hdr;
@@ -700,19 +713,23 @@ static int send_req(int sockfd, SheepdogReq *hdr, void 
*data,
 iov[1].iov_len = *wlen;
 }
 
-ret = do_writev(sockfd, iov, sizeof(*hdr) + *wlen, 0);
-if (ret) {
-error_report("failed to send a req, %s\n", strerror(errno));
-ret = -1;
+while (done < sizeof(*hdr) + *wlen) {
+ret = do_writev(sockfd, iov, sizeof(*hdr) + *wlen - done, done);
+if (ret <

[Qemu-devel] [PATCH] sheepdog: support creating images on remote hosts

2011-01-27 Thread MORITA Kazutaka
This patch parses the input filename in sd_create(), and enables us
specifying a target server to create sheepdog images.

Signed-off-by: MORITA Kazutaka 
---
 block/sheepdog.c |   17 ++---
 1 files changed, 14 insertions(+), 3 deletions(-)

diff --git a/block/sheepdog.c b/block/sheepdog.c
index e62820a..a54e0de 100644
--- a/block/sheepdog.c
+++ b/block/sheepdog.c
@@ -1294,12 +1294,23 @@ static int do_sd_create(char *filename, int64_t 
vdi_size,
 static int sd_create(const char *filename, QEMUOptionParameter *options)
 {
 int ret;
-uint32_t vid = 0;
+uint32_t vid = 0, base_vid = 0;
 int64_t vdi_size = 0;
 char *backing_file = NULL;
+BDRVSheepdogState s;
+char vdi[SD_MAX_VDI_LEN], tag[SD_MAX_VDI_TAG_LEN];
+uint32_t snapid;
 
 strstart(filename, "sheepdog:", (const char **)&filename);
 
+memset(&s, 0, sizeof(s));
+memset(vdi, 0, sizeof(vdi));
+memset(tag, 0, sizeof(tag));
+if (parse_vdiname(&s, filename, vdi, &snapid, tag) < 0) {
+error_report("invalid filename\n");
+return -EINVAL;
+}
+
 while (options && options->name) {
 if (!strcmp(options->name, BLOCK_OPT_SIZE)) {
 vdi_size = options->value.n;
@@ -1338,11 +1349,11 @@ static int sd_create(const char *filename, 
QEMUOptionParameter *options)
 return -EINVAL;
 }
 
-vid = s->inode.vdi_id;
+base_vid = s->inode.vdi_id;
 bdrv_delete(bs);
 }
 
-return do_sd_create((char *)filename, vdi_size, vid, NULL, 0, NULL, NULL);
+return do_sd_create((char *)vdi, vdi_size, base_vid, &vid, 0, s.addr, 
s.port);
 }
 
 static void sd_close(BlockDriverState *bs)
-- 
1.5.6.5




[Qemu-devel] [PATCH] Documentation: add Sheepdog disk images

2011-02-07 Thread MORITA Kazutaka
Signed-off-by: MORITA Kazutaka 
---
 qemu-doc.texi |   52 
 1 files changed, 52 insertions(+), 0 deletions(-)

diff --git a/qemu-doc.texi b/qemu-doc.texi
index 22a8663..86e017c 100644
--- a/qemu-doc.texi
+++ b/qemu-doc.texi
@@ -407,6 +407,7 @@ snapshots.
 * host_drives::   Using host drives
 * disk_images_fat_images::Virtual FAT disk images
 * disk_images_nbd::   NBD access
+* disk_images_sheepdog::  Sheepdog disk images
 @end menu
 
 @node disk_images_quickstart
@@ -630,6 +631,57 @@ qemu -cdrom nbd:localhost:exportname=debian-500-ppc-netinst
 qemu -cdrom nbd:localhost:exportname=openSUSE-11.1-ppc-netinst
 @end example
 
+@node disk_images_sheepdog
+@subsection Sheepdog disk images
+
+Sheepdog is a distributed storage system for QEMU.  It provides highly
+available block level storage volumes that can be attached to
+QEMU-based virtual machines.
+
+You can create a Sheepdog disk image with the command:
+@example
+qemu-img create sheepdog:@var{image} @var{size}
+@end example
+where @var{image} is the Sheepdog image name and @var{size} is its
+size.
+
+To import the existing @var{filename} to Sheepdog, you can use a
+convert command.
+@example
+qemu-img convert @var{filename} sheepdog:@var{image}
+@end example
+
+You can boot from the Sheepdog disk image with the command:
+@example
+qemu sheepdog:@var{image}
+@end example
+
+You can also create a snapshot of the Sheepdog image like qcow2.
+@example
+qemu-img snapshot -c @var{tag} sheepdog:@var{image}
+@end example
+where @var{tag} is a tag name of the newly created snapshot.
+
+To boot from the Sheepdog snapshot, specify the tag name of the
+snapshot.
+@example
+qemu sheepdog:@var{image}:@var{tag}
+@end example
+
+You can create a cloned image from the existing snapshot.
+@example
+qemu-img create -b sheepdog:@var{base}:@var{tag} sheepdog:@var{image}
+@end example
+where @var{base} is a image name of the source snapshot and @var{tag}
+is its tag name.
+
+If the Sheepdog daemon doesn't run on the local host, you need to
+specify one of the Sheepdog servers to connect to.
+@example
+qemu-img create sheepdog:@var{hostname}:@var{port}:@var{image} @var{size}
+qemu sheepdog:@var{hostname}:@var{port}:@var{image}
+@end example
+
 @node pcsys_network
 @section Network emulation
 
-- 
1.5.6.5




Re: [Qemu-devel] Re: [PATCH 3/3] block/nbd: Make the NBD block device use the AIO interface

2011-02-22 Thread MORITA Kazutaka
At Mon, 21 Feb 2011 17:48:49 +0100,
Kevin Wolf wrote:
> 
> Am 21.02.2011 17:31, schrieb Nicholas Thomas:
> > Hi again,
> > 
> > Thanks for looking through the patches. I'm just going through and
> > making the suggested changes now. I've also got qemu-nbd and block/nbd.c
> > working over IPv6 :) - hopefully I'll be able to provide patches in a
> > couple of days. Just a few questions about some of the changes...
> > 
> > Canceled requests: 
> >>> +
> >>> +
> >>> +static void nbd_aio_cancel(BlockDriverAIOCB *blockacb)
> >>> +{
> >>> +NBDAIOCB *acb = (NBDAIOCB *)blockacb;
> >>> +
> >>> +/*
> >>> + * We cannot cancel the requests which are already sent to
> >>> + * the servers, so we just complete the request with -EIO here.
> >>> + */
> >>> +acb->common.cb(acb->common.opaque, -EIO);
> >>> +acb->canceled = 1;
> >>> +}
> >>
> >> I think you need to check for acb->canceled before you write to the
> >> associated buffer when receiving the reply for a read request. The
> >> buffer might not exist any more after the request is cancelled.
> > 
> > I "borrowed" this code from block/sheepdog.c (along with a fair few
> > other bits ;) ) - which doesn't seem to do any special checking for
> > cancelled write requests. So if there is a potential SIGSEGV here, I
> > guess sheepdog is also vulnerable.
> 
> Right, now that you mention it, I seem to remember this from Sheepdog. I
> think I had a discussion with Stefan and he convinced me that we could
> get away with it in Sheepdog because of some condition that Sheepdog
> meets. Not sure any more what that condition was and if it applies to NBD.
> 
> Was it that Sheepdog has a bounce buffer for all requests?

Sheepdog doesn't use a bounce buffer for any requests, and to me, it
seems that Sheepdog also needs to check acb->canceled before reading
the response of a read request...


> >>> +static BlockDriverAIOCB *nbd_aio_readv(BlockDriverState *bs,
> >>> +int64_t sector_num, QEMUIOVector *qiov, int nb_sectors,
> >>> +BlockDriverCompletionFunc *cb, void *opaque)
> >>> +{
> >>> [...]
> >>> +for (i = 0; i < qiov->niov; i++) {
> >>> +memset(qiov->iov[i].iov_base, 0, qiov->iov[i].iov_len);
> >>> +}
> >>
> >> qemu_iovec_memset?
> >>
> >> What is this even for? Aren't these zeros overwritten anyway?
> > 
> > Again, present in sheepdog - but it does seem to work fine without it.
> > I'll remove it from NBD.
> 
> Maybe Sheepdog reads only partially from the server if blocks are
> unallocated or something.

Yes, exactly.


Thanks,

Kazutaka



[Qemu-devel] [PATCH v2] qemu-io: check registered fds in command_loop()

2010-06-20 Thread MORITA Kazutaka
Some block drivers use an aio handler and do I/O completion routines
in it.  However, the handler is not invoked if we only do
aio_read/write, because registered fds are not checked at all.

This patch registers an aio handler of STDIO to checks whether we can
read a command without blocking, and calls qemu_aio_wait() in
command_loop().  Any other handlers can be invoked when user input is
idle.

Signed-off-by: MORITA Kazutaka 
---

It seems that the QEMU aio implementation doesn't allow to call
qemu_aio_wait() in the aio handler, so the previous patch is broken.

This patch only checks that STDIO is ready to read a line in the aio
handler, and invokes a command in command_loop().

I think this also fixes the problem which occurs in qemu-iotests.

 cmd.c |   33 ++---
 1 files changed, 30 insertions(+), 3 deletions(-)

diff --git a/cmd.c b/cmd.c
index 2336334..db2c9c4 100644
--- a/cmd.c
+++ b/cmd.c
@@ -24,6 +24,7 @@
 #include 
 
 #include "cmd.h"
+#include "qemu-aio.h"
 
 #define _(x)   x   /* not gettext support yet */
 
@@ -149,10 +150,20 @@ add_args_command(
args_func = af;
 }
 
+static void prep_fetchline(void *opaque)
+{
+int *fetchable = opaque;
+
+qemu_aio_set_fd_handler(STDIN_FILENO, NULL, NULL, NULL, NULL, NULL);
+*fetchable= 1;
+}
+
+static char *get_prompt(void);
+
 void
 command_loop(void)
 {
-   int c, i, j = 0, done = 0;
+   int c, i, j = 0, done = 0, fetchable = 0, prompted = 0;
char*input;
char**v;
const cmdinfo_t *ct;
@@ -186,7 +197,21 @@ command_loop(void)
free(cmdline);
return;
}
+
while (!done) {
+if (!prompted) {
+printf("%s", get_prompt());
+fflush(stdout);
+qemu_aio_set_fd_handler(STDIN_FILENO, prep_fetchline, NULL, NULL,
+NULL, &fetchable);
+prompted = 1;
+}
+
+qemu_aio_wait();
+
+if (!fetchable) {
+continue;
+}
if ((input = fetchline()) == NULL)
break;
v = breakline(input, &c);
@@ -199,7 +224,11 @@ command_loop(void)
v[0]);
}
doneline(input, v);
+
+prompted = 0;
+fetchable = 0;
}
+qemu_aio_set_fd_handler(STDIN_FILENO, NULL, NULL, NULL, NULL, NULL);
 }
 
 /* from libxcmd/input.c */
@@ -270,8 +299,6 @@ fetchline(void)
 
if (!line)
return NULL;
-   printf("%s", get_prompt());
-   fflush(stdout);
if (!fgets(line, MAXREADLINESZ, stdin)) {
free(line);
return NULL;
-- 
1.5.6.5




[Qemu-devel] [PATCH] qemu-img: avoid calling exit(1) to release resources properly

2010-06-20 Thread MORITA Kazutaka
This patch removes exit(1) from error(), and properly releases
resources such as a block driver and an allocated memory.

For testing the Sheepdog block driver with qemu-iotests, it is
necessary to call bdrv_delete() before the program exits.  Because the
driver releases the lock of VM images in the close handler.

Signed-off-by: MORITA Kazutaka 
---
 qemu-img.c |  235 +++-
 1 files changed, 184 insertions(+), 51 deletions(-)

diff --git a/qemu-img.c b/qemu-img.c
index ea091f0..fe8a577 100644
--- a/qemu-img.c
+++ b/qemu-img.c
@@ -39,14 +39,13 @@ typedef struct img_cmd_t {
 /* Default to cache=writeback as data integrity is not important for qemu-tcg. 
*/
 #define BDRV_O_FLAGS BDRV_O_CACHE_WB
 
-static void QEMU_NORETURN error(const char *fmt, ...)
+static void error(const char *fmt, ...)
 {
 va_list ap;
 va_start(ap, fmt);
 fprintf(stderr, "qemu-img: ");
 vfprintf(stderr, fmt, ap);
 fprintf(stderr, "\n");
-exit(1);
 va_end(ap);
 }
 
@@ -197,57 +196,76 @@ static BlockDriverState *bdrv_new_open(const char 
*filename,
 char password[256];
 
 bs = bdrv_new("");
-if (!bs)
+if (!bs) {
 error("Not enough memory");
+goto fail;
+}
 if (fmt) {
 drv = bdrv_find_format(fmt);
-if (!drv)
+if (!drv) {
 error("Unknown file format '%s'", fmt);
+goto fail;
+}
 } else {
 drv = NULL;
 }
 if (bdrv_open(bs, filename, flags, drv) < 0) {
 error("Could not open '%s'", filename);
+goto fail;
 }
 if (bdrv_is_encrypted(bs)) {
 printf("Disk image '%s' is encrypted.\n", filename);
-if (read_password(password, sizeof(password)) < 0)
+if (read_password(password, sizeof(password)) < 0) {
 error("No password given");
-if (bdrv_set_key(bs, password) < 0)
+goto fail;
+}
+if (bdrv_set_key(bs, password) < 0) {
 error("invalid password");
+goto fail;
+}
 }
 return bs;
+fail:
+if (bs) {
+bdrv_delete(bs);
+}
+return NULL;
 }
 
-static void add_old_style_options(const char *fmt, QEMUOptionParameter *list,
+static int add_old_style_options(const char *fmt, QEMUOptionParameter *list,
 int flags, const char *base_filename, const char *base_fmt)
 {
 if (flags & BLOCK_FLAG_ENCRYPT) {
 if (set_option_parameter(list, BLOCK_OPT_ENCRYPT, "on")) {
 error("Encryption not supported for file format '%s'", fmt);
+return -1;
 }
 }
 if (flags & BLOCK_FLAG_COMPAT6) {
 if (set_option_parameter(list, BLOCK_OPT_COMPAT6, "on")) {
 error("VMDK version 6 not supported for file format '%s'", fmt);
+return -1;
 }
 }
 
 if (base_filename) {
 if (set_option_parameter(list, BLOCK_OPT_BACKING_FILE, base_filename)) 
{
 error("Backing file not supported for file format '%s'", fmt);
+return -1;
 }
 }
 if (base_fmt) {
 if (set_option_parameter(list, BLOCK_OPT_BACKING_FMT, base_fmt)) {
 error("Backing file format not supported for file format '%s'", 
fmt);
+return -1;
 }
 }
+return 0;
 }
 
 static int img_create(int argc, char **argv)
 {
-int c, ret, flags;
+int c, ret = 0, flags;
 const char *fmt = "raw";
 const char *base_fmt = NULL;
 const char *filename;
@@ -293,12 +311,16 @@ static int img_create(int argc, char **argv)
 
 /* Find driver and parse its options */
 drv = bdrv_find_format(fmt);
-if (!drv)
+if (!drv) {
 error("Unknown file format '%s'", fmt);
+return 1;
+}
 
 proto_drv = bdrv_find_protocol(filename);
-if (!proto_drv)
+if (!proto_drv) {
 error("Unknown protocol '%s'", filename);
+return 1;
+}
 
 create_options = append_option_parameters(create_options,
   drv->create_options);
@@ -307,7 +329,7 @@ static int img_create(int argc, char **argv)
 
 if (options && !strcmp(options, "?")) {
 print_option_help(create_options);
-return 0;
+goto out;
 }
 
 /* Create parameter list with default values */
@@ -319,6 +341,8 @@ static int img_create(int argc, char **argv)
 param = parse_option_parameters(options, create_options, param);
 if (param == NULL) {
 error("Invalid options for file format '%s'.", fmt);
+ret = -1;
+goto out;
 }
 }
 
@@ -328,7 +352,10 @@ static int

[Qemu-devel] [PATCH v6] block: add sheepdog driver for distributed storage support

2010-06-20 Thread MORITA Kazutaka
Sheepdog is a distributed storage system for QEMU. It provides highly
available block level storage volumes to VMs like Amazon EBS.  This
patch adds a qemu block driver for Sheepdog.

Sheepdog features are:
- No node in the cluster is special (no metadata node, no control
  node, etc)
- Linear scalability in performance and capacity
- No single point of failure
- Autonomous management (zero configuration)
- Useful volume management support such as snapshot and cloning
- Thin provisioning
- Autonomous load balancing

The more details are available at the project site:
http://www.osrg.net/sheepdog/

Signed-off-by: MORITA Kazutaka 
---

I've addressed the comments and tested with qemu-iotests which is
hacked for Sheepdog.

This version changes an inode data format to support a snapshot tag
name, so to test this patch, please pull the latest sheepdog server
codes.
  git://sheepdog.git.sourceforge.net/gitroot/sheepdog/sheepdog next

Sheepdog passes almost all testcases against a raw format, but failed in
the following ones:
 - 005: Sheepdog cannot support a larger image than 4 TB, so failed in
creating a 5 TB image.
 - 012: Sheepdog images are not files, so cannot make them read-only
by chmod.

Thanks,
Kazutaka


Changes from v5 to v6 are:
 - support a snapshot name
 - support resizing images (stretching only) to pass a qemu-iotests check
 - fix compile errors on the WIN32 environment
 - initialize an array to avoid a valgrind warning
 - remove an aio handler when it is no longer needed

Changes from v4 to v5 are:
 - address the comments to the sheepdog driver (Thanks Kevin, Chris!)
 -- fix a coding style
 -- fix aio_cancel handling
 -- fix an overflow bug in coping hostname
 -- add comments to the non-trivial functions
 - remove already applied patches from the patchset

Changes from v3 to v4 are:
 - fix error handling in bdrv_snapshot_goto.

Changes from v2 to v3 are:

 - add drv->bdrv_close() and drv->bdrv_open() before and after
   bdrv_snapshot_goto() call of the protocol.
 - address the review comments on the sheepdog driver code.

Changes from v1 to v2 are:

 - rebase onto git://repo.or.cz/qemu/kevin.git block
 - modify the sheepdog driver as a protocol driver
 - add new patch to call the snapshot handler of the protocol


 Makefile.objs|2 +-
 block/sheepdog.c | 2036 ++
 2 files changed, 2037 insertions(+), 1 deletions(-)
 create mode 100644 block/sheepdog.c

diff --git a/Makefile.objs b/Makefile.objs
index 2bfb6d1..4c37182 100644
--- a/Makefile.objs
+++ b/Makefile.objs
@@ -14,7 +14,7 @@ block-obj-$(CONFIG_LINUX_AIO) += linux-aio.o
 
 block-nested-y += raw.o cow.o qcow.o vdi.o vmdk.o cloop.o dmg.o bochs.o vpc.o 
vvfat.o
 block-nested-y += qcow2.o qcow2-refcount.o qcow2-cluster.o qcow2-snapshot.o
-block-nested-y += parallels.o nbd.o blkdebug.o
+block-nested-y += parallels.o nbd.o blkdebug.o sheepdog.o
 block-nested-$(CONFIG_WIN32) += raw-win32.o
 block-nested-$(CONFIG_POSIX) += raw-posix.o
 block-nested-$(CONFIG_CURL) += curl.o
diff --git a/block/sheepdog.c b/block/sheepdog.c
new file mode 100644
index 000..69a2494
--- /dev/null
+++ b/block/sheepdog.c
@@ -0,0 +1,2036 @@
+/*
+ * Copyright (C) 2009-2010 Nippon Telegraph and Telephone Corporation.
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License version
+ * 2 as published by the Free Software Foundation.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program. If not, see <http://www.gnu.org/licenses/>.
+ */
+#ifdef _WIN32
+#include 
+#include 
+#include 
+#else
+#include 
+#include 
+
+#define closesocket(s) close(s)
+#endif
+
+#include "qemu-common.h"
+#include "qemu-error.h"
+#include "qemu_socket.h"
+#include "block_int.h"
+
+#define SD_PROTO_VER 0x01
+
+#define SD_DEFAULT_ADDR "localhost"
+#define SD_DEFAULT_PORT "7000"
+
+#define SD_OP_CREATE_AND_WRITE_OBJ  0x01
+#define SD_OP_READ_OBJ   0x02
+#define SD_OP_WRITE_OBJ  0x03
+
+#define SD_OP_NEW_VDI0x11
+#define SD_OP_LOCK_VDI   0x12
+#define SD_OP_RELEASE_VDI0x13
+#define SD_OP_GET_VDI_INFO   0x14
+#define SD_OP_READ_VDIS  0x15
+
+#define SD_FLAG_CMD_WRITE0x01
+#define SD_FLAG_CMD_COW  0x02
+
+#define SD_RES_SUCCESS   0x00 /* Success */
+#define SD_RES_UNKNOWN   0x01 /* Unknown error */
+#define SD_RES_NO_OBJ0x02 /* No object found */
+#define SD_RES_EIO   0x03 /* I/O error */
+#define SD_RES_VDI_EXIST 0x04 /* Vdi exists already */
+#define SD_RES_INVALID_PARMS 0x05 /* Invalid parameters */
+#define SD_RES_SYSTEM_ERROR  0x06 /* System error */
+#define SD_RES_VDI_LOCKED0x07 /* Vdi is locked */
+#define SD_RES_NO_VDI0x08 /* No vdi found */
+#define SD_RES_NO_BASE_VDI   0x09 /* No base vdi found */
+#define SD_RES_VDI_READ  0x0A /* Ca

Re: [Qemu-devel] [PATCH 1/2] qemu-img check: Distinguish different kinds of errors

2010-07-06 Thread MORITA Kazutaka
At Fri,  2 Jul 2010 19:14:59 +0200,
Kevin Wolf wrote:
> 
> People think that their images are corrupted when in fact there are just some
> leaked clusters. Differentiating several error cases should make the messages
> more comprehensible.
> 
> Signed-off-by: Kevin Wolf 
> ---
>  block.c|   10 ++--
>  block.h|   10 -
>  qemu-img.c |   62 +--
>  3 files changed, 63 insertions(+), 19 deletions(-)
> 
> diff --git a/block.c b/block.c
> index dd6dd76..b0ceef0 100644
> --- a/block.c
> +++ b/block.c
> @@ -710,15 +710,19 @@ DeviceState *bdrv_get_attached(BlockDriverState *bs)
>  /*
>   * Run consistency checks on an image
>   *
> - * Returns the number of errors or -errno when an internal error occurs
> + * Returns 0 if the check could be completed (it doesn't mean that the image 
> is
> + * free of errors) or -errno when an internal error occured. The results of 
> the
> + * check are stored in res.
>   */
> -int bdrv_check(BlockDriverState *bs)
> +int bdrv_check(BlockDriverState *bs, BdrvCheckResult *res)
>  {
>  if (bs->drv->bdrv_check == NULL) {
>  return -ENOTSUP;
>  }
>  
> -return bs->drv->bdrv_check(bs);
> +memset(res, 0, sizeof(*res));
> +res->corruptions = bs->drv->bdrv_check(bs);
> +return res->corruptions < 0 ? res->corruptions : 0;
>  }
>  
>  /* commit COW file into the raw image */
> diff --git a/block.h b/block.h
> index 3d03b3e..c2a7e4c 100644
> --- a/block.h
> +++ b/block.h
> @@ -74,7 +74,6 @@ void bdrv_close(BlockDriverState *bs);
>  int bdrv_attach(BlockDriverState *bs, DeviceState *qdev);
>  void bdrv_detach(BlockDriverState *bs, DeviceState *qdev);
>  DeviceState *bdrv_get_attached(BlockDriverState *bs);
> -int bdrv_check(BlockDriverState *bs);
>  int bdrv_read(BlockDriverState *bs, int64_t sector_num,
>uint8_t *buf, int nb_sectors);
>  int bdrv_write(BlockDriverState *bs, int64_t sector_num,
> @@ -97,6 +96,15 @@ int bdrv_change_backing_file(BlockDriverState *bs,
>  const char *backing_file, const char *backing_fmt);
>  void bdrv_register(BlockDriver *bdrv);
>  
> +
> +typedef struct BdrvCheckResult {
> +int corruptions;
> +int leaks;
> +int check_errors;
> +} BdrvCheckResult;
> +
> +int bdrv_check(BlockDriverState *bs, BdrvCheckResult *res);
> +
>  /* async block I/O */
>  typedef struct BlockDriverAIOCB BlockDriverAIOCB;
>  typedef void BlockDriverCompletionFunc(void *opaque, int ret);
> diff --git a/qemu-img.c b/qemu-img.c
> index 700af21..1782ac9 100644
> --- a/qemu-img.c
> +++ b/qemu-img.c
> @@ -425,11 +425,20 @@ out:
>  return 0;
>  }
>  
> +/*
> + * Checks an image for consistency. Exit codes:
> + *
> + * 0 - Check completed, image is good
> + * 1 - Check not completed because of internal errors
> + * 2 - Check completed, image is corrupted
> + * 3 - Check completed, image has leaked clusters, but is good otherwise
> + */
>  static int img_check(int argc, char **argv)
>  {
>  int c, ret;
>  const char *filename, *fmt;
>  BlockDriverState *bs;
> +BdrvCheckResult result;
>  
>  fmt = NULL;
>  for(;;) {
> @@ -453,28 +462,51 @@ static int img_check(int argc, char **argv)
>  if (!bs) {
>  return 1;
>  }
> -ret = bdrv_check(bs);
> -switch(ret) {
> -case 0:
> -printf("No errors were found on the image.\n");
> -break;
> -case -ENOTSUP:
> +ret = bdrv_check(bs, &result);
> +
> +if (ret == -ENOTSUP) {
>  error("This image format does not support checks");
> -break;
> -default:
> -if (ret < 0) {
> -error("An error occurred during the check");
> -} else {
> -printf("%d errors were found on the image.\n", ret);
> +return 1;

Is it okay to call bdrv_delete(bs) before return?  It is necessary for
the sheepdog driver to pass qemu-iotests.

Kazutaka


--- a/qemu-img.c
+++ b/qemu-img.c
@@ -466,6 +466,7 @@ static int img_check(int argc, char **argv)
 
 if (ret == -ENOTSUP) {
 error("This image format does not support checks");
+bdrv_delete(bs);
 return 1;
 }
 



[Qemu-devel] [PATCH] sheepdog: fix compile error on systems without TCP_CORK

2010-07-06 Thread MORITA Kazutaka
WIN32 is not only the system which doesn't have TCP_CORK (e.g. OS X).

Signed-off-by: MORITA Kazutaka 
---

Betts, I think this patch fix the compile error.  Can you try this
one?

 block/sheepdog.c |2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/block/sheepdog.c b/block/sheepdog.c
index 69a2494..81aa564 100644
--- a/block/sheepdog.c
+++ b/block/sheepdog.c
@@ -889,7 +889,7 @@ static int aio_flush_request(void *opaque)
 return !QLIST_EMPTY(&s->outstanding_aio_head);
 }
 
-#ifdef _WIN32
+#if !defined(SOL_TCP) || !defined(TCP_CORK)
 
 static int set_cork(int fd, int v)
 {
-- 
1.5.6.5




[Qemu-devel] [RFC PATCH 1/2] close all the block drivers before the qemu process exits

2010-05-12 Thread MORITA Kazutaka
This patch calls the close handler of the block driver before the qemu
process exits.

This is necessary because the sheepdog block driver releases the lock
of VM images in the close handler.

Signed-off-by: MORITA Kazutaka 
---
 block.c   |   11 +++
 block.h   |1 +
 monitor.c |1 +
 vl.c  |1 +
 4 files changed, 14 insertions(+), 0 deletions(-)

diff --git a/block.c b/block.c
index 7326bfe..a606820 100644
--- a/block.c
+++ b/block.c
@@ -526,6 +526,17 @@ void bdrv_close(BlockDriverState *bs)
 }
 }
 
+void bdrv_close_all(void)
+{
+BlockDriverState *bs, *n;
+
+for (bs = bdrv_first, n = bs->next; bs; bs = n, n = bs ? bs->next : NULL) {
+if (bs && bs->drv && bs->drv->bdrv_close) {
+bs->drv->bdrv_close(bs);
+}
+}
+}
+
 void bdrv_delete(BlockDriverState *bs)
 {
 BlockDriverState **pbs;
diff --git a/block.h b/block.h
index fa51ddf..1a1293a 100644
--- a/block.h
+++ b/block.h
@@ -123,6 +123,7 @@ BlockDriverAIOCB *bdrv_aio_ioctl(BlockDriverState *bs,
 /* Ensure contents are flushed to disk.  */
 void bdrv_flush(BlockDriverState *bs);
 void bdrv_flush_all(void);
+void bdrv_close_all(void);
 
 int bdrv_is_allocated(BlockDriverState *bs, int64_t sector_num, int nb_sectors,
int *pnum);
diff --git a/monitor.c b/monitor.c
index 17e59f5..44bfe83 100644
--- a/monitor.c
+++ b/monitor.c
@@ -845,6 +845,7 @@ static void do_info_cpu_stats(Monitor *mon)
  */
 static void do_quit(Monitor *mon, const QDict *qdict, QObject **ret_data)
 {
+bdrv_close_all();
 exit(0);
 }
 
diff --git a/vl.c b/vl.c
index 77677e8..65160ed 100644
--- a/vl.c
+++ b/vl.c
@@ -4205,6 +4205,7 @@ static void main_loop(void)
 vm_stop(r);
 }
 }
+bdrv_close_all();
 pause_all_vcpus();
 }
 
-- 
1.5.6.5




[Qemu-devel] [RFC PATCH 0/2] Sheepdog: distributed storage system for QEMU

2010-05-12 Thread MORITA Kazutaka
Hi all,

This patch adds a block driver for Sheepdog distributed storage
system.  Please consider for inclusion.

Sheepdog is a distributed storage system for QEMU.  It provides highly
available block level storage volumes to VMs like Amazon EBS.

Sheepdog features are:
- No node in the cluster is special (no metadata node, no control
  node, etc)
- Linear scalability in performance and capacity
- No single point of failure
- Autonomous management (zero configuration)
- Useful volume management support such as snapshot and cloning
- Thin provisioning
- Autonomous load balancing

The more details are available at the project site [1] and my previous
post about sheepdog [2].

We have implemented the essential part of sheepdog features, and
believe the API between Sheepdog and QEMU is finalized.

Any comments or suggestions would be greatly appreciated.


Here are examples:

$ qemu-img create -f sheepdog vol1 256G # create images

$ qemu --drive format=sheepdog,file=vol1# start up a VM

$ qemu-img snapshot -c name sheepdog:vol1   # create a snapshot

$ qemu-img snapshot -l sheepdog:vol1# list snapshots
IDTAG VM SIZEDATE   VM CLOCK
1   0 2010-05-06 02:29:29   00:00:00.000
2   0 2010-05-06 02:29:55   00:00:00.000

$ qemu --drive format=sheepdog,file=vol1:1  # start up from a snapshot

$ qemu-img create -b sheepdog:vol1:1 -f sheepdog vol2   # clone images


Thanks,

Kazutaka

[1] http://www.osrg.net/sheepdog/

[2] http://lists.nongnu.org/archive/html/qemu-devel/2009-10/msg01773.html


MORITA Kazutaka (2):
  close all the block drivers before the qemu process exits
  block: add sheepdog driver for distributed storage support

 Makefile |2 +-
 block.c  |   14 +-
 block.h  |1 +
 block/sheepdog.c | 1828 ++
 monitor.c|1 +
 vl.c |1 +
 6 files changed, 1845 insertions(+), 2 deletions(-)
 create mode 100644 block/sheepdog.c




[Qemu-devel] [RFC PATCH 2/2] block: add sheepdog driver for distributed storage support

2010-05-12 Thread MORITA Kazutaka
Sheepdog is a distributed storage system for QEMU. It provides highly
available block level storage volumes to VMs like Amazon EBS.  This
patch adds a qemu block driver for Sheepdog.

Sheepdog features are:
- No node in the cluster is special (no metadata node, no control
  node, etc)
- Linear scalability in performance and capacity
- No single point of failure
- Autonomous management (zero configuration)
- Useful volume management support such as snapshot and cloning
- Thin provisioning
- Autonomous load balancing

The more details are available at the project site:
http://www.osrg.net/sheepdog/

Signed-off-by: MORITA Kazutaka 
---
 Makefile |2 +-
 block.c  |3 +-
 block/sheepdog.c | 1828 ++
 3 files changed, 1831 insertions(+), 2 deletions(-)
 create mode 100644 block/sheepdog.c

diff --git a/Makefile b/Makefile
index c1fa08c..d03cda1 100644
--- a/Makefile
+++ b/Makefile
@@ -97,7 +97,7 @@ block-obj-$(CONFIG_LINUX_AIO) += linux-aio.o
 
 block-nested-y += cow.o qcow.o vdi.o vmdk.o cloop.o dmg.o bochs.o vpc.o vvfat.o
 block-nested-y += qcow2.o qcow2-refcount.o qcow2-cluster.o qcow2-snapshot.o
-block-nested-y += parallels.o nbd.o
+block-nested-y += parallels.o nbd.o sheepdog.o
 block-nested-$(CONFIG_WIN32) += raw-win32.o
 block-nested-$(CONFIG_POSIX) += raw-posix.o
 block-nested-$(CONFIG_CURL) += curl.o
diff --git a/block.c b/block.c
index a606820..ab00f3f 100644
--- a/block.c
+++ b/block.c
@@ -307,7 +307,8 @@ static BlockDriver *find_image_format(const char *filename)
 
 drv = find_protocol(filename);
 /* no need to test disk image formats for vvfat */
-if (drv && strcmp(drv->format_name, "vvfat") == 0)
+if (drv && (!strcmp(drv->format_name, "vvfat") ||
+!strcmp(drv->format_name, "sheepdog")))
 return drv;
 
 ret = bdrv_file_open(&bs, filename, BDRV_O_RDONLY);
diff --git a/block/sheepdog.c b/block/sheepdog.c
new file mode 100644
index 000..7c07a52
--- /dev/null
+++ b/block/sheepdog.c
@@ -0,0 +1,1828 @@
+/*
+ * Copyright (C) 2009-2010 Nippon Telegraph and Telephone Corporation.
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License version
+ * 2 as published by the Free Software Foundation.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program. If not, see <http://www.gnu.org/licenses/>.
+ */
+#include 
+#include 
+
+#include "qemu-common.h"
+#include "block_int.h"
+
+#define SD_PROTO_VER 0x01
+
+#define SD_DEFAULT_ADDR "localhost:7000"
+
+#define SD_OP_CREATE_AND_WRITE_OBJ  0x01
+#define SD_OP_READ_OBJ   0x02
+#define SD_OP_WRITE_OBJ  0x03
+
+#define SD_OP_NEW_VDI0x11
+#define SD_OP_LOCK_VDI   0x12
+#define SD_OP_RELEASE_VDI0x13
+#define SD_OP_GET_VDI_INFO   0x14
+#define SD_OP_READ_VDIS  0x15
+
+#define SD_FLAG_CMD_WRITE0x01
+#define SD_FLAG_CMD_COW  0x02
+
+#define SD_RES_SUCCESS   0x00 /* Success */
+#define SD_RES_UNKNOWN   0x01 /* Unknown error */
+#define SD_RES_NO_OBJ0x02 /* No object found */
+#define SD_RES_EIO   0x03 /* I/O error */
+#define SD_RES_VDI_EXIST 0x04 /* Vdi exists already */
+#define SD_RES_INVALID_PARMS 0x05 /* Invalid parameters */
+#define SD_RES_SYSTEM_ERROR  0x06 /* System error */
+#define SD_RES_VDI_LOCKED0x07 /* Vdi is locked */
+#define SD_RES_NO_VDI0x08 /* No vdi found */
+#define SD_RES_NO_BASE_VDI   0x09 /* No base vdi found */
+#define SD_RES_VDI_READ  0x0A /* Cannot read requested vdi */
+#define SD_RES_VDI_WRITE 0x0B /* Cannot write requested vdi */
+#define SD_RES_BASE_VDI_READ 0x0C /* Cannot read base vdi */
+#define SD_RES_BASE_VDI_WRITE   0x0D /* Cannot write base vdi */
+#define SD_RES_NO_TAG0x0E /* Requested tag is not found */
+#define SD_RES_STARTUP   0x0F /* Sheepdog is on starting up */
+#define SD_RES_VDI_NOT_LOCKED   0x10 /* Vdi is not locked */
+#define SD_RES_SHUTDOWN  0x11 /* Sheepdog is shutting down */
+#define SD_RES_NO_MEM0x12 /* Cannot allocate memory */
+#define SD_RES_FULL_VDI  0x13 /* we already have the maximum vdis */
+#define SD_RES_VER_MISMATCH  0x14 /* Protocol version mismatch */
+#define SD_RES_NO_SPACE  0x15 /* Server has no room for new objects */
+#define SD_RES_WAIT_FOR_FORMAT  0x16 /* Sheepdog is waiting for a format 
operation */
+#define SD_RES_WAIT_FOR_JOIN0x17 /* Sheepdog is waiting for other nodes 
joining */
+#define SD_RES_JOIN_FAILED   0x18 /* Target node had failed to join sheepdog */
+
+/*
+ * Object ID rules
+ *
+ *  0 - 19 (20 bits): data object space
+ * 20 - 31 (12 bits): reserved data object space
+ * 32 - 55 (24 bits): vdi object space
+ * 56 - 59 ( 4 bits): reserved vdi object space
+ * 60 - 63 ( 4 bits): object type indentifier space
+ */
+
+#define VDI_SPACE_SH

[Qemu-devel] Re: [RFC PATCH 1/2] close all the block drivers before the qemu process exits

2010-05-12 Thread MORITA Kazutaka
At Thu, 13 May 2010 05:16:35 +0900,
MORITA Kazutaka wrote:
> 
> On 2010/05/12 23:28, Avi Kivity wrote:
> > On 05/12/2010 01:46 PM, MORITA Kazutaka wrote:
> >> This patch calls the close handler of the block driver before the qemu
> >> process exits.
> >>
> >> This is necessary because the sheepdog block driver releases the lock
> >> of VM images in the close handler.
> >>
> >>
> > 
> > How do you handle abnormal termination?
> > 
> 
> In the case, we need to release the lock manually, unfortunately.
> Sheepdog admin tool has a command to do that.
> 

More precisely, if qemu fails down with its host machine, we detect
the qemu failure and release the lock.  It is because Sheepdog
currently assumes that all the qemu processes are in the sheepdog
cluster, and remember that where they are running.  When machine
failures happen, sheepdog release the lock of the VMs on the machines
(We uses corosync to check which machines are alive or not).

If the qemu process exits abnormally and its host machine is alive,
the sheepdog daemon on the host needs to detect the qemu failure.
However, the feature is not implemented yet.  We think of checking a
socket connection between qemu and sheepdog daemon to detect the
failure.  Currently, we need to release the lock manually from the
admin tool in this case.

Thanks,

Kazutaka



[Qemu-devel] Re: [RFC PATCH 1/2] close all the block drivers before the qemu process exits

2010-05-12 Thread MORITA Kazutaka
On 2010/05/12 23:28, Avi Kivity wrote:
> On 05/12/2010 01:46 PM, MORITA Kazutaka wrote:
>> This patch calls the close handler of the block driver before the qemu
>> process exits.
>>
>> This is necessary because the sheepdog block driver releases the lock
>> of VM images in the close handler.
>>
>>
> 
> How do you handle abnormal termination?
> 

In the case, we need to release the lock manually, unfortunately.
Sheepdog admin tool has a command to do that.

Thanks,

Kazutaka



Re: [Qemu-devel] [RFC PATCH 0/2] Sheepdog: distributed storage system for QEMU

2010-05-12 Thread MORITA Kazutaka
On 2010/05/12 20:38, Kevin Wolf wrote:
> Am 12.05.2010 12:46, schrieb MORITA Kazutaka:
>> Hi all,
>>
>> This patch adds a block driver for Sheepdog distributed storage
>> system.  Please consider for inclusion.
>>
>> Sheepdog is a distributed storage system for QEMU.  It provides highly
>> available block level storage volumes to VMs like Amazon EBS.
>>
>> Sheepdog features are:
>> - No node in the cluster is special (no metadata node, no control
>>   node, etc)
>> - Linear scalability in performance and capacity
>> - No single point of failure
>> - Autonomous management (zero configuration)
>> - Useful volume management support such as snapshot and cloning
>> - Thin provisioning
>> - Autonomous load balancing
>>
>> The more details are available at the project site [1] and my previous
>> post about sheepdog [2].
>>
>> We have implemented the essential part of sheepdog features, and
>> believe the API between Sheepdog and QEMU is finalized.
>>
>> Any comments or suggestions would be greatly appreciated.
> 
> These patches don't apply, neither on git master nor on the block
> branch. Please rebase them on git://repo.or.cz/qemu/kevin.git block for
> the next submission.
> 

Ok, I'll rebase them and resend later. Sorry for inconvenience.

> I'll have a closer look at your code later, but one thing I noticed is
> that the new block driver is something in between a protocol and a
> format driver (just like vvfat, which should stop doing so, too). I
> think it ought to be a real protocol with the raw format driver on top
> (or any other format - I don't see a reason why this should be
> restricted to raw).
> 
> The one thing that is unusual about it as a protocol driver is that it
> supports snapshots. However, while it is the first one, supporting
> snapshots in protocols is a thing that could be generally useful to
> support (for example thinking of a LVM protocol, which was discussed in
> the past).
> 

I agreed.  I'll modify the sheepdog driver patch as a protocol driver one,
and remove unnecessary format check from my patch.

> So in block.c we could check if the format driver supports snapshots,
> and if it doesn't we try again with the underlying protocol. Not sure
> yet what we would do when both format and protocol do support snapshots
> (qcow2 on sheepdog/LVM/...), but that's a detail.
> 

Thanks,

Kazutaka



Re: [Qemu-devel] [RFC PATCH 1/2] close all the block drivers before the qemu process exits

2010-05-12 Thread MORITA Kazutaka
On 2010/05/12 23:01, Christoph Hellwig wrote:
> On Wed, May 12, 2010 at 07:46:52PM +0900, MORITA Kazutaka wrote:
>> This patch calls the close handler of the block driver before the qemu
>> process exits.
>>
>> This is necessary because the sheepdog block driver releases the lock
>> of VM images in the close handler.
>>
>> Signed-off-by: MORITA Kazutaka 
> 
> Looks good in principle, except that bdrv_first is gone and has been
> replaced with a real list in the meantime, so this won't even apply.
> 

Thank you for your comment.
I'll rebase and resend the updated version in the next days.

Thanks,

Kazutaka



Re: [Qemu-devel] [RFC PATCH 0/2] Sheepdog: distributed storage system for QEMU

2010-05-13 Thread MORITA Kazutaka
At Thu, 13 May 2010 04:46:46 +0900,
MORITA Kazutaka wrote:
> 
> On 2010/05/12 20:38, Kevin Wolf wrote:
> > I'll have a closer look at your code later, but one thing I noticed is
> > that the new block driver is something in between a protocol and a
> > format driver (just like vvfat, which should stop doing so, too). I
> > think it ought to be a real protocol with the raw format driver on top
> > (or any other format - I don't see a reason why this should be
> > restricted to raw).
> > 
> > The one thing that is unusual about it as a protocol driver is that it
> > supports snapshots. However, while it is the first one, supporting
> > snapshots in protocols is a thing that could be generally useful to
> > support (for example thinking of a LVM protocol, which was discussed in
> > the past).
> > 
> 
> I agreed.  I'll modify the sheepdog driver patch as a protocol driver one,
> and remove unnecessary format check from my patch.
> 
> > So in block.c we could check if the format driver supports snapshots,
> > and if it doesn't we try again with the underlying protocol. Not sure
> > yet what we would do when both format and protocol do support snapshots
> > (qcow2 on sheepdog/LVM/...), but that's a detail.
> > 
> 

To support snapshot in a protocol, I'd like to call the hander of the
protocol driver in the following functions in block.c:

bdrv_snapshot_create
bdrv_snapshot_goto
bdrv_snapshot_delete
bdrv_snapshot_list
bdrv_save_vmstate
bdrv_load_vmstate

Is it okay?

In the case both format and protocol drivers support snapshots, I
think it is better to call the format driver handler.  Because qcow2
is well known as a snapshot support format, so when users use qcow2,
they expect to get snapshot with qcow2.

There is another problem to make the sheepdog driver be a protocol;
how to deal with protocol specific create_options?

For example, sheepdog supports cloning images as a format driver:

  $ qemu-img create -f sheepdog dst -b sheepdog:src

But if the sheepdog driver is a protocol, error will occur.

  $ qemu-img create sheepdog:dst -b sheepdog:src
  Unknown option 'backing_file'
  qemu-img: Backing file not supported for file format 'raw'

It is because the raw format doesn't support a backing_file option.
To support the protocol specific create_options, if the format driver
cannot parse some of the arguments, the protocol driver need to parse
them.

If my suggestions are okay, I'd like to prepare the patches.

Regards,

Kazutaka



[Qemu-devel] [RFC PATCH v2 1/3] close all the block drivers before the qemu process exits

2010-05-14 Thread MORITA Kazutaka
This patch calls the close handler of the block driver before the qemu
process exits.

This is necessary because the sheepdog block driver releases the lock
of VM images in the close handler.

Signed-off-by: MORITA Kazutaka 
---
 block.c |9 +
 block.h |1 +
 vl.c|1 +
 3 files changed, 11 insertions(+), 0 deletions(-)

diff --git a/block.c b/block.c
index c134c2b..988a94a 100644
--- a/block.c
+++ b/block.c
@@ -641,6 +641,15 @@ void bdrv_close(BlockDriverState *bs)
 }
 }
 
+void bdrv_close_all(void)
+{
+BlockDriverState *bs;
+
+QTAILQ_FOREACH(bs, &bdrv_states, list) {
+bdrv_close(bs);
+}
+}
+
 void bdrv_delete(BlockDriverState *bs)
 {
 /* remove from list, if necessary */
diff --git a/block.h b/block.h
index 278259c..531e802 100644
--- a/block.h
+++ b/block.h
@@ -121,6 +121,7 @@ BlockDriverAIOCB *bdrv_aio_ioctl(BlockDriverState *bs,
 /* Ensure contents are flushed to disk.  */
 void bdrv_flush(BlockDriverState *bs);
 void bdrv_flush_all(void);
+void bdrv_close_all(void);
 
 int bdrv_has_zero_init(BlockDriverState *bs);
 int bdrv_is_allocated(BlockDriverState *bs, int64_t sector_num, int nb_sectors,
diff --git a/vl.c b/vl.c
index 85bcc84..5ce7807 100644
--- a/vl.c
+++ b/vl.c
@@ -2007,6 +2007,7 @@ static void main_loop(void)
 exit(0);
 }
 }
+bdrv_close_all();
 pause_all_vcpus();
 }
 
-- 
1.5.6.5




[Qemu-devel] [RFC PATCH v2 2/3] block: call the snapshot handlers of the protocol drivers

2010-05-14 Thread MORITA Kazutaka
When snapshot handlers of the format driver is not defined, it is
better to call the ones of the protocol driver.

This enables us to implement snapshot support in the protocol driver.

Signed-off-by: MORITA Kazutaka 
---
 block.c |   48 ++--
 1 files changed, 30 insertions(+), 18 deletions(-)

diff --git a/block.c b/block.c
index 988a94a..d1866be 100644
--- a/block.c
+++ b/block.c
@@ -1689,9 +1689,11 @@ int bdrv_save_vmstate(BlockDriverState *bs, const 
uint8_t *buf,
 BlockDriver *drv = bs->drv;
 if (!drv)
 return -ENOMEDIUM;
-if (!drv->bdrv_save_vmstate)
-return -ENOTSUP;
-return drv->bdrv_save_vmstate(bs, buf, pos, size);
+if (drv->bdrv_save_vmstate)
+return drv->bdrv_save_vmstate(bs, buf, pos, size);
+if (bs->file)
+return bdrv_save_vmstate(bs->file, buf, pos, size);
+return -ENOTSUP;
 }
 
 int bdrv_load_vmstate(BlockDriverState *bs, uint8_t *buf,
@@ -1700,9 +1702,11 @@ int bdrv_load_vmstate(BlockDriverState *bs, uint8_t *buf,
 BlockDriver *drv = bs->drv;
 if (!drv)
 return -ENOMEDIUM;
-if (!drv->bdrv_load_vmstate)
-return -ENOTSUP;
-return drv->bdrv_load_vmstate(bs, buf, pos, size);
+if (drv->bdrv_load_vmstate)
+return drv->bdrv_load_vmstate(bs, buf, pos, size);
+if (bs->file)
+return bdrv_load_vmstate(bs->file, buf, pos, size);
+return -ENOTSUP;
 }
 
 void bdrv_debug_event(BlockDriverState *bs, BlkDebugEvent event)
@@ -1726,9 +1730,11 @@ int bdrv_snapshot_create(BlockDriverState *bs,
 BlockDriver *drv = bs->drv;
 if (!drv)
 return -ENOMEDIUM;
-if (!drv->bdrv_snapshot_create)
-return -ENOTSUP;
-return drv->bdrv_snapshot_create(bs, sn_info);
+if (drv->bdrv_snapshot_create)
+return drv->bdrv_snapshot_create(bs, sn_info);
+if (bs->file)
+return bdrv_snapshot_create(bs->file, sn_info);
+return -ENOTSUP;
 }
 
 int bdrv_snapshot_goto(BlockDriverState *bs,
@@ -1737,9 +1743,11 @@ int bdrv_snapshot_goto(BlockDriverState *bs,
 BlockDriver *drv = bs->drv;
 if (!drv)
 return -ENOMEDIUM;
-if (!drv->bdrv_snapshot_goto)
-return -ENOTSUP;
-return drv->bdrv_snapshot_goto(bs, snapshot_id);
+if (drv->bdrv_snapshot_goto)
+return drv->bdrv_snapshot_goto(bs, snapshot_id);
+if (bs->file)
+return bdrv_snapshot_goto(bs->file, snapshot_id);
+return -ENOTSUP;
 }
 
 int bdrv_snapshot_delete(BlockDriverState *bs, const char *snapshot_id)
@@ -1747,9 +1755,11 @@ int bdrv_snapshot_delete(BlockDriverState *bs, const 
char *snapshot_id)
 BlockDriver *drv = bs->drv;
 if (!drv)
 return -ENOMEDIUM;
-if (!drv->bdrv_snapshot_delete)
-return -ENOTSUP;
-return drv->bdrv_snapshot_delete(bs, snapshot_id);
+if (drv->bdrv_snapshot_delete)
+return drv->bdrv_snapshot_delete(bs, snapshot_id);
+if (bs->file)
+return bdrv_snapshot_delete(bs->file, snapshot_id);
+return -ENOTSUP;
 }
 
 int bdrv_snapshot_list(BlockDriverState *bs,
@@ -1758,9 +1768,11 @@ int bdrv_snapshot_list(BlockDriverState *bs,
 BlockDriver *drv = bs->drv;
 if (!drv)
 return -ENOMEDIUM;
-if (!drv->bdrv_snapshot_list)
-return -ENOTSUP;
-return drv->bdrv_snapshot_list(bs, psn_info);
+if (drv->bdrv_snapshot_list)
+return drv->bdrv_snapshot_list(bs, psn_info);
+if (bs->file)
+return bdrv_snapshot_list(bs->file, psn_info);
+return -ENOTSUP;
 }
 
 #define NB_SUFFIXES 4
-- 
1.5.6.5




[Qemu-devel] [RFC PATCH v2 0/3] Sheepdog: distributed storage system for QEMU

2010-05-14 Thread MORITA Kazutaka
Hi all,

This patch adds a block driver for Sheepdog distributed storage
system.

Changes from v1 to v2 are:

 - rebase onto git://repo.or.cz/qemu/kevin.git block
 - modify the sheepdog driver as a protocol driver
 - add new patch to call the snapshot handler of the protocol

One issue still remains; qemu-img parses command line options with
`create_options' of the format handler, so we cannot use protocol
specific options.

In this version, sheepdog needs to be used as a format driver when we
want to use sheepdog specific options.

e.g. Create clone image vol2 from vol1

  $ qemu-img create -b sheepdog:vol1:1 -f sheepdog vol2


Thanks,

Kazutaka


MORITA Kazutaka (3):
  close all the block drivers before the qemu process exits
  block: call the snapshot handlers of the protocol drivers
  block: add sheepdog driver for distributed storage support

 Makefile.objs|2 +-
 block.c  |   57 ++-
 block.h  |1 +
 block/sheepdog.c | 1831 ++
 vl.c |1 +
 5 files changed, 1873 insertions(+), 19 deletions(-)
 create mode 100644 block/sheepdog.c




[Qemu-devel] [RFC PATCH v2 3/3] block: add sheepdog driver for distributed storage support

2010-05-14 Thread MORITA Kazutaka
Sheepdog is a distributed storage system for QEMU. It provides highly
available block level storage volumes to VMs like Amazon EBS.  This
patch adds a qemu block driver for Sheepdog.

Sheepdog features are:
- No node in the cluster is special (no metadata node, no control
  node, etc)
- Linear scalability in performance and capacity
- No single point of failure
- Autonomous management (zero configuration)
- Useful volume management support such as snapshot and cloning
- Thin provisioning
- Autonomous load balancing

The more details are available at the project site:
http://www.osrg.net/sheepdog/

Signed-off-by: MORITA Kazutaka 
---
 Makefile.objs|2 +-
 block/sheepdog.c | 1831 ++
 2 files changed, 1832 insertions(+), 1 deletions(-)
 create mode 100644 block/sheepdog.c

diff --git a/Makefile.objs b/Makefile.objs
index ecdd53e..6edbc57 100644
--- a/Makefile.objs
+++ b/Makefile.objs
@@ -14,7 +14,7 @@ block-obj-$(CONFIG_LINUX_AIO) += linux-aio.o
 
 block-nested-y += raw.o cow.o qcow.o vdi.o vmdk.o cloop.o dmg.o bochs.o vpc.o 
vvfat.o
 block-nested-y += qcow2.o qcow2-refcount.o qcow2-cluster.o qcow2-snapshot.o
-block-nested-y += parallels.o nbd.o blkdebug.o
+block-nested-y += parallels.o nbd.o blkdebug.o sheepdog.o
 block-nested-$(CONFIG_WIN32) += raw-win32.o
 block-nested-$(CONFIG_POSIX) += raw-posix.o
 block-nested-$(CONFIG_CURL) += curl.o
diff --git a/block/sheepdog.c b/block/sheepdog.c
new file mode 100644
index 000..adf3a71
--- /dev/null
+++ b/block/sheepdog.c
@@ -0,0 +1,1831 @@
+/*
+ * Copyright (C) 2009-2010 Nippon Telegraph and Telephone Corporation.
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License version
+ * 2 as published by the Free Software Foundation.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program. If not, see <http://www.gnu.org/licenses/>.
+ */
+#include 
+#include 
+
+#include "qemu-common.h"
+#include "block_int.h"
+
+#define SD_PROTO_VER 0x01
+
+#define SD_DEFAULT_ADDR "localhost:7000"
+
+#define SD_OP_CREATE_AND_WRITE_OBJ  0x01
+#define SD_OP_READ_OBJ   0x02
+#define SD_OP_WRITE_OBJ  0x03
+
+#define SD_OP_NEW_VDI0x11
+#define SD_OP_LOCK_VDI   0x12
+#define SD_OP_RELEASE_VDI0x13
+#define SD_OP_GET_VDI_INFO   0x14
+#define SD_OP_READ_VDIS  0x15
+
+#define SD_FLAG_CMD_WRITE0x01
+#define SD_FLAG_CMD_COW  0x02
+
+#define SD_RES_SUCCESS   0x00 /* Success */
+#define SD_RES_UNKNOWN   0x01 /* Unknown error */
+#define SD_RES_NO_OBJ0x02 /* No object found */
+#define SD_RES_EIO   0x03 /* I/O error */
+#define SD_RES_VDI_EXIST 0x04 /* Vdi exists already */
+#define SD_RES_INVALID_PARMS 0x05 /* Invalid parameters */
+#define SD_RES_SYSTEM_ERROR  0x06 /* System error */
+#define SD_RES_VDI_LOCKED0x07 /* Vdi is locked */
+#define SD_RES_NO_VDI0x08 /* No vdi found */
+#define SD_RES_NO_BASE_VDI   0x09 /* No base vdi found */
+#define SD_RES_VDI_READ  0x0A /* Cannot read requested vdi */
+#define SD_RES_VDI_WRITE 0x0B /* Cannot write requested vdi */
+#define SD_RES_BASE_VDI_READ 0x0C /* Cannot read base vdi */
+#define SD_RES_BASE_VDI_WRITE   0x0D /* Cannot write base vdi */
+#define SD_RES_NO_TAG0x0E /* Requested tag is not found */
+#define SD_RES_STARTUP   0x0F /* Sheepdog is on starting up */
+#define SD_RES_VDI_NOT_LOCKED   0x10 /* Vdi is not locked */
+#define SD_RES_SHUTDOWN  0x11 /* Sheepdog is shutting down */
+#define SD_RES_NO_MEM0x12 /* Cannot allocate memory */
+#define SD_RES_FULL_VDI  0x13 /* we already have the maximum vdis */
+#define SD_RES_VER_MISMATCH  0x14 /* Protocol version mismatch */
+#define SD_RES_NO_SPACE  0x15 /* Server has no room for new objects */
+#define SD_RES_WAIT_FOR_FORMAT  0x16 /* Sheepdog is waiting for a format 
operation */
+#define SD_RES_WAIT_FOR_JOIN0x17 /* Sheepdog is waiting for other nodes 
joining */
+#define SD_RES_JOIN_FAILED   0x18 /* Target node had failed to join sheepdog */
+
+/*
+ * Object ID rules
+ *
+ *  0 - 19 (20 bits): data object space
+ * 20 - 31 (12 bits): reserved data object space
+ * 32 - 55 (24 bits): vdi object space
+ * 56 - 59 ( 4 bits): reserved vdi object space
+ * 60 - 63 ( 4 bits): object type indentifier space
+ */
+
+#define VDI_SPACE_SHIFT   32
+#define VDI_BIT (UINT64_C(1) << 63)
+#define VMSTATE_BIT (UINT64_C(1) << 62)
+#define MAX_DATA_OBJS (1ULL << 20)
+#define MAX_CHILDREN 1024
+#define SD_MAX_VDI_LEN 256
+#define SD_NR_VDIS   (1U << 24)
+#define SD_DATA_OBJ_SIZE (UINT64_C(1) << 22)
+
+#define SD_INODE_SIZE (sizeof(struct sd_inode))
+#define CURRENT_VDI_ID 0
+
+struct sd_req {
+   uint8_t proto_ver;
+   uint8_t opcode;
+   uint16_tflags;
+   uint32_tepoch;
+   uint32_

Re: [Qemu-devel] [RFC PATCH 0/2] Sheepdog: distributed storage system for QEMU

2010-05-14 Thread MORITA Kazutaka
At Fri, 14 May 2010 10:32:26 +0200,
Kevin Wolf wrote:
> 
> Am 13.05.2010 16:03, schrieb MORITA Kazutaka:
> > To support snapshot in a protocol, I'd like to call the hander of the
> > protocol driver in the following functions in block.c:
> > 
> > bdrv_snapshot_create
> > bdrv_snapshot_goto
> > bdrv_snapshot_delete
> > bdrv_snapshot_list
> > bdrv_save_vmstate
> > bdrv_load_vmstate
> > 
> > Is it okay?
> 
> Yes, I think this is the way to go.
> 
Done.

> > In the case both format and protocol drivers support snapshots, I
> > think it is better to call the format driver handler.  Because qcow2
> > is well known as a snapshot support format, so when users use qcow2,
> > they expect to get snapshot with qcow2.
> 
> I agree.
> 
Done.

> > There is another problem to make the sheepdog driver be a protocol;
> > how to deal with protocol specific create_options?
> > 
> > For example, sheepdog supports cloning images as a format driver:
> > 
> >   $ qemu-img create -f sheepdog dst -b sheepdog:src
> > 
> > But if the sheepdog driver is a protocol, error will occur.
> > 
> >   $ qemu-img create sheepdog:dst -b sheepdog:src
> >   Unknown option 'backing_file'
> >   qemu-img: Backing file not supported for file format 'raw'
> > 
> > It is because the raw format doesn't support a backing_file option.
> > To support the protocol specific create_options, if the format driver
> > cannot parse some of the arguments, the protocol driver need to parse
> > them.
> 
> That's actually a good point. Yes, I think it makes a lot of sense to
> allow parameters to be passed to the protocol driver.
> 

Okay. But it seemed to require many changes to the qemu-img parser, so I didn't
do it in the patchset I sent just now.

> Also, I've never tried to create an image over a protocol other than
> file. As far as I know, raw is the only format for which it should work
> right now (at least in theory). As we're going forward, I'm planning to
> convert the other drivers, too.
> 

I see. Thank you for the explanations.


Regards,

Kazutaka



[Qemu-devel] [RFC PATCH v3 2/3] block: call the snapshot handlers of the protocol drivers

2010-05-17 Thread MORITA Kazutaka
When snapshot handlers are not defined in the format driver, it is
better to call the ones of the protocol driver.  This enables us to
implement snapshot support in the protocol driver.

We need to call bdrv_close() and bdrv_open() handlers of the format
driver before and after bdrv_snapshot_goto() call of the protocol.  It is
because the contents of the block driver state may need to be changed
after loading vmstate.

Signed-off-by: MORITA Kazutaka 
---
 block.c |   61 +++--
 1 files changed, 43 insertions(+), 18 deletions(-)

diff --git a/block.c b/block.c
index f3bf3f2..c987e57 100644
--- a/block.c
+++ b/block.c
@@ -1683,9 +1683,11 @@ int bdrv_save_vmstate(BlockDriverState *bs, const 
uint8_t *buf,
 BlockDriver *drv = bs->drv;
 if (!drv)
 return -ENOMEDIUM;
-if (!drv->bdrv_save_vmstate)
-return -ENOTSUP;
-return drv->bdrv_save_vmstate(bs, buf, pos, size);
+if (drv->bdrv_save_vmstate)
+return drv->bdrv_save_vmstate(bs, buf, pos, size);
+if (bs->file)
+return bdrv_save_vmstate(bs->file, buf, pos, size);
+return -ENOTSUP;
 }
 
 int bdrv_load_vmstate(BlockDriverState *bs, uint8_t *buf,
@@ -1694,9 +1696,11 @@ int bdrv_load_vmstate(BlockDriverState *bs, uint8_t *buf,
 BlockDriver *drv = bs->drv;
 if (!drv)
 return -ENOMEDIUM;
-if (!drv->bdrv_load_vmstate)
-return -ENOTSUP;
-return drv->bdrv_load_vmstate(bs, buf, pos, size);
+if (drv->bdrv_load_vmstate)
+return drv->bdrv_load_vmstate(bs, buf, pos, size);
+if (bs->file)
+return bdrv_load_vmstate(bs->file, buf, pos, size);
+return -ENOTSUP;
 }
 
 void bdrv_debug_event(BlockDriverState *bs, BlkDebugEvent event)
@@ -1720,20 +1724,37 @@ int bdrv_snapshot_create(BlockDriverState *bs,
 BlockDriver *drv = bs->drv;
 if (!drv)
 return -ENOMEDIUM;
-if (!drv->bdrv_snapshot_create)
-return -ENOTSUP;
-return drv->bdrv_snapshot_create(bs, sn_info);
+if (drv->bdrv_snapshot_create)
+return drv->bdrv_snapshot_create(bs, sn_info);
+if (bs->file)
+return bdrv_snapshot_create(bs->file, sn_info);
+return -ENOTSUP;
 }
 
 int bdrv_snapshot_goto(BlockDriverState *bs,
const char *snapshot_id)
 {
 BlockDriver *drv = bs->drv;
+int ret, open_ret;
+
 if (!drv)
 return -ENOMEDIUM;
-if (!drv->bdrv_snapshot_goto)
-return -ENOTSUP;
-return drv->bdrv_snapshot_goto(bs, snapshot_id);
+if (drv->bdrv_snapshot_goto)
+return drv->bdrv_snapshot_goto(bs, snapshot_id);
+
+if (bs->file) {
+drv->bdrv_close(bs);
+ret = bdrv_snapshot_goto(bs->file, snapshot_id);
+open_ret = drv->bdrv_open(bs, bs->open_flags);
+if (open_ret < 0) {
+bdrv_delete(bs);
+bs->drv = NULL;
+return open_ret;
+}
+return ret;
+}
+
+return -ENOTSUP;
 }
 
 int bdrv_snapshot_delete(BlockDriverState *bs, const char *snapshot_id)
@@ -1741,9 +1762,11 @@ int bdrv_snapshot_delete(BlockDriverState *bs, const 
char *snapshot_id)
 BlockDriver *drv = bs->drv;
 if (!drv)
 return -ENOMEDIUM;
-if (!drv->bdrv_snapshot_delete)
-return -ENOTSUP;
-return drv->bdrv_snapshot_delete(bs, snapshot_id);
+if (drv->bdrv_snapshot_delete)
+return drv->bdrv_snapshot_delete(bs, snapshot_id);
+if (bs->file)
+return bdrv_snapshot_delete(bs->file, snapshot_id);
+return -ENOTSUP;
 }
 
 int bdrv_snapshot_list(BlockDriverState *bs,
@@ -1752,9 +1775,11 @@ int bdrv_snapshot_list(BlockDriverState *bs,
 BlockDriver *drv = bs->drv;
 if (!drv)
 return -ENOMEDIUM;
-if (!drv->bdrv_snapshot_list)
-return -ENOTSUP;
-return drv->bdrv_snapshot_list(bs, psn_info);
+if (drv->bdrv_snapshot_list)
+return drv->bdrv_snapshot_list(bs, psn_info);
+if (bs->file)
+return bdrv_snapshot_list(bs->file, psn_info);
+return -ENOTSUP;
 }
 
 #define NB_SUFFIXES 4
-- 
1.5.6.5




[Qemu-devel] [RFC PATCH v3 1/3] close all the block drivers before the qemu process exits

2010-05-17 Thread MORITA Kazutaka
This patch calls the close handler of the block driver before the qemu
process exits.

This is necessary because the sheepdog block driver releases the lock
of VM images in the close handler.

Signed-off-by: MORITA Kazutaka 
---
 block.c |9 +
 block.h |1 +
 vl.c|1 +
 3 files changed, 11 insertions(+), 0 deletions(-)

diff --git a/block.c b/block.c
index bfe46e3..f3bf3f2 100644
--- a/block.c
+++ b/block.c
@@ -636,6 +636,15 @@ void bdrv_close(BlockDriverState *bs)
 }
 }
 
+void bdrv_close_all(void)
+{
+BlockDriverState *bs;
+
+QTAILQ_FOREACH(bs, &bdrv_states, list) {
+bdrv_close(bs);
+}
+}
+
 void bdrv_delete(BlockDriverState *bs)
 {
 /* remove from list, if necessary */
diff --git a/block.h b/block.h
index 278259c..531e802 100644
--- a/block.h
+++ b/block.h
@@ -121,6 +121,7 @@ BlockDriverAIOCB *bdrv_aio_ioctl(BlockDriverState *bs,
 /* Ensure contents are flushed to disk.  */
 void bdrv_flush(BlockDriverState *bs);
 void bdrv_flush_all(void);
+void bdrv_close_all(void);
 
 int bdrv_has_zero_init(BlockDriverState *bs);
 int bdrv_is_allocated(BlockDriverState *bs, int64_t sector_num, int nb_sectors,
diff --git a/vl.c b/vl.c
index 85bcc84..5ce7807 100644
--- a/vl.c
+++ b/vl.c
@@ -2007,6 +2007,7 @@ static void main_loop(void)
 exit(0);
 }
 }
+bdrv_close_all();
 pause_all_vcpus();
 }
 
-- 
1.5.6.5




[Qemu-devel] [RFC PATCH v3 0/3] Sheepdog: distributed storage system for QEMU

2010-05-17 Thread MORITA Kazutaka
Hi all,

This patch adds a block driver for Sheepdog distributed storage
system.

Changes from v2 to v3 are:

 - add drv->bdrv_close() and drv->bdrv_open() before and after
   bdrv_snapshot_goto() call of the protocol.
 - address the review comments on the sheepdog driver code.
   I'll send the details in the reply of the review mail.

Changes from v1 to v2 are:

 - rebase onto git://repo.or.cz/qemu/kevin.git block
 - modify the sheepdog driver as a protocol driver
 - add new patch to call the snapshot handler of the protocol

If this patchset is okay, I'll work on the image creation problem of the
protocol driver.

Thanks,

Kazutaka


MORITA Kazutaka (3):
  close all the block drivers before the qemu process exits
  block: call the snapshot handlers of the protocol drivers
  block: add sheepdog driver for distributed storage support

 Makefile.objs|2 +-
 block.c  |   70 ++-
 block.h  |1 +
 block/sheepdog.c | 1845 ++
 vl.c |1 +
 5 files changed, 1900 insertions(+), 19 deletions(-)
 create mode 100644 block/sheepdog.c




[Qemu-devel] [RFC PATCH v3 3/3] block: add sheepdog driver for distributed storage support

2010-05-17 Thread MORITA Kazutaka
Sheepdog is a distributed storage system for QEMU. It provides highly
available block level storage volumes to VMs like Amazon EBS.  This
patch adds a qemu block driver for Sheepdog.

Sheepdog features are:
- No node in the cluster is special (no metadata node, no control
  node, etc)
- Linear scalability in performance and capacity
- No single point of failure
- Autonomous management (zero configuration)
- Useful volume management support such as snapshot and cloning
- Thin provisioning
- Autonomous load balancing

The more details are available at the project site:
http://www.osrg.net/sheepdog/

Signed-off-by: MORITA Kazutaka 
---
 Makefile.objs|2 +-
 block/sheepdog.c | 1845 ++
 2 files changed, 1846 insertions(+), 1 deletions(-)
 create mode 100644 block/sheepdog.c

diff --git a/Makefile.objs b/Makefile.objs
index ecdd53e..6edbc57 100644
--- a/Makefile.objs
+++ b/Makefile.objs
@@ -14,7 +14,7 @@ block-obj-$(CONFIG_LINUX_AIO) += linux-aio.o
 
 block-nested-y += raw.o cow.o qcow.o vdi.o vmdk.o cloop.o dmg.o bochs.o vpc.o 
vvfat.o
 block-nested-y += qcow2.o qcow2-refcount.o qcow2-cluster.o qcow2-snapshot.o
-block-nested-y += parallels.o nbd.o blkdebug.o
+block-nested-y += parallels.o nbd.o blkdebug.o sheepdog.o
 block-nested-$(CONFIG_WIN32) += raw-win32.o
 block-nested-$(CONFIG_POSIX) += raw-posix.o
 block-nested-$(CONFIG_CURL) += curl.o
diff --git a/block/sheepdog.c b/block/sheepdog.c
new file mode 100644
index 000..4672f00
--- /dev/null
+++ b/block/sheepdog.c
@@ -0,0 +1,1845 @@
+/*
+ * Copyright (C) 2009-2010 Nippon Telegraph and Telephone Corporation.
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License version
+ * 2 as published by the Free Software Foundation.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program. If not, see <http://www.gnu.org/licenses/>.
+ */
+#include 
+#include 
+
+#include "qemu-common.h"
+#include "qemu-error.h"
+#include "block_int.h"
+
+#define SD_PROTO_VER 0x01
+
+#define SD_DEFAULT_ADDR "localhost:7000"
+
+#define SD_OP_CREATE_AND_WRITE_OBJ  0x01
+#define SD_OP_READ_OBJ   0x02
+#define SD_OP_WRITE_OBJ  0x03
+
+#define SD_OP_NEW_VDI0x11
+#define SD_OP_LOCK_VDI   0x12
+#define SD_OP_RELEASE_VDI0x13
+#define SD_OP_GET_VDI_INFO   0x14
+#define SD_OP_READ_VDIS  0x15
+
+#define SD_FLAG_CMD_WRITE0x01
+#define SD_FLAG_CMD_COW  0x02
+
+#define SD_RES_SUCCESS   0x00 /* Success */
+#define SD_RES_UNKNOWN   0x01 /* Unknown error */
+#define SD_RES_NO_OBJ0x02 /* No object found */
+#define SD_RES_EIO   0x03 /* I/O error */
+#define SD_RES_VDI_EXIST 0x04 /* Vdi exists already */
+#define SD_RES_INVALID_PARMS 0x05 /* Invalid parameters */
+#define SD_RES_SYSTEM_ERROR  0x06 /* System error */
+#define SD_RES_VDI_LOCKED0x07 /* Vdi is locked */
+#define SD_RES_NO_VDI0x08 /* No vdi found */
+#define SD_RES_NO_BASE_VDI   0x09 /* No base vdi found */
+#define SD_RES_VDI_READ  0x0A /* Cannot read requested vdi */
+#define SD_RES_VDI_WRITE 0x0B /* Cannot write requested vdi */
+#define SD_RES_BASE_VDI_READ 0x0C /* Cannot read base vdi */
+#define SD_RES_BASE_VDI_WRITE   0x0D /* Cannot write base vdi */
+#define SD_RES_NO_TAG0x0E /* Requested tag is not found */
+#define SD_RES_STARTUP   0x0F /* Sheepdog is on starting up */
+#define SD_RES_VDI_NOT_LOCKED   0x10 /* Vdi is not locked */
+#define SD_RES_SHUTDOWN  0x11 /* Sheepdog is shutting down */
+#define SD_RES_NO_MEM0x12 /* Cannot allocate memory */
+#define SD_RES_FULL_VDI  0x13 /* we already have the maximum vdis */
+#define SD_RES_VER_MISMATCH  0x14 /* Protocol version mismatch */
+#define SD_RES_NO_SPACE  0x15 /* Server has no room for new objects */
+#define SD_RES_WAIT_FOR_FORMAT  0x16 /* Sheepdog is waiting for a format 
operation */
+#define SD_RES_WAIT_FOR_JOIN0x17 /* Sheepdog is waiting for other nodes 
joining */
+#define SD_RES_JOIN_FAILED   0x18 /* Target node had failed to join sheepdog */
+
+/*
+ * Object ID rules
+ *
+ *  0 - 19 (20 bits): data object space
+ * 20 - 31 (12 bits): reserved data object space
+ * 32 - 55 (24 bits): vdi object space
+ * 56 - 59 ( 4 bits): reserved vdi object space
+ * 60 - 63 ( 4 bits): object type indentifier space
+ */
+
+#define VDI_SPACE_SHIFT   32
+#define VDI_BIT (UINT64_C(1) << 63)
+#define VMSTATE_BIT (UINT64_C(1) << 62)
+#define MAX_DATA_OBJS (1ULL << 20)
+#define MAX_CHILDREN 1024
+#define SD_MAX_VDI_LEN 256
+#define SD_NR_VDIS   (1U << 24)
+#define SD_DATA_OBJ_SIZE (UINT64_C(1) << 22)
+
+#define SD_INODE_SIZE (sizeof(SheepdogInode))
+#define CURRENT_VDI_ID 0
+
+typedef struct SheepdogReq {
+   uint8_t proto_ver;
+   uint8_t opcode;
+   uint16_tflags;

[Qemu-devel] Re: [RFC PATCH v2 3/3] block: add sheepdog driver for distributed storage support

2010-05-17 Thread MORITA Kazutaka
Hi,

Thank you very much for the reviewing!

At Fri, 14 May 2010 13:08:06 +0200,
Kevin Wolf wrote:

> > +
> > +struct sd_req {
> > +   uint8_t proto_ver;
> > +   uint8_t opcode;
> > +   uint16_tflags;
> > +   uint32_tepoch;
> > +   uint32_tid;
> > +   uint32_tdata_length;
> > +   uint32_topcode_specific[8];
> > +};
> 
> CODING_STYLE says that structs should be typedefed and their names
> should be in CamelCase. So something like this:
> 
> typedef struct SheepdogReq {
> ...
> } SheepdogReq;
> 
> (Or, if your prefer, SDReq; but with things like SDAIOCB I think it
> becomes hard to read)
> 

I see. I use Sheepdog as a prefix, like SheepdogReq.


> > +/*
> > +
> > +#undef eprintf
> > +#define eprintf(fmt, args...)  
> > \
> > +do {   
> > \
> > +   fprintf(stderr, "%s %d: " fmt, __func__, __LINE__, ##args); \
> > +} while (0)
> 
> What about using error_report() instead of fprintf? Though it should be
> the same currently.
> 

Yes, using common helper functions is better.  I replaced all the
printf.


> > +
> > +   for (i = 0; i < ARRAY_SIZE(errors); ++i)
> > +   if (errors[i].err == err)
> > +   return errors[i].desc;
> 
> CODING_STYLE requires braces here.
> 

I fixed all the missing braces.


> > +
> > +   return "Invalid error code";
> > +}
> > +
> > +static inline int before(uint32_t seq1, uint32_t seq2)
> > +{
> > +return (int32_t)(seq1 - seq2) < 0;
> > +}
> > +
> > +static inline int after(uint32_t seq1, uint32_t seq2)
> > +{
> > +   return (int32_t)(seq2 - seq1) < 0;
> > +}
> 
> These functions look strange... Is the difference to seq1 < seq2 that
> the cast introduces intentional? (after(0x0, 0xabcdefff) == 1)
> 
> If yes, why is this useful? This needs a comment. If no, why even bother
> to have this function instead of directly using < or > ?
> 

These functions are used to compare sequential numbers which can be
wrap-around. For example, linux/net/tcp.h in the linux kernel.

Anyway, sheepdog doesn't use these functions, so I removed them.


> > +   if (snapid)
> > +   dprintf("%" PRIx32 " non current inode was open.\n", vid);
> > +   else
> > +   s->is_current = 1;
> > +
> > +   fd = connect_to_sdog(s->addr);
> 
> I wonder why you need to open another connection here instead of using
> s->fd. This pattern repeats at least in the snapshot functions, so I'm
> sure it's there for a reason. Maybe add a comment?
> 

We can use s->fd only for aio read/write operations.  It is because
the block driver may be during waiting response from the server, so we
cannot send other requests to the discriptor to avoid receiving wrong
data.

I added the comment to get_sheep_fd().


> > +
> > +   iov.iov_base = &s->inode;
> > +   iov.iov_len = sizeof(s->inode);
> > +   aio_req = alloc_aio_req(s, acb, vid_to_vdi_oid(s->inode.vdi_id),
> > +   data_len, offset, 0, 0, offset);
> > +   if (!aio_req) {
> > +   eprintf("too many requests\n");
> > +   acb->ret = -EIO;
> > +   goto out;
> > +   }
> 
> Randomly failing requests is probably not a good idea. The guest might
> decide that the disk/file system is broken and stop using it. Can't you
> use a list like in AIOPool, so you can dynamically add new requests as
> needed?
> 

I agree.  In the v3 patch, AIO requests are allocated dynamically, and
all the requests are linked to the outstanding_aio_head in the
BDRVSheepdogState.


> > +
> > +static int sd_snapshot_goto(BlockDriverState *bs, const char *snapshot_id)
> > +{
> > +   struct bdrv_sd_state *s = bs->opaque;
> > +   struct bdrv_sd_state *old_s;
> > +   char vdi[SD_MAX_VDI_LEN];
> > +   char *buf = NULL;
> > +   uint32_t vid;
> > +   uint32_t snapid = 0;
> > +   int ret = -ENOENT, fd;
> > +
> > +   old_s = qemu_malloc(sizeof(struct bdrv_sd_state));
> > +   if (!old_s) {
> 
> qemu_malloc never returns NULL.
> 

I removed all the NULL checks.


> > +
> > +BlockDriver bdrv_sheepdog = {
> > +   .format_name = "sheepdog",
> > +   .protocol_name = "sheepdog",
> > +   .instance_size = sizeof(struct bdrv_sd_state),
> > +   .bdrv_file_open = sd_open,
> > +   .bdrv_close = sd_close,
> > +   .bdrv_create = sd_create,
> > +
> > +   .bdrv_aio_readv = sd_aio_readv,
> > +   .bdrv_aio_writev = sd_aio_writev,
> > +
> > +   .bdrv_snapshot_create = sd_snapshot_create,
> > +   .bdrv_snapshot_goto = sd_snapshot_goto,
> > +   .bdrv_snapshot_delete = sd_snapshot_delete,
> > +   .bdrv_snapshot_list = sd_snapshot_list,
> > +
> > +   .bdrv_save_vmstate = sd_save_vmstate,
> > +   .bdrv_load_vmstate = sd_load_vmstate,
> > +
> > +   .create_options = sd_create_options,
> > +};
> 
> Please align the = to the same column, at least in each block.
> 

I have aligned in the v3 patch.


Thanks,

Kazutaka



[Qemu-devel] Re: [RFC PATCH v3 2/3] block: call the snapshot handlers of the protocol drivers

2010-05-17 Thread MORITA Kazutaka
At Mon, 17 May 2010 13:08:08 +0200,
Kevin Wolf wrote:
> 
> Am 17.05.2010 12:19, schrieb MORITA Kazutaka:
> >  
> >  int bdrv_snapshot_goto(BlockDriverState *bs,
> > const char *snapshot_id)
> >  {
> >  BlockDriver *drv = bs->drv;
> > +int ret, open_ret;
> > +
> >  if (!drv)
> >  return -ENOMEDIUM;
> > -if (!drv->bdrv_snapshot_goto)
> > -return -ENOTSUP;
> > -return drv->bdrv_snapshot_goto(bs, snapshot_id);
> > +if (drv->bdrv_snapshot_goto)
> > +return drv->bdrv_snapshot_goto(bs, snapshot_id);
> > +
> > +if (bs->file) {
> > +drv->bdrv_close(bs);
> > +ret = bdrv_snapshot_goto(bs->file, snapshot_id);
> > +open_ret = drv->bdrv_open(bs, bs->open_flags);
> > +if (open_ret < 0) {
> > +bdrv_delete(bs);
> 
> I think you mean bs->file here.
> 
> Kevin

This is an error of re-opening the format driver, so what we should
delete here is not bs->file but bs, isn't it?  If we failed to open bs
here, the drive doesn't seem to work anymore.

Regards,

Kazutaka

> > +bs->drv = NULL;
> > +return open_ret;
> > +}
> > +return ret;
> > +}
> > +
> > +return -ENOTSUP;
> >  }



Re: [Qemu-devel] Re: [RFC PATCH v3 2/3] block: call the snapshot handlers of the protocol drivers

2010-05-17 Thread MORITA Kazutaka
On Mon, May 17, 2010 at 9:20 PM, Kevin Wolf  wrote:
> Am 17.05.2010 14:19, schrieb MORITA Kazutaka:
>> At Mon, 17 May 2010 13:08:08 +0200,
>> Kevin Wolf wrote:
>>>
>>> Am 17.05.2010 12:19, schrieb MORITA Kazutaka:
>>>>
>>>>  int bdrv_snapshot_goto(BlockDriverState *bs,
>>>>                         const char *snapshot_id)
>>>>  {
>>>>      BlockDriver *drv = bs->drv;
>>>> +    int ret, open_ret;
>>>> +
>>>>      if (!drv)
>>>>          return -ENOMEDIUM;
>>>> -    if (!drv->bdrv_snapshot_goto)
>>>> -        return -ENOTSUP;
>>>> -    return drv->bdrv_snapshot_goto(bs, snapshot_id);
>>>> +    if (drv->bdrv_snapshot_goto)
>>>> +        return drv->bdrv_snapshot_goto(bs, snapshot_id);
>>>> +
>>>> +    if (bs->file) {
>>>> +        drv->bdrv_close(bs);
>>>> +        ret = bdrv_snapshot_goto(bs->file, snapshot_id);
>>>> +        open_ret = drv->bdrv_open(bs, bs->open_flags);
>>>> +        if (open_ret < 0) {
>>>> +            bdrv_delete(bs);
>>>
>>> I think you mean bs->file here.
>>>
>>> Kevin
>>
>> This is an error of re-opening the format driver, so what we should
>> delete here is not bs->file but bs, isn't it?  If we failed to open bs
>> here, the drive doesn't seem to work anymore.
>
> But bdrv_delete means basically free it. This is almost guaranteed to
> lead to crashes because that BlockDriverState is still in use in other
> places.
>
> One additional case of use after free is in the very next line:
>
>>>> +            bs->drv = NULL;
>
> You can't do that when bs is freed, obviously. But I think just setting
> bs->drv to NULL without bdrv_deleting it before is the better way. It
> will fail any requests (with -ENOMEDIUM), but can't produce crashes.
> This is also what bdrv_commit does in such situations.
>
> In this state, we don't access the underlying file any more, so we could
> delete bs->file - this is why I thought you actually meant to do that.
>

I'm sorry for the confusion.  I understand what we should do here.
I'll fix it for the next post.

Thanks,

Kazutaka



[Qemu-devel] [PATCH] add support for protocol driver create_options

2010-05-19 Thread MORITA Kazutaka
This patch enables protocol drivers to use their create options which
are not supported by the format.  For example, protcol drivers can use
a backing_file option with raw format.

Signed-off-by: MORITA Kazutaka 
---
 block.c   |7 +++
 block.h   |1 +
 qemu-img.c|   49 ++---
 qemu-option.c |   52 +---
 qemu-option.h |2 ++
 5 files changed, 85 insertions(+), 26 deletions(-)

diff --git a/block.c b/block.c
index 48d8468..0ab9424 100644
--- a/block.c
+++ b/block.c
@@ -56,7 +56,6 @@ static int bdrv_read_em(BlockDriverState *bs, int64_t 
sector_num,
 uint8_t *buf, int nb_sectors);
 static int bdrv_write_em(BlockDriverState *bs, int64_t sector_num,
  const uint8_t *buf, int nb_sectors);
-static BlockDriver *find_protocol(const char *filename);
 
 static QTAILQ_HEAD(, BlockDriverState) bdrv_states =
 QTAILQ_HEAD_INITIALIZER(bdrv_states);
@@ -210,7 +209,7 @@ int bdrv_create_file(const char* filename, 
QEMUOptionParameter *options)
 {
 BlockDriver *drv;
 
-drv = find_protocol(filename);
+drv = bdrv_find_protocol(filename);
 if (drv == NULL) {
 drv = bdrv_find_format("file");
 }
@@ -283,7 +282,7 @@ static BlockDriver *find_hdev_driver(const char *filename)
 return drv;
 }
 
-static BlockDriver *find_protocol(const char *filename)
+BlockDriver *bdrv_find_protocol(const char *filename)
 {
 BlockDriver *drv1;
 char protocol[128];
@@ -469,7 +468,7 @@ int bdrv_file_open(BlockDriverState **pbs, const char 
*filename, int flags)
 BlockDriver *drv;
 int ret;
 
-drv = find_protocol(filename);
+drv = bdrv_find_protocol(filename);
 if (!drv) {
 return -ENOENT;
 }
diff --git a/block.h b/block.h
index 24efeb6..9034ebb 100644
--- a/block.h
+++ b/block.h
@@ -54,6 +54,7 @@ void bdrv_info_stats(Monitor *mon, QObject **ret_data);
 
 void bdrv_init(void);
 void bdrv_init_with_whitelist(void);
+BlockDriver *bdrv_find_protocol(const char *filename);
 BlockDriver *bdrv_find_format(const char *format_name);
 BlockDriver *bdrv_find_whitelisted_format(const char *format_name);
 int bdrv_create(BlockDriver *drv, const char* filename,
diff --git a/qemu-img.c b/qemu-img.c
index d3c30a7..8ae7184 100644
--- a/qemu-img.c
+++ b/qemu-img.c
@@ -252,8 +252,8 @@ static int img_create(int argc, char **argv)
 const char *base_fmt = NULL;
 const char *filename;
 const char *base_filename = NULL;
-BlockDriver *drv;
-QEMUOptionParameter *param = NULL;
+BlockDriver *drv, *proto_drv;
+QEMUOptionParameter *param = NULL, *create_options = NULL;
 char *options = NULL;
 
 flags = 0;
@@ -286,33 +286,42 @@ static int img_create(int argc, char **argv)
 }
 }
 
+/* Get the filename */
+if (optind >= argc)
+help();
+filename = argv[optind++];
+
 /* Find driver and parse its options */
 drv = bdrv_find_format(fmt);
 if (!drv)
 error("Unknown file format '%s'", fmt);
 
+proto_drv = bdrv_find_protocol(filename);
+if (!proto_drv)
+error("Unknown protocol '%s'", filename);
+
+create_options = append_option_parameters(create_options,
+  drv->create_options);
+create_options = append_option_parameters(create_options,
+  proto_drv->create_options);
+
 if (options && !strcmp(options, "?")) {
-print_option_help(drv->create_options);
+print_option_help(create_options);
 return 0;
 }
 
 /* Create parameter list with default values */
-param = parse_option_parameters("", drv->create_options, param);
+param = parse_option_parameters("", create_options, param);
 set_option_parameter_int(param, BLOCK_OPT_SIZE, -1);
 
 /* Parse -o options */
 if (options) {
-param = parse_option_parameters(options, drv->create_options, param);
+param = parse_option_parameters(options, create_options, param);
 if (param == NULL) {
 error("Invalid options for file format '%s'.", fmt);
 }
 }
 
-/* Get the filename */
-if (optind >= argc)
-help();
-filename = argv[optind++];
-
 /* Add size to parameters */
 if (optind < argc) {
 set_option_parameter(param, BLOCK_OPT_SIZE, argv[optind++]);
@@ -362,6 +371,7 @@ static int img_create(int argc, char **argv)
 puts("");
 
 ret = bdrv_create(drv, filename, param);
+free_option_parameters(create_options);
 free_option_parameters(param);
 
 if (ret < 0) {
@@ -543,14 +553,14 @@ static int img_convert(int argc, char **argv)
 {
 int c, ret, n, n1, bs_n, bs_i, flags, cluster_size, cluster_sectors;
 const char *fmt, *

[Qemu-devel] [ANNOUNCE] Sheepdog: Distributed Storage System for KVM

2009-10-20 Thread MORITA Kazutaka

Hi everyone,

Sheepdog is a distributed storage system for KVM/QEMU. It provides
highly available block level storage volumes to VMs like Amazon EBS.
Sheepdog supports advanced volume management features such as snapshot,
cloning, and thin provisioning. Sheepdog runs on several tens or hundreds
of nodes, and the architecture is fully symmetric; there is no central
node such as a meta-data server.

The following list describes the features of Sheepdog.

* Linear scalability in performance and capacity
* No single point of failure
* Redundant architecture (data is written to multiple nodes)
- Tolerance against network failure
* Zero configuration (newly added machines will join the cluster 
automatically)
- Autonomous load balancing
* Snapshot
- Online snapshot from qemu-monitor
* Clone from a snapshot volume
* Thin provisioning
- Amazon EBS API support (to use from a Eucalyptus instance)

(* = current features, - = on our todo list)

More details and download links are here:

http://www.osrg.net/sheepdog/

Note that the code is still in an early stage.
There are some critical TODO items:

- VM image deletion support
- Support architectures other than X86_64
- Data recoverys
- Free space management
- Guarantee reliability and availability under heavy load
- Performance improvement
- Reclaim unused blocks
- More documentation

We hope finding people interested in working together.
Enjoy!


Here are examples:

- create images

$ kvm-img create -f sheepdog "Alice's Disk" 256G
$ kvm-img create -f sheepdog "Bob's Disk" 256G

- list images

$ shepherd info -t vdi
   4 : Alice's Disk  256 GB (allocated: 0 MB, shared: 0 MB), 2009-10-15
16:17:18, tag:0, current
   8 : Bob's Disk256 GB (allocated: 0 MB, shared: 0 MB), 2009-10-15
16:29:20, tag:0, current

- start up a virtual machine

$ kvm --drive format=sheepdog,file="Alice's Disk"

- create a snapshot

$ kvm-img snapshot -c name sheepdog:"Alice's Disk"

- clone from a snapshot

$ kvm-img create -b sheepdog:"Alice's Disk":0 -f sheepdog "Charlie's Disk"


Thanks.

--
MORITA, Kazutaka

NTT Cyber Space Labs
OSS Computing Project
Kernel Group
E-mail: morita.kazut...@lab.ntt.co.jp





[Qemu-devel] Re: [ANNOUNCE] Sheepdog: Distributed Storage System for KVM

2009-10-23 Thread MORITA Kazutaka
Hello,

Does the following patch work for you?

diff --git a/sheep/work.c b/sheep/work.c
index 4df8dc0..45f362d 100644
--- a/sheep/work.c
+++ b/sheep/work.c
@@ -28,6 +28,7 @@
 #include 
 #include 
 #include 
+#define _LINUX_FCNTL_H
 #include 

 #include "list.h"


On Wed, Oct 21, 2009 at 5:45 PM, Nikolai K. Bochev
 wrote:
> Hello,
>
> I am getting the following error trying to compile sheepdog on Ubuntu 9.10 ( 
> 2.6.31-14 x64 ) :
>
> cd shepherd; make
> make[1]: Entering directory 
> `/home/shiny/Packages/sheepdog-2009102101/shepherd'
> cc -c -g -O2 -Wall -Wstrict-prototypes -I../include -D_GNU_SOURCE shepherd.c 
> -o shepherd.o
> shepherd.c: In function ‘main’:
> shepherd.c:300: warning: dereferencing pointer ‘hdr.55’ does break 
> strict-aliasing rules
> shepherd.c:300: note: initialized from here
> cc -c -g -O2 -Wall -Wstrict-prototypes -I../include -D_GNU_SOURCE treeview.c 
> -o treeview.o
> cc -c -g -O2 -Wall -Wstrict-prototypes -I../include -D_GNU_SOURCE 
> ../lib/event.c -o ../lib/event.o
> cc -c -g -O2 -Wall -Wstrict-prototypes -I../include -D_GNU_SOURCE 
> ../lib/net.c -o ../lib/net.o
> ../lib/net.c: In function ‘write_object’:
> ../lib/net.c:358: warning: ‘vosts’ may be used uninitialized in this function
> cc -c -g -O2 -Wall -Wstrict-prototypes -I../include -D_GNU_SOURCE 
> ../lib/logger.c -o ../lib/logger.o
> cc shepherd.o treeview.o ../lib/event.o ../lib/net.o ../lib/logger.o -o 
> shepherd -lncurses -lcrypto
> make[1]: Leaving directory `/home/shiny/Packages/sheepdog-2009102101/shepherd'
> cd sheep; make
> make[1]: Entering directory `/home/shiny/Packages/sheepdog-2009102101/sheep'
> cc -c -g -O2 -Wall -Wstrict-prototypes -I../include -D_GNU_SOURCE sheep.c -o 
> sheep.o
> cc -c -g -O2 -Wall -Wstrict-prototypes -I../include -D_GNU_SOURCE store.c -o 
> store.o
> cc -c -g -O2 -Wall -Wstrict-prototypes -I../include -D_GNU_SOURCE net.c -o 
> net.o
> cc -c -g -O2 -Wall -Wstrict-prototypes -I../include -D_GNU_SOURCE work.c -o 
> work.o
> In file included from /usr/include/asm/fcntl.h:1,
>                 from /usr/include/linux/fcntl.h:4,
>                 from /usr/include/linux/signalfd.h:13,
>                 from work.c:31:
> /usr/include/asm-generic/fcntl.h:117: error: redefinition of ‘struct flock’
> /usr/include/asm-generic/fcntl.h:140: error: redefinition of ‘struct flock64’
> make[1]: *** [work.o] Error 1
> make[1]: Leaving directory `/home/shiny/Packages/sheepdog-2009102101/sheep'
> make: *** [all] Error 2
>
> I have all the required libs installed. Patching and compiling qemu-kvm went 
> flawless.
>
> - Original Message -
> From: "MORITA Kazutaka" 
> To: k...@vger.kernel.org, qemu-devel@nongnu.org, linux-fsde...@vger.kernel.org
> Sent: Wednesday, October 21, 2009 8:13:47 AM
> Subject: [ANNOUNCE] Sheepdog: Distributed Storage System for KVM
>
> Hi everyone,
>
> Sheepdog is a distributed storage system for KVM/QEMU. It provides
> highly available block level storage volumes to VMs like Amazon EBS.
> Sheepdog supports advanced volume management features such as snapshot,
> cloning, and thin provisioning. Sheepdog runs on several tens or hundreds
> of nodes, and the architecture is fully symmetric; there is no central
> node such as a meta-data server.
>
> The following list describes the features of Sheepdog.
>
>     * Linear scalability in performance and capacity
>     * No single point of failure
>     * Redundant architecture (data is written to multiple nodes)
>     - Tolerance against network failure
>     * Zero configuration (newly added machines will join the cluster 
> automatically)
>     - Autonomous load balancing
>     * Snapshot
>     - Online snapshot from qemu-monitor
>     * Clone from a snapshot volume
>     * Thin provisioning
>     - Amazon EBS API support (to use from a Eucalyptus instance)
>
> (* = current features, - = on our todo list)
>
> More details and download links are here:
>
> http://www.osrg.net/sheepdog/
>
> Note that the code is still in an early stage.
> There are some critical TODO items:
>
>     - VM image deletion support
>     - Support architectures other than X86_64
>     - Data recoverys
>     - Free space management
>     - Guarantee reliability and availability under heavy load
>     - Performance improvement
>     - Reclaim unused blocks
>     - More documentation
>
> We hope finding people interested in working together.
> Enjoy!
>
>
> Here are examples:
>
> - create images
>
> $ kvm-img create -f sheepdog "Alice's Disk" 256G
> $ kvm-img create -f sheepdog "Bob's Disk" 256G
>
> - list images
>
> $ shepherd info -t vdi
>    4

[Qemu-devel] Re: [ANNOUNCE] Sheepdog: Distributed Storage System for KVM

2009-10-23 Thread MORITA Kazutaka
We use JGroups (Java library) for reliable multicast communication in
our cluster manager daemon. We don't worry about the performance much
since the cluster manager daemon is not involved in the I/O path. We
might think about moving to corosync if it is more stable than
JGroups.

On Wed, Oct 21, 2009 at 6:08 PM, Dietmar Maurer  wrote:
> Quite interesting. But would it be possible to use corosync for the cluster 
> communication? The point is that we need corosync anyways for pacemaker, it 
> is written in C (high performance) and seem to implement the feature you need?
>
>> -Original Message-
>> From: kvm-ow...@vger.kernel.org [mailto:kvm-ow...@vger.kernel.org] On
>> Behalf Of MORITA Kazutaka
>> Sent: Mittwoch, 21. Oktober 2009 07:14
>> To: k...@vger.kernel.org; qemu-devel@nongnu.org; linux-
>> fsde...@vger.kernel.org
>> Subject: [ANNOUNCE] Sheepdog: Distributed Storage System for KVM
>>
>> Hi everyone,
>>
>> Sheepdog is a distributed storage system for KVM/QEMU. It provides
>> highly available block level storage volumes to VMs like Amazon EBS.
>> Sheepdog supports advanced volume management features such as snapshot,
>> cloning, and thin provisioning. Sheepdog runs on several tens or
>> hundreds
>> of nodes, and the architecture is fully symmetric; there is no central
>> node such as a meta-data server.
>
> --
> To unsubscribe from this list: send the line "unsubscribe kvm" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>



-- 
MORITA, Kazutaka

NTT Cyber Space Labs
OSS Computing Project
Kernel Group
E-mail: morita.kazut...@lab.ntt.co.jp




[Qemu-devel] Re: [ANNOUNCE] Sheepdog: Distributed Storage System for KVM

2009-10-23 Thread MORITA Kazutaka
On Fri, Oct 23, 2009 at 12:30 AM, Avi Kivity  wrote:
> On 10/21/2009 07:13 AM, MORITA Kazutaka wrote:
>>
>> Hi everyone,
>>
>> Sheepdog is a distributed storage system for KVM/QEMU. It provides
>> highly available block level storage volumes to VMs like Amazon EBS.
>> Sheepdog supports advanced volume management features such as snapshot,
>> cloning, and thin provisioning. Sheepdog runs on several tens or hundreds
>> of nodes, and the architecture is fully symmetric; there is no central
>> node such as a meta-data server.
>
> Very interesting!  From a very brief look at the code, it looks like the
> sheepdog block format driver is a network client that is able to access
> highly available images, yes?

Yes. Sheepdog is a simple key-value storage system that
consists of multiple nodes (a bit similar to Amazon Dynamo, I guess).

The qemu Sheepdog driver (client) divides a VM image into fixed-size
objects and store them on the key-value storage system.

> If so, is it reasonable to compare this to a cluster file system setup (like
> GFS) with images as files on this filesystem?  The difference would be that
> clustering is implemented in userspace in sheepdog, but in the kernel for a
> clustering filesystem.

I think that the major difference between sheepdog and cluster file
systems such as Google File system, pNFS, etc is the interface between
clients and a storage system.

> How is load balancing implemented?  Can you move an image transparently
> while a guest is running?  Will an image be moved closer to its guest?

Sheepdog uses consistent hashing to decide where objects store; I/O
load is balanced across the nodes. When a new node is added or the
existing node is removed, the hash table changes and the data
automatically and transparently are moved over nodes.

We plan to implement a mechanism to distribute the data not randomly
but intelligently; we could use machine load, the locations of VMs, etc.

> Can you stripe an image across nodes?

Yes, a VM images is divided into multiple objects, and they are
stored over nodes.

> Do you support multiple guests accessing the same image?

A VM image can be attached to any VMs but one VM at a time; multiple
running VMs cannot access to the same VM image.

> What about fault tolerance - storing an image redundantly on multiple nodes?

Yes, all objects are replicated to multiple nodes.


-- 
MORITA, Kazutaka

NTT Cyber Space Labs
OSS Computing Project
Kernel Group
E-mail: morita.kazut...@lab.ntt.co.jp




Re: [Qemu-devel] Re: [ANNOUNCE] Sheepdog: Distributed Storage System for KVM

2009-10-23 Thread MORITA Kazutaka
On Fri, Oct 23, 2009 at 8:10 PM, Alexander Graf  wrote:
>
> On 23.10.2009, at 12:41, MORITA Kazutaka wrote:
>
> On Fri, Oct 23, 2009 at 12:30 AM, Avi Kivity  wrote:
>
> How is load balancing implemented?  Can you move an image transparently
>
> while a guest is running?  Will an image be moved closer to its guest?
>
> Sheepdog uses consistent hashing to decide where objects store; I/O
> load is balanced across the nodes. When a new node is added or the
> existing node is removed, the hash table changes and the data
> automatically and transparently are moved over nodes.
>
> We plan to implement a mechanism to distribute the data not randomly
> but intelligently; we could use machine load, the locations of VMs, etc.
>
> What exactly does balanced mean? Can it cope with individual nodes having
> more disk space than others?

I mean objects are uniformly distributed over the nodes by the hash function.
Distribution using free disk space information is one of TODOs.

> Do you support multiple guests accessing the same image?
>
> A VM image can be attached to any VMs but one VM at a time; multiple
> running VMs cannot access to the same VM image.
>
> What about read-only access? Imagine you'd have 5 kvm instances each
> accessing it using -snapshot.

By creating new clone images from existing snapshot image, you can do
the similar thing.
Sheepdog can create cloning image instantly.

-- 
MORITA, Kazutaka

NTT Cyber Space Labs
OSS Computing Project
Kernel Group
E-mail: morita.kazut...@lab.ntt.co.jp




Re: [Qemu-devel] [ANNOUNCE] Sheepdog: Distributed Storage System for KVM

2009-10-23 Thread MORITA Kazutaka
Hi,

Thanks for many comments.

Sheepdog git trees are created.

  Sheepdog server
git://sheepdog.git.sourceforge.net/gitroot/sheepdog/sheepdog

  Sheepdog client
git://sheepdog.git.sourceforge.net/gitroot/sheepdog/qemu-kvm

Please try!

On Wed, Oct 21, 2009 at 2:13 PM, MORITA Kazutaka
 wrote:
> Hi everyone,
>
> Sheepdog is a distributed storage system for KVM/QEMU. It provides
> highly available block level storage volumes to VMs like Amazon EBS.
> Sheepdog supports advanced volume management features such as snapshot,
> cloning, and thin provisioning. Sheepdog runs on several tens or hundreds
> of nodes, and the architecture is fully symmetric; there is no central
> node such as a meta-data server.
>
> The following list describes the features of Sheepdog.
>
>    * Linear scalability in performance and capacity
>    * No single point of failure
>    * Redundant architecture (data is written to multiple nodes)
>    - Tolerance against network failure
>    * Zero configuration (newly added machines will join the cluster
> automatically)
>    - Autonomous load balancing
>    * Snapshot
>    - Online snapshot from qemu-monitor
>    * Clone from a snapshot volume
>    * Thin provisioning
>    - Amazon EBS API support (to use from a Eucalyptus instance)
>
> (* = current features, - = on our todo list)
>
> More details and download links are here:
>
> http://www.osrg.net/sheepdog/
>
> Note that the code is still in an early stage.
> There are some critical TODO items:
>
>    - VM image deletion support
>    - Support architectures other than X86_64
>    - Data recoverys
>    - Free space management
>    - Guarantee reliability and availability under heavy load
>    - Performance improvement
>    - Reclaim unused blocks
>    - More documentation
>
> We hope finding people interested in working together.
> Enjoy!
>
>
> Here are examples:
>
> - create images
>
> $ kvm-img create -f sheepdog "Alice's Disk" 256G
> $ kvm-img create -f sheepdog "Bob's Disk" 256G
>
> - list images
>
> $ shepherd info -t vdi
>   4 : Alice's Disk  256 GB (allocated: 0 MB, shared: 0 MB), 2009-10-15
> 16:17:18, tag:        0, current
>   8 : Bob's Disk    256 GB (allocated: 0 MB, shared: 0 MB), 2009-10-15
> 16:29:20, tag:        0, current
>
> - start up a virtual machine
>
> $ kvm --drive format=sheepdog,file="Alice's Disk"
>
> - create a snapshot
>
> $ kvm-img snapshot -c name sheepdog:"Alice's Disk"
>
> - clone from a snapshot
>
> $ kvm-img create -b sheepdog:"Alice's Disk":0 -f sheepdog "Charlie's Disk"
>
>
> Thanks.
>
> --
> MORITA, Kazutaka
>
> NTT Cyber Space Labs
> OSS Computing Project
> Kernel Group
> E-mail: morita.kazut...@lab.ntt.co.jp
>
>
>
>



-- 
MORITA, Kazutaka

NTT Cyber Space Labs
OSS Computing Project
Kernel Group
E-mail: morita.kazut...@lab.ntt.co.jp




Re: [Qemu-devel] [ANNOUNCE] Sheepdog: Distributed Storage System for KVM

2009-10-23 Thread MORITA Kazutaka
On Sat, Oct 24, 2009 at 4:45 AM, Javier Guerra  wrote:
> On Fri, Oct 23, 2009 at 2:39 PM, MORITA Kazutaka
>  wrote:
>> Thanks for many comments.
>>
>> Sheepdog git trees are created.
>
> great!
>
> is there any client (no matter how crude) besides the patched
> KVM/Qemu?  it would make it far easier to hack around...

No, there isn't. Sorry.
I think we should provide a test client as soon as possible.

-- 
MORITA, Kazutaka

NTT Cyber Space Labs
OSS Computing Project
Kernel Group
E-mail: morita.kazut...@lab.ntt.co.jp




[Qemu-devel] Re: [ANNOUNCE] Sheepdog: Distributed Storage System for KVM

2009-10-25 Thread MORITA Kazutaka

On 2009/10/25 17:51, Dietmar Maurer wrote:

Do you support multiple guests accessing the same image?

A VM image can be attached to any VMs but one VM at a time; multiple
running VMs cannot access to the same VM image.


I guess this is a problem when you want to do live migrations?


Yes, because Sheepdog locks a VM image when it is opened.
To avoid this problem, locking must be delayed until migration has done.
This is also a TODO item.

--
MORITA Kazutaka







[Qemu-devel] Re: [ANNOUNCE] Sheepdog: Distributed Storage System for KVM

2009-10-27 Thread MORITA Kazutaka

On 2009/10/21 14:13, MORITA Kazutaka wrote:

Hi everyone,

Sheepdog is a distributed storage system for KVM/QEMU. It provides
highly available block level storage volumes to VMs like Amazon EBS.
Sheepdog supports advanced volume management features such as snapshot,
cloning, and thin provisioning. Sheepdog runs on several tens or hundreds
of nodes, and the architecture is fully symmetric; there is no central
node such as a meta-data server.


We added some pages to Sheepdog website:

 Design: http://www.osrg.net/sheepdog/design.html
 FAQ   : http://www.osrg.net/sheepdog/faq.html

Sheepdog mailing list is also ready to use (thanks for Tomasz)

 Subscribe/Unsubscribe/Preferences
   http://lists.wpkg.org/mailman/listinfo/sheepdog
 Archive
   http://lists.wpkg.org/pipermail/sheepdog/

We are always looking for developers or users interested in
participating in Sheepdog project!

Thanks.

MORITA Kazutaka




<    1   2   3