Re: [Qemu-devel] [RFC PATCH 1/1] ceph/rbd block driver for qemu-kvm
At Mon, 24 May 2010 14:16:32 -0500, Anthony Liguori wrote: > > On 05/24/2010 06:56 AM, Avi Kivity wrote: > > On 05/24/2010 02:42 PM, MORITA Kazutaka wrote: > >> > >>> The server would be local and talk over a unix domain socket, perhaps > >>> anonymous. > >>> > >>> nbd has other issues though, such as requiring a copy and no support > >>> for > >>> metadata operations such as snapshot and file size extension. > >>> > >> Sorry, my explanation was unclear. I'm not sure how running servers > >> on localhost can solve the problem. > > > > The local server can convert from the local (nbd) protocol to the > > remote (sheepdog, ceph) protocol. > > > >> What I wanted to say was that we cannot specify the image of VM. With > >> nbd protocol, command line arguments are as follows: > >> > >> $ qemu nbd:hostname:port > >> > >> As this syntax shows, with nbd protocol the client cannot pass the VM > >> image name to the server. > > > > We would extend it to allow it to connect to a unix domain socket: > > > > qemu nbd:unix:/path/to/socket > > nbd is a no-go because it only supports a single, synchronous I/O > operation at a time and has no mechanism for extensibility. > > If we go this route, I think two options are worth considering. The > first would be a purely socket based approach where we just accepted the > extra copy. > > The other potential approach would be shared memory based. We export > all guest ram as shared memory along with a small bounce buffer pool. > We would then use a ring queue (potentially even using virtio-blk) and > an eventfd for notification. > The shared memory approach assumes that there is a local server who can talk with the storage system. But Ceph doesn't require the local server, and Sheepdog would be extended to support VMs running outside the storage system. We could run a local daemon who can only work as proxy, but I don't think it looks a clean approach. So I think a socket based approach is the right way to go. BTW, is it required to design a common interface? The way Sheepdog replicates data is different from Ceph, so I think it is not possible to define a common protocol as Christian says. Regards, Kazutaka > > The server at the other end would associate the socket with a filename > > and forward it to the server using the remote protocol. > > > > However, I don't think nbd would be a good protocol. My preference > > would be for a plugin API, or for a new local protocol that uses > > splice() to avoid copies. > > I think a good shared memory implementation would be preferable to > plugins. I think it's worth attempting to do a plugin interface for the > block layer but I strongly suspect it would not be sufficient. > > I would not want to see plugins that interacted with BlockDriverState > directly, for instance. We change it far too often. Our main loop > functions are also not terribly stable so I'm not sure how we would > handle that (unless we forced all block plugins to be in a separate thread). >
[Qemu-devel] Re: [PATCH] add support for protocol driver create_options
At Tue, 25 May 2010 15:43:17 +0200, Kevin Wolf wrote: > > Am 24.05.2010 08:34, schrieb MORITA Kazutaka: > > At Fri, 21 May 2010 18:57:36 +0200, > > Kevin Wolf wrote: > >> > >> Am 20.05.2010 07:36, schrieb MORITA Kazutaka: > >>> + > >>> +/* > >>> + * Append an option list (list) to an option list (dest). > >>> + * > >>> + * If dest is NULL, a new copy of list is created. > >>> + * > >>> + * Returns a pointer to the first element of dest (or the newly > >>> allocated copy) > >>> + */ > >>> +QEMUOptionParameter *append_option_parameters(QEMUOptionParameter *dest, > >>> +QEMUOptionParameter *list) > >>> +{ > >>> +size_t num_options, num_dest_options; > >>> + > >>> +num_options = count_option_parameters(dest); > >>> +num_dest_options = num_options; > >>> + > >>> +num_options += count_option_parameters(list); > >>> + > >>> +dest = qemu_realloc(dest, (num_options + 1) * > >>> sizeof(QEMUOptionParameter)); > >>> + > >>> +while (list && list->name) { > >>> +if (get_option_parameter(dest, list->name) == NULL) { > >>> +dest[num_dest_options++] = *list; > >> > >> You need to add a dest[num_dest_options].name = NULL; here. Otherwise > >> the next loop iteration works on uninitialized memory and possibly an > >> unterminated list. I got a segfault for that reason. > >> > > > > I forgot to add it, sorry. > > Fixed version is below. > > > > Thanks, > > > > Kazutaka > > > > == > > This patch enables protocol drivers to use their create options which > > are not supported by the format. For example, protcol drivers can use > > a backing_file option with raw format. > > > > Signed-off-by: MORITA Kazutaka > > $ ./qemu-img create -f qcow2 -o cluster_size=4k /tmp/test.qcow2 4G > Unknown option 'cluster_size' > qemu-img: Invalid options for file format 'qcow2'. > > I think you added another num_dest_options++ which shouldn't be there. > Sorry again. I wrongly added `dest[num_dest_options++].name = NULL;' instead of `dest[num_dest_options].name = NULL;'. Thanks, Kazutaka == This patch enables protocol drivers to use their create options which are not supported by the format. For example, protcol drivers can use a backing_file option with raw format. Signed-off-by: MORITA Kazutaka --- block.c |7 +++ block.h |1 + qemu-img.c| 49 ++--- qemu-option.c | 53 ++--- qemu-option.h |2 ++ 5 files changed, 86 insertions(+), 26 deletions(-) diff --git a/block.c b/block.c index 6e7766a..f881f10 100644 --- a/block.c +++ b/block.c @@ -56,7 +56,6 @@ static int bdrv_read_em(BlockDriverState *bs, int64_t sector_num, uint8_t *buf, int nb_sectors); static int bdrv_write_em(BlockDriverState *bs, int64_t sector_num, const uint8_t *buf, int nb_sectors); -static BlockDriver *find_protocol(const char *filename); static QTAILQ_HEAD(, BlockDriverState) bdrv_states = QTAILQ_HEAD_INITIALIZER(bdrv_states); @@ -210,7 +209,7 @@ int bdrv_create_file(const char* filename, QEMUOptionParameter *options) { BlockDriver *drv; -drv = find_protocol(filename); +drv = bdrv_find_protocol(filename); if (drv == NULL) { drv = bdrv_find_format("file"); } @@ -283,7 +282,7 @@ static BlockDriver *find_hdev_driver(const char *filename) return drv; } -static BlockDriver *find_protocol(const char *filename) +BlockDriver *bdrv_find_protocol(const char *filename) { BlockDriver *drv1; char protocol[128]; @@ -478,7 +477,7 @@ int bdrv_file_open(BlockDriverState **pbs, const char *filename, int flags) BlockDriver *drv; int ret; -drv = find_protocol(filename); +drv = bdrv_find_protocol(filename); if (!drv) { return -ENOENT; } diff --git a/block.h b/block.h index 24efeb6..9034ebb 100644 --- a/block.h +++ b/block.h @@ -54,6 +54,7 @@ void bdrv_info_stats(Monitor *mon, QObject **ret_data); void bdrv_init(void); void bdrv_init_with_whitelist(void); +BlockDriver *bdrv_find_protocol(const char *filename); BlockDriver *bdrv_find_format(const char *format_name); BlockDriver *bdrv_find_whitelisted_format(const char *format_name); int bdrv_create(BlockDriver *drv, const char* filename, diff --git a/qemu-img.c b/qemu-img.c index cb007b7..ea091f0 100644 --- a/qemu-img.c +++ b/qemu-img.c @@ -252,8 +
Re: [Qemu-devel] [RFC PATCH 1/1] ceph/rbd block driver for qemu-kvm
At Tue, 25 May 2010 10:12:53 -0700 (PDT), Sage Weil wrote: > > On Tue, 25 May 2010, Avi Kivity wrote: > > > What's the reason for not having these drivers upstream? Do we gain > > > anything by hiding them from our users and requiring them to install the > > > drivers separately from somewhere else? > > > > > > > Six months. > > FWIW, we (Ceph) aren't complaining about the 6 month lag time (and I don't > think the Sheepdog guys are either). > I agree. We aren't complaining about it. > From our perspective, the current BlockDriver abstraction is ideal, as it > represents the reality of qemu's interaction with storage. Any 'external' > interface will be inferior to that in one way or another. But either way, > we are perfectly willing to work with you to all to keep in sync with any > future BlockDriver API improvements. It is worth our time investment even > if the API is less stable. > I agree. > The ability to dynamically load a shared object using the existing api > would make development a bit easier, but I'm not convinced it's better for > for users. I think having ceph and sheepdog upstream with qemu will serve > end users best, and we at least are willing to spend the time to help > maintain that code in qemu.git. > I agree. Regards, Kazutaka
[Qemu-devel] [RFC PATCH v4 2/3] block: call the snapshot handlers of the protocol drivers
When snapshot handlers are not defined in the format driver, it is better to call the ones of the protocol driver. This enables us to implement snapshot support in the protocol driver. We need to call bdrv_close() and bdrv_open() handlers of the format driver before and after bdrv_snapshot_goto() call of the protocol. It is because the contents of the block driver state may need to be changed after loading vmstate. Signed-off-by: MORITA Kazutaka --- block.c | 61 +++-- 1 files changed, 43 insertions(+), 18 deletions(-) diff --git a/block.c b/block.c index da0dc47..cf80dbf 100644 --- a/block.c +++ b/block.c @@ -1697,9 +1697,11 @@ int bdrv_save_vmstate(BlockDriverState *bs, const uint8_t *buf, BlockDriver *drv = bs->drv; if (!drv) return -ENOMEDIUM; -if (!drv->bdrv_save_vmstate) -return -ENOTSUP; -return drv->bdrv_save_vmstate(bs, buf, pos, size); +if (drv->bdrv_save_vmstate) +return drv->bdrv_save_vmstate(bs, buf, pos, size); +if (bs->file) +return bdrv_save_vmstate(bs->file, buf, pos, size); +return -ENOTSUP; } int bdrv_load_vmstate(BlockDriverState *bs, uint8_t *buf, @@ -1708,9 +1710,11 @@ int bdrv_load_vmstate(BlockDriverState *bs, uint8_t *buf, BlockDriver *drv = bs->drv; if (!drv) return -ENOMEDIUM; -if (!drv->bdrv_load_vmstate) -return -ENOTSUP; -return drv->bdrv_load_vmstate(bs, buf, pos, size); +if (drv->bdrv_load_vmstate) +return drv->bdrv_load_vmstate(bs, buf, pos, size); +if (bs->file) +return bdrv_load_vmstate(bs->file, buf, pos, size); +return -ENOTSUP; } void bdrv_debug_event(BlockDriverState *bs, BlkDebugEvent event) @@ -1734,20 +1738,37 @@ int bdrv_snapshot_create(BlockDriverState *bs, BlockDriver *drv = bs->drv; if (!drv) return -ENOMEDIUM; -if (!drv->bdrv_snapshot_create) -return -ENOTSUP; -return drv->bdrv_snapshot_create(bs, sn_info); +if (drv->bdrv_snapshot_create) +return drv->bdrv_snapshot_create(bs, sn_info); +if (bs->file) +return bdrv_snapshot_create(bs->file, sn_info); +return -ENOTSUP; } int bdrv_snapshot_goto(BlockDriverState *bs, const char *snapshot_id) { BlockDriver *drv = bs->drv; +int ret, open_ret; + if (!drv) return -ENOMEDIUM; -if (!drv->bdrv_snapshot_goto) -return -ENOTSUP; -return drv->bdrv_snapshot_goto(bs, snapshot_id); +if (drv->bdrv_snapshot_goto) +return drv->bdrv_snapshot_goto(bs, snapshot_id); + +if (bs->file) { +drv->bdrv_close(bs); +ret = bdrv_snapshot_goto(bs->file, snapshot_id); +open_ret = drv->bdrv_open(bs, bs->open_flags); +if (open_ret < 0) { +bdrv_delete(bs->file); +bs->drv = NULL; +return open_ret; +} +return ret; +} + +return -ENOTSUP; } int bdrv_snapshot_delete(BlockDriverState *bs, const char *snapshot_id) @@ -1755,9 +1776,11 @@ int bdrv_snapshot_delete(BlockDriverState *bs, const char *snapshot_id) BlockDriver *drv = bs->drv; if (!drv) return -ENOMEDIUM; -if (!drv->bdrv_snapshot_delete) -return -ENOTSUP; -return drv->bdrv_snapshot_delete(bs, snapshot_id); +if (drv->bdrv_snapshot_delete) +return drv->bdrv_snapshot_delete(bs, snapshot_id); +if (bs->file) +return bdrv_snapshot_delete(bs->file, snapshot_id); +return -ENOTSUP; } int bdrv_snapshot_list(BlockDriverState *bs, @@ -1766,9 +1789,11 @@ int bdrv_snapshot_list(BlockDriverState *bs, BlockDriver *drv = bs->drv; if (!drv) return -ENOMEDIUM; -if (!drv->bdrv_snapshot_list) -return -ENOTSUP; -return drv->bdrv_snapshot_list(bs, psn_info); +if (drv->bdrv_snapshot_list) +return drv->bdrv_snapshot_list(bs, psn_info); +if (bs->file) +return bdrv_snapshot_list(bs->file, psn_info); +return -ENOTSUP; } #define NB_SUFFIXES 4 -- 1.5.6.5
[Qemu-devel] [RFC PATCH v4 0/3] Sheepdog: distributed storage system for QEMU
Hi all, This patch adds a block driver for Sheepdog distributed storage system. Please consider for inclusion. I applied comments for the 2nd patch (thanks Kevin!). The rest patches are not changed from the previous version. Changes from v3 to v4 are: - fix error handling in bdrv_snapshot_goto. Changes from v2 to v3 are: - add drv->bdrv_close() and drv->bdrv_open() before and after bdrv_snapshot_goto() call of the protocol. - address the review comments on the sheepdog driver code. Changes from v1 to v2 are: - rebase onto git://repo.or.cz/qemu/kevin.git block - modify the sheepdog driver as a protocol driver - add new patch to call the snapshot handler of the protocol Thanks, Kazutaka MORITA Kazutaka (3): close all the block drivers before the qemu process exits block: call the snapshot handlers of the protocol drivers block: add sheepdog driver for distributed storage support Makefile.objs|2 +- block.c | 70 ++- block.h |1 + block/sheepdog.c | 1835 ++ vl.c |1 + 5 files changed, 1890 insertions(+), 19 deletions(-) create mode 100644 block/sheepdog.c
[Qemu-devel] [RFC PATCH v4 1/3] close all the block drivers before the qemu process exits
This patch calls the close handler of the block driver before the qemu process exits. This is necessary because the sheepdog block driver releases the lock of VM images in the close handler. Signed-off-by: MORITA Kazutaka --- block.c |9 + block.h |1 + vl.c|1 + 3 files changed, 11 insertions(+), 0 deletions(-) diff --git a/block.c b/block.c index 24c63f6..da0dc47 100644 --- a/block.c +++ b/block.c @@ -646,6 +646,15 @@ void bdrv_close(BlockDriverState *bs) } } +void bdrv_close_all(void) +{ +BlockDriverState *bs; + +QTAILQ_FOREACH(bs, &bdrv_states, list) { +bdrv_close(bs); +} +} + void bdrv_delete(BlockDriverState *bs) { /* remove from list, if necessary */ diff --git a/block.h b/block.h index 756670d..25744b1 100644 --- a/block.h +++ b/block.h @@ -123,6 +123,7 @@ BlockDriverAIOCB *bdrv_aio_ioctl(BlockDriverState *bs, /* Ensure contents are flushed to disk. */ void bdrv_flush(BlockDriverState *bs); void bdrv_flush_all(void); +void bdrv_close_all(void); int bdrv_has_zero_init(BlockDriverState *bs); int bdrv_is_allocated(BlockDriverState *bs, int64_t sector_num, int nb_sectors, diff --git a/vl.c b/vl.c index 7121cd0..8ffe36f 100644 --- a/vl.c +++ b/vl.c @@ -1992,6 +1992,7 @@ static void main_loop(void) vm_stop(r); } } +bdrv_close_all(); pause_all_vcpus(); } -- 1.5.6.5
[Qemu-devel] [RFC PATCH v4 3/3] block: add sheepdog driver for distributed storage support
Sheepdog is a distributed storage system for QEMU. It provides highly available block level storage volumes to VMs like Amazon EBS. This patch adds a qemu block driver for Sheepdog. Sheepdog features are: - No node in the cluster is special (no metadata node, no control node, etc) - Linear scalability in performance and capacity - No single point of failure - Autonomous management (zero configuration) - Useful volume management support such as snapshot and cloning - Thin provisioning - Autonomous load balancing The more details are available at the project site: http://www.osrg.net/sheepdog/ Signed-off-by: MORITA Kazutaka --- Makefile.objs|2 +- block/sheepdog.c | 1835 ++ 2 files changed, 1836 insertions(+), 1 deletions(-) create mode 100644 block/sheepdog.c diff --git a/Makefile.objs b/Makefile.objs index 1a942e5..527a754 100644 --- a/Makefile.objs +++ b/Makefile.objs @@ -14,7 +14,7 @@ block-obj-$(CONFIG_LINUX_AIO) += linux-aio.o block-nested-y += raw.o cow.o qcow.o vdi.o vmdk.o cloop.o dmg.o bochs.o vpc.o vvfat.o block-nested-y += qcow2.o qcow2-refcount.o qcow2-cluster.o qcow2-snapshot.o -block-nested-y += parallels.o nbd.o blkdebug.o +block-nested-y += parallels.o nbd.o blkdebug.o sheepdog.o block-nested-$(CONFIG_WIN32) += raw-win32.o block-nested-$(CONFIG_POSIX) += raw-posix.o block-nested-$(CONFIG_CURL) += curl.o diff --git a/block/sheepdog.c b/block/sheepdog.c new file mode 100644 index 000..68545e8 --- /dev/null +++ b/block/sheepdog.c @@ -0,0 +1,1835 @@ +/* + * Copyright (C) 2009-2010 Nippon Telegraph and Telephone Corporation. + * + * This program is free software; you can redistribute it and/or + * modify it under the terms of the GNU General Public License version + * 2 as published by the Free Software Foundation. + * + * You should have received a copy of the GNU General Public License + * along with this program. If not, see <http://www.gnu.org/licenses/>. + */ +#include +#include + +#include "qemu-common.h" +#include "qemu-error.h" +#include "block_int.h" + +#define SD_PROTO_VER 0x01 + +#define SD_DEFAULT_ADDR "localhost:7000" + +#define SD_OP_CREATE_AND_WRITE_OBJ 0x01 +#define SD_OP_READ_OBJ 0x02 +#define SD_OP_WRITE_OBJ 0x03 + +#define SD_OP_NEW_VDI0x11 +#define SD_OP_LOCK_VDI 0x12 +#define SD_OP_RELEASE_VDI0x13 +#define SD_OP_GET_VDI_INFO 0x14 +#define SD_OP_READ_VDIS 0x15 + +#define SD_FLAG_CMD_WRITE0x01 +#define SD_FLAG_CMD_COW 0x02 + +#define SD_RES_SUCCESS 0x00 /* Success */ +#define SD_RES_UNKNOWN 0x01 /* Unknown error */ +#define SD_RES_NO_OBJ0x02 /* No object found */ +#define SD_RES_EIO 0x03 /* I/O error */ +#define SD_RES_VDI_EXIST 0x04 /* Vdi exists already */ +#define SD_RES_INVALID_PARMS 0x05 /* Invalid parameters */ +#define SD_RES_SYSTEM_ERROR 0x06 /* System error */ +#define SD_RES_VDI_LOCKED0x07 /* Vdi is locked */ +#define SD_RES_NO_VDI0x08 /* No vdi found */ +#define SD_RES_NO_BASE_VDI 0x09 /* No base vdi found */ +#define SD_RES_VDI_READ 0x0A /* Cannot read requested vdi */ +#define SD_RES_VDI_WRITE 0x0B /* Cannot write requested vdi */ +#define SD_RES_BASE_VDI_READ 0x0C /* Cannot read base vdi */ +#define SD_RES_BASE_VDI_WRITE 0x0D /* Cannot write base vdi */ +#define SD_RES_NO_TAG0x0E /* Requested tag is not found */ +#define SD_RES_STARTUP 0x0F /* Sheepdog is on starting up */ +#define SD_RES_VDI_NOT_LOCKED 0x10 /* Vdi is not locked */ +#define SD_RES_SHUTDOWN 0x11 /* Sheepdog is shutting down */ +#define SD_RES_NO_MEM0x12 /* Cannot allocate memory */ +#define SD_RES_FULL_VDI 0x13 /* we already have the maximum vdis */ +#define SD_RES_VER_MISMATCH 0x14 /* Protocol version mismatch */ +#define SD_RES_NO_SPACE 0x15 /* Server has no room for new objects */ +#define SD_RES_WAIT_FOR_FORMAT 0x16 /* Sheepdog is waiting for a format operation */ +#define SD_RES_WAIT_FOR_JOIN0x17 /* Sheepdog is waiting for other nodes joining */ +#define SD_RES_JOIN_FAILED 0x18 /* Target node had failed to join sheepdog */ + +/* + * Object ID rules + * + * 0 - 19 (20 bits): data object space + * 20 - 31 (12 bits): reserved data object space + * 32 - 55 (24 bits): vdi object space + * 56 - 59 ( 4 bits): reserved vdi object space + * 60 - 63 ( 4 bits): object type indentifier space + */ + +#define VDI_SPACE_SHIFT 32 +#define VDI_BIT (UINT64_C(1) << 63) +#define VMSTATE_BIT (UINT64_C(1) << 62) +#define MAX_DATA_OBJS (1ULL << 20) +#define MAX_CHILDREN 1024 +#define SD_MAX_VDI_LEN 256 +#define SD_NR_VDIS (1U << 24) +#define SD_DATA_OBJ_SIZE (UINT64_C(1) << 22) + +#define SD_INODE_SIZE (sizeof(SheepdogInode)) +#define CURRENT_VDI_ID 0 + +typedef struct SheepdogReq { + uint8_t proto_ver; + uint8_t opcode; + uint16_tflags;
[Qemu-devel] Re: [RFC PATCH v4 0/3] Sheepdog: distributed storage system for QEMU
At Wed, 02 Jun 2010 12:49:02 +0200, Kevin Wolf wrote: > > Am 28.05.2010 04:44, schrieb MORITA Kazutaka: > > Hi all, > > > > This patch adds a block driver for Sheepdog distributed storage > > system. Please consider for inclusion. > > Hint for next time: You should remove the RFC from the subject line if > you think the patch is ready for inclusion. Otherwise I might miss this > and think you only want comments on it. > Thanks for the advice. I'll do so the next time. > > MORITA Kazutaka (3): > > close all the block drivers before the qemu process exits > > block: call the snapshot handlers of the protocol drivers > > block: add sheepdog driver for distributed storage support > > Thanks, I have applied the first two patches to the block branch, they > look good to me. I'll send some comments for the third one (though it's > only coding style until now). > Thanks a lot. Kazutaka
[Qemu-devel] Re: [RFC PATCH v4 3/3] block: add sheepdog driver for distributed storage support
At Tue, 01 Jun 2010 09:58:04 -0500, Thanks for your comments! Chris Krumme wrote: > > On 05/27/2010 09:44 PM, MORITA Kazutaka wrote: > > Sheepdog is a distributed storage system for QEMU. It provides highly > > + > > +static int connect_to_sdog(const char *addr) > > +{ > > + char buf[64]; > > + char hbuf[NI_MAXHOST], sbuf[NI_MAXSERV]; > > + char name[256], *p; > > + int fd, ret; > > + struct addrinfo hints, *res, *res0; > > + int port = 0; > > + > > + if (!addr) { > > + addr = SD_DEFAULT_ADDR; > > + } > > + > > + strcpy(name, addr); > > > > Can strlen(addr) be > sizeof(name)? > Yes, we should check the length of addr. This would causes overflows. > > + > > + p = name; > > + while (*p) { > > + if (*p == ':') { > > + *p++ = '\0'; > > > > May also need to check for p > name + sizeof(name). > p should be NULL-terminated, so the check is not required, I think. > > + break; > > + } else { > > + p++; > > + } > > + } > > + > > + if (*p == '\0') { > > + error_report("cannot find a port number, %s\n", name); > > + return -1; > > + } > > + port = strtol(p, NULL, 10); > > > > Are negative numbers valid here? > No. It is better to use strtoul. > > + > > +static int parse_vdiname(BDRVSheepdogState *s, const char *filename, > > +char *vdi, int vdi_len, uint32_t *snapid) > > +{ > > + char *p, *q; > > + int nr_sep; > > + > > + p = q = strdup(filename); > > + > > + if (!p) { > > > > I think Qemu has a version of strdup that will not return NULL. > Yes. We can use qemu_strdup here. > > + > > +/* TODO: error cleanups */ > > +static int sd_open(BlockDriverState *bs, const char *filename, int flags) > > +{ > > + int ret, fd; > > + uint32_t vid = 0; > > + BDRVSheepdogState *s = bs->opaque; > > + char vdi[256]; > > + uint32_t snapid; > > + int for_snapshot = 0; > > + char *buf; > > + > > + strstart(filename, "sheepdog:", (const char **)&filename); > > + > > + buf = qemu_malloc(SD_INODE_SIZE); > > + > > + memset(vdi, 0, sizeof(vdi)); > > + if (parse_vdiname(s, filename, vdi, sizeof(vdi),&snapid)< 0) { > > + goto out; > > + } > > + s->fd = get_sheep_fd(s); > > + if (s->fd< 0) { > > > > buf is not freed, goto out maybe. > Yes, we should goto out here. > > + > > +static int do_sd_create(const char *addr, char *filename, char *tag, > > + int64_t total_sectors, uint32_t base_vid, > > + uint32_t *vdi_id, int snapshot) > > +{ > > + SheepdogVdiReq hdr; > > + SheepdogVdiRsp *rsp = (SheepdogVdiRsp *)&hdr; > > + int fd, ret; > > + unsigned int wlen, rlen = 0; > > + char buf[SD_MAX_VDI_LEN]; > > + > > + fd = connect_to_sdog(addr); > > + if (fd< 0) { > > + return -1; > > + } > > + > > + strncpy(buf, filename, SD_MAX_VDI_LEN); > > + > > + memset(&hdr, 0, sizeof(hdr)); > > + hdr.opcode = SD_OP_NEW_VDI; > > + hdr.base_vdi_id = base_vid; > > + > > + wlen = SD_MAX_VDI_LEN; > > + > > + hdr.flags = SD_FLAG_CMD_WRITE; > > + hdr.snapid = snapshot; > > + > > + hdr.data_length = wlen; > > + hdr.vdi_size = total_sectors * 512; > > > > There is another patch on the list changing 512 to a define for sector size. > OK. We'll define SECTOR_SIZE. > > + > > + ret = do_req(fd, (SheepdogReq *)&hdr, buf,&wlen,&rlen); > > + > > + close(fd); > > + > > + if (ret) { > > + return -1; > > + } > > + > > + if (rsp->result != SD_RES_SUCCESS) { > > + error_report("%s, %s\n", sd_strerror(rsp->result), filename); > > + return -1; > > + } > > + > > + if (vdi_id) { > > + *vdi_id = rsp->vdi_id; > > + } > > + > > + return 0; > > +} > > + > > +static int sd_create(const char *filename, QEMUOptionParameter *options) > > +{ > > + int ret; > > + uint32_t vid = 0; > > + int64_t total_sectors = 0; > > + char *backing_file = NULL; > > + > &
[Qemu-devel] Re: [RFC PATCH v4 3/3] block: add sheepdog driver for distributed storage support
At Wed, 02 Jun 2010 15:55:42 +0200, Kevin Wolf wrote: > > Am 28.05.2010 04:44, schrieb MORITA Kazutaka: > > Sheepdog is a distributed storage system for QEMU. It provides highly > > available block level storage volumes to VMs like Amazon EBS. This > > patch adds a qemu block driver for Sheepdog. > > > > Sheepdog features are: > > - No node in the cluster is special (no metadata node, no control > > node, etc) > > - Linear scalability in performance and capacity > > - No single point of failure > > - Autonomous management (zero configuration) > > - Useful volume management support such as snapshot and cloning > > - Thin provisioning > > - Autonomous load balancing > > > > The more details are available at the project site: > > http://www.osrg.net/sheepdog/ > > > > Signed-off-by: MORITA Kazutaka > > --- > > Makefile.objs|2 +- > > block/sheepdog.c | 1835 > > ++ > > 2 files changed, 1836 insertions(+), 1 deletions(-) > > create mode 100644 block/sheepdog.c > > One general thing: The code uses some mix of spaces and tabs for > indentation, with the greatest part using tabs. According to > CODING_STYLE it should consistently use four spaces instead. > OK. I'll fix the indentation according to CODYING_STYLE. > > + > > +typedef struct SheepdogInode { > > + char name[SD_MAX_VDI_LEN]; > > + uint64_t ctime; > > + uint64_t snap_ctime; > > + uint64_t vm_clock_nsec; > > + uint64_t vdi_size; > > + uint64_t vm_state_size; > > + uint16_t copy_policy; > > + uint8_t nr_copies; > > + uint8_t block_size_shift; > > + uint32_t snap_id; > > + uint32_t vdi_id; > > + uint32_t parent_vdi_id; > > + uint32_t child_vdi_id[MAX_CHILDREN]; > > + uint32_t data_vdi_id[MAX_DATA_OBJS]; > > Wow, this is a huge array. :-) > > So Sheepdog has a fixed limit of 16 TB, right? > MAX_DATA_OBJS is (1 << 20), and the size of a object is 4 MB. So the limit of the Sheepdog image size is 4 TB. These values are hard-coded, and I guess they should be configurable. > > > +} SheepdogInode; > > + > > + > > +static void sd_aio_cancel(BlockDriverAIOCB *blockacb) > > +{ > > + SheepdogAIOCB *acb = (SheepdogAIOCB *)blockacb; > > + > > + acb->canceled = 1; > > +} > > Does this provide the right semantics? You haven't really cancelled the > request, but you pretend to. So you actually complete the request in the > background and then throw the return code away. > > I seem to remember that posix-aio-compat.c waits at this point for > completion of the requests, calls the callbacks and only afterwards > returns from aio_cancel when no more requests are in flight. > > Or if you can really cancel requests, it would be the best option, of > course. > Sheepdog cannot cancel the requests which are already sent to the servers. So, as you say, we pretend to cancel the requests without waiting for completion of them. However, are there any situation where pretending to cancel causes problems in practice? To wait for completion of the requests here, we may need to create another thread for processing I/O like posix-aio-compat.c. > > + > > +static int do_send_recv(int sockfd, struct iovec *iov, int len, int offset, > > + int write) > > I've spent at least 15 minutes figuring out what this function does. I > think I've got it now more or less, but I've come to the conclusion that > this code needs more comments. > > I'd suggest to add a header comment to all non-trivial functions and > maybe somewhere on the top a general description of how things work. > > As far as I understood now, there are basically two parts of request > handling: > > 1. The request is sent to the server. Its AIOCB is saved in a list in > the BDRVSheepdogState. It doesn't pass a callback or anything for the > completion. > > 2. aio_read_response is registered as a fd handler to the sheepdog > connection. When the server responds, it searches the right AIOCB in the > list and the second part of request handling starts. > > do_send_recv is the function that is used to do all communication with > the server. The iov stuff looks like it's only used for some data, but > seems this is not true - it's also used for the metadata of the protocol. > > Did I understand it right so far? > Yes, exactly. I'll add comments to make codes more readable. > > +{ > > + struct msghdr msg; > > + int ret, diff; > > + > >
[Qemu-devel] Re: [RFC PATCH v4 3/3] block: add sheepdog driver for distributed storage support
At Fri, 04 Jun 2010 13:04:00 +0200, Kevin Wolf wrote: > > Am 03.06.2010 18:23, schrieb MORITA Kazutaka: > >>> +static void sd_aio_cancel(BlockDriverAIOCB *blockacb) > >>> +{ > >>> + SheepdogAIOCB *acb = (SheepdogAIOCB *)blockacb; > >>> + > >>> + acb->canceled = 1; > >>> +} > >> > >> Does this provide the right semantics? You haven't really cancelled the > >> request, but you pretend to. So you actually complete the request in the > >> background and then throw the return code away. > >> > >> I seem to remember that posix-aio-compat.c waits at this point for > >> completion of the requests, calls the callbacks and only afterwards > >> returns from aio_cancel when no more requests are in flight. > >> > >> Or if you can really cancel requests, it would be the best option, of > >> course. > >> > > > > Sheepdog cannot cancel the requests which are already sent to the > > servers. So, as you say, we pretend to cancel the requests without > > waiting for completion of them. However, are there any situation > > where pretending to cancel causes problems in practice? > > I'm not sure how often it would happen in practice, but if the guest OS > thinks the old value is on disk when in fact the new one is, this could > lead to corruption. I think if it can happen, even without evidence that > it actually does, it's already relevant enough. > I agree. > > To wait for completion of the requests here, we may need to create > > another thread for processing I/O like posix-aio-compat.c. > > I don't think you need a thread to get the same behaviour, you just need > to call the fd handlers like in the main loop. It would probably be the > first driver doing this, though, and it's not an often used code path, > so it might be a bad idea. > > Maybe it's reasonable to just complete the request with -EIO? This way > the guest couldn't make any assumption about the data written. On the > other hand, it could be unhappy about failed requests, but that's > probably better than corruption. > Completing with -EIO looks good to me. Thanks for the advice. I'll send an updated patch tomorrow. Regards, Kazutaka
[Qemu-devel] [PATCH v5] block: add sheepdog driver for distributed storage support
Sheepdog is a distributed storage system for QEMU. It provides highly available block level storage volumes to VMs like Amazon EBS. This patch adds a qemu block driver for Sheepdog. Sheepdog features are: - No node in the cluster is special (no metadata node, no control node, etc) - Linear scalability in performance and capacity - No single point of failure - Autonomous management (zero configuration) - Useful volume management support such as snapshot and cloning - Thin provisioning - Autonomous load balancing The more details are available at the project site: http://www.osrg.net/sheepdog/ Signed-off-by: MORITA Kazutaka --- Changes from v4 to v5 are: - address the comments to the sheepdog driver (Thanks Kevin, Chris!) -- fix a coding style -- fix aio_cancel handling -- fix an overflow bug in coping hostname -- add comments to the non-trivial functions - remove already applied patches from the patchset Changes from v3 to v4 are: - fix error handling in bdrv_snapshot_goto. Changes from v2 to v3 are: - add drv->bdrv_close() and drv->bdrv_open() before and after bdrv_snapshot_goto() call of the protocol. - address the review comments on the sheepdog driver code. Changes from v1 to v2 are: - rebase onto git://repo.or.cz/qemu/kevin.git block - modify the sheepdog driver as a protocol driver - add new patch to call the snapshot handler of the protocol Makefile.objs|2 +- block/sheepdog.c | 1905 ++ 2 files changed, 1906 insertions(+), 1 deletions(-) create mode 100644 block/sheepdog.c diff --git a/Makefile.objs b/Makefile.objs index 54dec26..070db8a 100644 --- a/Makefile.objs +++ b/Makefile.objs @@ -14,7 +14,7 @@ block-obj-$(CONFIG_LINUX_AIO) += linux-aio.o block-nested-y += raw.o cow.o qcow.o vdi.o vmdk.o cloop.o dmg.o bochs.o vpc.o vvfat.o block-nested-y += qcow2.o qcow2-refcount.o qcow2-cluster.o qcow2-snapshot.o -block-nested-y += parallels.o nbd.o blkdebug.o +block-nested-y += parallels.o nbd.o blkdebug.o sheepdog.o block-nested-$(CONFIG_WIN32) += raw-win32.o block-nested-$(CONFIG_POSIX) += raw-posix.o block-nested-$(CONFIG_CURL) += curl.o diff --git a/block/sheepdog.c b/block/sheepdog.c new file mode 100644 index 000..a9477a5 --- /dev/null +++ b/block/sheepdog.c @@ -0,0 +1,1905 @@ +/* + * Copyright (C) 2009-2010 Nippon Telegraph and Telephone Corporation. + * + * This program is free software; you can redistribute it and/or + * modify it under the terms of the GNU General Public License version + * 2 as published by the Free Software Foundation. + * + * You should have received a copy of the GNU General Public License + * along with this program. If not, see <http://www.gnu.org/licenses/>. + */ +#include +#include + +#include "qemu-common.h" +#include "qemu-error.h" +#include "block_int.h" + +#define SD_PROTO_VER 0x01 + +#define SD_DEFAULT_ADDR "localhost" +#define SD_DEFAULT_PORT "7000" + +#define SD_OP_CREATE_AND_WRITE_OBJ 0x01 +#define SD_OP_READ_OBJ 0x02 +#define SD_OP_WRITE_OBJ 0x03 + +#define SD_OP_NEW_VDI0x11 +#define SD_OP_LOCK_VDI 0x12 +#define SD_OP_RELEASE_VDI0x13 +#define SD_OP_GET_VDI_INFO 0x14 +#define SD_OP_READ_VDIS 0x15 + +#define SD_FLAG_CMD_WRITE0x01 +#define SD_FLAG_CMD_COW 0x02 + +#define SD_RES_SUCCESS 0x00 /* Success */ +#define SD_RES_UNKNOWN 0x01 /* Unknown error */ +#define SD_RES_NO_OBJ0x02 /* No object found */ +#define SD_RES_EIO 0x03 /* I/O error */ +#define SD_RES_VDI_EXIST 0x04 /* Vdi exists already */ +#define SD_RES_INVALID_PARMS 0x05 /* Invalid parameters */ +#define SD_RES_SYSTEM_ERROR 0x06 /* System error */ +#define SD_RES_VDI_LOCKED0x07 /* Vdi is locked */ +#define SD_RES_NO_VDI0x08 /* No vdi found */ +#define SD_RES_NO_BASE_VDI 0x09 /* No base vdi found */ +#define SD_RES_VDI_READ 0x0A /* Cannot read requested vdi */ +#define SD_RES_VDI_WRITE 0x0B /* Cannot write requested vdi */ +#define SD_RES_BASE_VDI_READ 0x0C /* Cannot read base vdi */ +#define SD_RES_BASE_VDI_WRITE 0x0D /* Cannot write base vdi */ +#define SD_RES_NO_TAG0x0E /* Requested tag is not found */ +#define SD_RES_STARTUP 0x0F /* Sheepdog is on starting up */ +#define SD_RES_VDI_NOT_LOCKED 0x10 /* Vdi is not locked */ +#define SD_RES_SHUTDOWN 0x11 /* Sheepdog is shutting down */ +#define SD_RES_NO_MEM0x12 /* Cannot allocate memory */ +#define SD_RES_FULL_VDI 0x13 /* we already have the maximum vdis */ +#define SD_RES_VER_MISMATCH 0x14 /* Protocol version mismatch */ +#define SD_RES_NO_SPACE 0x15 /* Server has no room for new objects */ +#define SD_RES_WAIT_FOR_FORMAT 0x16 /* Waiting for a format operation */ +#define SD_RES_WAIT_FOR_JOIN0x17 /* Waiting for other nodes joining */ +#define SD_RES_JOIN_FAILED 0x18 /* Target node had failed to join sheepdog */ + +/* + * Object ID rule
Re: [Qemu-devel] [PATCH v4] savevm: Really verify if a drive supports snapshots
At Fri, 4 Jun 2010 16:35:59 -0300, Miguel Di Ciurcio Filho wrote: > > Both bdrv_can_snapshot() and bdrv_has_snapshot() does not work as advertized. > > First issue: Their names implies different porpouses, but they do the same > thing > and have exactly the same code. Maybe copied and pasted and forgotten? > bdrv_has_snapshot() is called in various places for actually checking if there > is snapshots or not. > > Second issue: the way bdrv_can_snapshot() verifies if a block driver supports > or > not snapshots does not catch all cases. E.g.: a raw image. > > So when do_savevm() is called, first thing it does is to set a global > BlockDriverState to save the VM memory state calling get_bs_snapshots(). > > static BlockDriverState *get_bs_snapshots(void) > { > BlockDriverState *bs; > DriveInfo *dinfo; > > if (bs_snapshots) > return bs_snapshots; > QTAILQ_FOREACH(dinfo, &drives, next) { > bs = dinfo->bdrv; > if (bdrv_can_snapshot(bs)) > goto ok; > } > return NULL; > ok: > bs_snapshots = bs; > return bs; > } > > bdrv_can_snapshot() may return a BlockDriverState that does not support > snapshots and do_savevm() goes on. > > Later on in do_savevm(), we find: > > QTAILQ_FOREACH(dinfo, &drives, next) { > bs1 = dinfo->bdrv; > if (bdrv_has_snapshot(bs1)) { > /* Write VM state size only to the image that contains the state > */ > sn->vm_state_size = (bs == bs1 ? vm_state_size : 0); > ret = bdrv_snapshot_create(bs1, sn); > if (ret < 0) { > monitor_printf(mon, "Error while creating snapshot on '%s'\n", >bdrv_get_device_name(bs1)); > } > } > } > > bdrv_has_snapshot(bs1) is not checking if the device does support or has > snapshots as explained above. Only in bdrv_snapshot_create() the device is > actually checked for snapshot support. > > So, in cases where the first device supports snapshots, and the second does > not, > the snapshot on the first will happen anyways. I believe this is not a good > behavior. It should be an all or nothing process. > > This patch addresses these issues by making bdrv_can_snapshot() actually do > what it must do and enforces better tests to avoid errors in the middle of > do_savevm(). bdrv_has_snapshot() is removed and replaced by > bdrv_can_snapshot() > where appropriate. > > bdrv_can_snapshot() was moved from savevm.c to block.c. It makes more sense > to me. > > The loadvm_state() function was updated too to enforce that when loading a VM > at > least all writable devices must support snapshots too. > > Signed-off-by: Miguel Di Ciurcio Filho > --- > block.c | 11 +++ > block.h |1 + > savevm.c | 58 -- > 3 files changed, 48 insertions(+), 22 deletions(-) > > diff --git a/block.c b/block.c > index cd70730..ace3cdb 100644 > --- a/block.c > +++ b/block.c > @@ -1720,6 +1720,17 @@ void bdrv_debug_event(BlockDriverState *bs, > BlkDebugEvent event) > /**/ > /* handling of snapshots */ > > +int bdrv_can_snapshot(BlockDriverState *bs) > +{ > +BlockDriver *drv = bs->drv; > +if (!drv || !drv->bdrv_snapshot_create || bdrv_is_removable(bs) || > +bdrv_is_read_only(bs)) { > +return 0; > +} > + > +return 1; > +} > + The underlying protocol could support snapshots, so I think we should check against bs->file too. --- a/block.c +++ b/block.c @@ -1671,6 +1671,9 @@ int bdrv_can_snapshot(BlockDriverState *bs) BlockDriver *drv = bs->drv; if (!drv || !drv->bdrv_snapshot_create || bdrv_is_removable(bs) || bdrv_is_read_only(bs)) { +if (bs->file) { +return bdrv_can_snapshot(bs->file); +} return 0; } Regards, Kazutaka
[Qemu-devel] Re: [PATCH v5] block: add sheepdog driver for distributed storage support
At Tue, 15 Jun 2010 10:24:14 +0200, Kevin Wolf wrote: > > Am 14.06.2010 21:48, schrieb MORITA Kazutaka: > >> 3) qemu-io aio_read/write doesn't seem to work well with it. I only get > >> the result of the AIO request when I exit qemu-io. This may be a qemu-io > >> problem or a Sheepdog one. We need to look into this, qemu-io is > >> important for testing and debugging (particularly for qemu-iotests) > >> > > Sheepdog receives responses from the server in the fd handler to the > > socket connection. But, while qemu-io executes aio_read/aio_write, it > > doesn't call qemu_aio_wait() and the fd handler isn't invoked at all. > > This seems to be the reason of the problem. > > > > I'm not sure this is a qemu-io problem or a Sheepdog one. If it is a > > qemu-io problem, we need to call qemu_aio_wait() somewhere in the > > command_loop(), I guess. If it is a Sheepdog problem, we need to > > consider another mechanism to receive responses... > > Not sure either. > > I think posix-aio-compat needs fd handlers to be called, too, and it > kind of works. I'm saying "kind of" because after an aio_read/write > command qemu-io exits (it doesn't with Sheepdog). And when exiting there > is a qemu_aio_wait(), so this explains why you get a result there. > > I guess it's a bug in the posix-aio-compat case rather than with Sheepdog. > It seems that fgets() is interrupted by a signal in fetchline() and qemu-io exits. BTW, I think we should call the fd handlers when user input is idle and the fds become ready. I'll send the patch later. > The good news is that if qemu-iotests works with only one aio_read/write > command before qemu-io exits, it's going to work with Sheepdog, too. > Great! Thanks, Kazutaka
[Qemu-devel] [PATCH 1/2] qemu-io: retry fgets() when errno is EINTR
posix-aio-compat sends a signal in aio operations, so we should consider that fgets() could be interrupted here. Signed-off-by: MORITA Kazutaka --- cmd.c |3 +++ 1 files changed, 3 insertions(+), 0 deletions(-) diff --git a/cmd.c b/cmd.c index 2336334..460df92 100644 --- a/cmd.c +++ b/cmd.c @@ -272,7 +272,10 @@ fetchline(void) return NULL; printf("%s", get_prompt()); fflush(stdout); +again: if (!fgets(line, MAXREADLINESZ, stdin)) { + if (errno == EINTR) + goto again; free(line); return NULL; } -- 1.5.6.5
[Qemu-devel] [PATCH 2/2] qemu-io: check registered fds in command_loop()
Some block drivers use an aio handler and do I/O completion routines in it. However, the handler is not invoked if we only do aio_read/write, because registered fds are not checked at all. This patch registers a command processing function as a fd handler to STDIO, and calls qemu_aio_wait() in command_loop(). Any other handlers can be invoked when user input is idle. Signed-off-by: MORITA Kazutaka --- cmd.c | 53 +++-- 1 files changed, 39 insertions(+), 14 deletions(-) diff --git a/cmd.c b/cmd.c index 460df92..2b66e24 100644 --- a/cmd.c +++ b/cmd.c @@ -24,6 +24,7 @@ #include #include "cmd.h" +#include "qemu-aio.h" #define _(x) x /* not gettext support yet */ @@ -149,6 +150,37 @@ add_args_command( args_func = af; } +static char *get_prompt(void); + +static void do_command(void *opaque) +{ + int c; + int *done = opaque; + char*input; + char**v; + const cmdinfo_t *ct; + + if ((input = fetchline()) == NULL) { + *done = 1; + return; + } + v = breakline(input, &c); + if (c) { + ct = find_command(v[0]); + if (ct) + *done = command(ct, c, v); + else + fprintf(stderr, _("command \"%s\" not found\n"), + v[0]); + } + doneline(input, v); + + if (*done == 0) { + printf("%s", get_prompt()); + fflush(stdout); + } +} + void command_loop(void) { @@ -186,20 +218,15 @@ command_loop(void) free(cmdline); return; } + + printf("%s", get_prompt()); + fflush(stdout); + + qemu_aio_set_fd_handler(STDIN_FILENO, do_command, NULL, NULL, NULL, &done); while (!done) { - if ((input = fetchline()) == NULL) - break; - v = breakline(input, &c); - if (c) { - ct = find_command(v[0]); - if (ct) - done = command(ct, c, v); - else - fprintf(stderr, _("command \"%s\" not found\n"), - v[0]); - } - doneline(input, v); + qemu_aio_wait(); } + qemu_aio_set_fd_handler(STDIN_FILENO, NULL, NULL, NULL, NULL, NULL); } /* from libxcmd/input.c */ @@ -270,8 +297,6 @@ fetchline(void) if (!line) return NULL; - printf("%s", get_prompt()); - fflush(stdout); again: if (!fgets(line, MAXREADLINESZ, stdin)) { if (errno == EINTR) -- 1.5.6.5
[Qemu-devel] [PATCH 0/2] qemu-io: fix aio_read/write problems
Hi, This patchset fixes the following qemu-io problems: - Qemu-io exits suddenly when we do aio_read/write to drivers which use posix-aio-compat. - We cannot get the results of aio_read/write if we don't do other operations. This problem occurs when the block driver uses a fd handler to get I/O completion. Thanks, Kazutaka MORITA Kazutaka (2): qemu-io: retry fgets() when errno is EINTR qemu-io: check registered fds in command_loop() cmd.c | 56 ++-- 1 files changed, 42 insertions(+), 14 deletions(-)
Re: [Qemu-devel] Re: [PATCH 1/2] qemu-io: retry fgets() when errno is EINTR
At Wed, 16 Jun 2010 13:04:47 +0200, Kevin Wolf wrote: > > Am 15.06.2010 19:53, schrieb MORITA Kazutaka: > > posix-aio-compat sends a signal in aio operations, so we should > > consider that fgets() could be interrupted here. > > > > Signed-off-by: MORITA Kazutaka > > --- > > cmd.c |3 +++ > > 1 files changed, 3 insertions(+), 0 deletions(-) > > > > diff --git a/cmd.c b/cmd.c > > index 2336334..460df92 100644 > > --- a/cmd.c > > +++ b/cmd.c > > @@ -272,7 +272,10 @@ fetchline(void) > > return NULL; > > printf("%s", get_prompt()); > > fflush(stdout); > > +again: > > if (!fgets(line, MAXREADLINESZ, stdin)) { > > + if (errno == EINTR) > > + goto again; > > free(line); > > return NULL; > > } > > This looks like a loop replaced by goto (and braces are missing). What > about this instead? > > do { > ret = fgets(...) > } while (ret == NULL && errno == EINTR) > > if (ret == NULL) { >fail > } > I agree. However, it seems that my second patch have already solved the problem. We register this readline routines as an aio handler now, so fgets() does not block and cannot return with EINTR. This patch looks no longer needed, sorry. Thanks, Kazutaka
Re: [Qemu-devel] Re: [PATCH 1/2] qemu-io: retry fgets() when errno is EINTRg
At Thu, 17 Jun 2010 18:18:18 +0100, Jamie Lokier wrote: > > Kevin Wolf wrote: > > Am 16.06.2010 18:52, schrieb MORITA Kazutaka: > > > At Wed, 16 Jun 2010 13:04:47 +0200, > > > Kevin Wolf wrote: > > >> > > >> Am 15.06.2010 19:53, schrieb MORITA Kazutaka: > > >>> posix-aio-compat sends a signal in aio operations, so we should > > >>> consider that fgets() could be interrupted here. > > >>> > > >>> Signed-off-by: MORITA Kazutaka > > >>> --- > > >>> cmd.c |3 +++ > > >>> 1 files changed, 3 insertions(+), 0 deletions(-) > > >>> > > >>> diff --git a/cmd.c b/cmd.c > > >>> index 2336334..460df92 100644 > > >>> --- a/cmd.c > > >>> +++ b/cmd.c > > >>> @@ -272,7 +272,10 @@ fetchline(void) > > >>> return NULL; > > >>> printf("%s", get_prompt()); > > >>> fflush(stdout); > > >>> +again: > > >>> if (!fgets(line, MAXREADLINESZ, stdin)) { > > >>> + if (errno == EINTR) > > >>> + goto again; > > >>> free(line); > > >>> return NULL; > > >>> } > > >> > > >> This looks like a loop replaced by goto (and braces are missing). What > > >> about this instead? > > >> > > >> do { > > >> ret = fgets(...) > > >> } while (ret == NULL && errno == EINTR) > > >> > > >> if (ret == NULL) { > > >>fail > > >> } > > >> > > > > > > I agree. > > > > > > However, it seems that my second patch have already solved the > > > problem. We register this readline routines as an aio handler now, so > > > fgets() does not block and cannot return with EINTR. > > > > > > This patch looks no longer needed, sorry. > > > > Good point. Thanks for having a look. > > Anyway, are you sure stdio functions can be interrupted with EINTR? > Linus reminds us that some stdio functions have to retry internally > anyway: > > http://comments.gmane.org/gmane.comp.version-control.git/18285 > I think It is another problem whether fgets() retries internally when a read system call is interrupted. We should handle EINTR if the system call can set EINTR. I think a read() doesn't return with EINTR if it doesn't block on Linux environment, but it may be not true on other operating systems. I send the fixed patch. I'm not sure this patch is really needed, but doesn't hurt anyway. = posix-aio-compat sends a signal in aio operations, so we should consider that fgets() could be interrupted here. Signed-off-by: MORITA Kazutaka --- cmd.c | 14 +- 1 files changed, 9 insertions(+), 5 deletions(-) diff --git a/cmd.c b/cmd.c index aee2a38..733bacd 100644 --- a/cmd.c +++ b/cmd.c @@ -293,14 +293,18 @@ fetchline(void) char * fetchline(void) { - char*p, *line = malloc(MAXREADLINESZ); + char*p, *line = malloc(MAXREADLINESZ), *ret; if (!line) return NULL; - if (!fgets(line, MAXREADLINESZ, stdin)) { - free(line); - return NULL; - } +do { +ret = fgets(line, MAXREADLINESZ, stdin); +} while (ret == NULL && errno == EINTR); + +if (ret == NULL) { +free(line); +return NULL; +} p = line + strlen(line); if (p != line && p[-1] == '\n') p[-1] = '\0'; -- 1.5.6.5
Re: [Qemu-devel] [PATCH] get rid of private bitmap functions in block/sheepdog.c, use generic ones
On Thu, Mar 10, 2011 at 11:03 PM, Michael Tokarev wrote: > qemu now has generic bitmap functions, > so don't redefine them in sheepdog.c, > use common header instead. A small cleanup. > > Here's only one function which is actually > used in sheepdog and gets replaced with > a generic one (simplified): > > - static inline int test_bit(int nr, const volatile unsigned long *addr) > + static inline int test_bit(int nr, const unsigned long *addr) > { > - return ((1UL << (nr % BITS_PER_LONG)) > & ((unsigned long*)addr)[nr / BITS_PER_LONG])) != 0; > + return 1UL & (addr[nr / BITS_PER_LONG] >> (nr & (BITS_PER_LONG-1))); > } > > The body is equivalent, but the argument is not: there's > "volatile" in there. Why it is used for - I'm not sure. > > Signed-off-by: Michael Tokarev Looks good. Thanks! Acked-by: MORITA Kazutaka
[Qemu-devel] [PATCH 0/3] sheepdog: fix aio related issues
This patchset fixes the Sheepodg AIO problems pointed out in: http://lists.gnu.org/archive/html/qemu-devel/2011-02/msg02495.html http://lists.gnu.org/archive/html/qemu-devel/2011-02/msg02474.html Thanks, Kazutaka MORITA Kazutaka (3): sheepdog: make send/recv operations non-blocking sheepdog: allow cancellation of I/Os which are not processed yet sheepdog: avoid accessing a buffer of the canceled I/O request block/sheepdog.c | 462 +++--- 1 files changed, 334 insertions(+), 128 deletions(-)
[Qemu-devel] [PATCH 3/3] sheepdog: avoid accessing a buffer of the canceled I/O request
We cannot access the buffer of the canceled I/O request because its AIOCB callback is already called and the buffer is not valid. Signed-off-by: MORITA Kazutaka --- block/sheepdog.c | 12 ++-- 1 files changed, 10 insertions(+), 2 deletions(-) diff --git a/block/sheepdog.c b/block/sheepdog.c index ed98701..6f60721 100644 --- a/block/sheepdog.c +++ b/block/sheepdog.c @@ -79,6 +79,7 @@ #define SD_DATA_OBJ_SIZE (UINT64_C(1) << 22) #define SD_MAX_VDI_SIZE (SD_DATA_OBJ_SIZE * MAX_DATA_OBJS) #define SECTOR_SIZE 512 +#define BUF_SIZE 4096 #define SD_INODE_SIZE (sizeof(SheepdogInode)) #define CURRENT_VDI_ID 0 @@ -900,8 +901,15 @@ static void aio_read_response(void *opaque) } conn_state = C_IO_DATA; case C_IO_DATA: -ret = do_readv(fd, acb->qiov->iov, aio_req->data_len - done, - aio_req->iov_offset + done); +if (acb->canceled) { +char tmp_buf[BUF_SIZE]; +int len = MIN(aio_req->data_len - done, sizeof(tmp_buf)); + +ret = do_read(fd, tmp_buf, len, 0); +} else { +ret = do_readv(fd, acb->qiov->iov, aio_req->data_len - done, + aio_req->iov_offset + done); +} if (ret < 0) { error_report("failed to get the data, %s\n", strerror(errno)); conn_state = C_IO_CLOSED; -- 1.5.6.5
[Qemu-devel] [PATCH 2/3] sheepdog: allow cancellation of I/Os which are not processed yet
We can cancel I/O requests safely if they are not sent to the servers. Signed-off-by: MORITA Kazutaka --- block/sheepdog.c | 37 + 1 files changed, 37 insertions(+), 0 deletions(-) diff --git a/block/sheepdog.c b/block/sheepdog.c index cedf806..ed98701 100644 --- a/block/sheepdog.c +++ b/block/sheepdog.c @@ -421,6 +421,43 @@ static void sd_finish_aiocb(SheepdogAIOCB *acb) static void sd_aio_cancel(BlockDriverAIOCB *blockacb) { SheepdogAIOCB *acb = (SheepdogAIOCB *)blockacb; +BDRVSheepdogState *s = blockacb->bs->opaque; +AIOReq *areq, *next, *oldest_send_req = NULL; + +if (acb->bh) { +/* + * sd_readv_writev_bh_cb() is not called yet, so we can + * release this safely + */ +qemu_bh_delete(acb->bh); +acb->bh = NULL; +qemu_aio_release(acb); +return; +} + +QLIST_FOREACH(areq, &s->outstanding_aio_head, outstanding_aio_siblings) { +if (areq->state == AIO_SEND_OBJREQ) { +oldest_send_req = areq; +} +} + +QLIST_FOREACH_SAFE(areq, &s->outstanding_aio_head, + outstanding_aio_siblings, next) { +if (areq->state == AIO_RECV_OBJREQ) { +continue; +} +if (areq->state == AIO_SEND_OBJREQ && areq == oldest_send_req) { +/* the oldest AIO_SEND_OBJREQ request could be being sent */ +continue; +} +free_aio_req(s, areq); +} + +if (QLIST_EMPTY(&acb->aioreq_head)) { +/* there is no outstanding request */ +qemu_aio_release(acb); +return; +} /* * Sheepdog cannot cancel the requests which are already sent to -- 1.5.6.5
[Qemu-devel] [PATCH 1/3] sheepdog: make send/recv operations non-blocking
This patch avoids retrying send/recv in AIO path when the sheepdog connection is not ready for the operation. Signed-off-by: MORITA Kazutaka --- block/sheepdog.c | 417 +- 1 files changed, 289 insertions(+), 128 deletions(-) diff --git a/block/sheepdog.c b/block/sheepdog.c index a54e0de..cedf806 100644 --- a/block/sheepdog.c +++ b/block/sheepdog.c @@ -242,6 +242,19 @@ static inline int is_snapshot(struct SheepdogInode *inode) typedef struct SheepdogAIOCB SheepdogAIOCB; +enum ConnectionState { +C_IO_HEADER, +C_IO_DATA, +C_IO_END, +C_IO_CLOSED, +}; + +enum AIOReqState { +AIO_PENDING,/* not ready for sending this request */ +AIO_SEND_OBJREQ,/* send this request */ +AIO_RECV_OBJREQ,/* receive a result of this request */ +}; + typedef struct AIOReq { SheepdogAIOCB *aiocb; unsigned int iov_offset; @@ -253,6 +266,9 @@ typedef struct AIOReq { uint8_t flags; uint32_t id; +enum AIOReqState state; +struct SheepdogObjReq hdr; + QLIST_ENTRY(AIOReq) outstanding_aio_siblings; QLIST_ENTRY(AIOReq) aioreq_siblings; } AIOReq; @@ -348,12 +364,14 @@ static const char * sd_strerror(int err) * 1. In the sd_aio_readv/writev, read/write requests are added to the *QEMU Bottom Halves. * - * 2. In sd_readv_writev_bh_cb, the callbacks of BHs, we send the I/O - *requests to the server and link the requests to the - *outstanding_list in the BDRVSheepdogState. we exits the - *function without waiting for receiving the response. + * 2. In sd_readv_writev_bh_cb, the callbacks of BHs, we set up the + *I/O requests to the server and link the requests to the + *outstanding_list in the BDRVSheepdogState. + * + * 3. We send the request in aio_send_request, the fd handler to the + *sheepdog connection. * - * 3. We receive the response in aio_read_response, the fd handler to + * 4. We receive the response in aio_read_response, the fd handler to *the sheepdog connection. If metadata update is needed, we send *the write request to the vdi object in sd_write_done, the write *completion function. The AIOCB callback is not called until all @@ -377,8 +395,6 @@ static inline AIOReq *alloc_aio_req(BDRVSheepdogState *s, SheepdogAIOCB *acb, aio_req->flags = flags; aio_req->id = s->aioreq_seq_num++; -QLIST_INSERT_HEAD(&s->outstanding_aio_head, aio_req, - outstanding_aio_siblings); QLIST_INSERT_HEAD(&acb->aioreq_head, aio_req, aioreq_siblings); return aio_req; @@ -640,20 +656,17 @@ static int do_readv_writev(int sockfd, struct iovec *iov, int len, again: ret = do_send_recv(sockfd, iov, len, iov_offset, write); if (ret < 0) { -if (errno == EINTR || errno == EAGAIN) { +if (errno == EINTR) { goto again; } +if (errno == EAGAIN) { +return 0; +} error_report("failed to recv a rsp, %s\n", strerror(errno)); -return 1; -} - -iov_offset += ret; -len -= ret; -if (len) { -goto again; +return -errno; } -return 0; +return ret; } static int do_readv(int sockfd, struct iovec *iov, int len, int iov_offset) @@ -666,30 +679,30 @@ static int do_writev(int sockfd, struct iovec *iov, int len, int iov_offset) return do_readv_writev(sockfd, iov, len, iov_offset, 1); } -static int do_read_write(int sockfd, void *buf, int len, int write) +static int do_read_write(int sockfd, void *buf, int len, int skip, int write) { struct iovec iov; iov.iov_base = buf; -iov.iov_len = len; +iov.iov_len = len + skip; -return do_readv_writev(sockfd, &iov, len, 0, write); +return do_readv_writev(sockfd, &iov, len, skip, write); } -static int do_read(int sockfd, void *buf, int len) +static int do_read(int sockfd, void *buf, int len, int skip) { -return do_read_write(sockfd, buf, len, 0); +return do_read_write(sockfd, buf, len, skip, 0); } -static int do_write(int sockfd, void *buf, int len) +static int do_write(int sockfd, void *buf, int len, int skip) { -return do_read_write(sockfd, buf, len, 1); +return do_read_write(sockfd, buf, len, skip, 1); } static int send_req(int sockfd, SheepdogReq *hdr, void *data, unsigned int *wlen) { -int ret; +int ret, done = 0; struct iovec iov[2]; iov[0].iov_base = hdr; @@ -700,19 +713,23 @@ static int send_req(int sockfd, SheepdogReq *hdr, void *data, iov[1].iov_len = *wlen; } -ret = do_writev(sockfd, iov, sizeof(*hdr) + *wlen, 0); -if (ret) { -error_report("failed to send a req, %s\n", strerror(errno)); -ret = -1; +while (done < sizeof(*hdr) + *wlen) { +ret = do_writev(sockfd, iov, sizeof(*hdr) + *wlen - done, done); +if (ret <
[Qemu-devel] [PATCH] sheepdog: support creating images on remote hosts
This patch parses the input filename in sd_create(), and enables us specifying a target server to create sheepdog images. Signed-off-by: MORITA Kazutaka --- block/sheepdog.c | 17 ++--- 1 files changed, 14 insertions(+), 3 deletions(-) diff --git a/block/sheepdog.c b/block/sheepdog.c index e62820a..a54e0de 100644 --- a/block/sheepdog.c +++ b/block/sheepdog.c @@ -1294,12 +1294,23 @@ static int do_sd_create(char *filename, int64_t vdi_size, static int sd_create(const char *filename, QEMUOptionParameter *options) { int ret; -uint32_t vid = 0; +uint32_t vid = 0, base_vid = 0; int64_t vdi_size = 0; char *backing_file = NULL; +BDRVSheepdogState s; +char vdi[SD_MAX_VDI_LEN], tag[SD_MAX_VDI_TAG_LEN]; +uint32_t snapid; strstart(filename, "sheepdog:", (const char **)&filename); +memset(&s, 0, sizeof(s)); +memset(vdi, 0, sizeof(vdi)); +memset(tag, 0, sizeof(tag)); +if (parse_vdiname(&s, filename, vdi, &snapid, tag) < 0) { +error_report("invalid filename\n"); +return -EINVAL; +} + while (options && options->name) { if (!strcmp(options->name, BLOCK_OPT_SIZE)) { vdi_size = options->value.n; @@ -1338,11 +1349,11 @@ static int sd_create(const char *filename, QEMUOptionParameter *options) return -EINVAL; } -vid = s->inode.vdi_id; +base_vid = s->inode.vdi_id; bdrv_delete(bs); } -return do_sd_create((char *)filename, vdi_size, vid, NULL, 0, NULL, NULL); +return do_sd_create((char *)vdi, vdi_size, base_vid, &vid, 0, s.addr, s.port); } static void sd_close(BlockDriverState *bs) -- 1.5.6.5
[Qemu-devel] [PATCH] Documentation: add Sheepdog disk images
Signed-off-by: MORITA Kazutaka --- qemu-doc.texi | 52 1 files changed, 52 insertions(+), 0 deletions(-) diff --git a/qemu-doc.texi b/qemu-doc.texi index 22a8663..86e017c 100644 --- a/qemu-doc.texi +++ b/qemu-doc.texi @@ -407,6 +407,7 @@ snapshots. * host_drives:: Using host drives * disk_images_fat_images::Virtual FAT disk images * disk_images_nbd:: NBD access +* disk_images_sheepdog:: Sheepdog disk images @end menu @node disk_images_quickstart @@ -630,6 +631,57 @@ qemu -cdrom nbd:localhost:exportname=debian-500-ppc-netinst qemu -cdrom nbd:localhost:exportname=openSUSE-11.1-ppc-netinst @end example +@node disk_images_sheepdog +@subsection Sheepdog disk images + +Sheepdog is a distributed storage system for QEMU. It provides highly +available block level storage volumes that can be attached to +QEMU-based virtual machines. + +You can create a Sheepdog disk image with the command: +@example +qemu-img create sheepdog:@var{image} @var{size} +@end example +where @var{image} is the Sheepdog image name and @var{size} is its +size. + +To import the existing @var{filename} to Sheepdog, you can use a +convert command. +@example +qemu-img convert @var{filename} sheepdog:@var{image} +@end example + +You can boot from the Sheepdog disk image with the command: +@example +qemu sheepdog:@var{image} +@end example + +You can also create a snapshot of the Sheepdog image like qcow2. +@example +qemu-img snapshot -c @var{tag} sheepdog:@var{image} +@end example +where @var{tag} is a tag name of the newly created snapshot. + +To boot from the Sheepdog snapshot, specify the tag name of the +snapshot. +@example +qemu sheepdog:@var{image}:@var{tag} +@end example + +You can create a cloned image from the existing snapshot. +@example +qemu-img create -b sheepdog:@var{base}:@var{tag} sheepdog:@var{image} +@end example +where @var{base} is a image name of the source snapshot and @var{tag} +is its tag name. + +If the Sheepdog daemon doesn't run on the local host, you need to +specify one of the Sheepdog servers to connect to. +@example +qemu-img create sheepdog:@var{hostname}:@var{port}:@var{image} @var{size} +qemu sheepdog:@var{hostname}:@var{port}:@var{image} +@end example + @node pcsys_network @section Network emulation -- 1.5.6.5
Re: [Qemu-devel] Re: [PATCH 3/3] block/nbd: Make the NBD block device use the AIO interface
At Mon, 21 Feb 2011 17:48:49 +0100, Kevin Wolf wrote: > > Am 21.02.2011 17:31, schrieb Nicholas Thomas: > > Hi again, > > > > Thanks for looking through the patches. I'm just going through and > > making the suggested changes now. I've also got qemu-nbd and block/nbd.c > > working over IPv6 :) - hopefully I'll be able to provide patches in a > > couple of days. Just a few questions about some of the changes... > > > > Canceled requests: > >>> + > >>> + > >>> +static void nbd_aio_cancel(BlockDriverAIOCB *blockacb) > >>> +{ > >>> +NBDAIOCB *acb = (NBDAIOCB *)blockacb; > >>> + > >>> +/* > >>> + * We cannot cancel the requests which are already sent to > >>> + * the servers, so we just complete the request with -EIO here. > >>> + */ > >>> +acb->common.cb(acb->common.opaque, -EIO); > >>> +acb->canceled = 1; > >>> +} > >> > >> I think you need to check for acb->canceled before you write to the > >> associated buffer when receiving the reply for a read request. The > >> buffer might not exist any more after the request is cancelled. > > > > I "borrowed" this code from block/sheepdog.c (along with a fair few > > other bits ;) ) - which doesn't seem to do any special checking for > > cancelled write requests. So if there is a potential SIGSEGV here, I > > guess sheepdog is also vulnerable. > > Right, now that you mention it, I seem to remember this from Sheepdog. I > think I had a discussion with Stefan and he convinced me that we could > get away with it in Sheepdog because of some condition that Sheepdog > meets. Not sure any more what that condition was and if it applies to NBD. > > Was it that Sheepdog has a bounce buffer for all requests? Sheepdog doesn't use a bounce buffer for any requests, and to me, it seems that Sheepdog also needs to check acb->canceled before reading the response of a read request... > >>> +static BlockDriverAIOCB *nbd_aio_readv(BlockDriverState *bs, > >>> +int64_t sector_num, QEMUIOVector *qiov, int nb_sectors, > >>> +BlockDriverCompletionFunc *cb, void *opaque) > >>> +{ > >>> [...] > >>> +for (i = 0; i < qiov->niov; i++) { > >>> +memset(qiov->iov[i].iov_base, 0, qiov->iov[i].iov_len); > >>> +} > >> > >> qemu_iovec_memset? > >> > >> What is this even for? Aren't these zeros overwritten anyway? > > > > Again, present in sheepdog - but it does seem to work fine without it. > > I'll remove it from NBD. > > Maybe Sheepdog reads only partially from the server if blocks are > unallocated or something. Yes, exactly. Thanks, Kazutaka
[Qemu-devel] [PATCH v2] qemu-io: check registered fds in command_loop()
Some block drivers use an aio handler and do I/O completion routines in it. However, the handler is not invoked if we only do aio_read/write, because registered fds are not checked at all. This patch registers an aio handler of STDIO to checks whether we can read a command without blocking, and calls qemu_aio_wait() in command_loop(). Any other handlers can be invoked when user input is idle. Signed-off-by: MORITA Kazutaka --- It seems that the QEMU aio implementation doesn't allow to call qemu_aio_wait() in the aio handler, so the previous patch is broken. This patch only checks that STDIO is ready to read a line in the aio handler, and invokes a command in command_loop(). I think this also fixes the problem which occurs in qemu-iotests. cmd.c | 33 ++--- 1 files changed, 30 insertions(+), 3 deletions(-) diff --git a/cmd.c b/cmd.c index 2336334..db2c9c4 100644 --- a/cmd.c +++ b/cmd.c @@ -24,6 +24,7 @@ #include #include "cmd.h" +#include "qemu-aio.h" #define _(x) x /* not gettext support yet */ @@ -149,10 +150,20 @@ add_args_command( args_func = af; } +static void prep_fetchline(void *opaque) +{ +int *fetchable = opaque; + +qemu_aio_set_fd_handler(STDIN_FILENO, NULL, NULL, NULL, NULL, NULL); +*fetchable= 1; +} + +static char *get_prompt(void); + void command_loop(void) { - int c, i, j = 0, done = 0; + int c, i, j = 0, done = 0, fetchable = 0, prompted = 0; char*input; char**v; const cmdinfo_t *ct; @@ -186,7 +197,21 @@ command_loop(void) free(cmdline); return; } + while (!done) { +if (!prompted) { +printf("%s", get_prompt()); +fflush(stdout); +qemu_aio_set_fd_handler(STDIN_FILENO, prep_fetchline, NULL, NULL, +NULL, &fetchable); +prompted = 1; +} + +qemu_aio_wait(); + +if (!fetchable) { +continue; +} if ((input = fetchline()) == NULL) break; v = breakline(input, &c); @@ -199,7 +224,11 @@ command_loop(void) v[0]); } doneline(input, v); + +prompted = 0; +fetchable = 0; } +qemu_aio_set_fd_handler(STDIN_FILENO, NULL, NULL, NULL, NULL, NULL); } /* from libxcmd/input.c */ @@ -270,8 +299,6 @@ fetchline(void) if (!line) return NULL; - printf("%s", get_prompt()); - fflush(stdout); if (!fgets(line, MAXREADLINESZ, stdin)) { free(line); return NULL; -- 1.5.6.5
[Qemu-devel] [PATCH] qemu-img: avoid calling exit(1) to release resources properly
This patch removes exit(1) from error(), and properly releases resources such as a block driver and an allocated memory. For testing the Sheepdog block driver with qemu-iotests, it is necessary to call bdrv_delete() before the program exits. Because the driver releases the lock of VM images in the close handler. Signed-off-by: MORITA Kazutaka --- qemu-img.c | 235 +++- 1 files changed, 184 insertions(+), 51 deletions(-) diff --git a/qemu-img.c b/qemu-img.c index ea091f0..fe8a577 100644 --- a/qemu-img.c +++ b/qemu-img.c @@ -39,14 +39,13 @@ typedef struct img_cmd_t { /* Default to cache=writeback as data integrity is not important for qemu-tcg. */ #define BDRV_O_FLAGS BDRV_O_CACHE_WB -static void QEMU_NORETURN error(const char *fmt, ...) +static void error(const char *fmt, ...) { va_list ap; va_start(ap, fmt); fprintf(stderr, "qemu-img: "); vfprintf(stderr, fmt, ap); fprintf(stderr, "\n"); -exit(1); va_end(ap); } @@ -197,57 +196,76 @@ static BlockDriverState *bdrv_new_open(const char *filename, char password[256]; bs = bdrv_new(""); -if (!bs) +if (!bs) { error("Not enough memory"); +goto fail; +} if (fmt) { drv = bdrv_find_format(fmt); -if (!drv) +if (!drv) { error("Unknown file format '%s'", fmt); +goto fail; +} } else { drv = NULL; } if (bdrv_open(bs, filename, flags, drv) < 0) { error("Could not open '%s'", filename); +goto fail; } if (bdrv_is_encrypted(bs)) { printf("Disk image '%s' is encrypted.\n", filename); -if (read_password(password, sizeof(password)) < 0) +if (read_password(password, sizeof(password)) < 0) { error("No password given"); -if (bdrv_set_key(bs, password) < 0) +goto fail; +} +if (bdrv_set_key(bs, password) < 0) { error("invalid password"); +goto fail; +} } return bs; +fail: +if (bs) { +bdrv_delete(bs); +} +return NULL; } -static void add_old_style_options(const char *fmt, QEMUOptionParameter *list, +static int add_old_style_options(const char *fmt, QEMUOptionParameter *list, int flags, const char *base_filename, const char *base_fmt) { if (flags & BLOCK_FLAG_ENCRYPT) { if (set_option_parameter(list, BLOCK_OPT_ENCRYPT, "on")) { error("Encryption not supported for file format '%s'", fmt); +return -1; } } if (flags & BLOCK_FLAG_COMPAT6) { if (set_option_parameter(list, BLOCK_OPT_COMPAT6, "on")) { error("VMDK version 6 not supported for file format '%s'", fmt); +return -1; } } if (base_filename) { if (set_option_parameter(list, BLOCK_OPT_BACKING_FILE, base_filename)) { error("Backing file not supported for file format '%s'", fmt); +return -1; } } if (base_fmt) { if (set_option_parameter(list, BLOCK_OPT_BACKING_FMT, base_fmt)) { error("Backing file format not supported for file format '%s'", fmt); +return -1; } } +return 0; } static int img_create(int argc, char **argv) { -int c, ret, flags; +int c, ret = 0, flags; const char *fmt = "raw"; const char *base_fmt = NULL; const char *filename; @@ -293,12 +311,16 @@ static int img_create(int argc, char **argv) /* Find driver and parse its options */ drv = bdrv_find_format(fmt); -if (!drv) +if (!drv) { error("Unknown file format '%s'", fmt); +return 1; +} proto_drv = bdrv_find_protocol(filename); -if (!proto_drv) +if (!proto_drv) { error("Unknown protocol '%s'", filename); +return 1; +} create_options = append_option_parameters(create_options, drv->create_options); @@ -307,7 +329,7 @@ static int img_create(int argc, char **argv) if (options && !strcmp(options, "?")) { print_option_help(create_options); -return 0; +goto out; } /* Create parameter list with default values */ @@ -319,6 +341,8 @@ static int img_create(int argc, char **argv) param = parse_option_parameters(options, create_options, param); if (param == NULL) { error("Invalid options for file format '%s'.", fmt); +ret = -1; +goto out; } } @@ -328,7 +352,10 @@ static int
[Qemu-devel] [PATCH v6] block: add sheepdog driver for distributed storage support
Sheepdog is a distributed storage system for QEMU. It provides highly available block level storage volumes to VMs like Amazon EBS. This patch adds a qemu block driver for Sheepdog. Sheepdog features are: - No node in the cluster is special (no metadata node, no control node, etc) - Linear scalability in performance and capacity - No single point of failure - Autonomous management (zero configuration) - Useful volume management support such as snapshot and cloning - Thin provisioning - Autonomous load balancing The more details are available at the project site: http://www.osrg.net/sheepdog/ Signed-off-by: MORITA Kazutaka --- I've addressed the comments and tested with qemu-iotests which is hacked for Sheepdog. This version changes an inode data format to support a snapshot tag name, so to test this patch, please pull the latest sheepdog server codes. git://sheepdog.git.sourceforge.net/gitroot/sheepdog/sheepdog next Sheepdog passes almost all testcases against a raw format, but failed in the following ones: - 005: Sheepdog cannot support a larger image than 4 TB, so failed in creating a 5 TB image. - 012: Sheepdog images are not files, so cannot make them read-only by chmod. Thanks, Kazutaka Changes from v5 to v6 are: - support a snapshot name - support resizing images (stretching only) to pass a qemu-iotests check - fix compile errors on the WIN32 environment - initialize an array to avoid a valgrind warning - remove an aio handler when it is no longer needed Changes from v4 to v5 are: - address the comments to the sheepdog driver (Thanks Kevin, Chris!) -- fix a coding style -- fix aio_cancel handling -- fix an overflow bug in coping hostname -- add comments to the non-trivial functions - remove already applied patches from the patchset Changes from v3 to v4 are: - fix error handling in bdrv_snapshot_goto. Changes from v2 to v3 are: - add drv->bdrv_close() and drv->bdrv_open() before and after bdrv_snapshot_goto() call of the protocol. - address the review comments on the sheepdog driver code. Changes from v1 to v2 are: - rebase onto git://repo.or.cz/qemu/kevin.git block - modify the sheepdog driver as a protocol driver - add new patch to call the snapshot handler of the protocol Makefile.objs|2 +- block/sheepdog.c | 2036 ++ 2 files changed, 2037 insertions(+), 1 deletions(-) create mode 100644 block/sheepdog.c diff --git a/Makefile.objs b/Makefile.objs index 2bfb6d1..4c37182 100644 --- a/Makefile.objs +++ b/Makefile.objs @@ -14,7 +14,7 @@ block-obj-$(CONFIG_LINUX_AIO) += linux-aio.o block-nested-y += raw.o cow.o qcow.o vdi.o vmdk.o cloop.o dmg.o bochs.o vpc.o vvfat.o block-nested-y += qcow2.o qcow2-refcount.o qcow2-cluster.o qcow2-snapshot.o -block-nested-y += parallels.o nbd.o blkdebug.o +block-nested-y += parallels.o nbd.o blkdebug.o sheepdog.o block-nested-$(CONFIG_WIN32) += raw-win32.o block-nested-$(CONFIG_POSIX) += raw-posix.o block-nested-$(CONFIG_CURL) += curl.o diff --git a/block/sheepdog.c b/block/sheepdog.c new file mode 100644 index 000..69a2494 --- /dev/null +++ b/block/sheepdog.c @@ -0,0 +1,2036 @@ +/* + * Copyright (C) 2009-2010 Nippon Telegraph and Telephone Corporation. + * + * This program is free software; you can redistribute it and/or + * modify it under the terms of the GNU General Public License version + * 2 as published by the Free Software Foundation. + * + * You should have received a copy of the GNU General Public License + * along with this program. If not, see <http://www.gnu.org/licenses/>. + */ +#ifdef _WIN32 +#include +#include +#include +#else +#include +#include + +#define closesocket(s) close(s) +#endif + +#include "qemu-common.h" +#include "qemu-error.h" +#include "qemu_socket.h" +#include "block_int.h" + +#define SD_PROTO_VER 0x01 + +#define SD_DEFAULT_ADDR "localhost" +#define SD_DEFAULT_PORT "7000" + +#define SD_OP_CREATE_AND_WRITE_OBJ 0x01 +#define SD_OP_READ_OBJ 0x02 +#define SD_OP_WRITE_OBJ 0x03 + +#define SD_OP_NEW_VDI0x11 +#define SD_OP_LOCK_VDI 0x12 +#define SD_OP_RELEASE_VDI0x13 +#define SD_OP_GET_VDI_INFO 0x14 +#define SD_OP_READ_VDIS 0x15 + +#define SD_FLAG_CMD_WRITE0x01 +#define SD_FLAG_CMD_COW 0x02 + +#define SD_RES_SUCCESS 0x00 /* Success */ +#define SD_RES_UNKNOWN 0x01 /* Unknown error */ +#define SD_RES_NO_OBJ0x02 /* No object found */ +#define SD_RES_EIO 0x03 /* I/O error */ +#define SD_RES_VDI_EXIST 0x04 /* Vdi exists already */ +#define SD_RES_INVALID_PARMS 0x05 /* Invalid parameters */ +#define SD_RES_SYSTEM_ERROR 0x06 /* System error */ +#define SD_RES_VDI_LOCKED0x07 /* Vdi is locked */ +#define SD_RES_NO_VDI0x08 /* No vdi found */ +#define SD_RES_NO_BASE_VDI 0x09 /* No base vdi found */ +#define SD_RES_VDI_READ 0x0A /* Ca
Re: [Qemu-devel] [PATCH 1/2] qemu-img check: Distinguish different kinds of errors
At Fri, 2 Jul 2010 19:14:59 +0200, Kevin Wolf wrote: > > People think that their images are corrupted when in fact there are just some > leaked clusters. Differentiating several error cases should make the messages > more comprehensible. > > Signed-off-by: Kevin Wolf > --- > block.c| 10 ++-- > block.h| 10 - > qemu-img.c | 62 +-- > 3 files changed, 63 insertions(+), 19 deletions(-) > > diff --git a/block.c b/block.c > index dd6dd76..b0ceef0 100644 > --- a/block.c > +++ b/block.c > @@ -710,15 +710,19 @@ DeviceState *bdrv_get_attached(BlockDriverState *bs) > /* > * Run consistency checks on an image > * > - * Returns the number of errors or -errno when an internal error occurs > + * Returns 0 if the check could be completed (it doesn't mean that the image > is > + * free of errors) or -errno when an internal error occured. The results of > the > + * check are stored in res. > */ > -int bdrv_check(BlockDriverState *bs) > +int bdrv_check(BlockDriverState *bs, BdrvCheckResult *res) > { > if (bs->drv->bdrv_check == NULL) { > return -ENOTSUP; > } > > -return bs->drv->bdrv_check(bs); > +memset(res, 0, sizeof(*res)); > +res->corruptions = bs->drv->bdrv_check(bs); > +return res->corruptions < 0 ? res->corruptions : 0; > } > > /* commit COW file into the raw image */ > diff --git a/block.h b/block.h > index 3d03b3e..c2a7e4c 100644 > --- a/block.h > +++ b/block.h > @@ -74,7 +74,6 @@ void bdrv_close(BlockDriverState *bs); > int bdrv_attach(BlockDriverState *bs, DeviceState *qdev); > void bdrv_detach(BlockDriverState *bs, DeviceState *qdev); > DeviceState *bdrv_get_attached(BlockDriverState *bs); > -int bdrv_check(BlockDriverState *bs); > int bdrv_read(BlockDriverState *bs, int64_t sector_num, >uint8_t *buf, int nb_sectors); > int bdrv_write(BlockDriverState *bs, int64_t sector_num, > @@ -97,6 +96,15 @@ int bdrv_change_backing_file(BlockDriverState *bs, > const char *backing_file, const char *backing_fmt); > void bdrv_register(BlockDriver *bdrv); > > + > +typedef struct BdrvCheckResult { > +int corruptions; > +int leaks; > +int check_errors; > +} BdrvCheckResult; > + > +int bdrv_check(BlockDriverState *bs, BdrvCheckResult *res); > + > /* async block I/O */ > typedef struct BlockDriverAIOCB BlockDriverAIOCB; > typedef void BlockDriverCompletionFunc(void *opaque, int ret); > diff --git a/qemu-img.c b/qemu-img.c > index 700af21..1782ac9 100644 > --- a/qemu-img.c > +++ b/qemu-img.c > @@ -425,11 +425,20 @@ out: > return 0; > } > > +/* > + * Checks an image for consistency. Exit codes: > + * > + * 0 - Check completed, image is good > + * 1 - Check not completed because of internal errors > + * 2 - Check completed, image is corrupted > + * 3 - Check completed, image has leaked clusters, but is good otherwise > + */ > static int img_check(int argc, char **argv) > { > int c, ret; > const char *filename, *fmt; > BlockDriverState *bs; > +BdrvCheckResult result; > > fmt = NULL; > for(;;) { > @@ -453,28 +462,51 @@ static int img_check(int argc, char **argv) > if (!bs) { > return 1; > } > -ret = bdrv_check(bs); > -switch(ret) { > -case 0: > -printf("No errors were found on the image.\n"); > -break; > -case -ENOTSUP: > +ret = bdrv_check(bs, &result); > + > +if (ret == -ENOTSUP) { > error("This image format does not support checks"); > -break; > -default: > -if (ret < 0) { > -error("An error occurred during the check"); > -} else { > -printf("%d errors were found on the image.\n", ret); > +return 1; Is it okay to call bdrv_delete(bs) before return? It is necessary for the sheepdog driver to pass qemu-iotests. Kazutaka --- a/qemu-img.c +++ b/qemu-img.c @@ -466,6 +466,7 @@ static int img_check(int argc, char **argv) if (ret == -ENOTSUP) { error("This image format does not support checks"); +bdrv_delete(bs); return 1; }
[Qemu-devel] [PATCH] sheepdog: fix compile error on systems without TCP_CORK
WIN32 is not only the system which doesn't have TCP_CORK (e.g. OS X). Signed-off-by: MORITA Kazutaka --- Betts, I think this patch fix the compile error. Can you try this one? block/sheepdog.c |2 +- 1 files changed, 1 insertions(+), 1 deletions(-) diff --git a/block/sheepdog.c b/block/sheepdog.c index 69a2494..81aa564 100644 --- a/block/sheepdog.c +++ b/block/sheepdog.c @@ -889,7 +889,7 @@ static int aio_flush_request(void *opaque) return !QLIST_EMPTY(&s->outstanding_aio_head); } -#ifdef _WIN32 +#if !defined(SOL_TCP) || !defined(TCP_CORK) static int set_cork(int fd, int v) { -- 1.5.6.5
[Qemu-devel] [RFC PATCH 1/2] close all the block drivers before the qemu process exits
This patch calls the close handler of the block driver before the qemu process exits. This is necessary because the sheepdog block driver releases the lock of VM images in the close handler. Signed-off-by: MORITA Kazutaka --- block.c | 11 +++ block.h |1 + monitor.c |1 + vl.c |1 + 4 files changed, 14 insertions(+), 0 deletions(-) diff --git a/block.c b/block.c index 7326bfe..a606820 100644 --- a/block.c +++ b/block.c @@ -526,6 +526,17 @@ void bdrv_close(BlockDriverState *bs) } } +void bdrv_close_all(void) +{ +BlockDriverState *bs, *n; + +for (bs = bdrv_first, n = bs->next; bs; bs = n, n = bs ? bs->next : NULL) { +if (bs && bs->drv && bs->drv->bdrv_close) { +bs->drv->bdrv_close(bs); +} +} +} + void bdrv_delete(BlockDriverState *bs) { BlockDriverState **pbs; diff --git a/block.h b/block.h index fa51ddf..1a1293a 100644 --- a/block.h +++ b/block.h @@ -123,6 +123,7 @@ BlockDriverAIOCB *bdrv_aio_ioctl(BlockDriverState *bs, /* Ensure contents are flushed to disk. */ void bdrv_flush(BlockDriverState *bs); void bdrv_flush_all(void); +void bdrv_close_all(void); int bdrv_is_allocated(BlockDriverState *bs, int64_t sector_num, int nb_sectors, int *pnum); diff --git a/monitor.c b/monitor.c index 17e59f5..44bfe83 100644 --- a/monitor.c +++ b/monitor.c @@ -845,6 +845,7 @@ static void do_info_cpu_stats(Monitor *mon) */ static void do_quit(Monitor *mon, const QDict *qdict, QObject **ret_data) { +bdrv_close_all(); exit(0); } diff --git a/vl.c b/vl.c index 77677e8..65160ed 100644 --- a/vl.c +++ b/vl.c @@ -4205,6 +4205,7 @@ static void main_loop(void) vm_stop(r); } } +bdrv_close_all(); pause_all_vcpus(); } -- 1.5.6.5
[Qemu-devel] [RFC PATCH 0/2] Sheepdog: distributed storage system for QEMU
Hi all, This patch adds a block driver for Sheepdog distributed storage system. Please consider for inclusion. Sheepdog is a distributed storage system for QEMU. It provides highly available block level storage volumes to VMs like Amazon EBS. Sheepdog features are: - No node in the cluster is special (no metadata node, no control node, etc) - Linear scalability in performance and capacity - No single point of failure - Autonomous management (zero configuration) - Useful volume management support such as snapshot and cloning - Thin provisioning - Autonomous load balancing The more details are available at the project site [1] and my previous post about sheepdog [2]. We have implemented the essential part of sheepdog features, and believe the API between Sheepdog and QEMU is finalized. Any comments or suggestions would be greatly appreciated. Here are examples: $ qemu-img create -f sheepdog vol1 256G # create images $ qemu --drive format=sheepdog,file=vol1# start up a VM $ qemu-img snapshot -c name sheepdog:vol1 # create a snapshot $ qemu-img snapshot -l sheepdog:vol1# list snapshots IDTAG VM SIZEDATE VM CLOCK 1 0 2010-05-06 02:29:29 00:00:00.000 2 0 2010-05-06 02:29:55 00:00:00.000 $ qemu --drive format=sheepdog,file=vol1:1 # start up from a snapshot $ qemu-img create -b sheepdog:vol1:1 -f sheepdog vol2 # clone images Thanks, Kazutaka [1] http://www.osrg.net/sheepdog/ [2] http://lists.nongnu.org/archive/html/qemu-devel/2009-10/msg01773.html MORITA Kazutaka (2): close all the block drivers before the qemu process exits block: add sheepdog driver for distributed storage support Makefile |2 +- block.c | 14 +- block.h |1 + block/sheepdog.c | 1828 ++ monitor.c|1 + vl.c |1 + 6 files changed, 1845 insertions(+), 2 deletions(-) create mode 100644 block/sheepdog.c
[Qemu-devel] [RFC PATCH 2/2] block: add sheepdog driver for distributed storage support
Sheepdog is a distributed storage system for QEMU. It provides highly available block level storage volumes to VMs like Amazon EBS. This patch adds a qemu block driver for Sheepdog. Sheepdog features are: - No node in the cluster is special (no metadata node, no control node, etc) - Linear scalability in performance and capacity - No single point of failure - Autonomous management (zero configuration) - Useful volume management support such as snapshot and cloning - Thin provisioning - Autonomous load balancing The more details are available at the project site: http://www.osrg.net/sheepdog/ Signed-off-by: MORITA Kazutaka --- Makefile |2 +- block.c |3 +- block/sheepdog.c | 1828 ++ 3 files changed, 1831 insertions(+), 2 deletions(-) create mode 100644 block/sheepdog.c diff --git a/Makefile b/Makefile index c1fa08c..d03cda1 100644 --- a/Makefile +++ b/Makefile @@ -97,7 +97,7 @@ block-obj-$(CONFIG_LINUX_AIO) += linux-aio.o block-nested-y += cow.o qcow.o vdi.o vmdk.o cloop.o dmg.o bochs.o vpc.o vvfat.o block-nested-y += qcow2.o qcow2-refcount.o qcow2-cluster.o qcow2-snapshot.o -block-nested-y += parallels.o nbd.o +block-nested-y += parallels.o nbd.o sheepdog.o block-nested-$(CONFIG_WIN32) += raw-win32.o block-nested-$(CONFIG_POSIX) += raw-posix.o block-nested-$(CONFIG_CURL) += curl.o diff --git a/block.c b/block.c index a606820..ab00f3f 100644 --- a/block.c +++ b/block.c @@ -307,7 +307,8 @@ static BlockDriver *find_image_format(const char *filename) drv = find_protocol(filename); /* no need to test disk image formats for vvfat */ -if (drv && strcmp(drv->format_name, "vvfat") == 0) +if (drv && (!strcmp(drv->format_name, "vvfat") || +!strcmp(drv->format_name, "sheepdog"))) return drv; ret = bdrv_file_open(&bs, filename, BDRV_O_RDONLY); diff --git a/block/sheepdog.c b/block/sheepdog.c new file mode 100644 index 000..7c07a52 --- /dev/null +++ b/block/sheepdog.c @@ -0,0 +1,1828 @@ +/* + * Copyright (C) 2009-2010 Nippon Telegraph and Telephone Corporation. + * + * This program is free software; you can redistribute it and/or + * modify it under the terms of the GNU General Public License version + * 2 as published by the Free Software Foundation. + * + * You should have received a copy of the GNU General Public License + * along with this program. If not, see <http://www.gnu.org/licenses/>. + */ +#include +#include + +#include "qemu-common.h" +#include "block_int.h" + +#define SD_PROTO_VER 0x01 + +#define SD_DEFAULT_ADDR "localhost:7000" + +#define SD_OP_CREATE_AND_WRITE_OBJ 0x01 +#define SD_OP_READ_OBJ 0x02 +#define SD_OP_WRITE_OBJ 0x03 + +#define SD_OP_NEW_VDI0x11 +#define SD_OP_LOCK_VDI 0x12 +#define SD_OP_RELEASE_VDI0x13 +#define SD_OP_GET_VDI_INFO 0x14 +#define SD_OP_READ_VDIS 0x15 + +#define SD_FLAG_CMD_WRITE0x01 +#define SD_FLAG_CMD_COW 0x02 + +#define SD_RES_SUCCESS 0x00 /* Success */ +#define SD_RES_UNKNOWN 0x01 /* Unknown error */ +#define SD_RES_NO_OBJ0x02 /* No object found */ +#define SD_RES_EIO 0x03 /* I/O error */ +#define SD_RES_VDI_EXIST 0x04 /* Vdi exists already */ +#define SD_RES_INVALID_PARMS 0x05 /* Invalid parameters */ +#define SD_RES_SYSTEM_ERROR 0x06 /* System error */ +#define SD_RES_VDI_LOCKED0x07 /* Vdi is locked */ +#define SD_RES_NO_VDI0x08 /* No vdi found */ +#define SD_RES_NO_BASE_VDI 0x09 /* No base vdi found */ +#define SD_RES_VDI_READ 0x0A /* Cannot read requested vdi */ +#define SD_RES_VDI_WRITE 0x0B /* Cannot write requested vdi */ +#define SD_RES_BASE_VDI_READ 0x0C /* Cannot read base vdi */ +#define SD_RES_BASE_VDI_WRITE 0x0D /* Cannot write base vdi */ +#define SD_RES_NO_TAG0x0E /* Requested tag is not found */ +#define SD_RES_STARTUP 0x0F /* Sheepdog is on starting up */ +#define SD_RES_VDI_NOT_LOCKED 0x10 /* Vdi is not locked */ +#define SD_RES_SHUTDOWN 0x11 /* Sheepdog is shutting down */ +#define SD_RES_NO_MEM0x12 /* Cannot allocate memory */ +#define SD_RES_FULL_VDI 0x13 /* we already have the maximum vdis */ +#define SD_RES_VER_MISMATCH 0x14 /* Protocol version mismatch */ +#define SD_RES_NO_SPACE 0x15 /* Server has no room for new objects */ +#define SD_RES_WAIT_FOR_FORMAT 0x16 /* Sheepdog is waiting for a format operation */ +#define SD_RES_WAIT_FOR_JOIN0x17 /* Sheepdog is waiting for other nodes joining */ +#define SD_RES_JOIN_FAILED 0x18 /* Target node had failed to join sheepdog */ + +/* + * Object ID rules + * + * 0 - 19 (20 bits): data object space + * 20 - 31 (12 bits): reserved data object space + * 32 - 55 (24 bits): vdi object space + * 56 - 59 ( 4 bits): reserved vdi object space + * 60 - 63 ( 4 bits): object type indentifier space + */ + +#define VDI_SPACE_SH
[Qemu-devel] Re: [RFC PATCH 1/2] close all the block drivers before the qemu process exits
At Thu, 13 May 2010 05:16:35 +0900, MORITA Kazutaka wrote: > > On 2010/05/12 23:28, Avi Kivity wrote: > > On 05/12/2010 01:46 PM, MORITA Kazutaka wrote: > >> This patch calls the close handler of the block driver before the qemu > >> process exits. > >> > >> This is necessary because the sheepdog block driver releases the lock > >> of VM images in the close handler. > >> > >> > > > > How do you handle abnormal termination? > > > > In the case, we need to release the lock manually, unfortunately. > Sheepdog admin tool has a command to do that. > More precisely, if qemu fails down with its host machine, we detect the qemu failure and release the lock. It is because Sheepdog currently assumes that all the qemu processes are in the sheepdog cluster, and remember that where they are running. When machine failures happen, sheepdog release the lock of the VMs on the machines (We uses corosync to check which machines are alive or not). If the qemu process exits abnormally and its host machine is alive, the sheepdog daemon on the host needs to detect the qemu failure. However, the feature is not implemented yet. We think of checking a socket connection between qemu and sheepdog daemon to detect the failure. Currently, we need to release the lock manually from the admin tool in this case. Thanks, Kazutaka
[Qemu-devel] Re: [RFC PATCH 1/2] close all the block drivers before the qemu process exits
On 2010/05/12 23:28, Avi Kivity wrote: > On 05/12/2010 01:46 PM, MORITA Kazutaka wrote: >> This patch calls the close handler of the block driver before the qemu >> process exits. >> >> This is necessary because the sheepdog block driver releases the lock >> of VM images in the close handler. >> >> > > How do you handle abnormal termination? > In the case, we need to release the lock manually, unfortunately. Sheepdog admin tool has a command to do that. Thanks, Kazutaka
Re: [Qemu-devel] [RFC PATCH 0/2] Sheepdog: distributed storage system for QEMU
On 2010/05/12 20:38, Kevin Wolf wrote: > Am 12.05.2010 12:46, schrieb MORITA Kazutaka: >> Hi all, >> >> This patch adds a block driver for Sheepdog distributed storage >> system. Please consider for inclusion. >> >> Sheepdog is a distributed storage system for QEMU. It provides highly >> available block level storage volumes to VMs like Amazon EBS. >> >> Sheepdog features are: >> - No node in the cluster is special (no metadata node, no control >> node, etc) >> - Linear scalability in performance and capacity >> - No single point of failure >> - Autonomous management (zero configuration) >> - Useful volume management support such as snapshot and cloning >> - Thin provisioning >> - Autonomous load balancing >> >> The more details are available at the project site [1] and my previous >> post about sheepdog [2]. >> >> We have implemented the essential part of sheepdog features, and >> believe the API between Sheepdog and QEMU is finalized. >> >> Any comments or suggestions would be greatly appreciated. > > These patches don't apply, neither on git master nor on the block > branch. Please rebase them on git://repo.or.cz/qemu/kevin.git block for > the next submission. > Ok, I'll rebase them and resend later. Sorry for inconvenience. > I'll have a closer look at your code later, but one thing I noticed is > that the new block driver is something in between a protocol and a > format driver (just like vvfat, which should stop doing so, too). I > think it ought to be a real protocol with the raw format driver on top > (or any other format - I don't see a reason why this should be > restricted to raw). > > The one thing that is unusual about it as a protocol driver is that it > supports snapshots. However, while it is the first one, supporting > snapshots in protocols is a thing that could be generally useful to > support (for example thinking of a LVM protocol, which was discussed in > the past). > I agreed. I'll modify the sheepdog driver patch as a protocol driver one, and remove unnecessary format check from my patch. > So in block.c we could check if the format driver supports snapshots, > and if it doesn't we try again with the underlying protocol. Not sure > yet what we would do when both format and protocol do support snapshots > (qcow2 on sheepdog/LVM/...), but that's a detail. > Thanks, Kazutaka
Re: [Qemu-devel] [RFC PATCH 1/2] close all the block drivers before the qemu process exits
On 2010/05/12 23:01, Christoph Hellwig wrote: > On Wed, May 12, 2010 at 07:46:52PM +0900, MORITA Kazutaka wrote: >> This patch calls the close handler of the block driver before the qemu >> process exits. >> >> This is necessary because the sheepdog block driver releases the lock >> of VM images in the close handler. >> >> Signed-off-by: MORITA Kazutaka > > Looks good in principle, except that bdrv_first is gone and has been > replaced with a real list in the meantime, so this won't even apply. > Thank you for your comment. I'll rebase and resend the updated version in the next days. Thanks, Kazutaka
Re: [Qemu-devel] [RFC PATCH 0/2] Sheepdog: distributed storage system for QEMU
At Thu, 13 May 2010 04:46:46 +0900, MORITA Kazutaka wrote: > > On 2010/05/12 20:38, Kevin Wolf wrote: > > I'll have a closer look at your code later, but one thing I noticed is > > that the new block driver is something in between a protocol and a > > format driver (just like vvfat, which should stop doing so, too). I > > think it ought to be a real protocol with the raw format driver on top > > (or any other format - I don't see a reason why this should be > > restricted to raw). > > > > The one thing that is unusual about it as a protocol driver is that it > > supports snapshots. However, while it is the first one, supporting > > snapshots in protocols is a thing that could be generally useful to > > support (for example thinking of a LVM protocol, which was discussed in > > the past). > > > > I agreed. I'll modify the sheepdog driver patch as a protocol driver one, > and remove unnecessary format check from my patch. > > > So in block.c we could check if the format driver supports snapshots, > > and if it doesn't we try again with the underlying protocol. Not sure > > yet what we would do when both format and protocol do support snapshots > > (qcow2 on sheepdog/LVM/...), but that's a detail. > > > To support snapshot in a protocol, I'd like to call the hander of the protocol driver in the following functions in block.c: bdrv_snapshot_create bdrv_snapshot_goto bdrv_snapshot_delete bdrv_snapshot_list bdrv_save_vmstate bdrv_load_vmstate Is it okay? In the case both format and protocol drivers support snapshots, I think it is better to call the format driver handler. Because qcow2 is well known as a snapshot support format, so when users use qcow2, they expect to get snapshot with qcow2. There is another problem to make the sheepdog driver be a protocol; how to deal with protocol specific create_options? For example, sheepdog supports cloning images as a format driver: $ qemu-img create -f sheepdog dst -b sheepdog:src But if the sheepdog driver is a protocol, error will occur. $ qemu-img create sheepdog:dst -b sheepdog:src Unknown option 'backing_file' qemu-img: Backing file not supported for file format 'raw' It is because the raw format doesn't support a backing_file option. To support the protocol specific create_options, if the format driver cannot parse some of the arguments, the protocol driver need to parse them. If my suggestions are okay, I'd like to prepare the patches. Regards, Kazutaka
[Qemu-devel] [RFC PATCH v2 1/3] close all the block drivers before the qemu process exits
This patch calls the close handler of the block driver before the qemu process exits. This is necessary because the sheepdog block driver releases the lock of VM images in the close handler. Signed-off-by: MORITA Kazutaka --- block.c |9 + block.h |1 + vl.c|1 + 3 files changed, 11 insertions(+), 0 deletions(-) diff --git a/block.c b/block.c index c134c2b..988a94a 100644 --- a/block.c +++ b/block.c @@ -641,6 +641,15 @@ void bdrv_close(BlockDriverState *bs) } } +void bdrv_close_all(void) +{ +BlockDriverState *bs; + +QTAILQ_FOREACH(bs, &bdrv_states, list) { +bdrv_close(bs); +} +} + void bdrv_delete(BlockDriverState *bs) { /* remove from list, if necessary */ diff --git a/block.h b/block.h index 278259c..531e802 100644 --- a/block.h +++ b/block.h @@ -121,6 +121,7 @@ BlockDriverAIOCB *bdrv_aio_ioctl(BlockDriverState *bs, /* Ensure contents are flushed to disk. */ void bdrv_flush(BlockDriverState *bs); void bdrv_flush_all(void); +void bdrv_close_all(void); int bdrv_has_zero_init(BlockDriverState *bs); int bdrv_is_allocated(BlockDriverState *bs, int64_t sector_num, int nb_sectors, diff --git a/vl.c b/vl.c index 85bcc84..5ce7807 100644 --- a/vl.c +++ b/vl.c @@ -2007,6 +2007,7 @@ static void main_loop(void) exit(0); } } +bdrv_close_all(); pause_all_vcpus(); } -- 1.5.6.5
[Qemu-devel] [RFC PATCH v2 2/3] block: call the snapshot handlers of the protocol drivers
When snapshot handlers of the format driver is not defined, it is better to call the ones of the protocol driver. This enables us to implement snapshot support in the protocol driver. Signed-off-by: MORITA Kazutaka --- block.c | 48 ++-- 1 files changed, 30 insertions(+), 18 deletions(-) diff --git a/block.c b/block.c index 988a94a..d1866be 100644 --- a/block.c +++ b/block.c @@ -1689,9 +1689,11 @@ int bdrv_save_vmstate(BlockDriverState *bs, const uint8_t *buf, BlockDriver *drv = bs->drv; if (!drv) return -ENOMEDIUM; -if (!drv->bdrv_save_vmstate) -return -ENOTSUP; -return drv->bdrv_save_vmstate(bs, buf, pos, size); +if (drv->bdrv_save_vmstate) +return drv->bdrv_save_vmstate(bs, buf, pos, size); +if (bs->file) +return bdrv_save_vmstate(bs->file, buf, pos, size); +return -ENOTSUP; } int bdrv_load_vmstate(BlockDriverState *bs, uint8_t *buf, @@ -1700,9 +1702,11 @@ int bdrv_load_vmstate(BlockDriverState *bs, uint8_t *buf, BlockDriver *drv = bs->drv; if (!drv) return -ENOMEDIUM; -if (!drv->bdrv_load_vmstate) -return -ENOTSUP; -return drv->bdrv_load_vmstate(bs, buf, pos, size); +if (drv->bdrv_load_vmstate) +return drv->bdrv_load_vmstate(bs, buf, pos, size); +if (bs->file) +return bdrv_load_vmstate(bs->file, buf, pos, size); +return -ENOTSUP; } void bdrv_debug_event(BlockDriverState *bs, BlkDebugEvent event) @@ -1726,9 +1730,11 @@ int bdrv_snapshot_create(BlockDriverState *bs, BlockDriver *drv = bs->drv; if (!drv) return -ENOMEDIUM; -if (!drv->bdrv_snapshot_create) -return -ENOTSUP; -return drv->bdrv_snapshot_create(bs, sn_info); +if (drv->bdrv_snapshot_create) +return drv->bdrv_snapshot_create(bs, sn_info); +if (bs->file) +return bdrv_snapshot_create(bs->file, sn_info); +return -ENOTSUP; } int bdrv_snapshot_goto(BlockDriverState *bs, @@ -1737,9 +1743,11 @@ int bdrv_snapshot_goto(BlockDriverState *bs, BlockDriver *drv = bs->drv; if (!drv) return -ENOMEDIUM; -if (!drv->bdrv_snapshot_goto) -return -ENOTSUP; -return drv->bdrv_snapshot_goto(bs, snapshot_id); +if (drv->bdrv_snapshot_goto) +return drv->bdrv_snapshot_goto(bs, snapshot_id); +if (bs->file) +return bdrv_snapshot_goto(bs->file, snapshot_id); +return -ENOTSUP; } int bdrv_snapshot_delete(BlockDriverState *bs, const char *snapshot_id) @@ -1747,9 +1755,11 @@ int bdrv_snapshot_delete(BlockDriverState *bs, const char *snapshot_id) BlockDriver *drv = bs->drv; if (!drv) return -ENOMEDIUM; -if (!drv->bdrv_snapshot_delete) -return -ENOTSUP; -return drv->bdrv_snapshot_delete(bs, snapshot_id); +if (drv->bdrv_snapshot_delete) +return drv->bdrv_snapshot_delete(bs, snapshot_id); +if (bs->file) +return bdrv_snapshot_delete(bs->file, snapshot_id); +return -ENOTSUP; } int bdrv_snapshot_list(BlockDriverState *bs, @@ -1758,9 +1768,11 @@ int bdrv_snapshot_list(BlockDriverState *bs, BlockDriver *drv = bs->drv; if (!drv) return -ENOMEDIUM; -if (!drv->bdrv_snapshot_list) -return -ENOTSUP; -return drv->bdrv_snapshot_list(bs, psn_info); +if (drv->bdrv_snapshot_list) +return drv->bdrv_snapshot_list(bs, psn_info); +if (bs->file) +return bdrv_snapshot_list(bs->file, psn_info); +return -ENOTSUP; } #define NB_SUFFIXES 4 -- 1.5.6.5
[Qemu-devel] [RFC PATCH v2 0/3] Sheepdog: distributed storage system for QEMU
Hi all, This patch adds a block driver for Sheepdog distributed storage system. Changes from v1 to v2 are: - rebase onto git://repo.or.cz/qemu/kevin.git block - modify the sheepdog driver as a protocol driver - add new patch to call the snapshot handler of the protocol One issue still remains; qemu-img parses command line options with `create_options' of the format handler, so we cannot use protocol specific options. In this version, sheepdog needs to be used as a format driver when we want to use sheepdog specific options. e.g. Create clone image vol2 from vol1 $ qemu-img create -b sheepdog:vol1:1 -f sheepdog vol2 Thanks, Kazutaka MORITA Kazutaka (3): close all the block drivers before the qemu process exits block: call the snapshot handlers of the protocol drivers block: add sheepdog driver for distributed storage support Makefile.objs|2 +- block.c | 57 ++- block.h |1 + block/sheepdog.c | 1831 ++ vl.c |1 + 5 files changed, 1873 insertions(+), 19 deletions(-) create mode 100644 block/sheepdog.c
[Qemu-devel] [RFC PATCH v2 3/3] block: add sheepdog driver for distributed storage support
Sheepdog is a distributed storage system for QEMU. It provides highly available block level storage volumes to VMs like Amazon EBS. This patch adds a qemu block driver for Sheepdog. Sheepdog features are: - No node in the cluster is special (no metadata node, no control node, etc) - Linear scalability in performance and capacity - No single point of failure - Autonomous management (zero configuration) - Useful volume management support such as snapshot and cloning - Thin provisioning - Autonomous load balancing The more details are available at the project site: http://www.osrg.net/sheepdog/ Signed-off-by: MORITA Kazutaka --- Makefile.objs|2 +- block/sheepdog.c | 1831 ++ 2 files changed, 1832 insertions(+), 1 deletions(-) create mode 100644 block/sheepdog.c diff --git a/Makefile.objs b/Makefile.objs index ecdd53e..6edbc57 100644 --- a/Makefile.objs +++ b/Makefile.objs @@ -14,7 +14,7 @@ block-obj-$(CONFIG_LINUX_AIO) += linux-aio.o block-nested-y += raw.o cow.o qcow.o vdi.o vmdk.o cloop.o dmg.o bochs.o vpc.o vvfat.o block-nested-y += qcow2.o qcow2-refcount.o qcow2-cluster.o qcow2-snapshot.o -block-nested-y += parallels.o nbd.o blkdebug.o +block-nested-y += parallels.o nbd.o blkdebug.o sheepdog.o block-nested-$(CONFIG_WIN32) += raw-win32.o block-nested-$(CONFIG_POSIX) += raw-posix.o block-nested-$(CONFIG_CURL) += curl.o diff --git a/block/sheepdog.c b/block/sheepdog.c new file mode 100644 index 000..adf3a71 --- /dev/null +++ b/block/sheepdog.c @@ -0,0 +1,1831 @@ +/* + * Copyright (C) 2009-2010 Nippon Telegraph and Telephone Corporation. + * + * This program is free software; you can redistribute it and/or + * modify it under the terms of the GNU General Public License version + * 2 as published by the Free Software Foundation. + * + * You should have received a copy of the GNU General Public License + * along with this program. If not, see <http://www.gnu.org/licenses/>. + */ +#include +#include + +#include "qemu-common.h" +#include "block_int.h" + +#define SD_PROTO_VER 0x01 + +#define SD_DEFAULT_ADDR "localhost:7000" + +#define SD_OP_CREATE_AND_WRITE_OBJ 0x01 +#define SD_OP_READ_OBJ 0x02 +#define SD_OP_WRITE_OBJ 0x03 + +#define SD_OP_NEW_VDI0x11 +#define SD_OP_LOCK_VDI 0x12 +#define SD_OP_RELEASE_VDI0x13 +#define SD_OP_GET_VDI_INFO 0x14 +#define SD_OP_READ_VDIS 0x15 + +#define SD_FLAG_CMD_WRITE0x01 +#define SD_FLAG_CMD_COW 0x02 + +#define SD_RES_SUCCESS 0x00 /* Success */ +#define SD_RES_UNKNOWN 0x01 /* Unknown error */ +#define SD_RES_NO_OBJ0x02 /* No object found */ +#define SD_RES_EIO 0x03 /* I/O error */ +#define SD_RES_VDI_EXIST 0x04 /* Vdi exists already */ +#define SD_RES_INVALID_PARMS 0x05 /* Invalid parameters */ +#define SD_RES_SYSTEM_ERROR 0x06 /* System error */ +#define SD_RES_VDI_LOCKED0x07 /* Vdi is locked */ +#define SD_RES_NO_VDI0x08 /* No vdi found */ +#define SD_RES_NO_BASE_VDI 0x09 /* No base vdi found */ +#define SD_RES_VDI_READ 0x0A /* Cannot read requested vdi */ +#define SD_RES_VDI_WRITE 0x0B /* Cannot write requested vdi */ +#define SD_RES_BASE_VDI_READ 0x0C /* Cannot read base vdi */ +#define SD_RES_BASE_VDI_WRITE 0x0D /* Cannot write base vdi */ +#define SD_RES_NO_TAG0x0E /* Requested tag is not found */ +#define SD_RES_STARTUP 0x0F /* Sheepdog is on starting up */ +#define SD_RES_VDI_NOT_LOCKED 0x10 /* Vdi is not locked */ +#define SD_RES_SHUTDOWN 0x11 /* Sheepdog is shutting down */ +#define SD_RES_NO_MEM0x12 /* Cannot allocate memory */ +#define SD_RES_FULL_VDI 0x13 /* we already have the maximum vdis */ +#define SD_RES_VER_MISMATCH 0x14 /* Protocol version mismatch */ +#define SD_RES_NO_SPACE 0x15 /* Server has no room for new objects */ +#define SD_RES_WAIT_FOR_FORMAT 0x16 /* Sheepdog is waiting for a format operation */ +#define SD_RES_WAIT_FOR_JOIN0x17 /* Sheepdog is waiting for other nodes joining */ +#define SD_RES_JOIN_FAILED 0x18 /* Target node had failed to join sheepdog */ + +/* + * Object ID rules + * + * 0 - 19 (20 bits): data object space + * 20 - 31 (12 bits): reserved data object space + * 32 - 55 (24 bits): vdi object space + * 56 - 59 ( 4 bits): reserved vdi object space + * 60 - 63 ( 4 bits): object type indentifier space + */ + +#define VDI_SPACE_SHIFT 32 +#define VDI_BIT (UINT64_C(1) << 63) +#define VMSTATE_BIT (UINT64_C(1) << 62) +#define MAX_DATA_OBJS (1ULL << 20) +#define MAX_CHILDREN 1024 +#define SD_MAX_VDI_LEN 256 +#define SD_NR_VDIS (1U << 24) +#define SD_DATA_OBJ_SIZE (UINT64_C(1) << 22) + +#define SD_INODE_SIZE (sizeof(struct sd_inode)) +#define CURRENT_VDI_ID 0 + +struct sd_req { + uint8_t proto_ver; + uint8_t opcode; + uint16_tflags; + uint32_tepoch; + uint32_
Re: [Qemu-devel] [RFC PATCH 0/2] Sheepdog: distributed storage system for QEMU
At Fri, 14 May 2010 10:32:26 +0200, Kevin Wolf wrote: > > Am 13.05.2010 16:03, schrieb MORITA Kazutaka: > > To support snapshot in a protocol, I'd like to call the hander of the > > protocol driver in the following functions in block.c: > > > > bdrv_snapshot_create > > bdrv_snapshot_goto > > bdrv_snapshot_delete > > bdrv_snapshot_list > > bdrv_save_vmstate > > bdrv_load_vmstate > > > > Is it okay? > > Yes, I think this is the way to go. > Done. > > In the case both format and protocol drivers support snapshots, I > > think it is better to call the format driver handler. Because qcow2 > > is well known as a snapshot support format, so when users use qcow2, > > they expect to get snapshot with qcow2. > > I agree. > Done. > > There is another problem to make the sheepdog driver be a protocol; > > how to deal with protocol specific create_options? > > > > For example, sheepdog supports cloning images as a format driver: > > > > $ qemu-img create -f sheepdog dst -b sheepdog:src > > > > But if the sheepdog driver is a protocol, error will occur. > > > > $ qemu-img create sheepdog:dst -b sheepdog:src > > Unknown option 'backing_file' > > qemu-img: Backing file not supported for file format 'raw' > > > > It is because the raw format doesn't support a backing_file option. > > To support the protocol specific create_options, if the format driver > > cannot parse some of the arguments, the protocol driver need to parse > > them. > > That's actually a good point. Yes, I think it makes a lot of sense to > allow parameters to be passed to the protocol driver. > Okay. But it seemed to require many changes to the qemu-img parser, so I didn't do it in the patchset I sent just now. > Also, I've never tried to create an image over a protocol other than > file. As far as I know, raw is the only format for which it should work > right now (at least in theory). As we're going forward, I'm planning to > convert the other drivers, too. > I see. Thank you for the explanations. Regards, Kazutaka
[Qemu-devel] [RFC PATCH v3 2/3] block: call the snapshot handlers of the protocol drivers
When snapshot handlers are not defined in the format driver, it is better to call the ones of the protocol driver. This enables us to implement snapshot support in the protocol driver. We need to call bdrv_close() and bdrv_open() handlers of the format driver before and after bdrv_snapshot_goto() call of the protocol. It is because the contents of the block driver state may need to be changed after loading vmstate. Signed-off-by: MORITA Kazutaka --- block.c | 61 +++-- 1 files changed, 43 insertions(+), 18 deletions(-) diff --git a/block.c b/block.c index f3bf3f2..c987e57 100644 --- a/block.c +++ b/block.c @@ -1683,9 +1683,11 @@ int bdrv_save_vmstate(BlockDriverState *bs, const uint8_t *buf, BlockDriver *drv = bs->drv; if (!drv) return -ENOMEDIUM; -if (!drv->bdrv_save_vmstate) -return -ENOTSUP; -return drv->bdrv_save_vmstate(bs, buf, pos, size); +if (drv->bdrv_save_vmstate) +return drv->bdrv_save_vmstate(bs, buf, pos, size); +if (bs->file) +return bdrv_save_vmstate(bs->file, buf, pos, size); +return -ENOTSUP; } int bdrv_load_vmstate(BlockDriverState *bs, uint8_t *buf, @@ -1694,9 +1696,11 @@ int bdrv_load_vmstate(BlockDriverState *bs, uint8_t *buf, BlockDriver *drv = bs->drv; if (!drv) return -ENOMEDIUM; -if (!drv->bdrv_load_vmstate) -return -ENOTSUP; -return drv->bdrv_load_vmstate(bs, buf, pos, size); +if (drv->bdrv_load_vmstate) +return drv->bdrv_load_vmstate(bs, buf, pos, size); +if (bs->file) +return bdrv_load_vmstate(bs->file, buf, pos, size); +return -ENOTSUP; } void bdrv_debug_event(BlockDriverState *bs, BlkDebugEvent event) @@ -1720,20 +1724,37 @@ int bdrv_snapshot_create(BlockDriverState *bs, BlockDriver *drv = bs->drv; if (!drv) return -ENOMEDIUM; -if (!drv->bdrv_snapshot_create) -return -ENOTSUP; -return drv->bdrv_snapshot_create(bs, sn_info); +if (drv->bdrv_snapshot_create) +return drv->bdrv_snapshot_create(bs, sn_info); +if (bs->file) +return bdrv_snapshot_create(bs->file, sn_info); +return -ENOTSUP; } int bdrv_snapshot_goto(BlockDriverState *bs, const char *snapshot_id) { BlockDriver *drv = bs->drv; +int ret, open_ret; + if (!drv) return -ENOMEDIUM; -if (!drv->bdrv_snapshot_goto) -return -ENOTSUP; -return drv->bdrv_snapshot_goto(bs, snapshot_id); +if (drv->bdrv_snapshot_goto) +return drv->bdrv_snapshot_goto(bs, snapshot_id); + +if (bs->file) { +drv->bdrv_close(bs); +ret = bdrv_snapshot_goto(bs->file, snapshot_id); +open_ret = drv->bdrv_open(bs, bs->open_flags); +if (open_ret < 0) { +bdrv_delete(bs); +bs->drv = NULL; +return open_ret; +} +return ret; +} + +return -ENOTSUP; } int bdrv_snapshot_delete(BlockDriverState *bs, const char *snapshot_id) @@ -1741,9 +1762,11 @@ int bdrv_snapshot_delete(BlockDriverState *bs, const char *snapshot_id) BlockDriver *drv = bs->drv; if (!drv) return -ENOMEDIUM; -if (!drv->bdrv_snapshot_delete) -return -ENOTSUP; -return drv->bdrv_snapshot_delete(bs, snapshot_id); +if (drv->bdrv_snapshot_delete) +return drv->bdrv_snapshot_delete(bs, snapshot_id); +if (bs->file) +return bdrv_snapshot_delete(bs->file, snapshot_id); +return -ENOTSUP; } int bdrv_snapshot_list(BlockDriverState *bs, @@ -1752,9 +1775,11 @@ int bdrv_snapshot_list(BlockDriverState *bs, BlockDriver *drv = bs->drv; if (!drv) return -ENOMEDIUM; -if (!drv->bdrv_snapshot_list) -return -ENOTSUP; -return drv->bdrv_snapshot_list(bs, psn_info); +if (drv->bdrv_snapshot_list) +return drv->bdrv_snapshot_list(bs, psn_info); +if (bs->file) +return bdrv_snapshot_list(bs->file, psn_info); +return -ENOTSUP; } #define NB_SUFFIXES 4 -- 1.5.6.5
[Qemu-devel] [RFC PATCH v3 1/3] close all the block drivers before the qemu process exits
This patch calls the close handler of the block driver before the qemu process exits. This is necessary because the sheepdog block driver releases the lock of VM images in the close handler. Signed-off-by: MORITA Kazutaka --- block.c |9 + block.h |1 + vl.c|1 + 3 files changed, 11 insertions(+), 0 deletions(-) diff --git a/block.c b/block.c index bfe46e3..f3bf3f2 100644 --- a/block.c +++ b/block.c @@ -636,6 +636,15 @@ void bdrv_close(BlockDriverState *bs) } } +void bdrv_close_all(void) +{ +BlockDriverState *bs; + +QTAILQ_FOREACH(bs, &bdrv_states, list) { +bdrv_close(bs); +} +} + void bdrv_delete(BlockDriverState *bs) { /* remove from list, if necessary */ diff --git a/block.h b/block.h index 278259c..531e802 100644 --- a/block.h +++ b/block.h @@ -121,6 +121,7 @@ BlockDriverAIOCB *bdrv_aio_ioctl(BlockDriverState *bs, /* Ensure contents are flushed to disk. */ void bdrv_flush(BlockDriverState *bs); void bdrv_flush_all(void); +void bdrv_close_all(void); int bdrv_has_zero_init(BlockDriverState *bs); int bdrv_is_allocated(BlockDriverState *bs, int64_t sector_num, int nb_sectors, diff --git a/vl.c b/vl.c index 85bcc84..5ce7807 100644 --- a/vl.c +++ b/vl.c @@ -2007,6 +2007,7 @@ static void main_loop(void) exit(0); } } +bdrv_close_all(); pause_all_vcpus(); } -- 1.5.6.5
[Qemu-devel] [RFC PATCH v3 0/3] Sheepdog: distributed storage system for QEMU
Hi all, This patch adds a block driver for Sheepdog distributed storage system. Changes from v2 to v3 are: - add drv->bdrv_close() and drv->bdrv_open() before and after bdrv_snapshot_goto() call of the protocol. - address the review comments on the sheepdog driver code. I'll send the details in the reply of the review mail. Changes from v1 to v2 are: - rebase onto git://repo.or.cz/qemu/kevin.git block - modify the sheepdog driver as a protocol driver - add new patch to call the snapshot handler of the protocol If this patchset is okay, I'll work on the image creation problem of the protocol driver. Thanks, Kazutaka MORITA Kazutaka (3): close all the block drivers before the qemu process exits block: call the snapshot handlers of the protocol drivers block: add sheepdog driver for distributed storage support Makefile.objs|2 +- block.c | 70 ++- block.h |1 + block/sheepdog.c | 1845 ++ vl.c |1 + 5 files changed, 1900 insertions(+), 19 deletions(-) create mode 100644 block/sheepdog.c
[Qemu-devel] [RFC PATCH v3 3/3] block: add sheepdog driver for distributed storage support
Sheepdog is a distributed storage system for QEMU. It provides highly available block level storage volumes to VMs like Amazon EBS. This patch adds a qemu block driver for Sheepdog. Sheepdog features are: - No node in the cluster is special (no metadata node, no control node, etc) - Linear scalability in performance and capacity - No single point of failure - Autonomous management (zero configuration) - Useful volume management support such as snapshot and cloning - Thin provisioning - Autonomous load balancing The more details are available at the project site: http://www.osrg.net/sheepdog/ Signed-off-by: MORITA Kazutaka --- Makefile.objs|2 +- block/sheepdog.c | 1845 ++ 2 files changed, 1846 insertions(+), 1 deletions(-) create mode 100644 block/sheepdog.c diff --git a/Makefile.objs b/Makefile.objs index ecdd53e..6edbc57 100644 --- a/Makefile.objs +++ b/Makefile.objs @@ -14,7 +14,7 @@ block-obj-$(CONFIG_LINUX_AIO) += linux-aio.o block-nested-y += raw.o cow.o qcow.o vdi.o vmdk.o cloop.o dmg.o bochs.o vpc.o vvfat.o block-nested-y += qcow2.o qcow2-refcount.o qcow2-cluster.o qcow2-snapshot.o -block-nested-y += parallels.o nbd.o blkdebug.o +block-nested-y += parallels.o nbd.o blkdebug.o sheepdog.o block-nested-$(CONFIG_WIN32) += raw-win32.o block-nested-$(CONFIG_POSIX) += raw-posix.o block-nested-$(CONFIG_CURL) += curl.o diff --git a/block/sheepdog.c b/block/sheepdog.c new file mode 100644 index 000..4672f00 --- /dev/null +++ b/block/sheepdog.c @@ -0,0 +1,1845 @@ +/* + * Copyright (C) 2009-2010 Nippon Telegraph and Telephone Corporation. + * + * This program is free software; you can redistribute it and/or + * modify it under the terms of the GNU General Public License version + * 2 as published by the Free Software Foundation. + * + * You should have received a copy of the GNU General Public License + * along with this program. If not, see <http://www.gnu.org/licenses/>. + */ +#include +#include + +#include "qemu-common.h" +#include "qemu-error.h" +#include "block_int.h" + +#define SD_PROTO_VER 0x01 + +#define SD_DEFAULT_ADDR "localhost:7000" + +#define SD_OP_CREATE_AND_WRITE_OBJ 0x01 +#define SD_OP_READ_OBJ 0x02 +#define SD_OP_WRITE_OBJ 0x03 + +#define SD_OP_NEW_VDI0x11 +#define SD_OP_LOCK_VDI 0x12 +#define SD_OP_RELEASE_VDI0x13 +#define SD_OP_GET_VDI_INFO 0x14 +#define SD_OP_READ_VDIS 0x15 + +#define SD_FLAG_CMD_WRITE0x01 +#define SD_FLAG_CMD_COW 0x02 + +#define SD_RES_SUCCESS 0x00 /* Success */ +#define SD_RES_UNKNOWN 0x01 /* Unknown error */ +#define SD_RES_NO_OBJ0x02 /* No object found */ +#define SD_RES_EIO 0x03 /* I/O error */ +#define SD_RES_VDI_EXIST 0x04 /* Vdi exists already */ +#define SD_RES_INVALID_PARMS 0x05 /* Invalid parameters */ +#define SD_RES_SYSTEM_ERROR 0x06 /* System error */ +#define SD_RES_VDI_LOCKED0x07 /* Vdi is locked */ +#define SD_RES_NO_VDI0x08 /* No vdi found */ +#define SD_RES_NO_BASE_VDI 0x09 /* No base vdi found */ +#define SD_RES_VDI_READ 0x0A /* Cannot read requested vdi */ +#define SD_RES_VDI_WRITE 0x0B /* Cannot write requested vdi */ +#define SD_RES_BASE_VDI_READ 0x0C /* Cannot read base vdi */ +#define SD_RES_BASE_VDI_WRITE 0x0D /* Cannot write base vdi */ +#define SD_RES_NO_TAG0x0E /* Requested tag is not found */ +#define SD_RES_STARTUP 0x0F /* Sheepdog is on starting up */ +#define SD_RES_VDI_NOT_LOCKED 0x10 /* Vdi is not locked */ +#define SD_RES_SHUTDOWN 0x11 /* Sheepdog is shutting down */ +#define SD_RES_NO_MEM0x12 /* Cannot allocate memory */ +#define SD_RES_FULL_VDI 0x13 /* we already have the maximum vdis */ +#define SD_RES_VER_MISMATCH 0x14 /* Protocol version mismatch */ +#define SD_RES_NO_SPACE 0x15 /* Server has no room for new objects */ +#define SD_RES_WAIT_FOR_FORMAT 0x16 /* Sheepdog is waiting for a format operation */ +#define SD_RES_WAIT_FOR_JOIN0x17 /* Sheepdog is waiting for other nodes joining */ +#define SD_RES_JOIN_FAILED 0x18 /* Target node had failed to join sheepdog */ + +/* + * Object ID rules + * + * 0 - 19 (20 bits): data object space + * 20 - 31 (12 bits): reserved data object space + * 32 - 55 (24 bits): vdi object space + * 56 - 59 ( 4 bits): reserved vdi object space + * 60 - 63 ( 4 bits): object type indentifier space + */ + +#define VDI_SPACE_SHIFT 32 +#define VDI_BIT (UINT64_C(1) << 63) +#define VMSTATE_BIT (UINT64_C(1) << 62) +#define MAX_DATA_OBJS (1ULL << 20) +#define MAX_CHILDREN 1024 +#define SD_MAX_VDI_LEN 256 +#define SD_NR_VDIS (1U << 24) +#define SD_DATA_OBJ_SIZE (UINT64_C(1) << 22) + +#define SD_INODE_SIZE (sizeof(SheepdogInode)) +#define CURRENT_VDI_ID 0 + +typedef struct SheepdogReq { + uint8_t proto_ver; + uint8_t opcode; + uint16_tflags;
[Qemu-devel] Re: [RFC PATCH v2 3/3] block: add sheepdog driver for distributed storage support
Hi, Thank you very much for the reviewing! At Fri, 14 May 2010 13:08:06 +0200, Kevin Wolf wrote: > > + > > +struct sd_req { > > + uint8_t proto_ver; > > + uint8_t opcode; > > + uint16_tflags; > > + uint32_tepoch; > > + uint32_tid; > > + uint32_tdata_length; > > + uint32_topcode_specific[8]; > > +}; > > CODING_STYLE says that structs should be typedefed and their names > should be in CamelCase. So something like this: > > typedef struct SheepdogReq { > ... > } SheepdogReq; > > (Or, if your prefer, SDReq; but with things like SDAIOCB I think it > becomes hard to read) > I see. I use Sheepdog as a prefix, like SheepdogReq. > > +/* > > + > > +#undef eprintf > > +#define eprintf(fmt, args...) > > \ > > +do { > > \ > > + fprintf(stderr, "%s %d: " fmt, __func__, __LINE__, ##args); \ > > +} while (0) > > What about using error_report() instead of fprintf? Though it should be > the same currently. > Yes, using common helper functions is better. I replaced all the printf. > > + > > + for (i = 0; i < ARRAY_SIZE(errors); ++i) > > + if (errors[i].err == err) > > + return errors[i].desc; > > CODING_STYLE requires braces here. > I fixed all the missing braces. > > + > > + return "Invalid error code"; > > +} > > + > > +static inline int before(uint32_t seq1, uint32_t seq2) > > +{ > > +return (int32_t)(seq1 - seq2) < 0; > > +} > > + > > +static inline int after(uint32_t seq1, uint32_t seq2) > > +{ > > + return (int32_t)(seq2 - seq1) < 0; > > +} > > These functions look strange... Is the difference to seq1 < seq2 that > the cast introduces intentional? (after(0x0, 0xabcdefff) == 1) > > If yes, why is this useful? This needs a comment. If no, why even bother > to have this function instead of directly using < or > ? > These functions are used to compare sequential numbers which can be wrap-around. For example, linux/net/tcp.h in the linux kernel. Anyway, sheepdog doesn't use these functions, so I removed them. > > + if (snapid) > > + dprintf("%" PRIx32 " non current inode was open.\n", vid); > > + else > > + s->is_current = 1; > > + > > + fd = connect_to_sdog(s->addr); > > I wonder why you need to open another connection here instead of using > s->fd. This pattern repeats at least in the snapshot functions, so I'm > sure it's there for a reason. Maybe add a comment? > We can use s->fd only for aio read/write operations. It is because the block driver may be during waiting response from the server, so we cannot send other requests to the discriptor to avoid receiving wrong data. I added the comment to get_sheep_fd(). > > + > > + iov.iov_base = &s->inode; > > + iov.iov_len = sizeof(s->inode); > > + aio_req = alloc_aio_req(s, acb, vid_to_vdi_oid(s->inode.vdi_id), > > + data_len, offset, 0, 0, offset); > > + if (!aio_req) { > > + eprintf("too many requests\n"); > > + acb->ret = -EIO; > > + goto out; > > + } > > Randomly failing requests is probably not a good idea. The guest might > decide that the disk/file system is broken and stop using it. Can't you > use a list like in AIOPool, so you can dynamically add new requests as > needed? > I agree. In the v3 patch, AIO requests are allocated dynamically, and all the requests are linked to the outstanding_aio_head in the BDRVSheepdogState. > > + > > +static int sd_snapshot_goto(BlockDriverState *bs, const char *snapshot_id) > > +{ > > + struct bdrv_sd_state *s = bs->opaque; > > + struct bdrv_sd_state *old_s; > > + char vdi[SD_MAX_VDI_LEN]; > > + char *buf = NULL; > > + uint32_t vid; > > + uint32_t snapid = 0; > > + int ret = -ENOENT, fd; > > + > > + old_s = qemu_malloc(sizeof(struct bdrv_sd_state)); > > + if (!old_s) { > > qemu_malloc never returns NULL. > I removed all the NULL checks. > > + > > +BlockDriver bdrv_sheepdog = { > > + .format_name = "sheepdog", > > + .protocol_name = "sheepdog", > > + .instance_size = sizeof(struct bdrv_sd_state), > > + .bdrv_file_open = sd_open, > > + .bdrv_close = sd_close, > > + .bdrv_create = sd_create, > > + > > + .bdrv_aio_readv = sd_aio_readv, > > + .bdrv_aio_writev = sd_aio_writev, > > + > > + .bdrv_snapshot_create = sd_snapshot_create, > > + .bdrv_snapshot_goto = sd_snapshot_goto, > > + .bdrv_snapshot_delete = sd_snapshot_delete, > > + .bdrv_snapshot_list = sd_snapshot_list, > > + > > + .bdrv_save_vmstate = sd_save_vmstate, > > + .bdrv_load_vmstate = sd_load_vmstate, > > + > > + .create_options = sd_create_options, > > +}; > > Please align the = to the same column, at least in each block. > I have aligned in the v3 patch. Thanks, Kazutaka
[Qemu-devel] Re: [RFC PATCH v3 2/3] block: call the snapshot handlers of the protocol drivers
At Mon, 17 May 2010 13:08:08 +0200, Kevin Wolf wrote: > > Am 17.05.2010 12:19, schrieb MORITA Kazutaka: > > > > int bdrv_snapshot_goto(BlockDriverState *bs, > > const char *snapshot_id) > > { > > BlockDriver *drv = bs->drv; > > +int ret, open_ret; > > + > > if (!drv) > > return -ENOMEDIUM; > > -if (!drv->bdrv_snapshot_goto) > > -return -ENOTSUP; > > -return drv->bdrv_snapshot_goto(bs, snapshot_id); > > +if (drv->bdrv_snapshot_goto) > > +return drv->bdrv_snapshot_goto(bs, snapshot_id); > > + > > +if (bs->file) { > > +drv->bdrv_close(bs); > > +ret = bdrv_snapshot_goto(bs->file, snapshot_id); > > +open_ret = drv->bdrv_open(bs, bs->open_flags); > > +if (open_ret < 0) { > > +bdrv_delete(bs); > > I think you mean bs->file here. > > Kevin This is an error of re-opening the format driver, so what we should delete here is not bs->file but bs, isn't it? If we failed to open bs here, the drive doesn't seem to work anymore. Regards, Kazutaka > > +bs->drv = NULL; > > +return open_ret; > > +} > > +return ret; > > +} > > + > > +return -ENOTSUP; > > }
Re: [Qemu-devel] Re: [RFC PATCH v3 2/3] block: call the snapshot handlers of the protocol drivers
On Mon, May 17, 2010 at 9:20 PM, Kevin Wolf wrote: > Am 17.05.2010 14:19, schrieb MORITA Kazutaka: >> At Mon, 17 May 2010 13:08:08 +0200, >> Kevin Wolf wrote: >>> >>> Am 17.05.2010 12:19, schrieb MORITA Kazutaka: >>>> >>>> int bdrv_snapshot_goto(BlockDriverState *bs, >>>> const char *snapshot_id) >>>> { >>>> BlockDriver *drv = bs->drv; >>>> + int ret, open_ret; >>>> + >>>> if (!drv) >>>> return -ENOMEDIUM; >>>> - if (!drv->bdrv_snapshot_goto) >>>> - return -ENOTSUP; >>>> - return drv->bdrv_snapshot_goto(bs, snapshot_id); >>>> + if (drv->bdrv_snapshot_goto) >>>> + return drv->bdrv_snapshot_goto(bs, snapshot_id); >>>> + >>>> + if (bs->file) { >>>> + drv->bdrv_close(bs); >>>> + ret = bdrv_snapshot_goto(bs->file, snapshot_id); >>>> + open_ret = drv->bdrv_open(bs, bs->open_flags); >>>> + if (open_ret < 0) { >>>> + bdrv_delete(bs); >>> >>> I think you mean bs->file here. >>> >>> Kevin >> >> This is an error of re-opening the format driver, so what we should >> delete here is not bs->file but bs, isn't it? If we failed to open bs >> here, the drive doesn't seem to work anymore. > > But bdrv_delete means basically free it. This is almost guaranteed to > lead to crashes because that BlockDriverState is still in use in other > places. > > One additional case of use after free is in the very next line: > >>>> + bs->drv = NULL; > > You can't do that when bs is freed, obviously. But I think just setting > bs->drv to NULL without bdrv_deleting it before is the better way. It > will fail any requests (with -ENOMEDIUM), but can't produce crashes. > This is also what bdrv_commit does in such situations. > > In this state, we don't access the underlying file any more, so we could > delete bs->file - this is why I thought you actually meant to do that. > I'm sorry for the confusion. I understand what we should do here. I'll fix it for the next post. Thanks, Kazutaka
[Qemu-devel] [PATCH] add support for protocol driver create_options
This patch enables protocol drivers to use their create options which are not supported by the format. For example, protcol drivers can use a backing_file option with raw format. Signed-off-by: MORITA Kazutaka --- block.c |7 +++ block.h |1 + qemu-img.c| 49 ++--- qemu-option.c | 52 +--- qemu-option.h |2 ++ 5 files changed, 85 insertions(+), 26 deletions(-) diff --git a/block.c b/block.c index 48d8468..0ab9424 100644 --- a/block.c +++ b/block.c @@ -56,7 +56,6 @@ static int bdrv_read_em(BlockDriverState *bs, int64_t sector_num, uint8_t *buf, int nb_sectors); static int bdrv_write_em(BlockDriverState *bs, int64_t sector_num, const uint8_t *buf, int nb_sectors); -static BlockDriver *find_protocol(const char *filename); static QTAILQ_HEAD(, BlockDriverState) bdrv_states = QTAILQ_HEAD_INITIALIZER(bdrv_states); @@ -210,7 +209,7 @@ int bdrv_create_file(const char* filename, QEMUOptionParameter *options) { BlockDriver *drv; -drv = find_protocol(filename); +drv = bdrv_find_protocol(filename); if (drv == NULL) { drv = bdrv_find_format("file"); } @@ -283,7 +282,7 @@ static BlockDriver *find_hdev_driver(const char *filename) return drv; } -static BlockDriver *find_protocol(const char *filename) +BlockDriver *bdrv_find_protocol(const char *filename) { BlockDriver *drv1; char protocol[128]; @@ -469,7 +468,7 @@ int bdrv_file_open(BlockDriverState **pbs, const char *filename, int flags) BlockDriver *drv; int ret; -drv = find_protocol(filename); +drv = bdrv_find_protocol(filename); if (!drv) { return -ENOENT; } diff --git a/block.h b/block.h index 24efeb6..9034ebb 100644 --- a/block.h +++ b/block.h @@ -54,6 +54,7 @@ void bdrv_info_stats(Monitor *mon, QObject **ret_data); void bdrv_init(void); void bdrv_init_with_whitelist(void); +BlockDriver *bdrv_find_protocol(const char *filename); BlockDriver *bdrv_find_format(const char *format_name); BlockDriver *bdrv_find_whitelisted_format(const char *format_name); int bdrv_create(BlockDriver *drv, const char* filename, diff --git a/qemu-img.c b/qemu-img.c index d3c30a7..8ae7184 100644 --- a/qemu-img.c +++ b/qemu-img.c @@ -252,8 +252,8 @@ static int img_create(int argc, char **argv) const char *base_fmt = NULL; const char *filename; const char *base_filename = NULL; -BlockDriver *drv; -QEMUOptionParameter *param = NULL; +BlockDriver *drv, *proto_drv; +QEMUOptionParameter *param = NULL, *create_options = NULL; char *options = NULL; flags = 0; @@ -286,33 +286,42 @@ static int img_create(int argc, char **argv) } } +/* Get the filename */ +if (optind >= argc) +help(); +filename = argv[optind++]; + /* Find driver and parse its options */ drv = bdrv_find_format(fmt); if (!drv) error("Unknown file format '%s'", fmt); +proto_drv = bdrv_find_protocol(filename); +if (!proto_drv) +error("Unknown protocol '%s'", filename); + +create_options = append_option_parameters(create_options, + drv->create_options); +create_options = append_option_parameters(create_options, + proto_drv->create_options); + if (options && !strcmp(options, "?")) { -print_option_help(drv->create_options); +print_option_help(create_options); return 0; } /* Create parameter list with default values */ -param = parse_option_parameters("", drv->create_options, param); +param = parse_option_parameters("", create_options, param); set_option_parameter_int(param, BLOCK_OPT_SIZE, -1); /* Parse -o options */ if (options) { -param = parse_option_parameters(options, drv->create_options, param); +param = parse_option_parameters(options, create_options, param); if (param == NULL) { error("Invalid options for file format '%s'.", fmt); } } -/* Get the filename */ -if (optind >= argc) -help(); -filename = argv[optind++]; - /* Add size to parameters */ if (optind < argc) { set_option_parameter(param, BLOCK_OPT_SIZE, argv[optind++]); @@ -362,6 +371,7 @@ static int img_create(int argc, char **argv) puts(""); ret = bdrv_create(drv, filename, param); +free_option_parameters(create_options); free_option_parameters(param); if (ret < 0) { @@ -543,14 +553,14 @@ static int img_convert(int argc, char **argv) { int c, ret, n, n1, bs_n, bs_i, flags, cluster_size, cluster_sectors; const char *fmt, *
[Qemu-devel] [ANNOUNCE] Sheepdog: Distributed Storage System for KVM
Hi everyone, Sheepdog is a distributed storage system for KVM/QEMU. It provides highly available block level storage volumes to VMs like Amazon EBS. Sheepdog supports advanced volume management features such as snapshot, cloning, and thin provisioning. Sheepdog runs on several tens or hundreds of nodes, and the architecture is fully symmetric; there is no central node such as a meta-data server. The following list describes the features of Sheepdog. * Linear scalability in performance and capacity * No single point of failure * Redundant architecture (data is written to multiple nodes) - Tolerance against network failure * Zero configuration (newly added machines will join the cluster automatically) - Autonomous load balancing * Snapshot - Online snapshot from qemu-monitor * Clone from a snapshot volume * Thin provisioning - Amazon EBS API support (to use from a Eucalyptus instance) (* = current features, - = on our todo list) More details and download links are here: http://www.osrg.net/sheepdog/ Note that the code is still in an early stage. There are some critical TODO items: - VM image deletion support - Support architectures other than X86_64 - Data recoverys - Free space management - Guarantee reliability and availability under heavy load - Performance improvement - Reclaim unused blocks - More documentation We hope finding people interested in working together. Enjoy! Here are examples: - create images $ kvm-img create -f sheepdog "Alice's Disk" 256G $ kvm-img create -f sheepdog "Bob's Disk" 256G - list images $ shepherd info -t vdi 4 : Alice's Disk 256 GB (allocated: 0 MB, shared: 0 MB), 2009-10-15 16:17:18, tag:0, current 8 : Bob's Disk256 GB (allocated: 0 MB, shared: 0 MB), 2009-10-15 16:29:20, tag:0, current - start up a virtual machine $ kvm --drive format=sheepdog,file="Alice's Disk" - create a snapshot $ kvm-img snapshot -c name sheepdog:"Alice's Disk" - clone from a snapshot $ kvm-img create -b sheepdog:"Alice's Disk":0 -f sheepdog "Charlie's Disk" Thanks. -- MORITA, Kazutaka NTT Cyber Space Labs OSS Computing Project Kernel Group E-mail: morita.kazut...@lab.ntt.co.jp
[Qemu-devel] Re: [ANNOUNCE] Sheepdog: Distributed Storage System for KVM
Hello, Does the following patch work for you? diff --git a/sheep/work.c b/sheep/work.c index 4df8dc0..45f362d 100644 --- a/sheep/work.c +++ b/sheep/work.c @@ -28,6 +28,7 @@ #include #include #include +#define _LINUX_FCNTL_H #include #include "list.h" On Wed, Oct 21, 2009 at 5:45 PM, Nikolai K. Bochev wrote: > Hello, > > I am getting the following error trying to compile sheepdog on Ubuntu 9.10 ( > 2.6.31-14 x64 ) : > > cd shepherd; make > make[1]: Entering directory > `/home/shiny/Packages/sheepdog-2009102101/shepherd' > cc -c -g -O2 -Wall -Wstrict-prototypes -I../include -D_GNU_SOURCE shepherd.c > -o shepherd.o > shepherd.c: In function ‘main’: > shepherd.c:300: warning: dereferencing pointer ‘hdr.55’ does break > strict-aliasing rules > shepherd.c:300: note: initialized from here > cc -c -g -O2 -Wall -Wstrict-prototypes -I../include -D_GNU_SOURCE treeview.c > -o treeview.o > cc -c -g -O2 -Wall -Wstrict-prototypes -I../include -D_GNU_SOURCE > ../lib/event.c -o ../lib/event.o > cc -c -g -O2 -Wall -Wstrict-prototypes -I../include -D_GNU_SOURCE > ../lib/net.c -o ../lib/net.o > ../lib/net.c: In function ‘write_object’: > ../lib/net.c:358: warning: ‘vosts’ may be used uninitialized in this function > cc -c -g -O2 -Wall -Wstrict-prototypes -I../include -D_GNU_SOURCE > ../lib/logger.c -o ../lib/logger.o > cc shepherd.o treeview.o ../lib/event.o ../lib/net.o ../lib/logger.o -o > shepherd -lncurses -lcrypto > make[1]: Leaving directory `/home/shiny/Packages/sheepdog-2009102101/shepherd' > cd sheep; make > make[1]: Entering directory `/home/shiny/Packages/sheepdog-2009102101/sheep' > cc -c -g -O2 -Wall -Wstrict-prototypes -I../include -D_GNU_SOURCE sheep.c -o > sheep.o > cc -c -g -O2 -Wall -Wstrict-prototypes -I../include -D_GNU_SOURCE store.c -o > store.o > cc -c -g -O2 -Wall -Wstrict-prototypes -I../include -D_GNU_SOURCE net.c -o > net.o > cc -c -g -O2 -Wall -Wstrict-prototypes -I../include -D_GNU_SOURCE work.c -o > work.o > In file included from /usr/include/asm/fcntl.h:1, > from /usr/include/linux/fcntl.h:4, > from /usr/include/linux/signalfd.h:13, > from work.c:31: > /usr/include/asm-generic/fcntl.h:117: error: redefinition of ‘struct flock’ > /usr/include/asm-generic/fcntl.h:140: error: redefinition of ‘struct flock64’ > make[1]: *** [work.o] Error 1 > make[1]: Leaving directory `/home/shiny/Packages/sheepdog-2009102101/sheep' > make: *** [all] Error 2 > > I have all the required libs installed. Patching and compiling qemu-kvm went > flawless. > > - Original Message - > From: "MORITA Kazutaka" > To: k...@vger.kernel.org, qemu-devel@nongnu.org, linux-fsde...@vger.kernel.org > Sent: Wednesday, October 21, 2009 8:13:47 AM > Subject: [ANNOUNCE] Sheepdog: Distributed Storage System for KVM > > Hi everyone, > > Sheepdog is a distributed storage system for KVM/QEMU. It provides > highly available block level storage volumes to VMs like Amazon EBS. > Sheepdog supports advanced volume management features such as snapshot, > cloning, and thin provisioning. Sheepdog runs on several tens or hundreds > of nodes, and the architecture is fully symmetric; there is no central > node such as a meta-data server. > > The following list describes the features of Sheepdog. > > * Linear scalability in performance and capacity > * No single point of failure > * Redundant architecture (data is written to multiple nodes) > - Tolerance against network failure > * Zero configuration (newly added machines will join the cluster > automatically) > - Autonomous load balancing > * Snapshot > - Online snapshot from qemu-monitor > * Clone from a snapshot volume > * Thin provisioning > - Amazon EBS API support (to use from a Eucalyptus instance) > > (* = current features, - = on our todo list) > > More details and download links are here: > > http://www.osrg.net/sheepdog/ > > Note that the code is still in an early stage. > There are some critical TODO items: > > - VM image deletion support > - Support architectures other than X86_64 > - Data recoverys > - Free space management > - Guarantee reliability and availability under heavy load > - Performance improvement > - Reclaim unused blocks > - More documentation > > We hope finding people interested in working together. > Enjoy! > > > Here are examples: > > - create images > > $ kvm-img create -f sheepdog "Alice's Disk" 256G > $ kvm-img create -f sheepdog "Bob's Disk" 256G > > - list images > > $ shepherd info -t vdi > 4
[Qemu-devel] Re: [ANNOUNCE] Sheepdog: Distributed Storage System for KVM
We use JGroups (Java library) for reliable multicast communication in our cluster manager daemon. We don't worry about the performance much since the cluster manager daemon is not involved in the I/O path. We might think about moving to corosync if it is more stable than JGroups. On Wed, Oct 21, 2009 at 6:08 PM, Dietmar Maurer wrote: > Quite interesting. But would it be possible to use corosync for the cluster > communication? The point is that we need corosync anyways for pacemaker, it > is written in C (high performance) and seem to implement the feature you need? > >> -Original Message- >> From: kvm-ow...@vger.kernel.org [mailto:kvm-ow...@vger.kernel.org] On >> Behalf Of MORITA Kazutaka >> Sent: Mittwoch, 21. Oktober 2009 07:14 >> To: k...@vger.kernel.org; qemu-devel@nongnu.org; linux- >> fsde...@vger.kernel.org >> Subject: [ANNOUNCE] Sheepdog: Distributed Storage System for KVM >> >> Hi everyone, >> >> Sheepdog is a distributed storage system for KVM/QEMU. It provides >> highly available block level storage volumes to VMs like Amazon EBS. >> Sheepdog supports advanced volume management features such as snapshot, >> cloning, and thin provisioning. Sheepdog runs on several tens or >> hundreds >> of nodes, and the architecture is fully symmetric; there is no central >> node such as a meta-data server. > > -- > To unsubscribe from this list: send the line "unsubscribe kvm" in > the body of a message to majord...@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > -- MORITA, Kazutaka NTT Cyber Space Labs OSS Computing Project Kernel Group E-mail: morita.kazut...@lab.ntt.co.jp
[Qemu-devel] Re: [ANNOUNCE] Sheepdog: Distributed Storage System for KVM
On Fri, Oct 23, 2009 at 12:30 AM, Avi Kivity wrote: > On 10/21/2009 07:13 AM, MORITA Kazutaka wrote: >> >> Hi everyone, >> >> Sheepdog is a distributed storage system for KVM/QEMU. It provides >> highly available block level storage volumes to VMs like Amazon EBS. >> Sheepdog supports advanced volume management features such as snapshot, >> cloning, and thin provisioning. Sheepdog runs on several tens or hundreds >> of nodes, and the architecture is fully symmetric; there is no central >> node such as a meta-data server. > > Very interesting! From a very brief look at the code, it looks like the > sheepdog block format driver is a network client that is able to access > highly available images, yes? Yes. Sheepdog is a simple key-value storage system that consists of multiple nodes (a bit similar to Amazon Dynamo, I guess). The qemu Sheepdog driver (client) divides a VM image into fixed-size objects and store them on the key-value storage system. > If so, is it reasonable to compare this to a cluster file system setup (like > GFS) with images as files on this filesystem? The difference would be that > clustering is implemented in userspace in sheepdog, but in the kernel for a > clustering filesystem. I think that the major difference between sheepdog and cluster file systems such as Google File system, pNFS, etc is the interface between clients and a storage system. > How is load balancing implemented? Can you move an image transparently > while a guest is running? Will an image be moved closer to its guest? Sheepdog uses consistent hashing to decide where objects store; I/O load is balanced across the nodes. When a new node is added or the existing node is removed, the hash table changes and the data automatically and transparently are moved over nodes. We plan to implement a mechanism to distribute the data not randomly but intelligently; we could use machine load, the locations of VMs, etc. > Can you stripe an image across nodes? Yes, a VM images is divided into multiple objects, and they are stored over nodes. > Do you support multiple guests accessing the same image? A VM image can be attached to any VMs but one VM at a time; multiple running VMs cannot access to the same VM image. > What about fault tolerance - storing an image redundantly on multiple nodes? Yes, all objects are replicated to multiple nodes. -- MORITA, Kazutaka NTT Cyber Space Labs OSS Computing Project Kernel Group E-mail: morita.kazut...@lab.ntt.co.jp
Re: [Qemu-devel] Re: [ANNOUNCE] Sheepdog: Distributed Storage System for KVM
On Fri, Oct 23, 2009 at 8:10 PM, Alexander Graf wrote: > > On 23.10.2009, at 12:41, MORITA Kazutaka wrote: > > On Fri, Oct 23, 2009 at 12:30 AM, Avi Kivity wrote: > > How is load balancing implemented? Can you move an image transparently > > while a guest is running? Will an image be moved closer to its guest? > > Sheepdog uses consistent hashing to decide where objects store; I/O > load is balanced across the nodes. When a new node is added or the > existing node is removed, the hash table changes and the data > automatically and transparently are moved over nodes. > > We plan to implement a mechanism to distribute the data not randomly > but intelligently; we could use machine load, the locations of VMs, etc. > > What exactly does balanced mean? Can it cope with individual nodes having > more disk space than others? I mean objects are uniformly distributed over the nodes by the hash function. Distribution using free disk space information is one of TODOs. > Do you support multiple guests accessing the same image? > > A VM image can be attached to any VMs but one VM at a time; multiple > running VMs cannot access to the same VM image. > > What about read-only access? Imagine you'd have 5 kvm instances each > accessing it using -snapshot. By creating new clone images from existing snapshot image, you can do the similar thing. Sheepdog can create cloning image instantly. -- MORITA, Kazutaka NTT Cyber Space Labs OSS Computing Project Kernel Group E-mail: morita.kazut...@lab.ntt.co.jp
Re: [Qemu-devel] [ANNOUNCE] Sheepdog: Distributed Storage System for KVM
Hi, Thanks for many comments. Sheepdog git trees are created. Sheepdog server git://sheepdog.git.sourceforge.net/gitroot/sheepdog/sheepdog Sheepdog client git://sheepdog.git.sourceforge.net/gitroot/sheepdog/qemu-kvm Please try! On Wed, Oct 21, 2009 at 2:13 PM, MORITA Kazutaka wrote: > Hi everyone, > > Sheepdog is a distributed storage system for KVM/QEMU. It provides > highly available block level storage volumes to VMs like Amazon EBS. > Sheepdog supports advanced volume management features such as snapshot, > cloning, and thin provisioning. Sheepdog runs on several tens or hundreds > of nodes, and the architecture is fully symmetric; there is no central > node such as a meta-data server. > > The following list describes the features of Sheepdog. > > * Linear scalability in performance and capacity > * No single point of failure > * Redundant architecture (data is written to multiple nodes) > - Tolerance against network failure > * Zero configuration (newly added machines will join the cluster > automatically) > - Autonomous load balancing > * Snapshot > - Online snapshot from qemu-monitor > * Clone from a snapshot volume > * Thin provisioning > - Amazon EBS API support (to use from a Eucalyptus instance) > > (* = current features, - = on our todo list) > > More details and download links are here: > > http://www.osrg.net/sheepdog/ > > Note that the code is still in an early stage. > There are some critical TODO items: > > - VM image deletion support > - Support architectures other than X86_64 > - Data recoverys > - Free space management > - Guarantee reliability and availability under heavy load > - Performance improvement > - Reclaim unused blocks > - More documentation > > We hope finding people interested in working together. > Enjoy! > > > Here are examples: > > - create images > > $ kvm-img create -f sheepdog "Alice's Disk" 256G > $ kvm-img create -f sheepdog "Bob's Disk" 256G > > - list images > > $ shepherd info -t vdi > 4 : Alice's Disk 256 GB (allocated: 0 MB, shared: 0 MB), 2009-10-15 > 16:17:18, tag: 0, current > 8 : Bob's Disk 256 GB (allocated: 0 MB, shared: 0 MB), 2009-10-15 > 16:29:20, tag: 0, current > > - start up a virtual machine > > $ kvm --drive format=sheepdog,file="Alice's Disk" > > - create a snapshot > > $ kvm-img snapshot -c name sheepdog:"Alice's Disk" > > - clone from a snapshot > > $ kvm-img create -b sheepdog:"Alice's Disk":0 -f sheepdog "Charlie's Disk" > > > Thanks. > > -- > MORITA, Kazutaka > > NTT Cyber Space Labs > OSS Computing Project > Kernel Group > E-mail: morita.kazut...@lab.ntt.co.jp > > > > -- MORITA, Kazutaka NTT Cyber Space Labs OSS Computing Project Kernel Group E-mail: morita.kazut...@lab.ntt.co.jp
Re: [Qemu-devel] [ANNOUNCE] Sheepdog: Distributed Storage System for KVM
On Sat, Oct 24, 2009 at 4:45 AM, Javier Guerra wrote: > On Fri, Oct 23, 2009 at 2:39 PM, MORITA Kazutaka > wrote: >> Thanks for many comments. >> >> Sheepdog git trees are created. > > great! > > is there any client (no matter how crude) besides the patched > KVM/Qemu? it would make it far easier to hack around... No, there isn't. Sorry. I think we should provide a test client as soon as possible. -- MORITA, Kazutaka NTT Cyber Space Labs OSS Computing Project Kernel Group E-mail: morita.kazut...@lab.ntt.co.jp
[Qemu-devel] Re: [ANNOUNCE] Sheepdog: Distributed Storage System for KVM
On 2009/10/25 17:51, Dietmar Maurer wrote: Do you support multiple guests accessing the same image? A VM image can be attached to any VMs but one VM at a time; multiple running VMs cannot access to the same VM image. I guess this is a problem when you want to do live migrations? Yes, because Sheepdog locks a VM image when it is opened. To avoid this problem, locking must be delayed until migration has done. This is also a TODO item. -- MORITA Kazutaka
[Qemu-devel] Re: [ANNOUNCE] Sheepdog: Distributed Storage System for KVM
On 2009/10/21 14:13, MORITA Kazutaka wrote: Hi everyone, Sheepdog is a distributed storage system for KVM/QEMU. It provides highly available block level storage volumes to VMs like Amazon EBS. Sheepdog supports advanced volume management features such as snapshot, cloning, and thin provisioning. Sheepdog runs on several tens or hundreds of nodes, and the architecture is fully symmetric; there is no central node such as a meta-data server. We added some pages to Sheepdog website: Design: http://www.osrg.net/sheepdog/design.html FAQ : http://www.osrg.net/sheepdog/faq.html Sheepdog mailing list is also ready to use (thanks for Tomasz) Subscribe/Unsubscribe/Preferences http://lists.wpkg.org/mailman/listinfo/sheepdog Archive http://lists.wpkg.org/pipermail/sheepdog/ We are always looking for developers or users interested in participating in Sheepdog project! Thanks. MORITA Kazutaka