Re: [sheepdog] [PATCH v2] sheep: make config file compatible with the previous one
On 08/21/2012 02:37 AM, MORITA Kazutaka wrote: Signed-off-by: MORITA Kazutaka morita.kazut...@lab.ntt.co.jp --- Changes from v1: - remove 'version' from sheepdog_config Even if we don't support a version check of the config in the next release, we should fix the compatibility issue at least. sheep/store.c |3 ++- 1 files changed, 2 insertions(+), 1 deletions(-) diff --git a/sheep/store.c b/sheep/store.c index 542804a..fcbf32d 100644 --- a/sheep/store.c +++ b/sheep/store.c @@ -30,10 +30,11 @@ struct sheepdog_config { uint64_t ctime; - uint64_t space; uint16_t flags; uint8_t copies; uint8_t store[STORE_LEN]; + uint8_t __pad[5]; + uint64_t space; }; char *obj_path; Applied, thanks. Yuan -- sheepdog mailing list sheepdog@lists.wpkg.org http://lists.wpkg.org/mailman/listinfo/sheepdog
Re: [sheepdog] [PATCH V2 00/11] INTRODUCE
On Tue, Aug 21, 2012 at 1:50 PM, Dietmar Maurer diet...@proxmox.com wrote: Disabling automatic recovery by default doesn't work for you? You can control the time to start recovery with collie cluster recover enable. It just looks strange to me to design the system for immediate/automatic recovery, and make 'disabling automatic recovery' an option. I would include the node state into the epoch. But maybe that is only an implementation detail. Hi folks: I need a conclusion: Does sheepdog need delay recovery supported by this series (or by Kazum's new idea and implementation) ? - Dietmar -- sheepdog mailing list sheepdog@lists.wpkg.org http://lists.wpkg.org/mailman/listinfo/sheepdog -- Yunkai Zhang Work at Taobao -- sheepdog mailing list sheepdog@lists.wpkg.org http://lists.wpkg.org/mailman/listinfo/sheepdog
Re: [sheepdog] [PATCH 1/2] collie: optimize 'collie vdi check' command
On Tue, Aug 21, 2012 at 1:48 PM, Yunkai Zhang yunkai...@gmail.com wrote: On Tue, Aug 21, 2012 at 1:42 PM, MORITA Kazutaka morita.kazut...@lab.ntt.co.jp wrote: At Thu, 16 Aug 2012 22:38:21 +0800, Yunkai Zhang wrote: After add '-F' flag, the help looks like: $ collie vdi check Usage: collie vdi check [-F] [-s snapshot] [-a address] [-p port] [-h] vdiname Options: -F, --force_repair force repair object's copies (dangerous) How about '-r, --repair'? Good for me. When I submit this patch, my new patch collie: add self options to collie's command haven't been developed, so '-r, --repair' options would be conflict with '-r, --raw' option. Since these two patches are dependent, I'll put them in a series in V2. fprintf(stderr, Failed to read, %s\n, sd_strerror(rsp-result)); exit(EXIT_FAILURE); } - return buf; + + memcpy(sha1, (unsigned char *)rsp-__pad[0], SHA1_LEN); Please define a member name instead of using __pad. OK. } -static void write_object_to(struct sd_vnode *vnode, uint64_t oid, void *buf) +static int do_repair(uint64_t oid, struct node_id *src, struct node_id *dest) { struct sd_req hdr; struct sd_rsp *rsp = (struct sd_rsp *)hdr; + unsigned rlen, wlen; + char host[128]; int fd, ret; - unsigned wlen = SD_DATA_OBJ_SIZE, rlen = 0; - char name[128]; - addr_to_str(name, sizeof(name), vnode-nid.addr, 0); - fd = connect_to(name, vnode-nid.port); + addr_to_str(host, sizeof(host), dest-addr, 0); + + fd = connect_to(host, dest-port); if (fd 0) { - fprintf(stderr, failed to connect to %s:%PRIu32\n, - name, vnode-nid.port); - exit(EXIT_FAILURE); + fprintf(stderr, Failed to connect\n); + return SD_RES_EIO; } - sd_init_req(hdr, SD_OP_WRITE_PEER); - hdr.epoch = sd_epoch; - hdr.flags = SD_FLAG_CMD_WRITE; - hdr.data_length = wlen; + sd_init_req(hdr, SD_OP_REPAIR_OBJ_PEER); I don't think sending peer requests directly from outside sheeps is a good idea. How about making the gateway node forward the requests? Ok, no problem. + rlen = 0; + wlen = sizeof(*src); + + hdr.epoch = sd_epoch; hdr.obj.oid = oid; + hdr.data_length = wlen; + hdr.flags = SD_FLAG_CMD_WRITE; - ret = exec_req(fd, hdr, buf, wlen, rlen); + ret = exec_req(fd, hdr, src, wlen, rlen); close(fd); - if (ret) { - fprintf(stderr, Failed to execute request\n); - exit(EXIT_FAILURE); + fprintf(stderr, Failed to repair oid:%PRIx64\n, oid); + return SD_RES_EIO; } - if (rsp-result != SD_RES_SUCCESS) { - fprintf(stderr, Failed to read, %s\n, - sd_strerror(rsp-result)); - exit(EXIT_FAILURE); + fprintf(stderr, Failed to repair oid:%PRIx64, %s\n, + oid, sd_strerror(rsp-result)); + return rsp-result; } + + return SD_RES_SUCCESS; } -/* - * Fix consistency of the replica of oid. - * - * XXX: The fix is rather dumb, just read the first copy and write it - * to other replica. - */ -static void do_check_repair(uint64_t oid, int nr_copies) +static int do_check_repair(uint64_t oid, int nr_copies) { struct sd_vnode *tgt_vnodes[nr_copies]; - void *buf, *buf_cmp; - int i; + unsigned char sha1[SD_MAX_COPIES][SHA1_LEN]; + char host[128]; + int i, j; oid_to_vnodes(sd_vnodes, sd_vnodes_nr, oid, nr_copies, tgt_vnodes); - buf = read_object_from(tgt_vnodes[0], oid); - for (i = 1; i nr_copies; i++) { - buf_cmp = read_object_from(tgt_vnodes[i], oid); - if (memcmp(buf, buf_cmp, SD_DATA_OBJ_SIZE)) { - free(buf_cmp); - goto fix_consistency; + for (i = 0; i nr_copies; i++) { + get_obj_checksum_from(tgt_vnodes[i], oid, sha1[i]); + } + + for (i = 0; i nr_copies; i++) { + for (j = (i + 1); j nr_copies; j++) { + if (memcmp(sha1[i], sha1[j], SHA1_LEN)) + goto diff; } - free(buf_cmp); } - free(buf); - return; + return 0; -fix_consistency: - for (i = 1; i nr_copies; i++) - write_object_to(tgt_vnodes[i], oid, buf); - fprintf(stdout, fix %PRIx64 success\n, oid); - free(buf); +diff: + fprintf(stderr, Failed oid: %PRIx64\n, oid); + for (i = 0; i nr_copies; i++) { + addr_to_str(host, sizeof(host), tgt_vnodes[i]-nid.addr, 0); + fprintf(stderr, copy[%d], sha1: %s, from: %s:%d\n, + i, sha1_to_hex(sha1[i]), host, tgt_vnodes[i]-nid.port); + } + + if (!vdi_cmd_data.force_repair) + return -1; + +
Re: [sheepdog] [PATCH V2 00/11] INTRODUCE
Hi Dietmar, Hi Yuan, Am 2012-08-21 07:27, schrieb Dietmar Maurer: Membership change can happen for many reason. It can happen if something is wrong on the switch (or if some admin configures the switch), a damaged network cable, a bug in the bonding driver, a damaged network card, or simply a power failure on a node, which reconnects after power is on Me and at least David had this Problem recently in our environment which ends in a complete data loss. Not that it happens very often or should be handled in a way the cluster can run even if it happens, but in my opinion this situation should be handled without data loss... Cheers Bastian -- sheepdog mailing list sheepdog@lists.wpkg.org http://lists.wpkg.org/mailman/listinfo/sheepdog
Re: [sheepdog] [PATCH V2 00/11] INTRODUCE
At Tue, 21 Aug 2012 14:14:23 +0800, Yunkai Zhang wrote: I need a conclusion: Does sheepdog need delay recovery supported by this series (or by Kazum's new idea and implementation) ? There are two different discussion in this thread: 1. turn on/off automatic recovery with a collie command (supported by this series) 2. delay starting automatic recovery in any case I think no one is against supporting 1. Thanks, Kazutaka -- sheepdog mailing list sheepdog@lists.wpkg.org http://lists.wpkg.org/mailman/listinfo/sheepdog
Re: [sheepdog] [PATCH V2 00/11] INTRODUCE
On Tue, Aug 21, 2012 at 2:43 PM, MORITA Kazutaka morita.kazut...@lab.ntt.co.jp wrote: At Tue, 21 Aug 2012 14:14:23 +0800, Yunkai Zhang wrote: I need a conclusion: Does sheepdog need delay recovery supported by this series (or by Kazum's new idea and implementation) ? There are two different discussion in this thread: 1. turn on/off automatic recovery with a collie command (supported by this series) 2. delay starting automatic recovery in any case I think no one is against supporting 1. Ok, I'll continue to improve this series after I complete other things. Thanks, Kazutaka -- Yunkai Zhang Work at Taobao -- sheepdog mailing list sheepdog@lists.wpkg.org http://lists.wpkg.org/mailman/listinfo/sheepdog
Re: [sheepdog] [PATCH V2 00/11] INTRODUCE
On 08/21/2012 02:48 PM, Yunkai Zhang wrote: Ok, I'll continue to improve this series after I complete other things. Why not choose Kazutaka's idea to implement delay recovery? It looks simple yet efficient at least to me. Thanks, Yuan -- sheepdog mailing list sheepdog@lists.wpkg.org http://lists.wpkg.org/mailman/listinfo/sheepdog
Re: [sheepdog] [PATCH V2 00/11] INTRODUCE
On Tue, Aug 21, 2012 at 2:58 PM, Liu Yuan namei.u...@gmail.com wrote: On 08/21/2012 02:48 PM, Yunkai Zhang wrote: Ok, I'll continue to improve this series after I complete other things. Why not choose Kazutaka's idea to implement delay recovery? It looks simple yet efficient at least to me. continue to import this series, of course including kazum's idea if it's the best way. Thanks, Yuan -- Yunkai Zhang Work at Taobao -- sheepdog mailing list sheepdog@lists.wpkg.org http://lists.wpkg.org/mailman/listinfo/sheepdog
Re: [sheepdog] [PATCH V2 00/11] INTRODUCE
On Tue, Aug 21, 2012 at 3:04 PM, Yunkai Zhang yunkai...@gmail.com wrote: On Tue, Aug 21, 2012 at 2:58 PM, Liu Yuan namei.u...@gmail.com wrote: On 08/21/2012 02:48 PM, Yunkai Zhang wrote: Ok, I'll continue to improve this series after I complete other things. Why not choose Kazutaka's idea to implement delay recovery? It looks simple yet efficient at least to me. continue to import this series, of course including kazum's idea if it's the best way. s/import/improve/ Thanks, Yuan -- Yunkai Zhang Work at Taobao -- Yunkai Zhang Work at Taobao -- sheepdog mailing list sheepdog@lists.wpkg.org http://lists.wpkg.org/mailman/listinfo/sheepdog
Re: [sheepdog] [PATCH 1/2] collie: optimize 'collie vdi check' command
On Tue, Aug 21, 2012 at 2:26 PM, Yunkai Zhang yunkai...@gmail.com wrote: On Tue, Aug 21, 2012 at 1:48 PM, Yunkai Zhang yunkai...@gmail.com wrote: On Tue, Aug 21, 2012 at 1:42 PM, MORITA Kazutaka morita.kazut...@lab.ntt.co.jp wrote: At Thu, 16 Aug 2012 22:38:21 +0800, Yunkai Zhang wrote: After add '-F' flag, the help looks like: $ collie vdi check Usage: collie vdi check [-F] [-s snapshot] [-a address] [-p port] [-h] vdiname Options: -F, --force_repair force repair object's copies (dangerous) How about '-r, --repair'? Good for me. When I submit this patch, my new patch collie: add self options to collie's command haven't been developed, so '-r, --repair' options would be conflict with '-r, --raw' option. I found that '-r, --raw' option is also need by 'collie vdi list' command, so there exist name conflict. I would use '-R, --repair'. Since these two patches are dependent, I'll put them in a series in V2. fprintf(stderr, Failed to read, %s\n, sd_strerror(rsp-result)); exit(EXIT_FAILURE); } - return buf; + + memcpy(sha1, (unsigned char *)rsp-__pad[0], SHA1_LEN); Please define a member name instead of using __pad. OK. } -static void write_object_to(struct sd_vnode *vnode, uint64_t oid, void *buf) +static int do_repair(uint64_t oid, struct node_id *src, struct node_id *dest) { struct sd_req hdr; struct sd_rsp *rsp = (struct sd_rsp *)hdr; + unsigned rlen, wlen; + char host[128]; int fd, ret; - unsigned wlen = SD_DATA_OBJ_SIZE, rlen = 0; - char name[128]; - addr_to_str(name, sizeof(name), vnode-nid.addr, 0); - fd = connect_to(name, vnode-nid.port); + addr_to_str(host, sizeof(host), dest-addr, 0); + + fd = connect_to(host, dest-port); if (fd 0) { - fprintf(stderr, failed to connect to %s:%PRIu32\n, - name, vnode-nid.port); - exit(EXIT_FAILURE); + fprintf(stderr, Failed to connect\n); + return SD_RES_EIO; } - sd_init_req(hdr, SD_OP_WRITE_PEER); - hdr.epoch = sd_epoch; - hdr.flags = SD_FLAG_CMD_WRITE; - hdr.data_length = wlen; + sd_init_req(hdr, SD_OP_REPAIR_OBJ_PEER); I don't think sending peer requests directly from outside sheeps is a good idea. How about making the gateway node forward the requests? Ok, no problem. + rlen = 0; + wlen = sizeof(*src); + + hdr.epoch = sd_epoch; hdr.obj.oid = oid; + hdr.data_length = wlen; + hdr.flags = SD_FLAG_CMD_WRITE; - ret = exec_req(fd, hdr, buf, wlen, rlen); + ret = exec_req(fd, hdr, src, wlen, rlen); close(fd); - if (ret) { - fprintf(stderr, Failed to execute request\n); - exit(EXIT_FAILURE); + fprintf(stderr, Failed to repair oid:%PRIx64\n, oid); + return SD_RES_EIO; } - if (rsp-result != SD_RES_SUCCESS) { - fprintf(stderr, Failed to read, %s\n, - sd_strerror(rsp-result)); - exit(EXIT_FAILURE); + fprintf(stderr, Failed to repair oid:%PRIx64, %s\n, + oid, sd_strerror(rsp-result)); + return rsp-result; } + + return SD_RES_SUCCESS; } -/* - * Fix consistency of the replica of oid. - * - * XXX: The fix is rather dumb, just read the first copy and write it - * to other replica. - */ -static void do_check_repair(uint64_t oid, int nr_copies) +static int do_check_repair(uint64_t oid, int nr_copies) { struct sd_vnode *tgt_vnodes[nr_copies]; - void *buf, *buf_cmp; - int i; + unsigned char sha1[SD_MAX_COPIES][SHA1_LEN]; + char host[128]; + int i, j; oid_to_vnodes(sd_vnodes, sd_vnodes_nr, oid, nr_copies, tgt_vnodes); - buf = read_object_from(tgt_vnodes[0], oid); - for (i = 1; i nr_copies; i++) { - buf_cmp = read_object_from(tgt_vnodes[i], oid); - if (memcmp(buf, buf_cmp, SD_DATA_OBJ_SIZE)) { - free(buf_cmp); - goto fix_consistency; + for (i = 0; i nr_copies; i++) { + get_obj_checksum_from(tgt_vnodes[i], oid, sha1[i]); + } + + for (i = 0; i nr_copies; i++) { + for (j = (i + 1); j nr_copies; j++) { + if (memcmp(sha1[i], sha1[j], SHA1_LEN)) + goto diff; } - free(buf_cmp); } - free(buf); - return; + return 0; -fix_consistency: - for (i = 1; i nr_copies; i++) - write_object_to(tgt_vnodes[i], oid, buf); - fprintf(stdout, fix %PRIx64 success\n, oid); - free(buf); +diff: + fprintf(stderr, Failed oid: %PRIx64\n, oid); + for (i = 0; i nr_copies; i++) { + addr_to_str(host, sizeof(host), tgt_vnodes[i]-nid.addr, 0); +
[sheepdog] [PATCH V2 1/4] collie: add private options to collie's command
From: Yunkai Zhang qiushu@taobao.com V2: - update commit log, make it more descriptive - use 'options' name in inner struct instread of 'self_options' --- 8 Now, all collie's command share the same global collie_options, it will lead to option's name conflict among commands if they use the same options but with different description. By moving the private options into individual structure of each command, and make collie_options only contain the common part of them, we can solve this problem. Signed-off-by: Yunkai Zhang qiushu@taobao.com --- collie/cluster.c | 24 ++-- collie/collie.c | 49 + collie/collie.h | 1 + collie/vdi.c | 56 4 files changed, 84 insertions(+), 46 deletions(-) diff --git a/collie/cluster.c b/collie/cluster.c index 9302b78..2cca3ec 100644 --- a/collie/cluster.c +++ b/collie/cluster.c @@ -16,6 +16,17 @@ #include collie.h +static struct sd_option cluster_options[] = { + {'b', store, 1, specify backend store}, + {'c', copies, 1, specify the data redundancy (number of copies)}, + {'m', mode, 1, mode (safe, quorum, unsafe)}, + {'f', force, 0, do not prompt for confirmation}, + {'R', restore, 1, restore the cluster}, + {'l', list, 0, list the user epoch information}, + + { 0, NULL, 0, NULL }, +}; + struct cluster_cmd_data { uint32_t epoch; int list; @@ -501,19 +512,20 @@ static int cluster_recover(int argc, char **argv) static struct subcommand cluster_cmd[] = { {info, NULL, aprh, show cluster information, -NULL, SUBCMD_FLAG_NEED_NODELIST, cluster_info}, +NULL, SUBCMD_FLAG_NEED_NODELIST, cluster_info, cluster_options}, {format, NULL, bcmaph, create a Sheepdog store, -NULL, 0, cluster_format}, +NULL, 0, cluster_format, cluster_options}, {shutdown, NULL, aph, stop Sheepdog, -NULL, SUBCMD_FLAG_NEED_NODELIST, cluster_shutdown}, +NULL, SUBCMD_FLAG_NEED_NODELIST, cluster_shutdown, cluster_options}, {snapshot, NULL, aRlph, snapshot/restore the cluster, -NULL, 0, cluster_snapshot}, +NULL, 0, cluster_snapshot, cluster_options}, {cleanup, NULL, aph, cleanup the useless snapshot data from recovery, -NULL, 0, cluster_cleanup}, +NULL, 0, cluster_cleanup, cluster_options}, {recover, NULL, afph, See 'collie cluster recover' for more information\n, -cluster_recover_cmd, SUBCMD_FLAG_NEED_THIRD_ARG, cluster_recover}, +cluster_recover_cmd, SUBCMD_FLAG_NEED_THIRD_ARG, +cluster_recover, cluster_options}, {NULL,}, }; diff --git a/collie/collie.c b/collie/collie.c index 32c044a..c1b1854 100644 --- a/collie/collie.c +++ b/collie/collie.c @@ -25,29 +25,13 @@ int raw_output = 0; static const struct sd_option collie_options[] = { - /* common options */ + /* common options for all collie commands */ {'a', address, 1, specify the daemon address (default: localhost)}, {'p', port, 1, specify the daemon port}, {'r', raw, 0, raw output mode: omit headers, separate fields with\n\ single spaces and print all sizes in decimal bytes}, {'h', help, 0, display this help and exit}, - /* VDI options */ - {'P', prealloc, 0, preallocate all the data objects}, - {'i', index, 1, specify the index of data objects}, - {'s', snapshot, 1, specify a snapshot id or tag name}, - {'x', exclusive, 0, write in an exclusive mode}, - {'d', delete, 0, delete a key}, - {'C', cache, 0, enable object cache}, - - /* cluster options */ - {'b', store, 1, specify backend store}, - {'c', copies, 1, specify the data redundancy (number of copies)}, - {'m', mode, 1, mode (safe, quorum, unsafe)}, - {'f', force, 0, do not prompt for confirmation}, - {'R', restore, 1, restore the cluster}, - {'l', list, 0, list the user epoch information}, - { 0, NULL, 0, NULL }, }; @@ -127,18 +111,34 @@ out: static int (*command_parser)(int, char *); static int (*command_fn)(int, char **); -static const char *command_options; +static const char *command_opts; static const char *command_arg; static const char *command_desc; +static struct sd_option *command_options; static const struct sd_option *find_opt(int ch) { int i; + struct sd_option *opt; + /* search for common options */ for (i = 0; i ARRAY_SIZE(collie_options); i++) { if (collie_options[i].val == ch) return collie_options + i; } + + /* search for self options */ + if (!command_options) + goto out; + + opt = command_options; + while (opt-val) { + if (opt-val == ch)
[sheepdog] [PATCH V2 2/4] collie: optimize 'collie vdi check' command
From: Yunkai Zhang qiushu@taobao.com V2: - use '-R, --repair' instread of '-F, --force_repair' - not connect to target sheep directly - define a member name instead of using __pad. - update this commit log -- 8 Reading all of vdi's objects from cluster when checking them will lead to a lot of waste of network bandwith, let's calculate the checksum of objects in backend and only send the checksum result to the collie client. And I think repairing object automaticly is dangerous, as we don't known which replica is correct. In order to let user have a chance to check them if necessary, I add a new option: '-R, --repair'. By default, this command just do check, not repair(as the command name implies). After add '-R' flag, the help looks like: $ collie vdi check Usage: collie vdi check [-R] [-s snapshot] [-a address] [-p port] [-h] vdiname Options: -R, --repairforce repair object's copies (dangerous) -s, --snapshot specify a snapshot id or tag name -a, --address specify the daemon address (default: localhost) -p, --port specify the daemon port -h, --help display this help and exit Let's show some examples when execute this command: * Success: $ collie vdi check test.img CHECKING VDI:test.img ... PASSED * Failure (by default not repair): $ collie vdi check test.img CHECKING VDI:test.img ... Failed oid: 9c5e680001 copy[0], sha1: c78ca69c4be7401b6d1f11a37a4cec4226e736cd, from: 127.0.0.1:7000 copy[1], sha1: 46dbc769de60a508faf134c6d51926741c0e38fa, from: 127.0.0.1:7001 copy[2], sha1: c78ca69c4be7401b6d1f11a37a4cec4226e736cd, from: 127.0.0.1:7004 FAILED With the output showed above, user can check all copies of this object and decide which one is correct (I plan to add a new option: '--oid' to 'collie vdi read' in another patch, so that user can specify which copy of object to be exported: $ collie vdi read test.img --oid 9c5e680001@127.0.0.1:7001 foo.img By testing foo.img, we can known which copy is correct). User can do force repair by specify -R or --repair flag: * Force repair: $ collie vdi check -R test.img CHECKING VDI:test.img ... Failed oid: 9c5e680001 copy[0], sha1: c78ca69c4be7401b6d1f11a37a4cec4226e736cd, from: 127.0.0.1:7000 copy[1], sha1: 46dbc769de60a508faf134c6d51926741c0e38fa, from: 127.0.0.1:7001 copy[2], sha1: c78ca69c4be7401b6d1f11a37a4cec4226e736cd, from: 127.0.0.1:7004 repairing ... copy this object from 127.0.0.1:7000 = 127.0.0.1:7001 copy this object from 127.0.0.1:7000 = 127.0.0.1:7004 repair finished REPAIRED Signed-off-by: Yunkai Zhang qiushu@taobao.com --- collie/common.c | 6 +- collie/vdi.c | 185 ++--- include/internal_proto.h | 22 +- include/sheep.h | 16 sheep/farm/farm.h| 1 - sheep/farm/sha1_file.c | 15 sheep/gateway.c | 76 +++ sheep/ops.c | 191 --- sheep/sheep_priv.h | 4 + 9 files changed, 390 insertions(+), 126 deletions(-) diff --git a/collie/common.c b/collie/common.c index f885c8c..83a2c3d 100644 --- a/collie/common.c +++ b/collie/common.c @@ -207,8 +207,7 @@ int send_light_req_get_response(struct sd_req *hdr, const char *host, int port) ret = exec_req(fd, hdr, NULL, wlen, rlen); close(fd); if (ret) { - fprintf(stderr, failed to connect to %s:%d\n, - host, port); + dprintf(failed to connect to %s:%d\n, host, port); return -1; } @@ -229,8 +228,7 @@ int send_light_req(struct sd_req *hdr, const char *host, int port) return -1; if (ret != SD_RES_SUCCESS) { - fprintf(stderr, Response's result: %s\n, - sd_strerror(ret)); + dprintf(Response's result: %s\n, sd_strerror(ret)); return -1; } diff --git a/collie/vdi.c b/collie/vdi.c index d27b5af..b479efd 100644 --- a/collie/vdi.c +++ b/collie/vdi.c @@ -23,6 +23,7 @@ static struct sd_option vdi_options[] = { {'x', exclusive, 0, write in an exclusive mode}, {'d', delete, 0, delete a key}, {'C', cache, 0, enable object cache}, + {'R', repair, 0, force repair object's copies (dangerous)}, { 0, NULL, 0, NULL }, }; @@ -35,6 +36,7 @@ struct vdi_cmd_data { int delete; int prealloc; int cache; + int repair; } vdi_cmd_data = { ~0, }; struct get_vdi_info { @@ -1320,126 +1322,143 @@ out: return ret; } -static void *read_object_from(struct sd_vnode *vnode, uint64_t oid) +static void get_obj_checksum_from(struct sd_vnode *vnode, uint64_t oid, + unsigned char *sha1) { struct sd_req hdr; - struct sd_rsp *rsp = (struct sd_rsp *)hdr; + struct sd_checksum_rsp *rsp =
[sheepdog] [PATCH V2 3/4] man: update 'collie vdi check' doc
From: Yunkai Zhang qiushu@taobao.com Signed-off-by: Yunkai Zhang qiushu@taobao.com --- man/collie.8 | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/man/collie.8 b/man/collie.8 index 359c409..d42c3d5 100644 --- a/man/collie.8 +++ b/man/collie.8 @@ -67,7 +67,7 @@ This command creates an image. .BI vdi snapshot [-s snapshot] [-a address] [-p port] [-h] vdiname This command creates a snapshot. .TP -.BI vdi check [-s snapshot] [-a address] [-p port] [-h] vdiname +.BI vdi check [-R] [-s snapshot] [-a address] [-p port] [-h] vdiname This command checks and repairs an image's consistency. .TP .BI vdi clone [-s snapshot] [-P] [-a address] [-p port] [-h] src vdi dst vdi -- 1.7.11.2 -- sheepdog mailing list sheepdog@lists.wpkg.org http://lists.wpkg.org/mailman/listinfo/sheepdog
[sheepdog] [PATCH] test: add test for recovery logic
From: Liu Yuan tailai...@taobao.com Signed-off-by: Liu Yuan tailai...@taobao.com --- tests/027 | 33 + tests/027.out |5 + tests/common.rc | 10 ++ tests/group |1 + 4 files changed, 49 insertions(+), 0 deletions(-) create mode 100755 tests/027 create mode 100644 tests/027.out diff --git a/tests/027 b/tests/027 new file mode 100755 index 000..456c0f7 --- /dev/null +++ b/tests/027 @@ -0,0 +1,33 @@ +#!/bin/bash + +# Test sheep recovery logic + +seq=`basename $0` +echo QA output created by $seq + +here=`pwd` +tmp=/tmp/$$ +status=1# failure is the default! + +# get standard environment, filters and checks +. ./common.rc +. ./common.filter + +_cleanup + +for i in `seq 0 3`; do +_start_sheep $i +done + +_wait_for_sheep 4 + +$COLLIE cluster format -c 2 + +$COLLIE vdi create test0 40M +$COLLIE vdi create test1 40M + +_kill_sheep 3 + +_wait_for_sheep_recovery 0 + +find $STORE -name '80fd32fc' diff --git a/tests/027.out b/tests/027.out new file mode 100644 index 000..f9887b5 --- /dev/null +++ b/tests/027.out @@ -0,0 +1,5 @@ +QA output created by 027 +using backend farm store +/tmp/sheepdog/0/obj/80fd32fc +/tmp/sheepdog/3/obj/80fd32fc +/tmp/sheepdog/1/obj/80fd32fc diff --git a/tests/common.rc b/tests/common.rc index 64182c6..7ede163 100644 --- a/tests/common.rc +++ b/tests/common.rc @@ -169,5 +169,15 @@ _kill_sheep() pkill -f $SHEEP $STORE/$1 } +_wait_for_sheep_recovery() +{ +while true; do + sleep 2 + if [ $($COLLIE node recovery -p $((7000+$1)) | wc -l) -eq 2 ]; then + break + fi +done +} + # make sure this script returns success /bin/true diff --git a/tests/group b/tests/group index aaa1ab6..57edea5 100644 --- a/tests/group +++ b/tests/group @@ -38,3 +38,4 @@ 024 auto quick cluster 025 auto quick cluster 026 auto quick vdi +027 auto quick store -- 1.7.1 -- sheepdog mailing list sheepdog@lists.wpkg.org http://lists.wpkg.org/mailman/listinfo/sheepdog
Re: [sheepdog] [PATCH V2 00/11] INTRODUCE
On Tue, Aug 21, 2012 at 2:03 AM, MORITA Kazutaka morita.kazut...@lab.ntt.co.jp wrote: At Mon, 20 Aug 2012 23:34:10 +0800, Yunkai Zhang wrote: In fact, I have thought this method, but we should face nearly the same problem: After sheep joined back, it should known which objects is dirty, and should do the clear work(because there are old version object stay in it's working directory). This method seems not save the steps, but will do extra recovery works. Can you give me a concrete example? I created a really naive patch to disable object recovery with my idea: Hi Kazum: I have read and do simple test with this patch, it works at most time. But write operation will be blocked in wait_forward_request(), I think there are some corner case we should handle. I think I have understood this good idea, it's simple and clever. Could you give a mature patch? We really want to use it in our cluster as soon as possible. Thank you! == diff --git a/sheep/recovery.c b/sheep/recovery.c index 5164aa7..8bf032f 100644 --- a/sheep/recovery.c +++ b/sheep/recovery.c @@ -35,6 +35,7 @@ struct recovery_work { uint64_t *oids; uint64_t *prio_oids; int nr_prio_oids; + int nr_scheduled_oids; struct vnode_info *old_vinfo; struct vnode_info *cur_vinfo; @@ -269,9 +270,6 @@ static inline void prepare_schedule_oid(uint64_t oid) oid); return; } - /* The oid is currently being recovered */ - if (rw-oids[rw-done] == oid) - return; rw-nr_prio_oids++; rw-prio_oids = xrealloc(rw-prio_oids, rw-nr_prio_oids * sizeof(uint64_t)); @@ -399,9 +397,31 @@ static inline void finish_schedule_oids(struct recovery_work *rw) done: free(rw-prio_oids); rw-prio_oids = NULL; + rw-nr_scheduled_oids += rw-nr_prio_oids; rw-nr_prio_oids = 0; } +static struct timer recovery_timer; + +static void recover_next_object(void *arg) +{ + struct recovery_work *rw = arg; + + if (rw-nr_prio_oids) + finish_schedule_oids(rw); + + if (rw-done rw-nr_scheduled_oids) { + /* Try recover next object */ + queue_work(sys-recovery_wqueue, rw-work); + return; + } + + /* There is no objects to be recovered. Try again later */ + recovery_timer.callback = recover_next_object; + recovery_timer.data = rw; + add_timer(recovery_timer, 1); /* FIXME */ +} + static void recover_object_main(struct work *work) { struct recovery_work *rw = container_of(work, struct recovery_work, @@ -425,11 +445,7 @@ static void recover_object_main(struct work *work) resume_wait_obj_requests(rw-oids[rw-done++]); if (rw-done rw-count) { - if (rw-nr_prio_oids) - finish_schedule_oids(rw); - - /* Try recover next object */ - queue_work(sys-recovery_wqueue, rw-work); + recover_next_object(rw); return; } @@ -458,7 +474,7 @@ static void finish_object_list(struct work *work) resume_wait_recovery_requests(); rw-work.fn = recover_object_work; rw-work.done = recover_object_main; - queue_work(sys-recovery_wqueue, rw-work); + recover_next_object(rw); return; } == I ran the following test, and object recovery was disabled correctly for both join and leave case. == #!/bin/bash for i in 0 1 2 3; do ./sheep/sheep /store/$i -z $i -p 700$i -c local done sleep 1 ./collie/collie cluster format ./collie/collie vdi create test 4G echo * objects will be created on node[0-2] * md5sum /store/[0,1,2,3]/obj/807c2b25 pkill -f ./sheep/sheep /store/1 sleep 3 echo * recovery doesn't start until the object is touched * md5sum /store/[0,2,3]/obj/807c2b25 ./collie/collie vdi snapshot test # invoke recovery of the vdi object echo * the object is recovered * md5sum /store/[0,2,3]/obj/807c2b25 ./sheep/sheep /store/1 -z 1 -p 7001 -c local sleep 3 echo * recovery doesn't start until the object is touched * md5sum /store/[0,1,2,3]/obj/807c2b25 ./collie/collie vdi list -p 7001 # invoke recovery of the vdi object echo * the object is recovered * md5sum /store/[0,1,2,3]/obj/807c2b25 == [Output] using backend farm store * objects will be created on node[0-2] * 701e77eab6002c9a48f7ba72c8d9bfe9 /store/0/obj/807c2b25 701e77eab6002c9a48f7ba72c8d9bfe9 /store/1/obj/807c2b25 701e77eab6002c9a48f7ba72c8d9bfe9 /store/2/obj/807c2b25 * recovery doesn't start until the object is touched * 701e77eab6002c9a48f7ba72c8d9bfe9 /store/0/obj/807c2b25 701e77eab6002c9a48f7ba72c8d9bfe9 /store/2/obj/807c2b25 * the object is recovered *
Re: [sheepdog] [PATCH V2 00/11] INTRODUCE
At Wed, 22 Aug 2012 01:16:49 +0800, Yunkai Zhang wrote: I have read and do simple test with this patch, it works at most time. But write operation will be blocked in wait_forward_request(), I think there are some corner case we should handle. Can you create a testcase to reproduce it? Could you give a mature patch? We really want to use it in our cluster as soon as possible. Okay, but I'm currently working on another problem - sheep blocks I/O requests long time while stale objects are moved to the farm backend store. I'll give a try after that. Thanks, Kazutaka -- sheepdog mailing list sheepdog@lists.wpkg.org http://lists.wpkg.org/mailman/listinfo/sheepdog
Re: [sheepdog] [PATCH V2 00/11] INTRODUCE
On Wed, Aug 22, 2012 at 9:31 AM, MORITA Kazutaka morita.kazut...@lab.ntt.co.jp wrote: At Wed, 22 Aug 2012 01:16:49 +0800, Yunkai Zhang wrote: I have read and do simple test with this patch, it works at most time. But write operation will be blocked in wait_forward_request(), I think there are some corner case we should handle. Can you create a testcase to reproduce it? Ok, I'll give a testcase later. Could you give a mature patch? We really want to use it in our cluster as soon as possible. Okay, but I'm currently working on another problem - sheep blocks I/O requests long time while stale objects are moved to the farm backend store. I'll give a try after that. Thanks~ Thanks, Kazutaka -- Yunkai Zhang Work at Taobao -- sheepdog mailing list sheepdog@lists.wpkg.org http://lists.wpkg.org/mailman/listinfo/sheepdog
Re: [sheepdog] [PATCH V2 00/11] INTRODUCE
On 08/22/2012 09:44 AM, Yunkai Zhang wrote: ould you give a mature patch? We really want to use it in our cluster as soon as possible. Okay, but I'm currently working on another problem - sheep blocks I/O requests long time while stale objects are moved to the farm backend store. I'll give a try after that. Thanks~ Hi Yunkai, since you are doing this series all these days, why not pick up Kazutaka's draft patch and perfect it? I think this is Kazutaka original intention, just a concrete example to say his idea could work out. -- thanks, Yuan -- sheepdog mailing list sheepdog@lists.wpkg.org http://lists.wpkg.org/mailman/listinfo/sheepdog
Re: [sheepdog] [PATCH V2 00/11] INTRODUCE
On Wed, Aug 22, 2012 at 9:55 AM, Liu Yuan namei.u...@gmail.com wrote: On 08/22/2012 09:44 AM, Yunkai Zhang wrote: ould you give a mature patch? We really want to use it in our cluster as soon as possible. Okay, but I'm currently working on another problem - sheep blocks I/O requests long time while stale objects are moved to the farm backend store. I'll give a try after that. Thanks~ Hi Yunkai, since you are doing this series all these days, why not pick up Kazutaka's draft patch and perfect it? I think this is Kazutaka original intention, just a concrete example to say his idea could work out. My intention is to respect Kazum's idea, if need my help, I'm pleasure to do it:). -- thanks, Yuan -- Yunkai Zhang Work at Taobao -- sheepdog mailing list sheepdog@lists.wpkg.org http://lists.wpkg.org/mailman/listinfo/sheepdog
Re: [sheepdog] [PATCH V2 00/11] INTRODUCE
On Wed, Aug 22, 2012 at 10:21 AM, MORITA Kazutaka morita.kazut...@lab.ntt.co.jp wrote: At Wed, 22 Aug 2012 10:14:07 +0800, Yunkai Zhang wrote: My intention is to respect Kazum's idea, if need my help, I'm pleasure to do it:). If you complete the work, it will help me a lot. :) Well, I'll complete it:) But now I'm busy with other things, maybe I'll send the first version based on this idea at Friday or this weekend. Thanks, Kazutaka -- Yunkai Zhang Work at Taobao -- sheepdog mailing list sheepdog@lists.wpkg.org http://lists.wpkg.org/mailman/listinfo/sheepdog
[sheepdog] Sheepdog 0.5.0 schedule and todos
Hi all, I think of releasing 0.5.0 on the early September. TODO items for the release are: - different redundancy level for each VDI - 'collie cluster recover' to enable/disable automatic recovery - 'collie vdi backup/restore' to get a differential backup between snapshots - another writeback support in backend store - fix a blocking problem in farm - fix mastership transfer Let me know if there are any opinions. Thanks, Kazutaka -- sheepdog mailing list sheepdog@lists.wpkg.org http://lists.wpkg.org/mailman/listinfo/sheepdog
[sheepdog] [PATCH] tests: kill sheep with signal KILL
From: levin li xingke@taobao.com In some case, we use pkill -f to kill a sheep node, and the log process isn't killed immediately and becomes a defunt process which take problems for the next start of the same node. Signed-off-by: levin li xingke@taobao.com --- tests/common.rc |2 +- 1 files changed, 1 insertions(+), 1 deletions(-) diff --git a/tests/common.rc b/tests/common.rc index 64182c6..8ca7dea 100644 --- a/tests/common.rc +++ b/tests/common.rc @@ -166,7 +166,7 @@ _start_sheep() _kill_sheep() { -pkill -f $SHEEP $STORE/$1 +pkill -9 -f $SHEEP $STORE/$1 } # make sure this script returns success -- 1.7.1 -- sheepdog mailing list sheepdog@lists.wpkg.org http://lists.wpkg.org/mailman/listinfo/sheepdog