Re: [sheepdog] [PATCH v2] sheep: make config file compatible with the previous one

2012-08-21 Thread Liu Yuan
On 08/21/2012 02:37 AM, MORITA Kazutaka wrote:
 Signed-off-by: MORITA Kazutaka morita.kazut...@lab.ntt.co.jp
 ---
 
 Changes from v1:
  - remove 'version' from sheepdog_config
 
 Even if we don't support a version check of the config in the next
 release, we should fix the compatibility issue at least.
 
 
  sheep/store.c |3 ++-
  1 files changed, 2 insertions(+), 1 deletions(-)
 
 diff --git a/sheep/store.c b/sheep/store.c
 index 542804a..fcbf32d 100644
 --- a/sheep/store.c
 +++ b/sheep/store.c
 @@ -30,10 +30,11 @@
  
  struct sheepdog_config {
   uint64_t ctime;
 - uint64_t space;
   uint16_t flags;
   uint8_t copies;
   uint8_t store[STORE_LEN];
 + uint8_t __pad[5];
 + uint64_t space;
  };
  
  char *obj_path;
 

Applied, thanks.

Yuan
-- 
sheepdog mailing list
sheepdog@lists.wpkg.org
http://lists.wpkg.org/mailman/listinfo/sheepdog


Re: [sheepdog] [PATCH V2 00/11] INTRODUCE

2012-08-21 Thread Yunkai Zhang
On Tue, Aug 21, 2012 at 1:50 PM, Dietmar Maurer diet...@proxmox.com wrote:
 Disabling automatic recovery by default doesn't work for you?  You can
 control the time to start recovery with collie cluster recover enable.

 It just looks strange to me to design the system for immediate/automatic 
 recovery, and
 make 'disabling automatic recovery' an option. I would include the node state 
 into the epoch.
 But maybe that is only an implementation detail.

Hi folks:

I need a conclusion:

Does sheepdog need delay recovery supported by this series (or by
Kazum's new idea and implementation) ?



 - Dietmar


 --
 sheepdog mailing list
 sheepdog@lists.wpkg.org
 http://lists.wpkg.org/mailman/listinfo/sheepdog



-- 
Yunkai Zhang
Work at Taobao
-- 
sheepdog mailing list
sheepdog@lists.wpkg.org
http://lists.wpkg.org/mailman/listinfo/sheepdog


Re: [sheepdog] [PATCH 1/2] collie: optimize 'collie vdi check' command

2012-08-21 Thread Yunkai Zhang
On Tue, Aug 21, 2012 at 1:48 PM, Yunkai Zhang yunkai...@gmail.com wrote:
 On Tue, Aug 21, 2012 at 1:42 PM, MORITA Kazutaka
 morita.kazut...@lab.ntt.co.jp wrote:
 At Thu, 16 Aug 2012 22:38:21 +0800,
 Yunkai Zhang wrote:

 After add '-F' flag, the help looks like:
 $ collie vdi check
 Usage: collie vdi check [-F] [-s snapshot] [-a address] [-p port] [-h] 
 vdiname
 Options:
   -F, --force_repair  force repair object's copies (dangerous)

 How about '-r, --repair'?


 Good for me.

When I submit this patch, my new patch collie: add self options to
collie's command haven't been developed, so '-r, --repair' options
would be conflict with '-r, --raw' option.

Since these two patches are dependent, I'll put them in a series in V2.




   fprintf(stderr, Failed to read, %s\n,
   sd_strerror(rsp-result));
   exit(EXIT_FAILURE);
   }
 - return buf;
 +
 + memcpy(sha1, (unsigned char *)rsp-__pad[0], SHA1_LEN);

 Please define a member name instead of using __pad.

 OK.



  }

 -static void write_object_to(struct sd_vnode *vnode, uint64_t oid, void 
 *buf)
 +static int do_repair(uint64_t oid, struct node_id *src, struct node_id 
 *dest)
  {
   struct sd_req hdr;
   struct sd_rsp *rsp = (struct sd_rsp *)hdr;
 + unsigned rlen, wlen;
 + char host[128];
   int fd, ret;
 - unsigned wlen = SD_DATA_OBJ_SIZE, rlen = 0;
 - char name[128];

 - addr_to_str(name, sizeof(name), vnode-nid.addr, 0);
 - fd = connect_to(name, vnode-nid.port);
 + addr_to_str(host, sizeof(host), dest-addr, 0);
 +
 + fd = connect_to(host, dest-port);
   if (fd  0) {
 - fprintf(stderr, failed to connect to %s:%PRIu32\n,
 - name, vnode-nid.port);
 - exit(EXIT_FAILURE);
 + fprintf(stderr, Failed to connect\n);
 + return SD_RES_EIO;
   }

 - sd_init_req(hdr, SD_OP_WRITE_PEER);
 - hdr.epoch = sd_epoch;
 - hdr.flags = SD_FLAG_CMD_WRITE;
 - hdr.data_length = wlen;
 + sd_init_req(hdr, SD_OP_REPAIR_OBJ_PEER);

 I don't think sending peer requests directly from outside sheeps is a
 good idea.  How about making the gateway node forward the requests?

 Ok, no problem.




 + rlen = 0;
 + wlen = sizeof(*src);
 +
 + hdr.epoch = sd_epoch;
   hdr.obj.oid = oid;
 + hdr.data_length = wlen;
 + hdr.flags = SD_FLAG_CMD_WRITE;

 - ret = exec_req(fd, hdr, buf, wlen, rlen);
 + ret = exec_req(fd, hdr, src, wlen, rlen);
   close(fd);
 -
   if (ret) {
 - fprintf(stderr, Failed to execute request\n);
 - exit(EXIT_FAILURE);
 + fprintf(stderr, Failed to repair oid:%PRIx64\n, oid);
 + return SD_RES_EIO;
   }
 -
   if (rsp-result != SD_RES_SUCCESS) {
 - fprintf(stderr, Failed to read, %s\n,
 - sd_strerror(rsp-result));
 - exit(EXIT_FAILURE);
 + fprintf(stderr, Failed to repair oid:%PRIx64, %s\n,
 + oid, sd_strerror(rsp-result));
 + return rsp-result;
   }
 +
 + return SD_RES_SUCCESS;
  }

 -/*
 - * Fix consistency of the replica of oid.
 - *
 - * XXX: The fix is rather dumb, just read the first copy and write it
 - * to other replica.
 - */
 -static void do_check_repair(uint64_t oid, int nr_copies)
 +static int do_check_repair(uint64_t oid, int nr_copies)
  {
   struct sd_vnode *tgt_vnodes[nr_copies];
 - void *buf, *buf_cmp;
 - int i;
 + unsigned char sha1[SD_MAX_COPIES][SHA1_LEN];
 + char host[128];
 + int i, j;

   oid_to_vnodes(sd_vnodes, sd_vnodes_nr, oid, nr_copies, tgt_vnodes);
 - buf = read_object_from(tgt_vnodes[0], oid);
 - for (i = 1; i  nr_copies; i++) {
 - buf_cmp = read_object_from(tgt_vnodes[i], oid);
 - if (memcmp(buf, buf_cmp, SD_DATA_OBJ_SIZE)) {
 - free(buf_cmp);
 - goto fix_consistency;
 + for (i = 0; i  nr_copies; i++) {
 + get_obj_checksum_from(tgt_vnodes[i], oid, sha1[i]);
 + }
 +
 + for (i = 0; i  nr_copies; i++) {
 + for (j = (i + 1); j  nr_copies; j++) {
 + if (memcmp(sha1[i], sha1[j], SHA1_LEN))
 + goto diff;
   }
 - free(buf_cmp);
   }
 - free(buf);
 - return;
 + return 0;

 -fix_consistency:
 - for (i = 1; i  nr_copies; i++)
 - write_object_to(tgt_vnodes[i], oid, buf);
 - fprintf(stdout, fix %PRIx64 success\n, oid);
 - free(buf);
 +diff:
 + fprintf(stderr, Failed oid: %PRIx64\n, oid);
 + for (i = 0; i  nr_copies; i++) {
 + addr_to_str(host, sizeof(host), tgt_vnodes[i]-nid.addr, 0);
 + fprintf(stderr,  copy[%d], sha1: %s, from: %s:%d\n,
 + i, sha1_to_hex(sha1[i]), host, 
 tgt_vnodes[i]-nid.port);
 + }
 +
 + if (!vdi_cmd_data.force_repair)
 + return -1;
 +
 +   

Re: [sheepdog] [PATCH V2 00/11] INTRODUCE

2012-08-21 Thread Bastian Scholz

Hi Dietmar, Hi Yuan,

Am 2012-08-21 07:27, schrieb Dietmar Maurer:
Membership change can happen for many reason. It  can happen if 
something is

wrong on the switch (or if some admin configures the switch), a
damaged network cable,
a bug in the bonding driver, a damaged network card, or simply a
power failure on a node,
which reconnects after power is on


Me and at least David had this Problem recently in our
environment which ends in a complete data loss.
Not that it happens very often or should be handled in a
way the cluster can run even if it happens, but in my
opinion this situation should be handled without data
loss...

Cheers

Bastian
--
sheepdog mailing list
sheepdog@lists.wpkg.org
http://lists.wpkg.org/mailman/listinfo/sheepdog


Re: [sheepdog] [PATCH V2 00/11] INTRODUCE

2012-08-21 Thread MORITA Kazutaka
At Tue, 21 Aug 2012 14:14:23 +0800,
Yunkai Zhang wrote:
 I need a conclusion:
 
 Does sheepdog need delay recovery supported by this series (or by
 Kazum's new idea and implementation) ?

There are two different discussion in this thread:

  1. turn on/off automatic recovery with a collie command (supported
 by this series)
  2. delay starting automatic recovery in any case

I think no one is against supporting 1.

Thanks,

Kazutaka
-- 
sheepdog mailing list
sheepdog@lists.wpkg.org
http://lists.wpkg.org/mailman/listinfo/sheepdog


Re: [sheepdog] [PATCH V2 00/11] INTRODUCE

2012-08-21 Thread Yunkai Zhang
On Tue, Aug 21, 2012 at 2:43 PM, MORITA Kazutaka
morita.kazut...@lab.ntt.co.jp wrote:
 At Tue, 21 Aug 2012 14:14:23 +0800,
 Yunkai Zhang wrote:
 I need a conclusion:

 Does sheepdog need delay recovery supported by this series (or by
 Kazum's new idea and implementation) ?

 There are two different discussion in this thread:

   1. turn on/off automatic recovery with a collie command (supported
  by this series)
   2. delay starting automatic recovery in any case

 I think no one is against supporting 1.

Ok, I'll continue to improve this series after I complete other things.


 Thanks,

 Kazutaka



-- 
Yunkai Zhang
Work at Taobao
-- 
sheepdog mailing list
sheepdog@lists.wpkg.org
http://lists.wpkg.org/mailman/listinfo/sheepdog


Re: [sheepdog] [PATCH V2 00/11] INTRODUCE

2012-08-21 Thread Liu Yuan
On 08/21/2012 02:48 PM, Yunkai Zhang wrote:
 Ok, I'll continue to improve this series after I complete other things.

Why not choose Kazutaka's idea to implement delay recovery? It looks
simple yet efficient at least to me.

Thanks,
Yuan
-- 
sheepdog mailing list
sheepdog@lists.wpkg.org
http://lists.wpkg.org/mailman/listinfo/sheepdog


Re: [sheepdog] [PATCH V2 00/11] INTRODUCE

2012-08-21 Thread Yunkai Zhang
On Tue, Aug 21, 2012 at 2:58 PM, Liu Yuan namei.u...@gmail.com wrote:
 On 08/21/2012 02:48 PM, Yunkai Zhang wrote:
 Ok, I'll continue to improve this series after I complete other things.

 Why not choose Kazutaka's idea to implement delay recovery? It looks
 simple yet efficient at least to me.

continue to import this series, of course including kazum's idea if
it's the best way.


 Thanks,
 Yuan



-- 
Yunkai Zhang
Work at Taobao
-- 
sheepdog mailing list
sheepdog@lists.wpkg.org
http://lists.wpkg.org/mailman/listinfo/sheepdog


Re: [sheepdog] [PATCH V2 00/11] INTRODUCE

2012-08-21 Thread Yunkai Zhang
On Tue, Aug 21, 2012 at 3:04 PM, Yunkai Zhang yunkai...@gmail.com wrote:
 On Tue, Aug 21, 2012 at 2:58 PM, Liu Yuan namei.u...@gmail.com wrote:
 On 08/21/2012 02:48 PM, Yunkai Zhang wrote:
 Ok, I'll continue to improve this series after I complete other things.

 Why not choose Kazutaka's idea to implement delay recovery? It looks
 simple yet efficient at least to me.

 continue to import this series, of course including kazum's idea if
 it's the best way.

s/import/improve/



 Thanks,
 Yuan



 --
 Yunkai Zhang
 Work at Taobao



-- 
Yunkai Zhang
Work at Taobao
-- 
sheepdog mailing list
sheepdog@lists.wpkg.org
http://lists.wpkg.org/mailman/listinfo/sheepdog


Re: [sheepdog] [PATCH 1/2] collie: optimize 'collie vdi check' command

2012-08-21 Thread Yunkai Zhang
On Tue, Aug 21, 2012 at 2:26 PM, Yunkai Zhang yunkai...@gmail.com wrote:
 On Tue, Aug 21, 2012 at 1:48 PM, Yunkai Zhang yunkai...@gmail.com wrote:
 On Tue, Aug 21, 2012 at 1:42 PM, MORITA Kazutaka
 morita.kazut...@lab.ntt.co.jp wrote:
 At Thu, 16 Aug 2012 22:38:21 +0800,
 Yunkai Zhang wrote:

 After add '-F' flag, the help looks like:
 $ collie vdi check
 Usage: collie vdi check [-F] [-s snapshot] [-a address] [-p port] [-h] 
 vdiname
 Options:
   -F, --force_repair  force repair object's copies (dangerous)

 How about '-r, --repair'?


 Good for me.

 When I submit this patch, my new patch collie: add self options to
 collie's command haven't been developed, so '-r, --repair' options
 would be conflict with '-r, --raw' option.

I found that '-r, --raw' option is also need by 'collie vdi list'
command, so there exist name conflict.

I would use '-R, --repair'.



 Since these two patches are dependent, I'll put them in a series in V2.




   fprintf(stderr, Failed to read, %s\n,
   sd_strerror(rsp-result));
   exit(EXIT_FAILURE);
   }
 - return buf;
 +
 + memcpy(sha1, (unsigned char *)rsp-__pad[0], SHA1_LEN);

 Please define a member name instead of using __pad.

 OK.



  }

 -static void write_object_to(struct sd_vnode *vnode, uint64_t oid, void 
 *buf)
 +static int do_repair(uint64_t oid, struct node_id *src, struct node_id 
 *dest)
  {
   struct sd_req hdr;
   struct sd_rsp *rsp = (struct sd_rsp *)hdr;
 + unsigned rlen, wlen;
 + char host[128];
   int fd, ret;
 - unsigned wlen = SD_DATA_OBJ_SIZE, rlen = 0;
 - char name[128];

 - addr_to_str(name, sizeof(name), vnode-nid.addr, 0);
 - fd = connect_to(name, vnode-nid.port);
 + addr_to_str(host, sizeof(host), dest-addr, 0);
 +
 + fd = connect_to(host, dest-port);
   if (fd  0) {
 - fprintf(stderr, failed to connect to %s:%PRIu32\n,
 - name, vnode-nid.port);
 - exit(EXIT_FAILURE);
 + fprintf(stderr, Failed to connect\n);
 + return SD_RES_EIO;
   }

 - sd_init_req(hdr, SD_OP_WRITE_PEER);
 - hdr.epoch = sd_epoch;
 - hdr.flags = SD_FLAG_CMD_WRITE;
 - hdr.data_length = wlen;
 + sd_init_req(hdr, SD_OP_REPAIR_OBJ_PEER);

 I don't think sending peer requests directly from outside sheeps is a
 good idea.  How about making the gateway node forward the requests?

 Ok, no problem.




 + rlen = 0;
 + wlen = sizeof(*src);
 +
 + hdr.epoch = sd_epoch;
   hdr.obj.oid = oid;
 + hdr.data_length = wlen;
 + hdr.flags = SD_FLAG_CMD_WRITE;

 - ret = exec_req(fd, hdr, buf, wlen, rlen);
 + ret = exec_req(fd, hdr, src, wlen, rlen);
   close(fd);
 -
   if (ret) {
 - fprintf(stderr, Failed to execute request\n);
 - exit(EXIT_FAILURE);
 + fprintf(stderr, Failed to repair oid:%PRIx64\n, oid);
 + return SD_RES_EIO;
   }
 -
   if (rsp-result != SD_RES_SUCCESS) {
 - fprintf(stderr, Failed to read, %s\n,
 - sd_strerror(rsp-result));
 - exit(EXIT_FAILURE);
 + fprintf(stderr, Failed to repair oid:%PRIx64, %s\n,
 + oid, sd_strerror(rsp-result));
 + return rsp-result;
   }
 +
 + return SD_RES_SUCCESS;
  }

 -/*
 - * Fix consistency of the replica of oid.
 - *
 - * XXX: The fix is rather dumb, just read the first copy and write it
 - * to other replica.
 - */
 -static void do_check_repair(uint64_t oid, int nr_copies)
 +static int do_check_repair(uint64_t oid, int nr_copies)
  {
   struct sd_vnode *tgt_vnodes[nr_copies];
 - void *buf, *buf_cmp;
 - int i;
 + unsigned char sha1[SD_MAX_COPIES][SHA1_LEN];
 + char host[128];
 + int i, j;

   oid_to_vnodes(sd_vnodes, sd_vnodes_nr, oid, nr_copies, tgt_vnodes);
 - buf = read_object_from(tgt_vnodes[0], oid);
 - for (i = 1; i  nr_copies; i++) {
 - buf_cmp = read_object_from(tgt_vnodes[i], oid);
 - if (memcmp(buf, buf_cmp, SD_DATA_OBJ_SIZE)) {
 - free(buf_cmp);
 - goto fix_consistency;
 + for (i = 0; i  nr_copies; i++) {
 + get_obj_checksum_from(tgt_vnodes[i], oid, sha1[i]);
 + }
 +
 + for (i = 0; i  nr_copies; i++) {
 + for (j = (i + 1); j  nr_copies; j++) {
 + if (memcmp(sha1[i], sha1[j], SHA1_LEN))
 + goto diff;
   }
 - free(buf_cmp);
   }
 - free(buf);
 - return;
 + return 0;

 -fix_consistency:
 - for (i = 1; i  nr_copies; i++)
 - write_object_to(tgt_vnodes[i], oid, buf);
 - fprintf(stdout, fix %PRIx64 success\n, oid);
 - free(buf);
 +diff:
 + fprintf(stderr, Failed oid: %PRIx64\n, oid);
 + for (i = 0; i  nr_copies; i++) {
 + addr_to_str(host, sizeof(host), tgt_vnodes[i]-nid.addr, 0);
 + 

[sheepdog] [PATCH V2 1/4] collie: add private options to collie's command

2012-08-21 Thread Yunkai Zhang
From: Yunkai Zhang qiushu@taobao.com

V2:
- update commit log, make it more descriptive
- use 'options' name in inner struct instread of 'self_options'
--- 8

Now, all collie's command share the same global collie_options, it will
lead to option's name conflict among commands if they use the same options
but with different description.

By moving the private options into individual structure of each command, and
make collie_options only contain the common part of them, we can solve this
problem.

Signed-off-by: Yunkai Zhang qiushu@taobao.com
---
 collie/cluster.c | 24 ++--
 collie/collie.c  | 49 +
 collie/collie.h  |  1 +
 collie/vdi.c | 56 
 4 files changed, 84 insertions(+), 46 deletions(-)

diff --git a/collie/cluster.c b/collie/cluster.c
index 9302b78..2cca3ec 100644
--- a/collie/cluster.c
+++ b/collie/cluster.c
@@ -16,6 +16,17 @@
 
 #include collie.h
 
+static struct sd_option cluster_options[] = {
+   {'b', store, 1, specify backend store},
+   {'c', copies, 1, specify the data redundancy (number of copies)},
+   {'m', mode, 1, mode (safe, quorum, unsafe)},
+   {'f', force, 0, do not prompt for confirmation},
+   {'R', restore, 1, restore the cluster},
+   {'l', list, 0, list the user epoch information},
+
+   { 0, NULL, 0, NULL },
+};
+
 struct cluster_cmd_data {
uint32_t epoch;
int list;
@@ -501,19 +512,20 @@ static int cluster_recover(int argc, char **argv)
 
 static struct subcommand cluster_cmd[] = {
{info, NULL, aprh, show cluster information,
-NULL, SUBCMD_FLAG_NEED_NODELIST, cluster_info},
+NULL, SUBCMD_FLAG_NEED_NODELIST, cluster_info, cluster_options},
{format, NULL, bcmaph, create a Sheepdog store,
-NULL, 0, cluster_format},
+NULL, 0, cluster_format, cluster_options},
{shutdown, NULL, aph, stop Sheepdog,
-NULL, SUBCMD_FLAG_NEED_NODELIST, cluster_shutdown},
+NULL, SUBCMD_FLAG_NEED_NODELIST, cluster_shutdown, cluster_options},
{snapshot, NULL, aRlph, snapshot/restore the cluster,
-NULL, 0, cluster_snapshot},
+NULL, 0, cluster_snapshot, cluster_options},
{cleanup, NULL, aph,
 cleanup the useless snapshot data from recovery,
-NULL, 0, cluster_cleanup},
+NULL, 0, cluster_cleanup, cluster_options},
{recover, NULL, afph,
 See 'collie cluster recover' for more information\n,
-cluster_recover_cmd, SUBCMD_FLAG_NEED_THIRD_ARG, cluster_recover},
+cluster_recover_cmd, SUBCMD_FLAG_NEED_THIRD_ARG,
+cluster_recover, cluster_options},
{NULL,},
 };
 
diff --git a/collie/collie.c b/collie/collie.c
index 32c044a..c1b1854 100644
--- a/collie/collie.c
+++ b/collie/collie.c
@@ -25,29 +25,13 @@ int raw_output = 0;
 
 static const struct sd_option collie_options[] = {
 
-   /* common options */
+   /* common options for all collie commands */
{'a', address, 1, specify the daemon address (default: localhost)},
{'p', port, 1, specify the daemon port},
{'r', raw, 0, raw output mode: omit headers, separate fields with\n\
   single spaces and print all sizes in decimal bytes},
{'h', help, 0, display this help and exit},
 
-   /* VDI options */
-   {'P', prealloc, 0, preallocate all the data objects},
-   {'i', index, 1, specify the index of data objects},
-   {'s', snapshot, 1, specify a snapshot id or tag name},
-   {'x', exclusive, 0, write in an exclusive mode},
-   {'d', delete, 0, delete a key},
-   {'C', cache, 0, enable object cache},
-
-   /* cluster options */
-   {'b', store, 1, specify backend store},
-   {'c', copies, 1, specify the data redundancy (number of copies)},
-   {'m', mode, 1, mode (safe, quorum, unsafe)},
-   {'f', force, 0, do not prompt for confirmation},
-   {'R', restore, 1, restore the cluster},
-   {'l', list, 0, list the user epoch information},
-
{ 0, NULL, 0, NULL },
 };
 
@@ -127,18 +111,34 @@ out:
 
 static int (*command_parser)(int, char *);
 static int (*command_fn)(int, char **);
-static const char *command_options;
+static const char *command_opts;
 static const char *command_arg;
 static const char *command_desc;
+static struct sd_option *command_options;
 
 static const struct sd_option *find_opt(int ch)
 {
int i;
+   struct sd_option *opt;
 
+   /* search for common options */
for (i = 0; i  ARRAY_SIZE(collie_options); i++) {
if (collie_options[i].val == ch)
return collie_options + i;
}
+
+   /* search for self options */
+   if (!command_options)
+   goto out;
+
+   opt = command_options;
+   while (opt-val) {
+   if (opt-val == ch)

[sheepdog] [PATCH V2 2/4] collie: optimize 'collie vdi check' command

2012-08-21 Thread Yunkai Zhang
From: Yunkai Zhang qiushu@taobao.com

V2:
- use '-R, --repair' instread of '-F, --force_repair'
- not connect to target sheep directly
- define a member name instead of using __pad.
- update this commit log
-- 8

Reading all of vdi's objects from cluster when checking them will lead to a lot
of waste of network bandwith, let's calculate the checksum of objects in backend
and only send the checksum result to the collie client.

And I think repairing object automaticly is dangerous, as we don't known which
replica is correct. In order to let user have a chance to check them if
necessary, I add a new option: '-R, --repair'. By default, this command
just do check, not repair(as the command name implies).

After add '-R' flag, the help looks like:
$ collie vdi check
Usage: collie vdi check [-R] [-s snapshot] [-a address] [-p port] [-h] vdiname
Options:
  -R, --repairforce repair object's copies (dangerous)
  -s, --snapshot  specify a snapshot id or tag name
  -a, --address   specify the daemon address (default: localhost)
  -p, --port  specify the daemon port
  -h, --help  display this help and exit

Let's show some examples when execute this command:
* Success:
$ collie vdi check test.img
CHECKING VDI:test.img ...
PASSED

* Failure (by default not repair):
$ collie vdi check test.img
CHECKING VDI:test.img ...
Failed oid: 9c5e680001
 copy[0], sha1: c78ca69c4be7401b6d1f11a37a4cec4226e736cd, from: 127.0.0.1:7000
 copy[1], sha1: 46dbc769de60a508faf134c6d51926741c0e38fa, from: 127.0.0.1:7001
 copy[2], sha1: c78ca69c4be7401b6d1f11a37a4cec4226e736cd, from: 127.0.0.1:7004
FAILED

With the output showed above, user can check all copies of this object and
decide which one is correct (I plan to add a new option: '--oid' to 'collie vdi 
read'
in another patch, so that user can specify which copy of object to be exported:
  $ collie vdi read test.img --oid 9c5e680001@127.0.0.1:7001  foo.img
By testing foo.img, we can known which copy is correct).

User can do force repair by specify -R or --repair flag:
* Force repair:
$ collie vdi check -R test.img
CHECKING VDI:test.img ...
Failed oid: 9c5e680001
 copy[0], sha1: c78ca69c4be7401b6d1f11a37a4cec4226e736cd, from: 127.0.0.1:7000
 copy[1], sha1: 46dbc769de60a508faf134c6d51926741c0e38fa, from: 127.0.0.1:7001
 copy[2], sha1: c78ca69c4be7401b6d1f11a37a4cec4226e736cd, from: 127.0.0.1:7004
 repairing ...
 copy this object from 127.0.0.1:7000 = 127.0.0.1:7001
 copy this object from 127.0.0.1:7000 = 127.0.0.1:7004
 repair finished
REPAIRED

Signed-off-by: Yunkai Zhang qiushu@taobao.com
---
 collie/common.c  |   6 +-
 collie/vdi.c | 185 ++---
 include/internal_proto.h |  22 +-
 include/sheep.h  |  16 
 sheep/farm/farm.h|   1 -
 sheep/farm/sha1_file.c   |  15 
 sheep/gateway.c  |  76 +++
 sheep/ops.c  | 191 ---
 sheep/sheep_priv.h   |   4 +
 9 files changed, 390 insertions(+), 126 deletions(-)

diff --git a/collie/common.c b/collie/common.c
index f885c8c..83a2c3d 100644
--- a/collie/common.c
+++ b/collie/common.c
@@ -207,8 +207,7 @@ int send_light_req_get_response(struct sd_req *hdr, const 
char *host, int port)
ret = exec_req(fd, hdr, NULL, wlen, rlen);
close(fd);
if (ret) {
-   fprintf(stderr, failed to connect to  %s:%d\n,
-   host, port);
+   dprintf(failed to connect to  %s:%d\n, host, port);
return -1;
}
 
@@ -229,8 +228,7 @@ int send_light_req(struct sd_req *hdr, const char *host, 
int port)
return -1;
 
if (ret != SD_RES_SUCCESS) {
-   fprintf(stderr, Response's result: %s\n,
-   sd_strerror(ret));
+   dprintf(Response's result: %s\n, sd_strerror(ret));
return -1;
}
 
diff --git a/collie/vdi.c b/collie/vdi.c
index d27b5af..b479efd 100644
--- a/collie/vdi.c
+++ b/collie/vdi.c
@@ -23,6 +23,7 @@ static struct sd_option vdi_options[] = {
{'x', exclusive, 0, write in an exclusive mode},
{'d', delete, 0, delete a key},
{'C', cache, 0, enable object cache},
+   {'R', repair, 0, force repair object's copies (dangerous)},
 
{ 0, NULL, 0, NULL },
 };
@@ -35,6 +36,7 @@ struct vdi_cmd_data {
int delete;
int prealloc;
int cache;
+   int repair;
 } vdi_cmd_data = { ~0, };
 
 struct get_vdi_info {
@@ -1320,126 +1322,143 @@ out:
return ret;
 }
 
-static void *read_object_from(struct sd_vnode *vnode, uint64_t oid)
+static void get_obj_checksum_from(struct sd_vnode *vnode, uint64_t oid,
+ unsigned char *sha1)
 {
struct sd_req hdr;
-   struct sd_rsp *rsp = (struct sd_rsp *)hdr;
+   struct sd_checksum_rsp *rsp = 

[sheepdog] [PATCH V2 3/4] man: update 'collie vdi check' doc

2012-08-21 Thread Yunkai Zhang
From: Yunkai Zhang qiushu@taobao.com

Signed-off-by: Yunkai Zhang qiushu@taobao.com
---
 man/collie.8 | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/man/collie.8 b/man/collie.8
index 359c409..d42c3d5 100644
--- a/man/collie.8
+++ b/man/collie.8
@@ -67,7 +67,7 @@ This command creates an image.
 .BI vdi snapshot [-s snapshot] [-a address] [-p port] [-h] vdiname
 This command creates a snapshot.
 .TP
-.BI vdi check [-s snapshot] [-a address] [-p port] [-h] vdiname
+.BI vdi check [-R] [-s snapshot] [-a address] [-p port] [-h] vdiname
 This command checks and repairs an image's consistency.
 .TP
 .BI vdi clone [-s snapshot] [-P] [-a address] [-p port] [-h] src vdi dst 
vdi
-- 
1.7.11.2

-- 
sheepdog mailing list
sheepdog@lists.wpkg.org
http://lists.wpkg.org/mailman/listinfo/sheepdog


[sheepdog] [PATCH] test: add test for recovery logic

2012-08-21 Thread Liu Yuan
From: Liu Yuan tailai...@taobao.com


Signed-off-by: Liu Yuan tailai...@taobao.com
---
 tests/027   |   33 +
 tests/027.out   |5 +
 tests/common.rc |   10 ++
 tests/group |1 +
 4 files changed, 49 insertions(+), 0 deletions(-)
 create mode 100755 tests/027
 create mode 100644 tests/027.out

diff --git a/tests/027 b/tests/027
new file mode 100755
index 000..456c0f7
--- /dev/null
+++ b/tests/027
@@ -0,0 +1,33 @@
+#!/bin/bash
+
+# Test sheep recovery logic
+
+seq=`basename $0`
+echo QA output created by $seq
+
+here=`pwd`
+tmp=/tmp/$$
+status=1# failure is the default!
+
+# get standard environment, filters and checks
+. ./common.rc
+. ./common.filter
+
+_cleanup
+
+for i in `seq 0 3`; do
+_start_sheep $i
+done
+
+_wait_for_sheep 4
+
+$COLLIE cluster format -c 2
+
+$COLLIE vdi create test0 40M
+$COLLIE vdi create test1 40M
+
+_kill_sheep 3
+
+_wait_for_sheep_recovery 0
+
+find $STORE -name '80fd32fc'
diff --git a/tests/027.out b/tests/027.out
new file mode 100644
index 000..f9887b5
--- /dev/null
+++ b/tests/027.out
@@ -0,0 +1,5 @@
+QA output created by 027
+using backend farm store
+/tmp/sheepdog/0/obj/80fd32fc
+/tmp/sheepdog/3/obj/80fd32fc
+/tmp/sheepdog/1/obj/80fd32fc
diff --git a/tests/common.rc b/tests/common.rc
index 64182c6..7ede163 100644
--- a/tests/common.rc
+++ b/tests/common.rc
@@ -169,5 +169,15 @@ _kill_sheep()
 pkill -f $SHEEP $STORE/$1
 }
 
+_wait_for_sheep_recovery()
+{
+while true; do
+   sleep 2
+   if [ $($COLLIE node recovery -p $((7000+$1)) | wc -l) -eq 2 ]; then
+   break
+   fi
+done
+}
+
 # make sure this script returns success
 /bin/true
diff --git a/tests/group b/tests/group
index aaa1ab6..57edea5 100644
--- a/tests/group
+++ b/tests/group
@@ -38,3 +38,4 @@
 024 auto quick cluster
 025 auto quick cluster
 026 auto quick vdi
+027 auto quick store
-- 
1.7.1

-- 
sheepdog mailing list
sheepdog@lists.wpkg.org
http://lists.wpkg.org/mailman/listinfo/sheepdog


Re: [sheepdog] [PATCH V2 00/11] INTRODUCE

2012-08-21 Thread Yunkai Zhang
On Tue, Aug 21, 2012 at 2:03 AM, MORITA Kazutaka
morita.kazut...@lab.ntt.co.jp wrote:
 At Mon, 20 Aug 2012 23:34:10 +0800,
 Yunkai Zhang wrote:

 In fact, I have thought this method, but we should face nearly the same 
 problem:

 After sheep joined back, it should known which objects is dirty, and
 should do the clear work(because there are old version object stay in
 it's working directory). This method seems not save the steps, but
 will do extra recovery works.

 Can you give me a concrete example?

 I created a really naive patch to disable object recovery with my
 idea:


Hi Kazum:

I have read and do simple test with this patch, it works at most time.

But write operation will be blocked in wait_forward_request(), I think
there are some corner case we should handle.

I think I have understood this good idea, it's simple and clever.


Could you give a mature patch? We really want to use it in our cluster
as soon as possible.


Thank you!


 ==
 diff --git a/sheep/recovery.c b/sheep/recovery.c
 index 5164aa7..8bf032f 100644
 --- a/sheep/recovery.c
 +++ b/sheep/recovery.c
 @@ -35,6 +35,7 @@ struct recovery_work {
 uint64_t *oids;
 uint64_t *prio_oids;
 int nr_prio_oids;
 +   int nr_scheduled_oids;

 struct vnode_info *old_vinfo;
 struct vnode_info *cur_vinfo;
 @@ -269,9 +270,6 @@ static inline void prepare_schedule_oid(uint64_t oid)
 oid);
 return;
 }
 -   /* The oid is currently being recovered */
 -   if (rw-oids[rw-done] == oid)
 -   return;
 rw-nr_prio_oids++;
 rw-prio_oids = xrealloc(rw-prio_oids,
  rw-nr_prio_oids * sizeof(uint64_t));
 @@ -399,9 +397,31 @@ static inline void finish_schedule_oids(struct 
 recovery_work *rw)
  done:
 free(rw-prio_oids);
 rw-prio_oids = NULL;
 +   rw-nr_scheduled_oids += rw-nr_prio_oids;
 rw-nr_prio_oids = 0;
  }

 +static struct timer recovery_timer;
 +
 +static void recover_next_object(void *arg)
 +{
 +   struct recovery_work *rw = arg;
 +
 +   if (rw-nr_prio_oids)
 +   finish_schedule_oids(rw);
 +
 +   if (rw-done  rw-nr_scheduled_oids) {
 +   /* Try recover next object */
 +   queue_work(sys-recovery_wqueue, rw-work);
 +   return;
 +   }
 +
 +   /* There is no objects to be recovered.  Try again later */
 +   recovery_timer.callback = recover_next_object;
 +   recovery_timer.data = rw;
 +   add_timer(recovery_timer, 1); /* FIXME */
 +}
 +
  static void recover_object_main(struct work *work)
  {
 struct recovery_work *rw = container_of(work, struct recovery_work,
 @@ -425,11 +445,7 @@ static void recover_object_main(struct work *work)
 resume_wait_obj_requests(rw-oids[rw-done++]);

 if (rw-done  rw-count) {
 -   if (rw-nr_prio_oids)
 -   finish_schedule_oids(rw);
 -
 -   /* Try recover next object */
 -   queue_work(sys-recovery_wqueue, rw-work);
 +   recover_next_object(rw);
 return;
 }

 @@ -458,7 +474,7 @@ static void finish_object_list(struct work *work)
 resume_wait_recovery_requests();
 rw-work.fn = recover_object_work;
 rw-work.done = recover_object_main;
 -   queue_work(sys-recovery_wqueue, rw-work);
 +   recover_next_object(rw);
 return;
  }

 ==

 I ran the following test, and object recovery was disabled correctly
 for both join and leave case.

 ==
 #!/bin/bash

 for i in 0 1 2 3; do
 ./sheep/sheep /store/$i -z $i -p 700$i -c local
 done

 sleep 1
 ./collie/collie cluster format

 ./collie/collie vdi create test 4G

 echo  * objects will be created on node[0-2] *
 md5sum /store/[0,1,2,3]/obj/807c2b25

 pkill -f ./sheep/sheep /store/1
 sleep 3

 echo  * recovery doesn't start until the object is touched *
 md5sum /store/[0,2,3]/obj/807c2b25

 ./collie/collie vdi snapshot test  # invoke recovery of the vdi object
 echo  * the object is recovered *
 md5sum /store/[0,2,3]/obj/807c2b25

 ./sheep/sheep /store/1 -z 1 -p 7001 -c local
 sleep 3

 echo  * recovery doesn't start until the object is touched *
 md5sum /store/[0,1,2,3]/obj/807c2b25

 ./collie/collie vdi list -p 7001  # invoke recovery of the vdi object
 echo  * the object is recovered *
 md5sum /store/[0,1,2,3]/obj/807c2b25
 ==

 [Output]

 using backend farm store
  * objects will be created on node[0-2] *
 701e77eab6002c9a48f7ba72c8d9bfe9  /store/0/obj/807c2b25
 701e77eab6002c9a48f7ba72c8d9bfe9  /store/1/obj/807c2b25
 701e77eab6002c9a48f7ba72c8d9bfe9  /store/2/obj/807c2b25
  * recovery doesn't start until the object is touched *
 701e77eab6002c9a48f7ba72c8d9bfe9  /store/0/obj/807c2b25
 701e77eab6002c9a48f7ba72c8d9bfe9  /store/2/obj/807c2b25
  * the object is recovered *
 

Re: [sheepdog] [PATCH V2 00/11] INTRODUCE

2012-08-21 Thread MORITA Kazutaka
At Wed, 22 Aug 2012 01:16:49 +0800,
Yunkai Zhang wrote:
 
 I have read and do simple test with this patch, it works at most time.
 
 But write operation will be blocked in wait_forward_request(), I think
 there are some corner case we should handle.

Can you create a testcase to reproduce it?

 Could you give a mature patch? We really want to use it in our cluster
 as soon as possible.

Okay, but I'm currently working on another problem - sheep blocks I/O
requests long time while stale objects are moved to the farm backend
store.  I'll give a try after that.

Thanks,

Kazutaka
-- 
sheepdog mailing list
sheepdog@lists.wpkg.org
http://lists.wpkg.org/mailman/listinfo/sheepdog


Re: [sheepdog] [PATCH V2 00/11] INTRODUCE

2012-08-21 Thread Yunkai Zhang
On Wed, Aug 22, 2012 at 9:31 AM, MORITA Kazutaka
morita.kazut...@lab.ntt.co.jp wrote:
 At Wed, 22 Aug 2012 01:16:49 +0800,
 Yunkai Zhang wrote:

 I have read and do simple test with this patch, it works at most time.

 But write operation will be blocked in wait_forward_request(), I think
 there are some corner case we should handle.

 Can you create a testcase to reproduce it?

Ok, I'll give a testcase later.


 Could you give a mature patch? We really want to use it in our cluster
 as soon as possible.

 Okay, but I'm currently working on another problem - sheep blocks I/O
 requests long time while stale objects are moved to the farm backend
 store.  I'll give a try after that.

Thanks~


 Thanks,

 Kazutaka



-- 
Yunkai Zhang
Work at Taobao
-- 
sheepdog mailing list
sheepdog@lists.wpkg.org
http://lists.wpkg.org/mailman/listinfo/sheepdog


Re: [sheepdog] [PATCH V2 00/11] INTRODUCE

2012-08-21 Thread Liu Yuan
On 08/22/2012 09:44 AM, Yunkai Zhang wrote:
 ould you give a mature patch? We really want to use it in our cluster
  as soon as possible.
 
  Okay, but I'm currently working on another problem - sheep blocks I/O
  requests long time while stale objects are moved to the farm backend
  store.  I'll give a try after that.
 Thanks~
 

Hi Yunkai, since you are doing this series all these days, why not pick up 
Kazutaka's
draft patch and perfect it? I think this is Kazutaka original intention, just a
concrete example to say his idea could work out.

-- 
thanks,
Yuan
-- 
sheepdog mailing list
sheepdog@lists.wpkg.org
http://lists.wpkg.org/mailman/listinfo/sheepdog


Re: [sheepdog] [PATCH V2 00/11] INTRODUCE

2012-08-21 Thread Yunkai Zhang
On Wed, Aug 22, 2012 at 9:55 AM, Liu Yuan namei.u...@gmail.com wrote:
 On 08/22/2012 09:44 AM, Yunkai Zhang wrote:
 ould you give a mature patch? We really want to use it in our cluster
  as soon as possible.
 
  Okay, but I'm currently working on another problem - sheep blocks I/O
  requests long time while stale objects are moved to the farm backend
  store.  I'll give a try after that.
 Thanks~


 Hi Yunkai, since you are doing this series all these days, why not pick up 
 Kazutaka's
 draft patch and perfect it? I think this is Kazutaka original intention, just 
 a
 concrete example to say his idea could work out.


My intention is to respect Kazum's idea, if need my help, I'm pleasure
to do it:).



 --
 thanks,
 Yuan



-- 
Yunkai Zhang
Work at Taobao
-- 
sheepdog mailing list
sheepdog@lists.wpkg.org
http://lists.wpkg.org/mailman/listinfo/sheepdog


Re: [sheepdog] [PATCH V2 00/11] INTRODUCE

2012-08-21 Thread Yunkai Zhang
On Wed, Aug 22, 2012 at 10:21 AM, MORITA Kazutaka
morita.kazut...@lab.ntt.co.jp wrote:
 At Wed, 22 Aug 2012 10:14:07 +0800,
 Yunkai Zhang wrote:

 My intention is to respect Kazum's idea, if need my help, I'm pleasure
 to do it:).

 If you complete the work, it will help me a lot. :)

Well, I'll complete it:)

But now I'm busy with other things, maybe I'll send the first version
based on this idea at Friday or this weekend.


 Thanks,

 Kazutaka



-- 
Yunkai Zhang
Work at Taobao
-- 
sheepdog mailing list
sheepdog@lists.wpkg.org
http://lists.wpkg.org/mailman/listinfo/sheepdog


[sheepdog] Sheepdog 0.5.0 schedule and todos

2012-08-21 Thread MORITA Kazutaka
Hi all,

I think of releasing 0.5.0 on the early September.

TODO items for the release are:
 - different redundancy level for each VDI
 - 'collie cluster recover' to enable/disable automatic recovery
 - 'collie vdi backup/restore' to get a differential backup between snapshots
 - another writeback support in backend store
 - fix a blocking problem in farm
 - fix mastership transfer

Let me know if there are any opinions.

Thanks,

Kazutaka
-- 
sheepdog mailing list
sheepdog@lists.wpkg.org
http://lists.wpkg.org/mailman/listinfo/sheepdog


[sheepdog] [PATCH] tests: kill sheep with signal KILL

2012-08-21 Thread levin li
From: levin li xingke@taobao.com

In some case, we use pkill -f to kill a sheep node, and
the log process isn't killed immediately and becomes a
defunt process which take problems for the next start of
the same node.

Signed-off-by: levin li xingke@taobao.com
---
 tests/common.rc |2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/tests/common.rc b/tests/common.rc
index 64182c6..8ca7dea 100644
--- a/tests/common.rc
+++ b/tests/common.rc
@@ -166,7 +166,7 @@ _start_sheep()
 
 _kill_sheep()
 {
-pkill -f $SHEEP $STORE/$1
+pkill -9 -f $SHEEP $STORE/$1
 }
 
 # make sure this script returns success
-- 
1.7.1

-- 
sheepdog mailing list
sheepdog@lists.wpkg.org
http://lists.wpkg.org/mailman/listinfo/sheepdog