Re: [sheepdog] [PATCH v4 01/10] sheep: use struct vdi_iocb to simplify the vdi_create api

2012-08-20 Thread levin li
On 2012年08月20日 12:53, MORITA Kazutaka wrote:
 At Thu,  9 Aug 2012 13:27:36 +0800,
 levin li wrote:
  
 +struct vdi_iocb {
 +char *data;
 
 Should be char *name?
 

Yes

 +uint32_t data_len;
 +uint64_t size;
 +uint32_t base_vid;
 +int is_snapshot;
 
 Should be bool is_snapshot?
 

Here is_snapshot is 'snapid', I should rename it to 'snapid'

thanks,

levin

 +int nr_copies;
 +};
 +
 
 Can we use this structure to lookup_vdi() and del_vdi(), too?
 
 Thanks,
 
 Kazutaka
 

-- 
sheepdog mailing list
sheepdog@lists.wpkg.org
http://lists.wpkg.org/mailman/listinfo/sheepdog


Re: [sheepdog] [PATCH v0, RFC] sheep: writeback cache semantics in backend store

2012-08-20 Thread MORITA Kazutaka
At Mon, 20 Aug 2012 13:51:08 +0800,
Liu Yuan wrote:
 
 I am suspicious that this approach is really useful:
 
  1. suppose we can only call sync() to flush page cache on each node.
 With a cluster that runs hundred of images, the sync() request will be
 issued almost every second, this kind of request storm will put this
 idea to useless, compared to SYNC open flag.
  2 even if we are working with syncfs(), the benefit will be offset by
 the complexity to find all the location of specified VDI and send
 requests one by one

I guess this approach will give benefit only when the numbers of nodes
and VMs are small, but it's okay if it's not turned on by default.
Anyway, I'd like to see more benchmark results (e.g. running dbench on
several VMs simultaneously) before accepting this patch.

Thanks,

Kazutaka
-- 
sheepdog mailing list
sheepdog@lists.wpkg.org
http://lists.wpkg.org/mailman/listinfo/sheepdog


Re: [sheepdog] [PATCH v4 06/10] sheep: fetch vdi copy list after sheep joins the cluster

2012-08-20 Thread MORITA Kazutaka
At Mon, 20 Aug 2012 15:41:03 +0800,
levin li wrote:
 
 On 2012年08月20日 13:15, MORITA Kazutaka wrote:
  At Thu,  9 Aug 2012 13:27:41 +0800,
  levin li wrote:
 
  From: levin li xingke@taobao.com
 
  The new joined node doesn't have the vdi copy list, or have
  incomplete vdi copy list, so we need to fetch the copy list
  data from other nodes
  
  It makes code complex to store the copy list in local store because
  it's difficult to keep consistency of the data.
  
  I'd suggest gathering both vid and copy list with SD_OP_READ_VDI
  requests at the same time.  Then we can remove this patch and simplify
  5th patch a lot.
  
  Thanks,
  
  Kazutaka
  
 
 How about this:
 
 We don't store the VDI copy list locally, but read it from the local
 VDI inode object when a node starts up, and in update_cluster_info()
 we collect the entire VDI copy list from other nodes just as what
 get_vdi_bitmap() does,

That's just what I meant.

 but a little different from get_vdi_bitmap(),
 we can not make it perform asynchronously if the node needs recovery,
 because we need the VDI copy list in recovery, the solution is that
 in prepare_object_list() we make sheep sleeps until it find that the
 get_vdi_copy_list() is finished.

Looks good.

However, I'm wondering if it's much easier to save the number of
copies as an xattr of each object rather than managing the VDI copy
list.  What do you think about it?

Thanks,

Kazutaka
-- 
sheepdog mailing list
sheepdog@lists.wpkg.org
http://lists.wpkg.org/mailman/listinfo/sheepdog


Re: [sheepdog] [PATCH v4 06/10] sheep: fetch vdi copy list after sheep joins the cluster

2012-08-20 Thread levin li
On 2012年08月20日 16:03, MORITA Kazutaka wrote:
 At Mon, 20 Aug 2012 15:41:03 +0800,
 levin li wrote:

 On 2012年08月20日 13:15, MORITA Kazutaka wrote:
 At Thu,  9 Aug 2012 13:27:41 +0800,
 levin li wrote:

 From: levin li xingke@taobao.com

 The new joined node doesn't have the vdi copy list, or have
 incomplete vdi copy list, so we need to fetch the copy list
 data from other nodes

 It makes code complex to store the copy list in local store because
 it's difficult to keep consistency of the data.

 I'd suggest gathering both vid and copy list with SD_OP_READ_VDI
 requests at the same time.  Then we can remove this patch and simplify
 5th patch a lot.

 Thanks,

 Kazutaka


 How about this:

 We don't store the VDI copy list locally, but read it from the local
 VDI inode object when a node starts up, and in update_cluster_info()
 we collect the entire VDI copy list from other nodes just as what
 get_vdi_bitmap() does,
 
 That's just what I meant.
 
 but a little different from get_vdi_bitmap(),
 we can not make it perform asynchronously if the node needs recovery,
 because we need the VDI copy list in recovery, the solution is that
 in prepare_object_list() we make sheep sleeps until it find that the
 get_vdi_copy_list() is finished.
 
 Looks good.
 
 However, I'm wondering if it's much easier to save the number of
 copies as an xattr of each object rather than managing the VDI copy
 list.  What do you think about it?
 
 Thanks,
 
 Kazutaka
 

Saving nr_copies for each object as xattr seems more complicated, when an
object is migrated from one node to another, the xattr info of that object
is lost. Moreover, if we try to read an object in remote node, we can not
specify the copy number in read_object(), I think a copy list may be the
simplest way for this problem, what do you think about it?

thanks,

levin
-- 
sheepdog mailing list
sheepdog@lists.wpkg.org
http://lists.wpkg.org/mailman/listinfo/sheepdog


Re: [sheepdog] [PATCH v4 06/10] sheep: fetch vdi copy list after sheep joins the cluster

2012-08-20 Thread MORITA Kazutaka
At Mon, 20 Aug 2012 16:26:06 +0800,
levin li wrote:
 
 Saving nr_copies for each object as xattr seems more complicated, when an
 object is migrated from one node to another, the xattr info of that object
 is lost. Moreover, if we try to read an object in remote node, we can not
 specify the copy number in read_object(), I think a copy list may be the
 simplest way for this problem, what do you think about it?

Ah, yes, we cannot get the nr_copies of vdi objects with the xattr
approach.  I agree with you.

Thanks,

Kazutaka
-- 
sheepdog mailing list
sheepdog@lists.wpkg.org
http://lists.wpkg.org/mailman/listinfo/sheepdog


Re: [sheepdog] [PATCH V2 00/11] INTRODUCE

2012-08-20 Thread Yunkai Zhang
On Mon, Aug 20, 2012 at 9:00 PM, MORITA Kazutaka
morita.kazut...@lab.ntt.co.jp wrote:
 At Thu,  9 Aug 2012 16:43:38 +0800,
 Yunkai Zhang wrote:

 From: Yunkai Zhang qiushu@taobao.com

 V2:
 - fix a typo
 - when an object is updated, delete it old version
 - reset cluster recovery state in finish_recovery()

 Yunkai Zhang (11):
   sheep: enable variale-length of join_message in response of join
 event
   sheep: share joining nodes with newly added sheep
   sheep: delay to process recovery caused by LEAVE event just like JOIN
 event
   sheep: don't cleanup working directory when sheep joined back
   sheep: read objects only from live nodes
   sheep: write objects only on live nodes
   sheep: mark dirty object that belongs to the leaving nodes
   sheep: send dirty object list to each sheep when cluster do recovery
   sheep: do recovery with dirty object list
   collie: update 'collie cluster recover info' commands
   collie: update doc about 'collie cluster recover disable'

  collie/cluster.c  |  46 ---
  include/internal_proto.h  |  32 ++--
  include/sheep.h   |  23 ++
  man/collie.8  |   2 +-
  sheep/cluster.h   |  29 +--
  sheep/cluster/accord.c|   2 +-
  sheep/cluster/corosync.c  |   9 ++-
  sheep/cluster/local.c |   2 +-
  sheep/cluster/zookeeper.c |   2 +-
  sheep/farm/trunk.c|   2 +-
  sheep/gateway.c   |  39 -
  sheep/group.c | 202 
 +-
  sheep/object_list_cache.c | 182 +++--
  sheep/ops.c   |  85 ---
  sheep/recovery.c  | 133 +++---
  sheep/sheep_priv.h|  57 -
  16 files changed, 743 insertions(+), 104 deletions(-)

 I've looked into this series, and IMHO the change is too complex.

 With this series, when recovery is disabled and there are left nodes,
 sheep can succeed in a write operation even if the data is not fully
 replicated.  But, if we allow it, it is difficult to prevent VMs from
 reading old data.  Actually this series put a lot of effort into it.

We want to upgrade sheepdog while not impact all online VMs, so we
need to allow all VMs to do write operation when recovery is disable
(It is important for a big cluster, we can't assume users would stop
their works during this time). And we also assume that this time is
short, we should upgrade sheepdog as soon as possible( 5 minutes).

This patch is implemented based on those assumption above. And maybe
it's difficult, but it's algorithm is clear, just three steps(from the
description from the 9th patch's commit log):

1) If a sheep joined back to the cluster, but there are some objects which have
   been deleted after this sheep left, such objects stay in its working
   directory, after recovery start, this sheep will send its object list to
   other sheeps. So after fetched all object list from cluster, each sheep
   should screen out these deleted objects list.

2) A sheep which have been left and joined back should drop the old version
   objects and recover the new ones from other sheeps.

3) The objects which have been updated should not recovered from a joined
   back sheep.


 I'd suggest allowing epoch increment even when recover is
 disabled.  If recovery work recovers only rw-prio_oids and delays the
 recovery of rw-oids, I think we can get the similar benefit with much
 simpler way:
   http://www.mail-archive.com/sheepdog@lists.wpkg.org/msg05439.html

In fact, I have thought this method, but we should face nearly the same problem:

After sheep joined back, it should known which objects is dirty, and
should do the clear work(because there are old version object stay in
it's working directory). This method seems not save the steps, but
will do extra recovery works.


 Thanks,

 Kazutaka



-- 
Yunkai Zhang
Work at Taobao
-- 
sheepdog mailing list
sheepdog@lists.wpkg.org
http://lists.wpkg.org/mailman/listinfo/sheepdog


Re: [sheepdog] [PATCH V2 00/11] INTRODUCE

2012-08-20 Thread Liu Yuan
On 08/20/2012 11:34 PM, Yunkai Zhang wrote:
 On Mon, Aug 20, 2012 at 9:00 PM, MORITA Kazutaka
 morita.kazut...@lab.ntt.co.jp wrote:
 At Thu,  9 Aug 2012 16:43:38 +0800,
 Yunkai Zhang wrote:

 From: Yunkai Zhang qiushu@taobao.com

 V2:
 - fix a typo
 - when an object is updated, delete it old version
 - reset cluster recovery state in finish_recovery()

 Yunkai Zhang (11):
   sheep: enable variale-length of join_message in response of join
 event
   sheep: share joining nodes with newly added sheep
   sheep: delay to process recovery caused by LEAVE event just like JOIN
 event
   sheep: don't cleanup working directory when sheep joined back
   sheep: read objects only from live nodes
   sheep: write objects only on live nodes
   sheep: mark dirty object that belongs to the leaving nodes
   sheep: send dirty object list to each sheep when cluster do recovery
   sheep: do recovery with dirty object list
   collie: update 'collie cluster recover info' commands
   collie: update doc about 'collie cluster recover disable'

  collie/cluster.c  |  46 ---
  include/internal_proto.h  |  32 ++--
  include/sheep.h   |  23 ++
  man/collie.8  |   2 +-
  sheep/cluster.h   |  29 +--
  sheep/cluster/accord.c|   2 +-
  sheep/cluster/corosync.c  |   9 ++-
  sheep/cluster/local.c |   2 +-
  sheep/cluster/zookeeper.c |   2 +-
  sheep/farm/trunk.c|   2 +-
  sheep/gateway.c   |  39 -
  sheep/group.c | 202 
 +-
  sheep/object_list_cache.c | 182 +++--
  sheep/ops.c   |  85 ---
  sheep/recovery.c  | 133 +++---
  sheep/sheep_priv.h|  57 -
  16 files changed, 743 insertions(+), 104 deletions(-)

 I've looked into this series, and IMHO the change is too complex.

 With this series, when recovery is disabled and there are left nodes,
 sheep can succeed in a write operation even if the data is not fully
 replicated.  But, if we allow it, it is difficult to prevent VMs from
 reading old data.  Actually this series put a lot of effort into it.
 
 We want to upgrade sheepdog while not impact all online VMs, so we
 need to allow all VMs to do write operation when recovery is disable
 (It is important for a big cluster, we can't assume users would stop
 their works during this time). And we also assume that this time is
 short, we should upgrade sheepdog as soon as possible( 5 minutes).
 

Upgrading cluster without stopping service is a nice feature, but I'm afraid in 
the near
future, Sheepdog won't meet this expectation due to fast growing development 
which is likely
to break the inter-sheep assumptions. Before we have this feature, we at least 
should do
following things before claiming to be online upgrading capable:
 1) inter-sheep protocol compatibility check logic
 2) has a relatively stable feature set and internal physical state (such as 
config file)

That is, it is time too early to talk online upgrading for now.

 This patch is implemented based on those assumption above. And maybe
 it's difficult, but it's algorithm is clear, just three steps(from the
 description from the 9th patch's commit log):
 
 1) If a sheep joined back to the cluster, but there are some objects which 
 have
been deleted after this sheep left, such objects stay in its working
directory, after recovery start, this sheep will send its object list to
other sheeps. So after fetched all object list from cluster, each sheep
should screen out these deleted objects list.
 
 2) A sheep which have been left and joined back should drop the old version
objects and recover the new ones from other sheeps.
 
 3) The objects which have been updated should not recovered from a joined
back sheep.
 

 I'd suggest allowing epoch increment even when recover is
 disabled.  If recovery work recovers only rw-prio_oids and delays the
 recovery of rw-oids, I think we can get the similar benefit with much
 simpler way:
   http://www.mail-archive.com/sheepdog@lists.wpkg.org/msg05439.html
 
 In fact, I have thought this method, but we should face nearly the same 
 problem:
 
 After sheep joined back, it should known which objects is dirty, and
 should do the clear work(because there are old version object stay in
 it's working directory). This method seems not save the steps, but
 will do extra recovery works.
 

IMHO, I think the suggested method won't cause different version objects, 
because we actually
increment epoch and we do the same as is now for the objects in the 
rw-prio_oids,
which is being requested. So for this kind of objects, we can still use current 
code to handle
it. For those objects not being requested at all (which might account for 
majority of the objects
in a short time window), we can do the trick: delay recovering them as much as 
possible, so
that subsequent join 

Re: [sheepdog] [PATCH V2 00/11] INTRODUCE

2012-08-20 Thread Christoph Hellwig
On Mon, Aug 20, 2012 at 11:34:10PM +0800, Yunkai Zhang wrote:
  sheep can succeed in a write operation even if the data is not fully
  replicated.  But, if we allow it, it is difficult to prevent VMs from
  reading old data.  Actually this series put a lot of effort into it.
 
 We want to upgrade sheepdog while not impact all online VMs, so we
 need to allow all VMs to do write operation when recovery is disable
 (It is important for a big cluster, we can't assume users would stop
 their works during this time). And we also assume that this time is
 short, we should upgrade sheepdog as soon as possible( 5 minutes).

FYI, I've been looking into this issue (but not this series yet) a bit
lately and came to the conflusion that the only way to proper solve it
is indeed to recuce redundancy.  One way to make this formal is
to have a minimum and a normal redundancy level and let writes succeed
as long as we meet the minimum level and not the full one.

Another thing that sprang into mind is that instead of the formal
recovery enable/disable we should simply always delay recovery, that
is only do recovery after every N seconds if changes happened.
Especially in the cases of whole racks going up/down or upgrades that
dramatically reduces the number of epochs required, and thus reduces
the recovery overhead.

I didn't actually have time to look into the implementation implications
of this yet, it's just high level thoughs.

-- 
sheepdog mailing list
sheepdog@lists.wpkg.org
http://lists.wpkg.org/mailman/listinfo/sheepdog


Re: [sheepdog] [PATCH 1/2] collie: add self options to collie's command

2012-08-20 Thread Liu Yuan
On 08/20/2012 10:28 PM, Yunkai Zhang wrote:
 Now, all collie's command share the same global collie_options, it will
 lead to option's name conflict among commands if they use the same options
 but with different description.
 
 By introducing self options to each command (if necessary) and make
 collie_options only contain the common part of all options, we can solve this
 issue.

I like this improvement, but 'self options' doesn't explain the idea better. 
This is
kind of namespace for each sub command, so simply name it as:

+   struct sd_option *options;

in a structure is enough.

And rework the comment and commit log to replace 'self option' with more 
meaningful
phrase such as, for e.g, By moving the global options into individual 
structure as a
private member, we can solve this problem

Also, with this patch, we can then change all those upper cased options into 
lower cases,
such as vdi create -P - vdi create -p for easier typing.

-- 
thanks,
Yuan
-- 
sheepdog mailing list
sheepdog@lists.wpkg.org
http://lists.wpkg.org/mailman/listinfo/sheepdog


Re: [sheepdog] [PATCH V2 00/11] INTRODUCE

2012-08-20 Thread Liu Yuan
On 08/21/2012 12:07 AM, Christoph Hellwig wrote:
 Another thing that sprang into mind is that instead of the formal
 recovery enable/disable we should simply always delay recovery, that
 is only do recovery after every N seconds if changes happened.
 Especially in the cases of whole racks going up/down or upgrades that
 dramatically reduces the number of epochs required, and thus reduces
 the recovery overhead.
 
 I didn't actually have time to look into the implementation implications
 of this yet, it's just high level thoughs.

I think negatively to delay recovery all the time. It is useful to delay 
recovery
in some time window for maintenance or operational purposes, so I think the idea
only to delay recovery manually at some controlled window is useful, but if we 
extend
this to all the running time, it will bring cluster to a less safe state (if not
dangerous) at any point. (we only upgrade cluster/maintain individual node only 
at some time,
not all the time, no?)

Trading data reliability is always the last resort for a distributed system, 
which highlights
data reliability compared to single data instance in local disk.  

-- 
thanks,
Yuan
-- 
sheepdog mailing list
sheepdog@lists.wpkg.org
http://lists.wpkg.org/mailman/listinfo/sheepdog


Re: [sheepdog] [PATCH V2 00/11] INTRODUCE

2012-08-20 Thread MORITA Kazutaka
At Mon, 20 Aug 2012 23:34:10 +0800,
Yunkai Zhang wrote:
 
 In fact, I have thought this method, but we should face nearly the same 
 problem:
 
 After sheep joined back, it should known which objects is dirty, and
 should do the clear work(because there are old version object stay in
 it's working directory). This method seems not save the steps, but
 will do extra recovery works.

Can you give me a concrete example?

I created a really naive patch to disable object recovery with my
idea:

==
diff --git a/sheep/recovery.c b/sheep/recovery.c
index 5164aa7..8bf032f 100644
--- a/sheep/recovery.c
+++ b/sheep/recovery.c
@@ -35,6 +35,7 @@ struct recovery_work {
uint64_t *oids;
uint64_t *prio_oids;
int nr_prio_oids;
+   int nr_scheduled_oids;
 
struct vnode_info *old_vinfo;
struct vnode_info *cur_vinfo;
@@ -269,9 +270,6 @@ static inline void prepare_schedule_oid(uint64_t oid)
oid);
return;
}
-   /* The oid is currently being recovered */
-   if (rw-oids[rw-done] == oid)
-   return;
rw-nr_prio_oids++;
rw-prio_oids = xrealloc(rw-prio_oids,
 rw-nr_prio_oids * sizeof(uint64_t));
@@ -399,9 +397,31 @@ static inline void finish_schedule_oids(struct 
recovery_work *rw)
 done:
free(rw-prio_oids);
rw-prio_oids = NULL;
+   rw-nr_scheduled_oids += rw-nr_prio_oids;
rw-nr_prio_oids = 0;
 }
 
+static struct timer recovery_timer;
+
+static void recover_next_object(void *arg)
+{
+   struct recovery_work *rw = arg;
+
+   if (rw-nr_prio_oids)
+   finish_schedule_oids(rw);
+
+   if (rw-done  rw-nr_scheduled_oids) {
+   /* Try recover next object */
+   queue_work(sys-recovery_wqueue, rw-work);
+   return;
+   }
+
+   /* There is no objects to be recovered.  Try again later */
+   recovery_timer.callback = recover_next_object;
+   recovery_timer.data = rw;
+   add_timer(recovery_timer, 1); /* FIXME */
+}
+
 static void recover_object_main(struct work *work)
 {
struct recovery_work *rw = container_of(work, struct recovery_work,
@@ -425,11 +445,7 @@ static void recover_object_main(struct work *work)
resume_wait_obj_requests(rw-oids[rw-done++]);
 
if (rw-done  rw-count) {
-   if (rw-nr_prio_oids)
-   finish_schedule_oids(rw);
-
-   /* Try recover next object */
-   queue_work(sys-recovery_wqueue, rw-work);
+   recover_next_object(rw);
return;
}
 
@@ -458,7 +474,7 @@ static void finish_object_list(struct work *work)
resume_wait_recovery_requests();
rw-work.fn = recover_object_work;
rw-work.done = recover_object_main;
-   queue_work(sys-recovery_wqueue, rw-work);
+   recover_next_object(rw);
return;
 }
 
==

I ran the following test, and object recovery was disabled correctly
for both join and leave case.

==
#!/bin/bash

for i in 0 1 2 3; do
./sheep/sheep /store/$i -z $i -p 700$i -c local
done

sleep 1
./collie/collie cluster format

./collie/collie vdi create test 4G

echo  * objects will be created on node[0-2] *
md5sum /store/[0,1,2,3]/obj/807c2b25

pkill -f ./sheep/sheep /store/1
sleep 3

echo  * recovery doesn't start until the object is touched *
md5sum /store/[0,2,3]/obj/807c2b25

./collie/collie vdi snapshot test  # invoke recovery of the vdi object
echo  * the object is recovered *
md5sum /store/[0,2,3]/obj/807c2b25

./sheep/sheep /store/1 -z 1 -p 7001 -c local
sleep 3

echo  * recovery doesn't start until the object is touched *
md5sum /store/[0,1,2,3]/obj/807c2b25

./collie/collie vdi list -p 7001  # invoke recovery of the vdi object
echo  * the object is recovered *
md5sum /store/[0,1,2,3]/obj/807c2b25
==

[Output]

using backend farm store
 * objects will be created on node[0-2] *
701e77eab6002c9a48f7ba72c8d9bfe9  /store/0/obj/807c2b25
701e77eab6002c9a48f7ba72c8d9bfe9  /store/1/obj/807c2b25
701e77eab6002c9a48f7ba72c8d9bfe9  /store/2/obj/807c2b25
 * recovery doesn't start until the object is touched *
701e77eab6002c9a48f7ba72c8d9bfe9  /store/0/obj/807c2b25
701e77eab6002c9a48f7ba72c8d9bfe9  /store/2/obj/807c2b25
 * the object is recovered *
3c3bf0d865363fd0d1f1d5c7aa044dcd  /store/0/obj/807c2b25
3c3bf0d865363fd0d1f1d5c7aa044dcd  /store/2/obj/807c2b25
3c3bf0d865363fd0d1f1d5c7aa044dcd  /store/3/obj/807c2b25
 * recovery doesn't start until the object is touched *
3c3bf0d865363fd0d1f1d5c7aa044dcd  /store/0/obj/807c2b25
3c3bf0d865363fd0d1f1d5c7aa044dcd  /store/2/obj/807c2b25
  NameIdSizeUsed  SharedCreation time   VDI id  Tag
s test 1  4.0 GB  0.0 MB  0.0 MB 2012-08-21 02:49   7c2b25  
  test 2  4.0 GB  0.0 MB  0.0 MB 2012-08-21 02:49 

Re: [sheepdog] [PATCH V2 00/11] INTRODUCE

2012-08-20 Thread MORITA Kazutaka
At Tue, 21 Aug 2012 00:29:50 +0800,
Liu Yuan wrote:
 
 On 08/21/2012 12:07 AM, Christoph Hellwig wrote:
  Another thing that sprang into mind is that instead of the formal
  recovery enable/disable we should simply always delay recovery, that
  is only do recovery after every N seconds if changes happened.
  Especially in the cases of whole racks going up/down or upgrades that
  dramatically reduces the number of epochs required, and thus reduces
  the recovery overhead.
  
  I didn't actually have time to look into the implementation implications
  of this yet, it's just high level thoughs.
 
 I think negatively to delay recovery all the time. It is useful to delay 
 recovery
 in some time window for maintenance or operational purposes, so I think the 
 idea
 only to delay recovery manually at some controlled window is useful, but if 
 we extend
 this to all the running time, it will bring cluster to a less safe state (if 
 not
 dangerous) at any point. (we only upgrade cluster/maintain individual node 
 only at some time,
 not all the time, no?)
 
 Trading data reliability is always the last resort for a distributed system, 
 which highlights
 data reliability compared to single data instance in local disk.  

I think delaying recovery for a few seconds always is useful for many
users.  Under heavy network load, sheep can wrongly detect node
failure and node membership can change frequently.  Delaying recovery
for a short time makes Sheepdog tolerant against such situation.

Thanks,

Kazutaka
-- 
sheepdog mailing list
sheepdog@lists.wpkg.org
http://lists.wpkg.org/mailman/listinfo/sheepdog


[sheepdog] [PATCH v2] sheep: make config file compatible with the previous one

2012-08-20 Thread MORITA Kazutaka
Signed-off-by: MORITA Kazutaka morita.kazut...@lab.ntt.co.jp
---

Changes from v1:
 - remove 'version' from sheepdog_config

Even if we don't support a version check of the config in the next
release, we should fix the compatibility issue at least.


 sheep/store.c |3 ++-
 1 files changed, 2 insertions(+), 1 deletions(-)

diff --git a/sheep/store.c b/sheep/store.c
index 542804a..fcbf32d 100644
--- a/sheep/store.c
+++ b/sheep/store.c
@@ -30,10 +30,11 @@
 
 struct sheepdog_config {
uint64_t ctime;
-   uint64_t space;
uint16_t flags;
uint8_t copies;
uint8_t store[STORE_LEN];
+   uint8_t __pad[5];
+   uint64_t space;
 };
 
 char *obj_path;
-- 
1.7.2.5

-- 
sheepdog mailing list
sheepdog@lists.wpkg.org
http://lists.wpkg.org/mailman/listinfo/sheepdog


Re: [sheepdog] [PATCH v2] sheep: make config file compatible with the previous one

2012-08-20 Thread Liu Yuan
On 08/21/2012 02:37 AM, MORITA Kazutaka wrote:
 + uint8_t __pad[5];
 + uint64_t space;

What is __pad[5] for?

Thanks,
Yuan
-- 
sheepdog mailing list
sheepdog@lists.wpkg.org
http://lists.wpkg.org/mailman/listinfo/sheepdog


Re: [sheepdog] [PATCH V2 00/11] INTRODUCE

2012-08-20 Thread Yunkai Zhang
On Tue, Aug 21, 2012 at 10:46 AM, Liu Yuan namei.u...@gmail.com wrote:
 On 08/21/2012 02:29 AM, MORITA Kazutaka wrote:
 I think delaying recovery for a few seconds always is useful for many
 users.  Under heavy network load, sheep can wrongly detect node
 failure and node membership can change frequently.  Delaying recovery
 for a short time makes Sheepdog tolerant against such situation.

 I think your example is very vague, what kind of driver you use? Sheep
 itself won't sense membership and rely on cluster drivers to maintain
 membership. Could you detail how it happen exactly in real case?

 If you are talking about network partition problem, I don't think delay
 recovery will help solve it. We have met network partition when we used
 corosync driver, for zookeeper driver, we haven't met it yet. (I guess
 we won't meet it with zookeeper as a central membership control).

 Suppose we have 6 nodes in a cluster, A,B,C,D,E,F one copy with epoch =
 1. For time t1, we get network partitioned, and three partitions show
 up, c1(A,B,C), c2(D,E),c3(F). So epoch for this three partitions is
 respectively epoch(c1=4, c2=5, c3=6) and all 3 partitions progress to
 recover and get updates to its local object.

 In your above example, suppose we might have these 3 partition
 automatically merge into one partition, this means, after merging
 1) epoch(c1=7, c2=9, c3=11)
 2) no code to handle different version objects which all nodes think his
 own local version is correct.

 So I think we have to handle epoch mismatch and object multi-version
 problems before evaluating delay recovery for network partition.

 If you are not talking about network partition problem, I think we can
 only meet stop/restart node case for manual maintenance, where I think
 manual recovery could really be helpful.


Delay recovery couldn't solve network partition problem, and as you
mentioned above, if sheep break internal protocol, delay recovery
could not help to sheep's upgrade.

But if sheep don't break internal protocol, for example, we just fix
memory leak bug/add some useful log/fix a corner case, it's very
useful for us.


 Thanks,
 Yuan
 --
 sheepdog mailing list
 sheepdog@lists.wpkg.org
 http://lists.wpkg.org/mailman/listinfo/sheepdog



-- 
Yunkai Zhang
Work at Taobao
-- 
sheepdog mailing list
sheepdog@lists.wpkg.org
http://lists.wpkg.org/mailman/listinfo/sheepdog


[sheepdog] [PATCH] test: consolidate 010 to check manual recovery

2012-08-20 Thread Liu Yuan
From: Liu Yuan tailai...@taobao.com

Signed-off-by: Liu Yuan tailai...@taobao.com
---
 tests/010 |   14 ++
 tests/010.out |   15 ++-
 2 files changed, 24 insertions(+), 5 deletions(-)

diff --git a/tests/010 b/tests/010
index 7496e2d..c3f53b4 100755
--- a/tests/010
+++ b/tests/010
@@ -1,5 +1,7 @@
 #!/bin/bash
 
+# Test manual recovery command
+
 seq=`basename $0`
 echo QA output created by $seq
 
@@ -13,15 +15,14 @@ status=1# failure is the default!
 
 _cleanup
 
-_start_sheep 0
-_start_sheep 1
+for i in `seq 0 1`; do _start_sheep $i; done
 
-sleep 2
+_wait_for_sheep 2
 
 $COLLIE cluster format -c 2
 $COLLIE cluster recover disable
 
-qemu-img create sheepdog:test 4G
+$COLLIE vdi create test 4G
 
 # create 20 objects
 for i in `seq 0 19`; do
@@ -34,3 +35,8 @@ _start_sheep 2
 for i in `seq 0 19`; do
 $COLLIE vdi write test $((i * 4 * 1024 * 1024)) 512  /dev/zero
 done
+
+$COLLIE cluster info | _filter_cluster_info
+
+$COLLIE cluster recover enable
+$COLLIE cluster info | _filter_cluster_info
diff --git a/tests/010.out b/tests/010.out
index 01cc1bf..ea84c35 100644
--- a/tests/010.out
+++ b/tests/010.out
@@ -2,4 +2,17 @@ QA output created by 010
 using backend farm store
 *Note*: Only disable the recovery caused by JOIN envets
 Cluster recovery: disable
-Formatting 'sheepdog:test', fmt=raw size=4294967296 
+Cluster status: running
+
+Cluster created at DATE
+
+Epoch Time   Version
+DATE  1 [127.0.0.1:7000, 127.0.0.1:7001]
+Cluster recovery: enable
+Cluster status: running
+
+Cluster created at DATE
+
+Epoch Time   Version
+DATE  2 [127.0.0.1:7000, 127.0.0.1:7001, 127.0.0.1:7002]
+DATE  1 [127.0.0.1:7000, 127.0.0.1:7001]
-- 
1.7.10.2

-- 
sheepdog mailing list
sheepdog@lists.wpkg.org
http://lists.wpkg.org/mailman/listinfo/sheepdog


Re: [sheepdog] [PATCH V2 00/11] INTRODUCE

2012-08-20 Thread Yunkai Zhang
On Tue, Aug 21, 2012 at 2:03 AM, MORITA Kazutaka
morita.kazut...@lab.ntt.co.jp wrote:
 At Mon, 20 Aug 2012 23:34:10 +0800,
 Yunkai Zhang wrote:

 In fact, I have thought this method, but we should face nearly the same 
 problem:

 After sheep joined back, it should known which objects is dirty, and
 should do the clear work(because there are old version object stay in
 it's working directory). This method seems not save the steps, but
 will do extra recovery works.

 Can you give me a concrete example?

 I created a really naive patch to disable object recovery with my
 idea:

 ==
 diff --git a/sheep/recovery.c b/sheep/recovery.c
 index 5164aa7..8bf032f 100644
 --- a/sheep/recovery.c
 +++ b/sheep/recovery.c
 @@ -35,6 +35,7 @@ struct recovery_work {
 uint64_t *oids;
 uint64_t *prio_oids;
 int nr_prio_oids;
 +   int nr_scheduled_oids;

 struct vnode_info *old_vinfo;
 struct vnode_info *cur_vinfo;
 @@ -269,9 +270,6 @@ static inline void prepare_schedule_oid(uint64_t oid)
 oid);
 return;
 }
 -   /* The oid is currently being recovered */
 -   if (rw-oids[rw-done] == oid)
 -   return;
 rw-nr_prio_oids++;
 rw-prio_oids = xrealloc(rw-prio_oids,
  rw-nr_prio_oids * sizeof(uint64_t));
 @@ -399,9 +397,31 @@ static inline void finish_schedule_oids(struct 
 recovery_work *rw)
  done:
 free(rw-prio_oids);
 rw-prio_oids = NULL;
 +   rw-nr_scheduled_oids += rw-nr_prio_oids;
 rw-nr_prio_oids = 0;
  }

 +static struct timer recovery_timer;
 +
 +static void recover_next_object(void *arg)
 +{
 +   struct recovery_work *rw = arg;
 +
 +   if (rw-nr_prio_oids)
 +   finish_schedule_oids(rw);
 +
 +   if (rw-done  rw-nr_scheduled_oids) {
 +   /* Try recover next object */
 +   queue_work(sys-recovery_wqueue, rw-work);
 +   return;
 +   }
 +
 +   /* There is no objects to be recovered.  Try again later */
 +   recovery_timer.callback = recover_next_object;
 +   recovery_timer.data = rw;
 +   add_timer(recovery_timer, 1); /* FIXME */
 +}
 +
  static void recover_object_main(struct work *work)
  {
 struct recovery_work *rw = container_of(work, struct recovery_work,
 @@ -425,11 +445,7 @@ static void recover_object_main(struct work *work)
 resume_wait_obj_requests(rw-oids[rw-done++]);

 if (rw-done  rw-count) {
 -   if (rw-nr_prio_oids)
 -   finish_schedule_oids(rw);
 -
 -   /* Try recover next object */
 -   queue_work(sys-recovery_wqueue, rw-work);
 +   recover_next_object(rw);
 return;
 }

 @@ -458,7 +474,7 @@ static void finish_object_list(struct work *work)
 resume_wait_recovery_requests();
 rw-work.fn = recover_object_work;
 rw-work.done = recover_object_main;
 -   queue_work(sys-recovery_wqueue, rw-work);
 +   recover_next_object(rw);
 return;
  }

 ==

 I ran the following test, and object recovery was disabled correctly
 for both join and leave case.

 ==
 #!/bin/bash

 for i in 0 1 2 3; do
 ./sheep/sheep /store/$i -z $i -p 700$i -c local
 done

 sleep 1
 ./collie/collie cluster format

 ./collie/collie vdi create test 4G

 echo  * objects will be created on node[0-2] *
 md5sum /store/[0,1,2,3]/obj/807c2b25

 pkill -f ./sheep/sheep /store/1
 sleep 3

 echo  * recovery doesn't start until the object is touched *
 md5sum /store/[0,2,3]/obj/807c2b25

 ./collie/collie vdi snapshot test  # invoke recovery of the vdi object
 echo  * the object is recovered *
 md5sum /store/[0,2,3]/obj/807c2b25

 ./sheep/sheep /store/1 -z 1 -p 7001 -c local
 sleep 3

 echo  * recovery doesn't start until the object is touched *
 md5sum /store/[0,1,2,3]/obj/807c2b25

 ./collie/collie vdi list -p 7001  # invoke recovery of the vdi object
 echo  * the object is recovered *
 md5sum /store/[0,1,2,3]/obj/807c2b25
 ==

 [Output]

 using backend farm store
  * objects will be created on node[0-2] *
 701e77eab6002c9a48f7ba72c8d9bfe9  /store/0/obj/807c2b25
 701e77eab6002c9a48f7ba72c8d9bfe9  /store/1/obj/807c2b25
 701e77eab6002c9a48f7ba72c8d9bfe9  /store/2/obj/807c2b25
  * recovery doesn't start until the object is touched *
 701e77eab6002c9a48f7ba72c8d9bfe9  /store/0/obj/807c2b25
 701e77eab6002c9a48f7ba72c8d9bfe9  /store/2/obj/807c2b25
  * the object is recovered *
 3c3bf0d865363fd0d1f1d5c7aa044dcd  /store/0/obj/807c2b25
 3c3bf0d865363fd0d1f1d5c7aa044dcd  /store/2/obj/807c2b25
 3c3bf0d865363fd0d1f1d5c7aa044dcd  /store/3/obj/807c2b25
  * recovery doesn't start until the object is touched *
 3c3bf0d865363fd0d1f1d5c7aa044dcd  /store/0/obj/807c2b25
 3c3bf0d865363fd0d1f1d5c7aa044dcd  

Re: [sheepdog] [PATCH V2 00/11] INTRODUCE

2012-08-20 Thread MORITA Kazutaka
At Tue, 21 Aug 2012 10:46:19 +0800,
Liu Yuan wrote:
 
 So I think we have to handle epoch mismatch and object multi-version
 problems before evaluating delay recovery for network partition.

Yes, delay recovery doesn't solve my example at all unless sheepdog
handles network partition.  I didn't intend to say that delaying
recovery always is necessary now but worth considering in future.

Thanks,

Kazutaka
-- 
sheepdog mailing list
sheepdog@lists.wpkg.org
http://lists.wpkg.org/mailman/listinfo/sheepdog


Re: [sheepdog] [PATCH V2 00/11] INTRODUCE

2012-08-20 Thread Liu Yuan
On 08/21/2012 11:21 AM, MORITA Kazutaka wrote:
 At Tue, 21 Aug 2012 10:46:19 +0800,
 Liu Yuan wrote:

 So I think we have to handle epoch mismatch and object multi-version
 problems before evaluating delay recovery for network partition.
 
 Yes, delay recovery doesn't solve my example at all unless sheepdog
 handles network partition.  I didn't intend to say that delaying
 recovery always is necessary now but worth considering in future.
 

Well, with a centralized membership control driver such as zookeeper or
accord (I'd like to built-in accord driver(possibly a simplified and
tailored for sheep) in the sheepdog repo for better development), I
think the network partition problem can be virtual gone with a
well-written software, that collaborate with sheep to minimize the
chances of NP happening rather than intend to solve it after it happens.

Thanks,
Yuan


-- 
sheepdog mailing list
sheepdog@lists.wpkg.org
http://lists.wpkg.org/mailman/listinfo/sheepdog


Re: [sheepdog] [PATCH v2] sheep: make config file compatible with the previous one

2012-08-20 Thread MORITA Kazutaka
At Tue, 21 Aug 2012 10:57:38 +0800,
Liu Yuan wrote:
 
 On 08/21/2012 02:37 AM, MORITA Kazutaka wrote:
  +   uint8_t __pad[5];
  +   uint64_t space;
 
 What is __pad[5] for?

If we don't add the padding, 32 and 64 bit machines read different
data from the same config file.  All network protocols and disk
formats should be aligned to 8 bytes, though we support only x86_64
now.

Thanks,

Kazutaka
-- 
sheepdog mailing list
sheepdog@lists.wpkg.org
http://lists.wpkg.org/mailman/listinfo/sheepdog


Re: [sheepdog] [PATCH V2 00/11] INTRODUCE

2012-08-20 Thread Dietmar Maurer
 On 08/21/2012 12:07 AM, Christoph Hellwig wrote:
  Another thing that sprang into mind is that instead of the formal
  recovery enable/disable we should simply always delay recovery, that
  is only do recovery after every N seconds if changes happened.
  Especially in the cases of whole racks going up/down or upgrades that
  dramatically reduces the number of epochs required, and thus reduces
  the recovery overhead.
 
  I didn't actually have time to look into the implementation
  implications of this yet, it's just high level thoughs.
 
 I think negatively to delay recovery all the time. It is useful to delay 
 recovery
 in some time window for maintenance or operational purposes, so I think
 the idea only to delay recovery manually at some controlled window is
 useful, but if we extend this to all the running time, it will bring cluster 
 to a
 less safe state (if not
 dangerous) at any point. (we only upgrade cluster/maintain individual node
 only at some time, not all the time, no?)

I still think that automatic recovery without delay is the wrong approach. At 
least for
small clusters you simply want to avoid unnecessary traffic. Such recovery can 
produce
massive traffic on the network (several TB of data), and can make the whole 
system unusable 
because of that. I want to control when recovery starts.

- Dietmar

-- 
sheepdog mailing list
sheepdog@lists.wpkg.org
http://lists.wpkg.org/mailman/listinfo/sheepdog


Re: [sheepdog] [PATCH V2 00/11] INTRODUCE

2012-08-20 Thread MORITA Kazutaka
At Tue, 21 Aug 2012 04:34:05 +,
Dietmar Maurer wrote:
 
  On 08/21/2012 12:07 AM, Christoph Hellwig wrote:
   Another thing that sprang into mind is that instead of the formal
   recovery enable/disable we should simply always delay recovery, that
   is only do recovery after every N seconds if changes happened.
   Especially in the cases of whole racks going up/down or upgrades that
   dramatically reduces the number of epochs required, and thus reduces
   the recovery overhead.
  
   I didn't actually have time to look into the implementation
   implications of this yet, it's just high level thoughs.
  
  I think negatively to delay recovery all the time. It is useful to delay 
  recovery
  in some time window for maintenance or operational purposes, so I think
  the idea only to delay recovery manually at some controlled window is
  useful, but if we extend this to all the running time, it will bring 
  cluster to a
  less safe state (if not
  dangerous) at any point. (we only upgrade cluster/maintain individual node
  only at some time, not all the time, no?)
 
 I still think that automatic recovery without delay is the wrong approach. At 
 least for
 small clusters you simply want to avoid unnecessary traffic. Such recovery 
 can produce
 massive traffic on the network (several TB of data), and can make the whole 
 system unusable 
 because of that. I want to control when recovery starts.

Disabling automatic recovery by default doesn't work for you?  You can
control the time to start recovery with collie cluster recover enable.

Thanks,

Kazutaka
-- 
sheepdog mailing list
sheepdog@lists.wpkg.org
http://lists.wpkg.org/mailman/listinfo/sheepdog


Re: [sheepdog] [PATCH V2 00/11] INTRODUCE

2012-08-20 Thread Liu Yuan
On 08/21/2012 12:34 PM, Dietmar Maurer wrote:
 I still think that automatic recovery without delay is the wrong approach. At 
 least for
 small clusters you simply want to avoid unnecessary traffic. Such recovery 
 can produce
 massive traffic on the network (several TB of data), and can make the whole 
 system unusable 
 because of that. I want to control when recovery starts.

Your goal to avoid unnecessary object transfer can be actually built on
top of manual recovery mechanism. If we implement manual recovery, we
can add a timeout option to it(very easy), thus if someone want to
always delay recovery, he can simply disable automatic recovery, add
specify a timeout for it. In this way, we can have several policy to
accommodate different purposes.

Thanks,
Yuan
-- 
sheepdog mailing list
sheepdog@lists.wpkg.org
http://lists.wpkg.org/mailman/listinfo/sheepdog


Re: [sheepdog] [PATCH V2 00/11] INTRODUCE

2012-08-20 Thread Dietmar Maurer
 I think your example is very vague, what kind of driver you use? Sheep itself
 won't sense membership and rely on cluster drivers to maintain
 membership. Could you detail how it happen exactly in real case?

Membership change can happen for many reason. It  can happen if something is
wrong on the switch (or if some admin configures the switch), a damaged network 
cable, 
a bug in the bonding driver, a damaged network card, or simply a power failure 
on a node, 
which reconnects after power is on

In Literature, the problem is also known as 'Babbling idiot' (Realtime people 
use that term). A single node
can make the whole system unusable.

- Dietmar 

-- 
sheepdog mailing list
sheepdog@lists.wpkg.org
http://lists.wpkg.org/mailman/listinfo/sheepdog


Re: [sheepdog] [PATCH] test: consolidate 010 to check manual recovery

2012-08-20 Thread MORITA Kazutaka
At Tue, 21 Aug 2012 11:03:52 +0800,
Liu Yuan wrote:
 
 From: Liu Yuan tailai...@taobao.com
 
 Signed-off-by: Liu Yuan tailai...@taobao.com
 ---
  tests/010 |   14 ++
  tests/010.out |   15 ++-
  2 files changed, 24 insertions(+), 5 deletions(-)

Applied, thanks!

Kazutaka
-- 
sheepdog mailing list
sheepdog@lists.wpkg.org
http://lists.wpkg.org/mailman/listinfo/sheepdog


Re: [sheepdog] [PATCH 1/2] collie: optimize 'collie vdi check' command

2012-08-20 Thread Yunkai Zhang
On Tue, Aug 21, 2012 at 1:42 PM, MORITA Kazutaka
morita.kazut...@lab.ntt.co.jp wrote:
 At Thu, 16 Aug 2012 22:38:21 +0800,
 Yunkai Zhang wrote:

 After add '-F' flag, the help looks like:
 $ collie vdi check
 Usage: collie vdi check [-F] [-s snapshot] [-a address] [-p port] [-h] 
 vdiname
 Options:
   -F, --force_repair  force repair object's copies (dangerous)

 How about '-r, --repair'?


Good for me.



   fprintf(stderr, Failed to read, %s\n,
   sd_strerror(rsp-result));
   exit(EXIT_FAILURE);
   }
 - return buf;
 +
 + memcpy(sha1, (unsigned char *)rsp-__pad[0], SHA1_LEN);

 Please define a member name instead of using __pad.

OK.



  }

 -static void write_object_to(struct sd_vnode *vnode, uint64_t oid, void *buf)
 +static int do_repair(uint64_t oid, struct node_id *src, struct node_id 
 *dest)
  {
   struct sd_req hdr;
   struct sd_rsp *rsp = (struct sd_rsp *)hdr;
 + unsigned rlen, wlen;
 + char host[128];
   int fd, ret;
 - unsigned wlen = SD_DATA_OBJ_SIZE, rlen = 0;
 - char name[128];

 - addr_to_str(name, sizeof(name), vnode-nid.addr, 0);
 - fd = connect_to(name, vnode-nid.port);
 + addr_to_str(host, sizeof(host), dest-addr, 0);
 +
 + fd = connect_to(host, dest-port);
   if (fd  0) {
 - fprintf(stderr, failed to connect to %s:%PRIu32\n,
 - name, vnode-nid.port);
 - exit(EXIT_FAILURE);
 + fprintf(stderr, Failed to connect\n);
 + return SD_RES_EIO;
   }

 - sd_init_req(hdr, SD_OP_WRITE_PEER);
 - hdr.epoch = sd_epoch;
 - hdr.flags = SD_FLAG_CMD_WRITE;
 - hdr.data_length = wlen;
 + sd_init_req(hdr, SD_OP_REPAIR_OBJ_PEER);

 I don't think sending peer requests directly from outside sheeps is a
 good idea.  How about making the gateway node forward the requests?

Ok, no problem.




 + rlen = 0;
 + wlen = sizeof(*src);
 +
 + hdr.epoch = sd_epoch;
   hdr.obj.oid = oid;
 + hdr.data_length = wlen;
 + hdr.flags = SD_FLAG_CMD_WRITE;

 - ret = exec_req(fd, hdr, buf, wlen, rlen);
 + ret = exec_req(fd, hdr, src, wlen, rlen);
   close(fd);
 -
   if (ret) {
 - fprintf(stderr, Failed to execute request\n);
 - exit(EXIT_FAILURE);
 + fprintf(stderr, Failed to repair oid:%PRIx64\n, oid);
 + return SD_RES_EIO;
   }
 -
   if (rsp-result != SD_RES_SUCCESS) {
 - fprintf(stderr, Failed to read, %s\n,
 - sd_strerror(rsp-result));
 - exit(EXIT_FAILURE);
 + fprintf(stderr, Failed to repair oid:%PRIx64, %s\n,
 + oid, sd_strerror(rsp-result));
 + return rsp-result;
   }
 +
 + return SD_RES_SUCCESS;
  }

 -/*
 - * Fix consistency of the replica of oid.
 - *
 - * XXX: The fix is rather dumb, just read the first copy and write it
 - * to other replica.
 - */
 -static void do_check_repair(uint64_t oid, int nr_copies)
 +static int do_check_repair(uint64_t oid, int nr_copies)
  {
   struct sd_vnode *tgt_vnodes[nr_copies];
 - void *buf, *buf_cmp;
 - int i;
 + unsigned char sha1[SD_MAX_COPIES][SHA1_LEN];
 + char host[128];
 + int i, j;

   oid_to_vnodes(sd_vnodes, sd_vnodes_nr, oid, nr_copies, tgt_vnodes);
 - buf = read_object_from(tgt_vnodes[0], oid);
 - for (i = 1; i  nr_copies; i++) {
 - buf_cmp = read_object_from(tgt_vnodes[i], oid);
 - if (memcmp(buf, buf_cmp, SD_DATA_OBJ_SIZE)) {
 - free(buf_cmp);
 - goto fix_consistency;
 + for (i = 0; i  nr_copies; i++) {
 + get_obj_checksum_from(tgt_vnodes[i], oid, sha1[i]);
 + }
 +
 + for (i = 0; i  nr_copies; i++) {
 + for (j = (i + 1); j  nr_copies; j++) {
 + if (memcmp(sha1[i], sha1[j], SHA1_LEN))
 + goto diff;
   }
 - free(buf_cmp);
   }
 - free(buf);
 - return;
 + return 0;

 -fix_consistency:
 - for (i = 1; i  nr_copies; i++)
 - write_object_to(tgt_vnodes[i], oid, buf);
 - fprintf(stdout, fix %PRIx64 success\n, oid);
 - free(buf);
 +diff:
 + fprintf(stderr, Failed oid: %PRIx64\n, oid);
 + for (i = 0; i  nr_copies; i++) {
 + addr_to_str(host, sizeof(host), tgt_vnodes[i]-nid.addr, 0);
 + fprintf(stderr,  copy[%d], sha1: %s, from: %s:%d\n,
 + i, sha1_to_hex(sha1[i]), host, 
 tgt_vnodes[i]-nid.port);
 + }
 +
 + if (!vdi_cmd_data.force_repair)
 + return -1;
 +
 + /*
 +  * Force repair the consistency of oid's replica
 +  *
 +  * FIXME: this fix is rather dumb, it just read the
 +  * first copy and write it to other replica,
 +  */
 + fprintf(stderr,  force repairing ...\n);
 + addr_to_str(host, sizeof(host), tgt_vnodes[0]-nid.addr,
 + 

Re: [sheepdog] [PATCH V2 00/11] INTRODUCE

2012-08-20 Thread Dietmar Maurer
 Disabling automatic recovery by default doesn't work for you?  You can
 control the time to start recovery with collie cluster recover enable.

It just looks strange to me to design the system for immediate/automatic 
recovery, and
make 'disabling automatic recovery' an option. I would include the node state 
into the epoch.
But maybe that is only an implementation detail.

- Dietmar


-- 
sheepdog mailing list
sheepdog@lists.wpkg.org
http://lists.wpkg.org/mailman/listinfo/sheepdog