Re: [sheepdog] [PATCH V2 00/11] INTRODUCE

2012-08-21 Thread Yunkai Zhang
On Tue, Aug 21, 2012 at 1:50 PM, Dietmar Maurer diet...@proxmox.com wrote:
 Disabling automatic recovery by default doesn't work for you?  You can
 control the time to start recovery with collie cluster recover enable.

 It just looks strange to me to design the system for immediate/automatic 
 recovery, and
 make 'disabling automatic recovery' an option. I would include the node state 
 into the epoch.
 But maybe that is only an implementation detail.

Hi folks:

I need a conclusion:

Does sheepdog need delay recovery supported by this series (or by
Kazum's new idea and implementation) ?



 - Dietmar


 --
 sheepdog mailing list
 sheepdog@lists.wpkg.org
 http://lists.wpkg.org/mailman/listinfo/sheepdog



-- 
Yunkai Zhang
Work at Taobao
-- 
sheepdog mailing list
sheepdog@lists.wpkg.org
http://lists.wpkg.org/mailman/listinfo/sheepdog


Re: [sheepdog] [PATCH V2 00/11] INTRODUCE

2012-08-21 Thread Bastian Scholz

Hi Dietmar, Hi Yuan,

Am 2012-08-21 07:27, schrieb Dietmar Maurer:
Membership change can happen for many reason. It  can happen if 
something is

wrong on the switch (or if some admin configures the switch), a
damaged network cable,
a bug in the bonding driver, a damaged network card, or simply a
power failure on a node,
which reconnects after power is on


Me and at least David had this Problem recently in our
environment which ends in a complete data loss.
Not that it happens very often or should be handled in a
way the cluster can run even if it happens, but in my
opinion this situation should be handled without data
loss...

Cheers

Bastian
--
sheepdog mailing list
sheepdog@lists.wpkg.org
http://lists.wpkg.org/mailman/listinfo/sheepdog


Re: [sheepdog] [PATCH V2 00/11] INTRODUCE

2012-08-21 Thread MORITA Kazutaka
At Tue, 21 Aug 2012 14:14:23 +0800,
Yunkai Zhang wrote:
 I need a conclusion:
 
 Does sheepdog need delay recovery supported by this series (or by
 Kazum's new idea and implementation) ?

There are two different discussion in this thread:

  1. turn on/off automatic recovery with a collie command (supported
 by this series)
  2. delay starting automatic recovery in any case

I think no one is against supporting 1.

Thanks,

Kazutaka
-- 
sheepdog mailing list
sheepdog@lists.wpkg.org
http://lists.wpkg.org/mailman/listinfo/sheepdog


Re: [sheepdog] [PATCH V2 00/11] INTRODUCE

2012-08-21 Thread Yunkai Zhang
On Tue, Aug 21, 2012 at 2:43 PM, MORITA Kazutaka
morita.kazut...@lab.ntt.co.jp wrote:
 At Tue, 21 Aug 2012 14:14:23 +0800,
 Yunkai Zhang wrote:
 I need a conclusion:

 Does sheepdog need delay recovery supported by this series (or by
 Kazum's new idea and implementation) ?

 There are two different discussion in this thread:

   1. turn on/off automatic recovery with a collie command (supported
  by this series)
   2. delay starting automatic recovery in any case

 I think no one is against supporting 1.

Ok, I'll continue to improve this series after I complete other things.


 Thanks,

 Kazutaka



-- 
Yunkai Zhang
Work at Taobao
-- 
sheepdog mailing list
sheepdog@lists.wpkg.org
http://lists.wpkg.org/mailman/listinfo/sheepdog


Re: [sheepdog] [PATCH V2 00/11] INTRODUCE

2012-08-21 Thread Liu Yuan
On 08/21/2012 02:48 PM, Yunkai Zhang wrote:
 Ok, I'll continue to improve this series after I complete other things.

Why not choose Kazutaka's idea to implement delay recovery? It looks
simple yet efficient at least to me.

Thanks,
Yuan
-- 
sheepdog mailing list
sheepdog@lists.wpkg.org
http://lists.wpkg.org/mailman/listinfo/sheepdog


Re: [sheepdog] [PATCH V2 00/11] INTRODUCE

2012-08-21 Thread Yunkai Zhang
On Tue, Aug 21, 2012 at 2:58 PM, Liu Yuan namei.u...@gmail.com wrote:
 On 08/21/2012 02:48 PM, Yunkai Zhang wrote:
 Ok, I'll continue to improve this series after I complete other things.

 Why not choose Kazutaka's idea to implement delay recovery? It looks
 simple yet efficient at least to me.

continue to import this series, of course including kazum's idea if
it's the best way.


 Thanks,
 Yuan



-- 
Yunkai Zhang
Work at Taobao
-- 
sheepdog mailing list
sheepdog@lists.wpkg.org
http://lists.wpkg.org/mailman/listinfo/sheepdog


Re: [sheepdog] [PATCH V2 00/11] INTRODUCE

2012-08-21 Thread Yunkai Zhang
On Tue, Aug 21, 2012 at 3:04 PM, Yunkai Zhang yunkai...@gmail.com wrote:
 On Tue, Aug 21, 2012 at 2:58 PM, Liu Yuan namei.u...@gmail.com wrote:
 On 08/21/2012 02:48 PM, Yunkai Zhang wrote:
 Ok, I'll continue to improve this series after I complete other things.

 Why not choose Kazutaka's idea to implement delay recovery? It looks
 simple yet efficient at least to me.

 continue to import this series, of course including kazum's idea if
 it's the best way.

s/import/improve/



 Thanks,
 Yuan



 --
 Yunkai Zhang
 Work at Taobao



-- 
Yunkai Zhang
Work at Taobao
-- 
sheepdog mailing list
sheepdog@lists.wpkg.org
http://lists.wpkg.org/mailman/listinfo/sheepdog


Re: [sheepdog] [PATCH V2 00/11] INTRODUCE

2012-08-21 Thread Yunkai Zhang
On Tue, Aug 21, 2012 at 2:03 AM, MORITA Kazutaka
morita.kazut...@lab.ntt.co.jp wrote:
 At Mon, 20 Aug 2012 23:34:10 +0800,
 Yunkai Zhang wrote:

 In fact, I have thought this method, but we should face nearly the same 
 problem:

 After sheep joined back, it should known which objects is dirty, and
 should do the clear work(because there are old version object stay in
 it's working directory). This method seems not save the steps, but
 will do extra recovery works.

 Can you give me a concrete example?

 I created a really naive patch to disable object recovery with my
 idea:


Hi Kazum:

I have read and do simple test with this patch, it works at most time.

But write operation will be blocked in wait_forward_request(), I think
there are some corner case we should handle.

I think I have understood this good idea, it's simple and clever.


Could you give a mature patch? We really want to use it in our cluster
as soon as possible.


Thank you!


 ==
 diff --git a/sheep/recovery.c b/sheep/recovery.c
 index 5164aa7..8bf032f 100644
 --- a/sheep/recovery.c
 +++ b/sheep/recovery.c
 @@ -35,6 +35,7 @@ struct recovery_work {
 uint64_t *oids;
 uint64_t *prio_oids;
 int nr_prio_oids;
 +   int nr_scheduled_oids;

 struct vnode_info *old_vinfo;
 struct vnode_info *cur_vinfo;
 @@ -269,9 +270,6 @@ static inline void prepare_schedule_oid(uint64_t oid)
 oid);
 return;
 }
 -   /* The oid is currently being recovered */
 -   if (rw-oids[rw-done] == oid)
 -   return;
 rw-nr_prio_oids++;
 rw-prio_oids = xrealloc(rw-prio_oids,
  rw-nr_prio_oids * sizeof(uint64_t));
 @@ -399,9 +397,31 @@ static inline void finish_schedule_oids(struct 
 recovery_work *rw)
  done:
 free(rw-prio_oids);
 rw-prio_oids = NULL;
 +   rw-nr_scheduled_oids += rw-nr_prio_oids;
 rw-nr_prio_oids = 0;
  }

 +static struct timer recovery_timer;
 +
 +static void recover_next_object(void *arg)
 +{
 +   struct recovery_work *rw = arg;
 +
 +   if (rw-nr_prio_oids)
 +   finish_schedule_oids(rw);
 +
 +   if (rw-done  rw-nr_scheduled_oids) {
 +   /* Try recover next object */
 +   queue_work(sys-recovery_wqueue, rw-work);
 +   return;
 +   }
 +
 +   /* There is no objects to be recovered.  Try again later */
 +   recovery_timer.callback = recover_next_object;
 +   recovery_timer.data = rw;
 +   add_timer(recovery_timer, 1); /* FIXME */
 +}
 +
  static void recover_object_main(struct work *work)
  {
 struct recovery_work *rw = container_of(work, struct recovery_work,
 @@ -425,11 +445,7 @@ static void recover_object_main(struct work *work)
 resume_wait_obj_requests(rw-oids[rw-done++]);

 if (rw-done  rw-count) {
 -   if (rw-nr_prio_oids)
 -   finish_schedule_oids(rw);
 -
 -   /* Try recover next object */
 -   queue_work(sys-recovery_wqueue, rw-work);
 +   recover_next_object(rw);
 return;
 }

 @@ -458,7 +474,7 @@ static void finish_object_list(struct work *work)
 resume_wait_recovery_requests();
 rw-work.fn = recover_object_work;
 rw-work.done = recover_object_main;
 -   queue_work(sys-recovery_wqueue, rw-work);
 +   recover_next_object(rw);
 return;
  }

 ==

 I ran the following test, and object recovery was disabled correctly
 for both join and leave case.

 ==
 #!/bin/bash

 for i in 0 1 2 3; do
 ./sheep/sheep /store/$i -z $i -p 700$i -c local
 done

 sleep 1
 ./collie/collie cluster format

 ./collie/collie vdi create test 4G

 echo  * objects will be created on node[0-2] *
 md5sum /store/[0,1,2,3]/obj/807c2b25

 pkill -f ./sheep/sheep /store/1
 sleep 3

 echo  * recovery doesn't start until the object is touched *
 md5sum /store/[0,2,3]/obj/807c2b25

 ./collie/collie vdi snapshot test  # invoke recovery of the vdi object
 echo  * the object is recovered *
 md5sum /store/[0,2,3]/obj/807c2b25

 ./sheep/sheep /store/1 -z 1 -p 7001 -c local
 sleep 3

 echo  * recovery doesn't start until the object is touched *
 md5sum /store/[0,1,2,3]/obj/807c2b25

 ./collie/collie vdi list -p 7001  # invoke recovery of the vdi object
 echo  * the object is recovered *
 md5sum /store/[0,1,2,3]/obj/807c2b25
 ==

 [Output]

 using backend farm store
  * objects will be created on node[0-2] *
 701e77eab6002c9a48f7ba72c8d9bfe9  /store/0/obj/807c2b25
 701e77eab6002c9a48f7ba72c8d9bfe9  /store/1/obj/807c2b25
 701e77eab6002c9a48f7ba72c8d9bfe9  /store/2/obj/807c2b25
  * recovery doesn't start until the object is touched *
 701e77eab6002c9a48f7ba72c8d9bfe9  /store/0/obj/807c2b25
 701e77eab6002c9a48f7ba72c8d9bfe9  /store/2/obj/807c2b25
  * the object is recovered *
 

Re: [sheepdog] [PATCH V2 00/11] INTRODUCE

2012-08-21 Thread MORITA Kazutaka
At Wed, 22 Aug 2012 01:16:49 +0800,
Yunkai Zhang wrote:
 
 I have read and do simple test with this patch, it works at most time.
 
 But write operation will be blocked in wait_forward_request(), I think
 there are some corner case we should handle.

Can you create a testcase to reproduce it?

 Could you give a mature patch? We really want to use it in our cluster
 as soon as possible.

Okay, but I'm currently working on another problem - sheep blocks I/O
requests long time while stale objects are moved to the farm backend
store.  I'll give a try after that.

Thanks,

Kazutaka
-- 
sheepdog mailing list
sheepdog@lists.wpkg.org
http://lists.wpkg.org/mailman/listinfo/sheepdog


Re: [sheepdog] [PATCH V2 00/11] INTRODUCE

2012-08-21 Thread Yunkai Zhang
On Wed, Aug 22, 2012 at 9:31 AM, MORITA Kazutaka
morita.kazut...@lab.ntt.co.jp wrote:
 At Wed, 22 Aug 2012 01:16:49 +0800,
 Yunkai Zhang wrote:

 I have read and do simple test with this patch, it works at most time.

 But write operation will be blocked in wait_forward_request(), I think
 there are some corner case we should handle.

 Can you create a testcase to reproduce it?

Ok, I'll give a testcase later.


 Could you give a mature patch? We really want to use it in our cluster
 as soon as possible.

 Okay, but I'm currently working on another problem - sheep blocks I/O
 requests long time while stale objects are moved to the farm backend
 store.  I'll give a try after that.

Thanks~


 Thanks,

 Kazutaka



-- 
Yunkai Zhang
Work at Taobao
-- 
sheepdog mailing list
sheepdog@lists.wpkg.org
http://lists.wpkg.org/mailman/listinfo/sheepdog


Re: [sheepdog] [PATCH V2 00/11] INTRODUCE

2012-08-21 Thread Liu Yuan
On 08/22/2012 09:44 AM, Yunkai Zhang wrote:
 ould you give a mature patch? We really want to use it in our cluster
  as soon as possible.
 
  Okay, but I'm currently working on another problem - sheep blocks I/O
  requests long time while stale objects are moved to the farm backend
  store.  I'll give a try after that.
 Thanks~
 

Hi Yunkai, since you are doing this series all these days, why not pick up 
Kazutaka's
draft patch and perfect it? I think this is Kazutaka original intention, just a
concrete example to say his idea could work out.

-- 
thanks,
Yuan
-- 
sheepdog mailing list
sheepdog@lists.wpkg.org
http://lists.wpkg.org/mailman/listinfo/sheepdog


Re: [sheepdog] [PATCH V2 00/11] INTRODUCE

2012-08-21 Thread Yunkai Zhang
On Wed, Aug 22, 2012 at 9:55 AM, Liu Yuan namei.u...@gmail.com wrote:
 On 08/22/2012 09:44 AM, Yunkai Zhang wrote:
 ould you give a mature patch? We really want to use it in our cluster
  as soon as possible.
 
  Okay, but I'm currently working on another problem - sheep blocks I/O
  requests long time while stale objects are moved to the farm backend
  store.  I'll give a try after that.
 Thanks~


 Hi Yunkai, since you are doing this series all these days, why not pick up 
 Kazutaka's
 draft patch and perfect it? I think this is Kazutaka original intention, just 
 a
 concrete example to say his idea could work out.


My intention is to respect Kazum's idea, if need my help, I'm pleasure
to do it:).



 --
 thanks,
 Yuan



-- 
Yunkai Zhang
Work at Taobao
-- 
sheepdog mailing list
sheepdog@lists.wpkg.org
http://lists.wpkg.org/mailman/listinfo/sheepdog


Re: [sheepdog] [PATCH V2 00/11] INTRODUCE

2012-08-21 Thread Yunkai Zhang
On Wed, Aug 22, 2012 at 10:21 AM, MORITA Kazutaka
morita.kazut...@lab.ntt.co.jp wrote:
 At Wed, 22 Aug 2012 10:14:07 +0800,
 Yunkai Zhang wrote:

 My intention is to respect Kazum's idea, if need my help, I'm pleasure
 to do it:).

 If you complete the work, it will help me a lot. :)

Well, I'll complete it:)

But now I'm busy with other things, maybe I'll send the first version
based on this idea at Friday or this weekend.


 Thanks,

 Kazutaka



-- 
Yunkai Zhang
Work at Taobao
-- 
sheepdog mailing list
sheepdog@lists.wpkg.org
http://lists.wpkg.org/mailman/listinfo/sheepdog


Re: [sheepdog] [PATCH V2 00/11] INTRODUCE

2012-08-20 Thread Yunkai Zhang
On Mon, Aug 20, 2012 at 9:00 PM, MORITA Kazutaka
morita.kazut...@lab.ntt.co.jp wrote:
 At Thu,  9 Aug 2012 16:43:38 +0800,
 Yunkai Zhang wrote:

 From: Yunkai Zhang qiushu@taobao.com

 V2:
 - fix a typo
 - when an object is updated, delete it old version
 - reset cluster recovery state in finish_recovery()

 Yunkai Zhang (11):
   sheep: enable variale-length of join_message in response of join
 event
   sheep: share joining nodes with newly added sheep
   sheep: delay to process recovery caused by LEAVE event just like JOIN
 event
   sheep: don't cleanup working directory when sheep joined back
   sheep: read objects only from live nodes
   sheep: write objects only on live nodes
   sheep: mark dirty object that belongs to the leaving nodes
   sheep: send dirty object list to each sheep when cluster do recovery
   sheep: do recovery with dirty object list
   collie: update 'collie cluster recover info' commands
   collie: update doc about 'collie cluster recover disable'

  collie/cluster.c  |  46 ---
  include/internal_proto.h  |  32 ++--
  include/sheep.h   |  23 ++
  man/collie.8  |   2 +-
  sheep/cluster.h   |  29 +--
  sheep/cluster/accord.c|   2 +-
  sheep/cluster/corosync.c  |   9 ++-
  sheep/cluster/local.c |   2 +-
  sheep/cluster/zookeeper.c |   2 +-
  sheep/farm/trunk.c|   2 +-
  sheep/gateway.c   |  39 -
  sheep/group.c | 202 
 +-
  sheep/object_list_cache.c | 182 +++--
  sheep/ops.c   |  85 ---
  sheep/recovery.c  | 133 +++---
  sheep/sheep_priv.h|  57 -
  16 files changed, 743 insertions(+), 104 deletions(-)

 I've looked into this series, and IMHO the change is too complex.

 With this series, when recovery is disabled and there are left nodes,
 sheep can succeed in a write operation even if the data is not fully
 replicated.  But, if we allow it, it is difficult to prevent VMs from
 reading old data.  Actually this series put a lot of effort into it.

We want to upgrade sheepdog while not impact all online VMs, so we
need to allow all VMs to do write operation when recovery is disable
(It is important for a big cluster, we can't assume users would stop
their works during this time). And we also assume that this time is
short, we should upgrade sheepdog as soon as possible( 5 minutes).

This patch is implemented based on those assumption above. And maybe
it's difficult, but it's algorithm is clear, just three steps(from the
description from the 9th patch's commit log):

1) If a sheep joined back to the cluster, but there are some objects which have
   been deleted after this sheep left, such objects stay in its working
   directory, after recovery start, this sheep will send its object list to
   other sheeps. So after fetched all object list from cluster, each sheep
   should screen out these deleted objects list.

2) A sheep which have been left and joined back should drop the old version
   objects and recover the new ones from other sheeps.

3) The objects which have been updated should not recovered from a joined
   back sheep.


 I'd suggest allowing epoch increment even when recover is
 disabled.  If recovery work recovers only rw-prio_oids and delays the
 recovery of rw-oids, I think we can get the similar benefit with much
 simpler way:
   http://www.mail-archive.com/sheepdog@lists.wpkg.org/msg05439.html

In fact, I have thought this method, but we should face nearly the same problem:

After sheep joined back, it should known which objects is dirty, and
should do the clear work(because there are old version object stay in
it's working directory). This method seems not save the steps, but
will do extra recovery works.


 Thanks,

 Kazutaka



-- 
Yunkai Zhang
Work at Taobao
-- 
sheepdog mailing list
sheepdog@lists.wpkg.org
http://lists.wpkg.org/mailman/listinfo/sheepdog


Re: [sheepdog] [PATCH V2 00/11] INTRODUCE

2012-08-20 Thread Liu Yuan
On 08/20/2012 11:34 PM, Yunkai Zhang wrote:
 On Mon, Aug 20, 2012 at 9:00 PM, MORITA Kazutaka
 morita.kazut...@lab.ntt.co.jp wrote:
 At Thu,  9 Aug 2012 16:43:38 +0800,
 Yunkai Zhang wrote:

 From: Yunkai Zhang qiushu@taobao.com

 V2:
 - fix a typo
 - when an object is updated, delete it old version
 - reset cluster recovery state in finish_recovery()

 Yunkai Zhang (11):
   sheep: enable variale-length of join_message in response of join
 event
   sheep: share joining nodes with newly added sheep
   sheep: delay to process recovery caused by LEAVE event just like JOIN
 event
   sheep: don't cleanup working directory when sheep joined back
   sheep: read objects only from live nodes
   sheep: write objects only on live nodes
   sheep: mark dirty object that belongs to the leaving nodes
   sheep: send dirty object list to each sheep when cluster do recovery
   sheep: do recovery with dirty object list
   collie: update 'collie cluster recover info' commands
   collie: update doc about 'collie cluster recover disable'

  collie/cluster.c  |  46 ---
  include/internal_proto.h  |  32 ++--
  include/sheep.h   |  23 ++
  man/collie.8  |   2 +-
  sheep/cluster.h   |  29 +--
  sheep/cluster/accord.c|   2 +-
  sheep/cluster/corosync.c  |   9 ++-
  sheep/cluster/local.c |   2 +-
  sheep/cluster/zookeeper.c |   2 +-
  sheep/farm/trunk.c|   2 +-
  sheep/gateway.c   |  39 -
  sheep/group.c | 202 
 +-
  sheep/object_list_cache.c | 182 +++--
  sheep/ops.c   |  85 ---
  sheep/recovery.c  | 133 +++---
  sheep/sheep_priv.h|  57 -
  16 files changed, 743 insertions(+), 104 deletions(-)

 I've looked into this series, and IMHO the change is too complex.

 With this series, when recovery is disabled and there are left nodes,
 sheep can succeed in a write operation even if the data is not fully
 replicated.  But, if we allow it, it is difficult to prevent VMs from
 reading old data.  Actually this series put a lot of effort into it.
 
 We want to upgrade sheepdog while not impact all online VMs, so we
 need to allow all VMs to do write operation when recovery is disable
 (It is important for a big cluster, we can't assume users would stop
 their works during this time). And we also assume that this time is
 short, we should upgrade sheepdog as soon as possible( 5 minutes).
 

Upgrading cluster without stopping service is a nice feature, but I'm afraid in 
the near
future, Sheepdog won't meet this expectation due to fast growing development 
which is likely
to break the inter-sheep assumptions. Before we have this feature, we at least 
should do
following things before claiming to be online upgrading capable:
 1) inter-sheep protocol compatibility check logic
 2) has a relatively stable feature set and internal physical state (such as 
config file)

That is, it is time too early to talk online upgrading for now.

 This patch is implemented based on those assumption above. And maybe
 it's difficult, but it's algorithm is clear, just three steps(from the
 description from the 9th patch's commit log):
 
 1) If a sheep joined back to the cluster, but there are some objects which 
 have
been deleted after this sheep left, such objects stay in its working
directory, after recovery start, this sheep will send its object list to
other sheeps. So after fetched all object list from cluster, each sheep
should screen out these deleted objects list.
 
 2) A sheep which have been left and joined back should drop the old version
objects and recover the new ones from other sheeps.
 
 3) The objects which have been updated should not recovered from a joined
back sheep.
 

 I'd suggest allowing epoch increment even when recover is
 disabled.  If recovery work recovers only rw-prio_oids and delays the
 recovery of rw-oids, I think we can get the similar benefit with much
 simpler way:
   http://www.mail-archive.com/sheepdog@lists.wpkg.org/msg05439.html
 
 In fact, I have thought this method, but we should face nearly the same 
 problem:
 
 After sheep joined back, it should known which objects is dirty, and
 should do the clear work(because there are old version object stay in
 it's working directory). This method seems not save the steps, but
 will do extra recovery works.
 

IMHO, I think the suggested method won't cause different version objects, 
because we actually
increment epoch and we do the same as is now for the objects in the 
rw-prio_oids,
which is being requested. So for this kind of objects, we can still use current 
code to handle
it. For those objects not being requested at all (which might account for 
majority of the objects
in a short time window), we can do the trick: delay recovering them as much as 
possible, so
that subsequent join 

Re: [sheepdog] [PATCH V2 00/11] INTRODUCE

2012-08-20 Thread Christoph Hellwig
On Mon, Aug 20, 2012 at 11:34:10PM +0800, Yunkai Zhang wrote:
  sheep can succeed in a write operation even if the data is not fully
  replicated.  But, if we allow it, it is difficult to prevent VMs from
  reading old data.  Actually this series put a lot of effort into it.
 
 We want to upgrade sheepdog while not impact all online VMs, so we
 need to allow all VMs to do write operation when recovery is disable
 (It is important for a big cluster, we can't assume users would stop
 their works during this time). And we also assume that this time is
 short, we should upgrade sheepdog as soon as possible( 5 minutes).

FYI, I've been looking into this issue (but not this series yet) a bit
lately and came to the conflusion that the only way to proper solve it
is indeed to recuce redundancy.  One way to make this formal is
to have a minimum and a normal redundancy level and let writes succeed
as long as we meet the minimum level and not the full one.

Another thing that sprang into mind is that instead of the formal
recovery enable/disable we should simply always delay recovery, that
is only do recovery after every N seconds if changes happened.
Especially in the cases of whole racks going up/down or upgrades that
dramatically reduces the number of epochs required, and thus reduces
the recovery overhead.

I didn't actually have time to look into the implementation implications
of this yet, it's just high level thoughs.

-- 
sheepdog mailing list
sheepdog@lists.wpkg.org
http://lists.wpkg.org/mailman/listinfo/sheepdog


Re: [sheepdog] [PATCH V2 00/11] INTRODUCE

2012-08-20 Thread Liu Yuan
On 08/21/2012 12:07 AM, Christoph Hellwig wrote:
 Another thing that sprang into mind is that instead of the formal
 recovery enable/disable we should simply always delay recovery, that
 is only do recovery after every N seconds if changes happened.
 Especially in the cases of whole racks going up/down or upgrades that
 dramatically reduces the number of epochs required, and thus reduces
 the recovery overhead.
 
 I didn't actually have time to look into the implementation implications
 of this yet, it's just high level thoughs.

I think negatively to delay recovery all the time. It is useful to delay 
recovery
in some time window for maintenance or operational purposes, so I think the idea
only to delay recovery manually at some controlled window is useful, but if we 
extend
this to all the running time, it will bring cluster to a less safe state (if not
dangerous) at any point. (we only upgrade cluster/maintain individual node only 
at some time,
not all the time, no?)

Trading data reliability is always the last resort for a distributed system, 
which highlights
data reliability compared to single data instance in local disk.  

-- 
thanks,
Yuan
-- 
sheepdog mailing list
sheepdog@lists.wpkg.org
http://lists.wpkg.org/mailman/listinfo/sheepdog


Re: [sheepdog] [PATCH V2 00/11] INTRODUCE

2012-08-20 Thread MORITA Kazutaka
At Mon, 20 Aug 2012 23:34:10 +0800,
Yunkai Zhang wrote:
 
 In fact, I have thought this method, but we should face nearly the same 
 problem:
 
 After sheep joined back, it should known which objects is dirty, and
 should do the clear work(because there are old version object stay in
 it's working directory). This method seems not save the steps, but
 will do extra recovery works.

Can you give me a concrete example?

I created a really naive patch to disable object recovery with my
idea:

==
diff --git a/sheep/recovery.c b/sheep/recovery.c
index 5164aa7..8bf032f 100644
--- a/sheep/recovery.c
+++ b/sheep/recovery.c
@@ -35,6 +35,7 @@ struct recovery_work {
uint64_t *oids;
uint64_t *prio_oids;
int nr_prio_oids;
+   int nr_scheduled_oids;
 
struct vnode_info *old_vinfo;
struct vnode_info *cur_vinfo;
@@ -269,9 +270,6 @@ static inline void prepare_schedule_oid(uint64_t oid)
oid);
return;
}
-   /* The oid is currently being recovered */
-   if (rw-oids[rw-done] == oid)
-   return;
rw-nr_prio_oids++;
rw-prio_oids = xrealloc(rw-prio_oids,
 rw-nr_prio_oids * sizeof(uint64_t));
@@ -399,9 +397,31 @@ static inline void finish_schedule_oids(struct 
recovery_work *rw)
 done:
free(rw-prio_oids);
rw-prio_oids = NULL;
+   rw-nr_scheduled_oids += rw-nr_prio_oids;
rw-nr_prio_oids = 0;
 }
 
+static struct timer recovery_timer;
+
+static void recover_next_object(void *arg)
+{
+   struct recovery_work *rw = arg;
+
+   if (rw-nr_prio_oids)
+   finish_schedule_oids(rw);
+
+   if (rw-done  rw-nr_scheduled_oids) {
+   /* Try recover next object */
+   queue_work(sys-recovery_wqueue, rw-work);
+   return;
+   }
+
+   /* There is no objects to be recovered.  Try again later */
+   recovery_timer.callback = recover_next_object;
+   recovery_timer.data = rw;
+   add_timer(recovery_timer, 1); /* FIXME */
+}
+
 static void recover_object_main(struct work *work)
 {
struct recovery_work *rw = container_of(work, struct recovery_work,
@@ -425,11 +445,7 @@ static void recover_object_main(struct work *work)
resume_wait_obj_requests(rw-oids[rw-done++]);
 
if (rw-done  rw-count) {
-   if (rw-nr_prio_oids)
-   finish_schedule_oids(rw);
-
-   /* Try recover next object */
-   queue_work(sys-recovery_wqueue, rw-work);
+   recover_next_object(rw);
return;
}
 
@@ -458,7 +474,7 @@ static void finish_object_list(struct work *work)
resume_wait_recovery_requests();
rw-work.fn = recover_object_work;
rw-work.done = recover_object_main;
-   queue_work(sys-recovery_wqueue, rw-work);
+   recover_next_object(rw);
return;
 }
 
==

I ran the following test, and object recovery was disabled correctly
for both join and leave case.

==
#!/bin/bash

for i in 0 1 2 3; do
./sheep/sheep /store/$i -z $i -p 700$i -c local
done

sleep 1
./collie/collie cluster format

./collie/collie vdi create test 4G

echo  * objects will be created on node[0-2] *
md5sum /store/[0,1,2,3]/obj/807c2b25

pkill -f ./sheep/sheep /store/1
sleep 3

echo  * recovery doesn't start until the object is touched *
md5sum /store/[0,2,3]/obj/807c2b25

./collie/collie vdi snapshot test  # invoke recovery of the vdi object
echo  * the object is recovered *
md5sum /store/[0,2,3]/obj/807c2b25

./sheep/sheep /store/1 -z 1 -p 7001 -c local
sleep 3

echo  * recovery doesn't start until the object is touched *
md5sum /store/[0,1,2,3]/obj/807c2b25

./collie/collie vdi list -p 7001  # invoke recovery of the vdi object
echo  * the object is recovered *
md5sum /store/[0,1,2,3]/obj/807c2b25
==

[Output]

using backend farm store
 * objects will be created on node[0-2] *
701e77eab6002c9a48f7ba72c8d9bfe9  /store/0/obj/807c2b25
701e77eab6002c9a48f7ba72c8d9bfe9  /store/1/obj/807c2b25
701e77eab6002c9a48f7ba72c8d9bfe9  /store/2/obj/807c2b25
 * recovery doesn't start until the object is touched *
701e77eab6002c9a48f7ba72c8d9bfe9  /store/0/obj/807c2b25
701e77eab6002c9a48f7ba72c8d9bfe9  /store/2/obj/807c2b25
 * the object is recovered *
3c3bf0d865363fd0d1f1d5c7aa044dcd  /store/0/obj/807c2b25
3c3bf0d865363fd0d1f1d5c7aa044dcd  /store/2/obj/807c2b25
3c3bf0d865363fd0d1f1d5c7aa044dcd  /store/3/obj/807c2b25
 * recovery doesn't start until the object is touched *
3c3bf0d865363fd0d1f1d5c7aa044dcd  /store/0/obj/807c2b25
3c3bf0d865363fd0d1f1d5c7aa044dcd  /store/2/obj/807c2b25
  NameIdSizeUsed  SharedCreation time   VDI id  Tag
s test 1  4.0 GB  0.0 MB  0.0 MB 2012-08-21 02:49   7c2b25  
  test 2  4.0 GB  0.0 MB  0.0 MB 2012-08-21 02:49 

Re: [sheepdog] [PATCH V2 00/11] INTRODUCE

2012-08-20 Thread MORITA Kazutaka
At Tue, 21 Aug 2012 00:29:50 +0800,
Liu Yuan wrote:
 
 On 08/21/2012 12:07 AM, Christoph Hellwig wrote:
  Another thing that sprang into mind is that instead of the formal
  recovery enable/disable we should simply always delay recovery, that
  is only do recovery after every N seconds if changes happened.
  Especially in the cases of whole racks going up/down or upgrades that
  dramatically reduces the number of epochs required, and thus reduces
  the recovery overhead.
  
  I didn't actually have time to look into the implementation implications
  of this yet, it's just high level thoughs.
 
 I think negatively to delay recovery all the time. It is useful to delay 
 recovery
 in some time window for maintenance or operational purposes, so I think the 
 idea
 only to delay recovery manually at some controlled window is useful, but if 
 we extend
 this to all the running time, it will bring cluster to a less safe state (if 
 not
 dangerous) at any point. (we only upgrade cluster/maintain individual node 
 only at some time,
 not all the time, no?)
 
 Trading data reliability is always the last resort for a distributed system, 
 which highlights
 data reliability compared to single data instance in local disk.  

I think delaying recovery for a few seconds always is useful for many
users.  Under heavy network load, sheep can wrongly detect node
failure and node membership can change frequently.  Delaying recovery
for a short time makes Sheepdog tolerant against such situation.

Thanks,

Kazutaka
-- 
sheepdog mailing list
sheepdog@lists.wpkg.org
http://lists.wpkg.org/mailman/listinfo/sheepdog


Re: [sheepdog] [PATCH V2 00/11] INTRODUCE

2012-08-20 Thread Yunkai Zhang
On Tue, Aug 21, 2012 at 10:46 AM, Liu Yuan namei.u...@gmail.com wrote:
 On 08/21/2012 02:29 AM, MORITA Kazutaka wrote:
 I think delaying recovery for a few seconds always is useful for many
 users.  Under heavy network load, sheep can wrongly detect node
 failure and node membership can change frequently.  Delaying recovery
 for a short time makes Sheepdog tolerant against such situation.

 I think your example is very vague, what kind of driver you use? Sheep
 itself won't sense membership and rely on cluster drivers to maintain
 membership. Could you detail how it happen exactly in real case?

 If you are talking about network partition problem, I don't think delay
 recovery will help solve it. We have met network partition when we used
 corosync driver, for zookeeper driver, we haven't met it yet. (I guess
 we won't meet it with zookeeper as a central membership control).

 Suppose we have 6 nodes in a cluster, A,B,C,D,E,F one copy with epoch =
 1. For time t1, we get network partitioned, and three partitions show
 up, c1(A,B,C), c2(D,E),c3(F). So epoch for this three partitions is
 respectively epoch(c1=4, c2=5, c3=6) and all 3 partitions progress to
 recover and get updates to its local object.

 In your above example, suppose we might have these 3 partition
 automatically merge into one partition, this means, after merging
 1) epoch(c1=7, c2=9, c3=11)
 2) no code to handle different version objects which all nodes think his
 own local version is correct.

 So I think we have to handle epoch mismatch and object multi-version
 problems before evaluating delay recovery for network partition.

 If you are not talking about network partition problem, I think we can
 only meet stop/restart node case for manual maintenance, where I think
 manual recovery could really be helpful.


Delay recovery couldn't solve network partition problem, and as you
mentioned above, if sheep break internal protocol, delay recovery
could not help to sheep's upgrade.

But if sheep don't break internal protocol, for example, we just fix
memory leak bug/add some useful log/fix a corner case, it's very
useful for us.


 Thanks,
 Yuan
 --
 sheepdog mailing list
 sheepdog@lists.wpkg.org
 http://lists.wpkg.org/mailman/listinfo/sheepdog



-- 
Yunkai Zhang
Work at Taobao
-- 
sheepdog mailing list
sheepdog@lists.wpkg.org
http://lists.wpkg.org/mailman/listinfo/sheepdog


Re: [sheepdog] [PATCH V2 00/11] INTRODUCE

2012-08-20 Thread Yunkai Zhang
On Tue, Aug 21, 2012 at 2:03 AM, MORITA Kazutaka
morita.kazut...@lab.ntt.co.jp wrote:
 At Mon, 20 Aug 2012 23:34:10 +0800,
 Yunkai Zhang wrote:

 In fact, I have thought this method, but we should face nearly the same 
 problem:

 After sheep joined back, it should known which objects is dirty, and
 should do the clear work(because there are old version object stay in
 it's working directory). This method seems not save the steps, but
 will do extra recovery works.

 Can you give me a concrete example?

 I created a really naive patch to disable object recovery with my
 idea:

 ==
 diff --git a/sheep/recovery.c b/sheep/recovery.c
 index 5164aa7..8bf032f 100644
 --- a/sheep/recovery.c
 +++ b/sheep/recovery.c
 @@ -35,6 +35,7 @@ struct recovery_work {
 uint64_t *oids;
 uint64_t *prio_oids;
 int nr_prio_oids;
 +   int nr_scheduled_oids;

 struct vnode_info *old_vinfo;
 struct vnode_info *cur_vinfo;
 @@ -269,9 +270,6 @@ static inline void prepare_schedule_oid(uint64_t oid)
 oid);
 return;
 }
 -   /* The oid is currently being recovered */
 -   if (rw-oids[rw-done] == oid)
 -   return;
 rw-nr_prio_oids++;
 rw-prio_oids = xrealloc(rw-prio_oids,
  rw-nr_prio_oids * sizeof(uint64_t));
 @@ -399,9 +397,31 @@ static inline void finish_schedule_oids(struct 
 recovery_work *rw)
  done:
 free(rw-prio_oids);
 rw-prio_oids = NULL;
 +   rw-nr_scheduled_oids += rw-nr_prio_oids;
 rw-nr_prio_oids = 0;
  }

 +static struct timer recovery_timer;
 +
 +static void recover_next_object(void *arg)
 +{
 +   struct recovery_work *rw = arg;
 +
 +   if (rw-nr_prio_oids)
 +   finish_schedule_oids(rw);
 +
 +   if (rw-done  rw-nr_scheduled_oids) {
 +   /* Try recover next object */
 +   queue_work(sys-recovery_wqueue, rw-work);
 +   return;
 +   }
 +
 +   /* There is no objects to be recovered.  Try again later */
 +   recovery_timer.callback = recover_next_object;
 +   recovery_timer.data = rw;
 +   add_timer(recovery_timer, 1); /* FIXME */
 +}
 +
  static void recover_object_main(struct work *work)
  {
 struct recovery_work *rw = container_of(work, struct recovery_work,
 @@ -425,11 +445,7 @@ static void recover_object_main(struct work *work)
 resume_wait_obj_requests(rw-oids[rw-done++]);

 if (rw-done  rw-count) {
 -   if (rw-nr_prio_oids)
 -   finish_schedule_oids(rw);
 -
 -   /* Try recover next object */
 -   queue_work(sys-recovery_wqueue, rw-work);
 +   recover_next_object(rw);
 return;
 }

 @@ -458,7 +474,7 @@ static void finish_object_list(struct work *work)
 resume_wait_recovery_requests();
 rw-work.fn = recover_object_work;
 rw-work.done = recover_object_main;
 -   queue_work(sys-recovery_wqueue, rw-work);
 +   recover_next_object(rw);
 return;
  }

 ==

 I ran the following test, and object recovery was disabled correctly
 for both join and leave case.

 ==
 #!/bin/bash

 for i in 0 1 2 3; do
 ./sheep/sheep /store/$i -z $i -p 700$i -c local
 done

 sleep 1
 ./collie/collie cluster format

 ./collie/collie vdi create test 4G

 echo  * objects will be created on node[0-2] *
 md5sum /store/[0,1,2,3]/obj/807c2b25

 pkill -f ./sheep/sheep /store/1
 sleep 3

 echo  * recovery doesn't start until the object is touched *
 md5sum /store/[0,2,3]/obj/807c2b25

 ./collie/collie vdi snapshot test  # invoke recovery of the vdi object
 echo  * the object is recovered *
 md5sum /store/[0,2,3]/obj/807c2b25

 ./sheep/sheep /store/1 -z 1 -p 7001 -c local
 sleep 3

 echo  * recovery doesn't start until the object is touched *
 md5sum /store/[0,1,2,3]/obj/807c2b25

 ./collie/collie vdi list -p 7001  # invoke recovery of the vdi object
 echo  * the object is recovered *
 md5sum /store/[0,1,2,3]/obj/807c2b25
 ==

 [Output]

 using backend farm store
  * objects will be created on node[0-2] *
 701e77eab6002c9a48f7ba72c8d9bfe9  /store/0/obj/807c2b25
 701e77eab6002c9a48f7ba72c8d9bfe9  /store/1/obj/807c2b25
 701e77eab6002c9a48f7ba72c8d9bfe9  /store/2/obj/807c2b25
  * recovery doesn't start until the object is touched *
 701e77eab6002c9a48f7ba72c8d9bfe9  /store/0/obj/807c2b25
 701e77eab6002c9a48f7ba72c8d9bfe9  /store/2/obj/807c2b25
  * the object is recovered *
 3c3bf0d865363fd0d1f1d5c7aa044dcd  /store/0/obj/807c2b25
 3c3bf0d865363fd0d1f1d5c7aa044dcd  /store/2/obj/807c2b25
 3c3bf0d865363fd0d1f1d5c7aa044dcd  /store/3/obj/807c2b25
  * recovery doesn't start until the object is touched *
 3c3bf0d865363fd0d1f1d5c7aa044dcd  /store/0/obj/807c2b25
 3c3bf0d865363fd0d1f1d5c7aa044dcd  

Re: [sheepdog] [PATCH V2 00/11] INTRODUCE

2012-08-20 Thread MORITA Kazutaka
At Tue, 21 Aug 2012 10:46:19 +0800,
Liu Yuan wrote:
 
 So I think we have to handle epoch mismatch and object multi-version
 problems before evaluating delay recovery for network partition.

Yes, delay recovery doesn't solve my example at all unless sheepdog
handles network partition.  I didn't intend to say that delaying
recovery always is necessary now but worth considering in future.

Thanks,

Kazutaka
-- 
sheepdog mailing list
sheepdog@lists.wpkg.org
http://lists.wpkg.org/mailman/listinfo/sheepdog


Re: [sheepdog] [PATCH V2 00/11] INTRODUCE

2012-08-20 Thread Liu Yuan
On 08/21/2012 11:21 AM, MORITA Kazutaka wrote:
 At Tue, 21 Aug 2012 10:46:19 +0800,
 Liu Yuan wrote:

 So I think we have to handle epoch mismatch and object multi-version
 problems before evaluating delay recovery for network partition.
 
 Yes, delay recovery doesn't solve my example at all unless sheepdog
 handles network partition.  I didn't intend to say that delaying
 recovery always is necessary now but worth considering in future.
 

Well, with a centralized membership control driver such as zookeeper or
accord (I'd like to built-in accord driver(possibly a simplified and
tailored for sheep) in the sheepdog repo for better development), I
think the network partition problem can be virtual gone with a
well-written software, that collaborate with sheep to minimize the
chances of NP happening rather than intend to solve it after it happens.

Thanks,
Yuan


-- 
sheepdog mailing list
sheepdog@lists.wpkg.org
http://lists.wpkg.org/mailman/listinfo/sheepdog


Re: [sheepdog] [PATCH V2 00/11] INTRODUCE

2012-08-20 Thread Dietmar Maurer
 On 08/21/2012 12:07 AM, Christoph Hellwig wrote:
  Another thing that sprang into mind is that instead of the formal
  recovery enable/disable we should simply always delay recovery, that
  is only do recovery after every N seconds if changes happened.
  Especially in the cases of whole racks going up/down or upgrades that
  dramatically reduces the number of epochs required, and thus reduces
  the recovery overhead.
 
  I didn't actually have time to look into the implementation
  implications of this yet, it's just high level thoughs.
 
 I think negatively to delay recovery all the time. It is useful to delay 
 recovery
 in some time window for maintenance or operational purposes, so I think
 the idea only to delay recovery manually at some controlled window is
 useful, but if we extend this to all the running time, it will bring cluster 
 to a
 less safe state (if not
 dangerous) at any point. (we only upgrade cluster/maintain individual node
 only at some time, not all the time, no?)

I still think that automatic recovery without delay is the wrong approach. At 
least for
small clusters you simply want to avoid unnecessary traffic. Such recovery can 
produce
massive traffic on the network (several TB of data), and can make the whole 
system unusable 
because of that. I want to control when recovery starts.

- Dietmar

-- 
sheepdog mailing list
sheepdog@lists.wpkg.org
http://lists.wpkg.org/mailman/listinfo/sheepdog


Re: [sheepdog] [PATCH V2 00/11] INTRODUCE

2012-08-20 Thread MORITA Kazutaka
At Tue, 21 Aug 2012 04:34:05 +,
Dietmar Maurer wrote:
 
  On 08/21/2012 12:07 AM, Christoph Hellwig wrote:
   Another thing that sprang into mind is that instead of the formal
   recovery enable/disable we should simply always delay recovery, that
   is only do recovery after every N seconds if changes happened.
   Especially in the cases of whole racks going up/down or upgrades that
   dramatically reduces the number of epochs required, and thus reduces
   the recovery overhead.
  
   I didn't actually have time to look into the implementation
   implications of this yet, it's just high level thoughs.
  
  I think negatively to delay recovery all the time. It is useful to delay 
  recovery
  in some time window for maintenance or operational purposes, so I think
  the idea only to delay recovery manually at some controlled window is
  useful, but if we extend this to all the running time, it will bring 
  cluster to a
  less safe state (if not
  dangerous) at any point. (we only upgrade cluster/maintain individual node
  only at some time, not all the time, no?)
 
 I still think that automatic recovery without delay is the wrong approach. At 
 least for
 small clusters you simply want to avoid unnecessary traffic. Such recovery 
 can produce
 massive traffic on the network (several TB of data), and can make the whole 
 system unusable 
 because of that. I want to control when recovery starts.

Disabling automatic recovery by default doesn't work for you?  You can
control the time to start recovery with collie cluster recover enable.

Thanks,

Kazutaka
-- 
sheepdog mailing list
sheepdog@lists.wpkg.org
http://lists.wpkg.org/mailman/listinfo/sheepdog


Re: [sheepdog] [PATCH V2 00/11] INTRODUCE

2012-08-20 Thread Liu Yuan
On 08/21/2012 12:34 PM, Dietmar Maurer wrote:
 I still think that automatic recovery without delay is the wrong approach. At 
 least for
 small clusters you simply want to avoid unnecessary traffic. Such recovery 
 can produce
 massive traffic on the network (several TB of data), and can make the whole 
 system unusable 
 because of that. I want to control when recovery starts.

Your goal to avoid unnecessary object transfer can be actually built on
top of manual recovery mechanism. If we implement manual recovery, we
can add a timeout option to it(very easy), thus if someone want to
always delay recovery, he can simply disable automatic recovery, add
specify a timeout for it. In this way, we can have several policy to
accommodate different purposes.

Thanks,
Yuan
-- 
sheepdog mailing list
sheepdog@lists.wpkg.org
http://lists.wpkg.org/mailman/listinfo/sheepdog


Re: [sheepdog] [PATCH V2 00/11] INTRODUCE

2012-08-20 Thread Dietmar Maurer
 I think your example is very vague, what kind of driver you use? Sheep itself
 won't sense membership and rely on cluster drivers to maintain
 membership. Could you detail how it happen exactly in real case?

Membership change can happen for many reason. It  can happen if something is
wrong on the switch (or if some admin configures the switch), a damaged network 
cable, 
a bug in the bonding driver, a damaged network card, or simply a power failure 
on a node, 
which reconnects after power is on

In Literature, the problem is also known as 'Babbling idiot' (Realtime people 
use that term). A single node
can make the whole system unusable.

- Dietmar 

-- 
sheepdog mailing list
sheepdog@lists.wpkg.org
http://lists.wpkg.org/mailman/listinfo/sheepdog


Re: [sheepdog] [PATCH V2 00/11] INTRODUCE

2012-08-20 Thread Dietmar Maurer
 Disabling automatic recovery by default doesn't work for you?  You can
 control the time to start recovery with collie cluster recover enable.

It just looks strange to me to design the system for immediate/automatic 
recovery, and
make 'disabling automatic recovery' an option. I would include the node state 
into the epoch.
But maybe that is only an implementation detail.

- Dietmar


-- 
sheepdog mailing list
sheepdog@lists.wpkg.org
http://lists.wpkg.org/mailman/listinfo/sheepdog


Re: [sheepdog] [PATCH V2 00/11] INTRODUCE

2012-08-12 Thread Liu Yuan
On 08/09/2012 04:43 PM, Yunkai Zhang wrote:
 - fix a typo
 - when an object is updated, delete it old version
 - reset cluster recovery state in finish_recovery()

You should brief what your patch set does in the introduction cover
letter. I have no idea what your INTRODUCE means. Please complete your
title too.

What is use case to delay LEAVE event (why we need such complexity)? How
to use? This is the most important arguments you should include in the
cover letter to defend your patch set and let people think it is useful.

Thanks,
Yuan
-- 
sheepdog mailing list
sheepdog@lists.wpkg.org
http://lists.wpkg.org/mailman/listinfo/sheepdog


Re: [sheepdog] [PATCH V2 00/11] INTRODUCE

2012-08-12 Thread Yunkai Zhang
On Mon, Aug 13, 2012 at 10:29 AM, Liu Yuan namei.u...@gmail.com wrote:
 On 08/09/2012 04:43 PM, Yunkai Zhang wrote:
 - fix a typo
 - when an object is updated, delete it old version
 - reset cluster recovery state in finish_recovery()

 You should brief what your patch set does in the introduction cover
 letter. I have no idea what your INTRODUCE means. Please complete your
 title too.

 What is use case to delay LEAVE event (why we need such complexity)? How

I suppose the reviewers already know my previous patchsets(delay JOIN
event) because we have been discussed in detail with Kazum and other
users about the benefit of delay LEAVE event. I just don't want to
type it again, you can learn its background from this mail thread:
http://lists.wpkg.org/pipermail/sheepdog/2012-July/005684.html

If you consist to rewrite this cover letter, I can update it in next version.

 to use? This is the most important arguments you should include in the
 cover letter to defend your patch set and let people think it is useful.

 Thanks,
 Yuan



-- 
Yunkai Zhang
Work at Taobao
-- 
sheepdog mailing list
sheepdog@lists.wpkg.org
http://lists.wpkg.org/mailman/listinfo/sheepdog