Re: [sheepdog] [PATCH V2 00/11] INTRODUCE
On Tue, Aug 21, 2012 at 1:50 PM, Dietmar Maurer diet...@proxmox.com wrote: Disabling automatic recovery by default doesn't work for you? You can control the time to start recovery with collie cluster recover enable. It just looks strange to me to design the system for immediate/automatic recovery, and make 'disabling automatic recovery' an option. I would include the node state into the epoch. But maybe that is only an implementation detail. Hi folks: I need a conclusion: Does sheepdog need delay recovery supported by this series (or by Kazum's new idea and implementation) ? - Dietmar -- sheepdog mailing list sheepdog@lists.wpkg.org http://lists.wpkg.org/mailman/listinfo/sheepdog -- Yunkai Zhang Work at Taobao -- sheepdog mailing list sheepdog@lists.wpkg.org http://lists.wpkg.org/mailman/listinfo/sheepdog
Re: [sheepdog] [PATCH V2 00/11] INTRODUCE
Hi Dietmar, Hi Yuan, Am 2012-08-21 07:27, schrieb Dietmar Maurer: Membership change can happen for many reason. It can happen if something is wrong on the switch (or if some admin configures the switch), a damaged network cable, a bug in the bonding driver, a damaged network card, or simply a power failure on a node, which reconnects after power is on Me and at least David had this Problem recently in our environment which ends in a complete data loss. Not that it happens very often or should be handled in a way the cluster can run even if it happens, but in my opinion this situation should be handled without data loss... Cheers Bastian -- sheepdog mailing list sheepdog@lists.wpkg.org http://lists.wpkg.org/mailman/listinfo/sheepdog
Re: [sheepdog] [PATCH V2 00/11] INTRODUCE
At Tue, 21 Aug 2012 14:14:23 +0800, Yunkai Zhang wrote: I need a conclusion: Does sheepdog need delay recovery supported by this series (or by Kazum's new idea and implementation) ? There are two different discussion in this thread: 1. turn on/off automatic recovery with a collie command (supported by this series) 2. delay starting automatic recovery in any case I think no one is against supporting 1. Thanks, Kazutaka -- sheepdog mailing list sheepdog@lists.wpkg.org http://lists.wpkg.org/mailman/listinfo/sheepdog
Re: [sheepdog] [PATCH V2 00/11] INTRODUCE
On Tue, Aug 21, 2012 at 2:43 PM, MORITA Kazutaka morita.kazut...@lab.ntt.co.jp wrote: At Tue, 21 Aug 2012 14:14:23 +0800, Yunkai Zhang wrote: I need a conclusion: Does sheepdog need delay recovery supported by this series (or by Kazum's new idea and implementation) ? There are two different discussion in this thread: 1. turn on/off automatic recovery with a collie command (supported by this series) 2. delay starting automatic recovery in any case I think no one is against supporting 1. Ok, I'll continue to improve this series after I complete other things. Thanks, Kazutaka -- Yunkai Zhang Work at Taobao -- sheepdog mailing list sheepdog@lists.wpkg.org http://lists.wpkg.org/mailman/listinfo/sheepdog
Re: [sheepdog] [PATCH V2 00/11] INTRODUCE
On 08/21/2012 02:48 PM, Yunkai Zhang wrote: Ok, I'll continue to improve this series after I complete other things. Why not choose Kazutaka's idea to implement delay recovery? It looks simple yet efficient at least to me. Thanks, Yuan -- sheepdog mailing list sheepdog@lists.wpkg.org http://lists.wpkg.org/mailman/listinfo/sheepdog
Re: [sheepdog] [PATCH V2 00/11] INTRODUCE
On Tue, Aug 21, 2012 at 2:58 PM, Liu Yuan namei.u...@gmail.com wrote: On 08/21/2012 02:48 PM, Yunkai Zhang wrote: Ok, I'll continue to improve this series after I complete other things. Why not choose Kazutaka's idea to implement delay recovery? It looks simple yet efficient at least to me. continue to import this series, of course including kazum's idea if it's the best way. Thanks, Yuan -- Yunkai Zhang Work at Taobao -- sheepdog mailing list sheepdog@lists.wpkg.org http://lists.wpkg.org/mailman/listinfo/sheepdog
Re: [sheepdog] [PATCH V2 00/11] INTRODUCE
On Tue, Aug 21, 2012 at 3:04 PM, Yunkai Zhang yunkai...@gmail.com wrote: On Tue, Aug 21, 2012 at 2:58 PM, Liu Yuan namei.u...@gmail.com wrote: On 08/21/2012 02:48 PM, Yunkai Zhang wrote: Ok, I'll continue to improve this series after I complete other things. Why not choose Kazutaka's idea to implement delay recovery? It looks simple yet efficient at least to me. continue to import this series, of course including kazum's idea if it's the best way. s/import/improve/ Thanks, Yuan -- Yunkai Zhang Work at Taobao -- Yunkai Zhang Work at Taobao -- sheepdog mailing list sheepdog@lists.wpkg.org http://lists.wpkg.org/mailman/listinfo/sheepdog
Re: [sheepdog] [PATCH V2 00/11] INTRODUCE
On Tue, Aug 21, 2012 at 2:03 AM, MORITA Kazutaka morita.kazut...@lab.ntt.co.jp wrote: At Mon, 20 Aug 2012 23:34:10 +0800, Yunkai Zhang wrote: In fact, I have thought this method, but we should face nearly the same problem: After sheep joined back, it should known which objects is dirty, and should do the clear work(because there are old version object stay in it's working directory). This method seems not save the steps, but will do extra recovery works. Can you give me a concrete example? I created a really naive patch to disable object recovery with my idea: Hi Kazum: I have read and do simple test with this patch, it works at most time. But write operation will be blocked in wait_forward_request(), I think there are some corner case we should handle. I think I have understood this good idea, it's simple and clever. Could you give a mature patch? We really want to use it in our cluster as soon as possible. Thank you! == diff --git a/sheep/recovery.c b/sheep/recovery.c index 5164aa7..8bf032f 100644 --- a/sheep/recovery.c +++ b/sheep/recovery.c @@ -35,6 +35,7 @@ struct recovery_work { uint64_t *oids; uint64_t *prio_oids; int nr_prio_oids; + int nr_scheduled_oids; struct vnode_info *old_vinfo; struct vnode_info *cur_vinfo; @@ -269,9 +270,6 @@ static inline void prepare_schedule_oid(uint64_t oid) oid); return; } - /* The oid is currently being recovered */ - if (rw-oids[rw-done] == oid) - return; rw-nr_prio_oids++; rw-prio_oids = xrealloc(rw-prio_oids, rw-nr_prio_oids * sizeof(uint64_t)); @@ -399,9 +397,31 @@ static inline void finish_schedule_oids(struct recovery_work *rw) done: free(rw-prio_oids); rw-prio_oids = NULL; + rw-nr_scheduled_oids += rw-nr_prio_oids; rw-nr_prio_oids = 0; } +static struct timer recovery_timer; + +static void recover_next_object(void *arg) +{ + struct recovery_work *rw = arg; + + if (rw-nr_prio_oids) + finish_schedule_oids(rw); + + if (rw-done rw-nr_scheduled_oids) { + /* Try recover next object */ + queue_work(sys-recovery_wqueue, rw-work); + return; + } + + /* There is no objects to be recovered. Try again later */ + recovery_timer.callback = recover_next_object; + recovery_timer.data = rw; + add_timer(recovery_timer, 1); /* FIXME */ +} + static void recover_object_main(struct work *work) { struct recovery_work *rw = container_of(work, struct recovery_work, @@ -425,11 +445,7 @@ static void recover_object_main(struct work *work) resume_wait_obj_requests(rw-oids[rw-done++]); if (rw-done rw-count) { - if (rw-nr_prio_oids) - finish_schedule_oids(rw); - - /* Try recover next object */ - queue_work(sys-recovery_wqueue, rw-work); + recover_next_object(rw); return; } @@ -458,7 +474,7 @@ static void finish_object_list(struct work *work) resume_wait_recovery_requests(); rw-work.fn = recover_object_work; rw-work.done = recover_object_main; - queue_work(sys-recovery_wqueue, rw-work); + recover_next_object(rw); return; } == I ran the following test, and object recovery was disabled correctly for both join and leave case. == #!/bin/bash for i in 0 1 2 3; do ./sheep/sheep /store/$i -z $i -p 700$i -c local done sleep 1 ./collie/collie cluster format ./collie/collie vdi create test 4G echo * objects will be created on node[0-2] * md5sum /store/[0,1,2,3]/obj/807c2b25 pkill -f ./sheep/sheep /store/1 sleep 3 echo * recovery doesn't start until the object is touched * md5sum /store/[0,2,3]/obj/807c2b25 ./collie/collie vdi snapshot test # invoke recovery of the vdi object echo * the object is recovered * md5sum /store/[0,2,3]/obj/807c2b25 ./sheep/sheep /store/1 -z 1 -p 7001 -c local sleep 3 echo * recovery doesn't start until the object is touched * md5sum /store/[0,1,2,3]/obj/807c2b25 ./collie/collie vdi list -p 7001 # invoke recovery of the vdi object echo * the object is recovered * md5sum /store/[0,1,2,3]/obj/807c2b25 == [Output] using backend farm store * objects will be created on node[0-2] * 701e77eab6002c9a48f7ba72c8d9bfe9 /store/0/obj/807c2b25 701e77eab6002c9a48f7ba72c8d9bfe9 /store/1/obj/807c2b25 701e77eab6002c9a48f7ba72c8d9bfe9 /store/2/obj/807c2b25 * recovery doesn't start until the object is touched * 701e77eab6002c9a48f7ba72c8d9bfe9 /store/0/obj/807c2b25 701e77eab6002c9a48f7ba72c8d9bfe9 /store/2/obj/807c2b25 * the object is recovered *
Re: [sheepdog] [PATCH V2 00/11] INTRODUCE
At Wed, 22 Aug 2012 01:16:49 +0800, Yunkai Zhang wrote: I have read and do simple test with this patch, it works at most time. But write operation will be blocked in wait_forward_request(), I think there are some corner case we should handle. Can you create a testcase to reproduce it? Could you give a mature patch? We really want to use it in our cluster as soon as possible. Okay, but I'm currently working on another problem - sheep blocks I/O requests long time while stale objects are moved to the farm backend store. I'll give a try after that. Thanks, Kazutaka -- sheepdog mailing list sheepdog@lists.wpkg.org http://lists.wpkg.org/mailman/listinfo/sheepdog
Re: [sheepdog] [PATCH V2 00/11] INTRODUCE
On Wed, Aug 22, 2012 at 9:31 AM, MORITA Kazutaka morita.kazut...@lab.ntt.co.jp wrote: At Wed, 22 Aug 2012 01:16:49 +0800, Yunkai Zhang wrote: I have read and do simple test with this patch, it works at most time. But write operation will be blocked in wait_forward_request(), I think there are some corner case we should handle. Can you create a testcase to reproduce it? Ok, I'll give a testcase later. Could you give a mature patch? We really want to use it in our cluster as soon as possible. Okay, but I'm currently working on another problem - sheep blocks I/O requests long time while stale objects are moved to the farm backend store. I'll give a try after that. Thanks~ Thanks, Kazutaka -- Yunkai Zhang Work at Taobao -- sheepdog mailing list sheepdog@lists.wpkg.org http://lists.wpkg.org/mailman/listinfo/sheepdog
Re: [sheepdog] [PATCH V2 00/11] INTRODUCE
On 08/22/2012 09:44 AM, Yunkai Zhang wrote: ould you give a mature patch? We really want to use it in our cluster as soon as possible. Okay, but I'm currently working on another problem - sheep blocks I/O requests long time while stale objects are moved to the farm backend store. I'll give a try after that. Thanks~ Hi Yunkai, since you are doing this series all these days, why not pick up Kazutaka's draft patch and perfect it? I think this is Kazutaka original intention, just a concrete example to say his idea could work out. -- thanks, Yuan -- sheepdog mailing list sheepdog@lists.wpkg.org http://lists.wpkg.org/mailman/listinfo/sheepdog
Re: [sheepdog] [PATCH V2 00/11] INTRODUCE
On Wed, Aug 22, 2012 at 9:55 AM, Liu Yuan namei.u...@gmail.com wrote: On 08/22/2012 09:44 AM, Yunkai Zhang wrote: ould you give a mature patch? We really want to use it in our cluster as soon as possible. Okay, but I'm currently working on another problem - sheep blocks I/O requests long time while stale objects are moved to the farm backend store. I'll give a try after that. Thanks~ Hi Yunkai, since you are doing this series all these days, why not pick up Kazutaka's draft patch and perfect it? I think this is Kazutaka original intention, just a concrete example to say his idea could work out. My intention is to respect Kazum's idea, if need my help, I'm pleasure to do it:). -- thanks, Yuan -- Yunkai Zhang Work at Taobao -- sheepdog mailing list sheepdog@lists.wpkg.org http://lists.wpkg.org/mailman/listinfo/sheepdog
Re: [sheepdog] [PATCH V2 00/11] INTRODUCE
On Wed, Aug 22, 2012 at 10:21 AM, MORITA Kazutaka morita.kazut...@lab.ntt.co.jp wrote: At Wed, 22 Aug 2012 10:14:07 +0800, Yunkai Zhang wrote: My intention is to respect Kazum's idea, if need my help, I'm pleasure to do it:). If you complete the work, it will help me a lot. :) Well, I'll complete it:) But now I'm busy with other things, maybe I'll send the first version based on this idea at Friday or this weekend. Thanks, Kazutaka -- Yunkai Zhang Work at Taobao -- sheepdog mailing list sheepdog@lists.wpkg.org http://lists.wpkg.org/mailman/listinfo/sheepdog
Re: [sheepdog] [PATCH V2 00/11] INTRODUCE
On Mon, Aug 20, 2012 at 9:00 PM, MORITA Kazutaka morita.kazut...@lab.ntt.co.jp wrote: At Thu, 9 Aug 2012 16:43:38 +0800, Yunkai Zhang wrote: From: Yunkai Zhang qiushu@taobao.com V2: - fix a typo - when an object is updated, delete it old version - reset cluster recovery state in finish_recovery() Yunkai Zhang (11): sheep: enable variale-length of join_message in response of join event sheep: share joining nodes with newly added sheep sheep: delay to process recovery caused by LEAVE event just like JOIN event sheep: don't cleanup working directory when sheep joined back sheep: read objects only from live nodes sheep: write objects only on live nodes sheep: mark dirty object that belongs to the leaving nodes sheep: send dirty object list to each sheep when cluster do recovery sheep: do recovery with dirty object list collie: update 'collie cluster recover info' commands collie: update doc about 'collie cluster recover disable' collie/cluster.c | 46 --- include/internal_proto.h | 32 ++-- include/sheep.h | 23 ++ man/collie.8 | 2 +- sheep/cluster.h | 29 +-- sheep/cluster/accord.c| 2 +- sheep/cluster/corosync.c | 9 ++- sheep/cluster/local.c | 2 +- sheep/cluster/zookeeper.c | 2 +- sheep/farm/trunk.c| 2 +- sheep/gateway.c | 39 - sheep/group.c | 202 +- sheep/object_list_cache.c | 182 +++-- sheep/ops.c | 85 --- sheep/recovery.c | 133 +++--- sheep/sheep_priv.h| 57 - 16 files changed, 743 insertions(+), 104 deletions(-) I've looked into this series, and IMHO the change is too complex. With this series, when recovery is disabled and there are left nodes, sheep can succeed in a write operation even if the data is not fully replicated. But, if we allow it, it is difficult to prevent VMs from reading old data. Actually this series put a lot of effort into it. We want to upgrade sheepdog while not impact all online VMs, so we need to allow all VMs to do write operation when recovery is disable (It is important for a big cluster, we can't assume users would stop their works during this time). And we also assume that this time is short, we should upgrade sheepdog as soon as possible( 5 minutes). This patch is implemented based on those assumption above. And maybe it's difficult, but it's algorithm is clear, just three steps(from the description from the 9th patch's commit log): 1) If a sheep joined back to the cluster, but there are some objects which have been deleted after this sheep left, such objects stay in its working directory, after recovery start, this sheep will send its object list to other sheeps. So after fetched all object list from cluster, each sheep should screen out these deleted objects list. 2) A sheep which have been left and joined back should drop the old version objects and recover the new ones from other sheeps. 3) The objects which have been updated should not recovered from a joined back sheep. I'd suggest allowing epoch increment even when recover is disabled. If recovery work recovers only rw-prio_oids and delays the recovery of rw-oids, I think we can get the similar benefit with much simpler way: http://www.mail-archive.com/sheepdog@lists.wpkg.org/msg05439.html In fact, I have thought this method, but we should face nearly the same problem: After sheep joined back, it should known which objects is dirty, and should do the clear work(because there are old version object stay in it's working directory). This method seems not save the steps, but will do extra recovery works. Thanks, Kazutaka -- Yunkai Zhang Work at Taobao -- sheepdog mailing list sheepdog@lists.wpkg.org http://lists.wpkg.org/mailman/listinfo/sheepdog
Re: [sheepdog] [PATCH V2 00/11] INTRODUCE
On 08/20/2012 11:34 PM, Yunkai Zhang wrote: On Mon, Aug 20, 2012 at 9:00 PM, MORITA Kazutaka morita.kazut...@lab.ntt.co.jp wrote: At Thu, 9 Aug 2012 16:43:38 +0800, Yunkai Zhang wrote: From: Yunkai Zhang qiushu@taobao.com V2: - fix a typo - when an object is updated, delete it old version - reset cluster recovery state in finish_recovery() Yunkai Zhang (11): sheep: enable variale-length of join_message in response of join event sheep: share joining nodes with newly added sheep sheep: delay to process recovery caused by LEAVE event just like JOIN event sheep: don't cleanup working directory when sheep joined back sheep: read objects only from live nodes sheep: write objects only on live nodes sheep: mark dirty object that belongs to the leaving nodes sheep: send dirty object list to each sheep when cluster do recovery sheep: do recovery with dirty object list collie: update 'collie cluster recover info' commands collie: update doc about 'collie cluster recover disable' collie/cluster.c | 46 --- include/internal_proto.h | 32 ++-- include/sheep.h | 23 ++ man/collie.8 | 2 +- sheep/cluster.h | 29 +-- sheep/cluster/accord.c| 2 +- sheep/cluster/corosync.c | 9 ++- sheep/cluster/local.c | 2 +- sheep/cluster/zookeeper.c | 2 +- sheep/farm/trunk.c| 2 +- sheep/gateway.c | 39 - sheep/group.c | 202 +- sheep/object_list_cache.c | 182 +++-- sheep/ops.c | 85 --- sheep/recovery.c | 133 +++--- sheep/sheep_priv.h| 57 - 16 files changed, 743 insertions(+), 104 deletions(-) I've looked into this series, and IMHO the change is too complex. With this series, when recovery is disabled and there are left nodes, sheep can succeed in a write operation even if the data is not fully replicated. But, if we allow it, it is difficult to prevent VMs from reading old data. Actually this series put a lot of effort into it. We want to upgrade sheepdog while not impact all online VMs, so we need to allow all VMs to do write operation when recovery is disable (It is important for a big cluster, we can't assume users would stop their works during this time). And we also assume that this time is short, we should upgrade sheepdog as soon as possible( 5 minutes). Upgrading cluster without stopping service is a nice feature, but I'm afraid in the near future, Sheepdog won't meet this expectation due to fast growing development which is likely to break the inter-sheep assumptions. Before we have this feature, we at least should do following things before claiming to be online upgrading capable: 1) inter-sheep protocol compatibility check logic 2) has a relatively stable feature set and internal physical state (such as config file) That is, it is time too early to talk online upgrading for now. This patch is implemented based on those assumption above. And maybe it's difficult, but it's algorithm is clear, just three steps(from the description from the 9th patch's commit log): 1) If a sheep joined back to the cluster, but there are some objects which have been deleted after this sheep left, such objects stay in its working directory, after recovery start, this sheep will send its object list to other sheeps. So after fetched all object list from cluster, each sheep should screen out these deleted objects list. 2) A sheep which have been left and joined back should drop the old version objects and recover the new ones from other sheeps. 3) The objects which have been updated should not recovered from a joined back sheep. I'd suggest allowing epoch increment even when recover is disabled. If recovery work recovers only rw-prio_oids and delays the recovery of rw-oids, I think we can get the similar benefit with much simpler way: http://www.mail-archive.com/sheepdog@lists.wpkg.org/msg05439.html In fact, I have thought this method, but we should face nearly the same problem: After sheep joined back, it should known which objects is dirty, and should do the clear work(because there are old version object stay in it's working directory). This method seems not save the steps, but will do extra recovery works. IMHO, I think the suggested method won't cause different version objects, because we actually increment epoch and we do the same as is now for the objects in the rw-prio_oids, which is being requested. So for this kind of objects, we can still use current code to handle it. For those objects not being requested at all (which might account for majority of the objects in a short time window), we can do the trick: delay recovering them as much as possible, so that subsequent join
Re: [sheepdog] [PATCH V2 00/11] INTRODUCE
On Mon, Aug 20, 2012 at 11:34:10PM +0800, Yunkai Zhang wrote: sheep can succeed in a write operation even if the data is not fully replicated. But, if we allow it, it is difficult to prevent VMs from reading old data. Actually this series put a lot of effort into it. We want to upgrade sheepdog while not impact all online VMs, so we need to allow all VMs to do write operation when recovery is disable (It is important for a big cluster, we can't assume users would stop their works during this time). And we also assume that this time is short, we should upgrade sheepdog as soon as possible( 5 minutes). FYI, I've been looking into this issue (but not this series yet) a bit lately and came to the conflusion that the only way to proper solve it is indeed to recuce redundancy. One way to make this formal is to have a minimum and a normal redundancy level and let writes succeed as long as we meet the minimum level and not the full one. Another thing that sprang into mind is that instead of the formal recovery enable/disable we should simply always delay recovery, that is only do recovery after every N seconds if changes happened. Especially in the cases of whole racks going up/down or upgrades that dramatically reduces the number of epochs required, and thus reduces the recovery overhead. I didn't actually have time to look into the implementation implications of this yet, it's just high level thoughs. -- sheepdog mailing list sheepdog@lists.wpkg.org http://lists.wpkg.org/mailman/listinfo/sheepdog
Re: [sheepdog] [PATCH V2 00/11] INTRODUCE
On 08/21/2012 12:07 AM, Christoph Hellwig wrote: Another thing that sprang into mind is that instead of the formal recovery enable/disable we should simply always delay recovery, that is only do recovery after every N seconds if changes happened. Especially in the cases of whole racks going up/down or upgrades that dramatically reduces the number of epochs required, and thus reduces the recovery overhead. I didn't actually have time to look into the implementation implications of this yet, it's just high level thoughs. I think negatively to delay recovery all the time. It is useful to delay recovery in some time window for maintenance or operational purposes, so I think the idea only to delay recovery manually at some controlled window is useful, but if we extend this to all the running time, it will bring cluster to a less safe state (if not dangerous) at any point. (we only upgrade cluster/maintain individual node only at some time, not all the time, no?) Trading data reliability is always the last resort for a distributed system, which highlights data reliability compared to single data instance in local disk. -- thanks, Yuan -- sheepdog mailing list sheepdog@lists.wpkg.org http://lists.wpkg.org/mailman/listinfo/sheepdog
Re: [sheepdog] [PATCH V2 00/11] INTRODUCE
At Mon, 20 Aug 2012 23:34:10 +0800, Yunkai Zhang wrote: In fact, I have thought this method, but we should face nearly the same problem: After sheep joined back, it should known which objects is dirty, and should do the clear work(because there are old version object stay in it's working directory). This method seems not save the steps, but will do extra recovery works. Can you give me a concrete example? I created a really naive patch to disable object recovery with my idea: == diff --git a/sheep/recovery.c b/sheep/recovery.c index 5164aa7..8bf032f 100644 --- a/sheep/recovery.c +++ b/sheep/recovery.c @@ -35,6 +35,7 @@ struct recovery_work { uint64_t *oids; uint64_t *prio_oids; int nr_prio_oids; + int nr_scheduled_oids; struct vnode_info *old_vinfo; struct vnode_info *cur_vinfo; @@ -269,9 +270,6 @@ static inline void prepare_schedule_oid(uint64_t oid) oid); return; } - /* The oid is currently being recovered */ - if (rw-oids[rw-done] == oid) - return; rw-nr_prio_oids++; rw-prio_oids = xrealloc(rw-prio_oids, rw-nr_prio_oids * sizeof(uint64_t)); @@ -399,9 +397,31 @@ static inline void finish_schedule_oids(struct recovery_work *rw) done: free(rw-prio_oids); rw-prio_oids = NULL; + rw-nr_scheduled_oids += rw-nr_prio_oids; rw-nr_prio_oids = 0; } +static struct timer recovery_timer; + +static void recover_next_object(void *arg) +{ + struct recovery_work *rw = arg; + + if (rw-nr_prio_oids) + finish_schedule_oids(rw); + + if (rw-done rw-nr_scheduled_oids) { + /* Try recover next object */ + queue_work(sys-recovery_wqueue, rw-work); + return; + } + + /* There is no objects to be recovered. Try again later */ + recovery_timer.callback = recover_next_object; + recovery_timer.data = rw; + add_timer(recovery_timer, 1); /* FIXME */ +} + static void recover_object_main(struct work *work) { struct recovery_work *rw = container_of(work, struct recovery_work, @@ -425,11 +445,7 @@ static void recover_object_main(struct work *work) resume_wait_obj_requests(rw-oids[rw-done++]); if (rw-done rw-count) { - if (rw-nr_prio_oids) - finish_schedule_oids(rw); - - /* Try recover next object */ - queue_work(sys-recovery_wqueue, rw-work); + recover_next_object(rw); return; } @@ -458,7 +474,7 @@ static void finish_object_list(struct work *work) resume_wait_recovery_requests(); rw-work.fn = recover_object_work; rw-work.done = recover_object_main; - queue_work(sys-recovery_wqueue, rw-work); + recover_next_object(rw); return; } == I ran the following test, and object recovery was disabled correctly for both join and leave case. == #!/bin/bash for i in 0 1 2 3; do ./sheep/sheep /store/$i -z $i -p 700$i -c local done sleep 1 ./collie/collie cluster format ./collie/collie vdi create test 4G echo * objects will be created on node[0-2] * md5sum /store/[0,1,2,3]/obj/807c2b25 pkill -f ./sheep/sheep /store/1 sleep 3 echo * recovery doesn't start until the object is touched * md5sum /store/[0,2,3]/obj/807c2b25 ./collie/collie vdi snapshot test # invoke recovery of the vdi object echo * the object is recovered * md5sum /store/[0,2,3]/obj/807c2b25 ./sheep/sheep /store/1 -z 1 -p 7001 -c local sleep 3 echo * recovery doesn't start until the object is touched * md5sum /store/[0,1,2,3]/obj/807c2b25 ./collie/collie vdi list -p 7001 # invoke recovery of the vdi object echo * the object is recovered * md5sum /store/[0,1,2,3]/obj/807c2b25 == [Output] using backend farm store * objects will be created on node[0-2] * 701e77eab6002c9a48f7ba72c8d9bfe9 /store/0/obj/807c2b25 701e77eab6002c9a48f7ba72c8d9bfe9 /store/1/obj/807c2b25 701e77eab6002c9a48f7ba72c8d9bfe9 /store/2/obj/807c2b25 * recovery doesn't start until the object is touched * 701e77eab6002c9a48f7ba72c8d9bfe9 /store/0/obj/807c2b25 701e77eab6002c9a48f7ba72c8d9bfe9 /store/2/obj/807c2b25 * the object is recovered * 3c3bf0d865363fd0d1f1d5c7aa044dcd /store/0/obj/807c2b25 3c3bf0d865363fd0d1f1d5c7aa044dcd /store/2/obj/807c2b25 3c3bf0d865363fd0d1f1d5c7aa044dcd /store/3/obj/807c2b25 * recovery doesn't start until the object is touched * 3c3bf0d865363fd0d1f1d5c7aa044dcd /store/0/obj/807c2b25 3c3bf0d865363fd0d1f1d5c7aa044dcd /store/2/obj/807c2b25 NameIdSizeUsed SharedCreation time VDI id Tag s test 1 4.0 GB 0.0 MB 0.0 MB 2012-08-21 02:49 7c2b25 test 2 4.0 GB 0.0 MB 0.0 MB 2012-08-21 02:49
Re: [sheepdog] [PATCH V2 00/11] INTRODUCE
At Tue, 21 Aug 2012 00:29:50 +0800, Liu Yuan wrote: On 08/21/2012 12:07 AM, Christoph Hellwig wrote: Another thing that sprang into mind is that instead of the formal recovery enable/disable we should simply always delay recovery, that is only do recovery after every N seconds if changes happened. Especially in the cases of whole racks going up/down or upgrades that dramatically reduces the number of epochs required, and thus reduces the recovery overhead. I didn't actually have time to look into the implementation implications of this yet, it's just high level thoughs. I think negatively to delay recovery all the time. It is useful to delay recovery in some time window for maintenance or operational purposes, so I think the idea only to delay recovery manually at some controlled window is useful, but if we extend this to all the running time, it will bring cluster to a less safe state (if not dangerous) at any point. (we only upgrade cluster/maintain individual node only at some time, not all the time, no?) Trading data reliability is always the last resort for a distributed system, which highlights data reliability compared to single data instance in local disk. I think delaying recovery for a few seconds always is useful for many users. Under heavy network load, sheep can wrongly detect node failure and node membership can change frequently. Delaying recovery for a short time makes Sheepdog tolerant against such situation. Thanks, Kazutaka -- sheepdog mailing list sheepdog@lists.wpkg.org http://lists.wpkg.org/mailman/listinfo/sheepdog
Re: [sheepdog] [PATCH V2 00/11] INTRODUCE
On Tue, Aug 21, 2012 at 10:46 AM, Liu Yuan namei.u...@gmail.com wrote: On 08/21/2012 02:29 AM, MORITA Kazutaka wrote: I think delaying recovery for a few seconds always is useful for many users. Under heavy network load, sheep can wrongly detect node failure and node membership can change frequently. Delaying recovery for a short time makes Sheepdog tolerant against such situation. I think your example is very vague, what kind of driver you use? Sheep itself won't sense membership and rely on cluster drivers to maintain membership. Could you detail how it happen exactly in real case? If you are talking about network partition problem, I don't think delay recovery will help solve it. We have met network partition when we used corosync driver, for zookeeper driver, we haven't met it yet. (I guess we won't meet it with zookeeper as a central membership control). Suppose we have 6 nodes in a cluster, A,B,C,D,E,F one copy with epoch = 1. For time t1, we get network partitioned, and three partitions show up, c1(A,B,C), c2(D,E),c3(F). So epoch for this three partitions is respectively epoch(c1=4, c2=5, c3=6) and all 3 partitions progress to recover and get updates to its local object. In your above example, suppose we might have these 3 partition automatically merge into one partition, this means, after merging 1) epoch(c1=7, c2=9, c3=11) 2) no code to handle different version objects which all nodes think his own local version is correct. So I think we have to handle epoch mismatch and object multi-version problems before evaluating delay recovery for network partition. If you are not talking about network partition problem, I think we can only meet stop/restart node case for manual maintenance, where I think manual recovery could really be helpful. Delay recovery couldn't solve network partition problem, and as you mentioned above, if sheep break internal protocol, delay recovery could not help to sheep's upgrade. But if sheep don't break internal protocol, for example, we just fix memory leak bug/add some useful log/fix a corner case, it's very useful for us. Thanks, Yuan -- sheepdog mailing list sheepdog@lists.wpkg.org http://lists.wpkg.org/mailman/listinfo/sheepdog -- Yunkai Zhang Work at Taobao -- sheepdog mailing list sheepdog@lists.wpkg.org http://lists.wpkg.org/mailman/listinfo/sheepdog
Re: [sheepdog] [PATCH V2 00/11] INTRODUCE
On Tue, Aug 21, 2012 at 2:03 AM, MORITA Kazutaka morita.kazut...@lab.ntt.co.jp wrote: At Mon, 20 Aug 2012 23:34:10 +0800, Yunkai Zhang wrote: In fact, I have thought this method, but we should face nearly the same problem: After sheep joined back, it should known which objects is dirty, and should do the clear work(because there are old version object stay in it's working directory). This method seems not save the steps, but will do extra recovery works. Can you give me a concrete example? I created a really naive patch to disable object recovery with my idea: == diff --git a/sheep/recovery.c b/sheep/recovery.c index 5164aa7..8bf032f 100644 --- a/sheep/recovery.c +++ b/sheep/recovery.c @@ -35,6 +35,7 @@ struct recovery_work { uint64_t *oids; uint64_t *prio_oids; int nr_prio_oids; + int nr_scheduled_oids; struct vnode_info *old_vinfo; struct vnode_info *cur_vinfo; @@ -269,9 +270,6 @@ static inline void prepare_schedule_oid(uint64_t oid) oid); return; } - /* The oid is currently being recovered */ - if (rw-oids[rw-done] == oid) - return; rw-nr_prio_oids++; rw-prio_oids = xrealloc(rw-prio_oids, rw-nr_prio_oids * sizeof(uint64_t)); @@ -399,9 +397,31 @@ static inline void finish_schedule_oids(struct recovery_work *rw) done: free(rw-prio_oids); rw-prio_oids = NULL; + rw-nr_scheduled_oids += rw-nr_prio_oids; rw-nr_prio_oids = 0; } +static struct timer recovery_timer; + +static void recover_next_object(void *arg) +{ + struct recovery_work *rw = arg; + + if (rw-nr_prio_oids) + finish_schedule_oids(rw); + + if (rw-done rw-nr_scheduled_oids) { + /* Try recover next object */ + queue_work(sys-recovery_wqueue, rw-work); + return; + } + + /* There is no objects to be recovered. Try again later */ + recovery_timer.callback = recover_next_object; + recovery_timer.data = rw; + add_timer(recovery_timer, 1); /* FIXME */ +} + static void recover_object_main(struct work *work) { struct recovery_work *rw = container_of(work, struct recovery_work, @@ -425,11 +445,7 @@ static void recover_object_main(struct work *work) resume_wait_obj_requests(rw-oids[rw-done++]); if (rw-done rw-count) { - if (rw-nr_prio_oids) - finish_schedule_oids(rw); - - /* Try recover next object */ - queue_work(sys-recovery_wqueue, rw-work); + recover_next_object(rw); return; } @@ -458,7 +474,7 @@ static void finish_object_list(struct work *work) resume_wait_recovery_requests(); rw-work.fn = recover_object_work; rw-work.done = recover_object_main; - queue_work(sys-recovery_wqueue, rw-work); + recover_next_object(rw); return; } == I ran the following test, and object recovery was disabled correctly for both join and leave case. == #!/bin/bash for i in 0 1 2 3; do ./sheep/sheep /store/$i -z $i -p 700$i -c local done sleep 1 ./collie/collie cluster format ./collie/collie vdi create test 4G echo * objects will be created on node[0-2] * md5sum /store/[0,1,2,3]/obj/807c2b25 pkill -f ./sheep/sheep /store/1 sleep 3 echo * recovery doesn't start until the object is touched * md5sum /store/[0,2,3]/obj/807c2b25 ./collie/collie vdi snapshot test # invoke recovery of the vdi object echo * the object is recovered * md5sum /store/[0,2,3]/obj/807c2b25 ./sheep/sheep /store/1 -z 1 -p 7001 -c local sleep 3 echo * recovery doesn't start until the object is touched * md5sum /store/[0,1,2,3]/obj/807c2b25 ./collie/collie vdi list -p 7001 # invoke recovery of the vdi object echo * the object is recovered * md5sum /store/[0,1,2,3]/obj/807c2b25 == [Output] using backend farm store * objects will be created on node[0-2] * 701e77eab6002c9a48f7ba72c8d9bfe9 /store/0/obj/807c2b25 701e77eab6002c9a48f7ba72c8d9bfe9 /store/1/obj/807c2b25 701e77eab6002c9a48f7ba72c8d9bfe9 /store/2/obj/807c2b25 * recovery doesn't start until the object is touched * 701e77eab6002c9a48f7ba72c8d9bfe9 /store/0/obj/807c2b25 701e77eab6002c9a48f7ba72c8d9bfe9 /store/2/obj/807c2b25 * the object is recovered * 3c3bf0d865363fd0d1f1d5c7aa044dcd /store/0/obj/807c2b25 3c3bf0d865363fd0d1f1d5c7aa044dcd /store/2/obj/807c2b25 3c3bf0d865363fd0d1f1d5c7aa044dcd /store/3/obj/807c2b25 * recovery doesn't start until the object is touched * 3c3bf0d865363fd0d1f1d5c7aa044dcd /store/0/obj/807c2b25 3c3bf0d865363fd0d1f1d5c7aa044dcd
Re: [sheepdog] [PATCH V2 00/11] INTRODUCE
At Tue, 21 Aug 2012 10:46:19 +0800, Liu Yuan wrote: So I think we have to handle epoch mismatch and object multi-version problems before evaluating delay recovery for network partition. Yes, delay recovery doesn't solve my example at all unless sheepdog handles network partition. I didn't intend to say that delaying recovery always is necessary now but worth considering in future. Thanks, Kazutaka -- sheepdog mailing list sheepdog@lists.wpkg.org http://lists.wpkg.org/mailman/listinfo/sheepdog
Re: [sheepdog] [PATCH V2 00/11] INTRODUCE
On 08/21/2012 11:21 AM, MORITA Kazutaka wrote: At Tue, 21 Aug 2012 10:46:19 +0800, Liu Yuan wrote: So I think we have to handle epoch mismatch and object multi-version problems before evaluating delay recovery for network partition. Yes, delay recovery doesn't solve my example at all unless sheepdog handles network partition. I didn't intend to say that delaying recovery always is necessary now but worth considering in future. Well, with a centralized membership control driver such as zookeeper or accord (I'd like to built-in accord driver(possibly a simplified and tailored for sheep) in the sheepdog repo for better development), I think the network partition problem can be virtual gone with a well-written software, that collaborate with sheep to minimize the chances of NP happening rather than intend to solve it after it happens. Thanks, Yuan -- sheepdog mailing list sheepdog@lists.wpkg.org http://lists.wpkg.org/mailman/listinfo/sheepdog
Re: [sheepdog] [PATCH V2 00/11] INTRODUCE
On 08/21/2012 12:07 AM, Christoph Hellwig wrote: Another thing that sprang into mind is that instead of the formal recovery enable/disable we should simply always delay recovery, that is only do recovery after every N seconds if changes happened. Especially in the cases of whole racks going up/down or upgrades that dramatically reduces the number of epochs required, and thus reduces the recovery overhead. I didn't actually have time to look into the implementation implications of this yet, it's just high level thoughs. I think negatively to delay recovery all the time. It is useful to delay recovery in some time window for maintenance or operational purposes, so I think the idea only to delay recovery manually at some controlled window is useful, but if we extend this to all the running time, it will bring cluster to a less safe state (if not dangerous) at any point. (we only upgrade cluster/maintain individual node only at some time, not all the time, no?) I still think that automatic recovery without delay is the wrong approach. At least for small clusters you simply want to avoid unnecessary traffic. Such recovery can produce massive traffic on the network (several TB of data), and can make the whole system unusable because of that. I want to control when recovery starts. - Dietmar -- sheepdog mailing list sheepdog@lists.wpkg.org http://lists.wpkg.org/mailman/listinfo/sheepdog
Re: [sheepdog] [PATCH V2 00/11] INTRODUCE
At Tue, 21 Aug 2012 04:34:05 +, Dietmar Maurer wrote: On 08/21/2012 12:07 AM, Christoph Hellwig wrote: Another thing that sprang into mind is that instead of the formal recovery enable/disable we should simply always delay recovery, that is only do recovery after every N seconds if changes happened. Especially in the cases of whole racks going up/down or upgrades that dramatically reduces the number of epochs required, and thus reduces the recovery overhead. I didn't actually have time to look into the implementation implications of this yet, it's just high level thoughs. I think negatively to delay recovery all the time. It is useful to delay recovery in some time window for maintenance or operational purposes, so I think the idea only to delay recovery manually at some controlled window is useful, but if we extend this to all the running time, it will bring cluster to a less safe state (if not dangerous) at any point. (we only upgrade cluster/maintain individual node only at some time, not all the time, no?) I still think that automatic recovery without delay is the wrong approach. At least for small clusters you simply want to avoid unnecessary traffic. Such recovery can produce massive traffic on the network (several TB of data), and can make the whole system unusable because of that. I want to control when recovery starts. Disabling automatic recovery by default doesn't work for you? You can control the time to start recovery with collie cluster recover enable. Thanks, Kazutaka -- sheepdog mailing list sheepdog@lists.wpkg.org http://lists.wpkg.org/mailman/listinfo/sheepdog
Re: [sheepdog] [PATCH V2 00/11] INTRODUCE
On 08/21/2012 12:34 PM, Dietmar Maurer wrote: I still think that automatic recovery without delay is the wrong approach. At least for small clusters you simply want to avoid unnecessary traffic. Such recovery can produce massive traffic on the network (several TB of data), and can make the whole system unusable because of that. I want to control when recovery starts. Your goal to avoid unnecessary object transfer can be actually built on top of manual recovery mechanism. If we implement manual recovery, we can add a timeout option to it(very easy), thus if someone want to always delay recovery, he can simply disable automatic recovery, add specify a timeout for it. In this way, we can have several policy to accommodate different purposes. Thanks, Yuan -- sheepdog mailing list sheepdog@lists.wpkg.org http://lists.wpkg.org/mailman/listinfo/sheepdog
Re: [sheepdog] [PATCH V2 00/11] INTRODUCE
I think your example is very vague, what kind of driver you use? Sheep itself won't sense membership and rely on cluster drivers to maintain membership. Could you detail how it happen exactly in real case? Membership change can happen for many reason. It can happen if something is wrong on the switch (or if some admin configures the switch), a damaged network cable, a bug in the bonding driver, a damaged network card, or simply a power failure on a node, which reconnects after power is on In Literature, the problem is also known as 'Babbling idiot' (Realtime people use that term). A single node can make the whole system unusable. - Dietmar -- sheepdog mailing list sheepdog@lists.wpkg.org http://lists.wpkg.org/mailman/listinfo/sheepdog
Re: [sheepdog] [PATCH V2 00/11] INTRODUCE
Disabling automatic recovery by default doesn't work for you? You can control the time to start recovery with collie cluster recover enable. It just looks strange to me to design the system for immediate/automatic recovery, and make 'disabling automatic recovery' an option. I would include the node state into the epoch. But maybe that is only an implementation detail. - Dietmar -- sheepdog mailing list sheepdog@lists.wpkg.org http://lists.wpkg.org/mailman/listinfo/sheepdog
Re: [sheepdog] [PATCH V2 00/11] INTRODUCE
On 08/09/2012 04:43 PM, Yunkai Zhang wrote: - fix a typo - when an object is updated, delete it old version - reset cluster recovery state in finish_recovery() You should brief what your patch set does in the introduction cover letter. I have no idea what your INTRODUCE means. Please complete your title too. What is use case to delay LEAVE event (why we need such complexity)? How to use? This is the most important arguments you should include in the cover letter to defend your patch set and let people think it is useful. Thanks, Yuan -- sheepdog mailing list sheepdog@lists.wpkg.org http://lists.wpkg.org/mailman/listinfo/sheepdog
Re: [sheepdog] [PATCH V2 00/11] INTRODUCE
On Mon, Aug 13, 2012 at 10:29 AM, Liu Yuan namei.u...@gmail.com wrote: On 08/09/2012 04:43 PM, Yunkai Zhang wrote: - fix a typo - when an object is updated, delete it old version - reset cluster recovery state in finish_recovery() You should brief what your patch set does in the introduction cover letter. I have no idea what your INTRODUCE means. Please complete your title too. What is use case to delay LEAVE event (why we need such complexity)? How I suppose the reviewers already know my previous patchsets(delay JOIN event) because we have been discussed in detail with Kazum and other users about the benefit of delay LEAVE event. I just don't want to type it again, you can learn its background from this mail thread: http://lists.wpkg.org/pipermail/sheepdog/2012-July/005684.html If you consist to rewrite this cover letter, I can update it in next version. to use? This is the most important arguments you should include in the cover letter to defend your patch set and let people think it is useful. Thanks, Yuan -- Yunkai Zhang Work at Taobao -- sheepdog mailing list sheepdog@lists.wpkg.org http://lists.wpkg.org/mailman/listinfo/sheepdog