Re: [sheepdog] [PATCH v4 01/10] sheep: use struct vdi_iocb to simplify the vdi_create api
On 2012年08月20日 12:53, MORITA Kazutaka wrote: At Thu, 9 Aug 2012 13:27:36 +0800, levin li wrote: +struct vdi_iocb { +char *data; Should be char *name? Yes +uint32_t data_len; +uint64_t size; +uint32_t base_vid; +int is_snapshot; Should be bool is_snapshot? Here is_snapshot is 'snapid', I should rename it to 'snapid' thanks, levin +int nr_copies; +}; + Can we use this structure to lookup_vdi() and del_vdi(), too? Thanks, Kazutaka -- sheepdog mailing list sheepdog@lists.wpkg.org http://lists.wpkg.org/mailman/listinfo/sheepdog
Re: [sheepdog] [PATCH v0, RFC] sheep: writeback cache semantics in backend store
At Mon, 20 Aug 2012 13:51:08 +0800, Liu Yuan wrote: I am suspicious that this approach is really useful: 1. suppose we can only call sync() to flush page cache on each node. With a cluster that runs hundred of images, the sync() request will be issued almost every second, this kind of request storm will put this idea to useless, compared to SYNC open flag. 2 even if we are working with syncfs(), the benefit will be offset by the complexity to find all the location of specified VDI and send requests one by one I guess this approach will give benefit only when the numbers of nodes and VMs are small, but it's okay if it's not turned on by default. Anyway, I'd like to see more benchmark results (e.g. running dbench on several VMs simultaneously) before accepting this patch. Thanks, Kazutaka -- sheepdog mailing list sheepdog@lists.wpkg.org http://lists.wpkg.org/mailman/listinfo/sheepdog
Re: [sheepdog] [PATCH v4 06/10] sheep: fetch vdi copy list after sheep joins the cluster
At Mon, 20 Aug 2012 15:41:03 +0800, levin li wrote: On 2012年08月20日 13:15, MORITA Kazutaka wrote: At Thu, 9 Aug 2012 13:27:41 +0800, levin li wrote: From: levin li xingke@taobao.com The new joined node doesn't have the vdi copy list, or have incomplete vdi copy list, so we need to fetch the copy list data from other nodes It makes code complex to store the copy list in local store because it's difficult to keep consistency of the data. I'd suggest gathering both vid and copy list with SD_OP_READ_VDI requests at the same time. Then we can remove this patch and simplify 5th patch a lot. Thanks, Kazutaka How about this: We don't store the VDI copy list locally, but read it from the local VDI inode object when a node starts up, and in update_cluster_info() we collect the entire VDI copy list from other nodes just as what get_vdi_bitmap() does, That's just what I meant. but a little different from get_vdi_bitmap(), we can not make it perform asynchronously if the node needs recovery, because we need the VDI copy list in recovery, the solution is that in prepare_object_list() we make sheep sleeps until it find that the get_vdi_copy_list() is finished. Looks good. However, I'm wondering if it's much easier to save the number of copies as an xattr of each object rather than managing the VDI copy list. What do you think about it? Thanks, Kazutaka -- sheepdog mailing list sheepdog@lists.wpkg.org http://lists.wpkg.org/mailman/listinfo/sheepdog
Re: [sheepdog] [PATCH v4 06/10] sheep: fetch vdi copy list after sheep joins the cluster
On 2012年08月20日 16:03, MORITA Kazutaka wrote: At Mon, 20 Aug 2012 15:41:03 +0800, levin li wrote: On 2012年08月20日 13:15, MORITA Kazutaka wrote: At Thu, 9 Aug 2012 13:27:41 +0800, levin li wrote: From: levin li xingke@taobao.com The new joined node doesn't have the vdi copy list, or have incomplete vdi copy list, so we need to fetch the copy list data from other nodes It makes code complex to store the copy list in local store because it's difficult to keep consistency of the data. I'd suggest gathering both vid and copy list with SD_OP_READ_VDI requests at the same time. Then we can remove this patch and simplify 5th patch a lot. Thanks, Kazutaka How about this: We don't store the VDI copy list locally, but read it from the local VDI inode object when a node starts up, and in update_cluster_info() we collect the entire VDI copy list from other nodes just as what get_vdi_bitmap() does, That's just what I meant. but a little different from get_vdi_bitmap(), we can not make it perform asynchronously if the node needs recovery, because we need the VDI copy list in recovery, the solution is that in prepare_object_list() we make sheep sleeps until it find that the get_vdi_copy_list() is finished. Looks good. However, I'm wondering if it's much easier to save the number of copies as an xattr of each object rather than managing the VDI copy list. What do you think about it? Thanks, Kazutaka Saving nr_copies for each object as xattr seems more complicated, when an object is migrated from one node to another, the xattr info of that object is lost. Moreover, if we try to read an object in remote node, we can not specify the copy number in read_object(), I think a copy list may be the simplest way for this problem, what do you think about it? thanks, levin -- sheepdog mailing list sheepdog@lists.wpkg.org http://lists.wpkg.org/mailman/listinfo/sheepdog
Re: [sheepdog] [PATCH v4 06/10] sheep: fetch vdi copy list after sheep joins the cluster
At Mon, 20 Aug 2012 16:26:06 +0800, levin li wrote: Saving nr_copies for each object as xattr seems more complicated, when an object is migrated from one node to another, the xattr info of that object is lost. Moreover, if we try to read an object in remote node, we can not specify the copy number in read_object(), I think a copy list may be the simplest way for this problem, what do you think about it? Ah, yes, we cannot get the nr_copies of vdi objects with the xattr approach. I agree with you. Thanks, Kazutaka -- sheepdog mailing list sheepdog@lists.wpkg.org http://lists.wpkg.org/mailman/listinfo/sheepdog
Re: [sheepdog] [PATCH V2 00/11] INTRODUCE
On Mon, Aug 20, 2012 at 9:00 PM, MORITA Kazutaka morita.kazut...@lab.ntt.co.jp wrote: At Thu, 9 Aug 2012 16:43:38 +0800, Yunkai Zhang wrote: From: Yunkai Zhang qiushu@taobao.com V2: - fix a typo - when an object is updated, delete it old version - reset cluster recovery state in finish_recovery() Yunkai Zhang (11): sheep: enable variale-length of join_message in response of join event sheep: share joining nodes with newly added sheep sheep: delay to process recovery caused by LEAVE event just like JOIN event sheep: don't cleanup working directory when sheep joined back sheep: read objects only from live nodes sheep: write objects only on live nodes sheep: mark dirty object that belongs to the leaving nodes sheep: send dirty object list to each sheep when cluster do recovery sheep: do recovery with dirty object list collie: update 'collie cluster recover info' commands collie: update doc about 'collie cluster recover disable' collie/cluster.c | 46 --- include/internal_proto.h | 32 ++-- include/sheep.h | 23 ++ man/collie.8 | 2 +- sheep/cluster.h | 29 +-- sheep/cluster/accord.c| 2 +- sheep/cluster/corosync.c | 9 ++- sheep/cluster/local.c | 2 +- sheep/cluster/zookeeper.c | 2 +- sheep/farm/trunk.c| 2 +- sheep/gateway.c | 39 - sheep/group.c | 202 +- sheep/object_list_cache.c | 182 +++-- sheep/ops.c | 85 --- sheep/recovery.c | 133 +++--- sheep/sheep_priv.h| 57 - 16 files changed, 743 insertions(+), 104 deletions(-) I've looked into this series, and IMHO the change is too complex. With this series, when recovery is disabled and there are left nodes, sheep can succeed in a write operation even if the data is not fully replicated. But, if we allow it, it is difficult to prevent VMs from reading old data. Actually this series put a lot of effort into it. We want to upgrade sheepdog while not impact all online VMs, so we need to allow all VMs to do write operation when recovery is disable (It is important for a big cluster, we can't assume users would stop their works during this time). And we also assume that this time is short, we should upgrade sheepdog as soon as possible( 5 minutes). This patch is implemented based on those assumption above. And maybe it's difficult, but it's algorithm is clear, just three steps(from the description from the 9th patch's commit log): 1) If a sheep joined back to the cluster, but there are some objects which have been deleted after this sheep left, such objects stay in its working directory, after recovery start, this sheep will send its object list to other sheeps. So after fetched all object list from cluster, each sheep should screen out these deleted objects list. 2) A sheep which have been left and joined back should drop the old version objects and recover the new ones from other sheeps. 3) The objects which have been updated should not recovered from a joined back sheep. I'd suggest allowing epoch increment even when recover is disabled. If recovery work recovers only rw-prio_oids and delays the recovery of rw-oids, I think we can get the similar benefit with much simpler way: http://www.mail-archive.com/sheepdog@lists.wpkg.org/msg05439.html In fact, I have thought this method, but we should face nearly the same problem: After sheep joined back, it should known which objects is dirty, and should do the clear work(because there are old version object stay in it's working directory). This method seems not save the steps, but will do extra recovery works. Thanks, Kazutaka -- Yunkai Zhang Work at Taobao -- sheepdog mailing list sheepdog@lists.wpkg.org http://lists.wpkg.org/mailman/listinfo/sheepdog
Re: [sheepdog] [PATCH V2 00/11] INTRODUCE
On 08/20/2012 11:34 PM, Yunkai Zhang wrote: On Mon, Aug 20, 2012 at 9:00 PM, MORITA Kazutaka morita.kazut...@lab.ntt.co.jp wrote: At Thu, 9 Aug 2012 16:43:38 +0800, Yunkai Zhang wrote: From: Yunkai Zhang qiushu@taobao.com V2: - fix a typo - when an object is updated, delete it old version - reset cluster recovery state in finish_recovery() Yunkai Zhang (11): sheep: enable variale-length of join_message in response of join event sheep: share joining nodes with newly added sheep sheep: delay to process recovery caused by LEAVE event just like JOIN event sheep: don't cleanup working directory when sheep joined back sheep: read objects only from live nodes sheep: write objects only on live nodes sheep: mark dirty object that belongs to the leaving nodes sheep: send dirty object list to each sheep when cluster do recovery sheep: do recovery with dirty object list collie: update 'collie cluster recover info' commands collie: update doc about 'collie cluster recover disable' collie/cluster.c | 46 --- include/internal_proto.h | 32 ++-- include/sheep.h | 23 ++ man/collie.8 | 2 +- sheep/cluster.h | 29 +-- sheep/cluster/accord.c| 2 +- sheep/cluster/corosync.c | 9 ++- sheep/cluster/local.c | 2 +- sheep/cluster/zookeeper.c | 2 +- sheep/farm/trunk.c| 2 +- sheep/gateway.c | 39 - sheep/group.c | 202 +- sheep/object_list_cache.c | 182 +++-- sheep/ops.c | 85 --- sheep/recovery.c | 133 +++--- sheep/sheep_priv.h| 57 - 16 files changed, 743 insertions(+), 104 deletions(-) I've looked into this series, and IMHO the change is too complex. With this series, when recovery is disabled and there are left nodes, sheep can succeed in a write operation even if the data is not fully replicated. But, if we allow it, it is difficult to prevent VMs from reading old data. Actually this series put a lot of effort into it. We want to upgrade sheepdog while not impact all online VMs, so we need to allow all VMs to do write operation when recovery is disable (It is important for a big cluster, we can't assume users would stop their works during this time). And we also assume that this time is short, we should upgrade sheepdog as soon as possible( 5 minutes). Upgrading cluster without stopping service is a nice feature, but I'm afraid in the near future, Sheepdog won't meet this expectation due to fast growing development which is likely to break the inter-sheep assumptions. Before we have this feature, we at least should do following things before claiming to be online upgrading capable: 1) inter-sheep protocol compatibility check logic 2) has a relatively stable feature set and internal physical state (such as config file) That is, it is time too early to talk online upgrading for now. This patch is implemented based on those assumption above. And maybe it's difficult, but it's algorithm is clear, just three steps(from the description from the 9th patch's commit log): 1) If a sheep joined back to the cluster, but there are some objects which have been deleted after this sheep left, such objects stay in its working directory, after recovery start, this sheep will send its object list to other sheeps. So after fetched all object list from cluster, each sheep should screen out these deleted objects list. 2) A sheep which have been left and joined back should drop the old version objects and recover the new ones from other sheeps. 3) The objects which have been updated should not recovered from a joined back sheep. I'd suggest allowing epoch increment even when recover is disabled. If recovery work recovers only rw-prio_oids and delays the recovery of rw-oids, I think we can get the similar benefit with much simpler way: http://www.mail-archive.com/sheepdog@lists.wpkg.org/msg05439.html In fact, I have thought this method, but we should face nearly the same problem: After sheep joined back, it should known which objects is dirty, and should do the clear work(because there are old version object stay in it's working directory). This method seems not save the steps, but will do extra recovery works. IMHO, I think the suggested method won't cause different version objects, because we actually increment epoch and we do the same as is now for the objects in the rw-prio_oids, which is being requested. So for this kind of objects, we can still use current code to handle it. For those objects not being requested at all (which might account for majority of the objects in a short time window), we can do the trick: delay recovering them as much as possible, so that subsequent join
Re: [sheepdog] [PATCH V2 00/11] INTRODUCE
On Mon, Aug 20, 2012 at 11:34:10PM +0800, Yunkai Zhang wrote: sheep can succeed in a write operation even if the data is not fully replicated. But, if we allow it, it is difficult to prevent VMs from reading old data. Actually this series put a lot of effort into it. We want to upgrade sheepdog while not impact all online VMs, so we need to allow all VMs to do write operation when recovery is disable (It is important for a big cluster, we can't assume users would stop their works during this time). And we also assume that this time is short, we should upgrade sheepdog as soon as possible( 5 minutes). FYI, I've been looking into this issue (but not this series yet) a bit lately and came to the conflusion that the only way to proper solve it is indeed to recuce redundancy. One way to make this formal is to have a minimum and a normal redundancy level and let writes succeed as long as we meet the minimum level and not the full one. Another thing that sprang into mind is that instead of the formal recovery enable/disable we should simply always delay recovery, that is only do recovery after every N seconds if changes happened. Especially in the cases of whole racks going up/down or upgrades that dramatically reduces the number of epochs required, and thus reduces the recovery overhead. I didn't actually have time to look into the implementation implications of this yet, it's just high level thoughs. -- sheepdog mailing list sheepdog@lists.wpkg.org http://lists.wpkg.org/mailman/listinfo/sheepdog
Re: [sheepdog] [PATCH 1/2] collie: add self options to collie's command
On 08/20/2012 10:28 PM, Yunkai Zhang wrote: Now, all collie's command share the same global collie_options, it will lead to option's name conflict among commands if they use the same options but with different description. By introducing self options to each command (if necessary) and make collie_options only contain the common part of all options, we can solve this issue. I like this improvement, but 'self options' doesn't explain the idea better. This is kind of namespace for each sub command, so simply name it as: + struct sd_option *options; in a structure is enough. And rework the comment and commit log to replace 'self option' with more meaningful phrase such as, for e.g, By moving the global options into individual structure as a private member, we can solve this problem Also, with this patch, we can then change all those upper cased options into lower cases, such as vdi create -P - vdi create -p for easier typing. -- thanks, Yuan -- sheepdog mailing list sheepdog@lists.wpkg.org http://lists.wpkg.org/mailman/listinfo/sheepdog
Re: [sheepdog] [PATCH V2 00/11] INTRODUCE
On 08/21/2012 12:07 AM, Christoph Hellwig wrote: Another thing that sprang into mind is that instead of the formal recovery enable/disable we should simply always delay recovery, that is only do recovery after every N seconds if changes happened. Especially in the cases of whole racks going up/down or upgrades that dramatically reduces the number of epochs required, and thus reduces the recovery overhead. I didn't actually have time to look into the implementation implications of this yet, it's just high level thoughs. I think negatively to delay recovery all the time. It is useful to delay recovery in some time window for maintenance or operational purposes, so I think the idea only to delay recovery manually at some controlled window is useful, but if we extend this to all the running time, it will bring cluster to a less safe state (if not dangerous) at any point. (we only upgrade cluster/maintain individual node only at some time, not all the time, no?) Trading data reliability is always the last resort for a distributed system, which highlights data reliability compared to single data instance in local disk. -- thanks, Yuan -- sheepdog mailing list sheepdog@lists.wpkg.org http://lists.wpkg.org/mailman/listinfo/sheepdog
Re: [sheepdog] [PATCH V2 00/11] INTRODUCE
At Mon, 20 Aug 2012 23:34:10 +0800, Yunkai Zhang wrote: In fact, I have thought this method, but we should face nearly the same problem: After sheep joined back, it should known which objects is dirty, and should do the clear work(because there are old version object stay in it's working directory). This method seems not save the steps, but will do extra recovery works. Can you give me a concrete example? I created a really naive patch to disable object recovery with my idea: == diff --git a/sheep/recovery.c b/sheep/recovery.c index 5164aa7..8bf032f 100644 --- a/sheep/recovery.c +++ b/sheep/recovery.c @@ -35,6 +35,7 @@ struct recovery_work { uint64_t *oids; uint64_t *prio_oids; int nr_prio_oids; + int nr_scheduled_oids; struct vnode_info *old_vinfo; struct vnode_info *cur_vinfo; @@ -269,9 +270,6 @@ static inline void prepare_schedule_oid(uint64_t oid) oid); return; } - /* The oid is currently being recovered */ - if (rw-oids[rw-done] == oid) - return; rw-nr_prio_oids++; rw-prio_oids = xrealloc(rw-prio_oids, rw-nr_prio_oids * sizeof(uint64_t)); @@ -399,9 +397,31 @@ static inline void finish_schedule_oids(struct recovery_work *rw) done: free(rw-prio_oids); rw-prio_oids = NULL; + rw-nr_scheduled_oids += rw-nr_prio_oids; rw-nr_prio_oids = 0; } +static struct timer recovery_timer; + +static void recover_next_object(void *arg) +{ + struct recovery_work *rw = arg; + + if (rw-nr_prio_oids) + finish_schedule_oids(rw); + + if (rw-done rw-nr_scheduled_oids) { + /* Try recover next object */ + queue_work(sys-recovery_wqueue, rw-work); + return; + } + + /* There is no objects to be recovered. Try again later */ + recovery_timer.callback = recover_next_object; + recovery_timer.data = rw; + add_timer(recovery_timer, 1); /* FIXME */ +} + static void recover_object_main(struct work *work) { struct recovery_work *rw = container_of(work, struct recovery_work, @@ -425,11 +445,7 @@ static void recover_object_main(struct work *work) resume_wait_obj_requests(rw-oids[rw-done++]); if (rw-done rw-count) { - if (rw-nr_prio_oids) - finish_schedule_oids(rw); - - /* Try recover next object */ - queue_work(sys-recovery_wqueue, rw-work); + recover_next_object(rw); return; } @@ -458,7 +474,7 @@ static void finish_object_list(struct work *work) resume_wait_recovery_requests(); rw-work.fn = recover_object_work; rw-work.done = recover_object_main; - queue_work(sys-recovery_wqueue, rw-work); + recover_next_object(rw); return; } == I ran the following test, and object recovery was disabled correctly for both join and leave case. == #!/bin/bash for i in 0 1 2 3; do ./sheep/sheep /store/$i -z $i -p 700$i -c local done sleep 1 ./collie/collie cluster format ./collie/collie vdi create test 4G echo * objects will be created on node[0-2] * md5sum /store/[0,1,2,3]/obj/807c2b25 pkill -f ./sheep/sheep /store/1 sleep 3 echo * recovery doesn't start until the object is touched * md5sum /store/[0,2,3]/obj/807c2b25 ./collie/collie vdi snapshot test # invoke recovery of the vdi object echo * the object is recovered * md5sum /store/[0,2,3]/obj/807c2b25 ./sheep/sheep /store/1 -z 1 -p 7001 -c local sleep 3 echo * recovery doesn't start until the object is touched * md5sum /store/[0,1,2,3]/obj/807c2b25 ./collie/collie vdi list -p 7001 # invoke recovery of the vdi object echo * the object is recovered * md5sum /store/[0,1,2,3]/obj/807c2b25 == [Output] using backend farm store * objects will be created on node[0-2] * 701e77eab6002c9a48f7ba72c8d9bfe9 /store/0/obj/807c2b25 701e77eab6002c9a48f7ba72c8d9bfe9 /store/1/obj/807c2b25 701e77eab6002c9a48f7ba72c8d9bfe9 /store/2/obj/807c2b25 * recovery doesn't start until the object is touched * 701e77eab6002c9a48f7ba72c8d9bfe9 /store/0/obj/807c2b25 701e77eab6002c9a48f7ba72c8d9bfe9 /store/2/obj/807c2b25 * the object is recovered * 3c3bf0d865363fd0d1f1d5c7aa044dcd /store/0/obj/807c2b25 3c3bf0d865363fd0d1f1d5c7aa044dcd /store/2/obj/807c2b25 3c3bf0d865363fd0d1f1d5c7aa044dcd /store/3/obj/807c2b25 * recovery doesn't start until the object is touched * 3c3bf0d865363fd0d1f1d5c7aa044dcd /store/0/obj/807c2b25 3c3bf0d865363fd0d1f1d5c7aa044dcd /store/2/obj/807c2b25 NameIdSizeUsed SharedCreation time VDI id Tag s test 1 4.0 GB 0.0 MB 0.0 MB 2012-08-21 02:49 7c2b25 test 2 4.0 GB 0.0 MB 0.0 MB 2012-08-21 02:49
Re: [sheepdog] [PATCH V2 00/11] INTRODUCE
At Tue, 21 Aug 2012 00:29:50 +0800, Liu Yuan wrote: On 08/21/2012 12:07 AM, Christoph Hellwig wrote: Another thing that sprang into mind is that instead of the formal recovery enable/disable we should simply always delay recovery, that is only do recovery after every N seconds if changes happened. Especially in the cases of whole racks going up/down or upgrades that dramatically reduces the number of epochs required, and thus reduces the recovery overhead. I didn't actually have time to look into the implementation implications of this yet, it's just high level thoughs. I think negatively to delay recovery all the time. It is useful to delay recovery in some time window for maintenance or operational purposes, so I think the idea only to delay recovery manually at some controlled window is useful, but if we extend this to all the running time, it will bring cluster to a less safe state (if not dangerous) at any point. (we only upgrade cluster/maintain individual node only at some time, not all the time, no?) Trading data reliability is always the last resort for a distributed system, which highlights data reliability compared to single data instance in local disk. I think delaying recovery for a few seconds always is useful for many users. Under heavy network load, sheep can wrongly detect node failure and node membership can change frequently. Delaying recovery for a short time makes Sheepdog tolerant against such situation. Thanks, Kazutaka -- sheepdog mailing list sheepdog@lists.wpkg.org http://lists.wpkg.org/mailman/listinfo/sheepdog
[sheepdog] [PATCH v2] sheep: make config file compatible with the previous one
Signed-off-by: MORITA Kazutaka morita.kazut...@lab.ntt.co.jp --- Changes from v1: - remove 'version' from sheepdog_config Even if we don't support a version check of the config in the next release, we should fix the compatibility issue at least. sheep/store.c |3 ++- 1 files changed, 2 insertions(+), 1 deletions(-) diff --git a/sheep/store.c b/sheep/store.c index 542804a..fcbf32d 100644 --- a/sheep/store.c +++ b/sheep/store.c @@ -30,10 +30,11 @@ struct sheepdog_config { uint64_t ctime; - uint64_t space; uint16_t flags; uint8_t copies; uint8_t store[STORE_LEN]; + uint8_t __pad[5]; + uint64_t space; }; char *obj_path; -- 1.7.2.5 -- sheepdog mailing list sheepdog@lists.wpkg.org http://lists.wpkg.org/mailman/listinfo/sheepdog
Re: [sheepdog] [PATCH v2] sheep: make config file compatible with the previous one
On 08/21/2012 02:37 AM, MORITA Kazutaka wrote: + uint8_t __pad[5]; + uint64_t space; What is __pad[5] for? Thanks, Yuan -- sheepdog mailing list sheepdog@lists.wpkg.org http://lists.wpkg.org/mailman/listinfo/sheepdog
Re: [sheepdog] [PATCH V2 00/11] INTRODUCE
On Tue, Aug 21, 2012 at 10:46 AM, Liu Yuan namei.u...@gmail.com wrote: On 08/21/2012 02:29 AM, MORITA Kazutaka wrote: I think delaying recovery for a few seconds always is useful for many users. Under heavy network load, sheep can wrongly detect node failure and node membership can change frequently. Delaying recovery for a short time makes Sheepdog tolerant against such situation. I think your example is very vague, what kind of driver you use? Sheep itself won't sense membership and rely on cluster drivers to maintain membership. Could you detail how it happen exactly in real case? If you are talking about network partition problem, I don't think delay recovery will help solve it. We have met network partition when we used corosync driver, for zookeeper driver, we haven't met it yet. (I guess we won't meet it with zookeeper as a central membership control). Suppose we have 6 nodes in a cluster, A,B,C,D,E,F one copy with epoch = 1. For time t1, we get network partitioned, and three partitions show up, c1(A,B,C), c2(D,E),c3(F). So epoch for this three partitions is respectively epoch(c1=4, c2=5, c3=6) and all 3 partitions progress to recover and get updates to its local object. In your above example, suppose we might have these 3 partition automatically merge into one partition, this means, after merging 1) epoch(c1=7, c2=9, c3=11) 2) no code to handle different version objects which all nodes think his own local version is correct. So I think we have to handle epoch mismatch and object multi-version problems before evaluating delay recovery for network partition. If you are not talking about network partition problem, I think we can only meet stop/restart node case for manual maintenance, where I think manual recovery could really be helpful. Delay recovery couldn't solve network partition problem, and as you mentioned above, if sheep break internal protocol, delay recovery could not help to sheep's upgrade. But if sheep don't break internal protocol, for example, we just fix memory leak bug/add some useful log/fix a corner case, it's very useful for us. Thanks, Yuan -- sheepdog mailing list sheepdog@lists.wpkg.org http://lists.wpkg.org/mailman/listinfo/sheepdog -- Yunkai Zhang Work at Taobao -- sheepdog mailing list sheepdog@lists.wpkg.org http://lists.wpkg.org/mailman/listinfo/sheepdog
[sheepdog] [PATCH] test: consolidate 010 to check manual recovery
From: Liu Yuan tailai...@taobao.com Signed-off-by: Liu Yuan tailai...@taobao.com --- tests/010 | 14 ++ tests/010.out | 15 ++- 2 files changed, 24 insertions(+), 5 deletions(-) diff --git a/tests/010 b/tests/010 index 7496e2d..c3f53b4 100755 --- a/tests/010 +++ b/tests/010 @@ -1,5 +1,7 @@ #!/bin/bash +# Test manual recovery command + seq=`basename $0` echo QA output created by $seq @@ -13,15 +15,14 @@ status=1# failure is the default! _cleanup -_start_sheep 0 -_start_sheep 1 +for i in `seq 0 1`; do _start_sheep $i; done -sleep 2 +_wait_for_sheep 2 $COLLIE cluster format -c 2 $COLLIE cluster recover disable -qemu-img create sheepdog:test 4G +$COLLIE vdi create test 4G # create 20 objects for i in `seq 0 19`; do @@ -34,3 +35,8 @@ _start_sheep 2 for i in `seq 0 19`; do $COLLIE vdi write test $((i * 4 * 1024 * 1024)) 512 /dev/zero done + +$COLLIE cluster info | _filter_cluster_info + +$COLLIE cluster recover enable +$COLLIE cluster info | _filter_cluster_info diff --git a/tests/010.out b/tests/010.out index 01cc1bf..ea84c35 100644 --- a/tests/010.out +++ b/tests/010.out @@ -2,4 +2,17 @@ QA output created by 010 using backend farm store *Note*: Only disable the recovery caused by JOIN envets Cluster recovery: disable -Formatting 'sheepdog:test', fmt=raw size=4294967296 +Cluster status: running + +Cluster created at DATE + +Epoch Time Version +DATE 1 [127.0.0.1:7000, 127.0.0.1:7001] +Cluster recovery: enable +Cluster status: running + +Cluster created at DATE + +Epoch Time Version +DATE 2 [127.0.0.1:7000, 127.0.0.1:7001, 127.0.0.1:7002] +DATE 1 [127.0.0.1:7000, 127.0.0.1:7001] -- 1.7.10.2 -- sheepdog mailing list sheepdog@lists.wpkg.org http://lists.wpkg.org/mailman/listinfo/sheepdog
Re: [sheepdog] [PATCH V2 00/11] INTRODUCE
On Tue, Aug 21, 2012 at 2:03 AM, MORITA Kazutaka morita.kazut...@lab.ntt.co.jp wrote: At Mon, 20 Aug 2012 23:34:10 +0800, Yunkai Zhang wrote: In fact, I have thought this method, but we should face nearly the same problem: After sheep joined back, it should known which objects is dirty, and should do the clear work(because there are old version object stay in it's working directory). This method seems not save the steps, but will do extra recovery works. Can you give me a concrete example? I created a really naive patch to disable object recovery with my idea: == diff --git a/sheep/recovery.c b/sheep/recovery.c index 5164aa7..8bf032f 100644 --- a/sheep/recovery.c +++ b/sheep/recovery.c @@ -35,6 +35,7 @@ struct recovery_work { uint64_t *oids; uint64_t *prio_oids; int nr_prio_oids; + int nr_scheduled_oids; struct vnode_info *old_vinfo; struct vnode_info *cur_vinfo; @@ -269,9 +270,6 @@ static inline void prepare_schedule_oid(uint64_t oid) oid); return; } - /* The oid is currently being recovered */ - if (rw-oids[rw-done] == oid) - return; rw-nr_prio_oids++; rw-prio_oids = xrealloc(rw-prio_oids, rw-nr_prio_oids * sizeof(uint64_t)); @@ -399,9 +397,31 @@ static inline void finish_schedule_oids(struct recovery_work *rw) done: free(rw-prio_oids); rw-prio_oids = NULL; + rw-nr_scheduled_oids += rw-nr_prio_oids; rw-nr_prio_oids = 0; } +static struct timer recovery_timer; + +static void recover_next_object(void *arg) +{ + struct recovery_work *rw = arg; + + if (rw-nr_prio_oids) + finish_schedule_oids(rw); + + if (rw-done rw-nr_scheduled_oids) { + /* Try recover next object */ + queue_work(sys-recovery_wqueue, rw-work); + return; + } + + /* There is no objects to be recovered. Try again later */ + recovery_timer.callback = recover_next_object; + recovery_timer.data = rw; + add_timer(recovery_timer, 1); /* FIXME */ +} + static void recover_object_main(struct work *work) { struct recovery_work *rw = container_of(work, struct recovery_work, @@ -425,11 +445,7 @@ static void recover_object_main(struct work *work) resume_wait_obj_requests(rw-oids[rw-done++]); if (rw-done rw-count) { - if (rw-nr_prio_oids) - finish_schedule_oids(rw); - - /* Try recover next object */ - queue_work(sys-recovery_wqueue, rw-work); + recover_next_object(rw); return; } @@ -458,7 +474,7 @@ static void finish_object_list(struct work *work) resume_wait_recovery_requests(); rw-work.fn = recover_object_work; rw-work.done = recover_object_main; - queue_work(sys-recovery_wqueue, rw-work); + recover_next_object(rw); return; } == I ran the following test, and object recovery was disabled correctly for both join and leave case. == #!/bin/bash for i in 0 1 2 3; do ./sheep/sheep /store/$i -z $i -p 700$i -c local done sleep 1 ./collie/collie cluster format ./collie/collie vdi create test 4G echo * objects will be created on node[0-2] * md5sum /store/[0,1,2,3]/obj/807c2b25 pkill -f ./sheep/sheep /store/1 sleep 3 echo * recovery doesn't start until the object is touched * md5sum /store/[0,2,3]/obj/807c2b25 ./collie/collie vdi snapshot test # invoke recovery of the vdi object echo * the object is recovered * md5sum /store/[0,2,3]/obj/807c2b25 ./sheep/sheep /store/1 -z 1 -p 7001 -c local sleep 3 echo * recovery doesn't start until the object is touched * md5sum /store/[0,1,2,3]/obj/807c2b25 ./collie/collie vdi list -p 7001 # invoke recovery of the vdi object echo * the object is recovered * md5sum /store/[0,1,2,3]/obj/807c2b25 == [Output] using backend farm store * objects will be created on node[0-2] * 701e77eab6002c9a48f7ba72c8d9bfe9 /store/0/obj/807c2b25 701e77eab6002c9a48f7ba72c8d9bfe9 /store/1/obj/807c2b25 701e77eab6002c9a48f7ba72c8d9bfe9 /store/2/obj/807c2b25 * recovery doesn't start until the object is touched * 701e77eab6002c9a48f7ba72c8d9bfe9 /store/0/obj/807c2b25 701e77eab6002c9a48f7ba72c8d9bfe9 /store/2/obj/807c2b25 * the object is recovered * 3c3bf0d865363fd0d1f1d5c7aa044dcd /store/0/obj/807c2b25 3c3bf0d865363fd0d1f1d5c7aa044dcd /store/2/obj/807c2b25 3c3bf0d865363fd0d1f1d5c7aa044dcd /store/3/obj/807c2b25 * recovery doesn't start until the object is touched * 3c3bf0d865363fd0d1f1d5c7aa044dcd /store/0/obj/807c2b25 3c3bf0d865363fd0d1f1d5c7aa044dcd
Re: [sheepdog] [PATCH V2 00/11] INTRODUCE
At Tue, 21 Aug 2012 10:46:19 +0800, Liu Yuan wrote: So I think we have to handle epoch mismatch and object multi-version problems before evaluating delay recovery for network partition. Yes, delay recovery doesn't solve my example at all unless sheepdog handles network partition. I didn't intend to say that delaying recovery always is necessary now but worth considering in future. Thanks, Kazutaka -- sheepdog mailing list sheepdog@lists.wpkg.org http://lists.wpkg.org/mailman/listinfo/sheepdog
Re: [sheepdog] [PATCH V2 00/11] INTRODUCE
On 08/21/2012 11:21 AM, MORITA Kazutaka wrote: At Tue, 21 Aug 2012 10:46:19 +0800, Liu Yuan wrote: So I think we have to handle epoch mismatch and object multi-version problems before evaluating delay recovery for network partition. Yes, delay recovery doesn't solve my example at all unless sheepdog handles network partition. I didn't intend to say that delaying recovery always is necessary now but worth considering in future. Well, with a centralized membership control driver such as zookeeper or accord (I'd like to built-in accord driver(possibly a simplified and tailored for sheep) in the sheepdog repo for better development), I think the network partition problem can be virtual gone with a well-written software, that collaborate with sheep to minimize the chances of NP happening rather than intend to solve it after it happens. Thanks, Yuan -- sheepdog mailing list sheepdog@lists.wpkg.org http://lists.wpkg.org/mailman/listinfo/sheepdog
Re: [sheepdog] [PATCH v2] sheep: make config file compatible with the previous one
At Tue, 21 Aug 2012 10:57:38 +0800, Liu Yuan wrote: On 08/21/2012 02:37 AM, MORITA Kazutaka wrote: + uint8_t __pad[5]; + uint64_t space; What is __pad[5] for? If we don't add the padding, 32 and 64 bit machines read different data from the same config file. All network protocols and disk formats should be aligned to 8 bytes, though we support only x86_64 now. Thanks, Kazutaka -- sheepdog mailing list sheepdog@lists.wpkg.org http://lists.wpkg.org/mailman/listinfo/sheepdog
Re: [sheepdog] [PATCH V2 00/11] INTRODUCE
On 08/21/2012 12:07 AM, Christoph Hellwig wrote: Another thing that sprang into mind is that instead of the formal recovery enable/disable we should simply always delay recovery, that is only do recovery after every N seconds if changes happened. Especially in the cases of whole racks going up/down or upgrades that dramatically reduces the number of epochs required, and thus reduces the recovery overhead. I didn't actually have time to look into the implementation implications of this yet, it's just high level thoughs. I think negatively to delay recovery all the time. It is useful to delay recovery in some time window for maintenance or operational purposes, so I think the idea only to delay recovery manually at some controlled window is useful, but if we extend this to all the running time, it will bring cluster to a less safe state (if not dangerous) at any point. (we only upgrade cluster/maintain individual node only at some time, not all the time, no?) I still think that automatic recovery without delay is the wrong approach. At least for small clusters you simply want to avoid unnecessary traffic. Such recovery can produce massive traffic on the network (several TB of data), and can make the whole system unusable because of that. I want to control when recovery starts. - Dietmar -- sheepdog mailing list sheepdog@lists.wpkg.org http://lists.wpkg.org/mailman/listinfo/sheepdog
Re: [sheepdog] [PATCH V2 00/11] INTRODUCE
At Tue, 21 Aug 2012 04:34:05 +, Dietmar Maurer wrote: On 08/21/2012 12:07 AM, Christoph Hellwig wrote: Another thing that sprang into mind is that instead of the formal recovery enable/disable we should simply always delay recovery, that is only do recovery after every N seconds if changes happened. Especially in the cases of whole racks going up/down or upgrades that dramatically reduces the number of epochs required, and thus reduces the recovery overhead. I didn't actually have time to look into the implementation implications of this yet, it's just high level thoughs. I think negatively to delay recovery all the time. It is useful to delay recovery in some time window for maintenance or operational purposes, so I think the idea only to delay recovery manually at some controlled window is useful, but if we extend this to all the running time, it will bring cluster to a less safe state (if not dangerous) at any point. (we only upgrade cluster/maintain individual node only at some time, not all the time, no?) I still think that automatic recovery without delay is the wrong approach. At least for small clusters you simply want to avoid unnecessary traffic. Such recovery can produce massive traffic on the network (several TB of data), and can make the whole system unusable because of that. I want to control when recovery starts. Disabling automatic recovery by default doesn't work for you? You can control the time to start recovery with collie cluster recover enable. Thanks, Kazutaka -- sheepdog mailing list sheepdog@lists.wpkg.org http://lists.wpkg.org/mailman/listinfo/sheepdog
Re: [sheepdog] [PATCH V2 00/11] INTRODUCE
On 08/21/2012 12:34 PM, Dietmar Maurer wrote: I still think that automatic recovery without delay is the wrong approach. At least for small clusters you simply want to avoid unnecessary traffic. Such recovery can produce massive traffic on the network (several TB of data), and can make the whole system unusable because of that. I want to control when recovery starts. Your goal to avoid unnecessary object transfer can be actually built on top of manual recovery mechanism. If we implement manual recovery, we can add a timeout option to it(very easy), thus if someone want to always delay recovery, he can simply disable automatic recovery, add specify a timeout for it. In this way, we can have several policy to accommodate different purposes. Thanks, Yuan -- sheepdog mailing list sheepdog@lists.wpkg.org http://lists.wpkg.org/mailman/listinfo/sheepdog
Re: [sheepdog] [PATCH V2 00/11] INTRODUCE
I think your example is very vague, what kind of driver you use? Sheep itself won't sense membership and rely on cluster drivers to maintain membership. Could you detail how it happen exactly in real case? Membership change can happen for many reason. It can happen if something is wrong on the switch (or if some admin configures the switch), a damaged network cable, a bug in the bonding driver, a damaged network card, or simply a power failure on a node, which reconnects after power is on In Literature, the problem is also known as 'Babbling idiot' (Realtime people use that term). A single node can make the whole system unusable. - Dietmar -- sheepdog mailing list sheepdog@lists.wpkg.org http://lists.wpkg.org/mailman/listinfo/sheepdog
Re: [sheepdog] [PATCH] test: consolidate 010 to check manual recovery
At Tue, 21 Aug 2012 11:03:52 +0800, Liu Yuan wrote: From: Liu Yuan tailai...@taobao.com Signed-off-by: Liu Yuan tailai...@taobao.com --- tests/010 | 14 ++ tests/010.out | 15 ++- 2 files changed, 24 insertions(+), 5 deletions(-) Applied, thanks! Kazutaka -- sheepdog mailing list sheepdog@lists.wpkg.org http://lists.wpkg.org/mailman/listinfo/sheepdog
Re: [sheepdog] [PATCH 1/2] collie: optimize 'collie vdi check' command
On Tue, Aug 21, 2012 at 1:42 PM, MORITA Kazutaka morita.kazut...@lab.ntt.co.jp wrote: At Thu, 16 Aug 2012 22:38:21 +0800, Yunkai Zhang wrote: After add '-F' flag, the help looks like: $ collie vdi check Usage: collie vdi check [-F] [-s snapshot] [-a address] [-p port] [-h] vdiname Options: -F, --force_repair force repair object's copies (dangerous) How about '-r, --repair'? Good for me. fprintf(stderr, Failed to read, %s\n, sd_strerror(rsp-result)); exit(EXIT_FAILURE); } - return buf; + + memcpy(sha1, (unsigned char *)rsp-__pad[0], SHA1_LEN); Please define a member name instead of using __pad. OK. } -static void write_object_to(struct sd_vnode *vnode, uint64_t oid, void *buf) +static int do_repair(uint64_t oid, struct node_id *src, struct node_id *dest) { struct sd_req hdr; struct sd_rsp *rsp = (struct sd_rsp *)hdr; + unsigned rlen, wlen; + char host[128]; int fd, ret; - unsigned wlen = SD_DATA_OBJ_SIZE, rlen = 0; - char name[128]; - addr_to_str(name, sizeof(name), vnode-nid.addr, 0); - fd = connect_to(name, vnode-nid.port); + addr_to_str(host, sizeof(host), dest-addr, 0); + + fd = connect_to(host, dest-port); if (fd 0) { - fprintf(stderr, failed to connect to %s:%PRIu32\n, - name, vnode-nid.port); - exit(EXIT_FAILURE); + fprintf(stderr, Failed to connect\n); + return SD_RES_EIO; } - sd_init_req(hdr, SD_OP_WRITE_PEER); - hdr.epoch = sd_epoch; - hdr.flags = SD_FLAG_CMD_WRITE; - hdr.data_length = wlen; + sd_init_req(hdr, SD_OP_REPAIR_OBJ_PEER); I don't think sending peer requests directly from outside sheeps is a good idea. How about making the gateway node forward the requests? Ok, no problem. + rlen = 0; + wlen = sizeof(*src); + + hdr.epoch = sd_epoch; hdr.obj.oid = oid; + hdr.data_length = wlen; + hdr.flags = SD_FLAG_CMD_WRITE; - ret = exec_req(fd, hdr, buf, wlen, rlen); + ret = exec_req(fd, hdr, src, wlen, rlen); close(fd); - if (ret) { - fprintf(stderr, Failed to execute request\n); - exit(EXIT_FAILURE); + fprintf(stderr, Failed to repair oid:%PRIx64\n, oid); + return SD_RES_EIO; } - if (rsp-result != SD_RES_SUCCESS) { - fprintf(stderr, Failed to read, %s\n, - sd_strerror(rsp-result)); - exit(EXIT_FAILURE); + fprintf(stderr, Failed to repair oid:%PRIx64, %s\n, + oid, sd_strerror(rsp-result)); + return rsp-result; } + + return SD_RES_SUCCESS; } -/* - * Fix consistency of the replica of oid. - * - * XXX: The fix is rather dumb, just read the first copy and write it - * to other replica. - */ -static void do_check_repair(uint64_t oid, int nr_copies) +static int do_check_repair(uint64_t oid, int nr_copies) { struct sd_vnode *tgt_vnodes[nr_copies]; - void *buf, *buf_cmp; - int i; + unsigned char sha1[SD_MAX_COPIES][SHA1_LEN]; + char host[128]; + int i, j; oid_to_vnodes(sd_vnodes, sd_vnodes_nr, oid, nr_copies, tgt_vnodes); - buf = read_object_from(tgt_vnodes[0], oid); - for (i = 1; i nr_copies; i++) { - buf_cmp = read_object_from(tgt_vnodes[i], oid); - if (memcmp(buf, buf_cmp, SD_DATA_OBJ_SIZE)) { - free(buf_cmp); - goto fix_consistency; + for (i = 0; i nr_copies; i++) { + get_obj_checksum_from(tgt_vnodes[i], oid, sha1[i]); + } + + for (i = 0; i nr_copies; i++) { + for (j = (i + 1); j nr_copies; j++) { + if (memcmp(sha1[i], sha1[j], SHA1_LEN)) + goto diff; } - free(buf_cmp); } - free(buf); - return; + return 0; -fix_consistency: - for (i = 1; i nr_copies; i++) - write_object_to(tgt_vnodes[i], oid, buf); - fprintf(stdout, fix %PRIx64 success\n, oid); - free(buf); +diff: + fprintf(stderr, Failed oid: %PRIx64\n, oid); + for (i = 0; i nr_copies; i++) { + addr_to_str(host, sizeof(host), tgt_vnodes[i]-nid.addr, 0); + fprintf(stderr, copy[%d], sha1: %s, from: %s:%d\n, + i, sha1_to_hex(sha1[i]), host, tgt_vnodes[i]-nid.port); + } + + if (!vdi_cmd_data.force_repair) + return -1; + + /* + * Force repair the consistency of oid's replica + * + * FIXME: this fix is rather dumb, it just read the + * first copy and write it to other replica, + */ + fprintf(stderr, force repairing ...\n); + addr_to_str(host, sizeof(host), tgt_vnodes[0]-nid.addr, +
Re: [sheepdog] [PATCH V2 00/11] INTRODUCE
Disabling automatic recovery by default doesn't work for you? You can control the time to start recovery with collie cluster recover enable. It just looks strange to me to design the system for immediate/automatic recovery, and make 'disabling automatic recovery' an option. I would include the node state into the epoch. But maybe that is only an implementation detail. - Dietmar -- sheepdog mailing list sheepdog@lists.wpkg.org http://lists.wpkg.org/mailman/listinfo/sheepdog