Re: [PATCHv5] rbd block driver fix race between aio completition and aio cancel
On Thu, Nov 29, 2012 at 10:37 PM, Stefan Priebe s.pri...@profihost.ag wrote: @@ -568,6 +562,10 @@ static void qemu_rbd_aio_cancel(BlockDriverAIOCB *blockacb) { RBDAIOCB *acb = (RBDAIOCB *) blockacb; acb-cancelled = 1; + +while (acb-status == -EINPROGRESS) { +qemu_aio_wait(); +} } static const AIOCBInfo rbd_aiocb_info = { @@ -639,6 +637,7 @@ static void rbd_aio_bh_cb(void *opaque) acb-common.cb(acb-common.opaque, (acb-ret 0 ? 0 : acb-ret)); qemu_bh_delete(acb-bh); acb-bh = NULL; +acb-status = 0; qemu_aio_release(acb); } We cannot release acb in rbd_aio_bh_cb() when acb-cancelled == 1 because qemu_rbd_aio_cancel() still accesses it. This was discussed in an early version of the patch. Stefan -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCHv6] rbd block driver fix race between aio completition and aio cancel
This one fixes a race which qemu had also in iscsi block driver between cancellation and io completition. qemu_rbd_aio_cancel was not synchronously waiting for the end of the command. To archieve this it introduces a new status flag which uses -EINPROGRESS. Changes since PATCHv5: - qemu_aio_release has to be done in qemu_rbd_aio_cancel if I/O was cancelled Changes since PATCHv4: - removed unnecessary qemu_vfree of acb-bounce as BH will always run Changes since PATCHv3: - removed unnecessary if condition in rbd_start_aio as we haven't start io yet - moved acb-status = 0 to rbd_aio_bh_cb so qemu_aio_wait always waits until BH was executed Changes since PATCHv2: - fixed missing braces - added vfree for bounce Signed-off-by: Stefan Priebe s.pri...@profihost.ag --- block/rbd.c | 20 1 file changed, 12 insertions(+), 8 deletions(-) diff --git a/block/rbd.c b/block/rbd.c index f3becc7..737bab1 100644 --- a/block/rbd.c +++ b/block/rbd.c @@ -77,6 +77,7 @@ typedef struct RBDAIOCB { int error; struct BDRVRBDState *s; int cancelled; +int status; } RBDAIOCB; typedef struct RADOSCB { @@ -376,12 +377,6 @@ static void qemu_rbd_complete_aio(RADOSCB *rcb) RBDAIOCB *acb = rcb-acb; int64_t r; -if (acb-cancelled) { -qemu_vfree(acb-bounce); -qemu_aio_release(acb); -goto done; -} - r = rcb-ret; if (acb-cmd == RBD_AIO_WRITE || @@ -409,7 +404,6 @@ static void qemu_rbd_complete_aio(RADOSCB *rcb) /* Note that acb-bh can be NULL in case where the aio was cancelled */ acb-bh = qemu_bh_new(rbd_aio_bh_cb, acb); qemu_bh_schedule(acb-bh); -done: g_free(rcb); } @@ -568,6 +562,12 @@ static void qemu_rbd_aio_cancel(BlockDriverAIOCB *blockacb) { RBDAIOCB *acb = (RBDAIOCB *) blockacb; acb-cancelled = 1; + +while (acb-status == -EINPROGRESS) { +qemu_aio_wait(); +} + +qemu_aio_release(acb); } static const AIOCBInfo rbd_aiocb_info = { @@ -639,8 +639,11 @@ static void rbd_aio_bh_cb(void *opaque) acb-common.cb(acb-common.opaque, (acb-ret 0 ? 0 : acb-ret)); qemu_bh_delete(acb-bh); acb-bh = NULL; +acb-status = 0; -qemu_aio_release(acb); +if (!acb-cancelled) { +qemu_aio_release(acb); +} } static int rbd_aio_discard_wrapper(rbd_image_t image, @@ -685,6 +688,7 @@ static BlockDriverAIOCB *rbd_start_aio(BlockDriverState *bs, acb-s = s; acb-cancelled = 0; acb-bh = NULL; +acb-status = -EINPROGRESS; if (cmd == RBD_AIO_WRITE) { qemu_iovec_to_buf(acb-qiov, 0, acb-bounce, qiov-size); -- 1.7.10.4 -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCHv5] rbd block driver fix race between aio completition and aio cancel
fixed in V6 Am 30.11.2012 09:26, schrieb Stefan Hajnoczi: On Thu, Nov 29, 2012 at 10:37 PM, Stefan Priebe s.pri...@profihost.ag wrote: @@ -568,6 +562,10 @@ static void qemu_rbd_aio_cancel(BlockDriverAIOCB *blockacb) { RBDAIOCB *acb = (RBDAIOCB *) blockacb; acb-cancelled = 1; + +while (acb-status == -EINPROGRESS) { +qemu_aio_wait(); +} } static const AIOCBInfo rbd_aiocb_info = { @@ -639,6 +637,7 @@ static void rbd_aio_bh_cb(void *opaque) acb-common.cb(acb-common.opaque, (acb-ret 0 ? 0 : acb-ret)); qemu_bh_delete(acb-bh); acb-bh = NULL; +acb-status = 0; qemu_aio_release(acb); } We cannot release acb in rbd_aio_bh_cb() when acb-cancelled == 1 because qemu_rbd_aio_cancel() still accesses it. This was discussed in an early version of the patch. Stefan -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Hangup during scrubbing - possible solutions
http://xdel.ru/downloads/ceph-log/ceph-scrub-stuck.log.gz http://xdel.ru/downloads/ceph-log/cluster-w.log.gz Here, please. I have initiated a deep-scrub of osd.1 which was lead to forever-stuck I/O requests in a short time(scrub `ll do the same). Second log may be useful for proper timestamps, as seeks on the original may took a long time. Osd processes on the specific node was restarted twice - at the beginning to be sure all config options were applied and at the end to do same plus to get rid of stuck requests. On Wed, Nov 28, 2012 at 5:35 AM, Samuel Just sam.j...@inktank.com wrote: If you can reproduce it again, what we really need are the osd logs from the acting set of a pg stuck in scrub with debug osd = 20 debug ms = 1 debug filestore = 20. Thanks, -Sam On Sun, Nov 25, 2012 at 2:08 PM, Andrey Korolyov and...@xdel.ru wrote: On Fri, Nov 23, 2012 at 12:35 AM, Sage Weil s...@inktank.com wrote: On Thu, 22 Nov 2012, Andrey Korolyov wrote: Hi, In the recent versions Ceph introduces some unexpected behavior for the permanent connections (VM or kernel clients) - after crash recovery, I/O will hang on the next planned scrub on the following scenario: - launch a bunch of clients doing non-intensive writes, - lose one or more osd, mark them down, wait for recovery completion, - do a slow scrub, e.g. scrubbing one osd per 5m, inside bash script, or wait for ceph to do the same, - observe a raising number of pgs stuck in the active+clean+scrubbing state (they took a master role from ones which was on killed osd and almost surely they are being written in time of crash), - some time later, clients will hang hardly and ceph log introduce stuck(old) I/O requests. The only one way to return clients back without losing their I/O state is per-osd restart, which also will help to get rid of active+clean+scrubbing pgs. First of all, I`ll be happy to help to solve this problem by providing logs. If you can reproduce this behavior with 'debug osd = 20' and 'debug ms = 1' logging on the OSD, that would be wonderful! I have tested slightly different recovery flow, please see below. Since there is no real harm, like frozen I/O, placement groups also was stuck forever on the active+clean+scrubbing state, until I restarted all osds (end of the log): http://xdel.ru/downloads/ceph-log/recover-clients-later-than-osd.txt.gz - start the healthy cluster - start persistent clients - add an another host with pair of OSDs, let them be in the data placement - wait for data to rearrange - [22:06 timestamp] mark OSDs out or simply kill them and wait(since I have an 1/2 hour delay on readjust in such case, I did ``ceph osd out'' manually) - watch for data to rearrange again - [22:51 timestamp] when it ends, start a manual rescrub, with non-zero active+clean+scrubbing-state placement groups at the end of process which `ll stay in this state forever until something happens After that, I can restart osds one per one, if I want to get rid of scrubbing states immediately and then do deep-scrub(if I don`t, those states will return at next ceph self-scrubbing) or do per-osd deep-scrub, if I have a lot of time. The case I have described in the previous message took place when I remove osd from data placement which existed on the moment when client(s) have started and indeed it is more harmful than current one(frozen I/O leads to hanging entire guest, for example). Since testing those flow took a lot of time, I`ll send logs related to this case tomorrow. Second question is not directly related to this problem, but I have thought on for a long time - is there a planned features to control scrub process more precisely, e.g. pg scrub rate or scheduled scrub, instead of current set of timeouts which of course not very predictable on when to run? Not yet. I would be interested in hearing what kind of control/config options/whatever you (and others) would like to see! Of course it will be awesome to have any determined scheduler or at least an option to disable automated scrubbing, since it is not very determined in time and deep-scrub eating a lot of I/O if command issued against entire OSD. Rate limiting is not in the first place, at least it may be recreated in external script, but for those who prefer to leave control to Ceph, it may be very useful. Thanks! -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCHv6] rbd block driver fix race between aio completition and aio cancel
On Fri, Nov 30, 2012 at 9:55 AM, Stefan Priebe s.pri...@profihost.ag wrote: This one fixes a race which qemu had also in iscsi block driver between cancellation and io completition. qemu_rbd_aio_cancel was not synchronously waiting for the end of the command. To archieve this it introduces a new status flag which uses -EINPROGRESS. Changes since PATCHv5: - qemu_aio_release has to be done in qemu_rbd_aio_cancel if I/O was cancelled Changes since PATCHv4: - removed unnecessary qemu_vfree of acb-bounce as BH will always run Changes since PATCHv3: - removed unnecessary if condition in rbd_start_aio as we haven't start io yet - moved acb-status = 0 to rbd_aio_bh_cb so qemu_aio_wait always waits until BH was executed Changes since PATCHv2: - fixed missing braces - added vfree for bounce Signed-off-by: Stefan Priebe s.pri...@profihost.ag --- block/rbd.c | 20 1 file changed, 12 insertions(+), 8 deletions(-) Reviewed-by: Stefan Hajnoczi stefa...@gmail.com -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 0/2] rbd: fix two memory leaks
This series fixes two memory leaks that occur whenever a special (non I/O) osd request in rbd. -Alex [PATCH 1/2] rbd: don't leak rbd_req on synchronous requests [PATCH 2/2] rbd: don't leak rbd_req for rbd_req_sync_notify_ack() -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 1/2] rbd: don't leak rbd_req on synchronous requests
When rbd_do_request() is called it allocates and populates an rbd_req structure to hold information about the osd request to be sent. This is done for the benefit of the callback function (in particular, rbd_req_cb()), which uses this in processing when the request completes. Synchronous requests provide no callback function, in which case rbd_do_request() waits for the request to complete before returning. This case is not handling the needed free of the rbd_req structure like it should, so it is getting leaked. Note however that the synchronous case has no need for the rbd_req structure at all. So rather than simply freeing this structure for synchronous requests, just don't allocate it to begin with. Signed-off-by: Alex Elder el...@inktank.com --- drivers/block/rbd.c | 48 1 file changed, 24 insertions(+), 24 deletions(-) diff --git a/drivers/block/rbd.c b/drivers/block/rbd.c index acdb4a6..78493e7 100644 --- a/drivers/block/rbd.c +++ b/drivers/block/rbd.c @@ -1160,20 +1160,11 @@ static int rbd_do_request(struct request *rq, struct ceph_msg *), u64 *ver) { + struct ceph_osd_client *osdc; struct ceph_osd_request *osd_req; - int ret; + struct rbd_request *rbd_req = NULL; struct timespec mtime = CURRENT_TIME; - struct rbd_request *rbd_req; - struct ceph_osd_client *osdc; - - rbd_req = kzalloc(sizeof(*rbd_req), GFP_NOIO); - if (!rbd_req) - return -ENOMEM; - - if (coll) { - rbd_req-coll = coll; - rbd_req-coll_index = coll_index; - } + int ret; dout(rbd_do_request object_name=%s ofs=%llu len=%llu coll=%p[%d]\n, object_name, (unsigned long long) ofs, @@ -1181,10 +1172,8 @@ static int rbd_do_request(struct request *rq, osdc = rbd_dev-rbd_client-client-osdc; osd_req = ceph_osdc_alloc_request(osdc, snapc, 1, false, GFP_NOIO); - if (!osd_req) { - ret = -ENOMEM; - goto done_pages; - } + if (!osd_req) + return -ENOMEM; osd_req-r_flags = flags; osd_req-r_pages = pages; @@ -1192,13 +1181,22 @@ static int rbd_do_request(struct request *rq, osd_req-r_bio = bio; bio_get(osd_req-r_bio); } - osd_req-r_callback = rbd_cb; - rbd_req-rq = rq; - rbd_req-bio = bio; - rbd_req-pages = pages; - rbd_req-len = len; + if (rbd_cb) { + ret = -ENOMEM; + rbd_req = kmalloc(sizeof(*rbd_req), GFP_NOIO); + if (!rbd_req) + goto done_osd_req; + + rbd_req-rq = rq; + rbd_req-bio = bio; + rbd_req-pages = pages; + rbd_req-len = len; + rbd_req-coll = coll; + rbd_req-coll_index = coll ? coll_index : 0; + } + osd_req-r_callback = rbd_cb; osd_req-r_priv = rbd_req; strncpy(osd_req-r_oid, object_name, sizeof(osd_req-r_oid)); @@ -1233,10 +1231,12 @@ static int rbd_do_request(struct request *rq, return ret; done_err: - bio_chain_put(rbd_req-bio); - ceph_osdc_put_request(osd_req); -done_pages: + if (bio) + bio_chain_put(osd_req-r_bio); kfree(rbd_req); +done_osd_req: + ceph_osdc_put_request(osd_req); + return ret; } -- 1.7.9.5 -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 2/2] rbd: don't leak rbd_req for rbd_req_sync_notify_ack()
When rbd_req_sync_notify_ack() calls rbd_do_request() it supplies rbd_simple_req_cb() as its callback function. Because the callback is supplied, an rbd_req structure gets allocated and populated so it can be used by the callback. However rbd_simple_req_cb() is not freeing (or even using) the rbd_req structure, so it's getting leaked. Since rbd_simple_req_cb() has no need for the rbd_req structure, just avoid allocating one for this case. Of the three calls to rbd_do_request(), only the one from rbd_do_op() needs the rbd_req structure, and that call can be distinguished from the other two because it supplies a non-null rbd_collection pointer. So fix this leak by only allocating the rbd_req structure if a non-null coll value is provided to rbd_do_request(). Signed-off-by: Alex Elder el...@inktank.com --- drivers/block/rbd.c |4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/drivers/block/rbd.c b/drivers/block/rbd.c index 78493e7..fca0ebf 100644 --- a/drivers/block/rbd.c +++ b/drivers/block/rbd.c @@ -1182,7 +1182,7 @@ static int rbd_do_request(struct request *rq, bio_get(osd_req-r_bio); } - if (rbd_cb) { + if (coll) { ret = -ENOMEM; rbd_req = kmalloc(sizeof(*rbd_req), GFP_NOIO); if (!rbd_req) @@ -1193,7 +1193,7 @@ static int rbd_do_request(struct request *rq, rbd_req-pages = pages; rbd_req-len = len; rbd_req-coll = coll; - rbd_req-coll_index = coll ? coll_index : 0; + rbd_req-coll_index = coll_index; } osd_req-r_callback = rbd_cb; -- 1.7.9.5 -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] libceph: for chooseleaf rules, retry CRUSH map descent from root if leaf is failed
Add libceph support for a new CRUSH tunable recently added to Ceph servers. Consider the CRUSH rule step chooseleaf firstn 0 type node_type This rule means that n replicas will be chosen in a manner such that each chosen leaf's branch will contain a unique instance of node_type. When an object is re-replicated after a leaf failure, if the CRUSH map uses a chooseleaf rule the remapped replica ends up under the node_type bucket that held the failed leaf. This causes uneven data distribution across the storage cluster, to the point that when all the leaves but one fail under a particular node_type bucket, that remaining leaf holds all the data from its failed peers. This behavior also limits the number of peers that can participate in the re-replication of the data held by the failed leaf, which increases the time required to re-replicate after a failure. For a chooseleaf CRUSH rule, the tree descent has two steps: call them the inner and outer descents. If the tree descent down to node_type is the outer descent, and the descent from node_type down to a leaf is the inner descent, the issue is that a down leaf is detected on the inner descent, so only the inner descent is retried. In order to disperse re-replicated data as widely as possible across a storage cluster after a failure, we want to retry the outer descent. So, fix up crush_choose() to allow the inner descent to return immediately on choosing a failed leaf. Wire this up as a new CRUSH tunable. Note that after this change, for a chooseleaf rule, if the primary OSD in a placement group has failed, choosing a replacement may result in one of the other OSDs in the PG colliding with the new primary. This requires that OSD's data for that PG to need moving as well. This seems unavoidable but should be relatively rare. Signed-off-by: Jim Schutt jasc...@sandia.gov --- include/linux/ceph/ceph_features.h |4 +++- include/linux/crush/crush.h|2 ++ net/ceph/crush/mapper.c| 13 ++--- net/ceph/osdmap.c |6 ++ 4 files changed, 21 insertions(+), 4 deletions(-) diff --git a/include/linux/ceph/ceph_features.h b/include/linux/ceph/ceph_features.h index dad579b..61e5af4 100644 --- a/include/linux/ceph/ceph_features.h +++ b/include/linux/ceph/ceph_features.h @@ -14,13 +14,15 @@ #define CEPH_FEATURE_DIRLAYOUTHASH (17) /* bits 8-17 defined by user-space; not supported yet here */ #define CEPH_FEATURE_CRUSH_TUNABLES (118) +#define CEPH_FEATURE_CRUSH_TUNABLES2 (125) /* * Features supported. */ #define CEPH_FEATURES_SUPPORTED_DEFAULT \ (CEPH_FEATURE_NOSRCADDR |\ -CEPH_FEATURE_CRUSH_TUNABLES) +CEPH_FEATURE_CRUSH_TUNABLES | \ +CEPH_FEATURE_CRUSH_TUNABLES2) #define CEPH_FEATURES_REQUIRED_DEFAULT \ (CEPH_FEATURE_NOSRCADDR) diff --git a/include/linux/crush/crush.h b/include/linux/crush/crush.h index 25baa28..6a1101f 100644 --- a/include/linux/crush/crush.h +++ b/include/linux/crush/crush.h @@ -162,6 +162,8 @@ struct crush_map { __u32 choose_local_fallback_tries; /* choose attempts before giving up */ __u32 choose_total_tries; + /* attempt chooseleaf inner descent once; on failure retry outer descent */ + __u32 chooseleaf_descend_once; }; diff --git a/net/ceph/crush/mapper.c b/net/ceph/crush/mapper.c index 35fce75..96c8a58 100644 --- a/net/ceph/crush/mapper.c +++ b/net/ceph/crush/mapper.c @@ -287,6 +287,7 @@ static int is_out(const struct crush_map *map, const __u32 *weight, int item, in * @outpos: our position in that vector * @firstn: true if choosing first n items, false if choosing indep * @recurse_to_leaf: true if we want one device under each item of given type + * @descend_once: true if we should only try one descent before giving up * @out2: second output vector for leaf items (if @recurse_to_leaf) */ static int crush_choose(const struct crush_map *map, @@ -295,7 +296,7 @@ static int crush_choose(const struct crush_map *map, int x, int numrep, int type, int *out, int outpos, int firstn, int recurse_to_leaf, - int *out2) + int descend_once, int *out2) { int rep; unsigned int ftotal, flocal; @@ -399,6 +400,7 @@ static int crush_choose(const struct crush_map *map, x, outpos+1, 0, out2, outpos, firstn, 0, + map-chooseleaf_descend_once, NULL) = outpos) /* didn't get leaf */ reject = 1; @@ -422,7 +424,10 @@ reject:
Re: OSD daemon changes port no
What kernel version and mds version are you running? I did # ceph osd pool create foo 12 # ceph osd pool create bar 12 # ceph mds add_data_pool 3 # ceph mds add_data_pool 4 and from a kernel mount # mkdir foo # mkdir bar # cephfs foo set_layout --pool 3 # cephfs bar set_layout --pool 4 # cephfs foo show_layout layout.data_pool: 3 layout.object_size: 4194304 layout.stripe_unit: 4194304 layout.stripe_count: 1 # cephfs bar show_layout layout.data_pool: 4 layout.object_size: 4194304 layout.stripe_unit: 4194304 layout.stripe_count: 1 This much you can test without playing with the crush map, btw. Maybe there is some crazy bug when the set_layouts are pipelined? Try with out using ? sage On Fri, 30 Nov 2012, hemant surale wrote: Hi Sage,Community , I am unable to use 2 directories to direct data to 2 different pools. I did following expt. Created 2 pool host ghost to seperate data placement . --//crushmap file --- # begin crush map # devices device 0 osd.0 device 1 osd.1 device 2 osd.2 device 3 osd.3 # types type 0 osd type 1 host type 2 rack type 3 row type 4 room type 5 datacenter type 6 pool type 7 ghost # buckets host hemantone-mirror-virtual-machine { id -6 # do not change unnecessarily # weight 1.000 alg straw hash 0 # rjenkins1 item osd.2 weight 1.000 } host hemantone-virtual-machine { id -7 # do not change unnecessarily # weight 1.000 alg straw hash 0 # rjenkins1 item osd.1 weight 1.000 } rack one { id -2 # do not change unnecessarily # weight 2.000 alg straw hash 0 # rjenkins1 item hemantone-mirror-virtual-machine weight 1.000 item hemantone-virtual-machine weight 1.000 } ghost hemant-virtual-machine { id -4 # do not change unnecessarily # weight 1.000 alg straw hash 0 # rjenkins1 item osd.0 weight 1.000 } ghost hemant-mirror-virtual-machine { id -5 # do not change unnecessarily # weight 1.000 alg straw hash 0 # rjenkins1 item osd.3 weight 1.000 } rack two { id -3 # do not change unnecessarily # weight 2.000 alg straw hash 0 # rjenkins1 item hemant-virtual-machine weight 1.000 item hemant-mirror-virtual-machine weight 1.000 } pool default { id -1 # do not change unnecessarily # weight 4.000 alg straw hash 0 # rjenkins1 item one weight 2.000 item two weight 2.000 } # rules rule data { ruleset 0 type replicated min_size 1 max_size 10 step take default step take one step chooseleaf firstn 0 type host step emit } rule metadata { ruleset 1 type replicated min_size 1 max_size 10 step take default step take one step chooseleaf firstn 0 type host step emit } rule rbd { ruleset 2 type replicated min_size 1 max_size 10 step take default step take one step chooseleaf firstn 0 type host step emit } rule forhost { ruleset 3 type replicated min_size 1 max_size 10 step take default step take one step chooseleaf firstn 0 type host step emit } rule forghost { ruleset 4 type replicated min_size 1 max_size 10 step take default step take two step chooseleaf firstn 0 type ghost step emit } # end crush map 1) set replication factor to 2. and crushrule accordingly . ( host got crush_ruleset = 3 ghost pool got crush_ruleset = 4). 2) Now I mounted data to dir. using mount.ceph 10.72.148.245:6789:/ /home/hemant/xmount.ceph 10.72.148.245:6789:/ /home/hemant/y 3) then mds add_data_pool 5 mds add_data_pool 6 ( here pool id are host = 5, ghost = 6) 4) cephfs /home/hemant/x set_layout --pool 5 -c 1 -u 4194304 -s 4194304 cephfs /home/hemant/y set_layout --pool 6 -c 1 -u 4194304 -s 4194304 PROBLEM: $ cephfs /home/hemant/x show_layout layout.data_pool: 6 layout.object_size: 4194304 layout.stripe_unit: 4194304 layout.stripe_count: 1 cephfs /home/hemant/y show_layout layout.data_pool: 6 layout.object_size: 4194304 layout.stripe_unit: 4194304 layout.stripe_count: 1 Both dir are using same pool to place data even after I specified to use separate using cephfs cmd. Please help me figure this out. -
Re: rbd map command hangs for 15 minutes during system start up
My initial tests using a 3.5.7 kernel with the 55 patches from wip-nick are going well. So far I've gone through 8 installs without an incident, I'll leave it run for a bit longer to see if it crops up again. Can I get a branch with these patches integrated into all of the backported patches to 3.5.x? I'd like to get this into our main testing branch, which is currently running 3.5.7 with the patches from wip-3.5 excluding the libceph_resubmit_linger_ops_when_pg_mapping_changes patch. Note that we had a case of a rbd map hang with our main testing branch, but I don't have a script that can reproduce that yet. It was after the cluster was all up and working, and we were doing a rolling reboot (cycling through each node). On Thu, Nov 29, 2012 at 12:37 PM, Alex Elder el...@inktank.com wrote: On 11/22/2012 12:04 PM, Nick Bartos wrote: Here are the ceph log messages (including the libceph kernel debug stuff you asked for) from a node boot with the rbd command hung for a couple of minutes: Nick, I have put together a branch that includes two fixes that might be helpful. I don't expect these fixes will necessarily *fix* what you're seeing, but one of them pulls a big hunk of processing out of the picture and might help eliminate some potential causes. I had to pull in several other patches as prerequisites in order to get those fixes to apply cleanly. Would you be able to give it a try, and let us know what results you get? The branch contains: - Linux 3.5.5 - Plus the first 49 patches you listed - Plus four patches, which are prerequisites... libceph: define ceph_extract_encoded_string() rbd: define some new format constants rbd: define rbd_dev_image_id() rbd: kill create_snap sysfs entry - ...for these two bug fixes: libceph: remove 'osdtimeout' option ceph: don't reference req after put The branch is available in the ceph-client git repository under the name wip-nick and has commit id dd9323aa. https://github.com/ceph/ceph-client/tree/wip-nick https://raw.github.com/gist/4132395/7cb5f0150179b012429c6e57749120dd88616cce/gistfile1.txt This full debug output is very helpful. Please supply that again as well. Thanks. -Alex On Wed, Nov 21, 2012 at 9:49 PM, Nick Bartos n...@pistoncloud.com wrote: It's very easy to reproduce now with my automated install script, the most I've seen it succeed with that patch is 2 in a row, and hanging on the 3rd, although it hangs on most builds. So it shouldn't take much to get it to do it again. I'll try and get to that tomorrow, when I'm a bit more rested and my brain is working better. Yes during this the OSDs are probably all syncing up. All the osd and mon daemons have started by the time the rdb commands are ran, though. On Wed, Nov 21, 2012 at 8:47 PM, Sage Weil s...@inktank.com wrote: On Wed, 21 Nov 2012, Nick Bartos wrote: FYI the build which included all 3.5 backports except patch #50 is still going strong after 21 builds. Okay, that one at least makes some sense. I've opened http://tracker.newdream.net/issues/3519 How easy is this to reproduce? If it is something you can trigger with debugging enabled ('echo module libceph +p /sys/kernel/debug/dynamic_debug/control') that would help tremendously. I'm guessing that during this startup time the OSDs are still in the process of starting? Alex, I bet that a test that does a lot of map/unmap stuff in a loop while thrashing OSDs could hit this. Thanks! sage On Wed, Nov 21, 2012 at 9:34 AM, Nick Bartos n...@pistoncloud.com wrote: With 8 successful installs already done, I'm reasonably confident that it's patch #50. I'm making another build which applies all patches from the 3.5 backport branch, excluding that specific one. I'll let you know if that turns up any unexpected failures. What will the potential fall out be for removing that specific patch? On Wed, Nov 21, 2012 at 9:02 AM, Nick Bartos n...@pistoncloud.com wrote: It's really looking like it's the libceph_resubmit_linger_ops_when_pg_mapping_changes commit. When patches 1-50 (listed below) are applied to 3.5.7, the hang is present. So far I have gone through 4 successful installs with no hang with only 1-49 applied. I'm still leaving my test run to make sure it's not a fluke, but since previously it hangs within the first couple of builds, it really looks like this is where the problem originated. 1-libceph_eliminate_connection_state_DEAD.patch 2-libceph_kill_bad_proto_ceph_connection_op.patch 3-libceph_rename_socket_callbacks.patch 4-libceph_rename_kvec_reset_and_kvec_add_functions.patch 5-libceph_embed_ceph_messenger_structure_in_ceph_client.patch 6-libceph_start_separating_connection_flags_from_state.patch 7-libceph_start_tracking_connection_socket_state.patch 8-libceph_provide_osd_number_when_creating_osd.patch 9-libceph_set_CLOSED_state_bit_in_con_init.patch
Re: rbd map command hangs for 15 minutes during system start up
On 11/30/2012 12:49 PM, Nick Bartos wrote: My initial tests using a 3.5.7 kernel with the 55 patches from wip-nick are going well. So far I've gone through 8 installs without an incident, I'll leave it run for a bit longer to see if it crops up again. This is great news! Now I wonder which of the two fixes took care of the problem... Can I get a branch with these patches integrated into all of the backported patches to 3.5.x? I'd like to get this into our main testing branch, which is currently running 3.5.7 with the patches from wip-3.5 excluding the libceph_resubmit_linger_ops_when_pg_mapping_changes patch. I will put together a new branch that includes the remainder of those patches for you shortly. Note that we had a case of a rbd map hang with our main testing branch, but I don't have a script that can reproduce that yet. It was after the cluster was all up and working, and we were doing a rolling reboot (cycling through each node). If you are able to reproduce this please let us know. -Alex On Thu, Nov 29, 2012 at 12:37 PM, Alex Elder el...@inktank.com wrote: On 11/22/2012 12:04 PM, Nick Bartos wrote: Here are the ceph log messages (including the libceph kernel debug stuff you asked for) from a node boot with the rbd command hung for a couple of minutes: Nick, I have put together a branch that includes two fixes that might be helpful. I don't expect these fixes will necessarily *fix* what you're seeing, but one of them pulls a big hunk of processing out of the picture and might help eliminate some potential causes. I had to pull in several other patches as prerequisites in order to get those fixes to apply cleanly. Would you be able to give it a try, and let us know what results you get? The branch contains: - Linux 3.5.5 - Plus the first 49 patches you listed - Plus four patches, which are prerequisites... libceph: define ceph_extract_encoded_string() rbd: define some new format constants rbd: define rbd_dev_image_id() rbd: kill create_snap sysfs entry - ...for these two bug fixes: libceph: remove 'osdtimeout' option ceph: don't reference req after put The branch is available in the ceph-client git repository under the name wip-nick and has commit id dd9323aa. https://github.com/ceph/ceph-client/tree/wip-nick https://raw.github.com/gist/4132395/7cb5f0150179b012429c6e57749120dd88616cce/gistfile1.txt This full debug output is very helpful. Please supply that again as well. Thanks. -Alex On Wed, Nov 21, 2012 at 9:49 PM, Nick Bartos n...@pistoncloud.com wrote: It's very easy to reproduce now with my automated install script, the most I've seen it succeed with that patch is 2 in a row, and hanging on the 3rd, although it hangs on most builds. So it shouldn't take much to get it to do it again. I'll try and get to that tomorrow, when I'm a bit more rested and my brain is working better. Yes during this the OSDs are probably all syncing up. All the osd and mon daemons have started by the time the rdb commands are ran, though. On Wed, Nov 21, 2012 at 8:47 PM, Sage Weil s...@inktank.com wrote: On Wed, 21 Nov 2012, Nick Bartos wrote: FYI the build which included all 3.5 backports except patch #50 is still going strong after 21 builds. Okay, that one at least makes some sense. I've opened http://tracker.newdream.net/issues/3519 How easy is this to reproduce? If it is something you can trigger with debugging enabled ('echo module libceph +p /sys/kernel/debug/dynamic_debug/control') that would help tremendously. I'm guessing that during this startup time the OSDs are still in the process of starting? Alex, I bet that a test that does a lot of map/unmap stuff in a loop while thrashing OSDs could hit this. Thanks! sage On Wed, Nov 21, 2012 at 9:34 AM, Nick Bartos n...@pistoncloud.com wrote: With 8 successful installs already done, I'm reasonably confident that it's patch #50. I'm making another build which applies all patches from the 3.5 backport branch, excluding that specific one. I'll let you know if that turns up any unexpected failures. What will the potential fall out be for removing that specific patch? On Wed, Nov 21, 2012 at 9:02 AM, Nick Bartos n...@pistoncloud.com wrote: It's really looking like it's the libceph_resubmit_linger_ops_when_pg_mapping_changes commit. When patches 1-50 (listed below) are applied to 3.5.7, the hang is present. So far I have gone through 4 successful installs with no hang with only 1-49 applied. I'm still leaving my test run to make sure it's not a fluke, but since previously it hangs within the first couple of builds, it really looks like this is where the problem originated. 1-libceph_eliminate_connection_state_DEAD.patch 2-libceph_kill_bad_proto_ceph_connection_op.patch 3-libceph_rename_socket_callbacks.patch
Re: Hangup during scrubbing - possible solutions
Hah! Thanks for the log, it's our handling of active_pushes. I'll have a patch shortly. Thanks! -Sam On Fri, Nov 30, 2012 at 4:14 AM, Andrey Korolyov and...@xdel.ru wrote: http://xdel.ru/downloads/ceph-log/ceph-scrub-stuck.log.gz http://xdel.ru/downloads/ceph-log/cluster-w.log.gz Here, please. I have initiated a deep-scrub of osd.1 which was lead to forever-stuck I/O requests in a short time(scrub `ll do the same). Second log may be useful for proper timestamps, as seeks on the original may took a long time. Osd processes on the specific node was restarted twice - at the beginning to be sure all config options were applied and at the end to do same plus to get rid of stuck requests. On Wed, Nov 28, 2012 at 5:35 AM, Samuel Just sam.j...@inktank.com wrote: If you can reproduce it again, what we really need are the osd logs from the acting set of a pg stuck in scrub with debug osd = 20 debug ms = 1 debug filestore = 20. Thanks, -Sam On Sun, Nov 25, 2012 at 2:08 PM, Andrey Korolyov and...@xdel.ru wrote: On Fri, Nov 23, 2012 at 12:35 AM, Sage Weil s...@inktank.com wrote: On Thu, 22 Nov 2012, Andrey Korolyov wrote: Hi, In the recent versions Ceph introduces some unexpected behavior for the permanent connections (VM or kernel clients) - after crash recovery, I/O will hang on the next planned scrub on the following scenario: - launch a bunch of clients doing non-intensive writes, - lose one or more osd, mark them down, wait for recovery completion, - do a slow scrub, e.g. scrubbing one osd per 5m, inside bash script, or wait for ceph to do the same, - observe a raising number of pgs stuck in the active+clean+scrubbing state (they took a master role from ones which was on killed osd and almost surely they are being written in time of crash), - some time later, clients will hang hardly and ceph log introduce stuck(old) I/O requests. The only one way to return clients back without losing their I/O state is per-osd restart, which also will help to get rid of active+clean+scrubbing pgs. First of all, I`ll be happy to help to solve this problem by providing logs. If you can reproduce this behavior with 'debug osd = 20' and 'debug ms = 1' logging on the OSD, that would be wonderful! I have tested slightly different recovery flow, please see below. Since there is no real harm, like frozen I/O, placement groups also was stuck forever on the active+clean+scrubbing state, until I restarted all osds (end of the log): http://xdel.ru/downloads/ceph-log/recover-clients-later-than-osd.txt.gz - start the healthy cluster - start persistent clients - add an another host with pair of OSDs, let them be in the data placement - wait for data to rearrange - [22:06 timestamp] mark OSDs out or simply kill them and wait(since I have an 1/2 hour delay on readjust in such case, I did ``ceph osd out'' manually) - watch for data to rearrange again - [22:51 timestamp] when it ends, start a manual rescrub, with non-zero active+clean+scrubbing-state placement groups at the end of process which `ll stay in this state forever until something happens After that, I can restart osds one per one, if I want to get rid of scrubbing states immediately and then do deep-scrub(if I don`t, those states will return at next ceph self-scrubbing) or do per-osd deep-scrub, if I have a lot of time. The case I have described in the previous message took place when I remove osd from data placement which existed on the moment when client(s) have started and indeed it is more harmful than current one(frozen I/O leads to hanging entire guest, for example). Since testing those flow took a lot of time, I`ll send logs related to this case tomorrow. Second question is not directly related to this problem, but I have thought on for a long time - is there a planned features to control scrub process more precisely, e.g. pg scrub rate or scheduled scrub, instead of current set of timeouts which of course not very predictable on when to run? Not yet. I would be interested in hearing what kind of control/config options/whatever you (and others) would like to see! Of course it will be awesome to have any determined scheduler or at least an option to disable automated scrubbing, since it is not very determined in time and deep-scrub eating a lot of I/O if command issued against entire OSD. Rate limiting is not in the first place, at least it may be recreated in external script, but for those who prefer to leave control to Ceph, it may be very useful. Thanks! -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Review request: wip-localized-read-tests
I've pushed up patches for the first phase of testing read from replica functionality, which looks only at objecter/client level ops: wip-localized-read-tests The major points are: 1. Run libcephfs tests w/ and w/o localized reads enabled 2. Add the performance counter in Objecter to record ops sent to replica 3. Add performance counter accessor in unit tests Locally I have verified that the performance counters are working with a 3 OSD setup, although there are not yet any unit tests that try to specifically assert a positive value on the counters. Thanks, Noah -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: rbd map command hangs for 15 minutes during system start up
On 11/29/2012 02:37 PM, Alex Elder wrote: On 11/22/2012 12:04 PM, Nick Bartos wrote: Here are the ceph log messages (including the libceph kernel debug stuff you asked for) from a node boot with the rbd command hung for a couple of minutes: I'm sorry, but I did something stupid... Yes, the branch I gave you includes these fixes. However it does *not* include the commit that was giving you trouble to begin with. So... I have updated that same branch (wip-nick) to contain: - Linux 3.5.5 - Plus the first *50* (not 49) patches you listed - Plus the ones I added before. The new commit id for that branch begins with be3198d6. I'm really sorry for this mistake. Please try this new branch and report back what you find. -Alex Nick, I have put together a branch that includes two fixes that might be helpful. I don't expect these fixes will necessarily *fix* what you're seeing, but one of them pulls a big hunk of processing out of the picture and might help eliminate some potential causes. I had to pull in several other patches as prerequisites in order to get those fixes to apply cleanly. Would you be able to give it a try, and let us know what results you get? The branch contains: - Linux 3.5.5 - Plus the first 49 patches you listed - Plus four patches, which are prerequisites... libceph: define ceph_extract_encoded_string() rbd: define some new format constants rbd: define rbd_dev_image_id() rbd: kill create_snap sysfs entry - ...for these two bug fixes: libceph: remove 'osdtimeout' option ceph: don't reference req after put The branch is available in the ceph-client git repository under the name wip-nick and has commit id dd9323aa. https://github.com/ceph/ceph-client/tree/wip-nick https://raw.github.com/gist/4132395/7cb5f0150179b012429c6e57749120dd88616cce/gistfile1.txt This full debug output is very helpful. Please supply that again as well. Thanks. -Alex On Wed, Nov 21, 2012 at 9:49 PM, Nick Bartos n...@pistoncloud.com wrote: It's very easy to reproduce now with my automated install script, the most I've seen it succeed with that patch is 2 in a row, and hanging on the 3rd, although it hangs on most builds. So it shouldn't take much to get it to do it again. I'll try and get to that tomorrow, when I'm a bit more rested and my brain is working better. Yes during this the OSDs are probably all syncing up. All the osd and mon daemons have started by the time the rdb commands are ran, though. On Wed, Nov 21, 2012 at 8:47 PM, Sage Weil s...@inktank.com wrote: On Wed, 21 Nov 2012, Nick Bartos wrote: FYI the build which included all 3.5 backports except patch #50 is still going strong after 21 builds. Okay, that one at least makes some sense. I've opened http://tracker.newdream.net/issues/3519 How easy is this to reproduce? If it is something you can trigger with debugging enabled ('echo module libceph +p /sys/kernel/debug/dynamic_debug/control') that would help tremendously. I'm guessing that during this startup time the OSDs are still in the process of starting? Alex, I bet that a test that does a lot of map/unmap stuff in a loop while thrashing OSDs could hit this. Thanks! sage On Wed, Nov 21, 2012 at 9:34 AM, Nick Bartos n...@pistoncloud.com wrote: With 8 successful installs already done, I'm reasonably confident that it's patch #50. I'm making another build which applies all patches from the 3.5 backport branch, excluding that specific one. I'll let you know if that turns up any unexpected failures. What will the potential fall out be for removing that specific patch? On Wed, Nov 21, 2012 at 9:02 AM, Nick Bartos n...@pistoncloud.com wrote: It's really looking like it's the libceph_resubmit_linger_ops_when_pg_mapping_changes commit. When patches 1-50 (listed below) are applied to 3.5.7, the hang is present. So far I have gone through 4 successful installs with no hang with only 1-49 applied. I'm still leaving my test run to make sure it's not a fluke, but since previously it hangs within the first couple of builds, it really looks like this is where the problem originated. 1-libceph_eliminate_connection_state_DEAD.patch 2-libceph_kill_bad_proto_ceph_connection_op.patch 3-libceph_rename_socket_callbacks.patch 4-libceph_rename_kvec_reset_and_kvec_add_functions.patch 5-libceph_embed_ceph_messenger_structure_in_ceph_client.patch 6-libceph_start_separating_connection_flags_from_state.patch 7-libceph_start_tracking_connection_socket_state.patch 8-libceph_provide_osd_number_when_creating_osd.patch 9-libceph_set_CLOSED_state_bit_in_con_init.patch 10-libceph_embed_ceph_connection_structure_in_mon_client.patch 11-libceph_drop_connection_refcounting_for_mon_client.patch 12-libceph_init_monitor_connection_when_opening.patch
librbd: error finding header: (2) No such file or directory
Hi, we war starting to see this error on some images: - rbd info kvm1207 error opening image kvm1207: (2) No such file or directory 2012-12-01 02:58:27.556677 7ffd50c60760 -1 librbd: error finding header: (2) No such file or directory Anyway to fix these images? Best regards, Simon -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: endless flying slow requests
I've pushed a fix to next, 49f32cee647c5bd09f36ba7c9fd4f481a697b9d7. Let me know if the problem persists with this patch. -Sam On Wed, Nov 28, 2012 at 2:04 PM, Andrey Korolyov and...@xdel.ru wrote: On Thu, Nov 29, 2012 at 1:12 AM, Samuel Just sam.j...@inktank.com wrote: Also, these clusters aren't mixed argonaut and next, are they? (Not that that shouldn't work, but it would be a useful data point.) -Sam On Wed, Nov 28, 2012 at 1:11 PM, Samuel Just sam.j...@inktank.com wrote: Did you observe hung io along with that error? Both sub_op_commit and sub_op_applied have happened, so the sub_op_reply should have been sent back to the primary. This looks more like a leak. If you also observed hung io, then it's possible that the problem is occurring between the sub_op_applied event and the response. -Sam It is relatively easy to check if one of client VMs has locked one or more cores to iowait or just hangs, so yes, these ops are related to real commit operations and they are hanged. I`m using all-new 0.54 cluster, without mixing of course. Does everyone who hit that bug readjusted cluster before bug shows itself(say, in a day-long distance)? On Tue, Nov 27, 2012 at 11:47 PM, Andrey Korolyov and...@xdel.ru wrote: On Wed, Nov 28, 2012 at 5:51 AM, Sage Weil s...@inktank.com wrote: Hi Stefan, On Thu, 15 Nov 2012, Sage Weil wrote: On Thu, 15 Nov 2012, Stefan Priebe - Profihost AG wrote: Am 14.11.2012 15:59, schrieb Sage Weil: Hi Stefan, I would be nice to confirm that no clients are waiting on replies for these requests; currently we suspect that the OSD request tracking is the buggy part. If you query the OSD admin socket you should be able to dump requests and see the client IP, and then query the client. Is it librbd? In that case you likely need to change the config so that it is listening on an admin socket ('admin socket = path'). Yes it is. So i have to specify admin socket at the KVM host? Right. IIRC the disk line is a ; (or \;) separated list of key/value pairs. How do i query the admin socket for requests? ceph --admin-daemon /path/to/socket help ceph --admin-daemon /path/to/socket objecter_dump (i think) Were you able to reproduce this? Thanks! sage Meanwhile, I did. :) Such requests will always be created if you have restarted or marked an osd out and then back in and scrub didn`t happen in the meantime (after such operation and before request arrival). What is more interesting, the hangup happens not exactly at the time of operation, but tens of minutes later. { description: osd_sub_op(client.1292013.0:45422 4.731 a384cf31\/rbd_data.1415fb1075f187.00a7\/head\/\/4 [] v 16444'21693 snapset=0=[]:[] snapc=0=[]), received_at: 2012-11-28 03:54:43.094151, age: 27812.942680, duration: 2.676641, flag_point: started, events: [ { time: 2012-11-28 03:54:43.094222, event: waiting_for_osdmap}, { time: 2012-11-28 03:54:43.386890, event: reached_pg}, { time: 2012-11-28 03:54:43.386894, event: started}, { time: 2012-11-28 03:54:43.386973, event: commit_queued_for_journal_write}, { time: 2012-11-28 03:54:45.360049, event: write_thread_in_journal_buffer}, { time: 2012-11-28 03:54:45.586183, event: journaled_completion_queued}, { time: 2012-11-28 03:54:45.586262, event: sub_op_commit}, { time: 2012-11-28 03:54:45.770792, event: sub_op_applied}]}]} sage Stefan On Wed, 14 Nov 2012, Stefan Priebe - Profihost AG wrote: Hello list, i see this several times. Endless flying slow requests. And they never stop until i restart the mentioned osd. 2012-11-14 10:11:57.513395 osd.24 [WRN] 1 slow requests, 1 included below; oldest blocked for 31789.858457 secs 2012-11-14 10:11:57.513399 osd.24 [WRN] slow request 31789.858457 seconds old, received at 2012-11-14 01:22:07.654922: osd_op(client.30286.0:6719 rbd_data.75c55bf2fdd7.1399 [write 282624~4096] 3.3f6d2373) v4 currently delayed 2012-11-14 10:11:58.513584 osd.24 [WRN] 1 slow requests, 1 included below; oldest blocked for 31790.858646 secs 2012-11-14 10:11:58.513586 osd.24 [WRN] slow request 31790.858646 seconds old, received at 2012-11-14 01:22:07.654922: osd_op(client.30286.0:6719 rbd_data.75c55bf2fdd7.1399 [write 282624~4096] 3.3f6d2373) v4 currently delayed 2012-11-14 10:11:59.513766 osd.24 [WRN] 1 slow requests, 1 included below; oldest blocked for 31791.858827 secs 2012-11-14 10:11:59.513768 osd.24 [WRN] slow request 31791.858827
Re: Hangup during scrubbing - possible solutions
Just pushed a fix to next, 49f32cee647c5bd09f36ba7c9fd4f481a697b9d7. Let me know if it persists. Thanks for the logs! -Sam On Fri, Nov 30, 2012 at 2:04 PM, Samuel Just sam.j...@inktank.com wrote: Hah! Thanks for the log, it's our handling of active_pushes. I'll have a patch shortly. Thanks! -Sam On Fri, Nov 30, 2012 at 4:14 AM, Andrey Korolyov and...@xdel.ru wrote: http://xdel.ru/downloads/ceph-log/ceph-scrub-stuck.log.gz http://xdel.ru/downloads/ceph-log/cluster-w.log.gz Here, please. I have initiated a deep-scrub of osd.1 which was lead to forever-stuck I/O requests in a short time(scrub `ll do the same). Second log may be useful for proper timestamps, as seeks on the original may took a long time. Osd processes on the specific node was restarted twice - at the beginning to be sure all config options were applied and at the end to do same plus to get rid of stuck requests. On Wed, Nov 28, 2012 at 5:35 AM, Samuel Just sam.j...@inktank.com wrote: If you can reproduce it again, what we really need are the osd logs from the acting set of a pg stuck in scrub with debug osd = 20 debug ms = 1 debug filestore = 20. Thanks, -Sam On Sun, Nov 25, 2012 at 2:08 PM, Andrey Korolyov and...@xdel.ru wrote: On Fri, Nov 23, 2012 at 12:35 AM, Sage Weil s...@inktank.com wrote: On Thu, 22 Nov 2012, Andrey Korolyov wrote: Hi, In the recent versions Ceph introduces some unexpected behavior for the permanent connections (VM or kernel clients) - after crash recovery, I/O will hang on the next planned scrub on the following scenario: - launch a bunch of clients doing non-intensive writes, - lose one or more osd, mark them down, wait for recovery completion, - do a slow scrub, e.g. scrubbing one osd per 5m, inside bash script, or wait for ceph to do the same, - observe a raising number of pgs stuck in the active+clean+scrubbing state (they took a master role from ones which was on killed osd and almost surely they are being written in time of crash), - some time later, clients will hang hardly and ceph log introduce stuck(old) I/O requests. The only one way to return clients back without losing their I/O state is per-osd restart, which also will help to get rid of active+clean+scrubbing pgs. First of all, I`ll be happy to help to solve this problem by providing logs. If you can reproduce this behavior with 'debug osd = 20' and 'debug ms = 1' logging on the OSD, that would be wonderful! I have tested slightly different recovery flow, please see below. Since there is no real harm, like frozen I/O, placement groups also was stuck forever on the active+clean+scrubbing state, until I restarted all osds (end of the log): http://xdel.ru/downloads/ceph-log/recover-clients-later-than-osd.txt.gz - start the healthy cluster - start persistent clients - add an another host with pair of OSDs, let them be in the data placement - wait for data to rearrange - [22:06 timestamp] mark OSDs out or simply kill them and wait(since I have an 1/2 hour delay on readjust in such case, I did ``ceph osd out'' manually) - watch for data to rearrange again - [22:51 timestamp] when it ends, start a manual rescrub, with non-zero active+clean+scrubbing-state placement groups at the end of process which `ll stay in this state forever until something happens After that, I can restart osds one per one, if I want to get rid of scrubbing states immediately and then do deep-scrub(if I don`t, those states will return at next ceph self-scrubbing) or do per-osd deep-scrub, if I have a lot of time. The case I have described in the previous message took place when I remove osd from data placement which existed on the moment when client(s) have started and indeed it is more harmful than current one(frozen I/O leads to hanging entire guest, for example). Since testing those flow took a lot of time, I`ll send logs related to this case tomorrow. Second question is not directly related to this problem, but I have thought on for a long time - is there a planned features to control scrub process more precisely, e.g. pg scrub rate or scheduled scrub, instead of current set of timeouts which of course not very predictable on when to run? Not yet. I would be interested in hearing what kind of control/config options/whatever you (and others) would like to see! Of course it will be awesome to have any determined scheduler or at least an option to disable automated scrubbing, since it is not very determined in time and deep-scrub eating a lot of I/O if command issued against entire OSD. Rate limiting is not in the first place, at least it may be recreated in external script, but for those who prefer to leave control to Ceph, it may be very useful. Thanks! -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this