Re: [PATCHv5] rbd block driver fix race between aio completition and aio cancel

2012-11-30 Thread Stefan Hajnoczi
On Thu, Nov 29, 2012 at 10:37 PM, Stefan Priebe s.pri...@profihost.ag wrote:
 @@ -568,6 +562,10 @@ static void qemu_rbd_aio_cancel(BlockDriverAIOCB 
 *blockacb)
  {
  RBDAIOCB *acb = (RBDAIOCB *) blockacb;
  acb-cancelled = 1;
 +
 +while (acb-status == -EINPROGRESS) {
 +qemu_aio_wait();
 +}
  }

  static const AIOCBInfo rbd_aiocb_info = {
 @@ -639,6 +637,7 @@ static void rbd_aio_bh_cb(void *opaque)
  acb-common.cb(acb-common.opaque, (acb-ret  0 ? 0 : acb-ret));
  qemu_bh_delete(acb-bh);
  acb-bh = NULL;
 +acb-status = 0;

  qemu_aio_release(acb);
  }

We cannot release acb in rbd_aio_bh_cb() when acb-cancelled == 1
because qemu_rbd_aio_cancel() still accesses it.  This was discussed
in an early version of the patch.

Stefan
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCHv6] rbd block driver fix race between aio completition and aio cancel

2012-11-30 Thread Stefan Priebe
This one fixes a race which qemu had also in iscsi block driver
between cancellation and io completition.

qemu_rbd_aio_cancel was not synchronously waiting for the end of
the command.

To archieve this it introduces a new status flag which uses
-EINPROGRESS.

Changes since PATCHv5:
- qemu_aio_release has to be done in qemu_rbd_aio_cancel if I/O
  was cancelled

Changes since PATCHv4:
- removed unnecessary qemu_vfree of acb-bounce as BH will always
  run

Changes since PATCHv3:
- removed unnecessary if condition in rbd_start_aio as we
  haven't start io yet
- moved acb-status = 0 to rbd_aio_bh_cb so qemu_aio_wait always
  waits until BH was executed

Changes since PATCHv2:
- fixed missing braces
- added vfree for bounce

Signed-off-by: Stefan Priebe s.pri...@profihost.ag

---
 block/rbd.c |   20 
 1 file changed, 12 insertions(+), 8 deletions(-)

diff --git a/block/rbd.c b/block/rbd.c
index f3becc7..737bab1 100644
--- a/block/rbd.c
+++ b/block/rbd.c
@@ -77,6 +77,7 @@ typedef struct RBDAIOCB {
 int error;
 struct BDRVRBDState *s;
 int cancelled;
+int status;
 } RBDAIOCB;
 
 typedef struct RADOSCB {
@@ -376,12 +377,6 @@ static void qemu_rbd_complete_aio(RADOSCB *rcb)
 RBDAIOCB *acb = rcb-acb;
 int64_t r;
 
-if (acb-cancelled) {
-qemu_vfree(acb-bounce);
-qemu_aio_release(acb);
-goto done;
-}
-
 r = rcb-ret;
 
 if (acb-cmd == RBD_AIO_WRITE ||
@@ -409,7 +404,6 @@ static void qemu_rbd_complete_aio(RADOSCB *rcb)
 /* Note that acb-bh can be NULL in case where the aio was cancelled */
 acb-bh = qemu_bh_new(rbd_aio_bh_cb, acb);
 qemu_bh_schedule(acb-bh);
-done:
 g_free(rcb);
 }
 
@@ -568,6 +562,12 @@ static void qemu_rbd_aio_cancel(BlockDriverAIOCB *blockacb)
 {
 RBDAIOCB *acb = (RBDAIOCB *) blockacb;
 acb-cancelled = 1;
+
+while (acb-status == -EINPROGRESS) {
+qemu_aio_wait();
+}
+
+qemu_aio_release(acb);
 }
 
 static const AIOCBInfo rbd_aiocb_info = {
@@ -639,8 +639,11 @@ static void rbd_aio_bh_cb(void *opaque)
 acb-common.cb(acb-common.opaque, (acb-ret  0 ? 0 : acb-ret));
 qemu_bh_delete(acb-bh);
 acb-bh = NULL;
+acb-status = 0;
 
-qemu_aio_release(acb);
+if (!acb-cancelled) {
+qemu_aio_release(acb);
+}
 }
 
 static int rbd_aio_discard_wrapper(rbd_image_t image,
@@ -685,6 +688,7 @@ static BlockDriverAIOCB *rbd_start_aio(BlockDriverState *bs,
 acb-s = s;
 acb-cancelled = 0;
 acb-bh = NULL;
+acb-status = -EINPROGRESS;
 
 if (cmd == RBD_AIO_WRITE) {
 qemu_iovec_to_buf(acb-qiov, 0, acb-bounce, qiov-size);
-- 
1.7.10.4

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCHv5] rbd block driver fix race between aio completition and aio cancel

2012-11-30 Thread Stefan Priebe - Profihost AG

fixed in V6

Am 30.11.2012 09:26, schrieb Stefan Hajnoczi:

On Thu, Nov 29, 2012 at 10:37 PM, Stefan Priebe s.pri...@profihost.ag wrote:

@@ -568,6 +562,10 @@ static void qemu_rbd_aio_cancel(BlockDriverAIOCB *blockacb)
  {
  RBDAIOCB *acb = (RBDAIOCB *) blockacb;
  acb-cancelled = 1;
+
+while (acb-status == -EINPROGRESS) {
+qemu_aio_wait();
+}
  }

  static const AIOCBInfo rbd_aiocb_info = {
@@ -639,6 +637,7 @@ static void rbd_aio_bh_cb(void *opaque)
  acb-common.cb(acb-common.opaque, (acb-ret  0 ? 0 : acb-ret));
  qemu_bh_delete(acb-bh);
  acb-bh = NULL;
+acb-status = 0;

  qemu_aio_release(acb);
  }


We cannot release acb in rbd_aio_bh_cb() when acb-cancelled == 1
because qemu_rbd_aio_cancel() still accesses it.  This was discussed
in an early version of the patch.

Stefan


--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Hangup during scrubbing - possible solutions

2012-11-30 Thread Andrey Korolyov
http://xdel.ru/downloads/ceph-log/ceph-scrub-stuck.log.gz
http://xdel.ru/downloads/ceph-log/cluster-w.log.gz

Here, please.

I have initiated a deep-scrub of osd.1 which was lead to forever-stuck
I/O requests in a short time(scrub `ll do the same). Second log may be
useful for proper timestamps, as seeks on the original may took a long
time. Osd processes on the specific node was restarted twice - at the
beginning to be sure all config options were applied and at the end to
do same plus to get rid of stuck requests.


On Wed, Nov 28, 2012 at 5:35 AM, Samuel Just sam.j...@inktank.com wrote:
 If you can reproduce it again, what we really need are the osd logs
 from the acting set of a pg stuck in scrub with
 debug osd = 20
 debug ms = 1
 debug filestore = 20.

 Thanks,
 -Sam

 On Sun, Nov 25, 2012 at 2:08 PM, Andrey Korolyov and...@xdel.ru wrote:
 On Fri, Nov 23, 2012 at 12:35 AM, Sage Weil s...@inktank.com wrote:
 On Thu, 22 Nov 2012, Andrey Korolyov wrote:
 Hi,

 In the recent versions Ceph introduces some unexpected behavior for
 the permanent connections (VM or kernel clients) - after crash
 recovery, I/O will hang on the next planned scrub on the following
 scenario:

 - launch a bunch of clients doing non-intensive writes,
 - lose one or more osd, mark them down, wait for recovery completion,
 - do a slow scrub, e.g. scrubbing one osd per 5m, inside bash script,
 or wait for ceph to do the same,
 - observe a raising number of pgs stuck in the active+clean+scrubbing
 state (they took a master role from ones which was on killed osd and
 almost surely they are being written in time of crash),
 - some time later, clients will hang hardly and ceph log introduce
 stuck(old) I/O requests.

 The only one way to return clients back without losing their I/O state
 is per-osd restart, which also will help to get rid of
 active+clean+scrubbing pgs.

 First of all, I`ll be happy to help to solve this problem by providing
 logs.

 If you can reproduce this behavior with 'debug osd = 20' and 'debug ms =
 1' logging on the OSD, that would be wonderful!


 I have tested slightly different recovery flow, please see below.
 Since there is no real harm, like frozen I/O, placement groups also
 was stuck forever on the active+clean+scrubbing state, until I
 restarted all osds (end of the log):

 http://xdel.ru/downloads/ceph-log/recover-clients-later-than-osd.txt.gz

 - start the healthy cluster
 - start persistent clients
 - add an another host with pair of OSDs, let them be in the data placement
 - wait for data to rearrange
 - [22:06 timestamp] mark OSDs out or simply kill them and wait(since I
 have an 1/2 hour delay on readjust in such case, I did ``ceph osd
 out'' manually)
 - watch for data to rearrange again
 - [22:51 timestamp] when it ends, start a manual rescrub, with
 non-zero active+clean+scrubbing-state placement groups at the end of
 process which `ll stay in this state forever until something happens

 After that, I can restart osds one per one, if I want to get rid of
 scrubbing states immediately and then do deep-scrub(if I don`t, those
 states will return at next ceph self-scrubbing) or do per-osd
 deep-scrub, if I have a lot of time. The case I have described in the
 previous message took place when I remove osd from data placement
 which existed on the moment when client(s) have started and indeed it
 is more harmful than current one(frozen I/O leads to hanging entire
 guest, for example). Since testing those flow took a lot of time, I`ll
 send logs related to this case tomorrow.

 Second question is not directly related to this problem, but I
 have thought on for a long time - is there a planned features to
 control scrub process more precisely, e.g. pg scrub rate or scheduled
 scrub, instead of current set of timeouts which of course not very
 predictable on when to run?

 Not yet.  I would be interested in hearing what kind of control/config
 options/whatever you (and others) would like to see!

 Of course it will be awesome to have any determined scheduler or at
 least an option to disable automated scrubbing, since it is not very
 determined in time and deep-scrub eating a lot of I/O if command
 issued against entire OSD. Rate limiting is not in the first place, at
 least it may be recreated in external script, but for those who prefer
 to leave control to Ceph, it may be very useful.

 Thanks!
 --
 To unsubscribe from this list: send the line unsubscribe ceph-devel in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCHv6] rbd block driver fix race between aio completition and aio cancel

2012-11-30 Thread Stefan Hajnoczi
On Fri, Nov 30, 2012 at 9:55 AM, Stefan Priebe s.pri...@profihost.ag wrote:
 This one fixes a race which qemu had also in iscsi block driver
 between cancellation and io completition.

 qemu_rbd_aio_cancel was not synchronously waiting for the end of
 the command.

 To archieve this it introduces a new status flag which uses
 -EINPROGRESS.

 Changes since PATCHv5:
 - qemu_aio_release has to be done in qemu_rbd_aio_cancel if I/O
   was cancelled

 Changes since PATCHv4:
 - removed unnecessary qemu_vfree of acb-bounce as BH will always
   run

 Changes since PATCHv3:
 - removed unnecessary if condition in rbd_start_aio as we
   haven't start io yet
 - moved acb-status = 0 to rbd_aio_bh_cb so qemu_aio_wait always
   waits until BH was executed

 Changes since PATCHv2:
 - fixed missing braces
 - added vfree for bounce

 Signed-off-by: Stefan Priebe s.pri...@profihost.ag

 ---
  block/rbd.c |   20 
  1 file changed, 12 insertions(+), 8 deletions(-)

Reviewed-by: Stefan Hajnoczi stefa...@gmail.com
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 0/2] rbd: fix two memory leaks

2012-11-30 Thread Alex Elder
This series fixes two memory leaks that occur whenever a special
(non I/O) osd request in rbd.

-Alex

[PATCH 1/2] rbd: don't leak rbd_req on synchronous requests
[PATCH 2/2] rbd: don't leak rbd_req for rbd_req_sync_notify_ack()
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 1/2] rbd: don't leak rbd_req on synchronous requests

2012-11-30 Thread Alex Elder
When rbd_do_request() is called it allocates and populates an
rbd_req structure to hold information about the osd request to be
sent.  This is done for the benefit of the callback function (in
particular, rbd_req_cb()), which uses this in processing when
the request completes.

Synchronous requests provide no callback function, in which case
rbd_do_request() waits for the request to complete before returning.
This case is not handling the needed free of the rbd_req structure
like it should, so it is getting leaked.

Note however that the synchronous case has no need for the rbd_req
structure at all.  So rather than simply freeing this structure for
synchronous requests, just don't allocate it to begin with.

Signed-off-by: Alex Elder el...@inktank.com
---
 drivers/block/rbd.c |   48 
 1 file changed, 24 insertions(+), 24 deletions(-)

diff --git a/drivers/block/rbd.c b/drivers/block/rbd.c
index acdb4a6..78493e7 100644
--- a/drivers/block/rbd.c
+++ b/drivers/block/rbd.c
@@ -1160,20 +1160,11 @@ static int rbd_do_request(struct request *rq,
 struct ceph_msg *),
  u64 *ver)
 {
+   struct ceph_osd_client *osdc;
struct ceph_osd_request *osd_req;
-   int ret;
+   struct rbd_request *rbd_req = NULL;
struct timespec mtime = CURRENT_TIME;
-   struct rbd_request *rbd_req;
-   struct ceph_osd_client *osdc;
-
-   rbd_req = kzalloc(sizeof(*rbd_req), GFP_NOIO);
-   if (!rbd_req)
-   return -ENOMEM;
-
-   if (coll) {
-   rbd_req-coll = coll;
-   rbd_req-coll_index = coll_index;
-   }
+   int ret;

dout(rbd_do_request object_name=%s ofs=%llu len=%llu coll=%p[%d]\n,
object_name, (unsigned long long) ofs,
@@ -1181,10 +1172,8 @@ static int rbd_do_request(struct request *rq,

osdc = rbd_dev-rbd_client-client-osdc;
osd_req = ceph_osdc_alloc_request(osdc, snapc, 1, false, GFP_NOIO);
-   if (!osd_req) {
-   ret = -ENOMEM;
-   goto done_pages;
-   }
+   if (!osd_req)
+   return -ENOMEM;

osd_req-r_flags = flags;
osd_req-r_pages = pages;
@@ -1192,13 +1181,22 @@ static int rbd_do_request(struct request *rq,
osd_req-r_bio = bio;
bio_get(osd_req-r_bio);
}
-   osd_req-r_callback = rbd_cb;

-   rbd_req-rq = rq;
-   rbd_req-bio = bio;
-   rbd_req-pages = pages;
-   rbd_req-len = len;
+   if (rbd_cb) {
+   ret = -ENOMEM;
+   rbd_req = kmalloc(sizeof(*rbd_req), GFP_NOIO);
+   if (!rbd_req)
+   goto done_osd_req;
+
+   rbd_req-rq = rq;
+   rbd_req-bio = bio;
+   rbd_req-pages = pages;
+   rbd_req-len = len;
+   rbd_req-coll = coll;
+   rbd_req-coll_index = coll ? coll_index : 0;
+   }

+   osd_req-r_callback = rbd_cb;
osd_req-r_priv = rbd_req;

strncpy(osd_req-r_oid, object_name, sizeof(osd_req-r_oid));
@@ -1233,10 +1231,12 @@ static int rbd_do_request(struct request *rq,
return ret;

 done_err:
-   bio_chain_put(rbd_req-bio);
-   ceph_osdc_put_request(osd_req);
-done_pages:
+   if (bio)
+   bio_chain_put(osd_req-r_bio);
kfree(rbd_req);
+done_osd_req:
+   ceph_osdc_put_request(osd_req);
+
return ret;
 }

-- 
1.7.9.5

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 2/2] rbd: don't leak rbd_req for rbd_req_sync_notify_ack()

2012-11-30 Thread Alex Elder
When rbd_req_sync_notify_ack() calls rbd_do_request() it supplies
rbd_simple_req_cb() as its callback function.  Because the callback
is supplied, an rbd_req structure gets allocated and populated so it
can be used by the callback.  However rbd_simple_req_cb() is not
freeing (or even using) the rbd_req structure, so it's getting
leaked.

Since rbd_simple_req_cb() has no need for the rbd_req structure,
just avoid allocating one for this case.  Of the three calls to
rbd_do_request(), only the one from rbd_do_op() needs the rbd_req
structure, and that call can be distinguished from the other two
because it supplies a non-null rbd_collection pointer.

So fix this leak by only allocating the rbd_req structure if a
non-null coll value is provided to rbd_do_request().

Signed-off-by: Alex Elder el...@inktank.com
---
 drivers/block/rbd.c |4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/block/rbd.c b/drivers/block/rbd.c
index 78493e7..fca0ebf 100644
--- a/drivers/block/rbd.c
+++ b/drivers/block/rbd.c
@@ -1182,7 +1182,7 @@ static int rbd_do_request(struct request *rq,
bio_get(osd_req-r_bio);
}

-   if (rbd_cb) {
+   if (coll) {
ret = -ENOMEM;
rbd_req = kmalloc(sizeof(*rbd_req), GFP_NOIO);
if (!rbd_req)
@@ -1193,7 +1193,7 @@ static int rbd_do_request(struct request *rq,
rbd_req-pages = pages;
rbd_req-len = len;
rbd_req-coll = coll;
-   rbd_req-coll_index = coll ? coll_index : 0;
+   rbd_req-coll_index = coll_index;
}

osd_req-r_callback = rbd_cb;
-- 
1.7.9.5

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] libceph: for chooseleaf rules, retry CRUSH map descent from root if leaf is failed

2012-11-30 Thread Jim Schutt
Add libceph support for a new CRUSH tunable recently added to Ceph servers.

Consider the CRUSH rule
  step chooseleaf firstn 0 type node_type

This rule means that n replicas will be chosen in a manner such that
each chosen leaf's branch will contain a unique instance of node_type.

When an object is re-replicated after a leaf failure, if the CRUSH map uses
a chooseleaf rule the remapped replica ends up under the node_type bucket
that held the failed leaf.  This causes uneven data distribution across the
storage cluster, to the point that when all the leaves but one fail under a
particular node_type bucket, that remaining leaf holds all the data from
its failed peers.

This behavior also limits the number of peers that can participate in the
re-replication of the data held by the failed leaf, which increases the
time required to re-replicate after a failure.

For a chooseleaf CRUSH rule, the tree descent has two steps: call them the
inner and outer descents.

If the tree descent down to node_type is the outer descent, and the descent
from node_type down to a leaf is the inner descent, the issue is that a
down leaf is detected on the inner descent, so only the inner descent is
retried.

In order to disperse re-replicated data as widely as possible across a
storage cluster after a failure, we want to retry the outer descent. So,
fix up crush_choose() to allow the inner descent to return immediately on
choosing a failed leaf.  Wire this up as a new CRUSH tunable.

Note that after this change, for a chooseleaf rule, if the primary OSD
in a placement group has failed, choosing a replacement may result in
one of the other OSDs in the PG colliding with the new primary.  This
requires that OSD's data for that PG to need moving as well.  This
seems unavoidable but should be relatively rare.

Signed-off-by: Jim Schutt jasc...@sandia.gov
---
 include/linux/ceph/ceph_features.h |4 +++-
 include/linux/crush/crush.h|2 ++
 net/ceph/crush/mapper.c|   13 ++---
 net/ceph/osdmap.c  |6 ++
 4 files changed, 21 insertions(+), 4 deletions(-)

diff --git a/include/linux/ceph/ceph_features.h 
b/include/linux/ceph/ceph_features.h
index dad579b..61e5af4 100644
--- a/include/linux/ceph/ceph_features.h
+++ b/include/linux/ceph/ceph_features.h
@@ -14,13 +14,15 @@
 #define CEPH_FEATURE_DIRLAYOUTHASH  (17)
 /* bits 8-17 defined by user-space; not supported yet here */
 #define CEPH_FEATURE_CRUSH_TUNABLES (118)
+#define CEPH_FEATURE_CRUSH_TUNABLES2 (125)
 
 /*
  * Features supported.
  */
 #define CEPH_FEATURES_SUPPORTED_DEFAULT  \
(CEPH_FEATURE_NOSRCADDR |\
-CEPH_FEATURE_CRUSH_TUNABLES)
+CEPH_FEATURE_CRUSH_TUNABLES |   \
+CEPH_FEATURE_CRUSH_TUNABLES2)
 
 #define CEPH_FEATURES_REQUIRED_DEFAULT   \
(CEPH_FEATURE_NOSRCADDR)
diff --git a/include/linux/crush/crush.h b/include/linux/crush/crush.h
index 25baa28..6a1101f 100644
--- a/include/linux/crush/crush.h
+++ b/include/linux/crush/crush.h
@@ -162,6 +162,8 @@ struct crush_map {
__u32 choose_local_fallback_tries;
/* choose attempts before giving up */ 
__u32 choose_total_tries;
+   /* attempt chooseleaf inner descent once; on failure retry outer 
descent */
+   __u32 chooseleaf_descend_once;
 };
 
 
diff --git a/net/ceph/crush/mapper.c b/net/ceph/crush/mapper.c
index 35fce75..96c8a58 100644
--- a/net/ceph/crush/mapper.c
+++ b/net/ceph/crush/mapper.c
@@ -287,6 +287,7 @@ static int is_out(const struct crush_map *map, const __u32 
*weight, int item, in
  * @outpos: our position in that vector
  * @firstn: true if choosing first n items, false if choosing indep
  * @recurse_to_leaf: true if we want one device under each item of given type
+ * @descend_once: true if we should only try one descent before giving up
  * @out2: second output vector for leaf items (if @recurse_to_leaf)
  */
 static int crush_choose(const struct crush_map *map,
@@ -295,7 +296,7 @@ static int crush_choose(const struct crush_map *map,
int x, int numrep, int type,
int *out, int outpos,
int firstn, int recurse_to_leaf,
-   int *out2)
+   int descend_once, int *out2)
 {
int rep;
unsigned int ftotal, flocal;
@@ -399,6 +400,7 @@ static int crush_choose(const struct crush_map *map,
 x, outpos+1, 0,
 out2, outpos,
 firstn, 0,
+
map-chooseleaf_descend_once,
 NULL) = outpos)
/* didn't get leaf */
reject = 1;
@@ -422,7 +424,10 @@ reject:

Re: OSD daemon changes port no

2012-11-30 Thread Sage Weil
What kernel version and mds version are you running?  I did

# ceph osd pool create foo 12
# ceph osd pool create bar 12
# ceph mds add_data_pool 3
# ceph mds add_data_pool 4

and from a kernel mount

# mkdir foo
# mkdir bar
# cephfs foo set_layout --pool 3
# cephfs bar set_layout --pool 4
# cephfs foo show_layout
layout.data_pool: 3
layout.object_size:   4194304
layout.stripe_unit:   4194304
layout.stripe_count:  1
# cephfs bar show_layout 
layout.data_pool: 4
layout.object_size:   4194304
layout.stripe_unit:   4194304
layout.stripe_count:  1

This much you can test without playing with the crush map, btw.

Maybe there is some crazy bug when the set_layouts are pipelined?  Try 
with out using  ?

sage


On Fri, 30 Nov 2012, hemant surale wrote:

 Hi Sage,Community ,
I am unable to use 2 directories to direct data to 2 different
 pools. I did following expt.
 
 Created 2 pool host  ghost to seperate data placement .
 --//crushmap file
 ---
 # begin crush map
 
 # devices
 device 0 osd.0
 device 1 osd.1
 device 2 osd.2
 device 3 osd.3
 
 # types
 type 0 osd
 type 1 host
 type 2 rack
 type 3 row
 type 4 room
 type 5 datacenter
 type 6 pool
 type 7 ghost
 
 # buckets
 host hemantone-mirror-virtual-machine {
 id -6   # do not change unnecessarily
 # weight 1.000
 alg straw
 hash 0  # rjenkins1
 item osd.2 weight 1.000
 }
 host hemantone-virtual-machine {
 id -7   # do not change unnecessarily
 # weight 1.000
 alg straw
 hash 0  # rjenkins1
 item osd.1 weight 1.000
 }
 rack one {
 id -2   # do not change unnecessarily
 # weight 2.000
 alg straw
 hash 0  # rjenkins1
 item hemantone-mirror-virtual-machine weight 1.000
 item hemantone-virtual-machine weight 1.000
 }
 ghost hemant-virtual-machine {
 id -4   # do not change unnecessarily
 # weight 1.000
 alg straw
 hash 0  # rjenkins1
 item osd.0 weight 1.000
 }
 ghost hemant-mirror-virtual-machine {
 id -5   # do not change unnecessarily
 # weight 1.000
 alg straw
 hash 0  # rjenkins1
 item osd.3 weight 1.000
 }
 rack two {
 id -3   # do not change unnecessarily
 # weight 2.000
 alg straw
 hash 0  # rjenkins1
 item hemant-virtual-machine weight 1.000
 item hemant-mirror-virtual-machine weight 1.000
 }
 pool default {
 id -1   # do not change unnecessarily
 # weight 4.000
 alg straw
 hash 0  # rjenkins1
 item one weight 2.000
 item two weight 2.000
 }
 
 # rules
 rule data {
 ruleset 0
 type replicated
 min_size 1
 max_size 10
 step take default
 step take one
 step chooseleaf firstn 0 type host
 step emit
 }
 rule metadata {
 ruleset 1
 type replicated
 min_size 1
 max_size 10
 step take default
 step take one
 step chooseleaf firstn 0 type host
 step emit
 }
 rule rbd {
 ruleset 2
 type replicated
 min_size 1
 max_size 10
 step take default
 step take one
 step chooseleaf firstn 0 type host
 step emit
 }
 rule forhost {
 ruleset 3
 type replicated
 min_size 1
 max_size 10
 step take default
 step take one
 step chooseleaf firstn 0 type host
 step emit
 }
 rule forghost {
 ruleset 4
 type replicated
 min_size 1
 max_size 10
 step take default
 step take two
 step chooseleaf firstn 0 type ghost
 step emit
 }
 
 # end crush map
 
 1) set replication factor to 2. and crushrule accordingly . ( host
 got crush_ruleset = 3  ghost pool got  crush_ruleset = 4).
 2) Now I mounted data to dir.  using mount.ceph 10.72.148.245:6789:/
 /home/hemant/xmount.ceph 10.72.148.245:6789:/ /home/hemant/y
 3) then mds add_data_pool 5   mds add_data_pool 6  ( here pool id
 are host = 5, ghost = 6)
 4) cephfs /home/hemant/x set_layout --pool 5 -c 1 -u 4194304 -s
 4194304   cephfs /home/hemant/y set_layout --pool 6 -c 1 -u 4194304
 -s 4194304
 
 PROBLEM:
  $ cephfs /home/hemant/x show_layout
 layout.data_pool: 6
 layout.object_size:   4194304
 layout.stripe_unit:   4194304
 layout.stripe_count:  1
 cephfs /home/hemant/y show_layout
 layout.data_pool: 6
 layout.object_size:   4194304
 layout.stripe_unit:   4194304
 layout.stripe_count:  1
 
 Both dir are using same pool to place data even after I specified to
 use separate using cephfs cmd.
 Please help me figure this out.
 
 -
 

Re: rbd map command hangs for 15 minutes during system start up

2012-11-30 Thread Nick Bartos
My initial tests using a 3.5.7 kernel with the 55 patches from
wip-nick are going well.  So far I've gone through 8 installs without
an incident, I'll leave it run for a bit longer to see if it crops up
again.

Can I get a branch with these patches integrated into all of the
backported patches to 3.5.x?  I'd like to get this into our main
testing branch, which is currently running 3.5.7 with the patches from
wip-3.5 excluding the
libceph_resubmit_linger_ops_when_pg_mapping_changes patch.

Note that we had a case of a rbd map hang with our main testing
branch, but I don't have a script that can reproduce that yet.  It was
after the cluster was all up and working, and we were  doing a rolling
reboot (cycling through each node).


On Thu, Nov 29, 2012 at 12:37 PM, Alex Elder el...@inktank.com wrote:
 On 11/22/2012 12:04 PM, Nick Bartos wrote:
 Here are the ceph log messages (including the libceph kernel debug
 stuff you asked for) from a node boot with the rbd command hung for a
 couple of minutes:

 Nick, I have put together a branch that includes two fixes
 that might be helpful.  I don't expect these fixes will
 necessarily *fix* what you're seeing, but one of them
 pulls a big hunk of processing out of the picture and
 might help eliminate some potential causes.  I had to
 pull in several other patches as prerequisites in order
 to get those fixes to apply cleanly.

 Would you be able to give it a try, and let us know what
 results you get?  The branch contains:
 - Linux 3.5.5
 - Plus the first 49 patches you listed
 - Plus four patches, which are prerequisites...
 libceph: define ceph_extract_encoded_string()
 rbd: define some new format constants
 rbd: define rbd_dev_image_id()
 rbd: kill create_snap sysfs entry
 - ...for these two bug fixes:
 libceph: remove 'osdtimeout' option
 ceph: don't reference req after put

 The branch is available in the ceph-client git repository
 under the name wip-nick and has commit id dd9323aa.
 https://github.com/ceph/ceph-client/tree/wip-nick

 https://raw.github.com/gist/4132395/7cb5f0150179b012429c6e57749120dd88616cce/gistfile1.txt

 This full debug output is very helpful.  Please supply
 that again as well.

 Thanks.

 -Alex

 On Wed, Nov 21, 2012 at 9:49 PM, Nick Bartos n...@pistoncloud.com wrote:
 It's very easy to reproduce now with my automated install script, the
 most I've seen it succeed with that patch is 2 in a row, and hanging
 on the 3rd, although it hangs on most builds.  So it shouldn't take
 much to get it to do it again.  I'll try and get to that tomorrow,
 when I'm a bit more rested and my brain is working better.

 Yes during this the OSDs are probably all syncing up.  All the osd and
 mon daemons have started by the time the rdb commands are ran, though.

 On Wed, Nov 21, 2012 at 8:47 PM, Sage Weil s...@inktank.com wrote:
 On Wed, 21 Nov 2012, Nick Bartos wrote:
 FYI the build which included all 3.5 backports except patch #50 is
 still going strong after 21 builds.

 Okay, that one at least makes some sense.  I've opened

 http://tracker.newdream.net/issues/3519

 How easy is this to reproduce?  If it is something you can trigger with
 debugging enabled ('echo module libceph +p 
 /sys/kernel/debug/dynamic_debug/control') that would help tremendously.

 I'm guessing that during this startup time the OSDs are still in the
 process of starting?

 Alex, I bet that a test that does a lot of map/unmap stuff in a loop while
 thrashing OSDs could hit this.

 Thanks!
 sage



 On Wed, Nov 21, 2012 at 9:34 AM, Nick Bartos n...@pistoncloud.com wrote:
 With 8 successful installs already done, I'm reasonably confident that
 it's patch #50.  I'm making another build which applies all patches
 from the 3.5 backport branch, excluding that specific one.  I'll let
 you know if that turns up any unexpected failures.

 What will the potential fall out be for removing that specific patch?


 On Wed, Nov 21, 2012 at 9:02 AM, Nick Bartos n...@pistoncloud.com 
 wrote:
 It's really looking like it's the
 libceph_resubmit_linger_ops_when_pg_mapping_changes commit.  When
 patches 1-50 (listed below) are applied to 3.5.7, the hang is present.
  So far I have gone through 4 successful installs with no hang with
 only 1-49 applied.  I'm still leaving my test run to make sure it's
 not a fluke, but since previously it hangs within the first couple of
 builds, it really looks like this is where the problem originated.

 1-libceph_eliminate_connection_state_DEAD.patch
 2-libceph_kill_bad_proto_ceph_connection_op.patch
 3-libceph_rename_socket_callbacks.patch
 4-libceph_rename_kvec_reset_and_kvec_add_functions.patch
 5-libceph_embed_ceph_messenger_structure_in_ceph_client.patch
 6-libceph_start_separating_connection_flags_from_state.patch
 7-libceph_start_tracking_connection_socket_state.patch
 8-libceph_provide_osd_number_when_creating_osd.patch
 9-libceph_set_CLOSED_state_bit_in_con_init.patch
 

Re: rbd map command hangs for 15 minutes during system start up

2012-11-30 Thread Alex Elder
On 11/30/2012 12:49 PM, Nick Bartos wrote:
 My initial tests using a 3.5.7 kernel with the 55 patches from
 wip-nick are going well.  So far I've gone through 8 installs without
 an incident, I'll leave it run for a bit longer to see if it crops up
 again.

This is great news!  Now I wonder which of the two fixes took
care of the problem...

 Can I get a branch with these patches integrated into all of the
 backported patches to 3.5.x?  I'd like to get this into our main
 testing branch, which is currently running 3.5.7 with the patches from
 wip-3.5 excluding the
 libceph_resubmit_linger_ops_when_pg_mapping_changes patch.

I will put together a new branch that includes the remainder
of those patches for you shortly.

 Note that we had a case of a rbd map hang with our main testing
 branch, but I don't have a script that can reproduce that yet.  It was
 after the cluster was all up and working, and we were  doing a rolling
 reboot (cycling through each node).

If you are able to reproduce this please let us know.

-Alex

 
 
 On Thu, Nov 29, 2012 at 12:37 PM, Alex Elder el...@inktank.com wrote:
 On 11/22/2012 12:04 PM, Nick Bartos wrote:
 Here are the ceph log messages (including the libceph kernel debug
 stuff you asked for) from a node boot with the rbd command hung for a
 couple of minutes:

 Nick, I have put together a branch that includes two fixes
 that might be helpful.  I don't expect these fixes will
 necessarily *fix* what you're seeing, but one of them
 pulls a big hunk of processing out of the picture and
 might help eliminate some potential causes.  I had to
 pull in several other patches as prerequisites in order
 to get those fixes to apply cleanly.

 Would you be able to give it a try, and let us know what
 results you get?  The branch contains:
 - Linux 3.5.5
 - Plus the first 49 patches you listed
 - Plus four patches, which are prerequisites...
 libceph: define ceph_extract_encoded_string()
 rbd: define some new format constants
 rbd: define rbd_dev_image_id()
 rbd: kill create_snap sysfs entry
 - ...for these two bug fixes:
 libceph: remove 'osdtimeout' option
 ceph: don't reference req after put

 The branch is available in the ceph-client git repository
 under the name wip-nick and has commit id dd9323aa.
 https://github.com/ceph/ceph-client/tree/wip-nick

 https://raw.github.com/gist/4132395/7cb5f0150179b012429c6e57749120dd88616cce/gistfile1.txt

 This full debug output is very helpful.  Please supply
 that again as well.

 Thanks.

 -Alex

 On Wed, Nov 21, 2012 at 9:49 PM, Nick Bartos n...@pistoncloud.com wrote:
 It's very easy to reproduce now with my automated install script, the
 most I've seen it succeed with that patch is 2 in a row, and hanging
 on the 3rd, although it hangs on most builds.  So it shouldn't take
 much to get it to do it again.  I'll try and get to that tomorrow,
 when I'm a bit more rested and my brain is working better.

 Yes during this the OSDs are probably all syncing up.  All the osd and
 mon daemons have started by the time the rdb commands are ran, though.

 On Wed, Nov 21, 2012 at 8:47 PM, Sage Weil s...@inktank.com wrote:
 On Wed, 21 Nov 2012, Nick Bartos wrote:
 FYI the build which included all 3.5 backports except patch #50 is
 still going strong after 21 builds.

 Okay, that one at least makes some sense.  I've opened

 http://tracker.newdream.net/issues/3519

 How easy is this to reproduce?  If it is something you can trigger with
 debugging enabled ('echo module libceph +p 
 /sys/kernel/debug/dynamic_debug/control') that would help tremendously.

 I'm guessing that during this startup time the OSDs are still in the
 process of starting?

 Alex, I bet that a test that does a lot of map/unmap stuff in a loop while
 thrashing OSDs could hit this.

 Thanks!
 sage



 On Wed, Nov 21, 2012 at 9:34 AM, Nick Bartos n...@pistoncloud.com 
 wrote:
 With 8 successful installs already done, I'm reasonably confident that
 it's patch #50.  I'm making another build which applies all patches
 from the 3.5 backport branch, excluding that specific one.  I'll let
 you know if that turns up any unexpected failures.

 What will the potential fall out be for removing that specific patch?


 On Wed, Nov 21, 2012 at 9:02 AM, Nick Bartos n...@pistoncloud.com 
 wrote:
 It's really looking like it's the
 libceph_resubmit_linger_ops_when_pg_mapping_changes commit.  When
 patches 1-50 (listed below) are applied to 3.5.7, the hang is present.
  So far I have gone through 4 successful installs with no hang with
 only 1-49 applied.  I'm still leaving my test run to make sure it's
 not a fluke, but since previously it hangs within the first couple of
 builds, it really looks like this is where the problem originated.

 1-libceph_eliminate_connection_state_DEAD.patch
 2-libceph_kill_bad_proto_ceph_connection_op.patch
 3-libceph_rename_socket_callbacks.patch
 

Re: Hangup during scrubbing - possible solutions

2012-11-30 Thread Samuel Just
Hah!  Thanks for the log, it's our handling of active_pushes.  I'll
have a patch shortly.

Thanks!
-Sam

On Fri, Nov 30, 2012 at 4:14 AM, Andrey Korolyov and...@xdel.ru wrote:
 http://xdel.ru/downloads/ceph-log/ceph-scrub-stuck.log.gz
 http://xdel.ru/downloads/ceph-log/cluster-w.log.gz

 Here, please.

 I have initiated a deep-scrub of osd.1 which was lead to forever-stuck
 I/O requests in a short time(scrub `ll do the same). Second log may be
 useful for proper timestamps, as seeks on the original may took a long
 time. Osd processes on the specific node was restarted twice - at the
 beginning to be sure all config options were applied and at the end to
 do same plus to get rid of stuck requests.


 On Wed, Nov 28, 2012 at 5:35 AM, Samuel Just sam.j...@inktank.com wrote:
 If you can reproduce it again, what we really need are the osd logs
 from the acting set of a pg stuck in scrub with
 debug osd = 20
 debug ms = 1
 debug filestore = 20.

 Thanks,
 -Sam

 On Sun, Nov 25, 2012 at 2:08 PM, Andrey Korolyov and...@xdel.ru wrote:
 On Fri, Nov 23, 2012 at 12:35 AM, Sage Weil s...@inktank.com wrote:
 On Thu, 22 Nov 2012, Andrey Korolyov wrote:
 Hi,

 In the recent versions Ceph introduces some unexpected behavior for
 the permanent connections (VM or kernel clients) - after crash
 recovery, I/O will hang on the next planned scrub on the following
 scenario:

 - launch a bunch of clients doing non-intensive writes,
 - lose one or more osd, mark them down, wait for recovery completion,
 - do a slow scrub, e.g. scrubbing one osd per 5m, inside bash script,
 or wait for ceph to do the same,
 - observe a raising number of pgs stuck in the active+clean+scrubbing
 state (they took a master role from ones which was on killed osd and
 almost surely they are being written in time of crash),
 - some time later, clients will hang hardly and ceph log introduce
 stuck(old) I/O requests.

 The only one way to return clients back without losing their I/O state
 is per-osd restart, which also will help to get rid of
 active+clean+scrubbing pgs.

 First of all, I`ll be happy to help to solve this problem by providing
 logs.

 If you can reproduce this behavior with 'debug osd = 20' and 'debug ms =
 1' logging on the OSD, that would be wonderful!


 I have tested slightly different recovery flow, please see below.
 Since there is no real harm, like frozen I/O, placement groups also
 was stuck forever on the active+clean+scrubbing state, until I
 restarted all osds (end of the log):

 http://xdel.ru/downloads/ceph-log/recover-clients-later-than-osd.txt.gz

 - start the healthy cluster
 - start persistent clients
 - add an another host with pair of OSDs, let them be in the data placement
 - wait for data to rearrange
 - [22:06 timestamp] mark OSDs out or simply kill them and wait(since I
 have an 1/2 hour delay on readjust in such case, I did ``ceph osd
 out'' manually)
 - watch for data to rearrange again
 - [22:51 timestamp] when it ends, start a manual rescrub, with
 non-zero active+clean+scrubbing-state placement groups at the end of
 process which `ll stay in this state forever until something happens

 After that, I can restart osds one per one, if I want to get rid of
 scrubbing states immediately and then do deep-scrub(if I don`t, those
 states will return at next ceph self-scrubbing) or do per-osd
 deep-scrub, if I have a lot of time. The case I have described in the
 previous message took place when I remove osd from data placement
 which existed on the moment when client(s) have started and indeed it
 is more harmful than current one(frozen I/O leads to hanging entire
 guest, for example). Since testing those flow took a lot of time, I`ll
 send logs related to this case tomorrow.

 Second question is not directly related to this problem, but I
 have thought on for a long time - is there a planned features to
 control scrub process more precisely, e.g. pg scrub rate or scheduled
 scrub, instead of current set of timeouts which of course not very
 predictable on when to run?

 Not yet.  I would be interested in hearing what kind of control/config
 options/whatever you (and others) would like to see!

 Of course it will be awesome to have any determined scheduler or at
 least an option to disable automated scrubbing, since it is not very
 determined in time and deep-scrub eating a lot of I/O if command
 issued against entire OSD. Rate limiting is not in the first place, at
 least it may be recreated in external script, but for those who prefer
 to leave control to Ceph, it may be very useful.

 Thanks!
 --
 To unsubscribe from this list: send the line unsubscribe ceph-devel in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Review request: wip-localized-read-tests

2012-11-30 Thread Noah Watkins
I've pushed up patches for the first phase of testing read from
replica functionality, which looks only at objecter/client level ops:

   wip-localized-read-tests

The major points are:

  1. Run libcephfs tests w/ and w/o localized reads enabled
  2. Add the performance counter in Objecter to record ops sent to replica
  3. Add performance counter accessor in unit tests

Locally I have verified that the performance counters are working with
a 3 OSD setup, although there are not yet any unit tests that try to
specifically assert a positive value on the counters.

Thanks,
Noah
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: rbd map command hangs for 15 minutes during system start up

2012-11-30 Thread Alex Elder
On 11/29/2012 02:37 PM, Alex Elder wrote:
 On 11/22/2012 12:04 PM, Nick Bartos wrote:
 Here are the ceph log messages (including the libceph kernel debug
 stuff you asked for) from a node boot with the rbd command hung for a
 couple of minutes:

I'm sorry, but I did something stupid...

Yes, the branch I gave you includes these fixes.  However
it does *not* include the commit that was giving you trouble
to begin with.

So...

I have updated that same branch (wip-nick) to contain:
- Linux 3.5.5
- Plus the first *50* (not 49) patches you listed
- Plus the ones I added before.

The new commit id for that branch begins with be3198d6.

I'm really sorry for this mistake.  Please try this new
branch and report back what you find.

-Alex


 Nick, I have put together a branch that includes two fixes
 that might be helpful.  I don't expect these fixes will
 necessarily *fix* what you're seeing, but one of them
 pulls a big hunk of processing out of the picture and
 might help eliminate some potential causes.  I had to
 pull in several other patches as prerequisites in order
 to get those fixes to apply cleanly.
 
 Would you be able to give it a try, and let us know what
 results you get?  The branch contains:
 - Linux 3.5.5
 - Plus the first 49 patches you listed
 - Plus four patches, which are prerequisites...
 libceph: define ceph_extract_encoded_string()
 rbd: define some new format constants
 rbd: define rbd_dev_image_id()
 rbd: kill create_snap sysfs entry
 - ...for these two bug fixes:
 libceph: remove 'osdtimeout' option
 ceph: don't reference req after put
 
 The branch is available in the ceph-client git repository
 under the name wip-nick and has commit id dd9323aa.
 https://github.com/ceph/ceph-client/tree/wip-nick
 
 https://raw.github.com/gist/4132395/7cb5f0150179b012429c6e57749120dd88616cce/gistfile1.txt
 
 This full debug output is very helpful.  Please supply
 that again as well.
 
 Thanks.
 
   -Alex
 
 On Wed, Nov 21, 2012 at 9:49 PM, Nick Bartos n...@pistoncloud.com wrote:
 It's very easy to reproduce now with my automated install script, the
 most I've seen it succeed with that patch is 2 in a row, and hanging
 on the 3rd, although it hangs on most builds.  So it shouldn't take
 much to get it to do it again.  I'll try and get to that tomorrow,
 when I'm a bit more rested and my brain is working better.

 Yes during this the OSDs are probably all syncing up.  All the osd and
 mon daemons have started by the time the rdb commands are ran, though.

 On Wed, Nov 21, 2012 at 8:47 PM, Sage Weil s...@inktank.com wrote:
 On Wed, 21 Nov 2012, Nick Bartos wrote:
 FYI the build which included all 3.5 backports except patch #50 is
 still going strong after 21 builds.

 Okay, that one at least makes some sense.  I've opened

 http://tracker.newdream.net/issues/3519

 How easy is this to reproduce?  If it is something you can trigger with
 debugging enabled ('echo module libceph +p 
 /sys/kernel/debug/dynamic_debug/control') that would help tremendously.

 I'm guessing that during this startup time the OSDs are still in the
 process of starting?

 Alex, I bet that a test that does a lot of map/unmap stuff in a loop while
 thrashing OSDs could hit this.

 Thanks!
 sage



 On Wed, Nov 21, 2012 at 9:34 AM, Nick Bartos n...@pistoncloud.com wrote:
 With 8 successful installs already done, I'm reasonably confident that
 it's patch #50.  I'm making another build which applies all patches
 from the 3.5 backport branch, excluding that specific one.  I'll let
 you know if that turns up any unexpected failures.

 What will the potential fall out be for removing that specific patch?


 On Wed, Nov 21, 2012 at 9:02 AM, Nick Bartos n...@pistoncloud.com 
 wrote:
 It's really looking like it's the
 libceph_resubmit_linger_ops_when_pg_mapping_changes commit.  When
 patches 1-50 (listed below) are applied to 3.5.7, the hang is present.
  So far I have gone through 4 successful installs with no hang with
 only 1-49 applied.  I'm still leaving my test run to make sure it's
 not a fluke, but since previously it hangs within the first couple of
 builds, it really looks like this is where the problem originated.

 1-libceph_eliminate_connection_state_DEAD.patch
 2-libceph_kill_bad_proto_ceph_connection_op.patch
 3-libceph_rename_socket_callbacks.patch
 4-libceph_rename_kvec_reset_and_kvec_add_functions.patch
 5-libceph_embed_ceph_messenger_structure_in_ceph_client.patch
 6-libceph_start_separating_connection_flags_from_state.patch
 7-libceph_start_tracking_connection_socket_state.patch
 8-libceph_provide_osd_number_when_creating_osd.patch
 9-libceph_set_CLOSED_state_bit_in_con_init.patch
 10-libceph_embed_ceph_connection_structure_in_mon_client.patch
 11-libceph_drop_connection_refcounting_for_mon_client.patch
 12-libceph_init_monitor_connection_when_opening.patch
 

librbd: error finding header: (2) No such file or directory

2012-11-30 Thread Simon Frerichs | Fremaks GmbH

Hi,

we war starting to see this error on some images:

- rbd info kvm1207
error opening image kvm1207: (2) No such file or directory
2012-12-01 02:58:27.556677 7ffd50c60760 -1 librbd: error finding header: 
(2) No such file or directory


Anyway to fix these images?

Best regards,
Simon
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: endless flying slow requests

2012-11-30 Thread Samuel Just
I've pushed a fix to next, 49f32cee647c5bd09f36ba7c9fd4f481a697b9d7.
Let me know if the problem persists with this patch.
-Sam

On Wed, Nov 28, 2012 at 2:04 PM, Andrey Korolyov and...@xdel.ru wrote:
 On Thu, Nov 29, 2012 at 1:12 AM, Samuel Just sam.j...@inktank.com wrote:
 Also, these clusters aren't mixed argonaut and next, are they?  (Not
 that that shouldn't work, but it would be a useful data point.)
 -Sam

 On Wed, Nov 28, 2012 at 1:11 PM, Samuel Just sam.j...@inktank.com wrote:
 Did you observe hung io along with that error?  Both sub_op_commit and
 sub_op_applied have happened, so the sub_op_reply should have been
 sent back to the primary.  This looks more like a leak.  If you also
 observed hung io, then it's possible that the problem is occurring
 between the sub_op_applied event and the response.
 -Sam


 It is relatively easy to check if one of client VMs has locked one or
 more cores to iowait or just hangs, so yes, these ops are related to
 real commit operations and they are hanged.
 I`m using all-new 0.54 cluster, without mixing of course. Does
 everyone who hit that bug readjusted cluster before bug shows
 itself(say, in a day-long distance)?

 On Tue, Nov 27, 2012 at 11:47 PM, Andrey Korolyov and...@xdel.ru wrote:
 On Wed, Nov 28, 2012 at 5:51 AM, Sage Weil s...@inktank.com wrote:
 Hi Stefan,

 On Thu, 15 Nov 2012, Sage Weil wrote:
 On Thu, 15 Nov 2012, Stefan Priebe - Profihost AG wrote:
  Am 14.11.2012 15:59, schrieb Sage Weil:
   Hi Stefan,
  
   I would be nice to confirm that no clients are waiting on replies for
   these requests; currently we suspect that the OSD request tracking 
   is the
   buggy part.  If you query the OSD admin socket you should be able to 
   dump
   requests and see the client IP, and then query the client.
  
   Is it librbd?  In that case you likely need to change the config so 
   that
   it is listening on an admin socket ('admin socket = path').
 
  Yes it is. So i have to specify admin socket at the KVM host?

 Right.  IIRC the disk line is a ; (or \;) separated list of key/value
 pairs.

  How do i query the admin socket for requests?

 ceph --admin-daemon /path/to/socket help
 ceph --admin-daemon /path/to/socket objecter_dump (i think)

 Were you able to reproduce this?

 Thanks!
 sage

 Meanwhile, I did. :)
 Such requests will always be created if you have restarted or marked
 an osd out and then back in and scrub didn`t happen in the meantime
 (after such operation and before request arrival).
 What is more interesting, the hangup happens not exactly at the time
 of operation, but tens of minutes later.

 { description: osd_sub_op(client.1292013.0:45422 4.731
 a384cf31\/rbd_data.1415fb1075f187.00a7\/head\/\/4 [] v
 16444'21693 snapset=0=[]:[] snapc=0=[]),
   received_at: 2012-11-28 03:54:43.094151,
   age: 27812.942680,
   duration: 2.676641,
   flag_point: started,
   events: [
 { time: 2012-11-28 03:54:43.094222,
   event: waiting_for_osdmap},
 { time: 2012-11-28 03:54:43.386890,
   event: reached_pg},
 { time: 2012-11-28 03:54:43.386894,
   event: started},
 { time: 2012-11-28 03:54:43.386973,
   event: commit_queued_for_journal_write},
 { time: 2012-11-28 03:54:45.360049,
   event: write_thread_in_journal_buffer},
 { time: 2012-11-28 03:54:45.586183,
   event: journaled_completion_queued},
 { time: 2012-11-28 03:54:45.586262,
   event: sub_op_commit},
 { time: 2012-11-28 03:54:45.770792,
   event: sub_op_applied}]}]}





 sage

 
  Stefan
 
 
   On Wed, 14 Nov 2012, Stefan Priebe - Profihost AG wrote:
  
Hello list,
   
i see this several times. Endless flying slow requests. And they 
never
stop
until i restart the mentioned osd.
   
2012-11-14 10:11:57.513395 osd.24 [WRN] 1 slow requests, 1 
included below;
oldest blocked for  31789.858457 secs
2012-11-14 10:11:57.513399 osd.24 [WRN] slow request 31789.858457 
seconds
old,
received at 2012-11-14 01:22:07.654922: osd_op(client.30286.0:6719
rbd_data.75c55bf2fdd7.1399 [write 282624~4096] 
3.3f6d2373) v4
currently delayed
2012-11-14 10:11:58.513584 osd.24 [WRN] 1 slow requests, 1 
included below;
oldest blocked for  31790.858646 secs
2012-11-14 10:11:58.513586 osd.24 [WRN] slow request 31790.858646 
seconds
old,
received at 2012-11-14 01:22:07.654922: osd_op(client.30286.0:6719
rbd_data.75c55bf2fdd7.1399 [write 282624~4096] 
3.3f6d2373) v4
currently delayed
2012-11-14 10:11:59.513766 osd.24 [WRN] 1 slow requests, 1 
included below;
oldest blocked for  31791.858827 secs
2012-11-14 10:11:59.513768 osd.24 [WRN] slow request 31791.858827 

Re: Hangup during scrubbing - possible solutions

2012-11-30 Thread Samuel Just
Just pushed a fix to next, 49f32cee647c5bd09f36ba7c9fd4f481a697b9d7.
Let me know if it persists.  Thanks for the logs!
-Sam

On Fri, Nov 30, 2012 at 2:04 PM, Samuel Just sam.j...@inktank.com wrote:
 Hah!  Thanks for the log, it's our handling of active_pushes.  I'll
 have a patch shortly.

 Thanks!
 -Sam

 On Fri, Nov 30, 2012 at 4:14 AM, Andrey Korolyov and...@xdel.ru wrote:
 http://xdel.ru/downloads/ceph-log/ceph-scrub-stuck.log.gz
 http://xdel.ru/downloads/ceph-log/cluster-w.log.gz

 Here, please.

 I have initiated a deep-scrub of osd.1 which was lead to forever-stuck
 I/O requests in a short time(scrub `ll do the same). Second log may be
 useful for proper timestamps, as seeks on the original may took a long
 time. Osd processes on the specific node was restarted twice - at the
 beginning to be sure all config options were applied and at the end to
 do same plus to get rid of stuck requests.


 On Wed, Nov 28, 2012 at 5:35 AM, Samuel Just sam.j...@inktank.com wrote:
 If you can reproduce it again, what we really need are the osd logs
 from the acting set of a pg stuck in scrub with
 debug osd = 20
 debug ms = 1
 debug filestore = 20.

 Thanks,
 -Sam

 On Sun, Nov 25, 2012 at 2:08 PM, Andrey Korolyov and...@xdel.ru wrote:
 On Fri, Nov 23, 2012 at 12:35 AM, Sage Weil s...@inktank.com wrote:
 On Thu, 22 Nov 2012, Andrey Korolyov wrote:
 Hi,

 In the recent versions Ceph introduces some unexpected behavior for
 the permanent connections (VM or kernel clients) - after crash
 recovery, I/O will hang on the next planned scrub on the following
 scenario:

 - launch a bunch of clients doing non-intensive writes,
 - lose one or more osd, mark them down, wait for recovery completion,
 - do a slow scrub, e.g. scrubbing one osd per 5m, inside bash script,
 or wait for ceph to do the same,
 - observe a raising number of pgs stuck in the active+clean+scrubbing
 state (they took a master role from ones which was on killed osd and
 almost surely they are being written in time of crash),
 - some time later, clients will hang hardly and ceph log introduce
 stuck(old) I/O requests.

 The only one way to return clients back without losing their I/O state
 is per-osd restart, which also will help to get rid of
 active+clean+scrubbing pgs.

 First of all, I`ll be happy to help to solve this problem by providing
 logs.

 If you can reproduce this behavior with 'debug osd = 20' and 'debug ms =
 1' logging on the OSD, that would be wonderful!


 I have tested slightly different recovery flow, please see below.
 Since there is no real harm, like frozen I/O, placement groups also
 was stuck forever on the active+clean+scrubbing state, until I
 restarted all osds (end of the log):

 http://xdel.ru/downloads/ceph-log/recover-clients-later-than-osd.txt.gz

 - start the healthy cluster
 - start persistent clients
 - add an another host with pair of OSDs, let them be in the data placement
 - wait for data to rearrange
 - [22:06 timestamp] mark OSDs out or simply kill them and wait(since I
 have an 1/2 hour delay on readjust in such case, I did ``ceph osd
 out'' manually)
 - watch for data to rearrange again
 - [22:51 timestamp] when it ends, start a manual rescrub, with
 non-zero active+clean+scrubbing-state placement groups at the end of
 process which `ll stay in this state forever until something happens

 After that, I can restart osds one per one, if I want to get rid of
 scrubbing states immediately and then do deep-scrub(if I don`t, those
 states will return at next ceph self-scrubbing) or do per-osd
 deep-scrub, if I have a lot of time. The case I have described in the
 previous message took place when I remove osd from data placement
 which existed on the moment when client(s) have started and indeed it
 is more harmful than current one(frozen I/O leads to hanging entire
 guest, for example). Since testing those flow took a lot of time, I`ll
 send logs related to this case tomorrow.

 Second question is not directly related to this problem, but I
 have thought on for a long time - is there a planned features to
 control scrub process more precisely, e.g. pg scrub rate or scheduled
 scrub, instead of current set of timeouts which of course not very
 predictable on when to run?

 Not yet.  I would be interested in hearing what kind of control/config
 options/whatever you (and others) would like to see!

 Of course it will be awesome to have any determined scheduler or at
 least an option to disable automated scrubbing, since it is not very
 determined in time and deep-scrub eating a lot of I/O if command
 issued against entire OSD. Rate limiting is not in the first place, at
 least it may be recreated in external script, but for those who prefer
 to leave control to Ceph, it may be very useful.

 Thanks!
 --
 To unsubscribe from this list: send the line unsubscribe ceph-devel in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this