Re: [PATCHv5] rbd block driver fix race between aio completition and aio cancel

2012-11-30 Thread Stefan Hajnoczi
On Thu, Nov 29, 2012 at 10:37 PM, Stefan Priebe  wrote:
> @@ -568,6 +562,10 @@ static void qemu_rbd_aio_cancel(BlockDriverAIOCB 
> *blockacb)
>  {
>  RBDAIOCB *acb = (RBDAIOCB *) blockacb;
>  acb->cancelled = 1;
> +
> +while (acb->status == -EINPROGRESS) {
> +qemu_aio_wait();
> +}
>  }
>
>  static const AIOCBInfo rbd_aiocb_info = {
> @@ -639,6 +637,7 @@ static void rbd_aio_bh_cb(void *opaque)
>  acb->common.cb(acb->common.opaque, (acb->ret > 0 ? 0 : acb->ret));
>  qemu_bh_delete(acb->bh);
>  acb->bh = NULL;
> +acb->status = 0;
>
>  qemu_aio_release(acb);
>  }

We cannot release acb in rbd_aio_bh_cb() when acb->cancelled == 1
because qemu_rbd_aio_cancel() still accesses it.  This was discussed
in an early version of the patch.

Stefan
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCHv6] rbd block driver fix race between aio completition and aio cancel

2012-11-30 Thread Stefan Priebe
This one fixes a race which qemu had also in iscsi block driver
between cancellation and io completition.

qemu_rbd_aio_cancel was not synchronously waiting for the end of
the command.

To archieve this it introduces a new status flag which uses
-EINPROGRESS.

Changes since PATCHv5:
- qemu_aio_release has to be done in qemu_rbd_aio_cancel if I/O
  was cancelled

Changes since PATCHv4:
- removed unnecessary qemu_vfree of acb->bounce as BH will always
  run

Changes since PATCHv3:
- removed unnecessary if condition in rbd_start_aio as we
  haven't start io yet
- moved acb->status = 0 to rbd_aio_bh_cb so qemu_aio_wait always
  waits until BH was executed

Changes since PATCHv2:
- fixed missing braces
- added vfree for bounce

Signed-off-by: Stefan Priebe 

---
 block/rbd.c |   20 
 1 file changed, 12 insertions(+), 8 deletions(-)

diff --git a/block/rbd.c b/block/rbd.c
index f3becc7..737bab1 100644
--- a/block/rbd.c
+++ b/block/rbd.c
@@ -77,6 +77,7 @@ typedef struct RBDAIOCB {
 int error;
 struct BDRVRBDState *s;
 int cancelled;
+int status;
 } RBDAIOCB;
 
 typedef struct RADOSCB {
@@ -376,12 +377,6 @@ static void qemu_rbd_complete_aio(RADOSCB *rcb)
 RBDAIOCB *acb = rcb->acb;
 int64_t r;
 
-if (acb->cancelled) {
-qemu_vfree(acb->bounce);
-qemu_aio_release(acb);
-goto done;
-}
-
 r = rcb->ret;
 
 if (acb->cmd == RBD_AIO_WRITE ||
@@ -409,7 +404,6 @@ static void qemu_rbd_complete_aio(RADOSCB *rcb)
 /* Note that acb->bh can be NULL in case where the aio was cancelled */
 acb->bh = qemu_bh_new(rbd_aio_bh_cb, acb);
 qemu_bh_schedule(acb->bh);
-done:
 g_free(rcb);
 }
 
@@ -568,6 +562,12 @@ static void qemu_rbd_aio_cancel(BlockDriverAIOCB *blockacb)
 {
 RBDAIOCB *acb = (RBDAIOCB *) blockacb;
 acb->cancelled = 1;
+
+while (acb->status == -EINPROGRESS) {
+qemu_aio_wait();
+}
+
+qemu_aio_release(acb);
 }
 
 static const AIOCBInfo rbd_aiocb_info = {
@@ -639,8 +639,11 @@ static void rbd_aio_bh_cb(void *opaque)
 acb->common.cb(acb->common.opaque, (acb->ret > 0 ? 0 : acb->ret));
 qemu_bh_delete(acb->bh);
 acb->bh = NULL;
+acb->status = 0;
 
-qemu_aio_release(acb);
+if (!acb->cancelled) {
+qemu_aio_release(acb);
+}
 }
 
 static int rbd_aio_discard_wrapper(rbd_image_t image,
@@ -685,6 +688,7 @@ static BlockDriverAIOCB *rbd_start_aio(BlockDriverState *bs,
 acb->s = s;
 acb->cancelled = 0;
 acb->bh = NULL;
+acb->status = -EINPROGRESS;
 
 if (cmd == RBD_AIO_WRITE) {
 qemu_iovec_to_buf(acb->qiov, 0, acb->bounce, qiov->size);
-- 
1.7.10.4

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCHv5] rbd block driver fix race between aio completition and aio cancel

2012-11-30 Thread Stefan Priebe - Profihost AG

fixed in V6

Am 30.11.2012 09:26, schrieb Stefan Hajnoczi:

On Thu, Nov 29, 2012 at 10:37 PM, Stefan Priebe  wrote:

@@ -568,6 +562,10 @@ static void qemu_rbd_aio_cancel(BlockDriverAIOCB *blockacb)
  {
  RBDAIOCB *acb = (RBDAIOCB *) blockacb;
  acb->cancelled = 1;
+
+while (acb->status == -EINPROGRESS) {
+qemu_aio_wait();
+}
  }

  static const AIOCBInfo rbd_aiocb_info = {
@@ -639,6 +637,7 @@ static void rbd_aio_bh_cb(void *opaque)
  acb->common.cb(acb->common.opaque, (acb->ret > 0 ? 0 : acb->ret));
  qemu_bh_delete(acb->bh);
  acb->bh = NULL;
+acb->status = 0;

  qemu_aio_release(acb);
  }


We cannot release acb in rbd_aio_bh_cb() when acb->cancelled == 1
because qemu_rbd_aio_cancel() still accesses it.  This was discussed
in an early version of the patch.

Stefan


--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Hangup during scrubbing - possible solutions

2012-11-30 Thread Andrey Korolyov
http://xdel.ru/downloads/ceph-log/ceph-scrub-stuck.log.gz
http://xdel.ru/downloads/ceph-log/cluster-w.log.gz

Here, please.

I have initiated a deep-scrub of osd.1 which was lead to forever-stuck
I/O requests in a short time(scrub `ll do the same). Second log may be
useful for proper timestamps, as seeks on the original may took a long
time. Osd processes on the specific node was restarted twice - at the
beginning to be sure all config options were applied and at the end to
do same plus to get rid of stuck requests.


On Wed, Nov 28, 2012 at 5:35 AM, Samuel Just  wrote:
> If you can reproduce it again, what we really need are the osd logs
> from the acting set of a pg stuck in scrub with
> debug osd = 20
> debug ms = 1
> debug filestore = 20.
>
> Thanks,
> -Sam
>
> On Sun, Nov 25, 2012 at 2:08 PM, Andrey Korolyov  wrote:
>> On Fri, Nov 23, 2012 at 12:35 AM, Sage Weil  wrote:
>>> On Thu, 22 Nov 2012, Andrey Korolyov wrote:
 Hi,

 In the recent versions Ceph introduces some unexpected behavior for
 the permanent connections (VM or kernel clients) - after crash
 recovery, I/O will hang on the next planned scrub on the following
 scenario:

 - launch a bunch of clients doing non-intensive writes,
 - lose one or more osd, mark them down, wait for recovery completion,
 - do a slow scrub, e.g. scrubbing one osd per 5m, inside bash script,
 or wait for ceph to do the same,
 - observe a raising number of pgs stuck in the active+clean+scrubbing
 state (they took a master role from ones which was on killed osd and
 almost surely they are being written in time of crash),
 - some time later, clients will hang hardly and ceph log introduce
 stuck(old) I/O requests.

 The only one way to return clients back without losing their I/O state
 is per-osd restart, which also will help to get rid of
 active+clean+scrubbing pgs.

 First of all, I`ll be happy to help to solve this problem by providing
 logs.
>>>
>>> If you can reproduce this behavior with 'debug osd = 20' and 'debug ms =
>>> 1' logging on the OSD, that would be wonderful!
>>>
>>
>> I have tested slightly different recovery flow, please see below.
>> Since there is no real harm, like frozen I/O, placement groups also
>> was stuck forever on the active+clean+scrubbing state, until I
>> restarted all osds (end of the log):
>>
>> http://xdel.ru/downloads/ceph-log/recover-clients-later-than-osd.txt.gz
>>
>> - start the healthy cluster
>> - start persistent clients
>> - add an another host with pair of OSDs, let them be in the data placement
>> - wait for data to rearrange
>> - [22:06 timestamp] mark OSDs out or simply kill them and wait(since I
>> have an 1/2 hour delay on readjust in such case, I did ``ceph osd
>> out'' manually)
>> - watch for data to rearrange again
>> - [22:51 timestamp] when it ends, start a manual rescrub, with
>> non-zero active+clean+scrubbing-state placement groups at the end of
>> process which `ll stay in this state forever until something happens
>>
>> After that, I can restart osds one per one, if I want to get rid of
>> scrubbing states immediately and then do deep-scrub(if I don`t, those
>> states will return at next ceph self-scrubbing) or do per-osd
>> deep-scrub, if I have a lot of time. The case I have described in the
>> previous message took place when I remove osd from data placement
>> which existed on the moment when client(s) have started and indeed it
>> is more harmful than current one(frozen I/O leads to hanging entire
>> guest, for example). Since testing those flow took a lot of time, I`ll
>> send logs related to this case tomorrow.
>>
 Second question is not directly related to this problem, but I
 have thought on for a long time - is there a planned features to
 control scrub process more precisely, e.g. pg scrub rate or scheduled
 scrub, instead of current set of timeouts which of course not very
 predictable on when to run?
>>>
>>> Not yet.  I would be interested in hearing what kind of control/config
>>> options/whatever you (and others) would like to see!
>>
>> Of course it will be awesome to have any determined scheduler or at
>> least an option to disable automated scrubbing, since it is not very
>> determined in time and deep-scrub eating a lot of I/O if command
>> issued against entire OSD. Rate limiting is not in the first place, at
>> least it may be recreated in external script, but for those who prefer
>> to leave control to Ceph, it may be very useful.
>>
>> Thanks!
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>> the body of a message to majord...@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCHv6] rbd block driver fix race between aio completition and aio cancel

2012-11-30 Thread Stefan Hajnoczi
On Fri, Nov 30, 2012 at 9:55 AM, Stefan Priebe  wrote:
> This one fixes a race which qemu had also in iscsi block driver
> between cancellation and io completition.
>
> qemu_rbd_aio_cancel was not synchronously waiting for the end of
> the command.
>
> To archieve this it introduces a new status flag which uses
> -EINPROGRESS.
>
> Changes since PATCHv5:
> - qemu_aio_release has to be done in qemu_rbd_aio_cancel if I/O
>   was cancelled
>
> Changes since PATCHv4:
> - removed unnecessary qemu_vfree of acb->bounce as BH will always
>   run
>
> Changes since PATCHv3:
> - removed unnecessary if condition in rbd_start_aio as we
>   haven't start io yet
> - moved acb->status = 0 to rbd_aio_bh_cb so qemu_aio_wait always
>   waits until BH was executed
>
> Changes since PATCHv2:
> - fixed missing braces
> - added vfree for bounce
>
> Signed-off-by: Stefan Priebe 
>
> ---
>  block/rbd.c |   20 
>  1 file changed, 12 insertions(+), 8 deletions(-)

Reviewed-by: Stefan Hajnoczi 
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 0/2] rbd: fix two memory leaks

2012-11-30 Thread Alex Elder
This series fixes two memory leaks that occur whenever a "special"
(non I/O) osd request in rbd.

-Alex

[PATCH 1/2] rbd: don't leak rbd_req on synchronous requests
[PATCH 2/2] rbd: don't leak rbd_req for rbd_req_sync_notify_ack()
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 1/2] rbd: don't leak rbd_req on synchronous requests

2012-11-30 Thread Alex Elder
When rbd_do_request() is called it allocates and populates an
rbd_req structure to hold information about the osd request to be
sent.  This is done for the benefit of the callback function (in
particular, rbd_req_cb()), which uses this in processing when
the request completes.

Synchronous requests provide no callback function, in which case
rbd_do_request() waits for the request to complete before returning.
This case is not handling the needed free of the rbd_req structure
like it should, so it is getting leaked.

Note however that the synchronous case has no need for the rbd_req
structure at all.  So rather than simply freeing this structure for
synchronous requests, just don't allocate it to begin with.

Signed-off-by: Alex Elder 
---
 drivers/block/rbd.c |   48 
 1 file changed, 24 insertions(+), 24 deletions(-)

diff --git a/drivers/block/rbd.c b/drivers/block/rbd.c
index acdb4a6..78493e7 100644
--- a/drivers/block/rbd.c
+++ b/drivers/block/rbd.c
@@ -1160,20 +1160,11 @@ static int rbd_do_request(struct request *rq,
 struct ceph_msg *),
  u64 *ver)
 {
+   struct ceph_osd_client *osdc;
struct ceph_osd_request *osd_req;
-   int ret;
+   struct rbd_request *rbd_req = NULL;
struct timespec mtime = CURRENT_TIME;
-   struct rbd_request *rbd_req;
-   struct ceph_osd_client *osdc;
-
-   rbd_req = kzalloc(sizeof(*rbd_req), GFP_NOIO);
-   if (!rbd_req)
-   return -ENOMEM;
-
-   if (coll) {
-   rbd_req->coll = coll;
-   rbd_req->coll_index = coll_index;
-   }
+   int ret;

dout("rbd_do_request object_name=%s ofs=%llu len=%llu coll=%p[%d]\n",
object_name, (unsigned long long) ofs,
@@ -1181,10 +1172,8 @@ static int rbd_do_request(struct request *rq,

osdc = &rbd_dev->rbd_client->client->osdc;
osd_req = ceph_osdc_alloc_request(osdc, snapc, 1, false, GFP_NOIO);
-   if (!osd_req) {
-   ret = -ENOMEM;
-   goto done_pages;
-   }
+   if (!osd_req)
+   return -ENOMEM;

osd_req->r_flags = flags;
osd_req->r_pages = pages;
@@ -1192,13 +1181,22 @@ static int rbd_do_request(struct request *rq,
osd_req->r_bio = bio;
bio_get(osd_req->r_bio);
}
-   osd_req->r_callback = rbd_cb;

-   rbd_req->rq = rq;
-   rbd_req->bio = bio;
-   rbd_req->pages = pages;
-   rbd_req->len = len;
+   if (rbd_cb) {
+   ret = -ENOMEM;
+   rbd_req = kmalloc(sizeof(*rbd_req), GFP_NOIO);
+   if (!rbd_req)
+   goto done_osd_req;
+
+   rbd_req->rq = rq;
+   rbd_req->bio = bio;
+   rbd_req->pages = pages;
+   rbd_req->len = len;
+   rbd_req->coll = coll;
+   rbd_req->coll_index = coll ? coll_index : 0;
+   }

+   osd_req->r_callback = rbd_cb;
osd_req->r_priv = rbd_req;

strncpy(osd_req->r_oid, object_name, sizeof(osd_req->r_oid));
@@ -1233,10 +1231,12 @@ static int rbd_do_request(struct request *rq,
return ret;

 done_err:
-   bio_chain_put(rbd_req->bio);
-   ceph_osdc_put_request(osd_req);
-done_pages:
+   if (bio)
+   bio_chain_put(osd_req->r_bio);
kfree(rbd_req);
+done_osd_req:
+   ceph_osdc_put_request(osd_req);
+
return ret;
 }

-- 
1.7.9.5

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 2/2] rbd: don't leak rbd_req for rbd_req_sync_notify_ack()

2012-11-30 Thread Alex Elder
When rbd_req_sync_notify_ack() calls rbd_do_request() it supplies
rbd_simple_req_cb() as its callback function.  Because the callback
is supplied, an rbd_req structure gets allocated and populated so it
can be used by the callback.  However rbd_simple_req_cb() is not
freeing (or even using) the rbd_req structure, so it's getting
leaked.

Since rbd_simple_req_cb() has no need for the rbd_req structure,
just avoid allocating one for this case.  Of the three calls to
rbd_do_request(), only the one from rbd_do_op() needs the rbd_req
structure, and that call can be distinguished from the other two
because it supplies a non-null rbd_collection pointer.

So fix this leak by only allocating the rbd_req structure if a
non-null "coll" value is provided to rbd_do_request().

Signed-off-by: Alex Elder 
---
 drivers/block/rbd.c |4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/block/rbd.c b/drivers/block/rbd.c
index 78493e7..fca0ebf 100644
--- a/drivers/block/rbd.c
+++ b/drivers/block/rbd.c
@@ -1182,7 +1182,7 @@ static int rbd_do_request(struct request *rq,
bio_get(osd_req->r_bio);
}

-   if (rbd_cb) {
+   if (coll) {
ret = -ENOMEM;
rbd_req = kmalloc(sizeof(*rbd_req), GFP_NOIO);
if (!rbd_req)
@@ -1193,7 +1193,7 @@ static int rbd_do_request(struct request *rq,
rbd_req->pages = pages;
rbd_req->len = len;
rbd_req->coll = coll;
-   rbd_req->coll_index = coll ? coll_index : 0;
+   rbd_req->coll_index = coll_index;
}

osd_req->r_callback = rbd_cb;
-- 
1.7.9.5

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] libceph: for chooseleaf rules, retry CRUSH map descent from root if leaf is failed

2012-11-30 Thread Jim Schutt
Add libceph support for a new CRUSH tunable recently added to Ceph servers.

Consider the CRUSH rule
  step chooseleaf firstn 0 type 

This rule means that  replicas will be chosen in a manner such that
each chosen leaf's branch will contain a unique instance of .

When an object is re-replicated after a leaf failure, if the CRUSH map uses
a chooseleaf rule the remapped replica ends up under the  bucket
that held the failed leaf.  This causes uneven data distribution across the
storage cluster, to the point that when all the leaves but one fail under a
particular  bucket, that remaining leaf holds all the data from
its failed peers.

This behavior also limits the number of peers that can participate in the
re-replication of the data held by the failed leaf, which increases the
time required to re-replicate after a failure.

For a chooseleaf CRUSH rule, the tree descent has two steps: call them the
inner and outer descents.

If the tree descent down to  is the outer descent, and the descent
from  down to a leaf is the inner descent, the issue is that a
down leaf is detected on the inner descent, so only the inner descent is
retried.

In order to disperse re-replicated data as widely as possible across a
storage cluster after a failure, we want to retry the outer descent. So,
fix up crush_choose() to allow the inner descent to return immediately on
choosing a failed leaf.  Wire this up as a new CRUSH tunable.

Note that after this change, for a chooseleaf rule, if the primary OSD
in a placement group has failed, choosing a replacement may result in
one of the other OSDs in the PG colliding with the new primary.  This
requires that OSD's data for that PG to need moving as well.  This
seems unavoidable but should be relatively rare.

Signed-off-by: Jim Schutt 
---
 include/linux/ceph/ceph_features.h |4 +++-
 include/linux/crush/crush.h|2 ++
 net/ceph/crush/mapper.c|   13 ++---
 net/ceph/osdmap.c  |6 ++
 4 files changed, 21 insertions(+), 4 deletions(-)

diff --git a/include/linux/ceph/ceph_features.h 
b/include/linux/ceph/ceph_features.h
index dad579b..61e5af4 100644
--- a/include/linux/ceph/ceph_features.h
+++ b/include/linux/ceph/ceph_features.h
@@ -14,13 +14,15 @@
 #define CEPH_FEATURE_DIRLAYOUTHASH  (1<<7)
 /* bits 8-17 defined by user-space; not supported yet here */
 #define CEPH_FEATURE_CRUSH_TUNABLES (1<<18)
+#define CEPH_FEATURE_CRUSH_TUNABLES2 (1<<25)
 
 /*
  * Features supported.
  */
 #define CEPH_FEATURES_SUPPORTED_DEFAULT  \
(CEPH_FEATURE_NOSRCADDR |\
-CEPH_FEATURE_CRUSH_TUNABLES)
+CEPH_FEATURE_CRUSH_TUNABLES |   \
+CEPH_FEATURE_CRUSH_TUNABLES2)
 
 #define CEPH_FEATURES_REQUIRED_DEFAULT   \
(CEPH_FEATURE_NOSRCADDR)
diff --git a/include/linux/crush/crush.h b/include/linux/crush/crush.h
index 25baa28..6a1101f 100644
--- a/include/linux/crush/crush.h
+++ b/include/linux/crush/crush.h
@@ -162,6 +162,8 @@ struct crush_map {
__u32 choose_local_fallback_tries;
/* choose attempts before giving up */ 
__u32 choose_total_tries;
+   /* attempt chooseleaf inner descent once; on failure retry outer 
descent */
+   __u32 chooseleaf_descend_once;
 };
 
 
diff --git a/net/ceph/crush/mapper.c b/net/ceph/crush/mapper.c
index 35fce75..96c8a58 100644
--- a/net/ceph/crush/mapper.c
+++ b/net/ceph/crush/mapper.c
@@ -287,6 +287,7 @@ static int is_out(const struct crush_map *map, const __u32 
*weight, int item, in
  * @outpos: our position in that vector
  * @firstn: true if choosing "first n" items, false if choosing "indep"
  * @recurse_to_leaf: true if we want one device under each item of given type
+ * @descend_once: true if we should only try one descent before giving up
  * @out2: second output vector for leaf items (if @recurse_to_leaf)
  */
 static int crush_choose(const struct crush_map *map,
@@ -295,7 +296,7 @@ static int crush_choose(const struct crush_map *map,
int x, int numrep, int type,
int *out, int outpos,
int firstn, int recurse_to_leaf,
-   int *out2)
+   int descend_once, int *out2)
 {
int rep;
unsigned int ftotal, flocal;
@@ -399,6 +400,7 @@ static int crush_choose(const struct crush_map *map,
 x, outpos+1, 0,
 out2, outpos,
 firstn, 0,
+
map->chooseleaf_descend_once,
 NULL) <= outpos)
/* didn't get leaf */
reject = 1;
@@ -422,7 +424,10 @@ reject:
ftotal++;
flocal++;
 
-   

Re: OSD daemon changes port no

2012-11-30 Thread Sage Weil
What kernel version and mds version are you running?  I did

# ceph osd pool create foo 12
# ceph osd pool create bar 12
# ceph mds add_data_pool 3
# ceph mds add_data_pool 4

and from a kernel mount

# mkdir foo
# mkdir bar
# cephfs foo set_layout --pool 3
# cephfs bar set_layout --pool 4
# cephfs foo show_layout
layout.data_pool: 3
layout.object_size:   4194304
layout.stripe_unit:   4194304
layout.stripe_count:  1
# cephfs bar show_layout 
layout.data_pool: 4
layout.object_size:   4194304
layout.stripe_unit:   4194304
layout.stripe_count:  1

This much you can test without playing with the crush map, btw.

Maybe there is some crazy bug when the set_layouts are pipelined?  Try 
with out using & ?

sage


On Fri, 30 Nov 2012, hemant surale wrote:

> Hi Sage,Community ,
>I am unable to use 2 directories to direct data to 2 different
> pools. I did following expt.
> 
> Created 2 pool "host" & "ghost" to seperate data placement .
> --//crushmap file
> ---
> # begin crush map
> 
> # devices
> device 0 osd.0
> device 1 osd.1
> device 2 osd.2
> device 3 osd.3
> 
> # types
> type 0 osd
> type 1 host
> type 2 rack
> type 3 row
> type 4 room
> type 5 datacenter
> type 6 pool
> type 7 ghost
> 
> # buckets
> host hemantone-mirror-virtual-machine {
> id -6   # do not change unnecessarily
> # weight 1.000
> alg straw
> hash 0  # rjenkins1
> item osd.2 weight 1.000
> }
> host hemantone-virtual-machine {
> id -7   # do not change unnecessarily
> # weight 1.000
> alg straw
> hash 0  # rjenkins1
> item osd.1 weight 1.000
> }
> rack one {
> id -2   # do not change unnecessarily
> # weight 2.000
> alg straw
> hash 0  # rjenkins1
> item hemantone-mirror-virtual-machine weight 1.000
> item hemantone-virtual-machine weight 1.000
> }
> ghost hemant-virtual-machine {
> id -4   # do not change unnecessarily
> # weight 1.000
> alg straw
> hash 0  # rjenkins1
> item osd.0 weight 1.000
> }
> ghost hemant-mirror-virtual-machine {
> id -5   # do not change unnecessarily
> # weight 1.000
> alg straw
> hash 0  # rjenkins1
> item osd.3 weight 1.000
> }
> rack two {
> id -3   # do not change unnecessarily
> # weight 2.000
> alg straw
> hash 0  # rjenkins1
> item hemant-virtual-machine weight 1.000
> item hemant-mirror-virtual-machine weight 1.000
> }
> pool default {
> id -1   # do not change unnecessarily
> # weight 4.000
> alg straw
> hash 0  # rjenkins1
> item one weight 2.000
> item two weight 2.000
> }
> 
> # rules
> rule data {
> ruleset 0
> type replicated
> min_size 1
> max_size 10
> step take default
> step take one
> step chooseleaf firstn 0 type host
> step emit
> }
> rule metadata {
> ruleset 1
> type replicated
> min_size 1
> max_size 10
> step take default
> step take one
> step chooseleaf firstn 0 type host
> step emit
> }
> rule rbd {
> ruleset 2
> type replicated
> min_size 1
> max_size 10
> step take default
> step take one
> step chooseleaf firstn 0 type host
> step emit
> }
> rule forhost {
> ruleset 3
> type replicated
> min_size 1
> max_size 10
> step take default
> step take one
> step chooseleaf firstn 0 type host
> step emit
> }
> rule forghost {
> ruleset 4
> type replicated
> min_size 1
> max_size 10
> step take default
> step take two
> step chooseleaf firstn 0 type ghost
> step emit
> }
> 
> # end crush map
> 
> 1) set replication factor to 2. and crushrule accordingly . ( "host"
> got crush_ruleset = 3 & "ghost" pool got  crush_ruleset = 4).
> 2) Now I mounted data to dir.  using "mount.ceph 10.72.148.245:6789:/
> /home/hemant/x"   & "mount.ceph 10.72.148.245:6789:/ /home/hemant/y"
> 3) then "mds add_data_pool 5"  & "mds add_data_pool 6"  ( here pool id
> are host = 5, ghost = 6)
> 4) "cephfs /home/hemant/x set_layout --pool 5 -c 1 -u 4194304 -s
> 4194304"  & "cephfs /home/hemant/y set_layout --pool 6 -c 1 -u 4194304
> -s 4194304"
> 
> PROBLEM:
>  $ cephfs /home/hemant/x show_layout
> layout.data_pool: 6
> layout.object_size:   4194304
> layout.stripe_unit:   4194304
> layout.stripe_count:  1
> cephfs /home/hemant/y show_layout
> layout.data_pool: 6
> layout.object_size:   4194304
> layout.stripe_unit:   419

Re: rbd map command hangs for 15 minutes during system start up

2012-11-30 Thread Nick Bartos
My initial tests using a 3.5.7 kernel with the 55 patches from
wip-nick are going well.  So far I've gone through 8 installs without
an incident, I'll leave it run for a bit longer to see if it crops up
again.

Can I get a branch with these patches integrated into all of the
backported patches to 3.5.x?  I'd like to get this into our main
testing branch, which is currently running 3.5.7 with the patches from
wip-3.5 excluding the
libceph_resubmit_linger_ops_when_pg_mapping_changes patch.

Note that we had a case of a rbd map hang with our main testing
branch, but I don't have a script that can reproduce that yet.  It was
after the cluster was all up and working, and we were  doing a rolling
reboot (cycling through each node).


On Thu, Nov 29, 2012 at 12:37 PM, Alex Elder  wrote:
> On 11/22/2012 12:04 PM, Nick Bartos wrote:
>> Here are the ceph log messages (including the libceph kernel debug
>> stuff you asked for) from a node boot with the rbd command hung for a
>> couple of minutes:
>
> Nick, I have put together a branch that includes two fixes
> that might be helpful.  I don't expect these fixes will
> necessarily *fix* what you're seeing, but one of them
> pulls a big hunk of processing out of the picture and
> might help eliminate some potential causes.  I had to
> pull in several other patches as prerequisites in order
> to get those fixes to apply cleanly.
>
> Would you be able to give it a try, and let us know what
> results you get?  The branch contains:
> - Linux 3.5.5
> - Plus the first 49 patches you listed
> - Plus four patches, which are prerequisites...
> libceph: define ceph_extract_encoded_string()
> rbd: define some new format constants
> rbd: define rbd_dev_image_id()
> rbd: kill create_snap sysfs entry
> - ...for these two bug fixes:
> libceph: remove 'osdtimeout' option
> ceph: don't reference req after put
>
> The branch is available in the ceph-client git repository
> under the name "wip-nick" and has commit id dd9323aa.
> https://github.com/ceph/ceph-client/tree/wip-nick
>
>> https://raw.github.com/gist/4132395/7cb5f0150179b012429c6e57749120dd88616cce/gistfile1.txt
>
> This full debug output is very helpful.  Please supply
> that again as well.
>
> Thanks.
>
> -Alex
>
>> On Wed, Nov 21, 2012 at 9:49 PM, Nick Bartos  wrote:
>>> It's very easy to reproduce now with my automated install script, the
>>> most I've seen it succeed with that patch is 2 in a row, and hanging
>>> on the 3rd, although it hangs on most builds.  So it shouldn't take
>>> much to get it to do it again.  I'll try and get to that tomorrow,
>>> when I'm a bit more rested and my brain is working better.
>>>
>>> Yes during this the OSDs are probably all syncing up.  All the osd and
>>> mon daemons have started by the time the rdb commands are ran, though.
>>>
>>> On Wed, Nov 21, 2012 at 8:47 PM, Sage Weil  wrote:
 On Wed, 21 Nov 2012, Nick Bartos wrote:
> FYI the build which included all 3.5 backports except patch #50 is
> still going strong after 21 builds.

 Okay, that one at least makes some sense.  I've opened

 http://tracker.newdream.net/issues/3519

 How easy is this to reproduce?  If it is something you can trigger with
 debugging enabled ('echo module libceph +p >
 /sys/kernel/debug/dynamic_debug/control') that would help tremendously.

 I'm guessing that during this startup time the OSDs are still in the
 process of starting?

 Alex, I bet that a test that does a lot of map/unmap stuff in a loop while
 thrashing OSDs could hit this.

 Thanks!
 sage


>
> On Wed, Nov 21, 2012 at 9:34 AM, Nick Bartos  wrote:
>> With 8 successful installs already done, I'm reasonably confident that
>> it's patch #50.  I'm making another build which applies all patches
>> from the 3.5 backport branch, excluding that specific one.  I'll let
>> you know if that turns up any unexpected failures.
>>
>> What will the potential fall out be for removing that specific patch?
>>
>>
>> On Wed, Nov 21, 2012 at 9:02 AM, Nick Bartos  
>> wrote:
>>> It's really looking like it's the
>>> libceph_resubmit_linger_ops_when_pg_mapping_changes commit.  When
>>> patches 1-50 (listed below) are applied to 3.5.7, the hang is present.
>>>  So far I have gone through 4 successful installs with no hang with
>>> only 1-49 applied.  I'm still leaving my test run to make sure it's
>>> not a fluke, but since previously it hangs within the first couple of
>>> builds, it really looks like this is where the problem originated.
>>>
>>> 1-libceph_eliminate_connection_state_DEAD.patch
>>> 2-libceph_kill_bad_proto_ceph_connection_op.patch
>>> 3-libceph_rename_socket_callbacks.patch
>>> 4-libceph_rename_kvec_reset_and_kvec_add_functions.patch
>>> 5-libceph_embed_ceph_messenger_structure_in_ceph_client.patc

Re: rbd map command hangs for 15 minutes during system start up

2012-11-30 Thread Alex Elder
On 11/30/2012 12:49 PM, Nick Bartos wrote:
> My initial tests using a 3.5.7 kernel with the 55 patches from
> wip-nick are going well.  So far I've gone through 8 installs without
> an incident, I'll leave it run for a bit longer to see if it crops up
> again.

This is great news!  Now I wonder which of the two fixes took
care of the problem...

> Can I get a branch with these patches integrated into all of the
> backported patches to 3.5.x?  I'd like to get this into our main
> testing branch, which is currently running 3.5.7 with the patches from
> wip-3.5 excluding the
> libceph_resubmit_linger_ops_when_pg_mapping_changes patch.

I will put together a new branch that includes the remainder
of those patches for you shortly.

> Note that we had a case of a rbd map hang with our main testing
> branch, but I don't have a script that can reproduce that yet.  It was
> after the cluster was all up and working, and we were  doing a rolling
> reboot (cycling through each node).

If you are able to reproduce this please let us know.

-Alex

> 
> 
> On Thu, Nov 29, 2012 at 12:37 PM, Alex Elder  wrote:
>> On 11/22/2012 12:04 PM, Nick Bartos wrote:
>>> Here are the ceph log messages (including the libceph kernel debug
>>> stuff you asked for) from a node boot with the rbd command hung for a
>>> couple of minutes:
>>
>> Nick, I have put together a branch that includes two fixes
>> that might be helpful.  I don't expect these fixes will
>> necessarily *fix* what you're seeing, but one of them
>> pulls a big hunk of processing out of the picture and
>> might help eliminate some potential causes.  I had to
>> pull in several other patches as prerequisites in order
>> to get those fixes to apply cleanly.
>>
>> Would you be able to give it a try, and let us know what
>> results you get?  The branch contains:
>> - Linux 3.5.5
>> - Plus the first 49 patches you listed
>> - Plus four patches, which are prerequisites...
>> libceph: define ceph_extract_encoded_string()
>> rbd: define some new format constants
>> rbd: define rbd_dev_image_id()
>> rbd: kill create_snap sysfs entry
>> - ...for these two bug fixes:
>> libceph: remove 'osdtimeout' option
>> ceph: don't reference req after put
>>
>> The branch is available in the ceph-client git repository
>> under the name "wip-nick" and has commit id dd9323aa.
>> https://github.com/ceph/ceph-client/tree/wip-nick
>>
>>> https://raw.github.com/gist/4132395/7cb5f0150179b012429c6e57749120dd88616cce/gistfile1.txt
>>
>> This full debug output is very helpful.  Please supply
>> that again as well.
>>
>> Thanks.
>>
>> -Alex
>>
>>> On Wed, Nov 21, 2012 at 9:49 PM, Nick Bartos  wrote:
 It's very easy to reproduce now with my automated install script, the
 most I've seen it succeed with that patch is 2 in a row, and hanging
 on the 3rd, although it hangs on most builds.  So it shouldn't take
 much to get it to do it again.  I'll try and get to that tomorrow,
 when I'm a bit more rested and my brain is working better.

 Yes during this the OSDs are probably all syncing up.  All the osd and
 mon daemons have started by the time the rdb commands are ran, though.

 On Wed, Nov 21, 2012 at 8:47 PM, Sage Weil  wrote:
> On Wed, 21 Nov 2012, Nick Bartos wrote:
>> FYI the build which included all 3.5 backports except patch #50 is
>> still going strong after 21 builds.
>
> Okay, that one at least makes some sense.  I've opened
>
> http://tracker.newdream.net/issues/3519
>
> How easy is this to reproduce?  If it is something you can trigger with
> debugging enabled ('echo module libceph +p >
> /sys/kernel/debug/dynamic_debug/control') that would help tremendously.
>
> I'm guessing that during this startup time the OSDs are still in the
> process of starting?
>
> Alex, I bet that a test that does a lot of map/unmap stuff in a loop while
> thrashing OSDs could hit this.
>
> Thanks!
> sage
>
>
>>
>> On Wed, Nov 21, 2012 at 9:34 AM, Nick Bartos  
>> wrote:
>>> With 8 successful installs already done, I'm reasonably confident that
>>> it's patch #50.  I'm making another build which applies all patches
>>> from the 3.5 backport branch, excluding that specific one.  I'll let
>>> you know if that turns up any unexpected failures.
>>>
>>> What will the potential fall out be for removing that specific patch?
>>>
>>>
>>> On Wed, Nov 21, 2012 at 9:02 AM, Nick Bartos  
>>> wrote:
 It's really looking like it's the
 libceph_resubmit_linger_ops_when_pg_mapping_changes commit.  When
 patches 1-50 (listed below) are applied to 3.5.7, the hang is present.
  So far I have gone through 4 successful installs with no hang with
 only 1-49 applied.  I'm still leaving my test run to make sure it's

Re: rbd map command hangs for 15 minutes during system start up

2012-11-30 Thread Sage Weil
On Fri, 30 Nov 2012, Alex Elder wrote:
> On 11/30/2012 12:49 PM, Nick Bartos wrote:
> > My initial tests using a 3.5.7 kernel with the 55 patches from
> > wip-nick are going well.  So far I've gone through 8 installs without
> > an incident, I'll leave it run for a bit longer to see if it crops up
> > again.
> 
> This is great news!  Now I wonder which of the two fixes took
> care of the problem...
> 
> > Can I get a branch with these patches integrated into all of the
> > backported patches to 3.5.x?  I'd like to get this into our main
> > testing branch, which is currently running 3.5.7 with the patches from
> > wip-3.5 excluding the
> > libceph_resubmit_linger_ops_when_pg_mapping_changes patch.
> 
> I will put together a new branch that includes the remainder
> of those patches for you shortly.
> 
> > Note that we had a case of a rbd map hang with our main testing
> > branch, but I don't have a script that can reproduce that yet.  It was
> > after the cluster was all up and working, and we were  doing a rolling
> > reboot (cycling through each node).
> 
> If you are able to reproduce this please let us know.

It sounds to me like it might be the same problem.  If we're lucky, those 
2 patches will resolve this as well.

(says the optimist!)
sage


> 
>   -Alex
> 
> > 
> > 
> > On Thu, Nov 29, 2012 at 12:37 PM, Alex Elder  wrote:
> >> On 11/22/2012 12:04 PM, Nick Bartos wrote:
> >>> Here are the ceph log messages (including the libceph kernel debug
> >>> stuff you asked for) from a node boot with the rbd command hung for a
> >>> couple of minutes:
> >>
> >> Nick, I have put together a branch that includes two fixes
> >> that might be helpful.  I don't expect these fixes will
> >> necessarily *fix* what you're seeing, but one of them
> >> pulls a big hunk of processing out of the picture and
> >> might help eliminate some potential causes.  I had to
> >> pull in several other patches as prerequisites in order
> >> to get those fixes to apply cleanly.
> >>
> >> Would you be able to give it a try, and let us know what
> >> results you get?  The branch contains:
> >> - Linux 3.5.5
> >> - Plus the first 49 patches you listed
> >> - Plus four patches, which are prerequisites...
> >> libceph: define ceph_extract_encoded_string()
> >> rbd: define some new format constants
> >> rbd: define rbd_dev_image_id()
> >> rbd: kill create_snap sysfs entry
> >> - ...for these two bug fixes:
> >> libceph: remove 'osdtimeout' option
> >> ceph: don't reference req after put
> >>
> >> The branch is available in the ceph-client git repository
> >> under the name "wip-nick" and has commit id dd9323aa.
> >> https://github.com/ceph/ceph-client/tree/wip-nick
> >>
> >>> https://raw.github.com/gist/4132395/7cb5f0150179b012429c6e57749120dd88616cce/gistfile1.txt
> >>
> >> This full debug output is very helpful.  Please supply
> >> that again as well.
> >>
> >> Thanks.
> >>
> >> -Alex
> >>
> >>> On Wed, Nov 21, 2012 at 9:49 PM, Nick Bartos  wrote:
>  It's very easy to reproduce now with my automated install script, the
>  most I've seen it succeed with that patch is 2 in a row, and hanging
>  on the 3rd, although it hangs on most builds.  So it shouldn't take
>  much to get it to do it again.  I'll try and get to that tomorrow,
>  when I'm a bit more rested and my brain is working better.
> 
>  Yes during this the OSDs are probably all syncing up.  All the osd and
>  mon daemons have started by the time the rdb commands are ran, though.
> 
>  On Wed, Nov 21, 2012 at 8:47 PM, Sage Weil  wrote:
> > On Wed, 21 Nov 2012, Nick Bartos wrote:
> >> FYI the build which included all 3.5 backports except patch #50 is
> >> still going strong after 21 builds.
> >
> > Okay, that one at least makes some sense.  I've opened
> >
> > http://tracker.newdream.net/issues/3519
> >
> > How easy is this to reproduce?  If it is something you can trigger with
> > debugging enabled ('echo module libceph +p >
> > /sys/kernel/debug/dynamic_debug/control') that would help tremendously.
> >
> > I'm guessing that during this startup time the OSDs are still in the
> > process of starting?
> >
> > Alex, I bet that a test that does a lot of map/unmap stuff in a loop 
> > while
> > thrashing OSDs could hit this.
> >
> > Thanks!
> > sage
> >
> >
> >>
> >> On Wed, Nov 21, 2012 at 9:34 AM, Nick Bartos  
> >> wrote:
> >>> With 8 successful installs already done, I'm reasonably confident that
> >>> it's patch #50.  I'm making another build which applies all patches
> >>> from the 3.5 backport branch, excluding that specific one.  I'll let
> >>> you know if that turns up any unexpected failures.
> >>>
> >>> What will the potential fall out be for removing that specific patch?
> >>>
> >>>
> >

Re: Hangup during scrubbing - possible solutions

2012-11-30 Thread Samuel Just
Hah!  Thanks for the log, it's our handling of active_pushes.  I'll
have a patch shortly.

Thanks!
-Sam

On Fri, Nov 30, 2012 at 4:14 AM, Andrey Korolyov  wrote:
> http://xdel.ru/downloads/ceph-log/ceph-scrub-stuck.log.gz
> http://xdel.ru/downloads/ceph-log/cluster-w.log.gz
>
> Here, please.
>
> I have initiated a deep-scrub of osd.1 which was lead to forever-stuck
> I/O requests in a short time(scrub `ll do the same). Second log may be
> useful for proper timestamps, as seeks on the original may took a long
> time. Osd processes on the specific node was restarted twice - at the
> beginning to be sure all config options were applied and at the end to
> do same plus to get rid of stuck requests.
>
>
> On Wed, Nov 28, 2012 at 5:35 AM, Samuel Just  wrote:
>> If you can reproduce it again, what we really need are the osd logs
>> from the acting set of a pg stuck in scrub with
>> debug osd = 20
>> debug ms = 1
>> debug filestore = 20.
>>
>> Thanks,
>> -Sam
>>
>> On Sun, Nov 25, 2012 at 2:08 PM, Andrey Korolyov  wrote:
>>> On Fri, Nov 23, 2012 at 12:35 AM, Sage Weil  wrote:
 On Thu, 22 Nov 2012, Andrey Korolyov wrote:
> Hi,
>
> In the recent versions Ceph introduces some unexpected behavior for
> the permanent connections (VM or kernel clients) - after crash
> recovery, I/O will hang on the next planned scrub on the following
> scenario:
>
> - launch a bunch of clients doing non-intensive writes,
> - lose one or more osd, mark them down, wait for recovery completion,
> - do a slow scrub, e.g. scrubbing one osd per 5m, inside bash script,
> or wait for ceph to do the same,
> - observe a raising number of pgs stuck in the active+clean+scrubbing
> state (they took a master role from ones which was on killed osd and
> almost surely they are being written in time of crash),
> - some time later, clients will hang hardly and ceph log introduce
> stuck(old) I/O requests.
>
> The only one way to return clients back without losing their I/O state
> is per-osd restart, which also will help to get rid of
> active+clean+scrubbing pgs.
>
> First of all, I`ll be happy to help to solve this problem by providing
> logs.

 If you can reproduce this behavior with 'debug osd = 20' and 'debug ms =
 1' logging on the OSD, that would be wonderful!

>>>
>>> I have tested slightly different recovery flow, please see below.
>>> Since there is no real harm, like frozen I/O, placement groups also
>>> was stuck forever on the active+clean+scrubbing state, until I
>>> restarted all osds (end of the log):
>>>
>>> http://xdel.ru/downloads/ceph-log/recover-clients-later-than-osd.txt.gz
>>>
>>> - start the healthy cluster
>>> - start persistent clients
>>> - add an another host with pair of OSDs, let them be in the data placement
>>> - wait for data to rearrange
>>> - [22:06 timestamp] mark OSDs out or simply kill them and wait(since I
>>> have an 1/2 hour delay on readjust in such case, I did ``ceph osd
>>> out'' manually)
>>> - watch for data to rearrange again
>>> - [22:51 timestamp] when it ends, start a manual rescrub, with
>>> non-zero active+clean+scrubbing-state placement groups at the end of
>>> process which `ll stay in this state forever until something happens
>>>
>>> After that, I can restart osds one per one, if I want to get rid of
>>> scrubbing states immediately and then do deep-scrub(if I don`t, those
>>> states will return at next ceph self-scrubbing) or do per-osd
>>> deep-scrub, if I have a lot of time. The case I have described in the
>>> previous message took place when I remove osd from data placement
>>> which existed on the moment when client(s) have started and indeed it
>>> is more harmful than current one(frozen I/O leads to hanging entire
>>> guest, for example). Since testing those flow took a lot of time, I`ll
>>> send logs related to this case tomorrow.
>>>
> Second question is not directly related to this problem, but I
> have thought on for a long time - is there a planned features to
> control scrub process more precisely, e.g. pg scrub rate or scheduled
> scrub, instead of current set of timeouts which of course not very
> predictable on when to run?

 Not yet.  I would be interested in hearing what kind of control/config
 options/whatever you (and others) would like to see!
>>>
>>> Of course it will be awesome to have any determined scheduler or at
>>> least an option to disable automated scrubbing, since it is not very
>>> determined in time and deep-scrub eating a lot of I/O if command
>>> issued against entire OSD. Rate limiting is not in the first place, at
>>> least it may be recreated in external script, but for those who prefer
>>> to leave control to Ceph, it may be very useful.
>>>
>>> Thanks!
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>> the body of a message to majord...@vger.kernel.org
>>> More majordomo info at  

Review request: wip-localized-read-tests

2012-11-30 Thread Noah Watkins
I've pushed up patches for the first phase of testing read from
replica functionality, which looks only at objecter/client level ops:

   wip-localized-read-tests

The major points are:

  1. Run libcephfs tests w/ and w/o localized reads enabled
  2. Add the performance counter in Objecter to record ops sent to replica
  3. Add performance counter accessor in unit tests

Locally I have verified that the performance counters are working with
a 3 OSD setup, although there are not yet any unit tests that try to
specifically assert a positive value on the counters.

Thanks,
Noah
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: rbd map command hangs for 15 minutes during system start up

2012-11-30 Thread Alex Elder
On 11/29/2012 02:37 PM, Alex Elder wrote:
> On 11/22/2012 12:04 PM, Nick Bartos wrote:
>> Here are the ceph log messages (including the libceph kernel debug
>> stuff you asked for) from a node boot with the rbd command hung for a
>> couple of minutes:

I'm sorry, but I did something stupid...

Yes, the branch I gave you includes these fixes.  However
it does *not* include the commit that was giving you trouble
to begin with.

So...

I have updated that same branch (wip-nick) to contain:
- Linux 3.5.5
- Plus the first *50* (not 49) patches you listed
- Plus the ones I added before.

The new commit id for that branch begins with be3198d6.

I'm really sorry for this mistake.  Please try this new
branch and report back what you find.

-Alex


> Nick, I have put together a branch that includes two fixes
> that might be helpful.  I don't expect these fixes will
> necessarily *fix* what you're seeing, but one of them
> pulls a big hunk of processing out of the picture and
> might help eliminate some potential causes.  I had to
> pull in several other patches as prerequisites in order
> to get those fixes to apply cleanly.
> 
> Would you be able to give it a try, and let us know what
> results you get?  The branch contains:
> - Linux 3.5.5
> - Plus the first 49 patches you listed
> - Plus four patches, which are prerequisites...
> libceph: define ceph_extract_encoded_string()
> rbd: define some new format constants
> rbd: define rbd_dev_image_id()
> rbd: kill create_snap sysfs entry
> - ...for these two bug fixes:
> libceph: remove 'osdtimeout' option
> ceph: don't reference req after put
> 
> The branch is available in the ceph-client git repository
> under the name "wip-nick" and has commit id dd9323aa.
> https://github.com/ceph/ceph-client/tree/wip-nick
> 
>> https://raw.github.com/gist/4132395/7cb5f0150179b012429c6e57749120dd88616cce/gistfile1.txt
> 
> This full debug output is very helpful.  Please supply
> that again as well.
> 
> Thanks.
> 
>   -Alex
> 
>> On Wed, Nov 21, 2012 at 9:49 PM, Nick Bartos  wrote:
>>> It's very easy to reproduce now with my automated install script, the
>>> most I've seen it succeed with that patch is 2 in a row, and hanging
>>> on the 3rd, although it hangs on most builds.  So it shouldn't take
>>> much to get it to do it again.  I'll try and get to that tomorrow,
>>> when I'm a bit more rested and my brain is working better.
>>>
>>> Yes during this the OSDs are probably all syncing up.  All the osd and
>>> mon daemons have started by the time the rdb commands are ran, though.
>>>
>>> On Wed, Nov 21, 2012 at 8:47 PM, Sage Weil  wrote:
 On Wed, 21 Nov 2012, Nick Bartos wrote:
> FYI the build which included all 3.5 backports except patch #50 is
> still going strong after 21 builds.

 Okay, that one at least makes some sense.  I've opened

 http://tracker.newdream.net/issues/3519

 How easy is this to reproduce?  If it is something you can trigger with
 debugging enabled ('echo module libceph +p >
 /sys/kernel/debug/dynamic_debug/control') that would help tremendously.

 I'm guessing that during this startup time the OSDs are still in the
 process of starting?

 Alex, I bet that a test that does a lot of map/unmap stuff in a loop while
 thrashing OSDs could hit this.

 Thanks!
 sage


>
> On Wed, Nov 21, 2012 at 9:34 AM, Nick Bartos  wrote:
>> With 8 successful installs already done, I'm reasonably confident that
>> it's patch #50.  I'm making another build which applies all patches
>> from the 3.5 backport branch, excluding that specific one.  I'll let
>> you know if that turns up any unexpected failures.
>>
>> What will the potential fall out be for removing that specific patch?
>>
>>
>> On Wed, Nov 21, 2012 at 9:02 AM, Nick Bartos  
>> wrote:
>>> It's really looking like it's the
>>> libceph_resubmit_linger_ops_when_pg_mapping_changes commit.  When
>>> patches 1-50 (listed below) are applied to 3.5.7, the hang is present.
>>>  So far I have gone through 4 successful installs with no hang with
>>> only 1-49 applied.  I'm still leaving my test run to make sure it's
>>> not a fluke, but since previously it hangs within the first couple of
>>> builds, it really looks like this is where the problem originated.
>>>
>>> 1-libceph_eliminate_connection_state_DEAD.patch
>>> 2-libceph_kill_bad_proto_ceph_connection_op.patch
>>> 3-libceph_rename_socket_callbacks.patch
>>> 4-libceph_rename_kvec_reset_and_kvec_add_functions.patch
>>> 5-libceph_embed_ceph_messenger_structure_in_ceph_client.patch
>>> 6-libceph_start_separating_connection_flags_from_state.patch
>>> 7-libceph_start_tracking_connection_socket_state.patch
>>> 8-libceph_provide_osd_number_when_creating_osd.patch
>>> 9-libc

librbd: error finding header: (2) No such file or directory

2012-11-30 Thread Simon Frerichs | Fremaks GmbH

Hi,

we war starting to see this error on some images:

-> rbd info kvm1207
error opening image kvm1207: (2) No such file or directory
2012-12-01 02:58:27.556677 7ffd50c60760 -1 librbd: error finding header: 
(2) No such file or directory


Anyway to fix these images?

Best regards,
Simon
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: endless flying slow requests

2012-11-30 Thread Samuel Just
I've pushed a fix to next, 49f32cee647c5bd09f36ba7c9fd4f481a697b9d7.
Let me know if the problem persists with this patch.
-Sam

On Wed, Nov 28, 2012 at 2:04 PM, Andrey Korolyov  wrote:
> On Thu, Nov 29, 2012 at 1:12 AM, Samuel Just  wrote:
>> Also, these clusters aren't mixed argonaut and next, are they?  (Not
>> that that shouldn't work, but it would be a useful data point.)
>> -Sam
>>
>> On Wed, Nov 28, 2012 at 1:11 PM, Samuel Just  wrote:
>>> Did you observe hung io along with that error?  Both sub_op_commit and
>>> sub_op_applied have happened, so the sub_op_reply should have been
>>> sent back to the primary.  This looks more like a leak.  If you also
>>> observed hung io, then it's possible that the problem is occurring
>>> between the sub_op_applied event and the response.
>>> -Sam
>>>
>
> It is relatively easy to check if one of client VMs has locked one or
> more cores to iowait or just hangs, so yes, these ops are related to
> real commit operations and they are hanged.
> I`m using all-new 0.54 cluster, without mixing of course. Does
> everyone who hit that bug readjusted cluster before bug shows
> itself(say, in a day-long distance)?
>
>>> On Tue, Nov 27, 2012 at 11:47 PM, Andrey Korolyov  wrote:
 On Wed, Nov 28, 2012 at 5:51 AM, Sage Weil  wrote:
> Hi Stefan,
>
> On Thu, 15 Nov 2012, Sage Weil wrote:
>> On Thu, 15 Nov 2012, Stefan Priebe - Profihost AG wrote:
>> > Am 14.11.2012 15:59, schrieb Sage Weil:
>> > > Hi Stefan,
>> > >
>> > > I would be nice to confirm that no clients are waiting on replies for
>> > > these requests; currently we suspect that the OSD request tracking 
>> > > is the
>> > > buggy part.  If you query the OSD admin socket you should be able to 
>> > > dump
>> > > requests and see the client IP, and then query the client.
>> > >
>> > > Is it librbd?  In that case you likely need to change the config so 
>> > > that
>> > > it is listening on an admin socket ('admin socket = path').
>> >
>> > Yes it is. So i have to specify admin socket at the KVM host?
>>
>> Right.  IIRC the disk line is a ; (or \;) separated list of key/value
>> pairs.
>>
>> > How do i query the admin socket for requests?
>>
>> ceph --admin-daemon /path/to/socket help
>> ceph --admin-daemon /path/to/socket objecter_dump (i think)
>
> Were you able to reproduce this?
>
> Thanks!
> sage

 Meanwhile, I did. :)
 Such requests will always be created if you have restarted or marked
 an osd out and then back in and scrub didn`t happen in the meantime
 (after such operation and before request arrival).
 What is more interesting, the hangup happens not exactly at the time
 of operation, but tens of minutes later.

 { "description": "osd_sub_op(client.1292013.0:45422 4.731
 a384cf31\/rbd_data.1415fb1075f187.00a7\/head\/\/4 [] v
 16444'21693 snapset=0=[]:[] snapc=0=[])",
   "received_at": "2012-11-28 03:54:43.094151",
   "age": "27812.942680",
   "duration": "2.676641",
   "flag_point": "started",
   "events": [
 { "time": "2012-11-28 03:54:43.094222",
   "event": "waiting_for_osdmap"},
 { "time": "2012-11-28 03:54:43.386890",
   "event": "reached_pg"},
 { "time": "2012-11-28 03:54:43.386894",
   "event": "started"},
 { "time": "2012-11-28 03:54:43.386973",
   "event": "commit_queued_for_journal_write"},
 { "time": "2012-11-28 03:54:45.360049",
   "event": "write_thread_in_journal_buffer"},
 { "time": "2012-11-28 03:54:45.586183",
   "event": "journaled_completion_queued"},
 { "time": "2012-11-28 03:54:45.586262",
   "event": "sub_op_commit"},
 { "time": "2012-11-28 03:54:45.770792",
   "event": "sub_op_applied"}]}]}


>
>
>>
>> sage
>>
>> >
>> > Stefan
>> >
>> >
>> > > On Wed, 14 Nov 2012, Stefan Priebe - Profihost AG wrote:
>> > >
>> > > > Hello list,
>> > > >
>> > > > i see this several times. Endless flying slow requests. And they 
>> > > > never
>> > > > stop
>> > > > until i restart the mentioned osd.
>> > > >
>> > > > 2012-11-14 10:11:57.513395 osd.24 [WRN] 1 slow requests, 1 
>> > > > included below;
>> > > > oldest blocked for > 31789.858457 secs
>> > > > 2012-11-14 10:11:57.513399 osd.24 [WRN] slow request 31789.858457 
>> > > > seconds
>> > > > old,
>> > > > received at 2012-11-14 01:22:07.654922: osd_op(client.30286.0:6719
>> > > > rbd_data.75c55bf2fdd7.1399 [write 282624~4096] 
>> > > > 3.3f6d2373) v4
>> > 

Re: Hangup during scrubbing - possible solutions

2012-11-30 Thread Samuel Just
Just pushed a fix to next, 49f32cee647c5bd09f36ba7c9fd4f481a697b9d7.
Let me know if it persists.  Thanks for the logs!
-Sam

On Fri, Nov 30, 2012 at 2:04 PM, Samuel Just  wrote:
> Hah!  Thanks for the log, it's our handling of active_pushes.  I'll
> have a patch shortly.
>
> Thanks!
> -Sam
>
> On Fri, Nov 30, 2012 at 4:14 AM, Andrey Korolyov  wrote:
>> http://xdel.ru/downloads/ceph-log/ceph-scrub-stuck.log.gz
>> http://xdel.ru/downloads/ceph-log/cluster-w.log.gz
>>
>> Here, please.
>>
>> I have initiated a deep-scrub of osd.1 which was lead to forever-stuck
>> I/O requests in a short time(scrub `ll do the same). Second log may be
>> useful for proper timestamps, as seeks on the original may took a long
>> time. Osd processes on the specific node was restarted twice - at the
>> beginning to be sure all config options were applied and at the end to
>> do same plus to get rid of stuck requests.
>>
>>
>> On Wed, Nov 28, 2012 at 5:35 AM, Samuel Just  wrote:
>>> If you can reproduce it again, what we really need are the osd logs
>>> from the acting set of a pg stuck in scrub with
>>> debug osd = 20
>>> debug ms = 1
>>> debug filestore = 20.
>>>
>>> Thanks,
>>> -Sam
>>>
>>> On Sun, Nov 25, 2012 at 2:08 PM, Andrey Korolyov  wrote:
 On Fri, Nov 23, 2012 at 12:35 AM, Sage Weil  wrote:
> On Thu, 22 Nov 2012, Andrey Korolyov wrote:
>> Hi,
>>
>> In the recent versions Ceph introduces some unexpected behavior for
>> the permanent connections (VM or kernel clients) - after crash
>> recovery, I/O will hang on the next planned scrub on the following
>> scenario:
>>
>> - launch a bunch of clients doing non-intensive writes,
>> - lose one or more osd, mark them down, wait for recovery completion,
>> - do a slow scrub, e.g. scrubbing one osd per 5m, inside bash script,
>> or wait for ceph to do the same,
>> - observe a raising number of pgs stuck in the active+clean+scrubbing
>> state (they took a master role from ones which was on killed osd and
>> almost surely they are being written in time of crash),
>> - some time later, clients will hang hardly and ceph log introduce
>> stuck(old) I/O requests.
>>
>> The only one way to return clients back without losing their I/O state
>> is per-osd restart, which also will help to get rid of
>> active+clean+scrubbing pgs.
>>
>> First of all, I`ll be happy to help to solve this problem by providing
>> logs.
>
> If you can reproduce this behavior with 'debug osd = 20' and 'debug ms =
> 1' logging on the OSD, that would be wonderful!
>

 I have tested slightly different recovery flow, please see below.
 Since there is no real harm, like frozen I/O, placement groups also
 was stuck forever on the active+clean+scrubbing state, until I
 restarted all osds (end of the log):

 http://xdel.ru/downloads/ceph-log/recover-clients-later-than-osd.txt.gz

 - start the healthy cluster
 - start persistent clients
 - add an another host with pair of OSDs, let them be in the data placement
 - wait for data to rearrange
 - [22:06 timestamp] mark OSDs out or simply kill them and wait(since I
 have an 1/2 hour delay on readjust in such case, I did ``ceph osd
 out'' manually)
 - watch for data to rearrange again
 - [22:51 timestamp] when it ends, start a manual rescrub, with
 non-zero active+clean+scrubbing-state placement groups at the end of
 process which `ll stay in this state forever until something happens

 After that, I can restart osds one per one, if I want to get rid of
 scrubbing states immediately and then do deep-scrub(if I don`t, those
 states will return at next ceph self-scrubbing) or do per-osd
 deep-scrub, if I have a lot of time. The case I have described in the
 previous message took place when I remove osd from data placement
 which existed on the moment when client(s) have started and indeed it
 is more harmful than current one(frozen I/O leads to hanging entire
 guest, for example). Since testing those flow took a lot of time, I`ll
 send logs related to this case tomorrow.

>> Second question is not directly related to this problem, but I
>> have thought on for a long time - is there a planned features to
>> control scrub process more precisely, e.g. pg scrub rate or scheduled
>> scrub, instead of current set of timeouts which of course not very
>> predictable on when to run?
>
> Not yet.  I would be interested in hearing what kind of control/config
> options/whatever you (and others) would like to see!

 Of course it will be awesome to have any determined scheduler or at
 least an option to disable automated scrubbing, since it is not very
 determined in time and deep-scrub eating a lot of I/O if command
 issued against entire OSD. Rate limiting is not in the first place, at
 least it may be recr