from:"Alex Elder"

Re: [PATCH v2] Net: ceph: messenger: Use local variable cursor in read_partial_msg_data()

2015-10-19 Thread Alex Elder

On 10/18/2015 09:49 PM, Shraddha Barke wrote:
> Use local variable cursor in place of >cursor in
>  read_partial_msg_data()
> 
> Signed-off-by: Shraddha Barke 

This is a pretty minor comment, but the "Net" in your
subject line is probably better *not* capitalized.

-Alex

> ---
> Changes in v2-
>  Drop incorrect use of cursor
> 
>  net/ceph/messenger.c | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/net/ceph/messenger.c b/net/ceph/messenger.c
> index b9b0e3b..b087edd 100644
> --- a/net/ceph/messenger.c
> +++ b/net/ceph/messenger.c
> @@ -2246,7 +2246,7 @@ static int read_partial_msg_data(struct ceph_connection 
> *con)
>   if (do_datacrc)
>   crc = con->in_data_crc;
>   while (cursor->resid) {
> - page = ceph_msg_data_next(>cursor, _offset, ,
> + page = ceph_msg_data_next(cursor, _offset, ,
>   NULL);
>   ret = ceph_tcp_recvpage(con->sock, page, page_offset, length);
>   if (ret <= 0) {
> @@ -2258,7 +2258,7 @@ static int read_partial_msg_data(struct ceph_connection 
> *con)
>  
>   if (do_datacrc)
>   crc = ceph_crc32c_page(crc, page, page_offset, ret);
> - (void) ceph_msg_data_advance(>cursor, (size_t)ret);
> + (void) ceph_msg_data_advance(cursor, (size_t)ret);
>   }
>   if (do_datacrc)
>   con->in_data_crc = crc;
> 

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Fwd: Re: [PATCH] rbd: set max_sectors explicitly

2015-10-16 Thread Alex Elder

Forgot to "Reply-all".  -Alex


 Forwarded Message 
Subject: Re: [PATCH] rbd: set max_sectors explicitly
Date: Fri, 16 Oct 2015 07:22:55 -0500
From: Alex Elder <el...@ieee.org>
To: Ilya Dryomov <idryo...@gmail.com>

On 10/07/2015 12:00 PM, Ilya Dryomov wrote:
> Commit 30e2bc08b2bb ("Revert "block: remove artifical max_hw_sectors
> cap"") restored a clamp on max_sectors.  It's now 2560 sectors instead
> of 1024, but it's not good enough: we set max_hw_sectors to rbd object
> size because we don't want object sized I/Os to be split, and the
> default object size is 4M.
> 
> So, set max_sectors to max_hw_sectors in rbd at queue init time.
> 
> Signed-off-by: Ilya Dryomov <idryo...@gmail.com>
> ---
>  drivers/block/rbd.c | 1 +
>  1 file changed, 1 insertion(+)
> 
> diff --git a/drivers/block/rbd.c b/drivers/block/rbd.c
> index 05072464d25e..04e69b4df664 100644
> --- a/drivers/block/rbd.c
> +++ b/drivers/block/rbd.c
> @@ -3760,6 +3760,7 @@ static int rbd_init_disk(struct rbd_device *rbd_dev)
>   /* set io sizes to object size */
>   segment_size = rbd_obj_bytes(_dev->header);
>   blk_queue_max_hw_sectors(q, segment_size / SECTOR_SIZE);
> + q->limits.max_sectors = queue_max_hw_sectors(q);

This currently is done by default by blk_queue_max_hw_sectors().
Do you see any different behavior with this patch in place?

This change seems at least harmless so if it's not too late:

Reviewed-by: Alex Elder <el...@linaro.org>

>   blk_queue_max_segments(q, segment_size / SECTOR_SIZE);
>   blk_queue_max_segment_size(q, segment_size);
>   blk_queue_io_min(q, segment_size);
> 



--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] rbd: use writefull op for object size writes

2015-10-16 Thread Alex Elder

On 10/07/2015 12:02 PM, Ilya Dryomov wrote:
> This covers only the simplest case - an object size sized write, but
> it's still useful in tiering setups when EC is used for the base tier
> as writefull op can be proxied, saving an object promotion.
> 
> Even though updating ceph_osdc_new_request() to allow writefull should
> just be a matter of fixing an assert, I didn't do it because its only
> user is cephfs.  All other sites were updated.
> 
> Reflects ceph.git commit 7bfb7f9025a8ee0d2305f49bf0336d2424da5b5b.
> 
> Signed-off-by: Ilya Dryomov <idryo...@gmail.com>

Looks good to me.

Reviewed-by: Alex Elder <el...@linaro.org>

> ---
>  drivers/block/rbd.c   |  9 +++--
>  net/ceph/osd_client.c | 13 +
>  2 files changed, 16 insertions(+), 6 deletions(-)
> 
> diff --git a/drivers/block/rbd.c b/drivers/block/rbd.c
> index 04e69b4df664..cd00e4653e49 100644
> --- a/drivers/block/rbd.c
> +++ b/drivers/block/rbd.c
> @@ -1863,9 +1863,11 @@ static void rbd_osd_req_callback(struct 
> ceph_osd_request *osd_req,
>   rbd_osd_read_callback(obj_request);
>   break;
>   case CEPH_OSD_OP_SETALLOCHINT:
> - rbd_assert(osd_req->r_ops[1].op == CEPH_OSD_OP_WRITE);
> + rbd_assert(osd_req->r_ops[1].op == CEPH_OSD_OP_WRITE ||
> +osd_req->r_ops[1].op == CEPH_OSD_OP_WRITEFULL);
>   /* fall through */
>   case CEPH_OSD_OP_WRITE:
> + case CEPH_OSD_OP_WRITEFULL:
>   rbd_osd_write_callback(obj_request);
>   break;
>   case CEPH_OSD_OP_STAT:
> @@ -2401,7 +2403,10 @@ static void rbd_img_obj_request_fill(struct 
> rbd_obj_request *obj_request,
>   opcode = CEPH_OSD_OP_ZERO;
>   }
>   } else if (op_type == OBJ_OP_WRITE) {
> - opcode = CEPH_OSD_OP_WRITE;
> + if (!offset && length == object_size)
> + opcode = CEPH_OSD_OP_WRITEFULL;
> + else
> + opcode = CEPH_OSD_OP_WRITE;
>   osd_req_op_alloc_hint_init(osd_request, num_ops,
>   object_size, object_size);
>   num_ops++;
> diff --git a/net/ceph/osd_client.c b/net/ceph/osd_client.c
> index 80b94e37c94a..f79ccac6699f 100644
> --- a/net/ceph/osd_client.c
> +++ b/net/ceph/osd_client.c
> @@ -285,6 +285,7 @@ static void osd_req_op_data_release(struct 
> ceph_osd_request *osd_req,
>   switch (op->op) {
>   case CEPH_OSD_OP_READ:
>   case CEPH_OSD_OP_WRITE:
> + case CEPH_OSD_OP_WRITEFULL:
>   ceph_osd_data_release(>extent.osd_data);
>   break;
>   case CEPH_OSD_OP_CALL:
> @@ -485,13 +486,14 @@ void osd_req_op_extent_init(struct ceph_osd_request 
> *osd_req,
>   size_t payload_len = 0;
>  
>   BUG_ON(opcode != CEPH_OSD_OP_READ && opcode != CEPH_OSD_OP_WRITE &&
> -opcode != CEPH_OSD_OP_ZERO && opcode != CEPH_OSD_OP_TRUNCATE);
> +opcode != CEPH_OSD_OP_WRITEFULL && opcode != CEPH_OSD_OP_ZERO &&
> +opcode != CEPH_OSD_OP_TRUNCATE);
>  
>   op->extent.offset = offset;
>   op->extent.length = length;
>   op->extent.truncate_size = truncate_size;
>   op->extent.truncate_seq = truncate_seq;
> - if (opcode == CEPH_OSD_OP_WRITE)
> + if (opcode == CEPH_OSD_OP_WRITE || opcode == CEPH_OSD_OP_WRITEFULL)
>   payload_len += length;
>  
>   op->payload_len = payload_len;
> @@ -670,9 +672,11 @@ static u64 osd_req_encode_op(struct ceph_osd_request 
> *req,
>   break;
>   case CEPH_OSD_OP_READ:
>   case CEPH_OSD_OP_WRITE:
> + case CEPH_OSD_OP_WRITEFULL:
>   case CEPH_OSD_OP_ZERO:
>   case CEPH_OSD_OP_TRUNCATE:
> - if (src->op == CEPH_OSD_OP_WRITE)
> + if (src->op == CEPH_OSD_OP_WRITE ||
> + src->op == CEPH_OSD_OP_WRITEFULL)
>   request_data_len = src->extent.length;
>   dst->extent.offset = cpu_to_le64(src->extent.offset);
>   dst->extent.length = cpu_to_le64(src->extent.length);
> @@ -681,7 +685,8 @@ static u64 osd_req_encode_op(struct ceph_osd_request *req,
>   dst->extent.truncate_seq =
>   cpu_to_le32(src->extent.truncate_seq);
>   osd_data = >extent.osd_data;
> - if (src->op == CEPH_OSD_OP_WRITE)
> + if (src->op == CEPH_OSD_OP_WRITE ||
> + src->op == CEPH_OSD_OP_WRITEFULL)
>   ceph_osdc_msg_data_add(req->r_request, osd_data);
>   else
>   ceph_osdc_msg_data_add(req->r_reply, osd_data);
> 

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] rbd: don't leak parent_spec in rbd_dev_probe_parent()

2015-10-16 Thread Alex Elder

On 10/16/2015 04:50 AM, Ilya Dryomov wrote:
> On Thu, Oct 15, 2015 at 11:10 PM, Alex Elder <el...@ieee.org> wrote:
>> On 10/11/2015 01:03 PM, Ilya Dryomov wrote:
>>> Currently we leak parent_spec and trigger a "parent reference
>>> underflow" warning if rbd_dev_create() in rbd_dev_probe_parent() fails.
>>> The problem is we take the !parent out_err branch and that only drops
>>> refcounts; parent_spec that would've been freed had we called
>>> rbd_dev_unparent() remains and triggers rbd_warn() in
>>> rbd_dev_parent_put() - at that point we have parent_spec != NULL and
>>> parent_ref == 0, so counter ends up being -1 after the decrement.
>>
>> OK, now that I understand the context...
>>
>> You now get extra references for the spec and client
>> for the parent only after creating the parent device.
>> I think the reason they logically belonged before
>> the call to rbd_device_create() for the parent is
>> because the client and spec pointers passed to that
>> function carry with them references that are passed
>> to the resulting rbd_device if successful.
> 
> Let me stress that those two get()s that I moved is not the point of
> the patch at all.  Whether we do it before or after rbd_dev_create() is
> pretty much irrelevant, the only reason I moved those calls is to make
> the error path simpler.

Yes I understand that.

>> If creating the parent device fails, you unparent the
>> original device, which will still have a null parent
>> pointer.  The effect of unparenting in this case is
>> dropping a reference to the parent's spec, and clearing
>> the device's pointer to it.  This is confusing, but
>> let's run with it.
>>
>> If creating the parent device succeeds, references to
>> the client and parent spec are taken (basically, these
>> belong to the just-created parent device).  The parent
>> image is now probed.  If this fails, you again
>> unparent the device.  We still have not set the
>> device's parent pointer, so the effect is as before,
>> dropping the parent spec reference and clearing
>> the device's reference to it.  The error handling
>> now destroys the parent, which drops references to
>> its client and the spec.  Again, this seems
>> confusing, as if we've dropped one more reference
>> to the parent spec than desired.
>>
>> This logic now seems to work.  But it's working
>> around a flaw in the caller I think.  Upon entry
>> to rbd_dev_probe_parent(), a layered device will
>> have have a non-null parent_spec pointer (and a
>> reference to it), which will have been filled in
>> by rbd_dev_v2_parent_info().
>>
>> Really, it should not be rbd_dev_probe_parent()
>> that drops that parent spec reference on error.
>> Instead, rbd_dev_image_probe() (which got the
>> reference to the parent spec) should handle
>> cleaning up the device's parent spec if an
>> error occurs after it has been assigned.
>>
>> I'll wait for your response, I'd like to know
>> if what I'm saying makes sense.
> 
> All of the above makes sense.  I agree that it is asymmetrical and that
> it would have been better to have rbd_dev_image_probe() drop the last
> ref on ->parent_spec.  But, that's not what all of the existing code is
> doing.

Agreed.

> Consider rbd_dev_probe_parent() without my patch.  There are two
> out_err jumps.  As it is, if rbd_dev_create() fails, we only revert
> those two get()s and return with alive ->parent_spec.  However, if
> rbd_dev_image_probe() fails, we call rbd_dev_unparent(), which will put
> ->parent_spec.  Really, the issue is rbd_dev_unparent(), which not only
> removes the parent rbd_dev, but also the parent spec.  All I did with
> this patch was align both out_err cases to do the same, namely undo the
> effects of rbd_dev_v2_parent_info() - I didn't want to create yet
> another error path.

OK.  Walking through the code I did concluded that what
you did made things "line up" properly.  But I just felt
like it wasn't necessarily the *right* fix, but it was
a correct fix.

The only point in suggesting what I think would be a better
fix is that it would make things easier to reason about
and get correct, and in so doing, make it easier to
maintain and adapt in the future.

But I have no objection to your fix, so please accept
this for your original patch:

Reviewed-by: Alex Elder <el...@linaro.org>




> Thanks,
> 
> Ilya
> 

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] rbd: don't leak parent_spec in rbd_dev_probe_parent()

2015-10-15 Thread Alex Elder

On 10/15/2015 01:31 PM, Ilya Dryomov wrote:
> On Thu, Oct 15, 2015 at 7:10 PM, Alex Elder <el...@ieee.org> wrote:
>> On 10/11/2015 01:03 PM, Ilya Dryomov wrote:
>>> Currently we leak parent_spec and trigger a "parent reference
>>> underflow" warning if rbd_dev_create() in rbd_dev_probe_parent() fails.
>>> The problem is we take the !parent out_err branch and that only drops
>>> refcounts; parent_spec that would've been freed had we called
>>> rbd_dev_unparent() remains and triggers rbd_warn() in
>>> rbd_dev_parent_put() - at that point we have parent_spec != NULL and
>>> parent_ref == 0, so counter ends up being -1 after the decrement.
>>>
>>> Redo rbd_dev_probe_parent() to fix this.
>>
>> I'm just starting to look at this.
>>
>> My up-front problem is that I want to understand why
>> it's OK to no longer grab references to the parent spec
>> and the client before creating the new parent device.
>> Is it simply because the rbd_dev that's the subject
>> of this call is already holding a reference to both,
>> so no need to get another?  I think that's right, but
>> you should say that in the explanation.  It's a
>> change that's independent of the other change (which
>> you describe).
> 
> I still grab references, I just moved that code after rbd_dev_create().
> Fundamentally rbd_dev_create() is just a memory allocation, so whether
> we acquire references before or after doesn't matter, except that
> acquiring them before complicates the error path.
> 
>>
>> OK, onto the change you describe.
>>
>> You say that we get the underflow if rbd_dev_create() fails.
>>
>> At that point in the function, we have:
>> parent == NULL
>> rbd_dev->parent_spec != NULL
>> parent_spec holds a new reference to that
>> rbdc holds a new reference to the client
>>
>> rbd_dev_create() only returns NULL if the kzalloc() fails.
>> And if so, we jump to out_err, take the !parent branch,
>> and we drop the references we took in rbdc and parent_spec
>> before returning.
>>
>> Where is the leak?  (Actually, underflow means we're
>> dropping more references than we took.)

OK, I'm still reviewing your patch, but I have another
comment.

> We enter rbd_dev_probe_parent().  If ->parent_spec != NULL (and
> therefore has a refcount of 1), we bump refcounts and proceed to
> allocating parent rbd_dev.  If rbd_dev_create() fails, we drop
> refcounts and exit rbd_dev_probe_parent() with ->parent_spec != NULL
> and a refcount of 1 on that spec.  We take err_out_probe in

This is exactly as it should be.  On error, we exit rbd_dev_create()
with things in exactly the state they were in when it was called.

> rbd_dev_image_probe() and through rbd_dev_unprobe() end up in
> rbd_dev_parent_put().  Now, ->parent_spec != NULL but ->parent_ref == 0
> because we never got to atomic_set(_dev->parent_ref, 1) in

That's where the faulty logic lies.  We set the parent_spec
before we set up the parent, but we update parent_ref only
after the parent is recorded.  And we later assume a non-null
parent_spec means we can decrement parent_ref, which in this
case is not a valid assumption.

I'll send my review comments shortly.  I think you have fixed
a bug but I think it's the wrong fix.

-Alex

> rbd_dev_probe_parent().  Local counter ends up -1 after the decrement
> and so instead of calling rbd_dev_unparent() which would have freed
> ->parent_spec we issue an underflow warning and bail.  ->parent_spec is
> leaked.
> 
> Thanks,
> 
> Ilya
> 

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] rbd: don't leak parent_spec in rbd_dev_probe_parent()

2015-10-15 Thread Alex Elder

On 10/11/2015 01:03 PM, Ilya Dryomov wrote:
> Currently we leak parent_spec and trigger a "parent reference
> underflow" warning if rbd_dev_create() in rbd_dev_probe_parent() fails.
> The problem is we take the !parent out_err branch and that only drops
> refcounts; parent_spec that would've been freed had we called
> rbd_dev_unparent() remains and triggers rbd_warn() in
> rbd_dev_parent_put() - at that point we have parent_spec != NULL and
> parent_ref == 0, so counter ends up being -1 after the decrement.

OK, now that I understand the context...

You now get extra references for the spec and client
for the parent only after creating the parent device.
I think the reason they logically belonged before
the call to rbd_device_create() for the parent is
because the client and spec pointers passed to that
function carry with them references that are passed
to the resulting rbd_device if successful.

If creating the parent device fails, you unparent the
original device, which will still have a null parent
pointer.  The effect of unparenting in this case is
dropping a reference to the parent's spec, and clearing
the device's pointer to it.  This is confusing, but
let's run with it.

If creating the parent device succeeds, references to
the client and parent spec are taken (basically, these
belong to the just-created parent device).  The parent
image is now probed.  If this fails, you again
unparent the device.  We still have not set the
device's parent pointer, so the effect is as before,
dropping the parent spec reference and clearing
the device's reference to it.  The error handling
now destroys the parent, which drops references to
its client and the spec.  Again, this seems
confusing, as if we've dropped one more reference
to the parent spec than desired.

This logic now seems to work.  But it's working
around a flaw in the caller I think.  Upon entry
to rbd_dev_probe_parent(), a layered device will
have have a non-null parent_spec pointer (and a
reference to it), which will have been filled in
by rbd_dev_v2_parent_info().

Really, it should not be rbd_dev_probe_parent()
that drops that parent spec reference on error.
Instead, rbd_dev_image_probe() (which got the
reference to the parent spec) should handle
cleaning up the device's parent spec if an
error occurs after it has been assigned.

I'll wait for your response, I'd like to know
if what I'm saying makes sense.

-Alex

> Redo rbd_dev_probe_parent() to fix this.
> 
> Signed-off-by: Ilya Dryomov 
> ---
>  drivers/block/rbd.c | 36 
>  1 file changed, 16 insertions(+), 20 deletions(-)
> 
> diff --git a/drivers/block/rbd.c b/drivers/block/rbd.c
> index cd00e4653e49..ccbc3cbbf24e 100644
> --- a/drivers/block/rbd.c
> +++ b/drivers/block/rbd.c
> @@ -5134,41 +5134,37 @@ out_err:
>  static int rbd_dev_probe_parent(struct rbd_device *rbd_dev)
>  {
>   struct rbd_device *parent = NULL;
> - struct rbd_spec *parent_spec;
> - struct rbd_client *rbdc;
>   int ret;
>  
>   if (!rbd_dev->parent_spec)
>   return 0;
> - /*
> -  * We need to pass a reference to the client and the parent
> -  * spec when creating the parent rbd_dev.  Images related by
> -  * parent/child relationships always share both.
> -  */
> - parent_spec = rbd_spec_get(rbd_dev->parent_spec);
> - rbdc = __rbd_get_client(rbd_dev->rbd_client);
>  
> - ret = -ENOMEM;
> - parent = rbd_dev_create(rbdc, parent_spec, NULL);
> - if (!parent)
> + parent = rbd_dev_create(rbd_dev->rbd_client, rbd_dev->parent_spec,
> + NULL);
> + if (!parent) {
> + ret = -ENOMEM;
>   goto out_err;
> + }
> +
> + /*
> +  * Images related by parent/child relationships always share
> +  * rbd_client and spec/parent_spec, so bump their refcounts.
> +  */
> + __rbd_get_client(rbd_dev->rbd_client);
> + rbd_spec_get(rbd_dev->parent_spec);
>  
>   ret = rbd_dev_image_probe(parent, false);
>   if (ret < 0)
>   goto out_err;
> +
>   rbd_dev->parent = parent;
>   atomic_set(_dev->parent_ref, 1);
> -
>   return 0;
> +
>  out_err:
> - if (parent) {
> - rbd_dev_unparent(rbd_dev);
> + rbd_dev_unparent(rbd_dev);
> + if (parent)
>   rbd_dev_destroy(parent);
> - } else {
> - rbd_put_client(rbdc);
> - rbd_spec_put(parent_spec);
> - }
> -
>   return ret;
>  }
>  
> 

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] rbd: don't leak parent_spec in rbd_dev_probe_parent()

2015-10-15 Thread Alex Elder

On 10/15/2015 01:31 PM, Ilya Dryomov wrote:
> On Thu, Oct 15, 2015 at 7:10 PM, Alex Elder <el...@ieee.org> wrote:
>> On 10/11/2015 01:03 PM, Ilya Dryomov wrote:
>>> Currently we leak parent_spec and trigger a "parent reference
>>> underflow" warning if rbd_dev_create() in rbd_dev_probe_parent() fails.
>>> The problem is we take the !parent out_err branch and that only drops
>>> refcounts; parent_spec that would've been freed had we called
>>> rbd_dev_unparent() remains and triggers rbd_warn() in
>>> rbd_dev_parent_put() - at that point we have parent_spec != NULL and
>>> parent_ref == 0, so counter ends up being -1 after the decrement.
>>>
>>> Redo rbd_dev_probe_parent() to fix this.
>>
>> I'm just starting to look at this.
>>
>> My up-front problem is that I want to understand why
>> it's OK to no longer grab references to the parent spec
>> and the client before creating the new parent device.
>> Is it simply because the rbd_dev that's the subject
>> of this call is already holding a reference to both,
>> so no need to get another?  I think that's right, but
>> you should say that in the explanation.  It's a
>> change that's independent of the other change (which
>> you describe).
> 
> I still grab references, I just moved that code after rbd_dev_create().
> Fundamentally rbd_dev_create() is just a memory allocation, so whether
> we acquire references before or after doesn't matter, except that
> acquiring them before complicates the error path.

OK, I accept that.  But looking at the patch alone
it made me wonder whether grabbing the references
was necessary before creating the parent device.

I guess the main point is you should have mentioned
it in your description.

I'm going to take another look at the original patch,
now that I have your explanation (below).  Thanks.

-Alex

>> OK, onto the change you describe.
>>
>> You say that we get the underflow if rbd_dev_create() fails.
>>
>> At that point in the function, we have:
>> parent == NULL
>> rbd_dev->parent_spec != NULL
>> parent_spec holds a new reference to that
>> rbdc holds a new reference to the client
>>
>> rbd_dev_create() only returns NULL if the kzalloc() fails.
>> And if so, we jump to out_err, take the !parent branch,
>> and we drop the references we took in rbdc and parent_spec
>> before returning.
>>
>> Where is the leak?  (Actually, underflow means we're
>> dropping more references than we took.)
> 
> We enter rbd_dev_probe_parent().  If ->parent_spec != NULL (and
> therefore has a refcount of 1), we bump refcounts and proceed to
> allocating parent rbd_dev.  If rbd_dev_create() fails, we drop
> refcounts and exit rbd_dev_probe_parent() with ->parent_spec != NULL
> and a refcount of 1 on that spec.  We take err_out_probe in
> rbd_dev_image_probe() and through rbd_dev_unprobe() end up in
> rbd_dev_parent_put().  Now, ->parent_spec != NULL but ->parent_ref == 0
> because we never got to atomic_set(_dev->parent_ref, 1) in
> rbd_dev_probe_parent().  Local counter ends up -1 after the decrement
> and so instead of calling rbd_dev_unparent() which would have freed
> ->parent_spec we issue an underflow warning and bail.  ->parent_spec is
> leaked.
> 
> Thanks,
> 
> Ilya
> 

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] libceph: check data_len in ->alloc_msg()

2015-09-08 Thread Alex Elder

On 09/02/2015 12:33 PM, Ilya Dryomov wrote:
> Only ->alloc_msg() should check data_len of the incoming message
> against the preallocated ceph_msg, doing it in the messenger is not
> right.  The contract is that either ->alloc_msg() returns a ceph_msg
> which will fit all of the portions of the incoming message, or it
> returns NULL and possibly sets skip, signaling whether NULL is due to
> an -ENOMEM.  ->alloc_msg() should be the only place where we make the
> skip/no-skip decision.
> 
> I stumbled upon this while looking at con/osd ref counting.  Right now,
> if we get a non-extent message with a larger data portion than we are
> prepared for, ->alloc_msg() returns a ceph_msg, and then, when we skip
> it in the messenger, we don't put the con/osd ref acquired in
> ceph_con_in_msg_alloc() (which is normally put in process_message()),
> so this also fixes a memory leak.

Sorry for the delay reviewing this.

This looks good.  In a follow-on patch you could update the
comments at the top of ceph_con_in_msg_alloc() to suggest
that *skip may be set or not when this function returns,
without saying "if we set" (which says to me that this
function sets it directly).  Minor, but I think it could
be stated a little more clearly.  That comment block also
talks about a connection's "alloc_msg op if available"
but we assert that pointer is non-null, so that doesn't
really sound right either.

> An existing BUG_ON in ceph_msg_data_cursor_init() ensures we don't
> corrupt random memory should a buggy ->alloc_msg() return an unfit
> ceph_msg.

So this will catch the problem, just a little later.

Reviewed-by: Alex Elder <el...@linaro.org>

> While at it, I changed the "unknown tid" dout() to a pr_warn() to make
> sure all skips are seen and unified format strings.
> 
> Signed-off-by: Ilya Dryomov <idryo...@gmail.com>
> ---
>  net/ceph/messenger.c  |  7 ---
>  net/ceph/osd_client.c | 51 
> ++-
>  2 files changed, 18 insertions(+), 40 deletions(-)
> 
> diff --git a/net/ceph/messenger.c b/net/ceph/messenger.c
> index 36757d46ac40..525f454f7531 100644
> --- a/net/ceph/messenger.c
> +++ b/net/ceph/messenger.c
> @@ -2337,13 +2337,6 @@ static int read_partial_message(struct ceph_connection 
> *con)
>   return ret;
>  
>   BUG_ON(!con->in_msg ^ skip);
> - if (con->in_msg && data_len > con->in_msg->data_length) {
> - pr_warn("%s skipping long message (%u > %zd)\n",
> - __func__, data_len, con->in_msg->data_length);
> - ceph_msg_put(con->in_msg);
> - con->in_msg = NULL;
> - skip = 1;
> - }
>   if (skip) {
>   /* skip this message */
>   dout("alloc_msg said skip message\n");
> diff --git a/net/ceph/osd_client.c b/net/ceph/osd_client.c
> index 50033677c0fa..80b94e37c94a 100644
> --- a/net/ceph/osd_client.c
> +++ b/net/ceph/osd_client.c
> @@ -2817,8 +2817,9 @@ out:
>  }
>  
>  /*
> - * lookup and return message for incoming reply.  set up reply message
> - * pages.
> + * Lookup and return message for incoming reply.  Don't try to do
> + * anything about a larger than preallocated data portion of the
> + * message at the moment - for now, just skip the message.
>   */
>  static struct ceph_msg *get_reply(struct ceph_connection *con,
> struct ceph_msg_header *hdr,
> @@ -2836,10 +2837,10 @@ static struct ceph_msg *get_reply(struct 
> ceph_connection *con,
>   mutex_lock(>request_mutex);
>   req = __lookup_request(osdc, tid);
>   if (!req) {
> - *skip = 1;
> + pr_warn("%s osd%d tid %llu unknown, skipping\n",
> + __func__, osd->o_osd, tid);
>   m = NULL;
> - dout("get_reply unknown tid %llu from osd%d\n", tid,
> -  osd->o_osd);
> + *skip = 1;
>   goto out;
>   }
>  
> @@ -2849,10 +2850,9 @@ static struct ceph_msg *get_reply(struct 
> ceph_connection *con,
>   ceph_msg_revoke_incoming(req->r_reply);
>  
>   if (front_len > req->r_reply->front_alloc_len) {
> - pr_warn("get_reply front %d > preallocated %d (%u#%llu)\n",
> - front_len, req->r_reply->front_alloc_len,
> - (unsigned int)con->peer_name.type,
> - le64_to_cpu(con->peer_name.num));
> + pr_warn("%s osd%d tid %llu front %d > prea

Re: [PATCH] rbd: plug rbd_dev->header.object_prefix memory leak

2015-09-01 Thread Alex Elder

On 08/31/2015 12:02 PM, Ilya Dryomov wrote:
> Need to free object_prefix when rbd_dev_v2_snap_context() fails, but
> only if this is the first time we are reading in the header.
> 
> Signed-off-by: Ilya Dryomov <idryo...@gmail.com>

Yes this makes sense.  It might be nice to encapsulate
some of this, something like rbd_dev_header_info_release().
As things stand, freeing of the object prefix appears
sort of random.  It wasn't easy to follow the life
cycle of that field doing a quick scan of the code for
when it's set and cleared.

Anyway, looks good.

Reviewed-by: Alex Elder <el...@linaro.org>

> ---
>  drivers/block/rbd.c | 5 -
>  1 file changed, 4 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/block/rbd.c b/drivers/block/rbd.c
> index 324bf35ec4dd..69d03aa46d0d 100644
> --- a/drivers/block/rbd.c
> +++ b/drivers/block/rbd.c
> @@ -4720,7 +4720,10 @@ static int rbd_dev_v2_header_info(struct rbd_device 
> *rbd_dev)
>   }
>  
>   ret = rbd_dev_v2_snap_context(rbd_dev);
> - dout("rbd_dev_v2_snap_context returned %d\n", ret);
> + if (ret && first_time) {
> + kfree(rbd_dev->header.object_prefix);
> + rbd_dev->header.object_prefix = NULL;
> + }
>  
>   return ret;
>  }
> 

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] rbd: fix double free on rbd_dev->header_name

2015-09-01 Thread Alex Elder

On 08/31/2015 07:47 AM, Ilya Dryomov wrote:
> If rbd_dev_image_probe() in rbd_dev_probe_parent() fails, header_name
> is freed twice: once in rbd_dev_probe_parent() and then in its caller
> rbd_dev_image_probe() (rbd_dev_image_probe() is called recursively to
> handle parent images).
> 
> rbd_dev_probe_parent() is responsible for probing the parent, so it
> shoudn't mock with clone's fields.

Agreed.  (But I think you mean "muck with.")  The other
argument is that the caller is who allocated it (via
rbd_dev_header_name()), so it should be responsible for
freeing it.

Reviewed-by: Alex Elder <el...@linaro.org>

> 
> Signed-off-by: Ilya Dryomov <idryo...@gmail.com>
> ---
>  drivers/block/rbd.c | 1 -
>  1 file changed, 1 deletion(-)
> 
> diff --git a/drivers/block/rbd.c b/drivers/block/rbd.c
> index bc67a93aa4f4..324bf35ec4dd 100644
> --- a/drivers/block/rbd.c
> +++ b/drivers/block/rbd.c
> @@ -5201,7 +5201,6 @@ static int rbd_dev_probe_parent(struct rbd_device 
> *rbd_dev)
>  out_err:
>   if (parent) {
>   rbd_dev_unparent(rbd_dev);
> - kfree(rbd_dev->header_name);
>   rbd_dev_destroy(parent);
>   } else {
>   rbd_put_client(rbdc);
> 

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 01/18] libceph: add scatterlist messenger data type

2015-07-29 Thread Alex Elder

On 07/29/2015 04:23 AM, mchri...@redhat.com wrote:
 From: Mike Christie micha...@cs.wisc.edu
 
 LIO uses scatterlist for its page/data management. This patch
 adds a scatterlist messenger data type, so LIO can pass its sg
 down directly to rbd.
 
 Signed-off-by: Mike Christie micha...@cs.wisc.edu

I'm not going to be able to review all of these, and this
isnt' even a complete review.  But it's something...

You're clearly on the right track, but I want to provide
a meaningful review for correctness and design so I'm
looking for a bit more information.

 ---
  include/linux/ceph/messenger.h  | 13 ++
  include/linux/ceph/osd_client.h | 12 +-
  net/ceph/messenger.c| 96 
 +
  net/ceph/osd_client.c   | 26 +++
  4 files changed, 146 insertions(+), 1 deletion(-)
 
 diff --git a/include/linux/ceph/messenger.h b/include/linux/ceph/messenger.h
 index 3775327..bc1bde8 100644
 --- a/include/linux/ceph/messenger.h
 +++ b/include/linux/ceph/messenger.h
 @@ -79,6 +79,7 @@ enum ceph_msg_data_type {
  #ifdef CONFIG_BLOCK
   CEPH_MSG_DATA_BIO,  /* data source/destination is a bio list */
  #endif /* CONFIG_BLOCK */
 + CEPH_MSG_DATA_SG,   /* data source/destination is a scatterlist */
  };
  
  static __inline__ bool ceph_msg_data_type_valid(enum ceph_msg_data_type type)
 @@ -90,6 +91,7 @@ static __inline__ bool ceph_msg_data_type_valid(enum 
 ceph_msg_data_type type)
  #ifdef CONFIG_BLOCK
   case CEPH_MSG_DATA_BIO:
  #endif /* CONFIG_BLOCK */
 + case CEPH_MSG_DATA_SG:
   return true;
   default:
   return false;
 @@ -112,6 +114,11 @@ struct ceph_msg_data {
   unsigned intalignment;  /* first page */
   };
   struct ceph_pagelist*pagelist;
 + struct {
 + struct scatterlist *sgl;
 + unsigned intsgl_init_offset;
 + u64 sgl_length;
 + };

Can you supply a short explanation of what these fields
represent?  It seems sgl_init_offset is the offset of the
starting byte in the sgl, but is the purpose of that for
page offset calculation, or does it represent an offset
into the total length of the sgl?  Or put another way,
does sgl_init_offset represent some portion of the
sgl_length that has been consumed already (and so the
initial residual length is sgl_length - sgl_init_offset)?

   };
  };
  
 @@ -139,6 +146,10 @@ struct ceph_msg_data_cursor {
   struct page *page;  /* page from list */
   size_t  offset; /* bytes from list */
   };
 + struct {
 + struct scatterlist  *sg;/* curr sg */

/* current sg */

 + unsigned intsg_consumed;

Here too, what does sg_consumed represent with respect to the
initial offset and the length?

I guess I'm going to stop with that.  It'll be a lot easier
for me to review this if I'm sure I'm sure I understand what
these represent.  Thanks.

-Alex

 + };
   };
  };
  

. . .

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] rbd: fix copyup completion race

2015-07-28 Thread Alex Elder

() when it completed.  This
meant the copyup would complete, and the copyup callback
function would essentially be inserted as an extra object
request callback function, called *before* the normal
rbd_img_obj_callback() completion function.

With this change, when the parent read full image request
completes, the original object request callback is left
as-is (I think/assume it's always rbd_img_obj_callback()).
However, unlike before, rbd_img_obj_copyup_callback()
is now called as part of the *osd request* completion
handling for the copyup operation.  This finishes before
the object request completion handling.  Rather than
chaining object request handlers, we clean up the copyup
OSD request as part of OSD completion (and then mark the
containing original object request done), and *then*
complete the object request normally.

I have two comments about this.
- I *think* this is a better way to do this anyway.  By
  which I mean that since we're setting up that copyup
  operation sort of behind the scenes, separate from the
  original object request, it's just as well that we
  clean up the pages allocated, etc. behind the scenes
  (at the OSD request layer, not the object request layer)
  also.
- Since rbd_img_obj_copyup_callback() is now being called
  in the OSD request callback context (only), it should
  be renamed rbd_osd_copyup_callback().

I wish there were a nice way to diagram the various
paths these requests take.  They're not overly
complicated, but every time I go through it I need
to sort of think through it again each time.

I'm going to say this looks good...

Reviewed-by: Alex Elder el...@linaro.org


 Cc: Alex Elder el...@linaro.org
 Cc: sta...@vger.kernel.org # 3.10+, needs backporting for  3.18
 Signed-off-by: Ilya Dryomov idryo...@gmail.com
 ---
  drivers/block/rbd.c | 20 
  1 file changed, 16 insertions(+), 4 deletions(-)
 
 diff --git a/drivers/block/rbd.c b/drivers/block/rbd.c
 index d94529d5c8e9..007e5d0cadaf 100644
 --- a/drivers/block/rbd.c
 +++ b/drivers/block/rbd.c
 @@ -523,6 +523,7 @@ void rbd_warn(struct rbd_device *rbd_dev, const char 
 *fmt, ...)
  #  define rbd_assert(expr)   ((void) 0)
  #endif /* !RBD_DEBUG */
  
 +static void rbd_img_obj_copyup_callback(struct rbd_obj_request *obj_request);
  static int rbd_img_obj_request_submit(struct rbd_obj_request *obj_request);
  static void rbd_img_parent_read(struct rbd_obj_request *obj_request);
  static void rbd_dev_remove_parent(struct rbd_device *rbd_dev);
 @@ -1818,6 +1819,16 @@ static void rbd_osd_stat_callback(struct 
 rbd_obj_request *obj_request)
   obj_request_done_set(obj_request);
  }
  
 +static void rbd_osd_call_callback(struct rbd_obj_request *obj_request)
 +{
 + dout(%s: obj %p\n, __func__, obj_request);
 +
 + if (obj_request_img_data_test(obj_request))
 + rbd_img_obj_copyup_callback(obj_request);
 + else
 + obj_request_done_set(obj_request);
 +}
 +
  static void rbd_osd_req_callback(struct ceph_osd_request *osd_req,
   struct ceph_msg *msg)
  {
 @@ -1866,6 +1877,8 @@ static void rbd_osd_req_callback(struct 
 ceph_osd_request *osd_req,
   rbd_osd_discard_callback(obj_request);
   break;
   case CEPH_OSD_OP_CALL:
 + rbd_osd_call_callback(obj_request);
 + break;
   case CEPH_OSD_OP_NOTIFY_ACK:
   case CEPH_OSD_OP_WATCH:
   rbd_osd_trivial_callback(obj_request);
 @@ -2537,6 +2550,8 @@ rbd_img_obj_copyup_callback(struct rbd_obj_request 
 *obj_request)
   struct page **pages;
   u32 page_count;
  
 + dout(%s: obj %p\n, __func__, obj_request);
 +
   rbd_assert(obj_request-type == OBJ_REQUEST_BIO ||
   obj_request-type == OBJ_REQUEST_NODATA);
   rbd_assert(obj_request_img_data_test(obj_request));
 @@ -2563,9 +2578,7 @@ rbd_img_obj_copyup_callback(struct rbd_obj_request 
 *obj_request)
   if (!obj_request-result)
   obj_request-xferred = obj_request-length;
  
 - /* Finish up with the normal image object callback */
 -
 - rbd_img_obj_callback(obj_request);
 + obj_request_done_set(obj_request);
  }
  
  static void
 @@ -2650,7 +2663,6 @@ rbd_img_obj_parent_read_full_callback(struct 
 rbd_img_request *img_request)
  
   /* All set, send it off. */
  
 - orig_request-callback = rbd_img_obj_copyup_callback;

What is orig_request-callback here, is it always
rbd_img_obj_callback()?

   osdc = rbd_dev-rbd_client-client-osdc;
   img_result = rbd_obj_request_submit(osdc, orig_request);
   if (!img_result)
 

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] rbd: use GFP_NOIO in rbd_obj_request_create()

2015-06-30 Thread Alex Elder

On 06/30/2015 02:19 PM, Ilya Dryomov wrote:
 rbd_obj_request_create() is called on the main I/O path, so we need to
 use GFP_NOIO to make sure allocation doesn't blow back on us.  Not all
 callers need this, but I'm still hardcoding the flag inside rather than
 making it a parameter because a) this is going to stable, and b) those
 callers shouldn't really use rbd_obj_request_create() and will be fixed
 in the future.
 
 More memory allocation fixes will follow.

This looks OK to me.  It's conservative; you can add a GFP parameter
in the future.

Reviewed-by: Alex Elder el...@linaro.org

 
 Cc: sta...@vger.kernel.org # 3.10+
 Signed-off-by: Ilya Dryomov idryo...@gmail.com
 ---
  drivers/block/rbd.c | 4 ++--
  1 file changed, 2 insertions(+), 2 deletions(-)
 
 diff --git a/drivers/block/rbd.c b/drivers/block/rbd.c
 index b316ee48a30b..d94529d5c8e9 100644
 --- a/drivers/block/rbd.c
 +++ b/drivers/block/rbd.c
 @@ -2022,11 +2022,11 @@ static struct rbd_obj_request 
 *rbd_obj_request_create(const char *object_name,
   rbd_assert(obj_request_type_valid(type));
  
   size = strlen(object_name) + 1;
 - name = kmalloc(size, GFP_KERNEL);
 + name = kmalloc(size, GFP_NOIO);
   if (!name)
   return NULL;
  
 - obj_request = kmem_cache_zalloc(rbd_obj_request_cache, GFP_KERNEL);
 + obj_request = kmem_cache_zalloc(rbd_obj_request_cache, GFP_NOIO);
   if (!obj_request) {
   kfree(name);
   return NULL;
 

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] libceph: Fix ceph_tcp_sendpage()'s more boolean usage

2015-06-25 Thread Alex Elder


On 06/25/2015 11:56 AM, Benoît Canet wrote:

Spotted while hunting http://tracker.ceph.com/issues/10905.

 From struct ceph_msg_data_cursor in include/linux/ceph/messenger.h:

boollast_piece; /* current is last piece */

In ceph_msg_data_next():

*last_piece = cursor-last_piece;

A call to ceph_msg_data_next() is followed by:

ret = ceph_tcp_sendpage(con-sock, page, page_offset,
length, last_piece);

while ceph_tcp_sendpage() is:

static int ceph_tcp_sendpage(struct socket *sock, struct page *page,i
 int offset, size_t size, bool more)

The logic is inverted: correct it.


Whoops.  I'm surprised we haven't had more trouble from this.
This should go to stable too.

Looks good.

Reviewed-by: Alex Elder el...@linaro.org



Signed-off-by: Benoît Canet benoit.ca...@nodalink.com
---
  net/ceph/messenger.c | 2 +-
  1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/ceph/messenger.c b/net/ceph/messenger.c
index ec68cd3..eda06fd 100644
--- a/net/ceph/messenger.c
+++ b/net/ceph/messenger.c
@@ -1544,7 +1544,7 @@ static int write_partial_message_data(struct 
ceph_connection *con)
page = ceph_msg_data_next(msg-cursor, page_offset, length,
last_piece);
ret = ceph_tcp_sendpage(con-sock, page, page_offset,
- length, last_piece);
+ length, !last_piece);
if (ret = 0) {
if (do_datacrc)
msg-footer.data_crc = cpu_to_le32(crc);



--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] rbd: bump queue_max_segments

2015-06-25 Thread Alex Elder


On 06/25/2015 04:01 AM, Ilya Dryomov wrote:

The default queue_limits::max_segments value (BLK_MAX_SEGMENTS = 128)
unnecessarily limits bio sizes to 512k (assuming 4k pages).  rbd, being
a virtual block device, doesn't have any restrictions on the number of
physical segments, so bump max_segments to max_hw_sectors, in theory
allowing a sector per segment (although the only case this matters that
I can think of is some readv/writev style thing).  In practice this is
going to give us 1M bios - the number of segments in a bio is limited
in bio_get_nr_vecs() by BIO_MAX_PAGES = 256.

Note that this doesn't result in any improvement on a typical direct
sequential test.  This is because on a box with a not too badly
fragmented memory the default BLK_MAX_SEGMENTS is enough to see nice
rbd object size sized requests.  The only difference is the size of
bios being merged - 512k vs 1M for something like

 $ dd if=/dev/zero of=/dev/rbd0 oflag=direct bs=$RBD_OBJ_SIZE
 $ dd if=/dev/rbd0 iflag=direct of=/dev/null bs=$RBD_OBJ_SIZE

Signed-off-by: Ilya Dryomov idryo...@gmail.com


This looks good.

Reviewed-by: Alex Elder el...@linaro.org


---
  drivers/block/rbd.c | 1 +
  1 file changed, 1 insertion(+)

diff --git a/drivers/block/rbd.c b/drivers/block/rbd.c
index 89fe8a4bc02e..bc88fbcb9715 100644
--- a/drivers/block/rbd.c
+++ b/drivers/block/rbd.c
@@ -3791,6 +3791,7 @@ static int rbd_init_disk(struct rbd_device *rbd_dev)
/* set io sizes to object size */
segment_size = rbd_obj_bytes(rbd_dev-header);
blk_queue_max_hw_sectors(q, segment_size / SECTOR_SIZE);
+   blk_queue_max_segments(q, segment_size / SECTOR_SIZE);
blk_queue_max_segment_size(q, segment_size);
blk_queue_io_min(q, segment_size);
blk_queue_io_opt(q, segment_size);



--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 3/3] rbd: queue_depth map option

2015-06-25 Thread Alex Elder


On 06/25/2015 04:11 AM, Ilya Dryomov wrote:

nr_requests (/sys/block/rbdid/queue/nr_requests) is pretty much
irrelevant in blk-mq case because each driver sets its own max depth
that it can handle and that's the number of tags that gets preallocated
on setup.  Users can't increase queue depth beyond that value via
writing to nr_requests.

For rbd we are happy with the default BLKDEV_MAX_RQ (128) for most
cases but we want to give users the opportunity to increase it.
Introduce a new per-device queue_depth option to do just that:

 $ sudo rbd map -o queue_depth=1024 ...

Signed-off-by: Ilya Dryomov idryo...@gmail.com


I haven't gone to follow through what happens with this
but I assume a value that's too large will be caught when
it's attempted to be used or something.

In any case this looks good to me.

Reviewed-by: Alex Elder el...@linaro.org


---
  drivers/block/rbd.c | 17 ++---
  1 file changed, 14 insertions(+), 3 deletions(-)

diff --git a/drivers/block/rbd.c b/drivers/block/rbd.c
index e502bce02d2c..b316ee48a30b 100644
--- a/drivers/block/rbd.c
+++ b/drivers/block/rbd.c
@@ -728,6 +728,7 @@ static struct rbd_client *rbd_client_find(struct 
ceph_options *ceph_opts)
   * (Per device) rbd map options
   */
  enum {
+   Opt_queue_depth,
Opt_last_int,
/* int args above */
Opt_last_string,
@@ -738,6 +739,7 @@ enum {
  };

  static match_table_t rbd_opts_tokens = {
+   {Opt_queue_depth, queue_depth=%d},
/* int args above */
/* string args above */
{Opt_read_only, read_only},
@@ -748,9 +750,11 @@ static match_table_t rbd_opts_tokens = {
  };

  struct rbd_options {
+   int queue_depth;
boolread_only;
  };

+#define RBD_QUEUE_DEPTH_DEFAULTBLKDEV_MAX_RQ
  #define RBD_READ_ONLY_DEFAULT false

  static int parse_rbd_opts_token(char *c, void *private)
@@ -774,6 +778,13 @@ static int parse_rbd_opts_token(char *c, void *private)
}

switch (token) {
+   case Opt_queue_depth:
+   if (intval  1) {
+   pr_err(queue_depth out of range\n);
+   return -EINVAL;
+   }
+   rbd_opts-queue_depth = intval;
+   break;
case Opt_read_only:
rbd_opts-read_only = true;
break;
@@ -3761,10 +3772,9 @@ static int rbd_init_disk(struct rbd_device *rbd_dev)

memset(rbd_dev-tag_set, 0, sizeof(rbd_dev-tag_set));
rbd_dev-tag_set.ops = rbd_mq_ops;
-   rbd_dev-tag_set.queue_depth = BLKDEV_MAX_RQ;
+   rbd_dev-tag_set.queue_depth = rbd_dev-opts-queue_depth;
rbd_dev-tag_set.numa_node = NUMA_NO_NODE;
-   rbd_dev-tag_set.flags =
-   BLK_MQ_F_SHOULD_MERGE | BLK_MQ_F_SG_MERGE;
+   rbd_dev-tag_set.flags = BLK_MQ_F_SHOULD_MERGE | BLK_MQ_F_SG_MERGE;
rbd_dev-tag_set.nr_hw_queues = 1;
rbd_dev-tag_set.cmd_size = sizeof(struct work_struct);

@@ -4948,6 +4958,7 @@ static int rbd_add_parse_args(const char *buf,
goto out_mem;

rbd_opts-read_only = RBD_READ_ONLY_DEFAULT;
+   rbd_opts-queue_depth = RBD_QUEUE_DEPTH_DEFAULT;

copts = ceph_parse_options(options, mon_addrs,
mon_addrs + mon_addrs_size - 1,



--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] Avoid holding the zero page on slab init failure

2015-06-25 Thread Alex Elder


On 06/24/2015 08:27 PM, Benoît Canet wrote:

Spotted by visual inspection.

Applies on libceph: Remove spurious kunmap() of the zero page.

Benoît Canet (1):
   libceph: Avoid holding the zero page on ceph_msgr_slab_init errors

  net/ceph/messenger.c | 10 +-
  1 file changed, 5 insertions(+), 5 deletions(-)



Benoit, in cases where you have just a single patch, there is
no need to include an additional mail message like this one.

If you have a series, that's when you'll supply a PATCH 0/N
message.

-Alex
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 1/3] rbd: terminate rbd_opts_tokens with Opt_err

2015-06-25 Thread Alex Elder


On 06/25/2015 04:11 AM, Ilya Dryomov wrote:

Also nuke useless Opt_last_bool and don't break lines unnecessarily.

Signed-off-by: Ilya Dryomov idryo...@gmail.com


Good idea.

Reviewed-by: Alex Elder el...@linaro.org


---
  drivers/block/rbd.c | 24 
  1 file changed, 8 insertions(+), 16 deletions(-)

diff --git a/drivers/block/rbd.c b/drivers/block/rbd.c
index bc88fbcb9715..4de8c9167c4b 100644
--- a/drivers/block/rbd.c
+++ b/drivers/block/rbd.c
@@ -724,7 +724,7 @@ static struct rbd_client *rbd_client_find(struct 
ceph_options *ceph_opts)
  }

  /*
- * mount options
+ * (Per device) rbd map options
   */
  enum {
Opt_last_int,
@@ -733,8 +733,7 @@ enum {
/* string args above */
Opt_read_only,
Opt_read_write,
-   /* Boolean args above */
-   Opt_last_bool,
+   Opt_err
  };

  static match_table_t rbd_opts_tokens = {
@@ -744,8 +743,7 @@ static match_table_t rbd_opts_tokens = {
{Opt_read_only, ro},/* Alternate spelling */
{Opt_read_write, read_write},
{Opt_read_write, rw},   /* Alternate spelling */
-   /* Boolean args above */
-   {-1, NULL}
+   {Opt_err, NULL}
  };

  struct rbd_options {
@@ -761,22 +759,15 @@ static int parse_rbd_opts_token(char *c, void *private)
int token, intval, ret;

token = match_token(c, rbd_opts_tokens, argstr);
-   if (token  0)
-   return -EINVAL;
-
if (token  Opt_last_int) {
ret = match_int(argstr[0], intval);
if (ret  0) {
-   pr_err(bad mount option arg (not int) 
-  at '%s'\n, c);
+   pr_err(bad mount option arg (not int) at '%s'\n, c);
return ret;
}
dout(got int token %d val %d\n, token, intval);
} else if (token  Opt_last_int  token  Opt_last_string) {
-   dout(got string token %d val %s\n, token,
-argstr[0].from);
-   } else if (token  Opt_last_string  token  Opt_last_bool) {
-   dout(got Boolean token %d\n, token);
+   dout(got string token %d val %s\n, token, argstr[0].from);
} else {
dout(got token %d\n, token);
}
@@ -789,9 +780,10 @@ static int parse_rbd_opts_token(char *c, void *private)
rbd_opts-read_only = false;
break;
default:
-   rbd_assert(false);
-   break;
+   /* libceph prints bad option msg */
+   return -EINVAL;
}
+
return 0;
  }




--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 2/3] rbd: store rbd_options in rbd_device

2015-06-25 Thread Alex Elder


On 06/25/2015 04:11 AM, Ilya Dryomov wrote:

Signed-off-by: Ilya Dryomov idryo...@gmail.com


Now that you need it when initializing the disk, this
makes sense.

Reviewed-by: Alex Elder el...@linaro.org


---
  drivers/block/rbd.c | 18 +++---
  1 file changed, 11 insertions(+), 7 deletions(-)

diff --git a/drivers/block/rbd.c b/drivers/block/rbd.c
index 4de8c9167c4b..e502bce02d2c 100644
--- a/drivers/block/rbd.c
+++ b/drivers/block/rbd.c
@@ -346,6 +346,7 @@ struct rbd_device {
struct rbd_image_header header;
unsigned long   flags;  /* possibly lock protected */
struct rbd_spec *spec;
+   struct rbd_options  *opts;

char*header_name;

@@ -4055,7 +4056,8 @@ static void rbd_spec_free(struct kref *kref)
  }

  static struct rbd_device *rbd_dev_create(struct rbd_client *rbdc,
-   struct rbd_spec *spec)
+struct rbd_spec *spec,
+struct rbd_options *opts)
  {
struct rbd_device *rbd_dev;

@@ -4069,8 +4071,9 @@ static struct rbd_device *rbd_dev_create(struct 
rbd_client *rbdc,
INIT_LIST_HEAD(rbd_dev-node);
init_rwsem(rbd_dev-header_rwsem);

-   rbd_dev-spec = spec;
rbd_dev-rbd_client = rbdc;
+   rbd_dev-spec = spec;
+   rbd_dev-opts = opts;

/* Initialize the layout used for all rbd requests */

@@ -4086,6 +4089,7 @@ static void rbd_dev_destroy(struct rbd_device *rbd_dev)
  {
rbd_put_client(rbd_dev-rbd_client);
rbd_spec_put(rbd_dev-spec);
+   kfree(rbd_dev-opts);
kfree(rbd_dev);
  }

@@ -5160,7 +5164,7 @@ static int rbd_dev_probe_parent(struct rbd_device 
*rbd_dev)
rbdc = __rbd_get_client(rbd_dev-rbd_client);

ret = -ENOMEM;
-   parent = rbd_dev_create(rbdc, parent_spec);
+   parent = rbd_dev_create(rbdc, parent_spec, NULL);
if (!parent)
goto out_err;

@@ -5406,9 +5410,6 @@ static ssize_t do_rbd_add(struct bus_type *bus,
rc = rbd_add_parse_args(buf, ceph_opts, rbd_opts, spec);
if (rc  0)
goto err_out_module;
-   read_only = rbd_opts-read_only;
-   kfree(rbd_opts);
-   rbd_opts = NULL;/* done with this */

rbdc = rbd_get_client(ceph_opts);
if (IS_ERR(rbdc)) {
@@ -5434,11 +5435,12 @@ static ssize_t do_rbd_add(struct bus_type *bus,
goto err_out_client;
}

-   rbd_dev = rbd_dev_create(rbdc, spec);
+   rbd_dev = rbd_dev_create(rbdc, spec, rbd_opts);
if (!rbd_dev)
goto err_out_client;
rbdc = NULL;/* rbd_dev now owns this */
spec = NULL;/* rbd_dev now owns this */
+   rbd_opts = NULL;/* rbd_dev now owns this */

rc = rbd_dev_image_probe(rbd_dev, true);
if (rc  0)
@@ -5446,6 +5448,7 @@ static ssize_t do_rbd_add(struct bus_type *bus,

/* If we are mapping a snapshot it must be marked read-only */

+   read_only = rbd_dev-opts-read_only;
if (rbd_dev-spec-snap_id != CEPH_NOSNAP)
read_only = true;
rbd_dev-mapping.read_only = read_only;
@@ -5470,6 +5473,7 @@ err_out_client:
rbd_put_client(rbdc);
  err_out_args:
rbd_spec_put(spec);
+   kfree(rbd_opts);
  err_out_module:
module_put(THIS_MODULE);




--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] libceph: Avoid holding the zero page on ceph_msgr_slab_init errors

2015-06-25 Thread Alex Elder


On 06/24/2015 08:27 PM, Benoît Canet wrote:

ceph_msgr_slab_init may fail due to a temporary ENOMEM.


Looks good.


Delay a bit the initialization of zero_page in ceph_msgr_init and
reorder it's cleanup in _ceph_msgr_exit for readability sake.


I'd say it's not readability, but a proper ordering (tear down
in reverse order of set up).

Reviewed-by: Alex Elder el...@linaro.org


BUG_ON() will not suffer to be postponed in case it is triggered.


This is unrelated, but in cases like this (where there's already
something in place to return an error) we should replace these
BUG_ON() calls with logged errors and a return.  They should
never ever happen, of course (they represent design problems)
but if it did it's better to allow the system to keep running.


Signed-off-by: Benoît Canet benoit.ca...@nodalink.com
---
  net/ceph/messenger.c | 10 +-
  1 file changed, 5 insertions(+), 5 deletions(-)

diff --git a/net/ceph/messenger.c b/net/ceph/messenger.c
index 38f06a4..ec68cd3 100644
--- a/net/ceph/messenger.c
+++ b/net/ceph/messenger.c
@@ -275,22 +275,22 @@ static void _ceph_msgr_exit(void)
ceph_msgr_wq = NULL;
}

-   ceph_msgr_slab_exit();
-
BUG_ON(zero_page == NULL);
page_cache_release(zero_page);
zero_page = NULL;
+
+   ceph_msgr_slab_exit();
  }

  int ceph_msgr_init(void)
  {
+   if (ceph_msgr_slab_init())
+   return -ENOMEM;
+
BUG_ON(zero_page != NULL);
zero_page = ZERO_PAGE(0);
page_cache_get(zero_page);

-   if (ceph_msgr_slab_init())
-   return -ENOMEM;
-
/*
 * The number of active work items is limited by the number of
 * connections, so leave @max_active at default.



--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] remove unbalanced kunmap

2015-06-24 Thread Alex Elder


On 06/24/2015 07:35 PM, Alex Elder wrote:

On 06/24/2015 04:18 PM, Benoît Canet wrote:

Spotted via visual inspection of the code.

Benoît Canet (1):
   libceph: Remove spurious kunmap() of the zero page

  net/ceph/messenger.c | 1 -
  1 file changed, 1 deletion(-)



I got no patch with this.  Is it just me?-Alex


Nevermind.  I get it.  This was associated with the
patch I just signed off on...

-Alex
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] remove unbalanced kunmap

2015-06-24 Thread Alex Elder


On 06/24/2015 04:18 PM, Benoît Canet wrote:

Spotted via visual inspection of the code.

Benoît Canet (1):
   libceph: Remove spurious kunmap() of the zero page

  net/ceph/messenger.c | 1 -
  1 file changed, 1 deletion(-)



I got no patch with this.  Is it just me?   -Alex
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] libceph: Remove spurious kunmap() of the zero page

2015-06-24 Thread Alex Elder


On 06/24/2015 04:18 PM, Benoît Canet wrote:

ceph_tcp_sendpage already does the work of mapping/unmapping
the zero page if needed.

Signed-off-by: Benoît Canet benoit.ca...@nodalink.com


This looks good.

Reviewed-by: Alex Elder el...@linaro.org


---
  net/ceph/messenger.c | 1 -
  1 file changed, 1 deletion(-)

diff --git a/net/ceph/messenger.c b/net/ceph/messenger.c
index 967080a..38f06a4 100644
--- a/net/ceph/messenger.c
+++ b/net/ceph/messenger.c
@@ -278,7 +278,6 @@ static void _ceph_msgr_exit(void)
ceph_msgr_slab_exit();

BUG_ON(zero_page == NULL);
-   kunmap(zero_page);
page_cache_release(zero_page);
zero_page = NULL;
  }



--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 3/5] libceph: a couple tweaks for wait loops

2015-05-25 Thread Alex Elder

On 05/25/2015 05:38 AM, Ilya Dryomov wrote:
 On Thu, May 21, 2015 at 4:29 PM, Alex Elder el...@ieee.org wrote:
 On 05/21/2015 07:35 AM, Ilya Dryomov wrote:
 - return -ETIMEDOUT instead of -EIO in case of timeout
 - wait_event_interruptible_timeout() returns time left until timeout
   and since it can be almost LONG_MAX we had better assign it to long

 Any error returned by wait_event_interruptible_timeout()
 can now be returned by __ceph_open_session().  It looks
 like that may, in fact, be only -EINTR and -ERESTARTSYS.
 But it's a change you could note in the log message.
 
 I think it's just -ERESTARTSYS so I didn't bother.

My point was almost a little more philosophical.  It's conceivable
(though not likely) that wait_event_interruptible_timeout() could
be changed to return a value that your caller here does not expect.

 It turns out the only caller ignores the return value of
 ceph_monc_wait_osdmap() anyway.  That should maybe be fixed.
 
 That's on purpose - rbd map tries to wait for a new enough osdmap only
 if the pool that the image is supposed to be in doesn't exist and we
 know we have a stale osdmap.  We ignore wait retval because if we
 timeout we should return this pool doesn't exist, not -ETIMEDOUT.

Yes I realize that.  This second part of my response was
following on to my previous thought.  That is, the caller
might get a different return value that it didn't expect;
but since the only caller ignores what gets returned, it's
a moot point.

-Alex

 Thanks,
 
 Ilya
 

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v4 08/11] block: kill merge_bvec_fn() completely

2015-05-25 Thread Alex Elder

On 05/25/2015 10:02 AM, Ilya Dryomov wrote:
 On Mon, May 25, 2015 at 5:04 PM, Christoph Hellwig h...@lst.de wrote:
 On Fri, May 22, 2015 at 11:18:40AM -0700, Ming Lin wrote:
 From: Kent Overstreet kent.overstr...@gmail.com

 As generic_make_request() is now able to handle arbitrarily sized bios,
 it's no longer necessary for each individual block driver to define its
 own -merge_bvec_fn() callback. Remove every invocation completely.

 It might be good to replace patch 1 and this one by a patch per driver
 to remove the merge_bvec_fn instance and add the blk_queue_split call
 for all those drivers that actually had a -merge_bvec_fn.  As some
 of them were non-trivial attention from the maintainers would be helpful,
 and a patch per driver might help with that.

 -/* This is called by bio_add_page().
 - *
 - * q-max_hw_sectors and other global limits are already enforced there.
 - *
 - * We need to call down to our lower level device,
 - * in case it has special restrictions.
 - *
 - * We also may need to enforce configured max-bio-bvecs limits.
 - *
 - * As long as the BIO is empty we have to allow at least one bvec,
 - * regardless of size and offset, so no need to ask lower levels.
 - */
 -int drbd_merge_bvec(struct request_queue *q, struct bvec_merge_data *bvm, 
 struct bio_vec *bvec)


 This just checks the lower device, so it looks obviously fine.

 -static int pkt_merge_bvec(struct request_queue *q, struct bvec_merge_data 
 *bmd,
 -   struct bio_vec *bvec)
 -{
 - struct pktcdvd_device *pd = q-queuedata;
 - sector_t zone = get_zone(bmd-bi_sector, pd);
 - int used = ((bmd-bi_sector - zone)  9) + bmd-bi_size;
 - int remaining = (pd-settings.size  9) - used;
 - int remaining2;
 -
 - /*
 -  * A bio = PAGE_SIZE must be allowed. If it crosses a packet
 -  * boundary, pkt_make_request() will split the bio.
 -  */
 - remaining2 = PAGE_SIZE - bmd-bi_size;
 - remaining = max(remaining, remaining2);
 -
 - BUG_ON(remaining  0);
 - return remaining;
 -}

 As mentioned in the comment pkt_make_request will split the bio so pkt
 looks fine.

 diff --git a/drivers/block/rbd.c b/drivers/block/rbd.c
 index ec6c5c6..f50edb3 100644
 --- a/drivers/block/rbd.c
 +++ b/drivers/block/rbd.c
 @@ -3440,52 +3440,6 @@ static int rbd_queue_rq(struct blk_mq_hw_ctx *hctx,
   return BLK_MQ_RQ_QUEUE_OK;
  }

 -/*
 - * a queue callback. Makes sure that we don't create a bio that spans 
 across
 - * multiple osd objects. One exception would be with a single page bios,
 - * which we handle later at bio_chain_clone_range()
 - */
 -static int rbd_merge_bvec(struct request_queue *q, struct bvec_merge_data 
 *bmd,
 -   struct bio_vec *bvec)

 It seems rbd handles requests spanning objects just fine, so I don't
 really understand why rbd_merge_bvec even exists.  Getting some form
 of ACK from the ceph folks would be useful.
 
 I'm not Alex, but yeah, we have all the clone/split machinery and so we
 can handle a spanning case just fine.  I think rbd_merge_bvec() exists
 to make sure we don't have to do that unless it's really necessary -
 like when a single page gets submitted at an inconvenient offset.

I am Alex.  This is something I never removed.  I haven't
looked at it closely now, but it seems to me that after I
created a function that split stuff properly up *before*
the BIO layer got to it (which has since been replaced by
code related to Kent's immutable BIO work), there has been
no need for this function.  Removing this was on a long-ago
to-do list--but I didn't want to do it without spending some
time ensuring it wouldn't break anything.

If you want me to work through it in more detail so I can
give a more certain response, let me know and I will do so.

-Alex

 I have a patch that adds a blk_queue_chunk_sectors(object_size) call to
 rbd_init_disk() but I haven't had a chance to play with it yet.  In any
 case, we should be fine with getting rid of rbd_merge_bvec().  If this
 ends up a per-driver patchset, I can make rbd_merge_bvec() -
 blk_queue_chunk_sectors() a single patch and push it through
 ceph-client.git.
 
 Thanks,
 
 Ilya
 

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 1/5] libceph: nuke time_sub()

2015-05-21 Thread Alex Elder

On 05/21/2015 07:35 AM, Ilya Dryomov wrote:
 Unused since ceph got merged into mainline I guess.
 
 Signed-off-by: Ilya Dryomov idryo...@gmail.com

Looks good.

Reviewed-by: Alex Elder el...@linaro.org

 ---
  include/linux/ceph/libceph.h | 9 -
  1 file changed, 9 deletions(-)
 
 diff --git a/include/linux/ceph/libceph.h b/include/linux/ceph/libceph.h
 index 30f92cefaa72..85ae9a889a3f 100644
 --- a/include/linux/ceph/libceph.h
 +++ b/include/linux/ceph/libceph.h
 @@ -93,15 +93,6 @@ enum {
   CEPH_MOUNT_SHUTDOWN,
  };
  
 -/*
 - * subtract jiffies
 - */
 -static inline unsigned long time_sub(unsigned long a, unsigned long b)
 -{
 - BUG_ON(time_after(b, a));
 - return (long)a - (long)b;
 -}
 -
  struct ceph_mds_client;
  
  /*
 

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 2/5] libceph: store timeouts in jiffies, verify user input

2015-05-21 Thread Alex Elder

On 05/21/2015 07:35 AM, Ilya Dryomov wrote:
 There are currently three libceph-level timeouts that the user can
 specify on mount: mount_timeout, osd_idle_ttl and osdkeepalive.  All of
 these are in seconds and no checking is done on user input: negative
 values are accepted, we multiply them all by HZ which may or may not
 overflow, arbitrarily large jiffies then get added together, etc.
 
 There is also a bug in the way mount_timeout=0 is handled.  It's
 supposed to mean infinite timeout, but that's not how wait.h APIs
 treat it and so __ceph_open_session() for example will busy loop
 without much chance of being interrupted if none of ceph-mons are
 there.
 
 Fix all this by verifying user input, storing timeouts capped by
 msecs_to_jiffies() in jiffies and using the new ceph_timeout_jiffies()
 helper for all user-specified waits to handle infinite timeouts
 correctly.

This looks good.  I like your use of local variables for
the options pointer; it makes it easy to see in the code
where the timeout values come from.

You could have handled timeout option checking and
error reporting generically, but I'm not sure that
would be better.

Reviewed-by: Alex Elder el...@linaro.org

 Signed-off-by: Ilya Dryomov idryo...@gmail.com
 ---
  drivers/block/rbd.c  |  5 +++--
  fs/ceph/dir.c|  4 ++--
  fs/ceph/mds_client.c | 12 ++--
  fs/ceph/mds_client.h |  2 +-
  fs/ceph/super.c  |  2 +-
  include/linux/ceph/libceph.h | 17 +++--
  net/ceph/ceph_common.c   | 41 +++--
  net/ceph/mon_client.c| 11 +--
  net/ceph/osd_client.c| 15 +++
  9 files changed, 71 insertions(+), 38 deletions(-)
 
 diff --git a/drivers/block/rbd.c b/drivers/block/rbd.c
 index 349115ae3bc2..992683b6b299 100644
 --- a/drivers/block/rbd.c
 +++ b/drivers/block/rbd.c
 @@ -4963,8 +4963,8 @@ out_err:
   */
  static int rbd_add_get_pool_id(struct rbd_client *rbdc, const char 
 *pool_name)
  {
 + struct ceph_options *opts = rbdc-client-options;
   u64 newest_epoch;
 - unsigned long timeout = rbdc-client-options-mount_timeout * HZ;
   int tries = 0;
   int ret;
  
 @@ -4979,7 +4979,8 @@ again:
   if (rbdc-client-osdc.osdmap-epoch  newest_epoch) {
   ceph_monc_request_next_osdmap(rbdc-client-monc);
   (void) ceph_monc_wait_osdmap(rbdc-client-monc,
 -  newest_epoch, timeout);
 +  newest_epoch,
 +  opts-mount_timeout);
   goto again;
   } else {
   /* the osdmap we have is new enough */
 diff --git a/fs/ceph/dir.c b/fs/ceph/dir.c
 index 4248307fea90..173dd4b58c71 100644
 --- a/fs/ceph/dir.c
 +++ b/fs/ceph/dir.c
 @@ -1259,8 +1259,8 @@ static int ceph_dir_fsync(struct file *file, loff_t 
 start, loff_t end,
inode, req-r_tid, last_tid);
   if (req-r_timeout) {
   unsigned long time_left = wait_for_completion_timeout(
 - req-r_safe_completion,
 - req-r_timeout);
 + req-r_safe_completion,
 + ceph_timeout_jiffies(req-r_timeout));
   if (time_left  0)
   ret = 0;
   else
 diff --git a/fs/ceph/mds_client.c b/fs/ceph/mds_client.c
 index 69a36f40517f..0b0e0a9a81c0 100644
 --- a/fs/ceph/mds_client.c
 +++ b/fs/ceph/mds_client.c
 @@ -2268,7 +2268,8 @@ int ceph_mdsc_do_request(struct ceph_mds_client *mdsc,
   dout(do_request waiting\n);
   if (req-r_timeout) {
   err = (long)wait_for_completion_killable_timeout(
 - req-r_completion, req-r_timeout);
 + req-r_completion,
 + ceph_timeout_jiffies(req-r_timeout));
   if (err == 0)
   err = -EIO;
   } else if (req-r_wait_for_completion) {
 @@ -3424,8 +3425,8 @@ int ceph_mdsc_init(struct ceph_fs_client *fsc)
   */
  static void wait_requests(struct ceph_mds_client *mdsc)
  {
 + struct ceph_options *opts = mdsc-fsc-client-options;
   struct ceph_mds_request *req;
 - struct ceph_fs_client *fsc = mdsc-fsc;
  
   mutex_lock(mdsc-mutex);
   if (__get_oldest_req(mdsc)) {
 @@ -3433,7 +3434,7 @@ static void wait_requests(struct ceph_mds_client *mdsc)
  
   dout(wait_requests waiting for requests\n);
   wait_for_completion_timeout(mdsc-safe_umount_waiters,
 - fsc-client-options-mount_timeout * HZ);
 + ceph_timeout_jiffies(opts-mount_timeout));
  
   /* tear down remaining

Re: [PATCH 3/5] libceph: a couple tweaks for wait loops

2015-05-21 Thread Alex Elder

On 05/21/2015 07:35 AM, Ilya Dryomov wrote:
 - return -ETIMEDOUT instead of -EIO in case of timeout
 - wait_event_interruptible_timeout() returns time left until timeout
   and since it can be almost LONG_MAX we had better assign it to long

Any error returned by wait_event_interruptible_timeout()
can now be returned by __ceph_open_session().  It looks
like that may, in fact, be only -EINTR and -ERESTARTSYS.
But it's a change you could note in the log message.

It turns out the only caller ignores the return value of
ceph_monc_wait_osdmap() anyway.  That should maybe be fixed.

In any case, this looks good.

Reviewed-by: Alex Elder el...@linaro.org

 Signed-off-by: Ilya Dryomov idryo...@gmail.com
 ---
  net/ceph/ceph_common.c | 7 +++
  net/ceph/mon_client.c  | 2 +-
  2 files changed, 4 insertions(+), 5 deletions(-)
 
 diff --git a/net/ceph/ceph_common.c b/net/ceph/ceph_common.c
 index a80e91c2c9a3..925d0c890b80 100644
 --- a/net/ceph/ceph_common.c
 +++ b/net/ceph/ceph_common.c
 @@ -647,8 +647,8 @@ static int have_mon_and_osd_map(struct ceph_client 
 *client)
   */
  int __ceph_open_session(struct ceph_client *client, unsigned long started)
  {
 - int err;
   unsigned long timeout = client-options-mount_timeout;
 + long err;
  
   /* open session, and wait for mon and osd maps */
   err = ceph_monc_open_session(client-monc);
 @@ -656,16 +656,15 @@ int __ceph_open_session(struct ceph_client *client, 
 unsigned long started)
   return err;
  
   while (!have_mon_and_osd_map(client)) {
 - err = -EIO;
   if (timeout  time_after_eq(jiffies, started + timeout))
 - return err;
 + return -ETIMEDOUT;
  
   /* wait */
   dout(mount waiting for mon_map\n);
   err = wait_event_interruptible_timeout(client-auth_wq,
   have_mon_and_osd_map(client) || (client-auth_err  0),
   ceph_timeout_jiffies(timeout));
 - if (err == -EINTR || err == -ERESTARTSYS)
 + if (err  0)
   return err;
   if (client-auth_err  0)
   return client-auth_err;
 diff --git a/net/ceph/mon_client.c b/net/ceph/mon_client.c
 index 0da3bdc116f7..9d6ff1215928 100644
 --- a/net/ceph/mon_client.c
 +++ b/net/ceph/mon_client.c
 @@ -308,7 +308,7 @@ int ceph_monc_wait_osdmap(struct ceph_mon_client *monc, 
 u32 epoch,
 unsigned long timeout)
  {
   unsigned long started = jiffies;
 - int ret;
 + long ret;
  
   mutex_lock(monc-mutex);
   while (monc-have_osdmap  epoch) {
 

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v2 5/5] rbd: timeout watch teardown on unmap with mount_timeout

2015-05-21 Thread Alex Elder

On 05/21/2015 07:35 AM, Ilya Dryomov wrote:
 As part of unmap sequence, kernel client has to talk to the OSDs to
 teardown watch on the header object.  If none of the OSDs are available
 it would hang forever, until interrupted by a signal - when that
 happens we follow through with the rest of unmap procedure (i.e.
 unregister the device and put all the data structures) and the unmap is
 still considired successful (rbd cli tool exits with 0).  The watch on
 the userspace side should eventually timeout so that's fine.
 
 This isn't very nice, because various userspace tools (pacemaker rbd
 resource agent, for example) then have to worry about setting up their
 own timeouts.  Timeout it with mount_timeout (60 seconds by default).
 
 Signed-off-by: Ilya Dryomov idryo...@gmail.com
 Reviewed-by: Sage Weil s...@redhat.com

You can now add:

Reviewed-by: Alex Elder el...@linaro.org

 ---
  drivers/block/rbd.c | 38 --
  1 file changed, 28 insertions(+), 10 deletions(-)
 
 diff --git a/drivers/block/rbd.c b/drivers/block/rbd.c
 index 992683b6b299..89fe8a4bc02e 100644
 --- a/drivers/block/rbd.c
 +++ b/drivers/block/rbd.c
 @@ -1563,22 +1563,39 @@ static void rbd_obj_request_end(struct 
 rbd_obj_request *obj_request)
  /*
   * Wait for an object request to complete.  If interrupted, cancel the
   * underlying osd request.
 + *
 + * @timeout: in jiffies, 0 means wait forever
   */
 -static int rbd_obj_request_wait(struct rbd_obj_request *obj_request)
 +static int __rbd_obj_request_wait(struct rbd_obj_request *obj_request,
 +   unsigned long timeout)
  {
 - int ret;
 + long ret;
  
   dout(%s %p\n, __func__, obj_request);
 -
 - ret = wait_for_completion_interruptible(obj_request-completion);
 - if (ret  0) {
 - dout(%s %p interrupted\n, __func__, obj_request);
 + ret = wait_for_completion_interruptible_timeout(
 + obj_request-completion,
 + ceph_timeout_jiffies(timeout));
 + if (ret = 0) {
 + if (ret == 0)
 + ret = -ETIMEDOUT;
   rbd_obj_request_end(obj_request);
 - return ret;
 + } else {
 + ret = 0;
   }
  
 - dout(%s %p done\n, __func__, obj_request);
 - return 0;
 + dout(%s %p ret %d\n, __func__, obj_request, (int)ret);
 + return ret;
 +}
 +
 +static int rbd_obj_request_wait(struct rbd_obj_request *obj_request)
 +{
 + return __rbd_obj_request_wait(obj_request, 0);
 +}
 +
 +static int rbd_obj_request_wait_timeout(struct rbd_obj_request *obj_request,
 + unsigned long timeout)
 +{
 + return __rbd_obj_request_wait(obj_request, timeout);
  }
  
  static void rbd_img_request_complete(struct rbd_img_request *img_request)
 @@ -3122,6 +3139,7 @@ static struct rbd_obj_request 
 *rbd_obj_watch_request_helper(
   bool watch)
  {
   struct ceph_osd_client *osdc = rbd_dev-rbd_client-client-osdc;
 + struct ceph_options *opts = osdc-client-options;
   struct rbd_obj_request *obj_request;
   int ret;
  
 @@ -3148,7 +3166,7 @@ static struct rbd_obj_request 
 *rbd_obj_watch_request_helper(
   if (ret)
   goto out;
  
 - ret = rbd_obj_request_wait(obj_request);
 + ret = rbd_obj_request_wait_timeout(obj_request, opts-mount_timeout);
   if (ret)
   goto out;
  
 

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 4/5] ceph: simplify two mount_timeout sites

2015-05-21 Thread Alex Elder

On 05/21/2015 07:35 AM, Ilya Dryomov wrote:
 No need to bifurcate wait now that we've got ceph_timeout_jiffies().

Looks good.

Reviewed-by: Alex Elder el...@linaro.org

 Signed-off-by: Ilya Dryomov idryo...@gmail.com
 ---
  fs/ceph/dir.c| 14 --
  fs/ceph/mds_client.c | 18 ++
  2 files changed, 14 insertions(+), 18 deletions(-)
 
 diff --git a/fs/ceph/dir.c b/fs/ceph/dir.c
 index 173dd4b58c71..3dec27e36417 100644
 --- a/fs/ceph/dir.c
 +++ b/fs/ceph/dir.c
 @@ -1257,17 +1257,11 @@ static int ceph_dir_fsync(struct file *file, loff_t 
 start, loff_t end,
  
   dout(dir_fsync %p wait on tid %llu (until %llu)\n,
inode, req-r_tid, last_tid);
 - if (req-r_timeout) {
 - unsigned long time_left = wait_for_completion_timeout(
 - req-r_safe_completion,
 + ret = !wait_for_completion_timeout(req-r_safe_completion,
   ceph_timeout_jiffies(req-r_timeout));
 - if (time_left  0)
 - ret = 0;
 - else
 - ret = -EIO;  /* timed out */
 - } else {
 - wait_for_completion(req-r_safe_completion);
 - }
 + if (ret)
 + ret = -EIO;  /* timed out */
 +
   ceph_mdsc_put_request(req);
  
   spin_lock(ci-i_unsafe_lock);
 diff --git a/fs/ceph/mds_client.c b/fs/ceph/mds_client.c
 index 0b0e0a9a81c0..5be2d287a26c 100644
 --- a/fs/ceph/mds_client.c
 +++ b/fs/ceph/mds_client.c
 @@ -2266,16 +2266,18 @@ int ceph_mdsc_do_request(struct ceph_mds_client *mdsc,
   /* wait */
   mutex_unlock(mdsc-mutex);
   dout(do_request waiting\n);
 - if (req-r_timeout) {
 - err = (long)wait_for_completion_killable_timeout(
 - req-r_completion,
 - ceph_timeout_jiffies(req-r_timeout));
 - if (err == 0)
 - err = -EIO;
 - } else if (req-r_wait_for_completion) {
 + if (!req-r_timeout  req-r_wait_for_completion) {
   err = req-r_wait_for_completion(mdsc, req);
   } else {
 - err = wait_for_completion_killable(req-r_completion);
 + long timeleft = wait_for_completion_killable_timeout(
 + req-r_completion,
 + ceph_timeout_jiffies(req-r_timeout));
 + if (timeleft  0)
 + err = 0;
 + else if (!timeleft)
 + err = -EIO;  /* timed out */
 + else
 + err = timeleft;  /* killed */
   }
   dout(do_request waited, got %d\n, err);
   mutex_lock(mdsc-mutex);
 

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 02/10] ceph: add start/finish encoding helpers

2015-05-01 Thread Alex Elder

On 05/01/2015 02:47 PM, Mike Christie wrote:
 On 05/01/2015 02:39 PM, Mike Christie wrote:
 On 04/30/2015 07:22 AM, Alex Elder wrote:
 +/**
 + * ceph_start_decoding_compat - decode block with legacy support for
 older schemes
 + * @p: buffer to decode
 + * @end: end of decode buffer
 + * @curr_ver: current version of the encoding that the code
 supports/encode
 + * @compat_ver: oldest version that includes a __u8 compat version field
 + * @len_ver: oldest version that includes a __u32 length wrapper
 + * @len: buffer to return len of data in buffer
 + */
 +static inline int ceph_start_decoding_compat(void **p, void *end, u8
 curr_ver,
 + u8 compat_ver, u8 len_ver, u32 *len)
 +{
 +u8 struct_ver, struct_compat;
 +
 +ceph_decode_8_safe(p, end, struct_ver, fail);
 +if (struct_ver = curr_ver) {

 It seems to me it doesn't matter whether the current code
 supports the struct compat version or not.  What matters
 is whether the encoded header contains the compat field
 or not.  I'm not sure what determines that.

 I am not sure if I understood this comment.

 I thought different structs got the compat field in different versions.
 So, I was concerned about a case where we might get a old struct sent to
 us. If the compat field was added to some struct_abc in version 2 and
 that is what we support in the kernel, but some old osd sent us a struct
 that was version 1, then we do not want to do the compat check below.

 
 Doh! I wrote the above mail, then realized what you meant.

OK good...  And I should have known what determines
whether the header contains the compat field, but
I was already a little confused about what I was
looking at...

-Alex

 I think I should have checked the compat_ver passed into the version above.
 
 if (struct_ver = compat_ver) {
   ceph_decode_8_safe(p, end, struct_compat, fail);
   if (curr_ver  struct_compat)
 


 +ceph_decode_8_safe(p, end, struct_compat, fail);
 +if (curr_ver  struct_compat)
 +return -EINVAL;
 +}
 +
 +if (struct_ver = len_ver) {
 +ceph_decode_32_safe(p, end, *len, fail);
 +} else {
 +*len = 0;
 +}
 +
 +return 0;
 +fail:
 +return -ERANGE;
 +}
 +
 +
   #define ceph_encode_need(p, end, n, bad)\
   do {\
   if (!likely(ceph_has_room(p, end, n)))\



 --
 To unsubscribe from this list: send the line unsubscribe ceph-devel in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html

 

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 1/3] libceph: properly release STAT request's raw_data_in

2015-04-27 Thread Alex Elder


On 04/27/2015 03:24 AM, Yan, Zheng wrote:

Signed-off-by: Yan, Zheng z...@redhat.com


Looks good.

Reviewed-by: Alex Elder el...@linaro.org


---
  net/ceph/osd_client.c | 3 +++
  1 file changed, 3 insertions(+)

diff --git a/net/ceph/osd_client.c b/net/ceph/osd_client.c
index 41a4abc..b93531f 100644
--- a/net/ceph/osd_client.c
+++ b/net/ceph/osd_client.c
@@ -296,6 +296,9 @@ static void osd_req_op_data_release(struct ceph_osd_request 
*osd_req,
case CEPH_OSD_OP_CMPXATTR:
ceph_osd_data_release(op-xattr.osd_data);
break;
+   case CEPH_OSD_OP_STAT:
+   ceph_osd_data_release(op-raw_data_in);
+   break;
default:
break;
}



--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 0/7] ceph-client: some refactoring patches

2015-04-27 Thread Alex Elder

Most of this series is just comment changes and a few minor naming
tweaks.  The last two patches actually change code, but the
resulting behavior should be identical to before.

These have been compile tested only.  They (especially the last two)
need to run through teuthology.  The series is based on the current
testing branch:
5eeb14f ceph: check OSD caps before read/write

-Alex

Alex Elder (7):
  messenger: update state diagram
  messenger: add some clarifying comments
  messenger: rename out_kvec_left field
  osd_client: don't supply connection to handle_reply()
  osd_client: improve safety of osd_req_op_data() macro
  messenger: encapsulate grabbing incoming message
  messenger: con_work() is not really a loop

 include/linux/ceph/messenger.h |   2 +-
 net/ceph/messenger.c   | 254 +++--
 net/ceph/osd_client.c  |  16 +--
 3 files changed, 156 insertions(+), 116 deletions(-)

-- 
2.1.0

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 1/7] messenger: update state diagram

2015-04-27 Thread Alex Elder

Update the fabulous state diagram in messenger.c to reflect the
way the code operates.  Two new transitions are added:

CLOSED - CLOSED
CLOSING - CLOSING

Signed-off-by: Alex Elder el...@linaro.org
---
 net/ceph/messenger.c | 51 +--
 1 file changed, 25 insertions(+), 26 deletions(-)

diff --git a/net/ceph/messenger.c b/net/ceph/messenger.c
index 967080a..9dfdb20 100644
--- a/net/ceph/messenger.c
+++ b/net/ceph/messenger.c
@@ -46,32 +46,31 @@
  *  | con_sock_state_init()
  *  v
  *  --
- *  | CLOSED |  initialized, but no socket (and no
- *  --  TCP connection)
- *   ^  \
- *   |   \ con_sock_state_connecting()
- *   |--
- *   |  \
- *   + con_sock_state_closed()   \
- *   |+---\
- *   | \  \\
- *   |  ---\\
- *   |  | CLOSING |  socket event;  \\
- *   |  ---  await close \\
- *   |   ^\   |
- *   |   | \  |
- *   |   + con_sock_state_closing() \ |
- *   |  / \ | |
- *   | /   ---  | |
- *   |/   \ v v
- *   |   /--
- *   |  /-| CONNECTING |  socket created, TCP
- *   |  |   / --  connect initiated
- *   |  |   | con_sock_state_connected()
- *   |  |   v
- *  -
- *  | CONNECTED |  TCP connection established
- *  -
+ *  | CLOSED |  initialized, but no socket (and no TCP connection)
+ *  --
+ *   ^   \  \  con_sock_state_connecting()
+ *   |   /   
+ *   +--\
+ *   |\  cen_sock_state_closed()  \
+ *   | +--\
+ *   |  \  \\
+ *   | ---  \\
+ *   | | CLOSING |  socket event;\\
+ *   | ---  await close   \\
+ *   |/   ^\\
+ *   |\   | \   |
+ *   | --+ con_sock_state_closing() \  |
+ *   |   / \ |  |
+ *   |  /   --   |  |
+ *   | /  \  |  v
+ *   |/--
+ *   |   /-| CONNECTING |  socket created, TCP
+ *   |   |   / --  connect initiated
+ *   |   |   | con_sock_state_connected()
+ *   |   |   v
+ * -
+ * | CONNECTED |  TCP connection established
+ * -
  *
  * State values for ceph_connection-sock_state; NEW is assumed to be 0.
  */
-- 
2.1.0

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 7/7] messenger: con_work() is not really a loop

2015-04-27 Thread Alex Elder

In con_work(), a while(true) loop is used to contain a block of code.
The normal path through that code does *not* actually loop.  The only
time it loops is if a socket read or write operation returns -EAGAIN.
So the use of the loop control structure is a little misleading.

Restructure that block so it's *not* a loop, and use goto calls
rather than a loop to implement the control flow.

Signed-off-by: Alex Elder el...@linaro.org
---
 net/ceph/messenger.c | 84 +---
 1 file changed, 41 insertions(+), 43 deletions(-)

diff --git a/net/ceph/messenger.c b/net/ceph/messenger.c
index ec60c23..053c5f3 100644
--- a/net/ceph/messenger.c
+++ b/net/ceph/messenger.c
@@ -2845,54 +2845,52 @@ static void con_work(struct work_struct *work)
struct ceph_connection *con = container_of(work, struct ceph_connection,
   work.work);
bool fault;
+   int ret;
 
mutex_lock(con-mutex);
-   while (true) {
-   int ret;
-
-   if ((fault = con_sock_closed(con))) {
-   dout(%s: con %p SOCK_CLOSED\n, __func__, con);
-   break;
-   }
-   if (con_backoff(con)) {
-   dout(%s: con %p BACKOFF\n, __func__, con);
-   break;
-   }
-   if (con-state == CON_STATE_STANDBY) {
-   dout(%s: con %p STANDBY\n, __func__, con);
-   break;
-   }
-   if (con-state == CON_STATE_CLOSED) {
-   dout(%s: con %p CLOSED\n, __func__, con);
-   BUG_ON(con-sock);
-   break;
-   }
-   if (con-state == CON_STATE_PREOPEN) {
-   dout(%s: con %p PREOPEN\n, __func__, con);
-   BUG_ON(con-sock);
-   }
-
-   ret = try_read(con);
-   if (ret  0) {
-   if (ret == -EAGAIN)
-   continue;
-   if (!con-error_msg)
-   con-error_msg = socket error on read;
-   fault = true;
-   break;
-   }
+again:
+   fault = con_sock_closed(con);
+   if (fault) {
+   dout(%s: con %p SOCK_CLOSED\n, __func__, con);
+   goto done;
+   }
+   if (con_backoff(con)) {
+   dout(%s: con %p BACKOFF\n, __func__, con);
+   goto done;
+   }
+   if (con-state == CON_STATE_STANDBY) {
+   dout(%s: con %p STANDBY\n, __func__, con);
+   goto done;
+   }
+   if (con-state == CON_STATE_CLOSED) {
+   dout(%s: con %p CLOSED\n, __func__, con);
+   BUG_ON(con-sock);
+   goto done;
+   }
+   if (con-state == CON_STATE_PREOPEN) {
+   dout(%s: con %p PREOPEN\n, __func__, con);
+   BUG_ON(con-sock);
+   }
 
-   ret = try_write(con);
-   if (ret  0) {
-   if (ret == -EAGAIN)
-   continue;
-   if (!con-error_msg)
-   con-error_msg = socket error on write;
-   fault = true;
-   }
+   ret = try_read(con);
+   if (ret  0) {
+   if (ret == -EAGAIN)
+   goto again;
+   if (!con-error_msg)
+   con-error_msg = socket error on read;
+   fault = true;
+   goto done;
+   }
 
-   break;  /* If we make it to here, we're done */
+   ret = try_write(con);
+   if (ret  0) {
+   if (ret == -EAGAIN)
+   goto again;
+   if (!con-error_msg)
+   con-error_msg = socket error on write;
+   fault = true;
}
+done:
if (fault)
con_fault(con);
mutex_unlock(con-mutex);
-- 
2.1.0

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 6/7] messenger: encapsulate grabbing incoming message

2015-04-27 Thread Alex Elder

There are currently three places in the messenger code that grab a
connection's incoming message (i.e., get the message--if any--and
and replace the connection's incoming message with a null pointer).

Create a new function ceph_con_in_msg_grab() to encapsulate this
operation, returning what had been the connection's incoming
message.  The connection's reference to the message is transferred
to the caller.  In the new function, test for a few null pointers
before using them, in case of pathological errors.

Do not call BUG_ON() if the message's connection pointer is bad.
Instead, report an error, then drop and ignore the message.  Take
care to use the right connection's put operation.

Establish the naming convention that in_msg is a pointer to a
message that was a connection's incoming message.  (This affects
some additional code in process_message().)

Signed-off-by: Alex Elder el...@linaro.org
---
 net/ceph/messenger.c | 88 +++-
 1 file changed, 60 insertions(+), 28 deletions(-)

diff --git a/net/ceph/messenger.c b/net/ceph/messenger.c
index a0d2673..ec60c23 100644
--- a/net/ceph/messenger.c
+++ b/net/ceph/messenger.c
@@ -650,21 +650,55 @@ static void ceph_msg_remove_list(struct list_head *head)
}
 }
 
+/*
+ * Grab the given connection's incoming message.
+ */
+static struct ceph_msg *ceph_con_in_msg_grab(struct ceph_connection *con)
+{
+   struct ceph_msg *in_msg;
+   struct ceph_connection *in_msg_con;
+
+   if (!con-in_msg)
+   return NULL;
+
+   /*
+* Grab the incoming message and its connection, then update
+* their pointers to each other.  The connection's reference
+* to the message is transferred to the caller unless there
+* is an error.
+*/
+   in_msg = con-in_msg;
+   con-in_msg = NULL;
+   in_msg_con = in_msg-con;
+   in_msg-con = NULL;
+
+   if (in_msg_con  in_msg-con-ops)
+   in_msg_con-ops-put(in_msg_con);
+
+   if (in_msg_con == con)
+   return in_msg;
+
+   pr_err(incoming message %p with bad connection (%p != %p)\n,
+   in_msg, in_msg_con, con);
+   ceph_msg_put(in_msg);
+
+   return NULL;
+}
+
 static void reset_connection(struct ceph_connection *con)
 {
+   struct ceph_msg *in_msg;
+
/* reset connection, out_queue, msg_ and connect_seq */
/* discard existing out_queue and msg_seq */
dout(reset_connection %p\n, con);
ceph_msg_remove_list(con-out_queue);
ceph_msg_remove_list(con-out_sent);
 
-   if (con-in_msg) {
-   BUG_ON(con-in_msg-con != con);
-   con-in_msg-con = NULL;
-   ceph_msg_put(con-in_msg);
-   con-in_msg = NULL;
-   con-ops-put(con);
-   }
+   /* If there is an incoming message, grab it and release it */
+   in_msg = ceph_con_in_msg_grab(con);
+   if (in_msg)
+   ceph_msg_put(in_msg);
 
con-connect_seq = 0;
con-out_seq = 0;
@@ -2428,30 +2462,29 @@ static int read_partial_message(struct ceph_connection 
*con)
  */
 static void process_message(struct ceph_connection *con)
 {
-   struct ceph_msg *msg;
+   struct ceph_msg *in_msg = ceph_con_in_msg_grab(con);
 
-   BUG_ON(con-in_msg-con != con);
-   con-in_msg-con = NULL;
-   msg = con-in_msg;
-   con-in_msg = NULL;
-   con-ops-put(con);
+   if (!in_msg) {
+   pr_err(no message to process);
+   return;
+   }
 
/* if first message, set peer_name */
if (con-peer_name.type == 0)
-   con-peer_name = msg-hdr.src;
+   con-peer_name = in_msg-hdr.src;
 
con-in_seq++;
mutex_unlock(con-mutex);
 
dout(= %p %llu from %s%lld %d=%s len %d+%d (%u %u %u) =\n,
-msg, le64_to_cpu(msg-hdr.seq),
-ENTITY_NAME(msg-hdr.src),
-le16_to_cpu(msg-hdr.type),
-ceph_msg_type_name(le16_to_cpu(msg-hdr.type)),
-le32_to_cpu(msg-hdr.front_len),
-le32_to_cpu(msg-hdr.data_len),
+in_msg, le64_to_cpu(in_msg-hdr.seq),
+ENTITY_NAME(in_msg-hdr.src),
+le16_to_cpu(in_msg-hdr.type),
+ceph_msg_type_name(le16_to_cpu(in_msg-hdr.type)),
+le32_to_cpu(in_msg-hdr.front_len),
+le32_to_cpu(in_msg-hdr.data_len),
 con-in_front_crc, con-in_middle_crc, con-in_data_crc);
-   con-ops-dispatch(con, msg);
+   con-ops-dispatch(con, in_msg);
 
mutex_lock(con-mutex);
 }
@@ -2876,6 +2909,8 @@ static void con_work(struct work_struct *work)
  */
 static void con_fault(struct ceph_connection *con)
 {
+   struct ceph_msg *in_msg;
+
dout(fault %p state %lu to peer %s\n,
 con, con-state, ceph_pr_addr(con-peer_addr.in_addr));
 
@@ -2895,13 +2930,10 @@ static void con_fault(struct ceph_connection *con)
return

[PATCH 2/7] messenger: add some clarifying comments

2015-04-27 Thread Alex Elder

In con_sock_closed() we test *and* clear the socket closed connection
flag.  It's not immediately obvious why the flag needs to be cleared.
Add a comment so that's a little clearer.

Similarly, we only want to do backoff processing once each time the
backoff flag is set.  Add a comment about that as well.

Signed-off-by: Alex Elder el...@linaro.org
---
 net/ceph/messenger.c | 9 +
 1 file changed, 9 insertions(+)

diff --git a/net/ceph/messenger.c b/net/ceph/messenger.c
index 9dfdb20..37b0fa7 100644
--- a/net/ceph/messenger.c
+++ b/net/ceph/messenger.c
@@ -2733,6 +2733,10 @@ static void cancel_con(struct ceph_connection *con)
 
 static bool con_sock_closed(struct ceph_connection *con)
 {
+   /*
+* We want to handle the event of the socket being
+* closed once, so clear the flag in case it is set.
+*/
if (!con_flag_test_and_clear(con, CON_FLAG_SOCK_CLOSED))
return false;
 
@@ -2764,6 +2768,11 @@ static bool con_backoff(struct ceph_connection *con)
 {
int ret;
 
+   /*
+* See if we need to back off.  We only want to do this
+* processing once each time the flag is set, so clear the
+* flag after it's tested.
+*/
if (!con_flag_test_and_clear(con, CON_FLAG_BACKOFF))
return false;
 
-- 
2.1.0

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 3/7] messenger: rename out_kvec_left field

2015-04-27 Thread Alex Elder

The name out_kvec_left in the ceph_connection structure is sort of
ambiguous.  It represents the number of the entries of the out_kvec[]
array that are filled with data to send.  The name can easily be
misinterpreted to mean the number of such entries that are available.

Change the name to out_kvec_used to avoid misunderstanding.

Signed-off-by: Alex Elder el...@linaro.org
---
 include/linux/ceph/messenger.h |  2 +-
 net/ceph/messenger.c   | 22 +++---
 2 files changed, 12 insertions(+), 12 deletions(-)

diff --git a/include/linux/ceph/messenger.h b/include/linux/ceph/messenger.h
index e154994..c457f13 100644
--- a/include/linux/ceph/messenger.h
+++ b/include/linux/ceph/messenger.h
@@ -230,7 +230,7 @@ struct ceph_connection {
 
struct kvec out_kvec[8], /* sending header/footer data */
*out_kvec_cur;
-   int out_kvec_left;   /* kvec's left in out_kvec */
+   int out_kvec_used;   /* kvec's used in out_kvec */
int out_skip;/* skip this many bytes */
int out_kvec_bytes;  /* total bytes left */
bool out_kvec_is_msg; /* kvec refers to out_msg */
diff --git a/net/ceph/messenger.c b/net/ceph/messenger.c
index 37b0fa7..a0d2673 100644
--- a/net/ceph/messenger.c
+++ b/net/ceph/messenger.c
@@ -773,7 +773,7 @@ static u32 get_global_seq(struct ceph_messenger *msgr, u32 
gt)
 
 static void con_out_kvec_reset(struct ceph_connection *con)
 {
-   con-out_kvec_left = 0;
+   con-out_kvec_used = 0;
con-out_kvec_bytes = 0;
con-out_kvec_cur = con-out_kvec[0];
 }
@@ -783,12 +783,12 @@ static void con_out_kvec_add(struct ceph_connection *con,
 {
int index;
 
-   index = con-out_kvec_left;
+   index = con-out_kvec_used;
BUG_ON(index = ARRAY_SIZE(con-out_kvec));
 
con-out_kvec[index].iov_len = size;
con-out_kvec[index].iov_base = data;
-   con-out_kvec_left++;
+   con-out_kvec_used++;
con-out_kvec_bytes += size;
 }
 
@@ -1194,7 +1194,7 @@ static void prepare_message_data(struct ceph_msg *msg, 
u32 data_len)
 static void prepare_write_message_footer(struct ceph_connection *con)
 {
struct ceph_msg *m = con-out_msg;
-   int v = con-out_kvec_left;
+   int v = con-out_kvec_used;
 
m-footer.flags |= CEPH_MSG_FOOTER_COMPLETE;
 
@@ -1213,7 +1213,7 @@ static void prepare_write_message_footer(struct 
ceph_connection *con)
con-out_kvec[v].iov_len = sizeof(m-old_footer);
con-out_kvec_bytes += sizeof(m-old_footer);
}
-   con-out_kvec_left++;
+   con-out_kvec_used++;
con-out_more = m-more_to_follow;
con-out_msg_done = true;
 }
@@ -1462,7 +1462,7 @@ static int write_partial_kvec(struct ceph_connection *con)
dout(write_partial_kvec %p %d left\n, con, con-out_kvec_bytes);
while (con-out_kvec_bytes  0) {
ret = ceph_tcp_sendmsg(con-sock, con-out_kvec_cur,
-  con-out_kvec_left, con-out_kvec_bytes,
+  con-out_kvec_used, con-out_kvec_bytes,
   con-out_more);
if (ret = 0)
goto out;
@@ -1472,10 +1472,10 @@ static int write_partial_kvec(struct ceph_connection 
*con)
 
/* account for full iov entries consumed */
while (ret = con-out_kvec_cur-iov_len) {
-   BUG_ON(!con-out_kvec_left);
+   BUG_ON(!con-out_kvec_used);
ret -= con-out_kvec_cur-iov_len;
con-out_kvec_cur++;
-   con-out_kvec_left--;
+   con-out_kvec_used--;
}
/* and for a partially-consumed entry */
if (ret) {
@@ -1483,12 +1483,12 @@ static int write_partial_kvec(struct ceph_connection 
*con)
con-out_kvec_cur-iov_base += ret;
}
}
-   con-out_kvec_left = 0;
+   con-out_kvec_used = 0;
con-out_kvec_is_msg = false;
ret = 1;
 out:
dout(write_partial_kvec %p %d left in %d kvecs ret = %d\n, con,
-con-out_kvec_bytes, con-out_kvec_left, ret);
+con-out_kvec_bytes, con-out_kvec_used, ret);
return ret;  /* done! */
 }
 
@@ -2497,7 +2497,7 @@ more_kvec:
if (ret = 0)
goto out;
}
-   if (con-out_kvec_left) {
+   if (con-out_kvec_used) {
ret = write_partial_kvec(con);
if (ret = 0)
goto out;
-- 
2.1.0

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 4/7] osd_client: don't supply connection to handle_reply()

2015-04-27 Thread Alex Elder

The connection parameter in handle_reply() is never used, so get rid
of it.

Signed-off-by: Alex Elder el...@linaro.org
---
 net/ceph/osd_client.c | 5 ++---
 1 file changed, 2 insertions(+), 3 deletions(-)

diff --git a/net/ceph/osd_client.c b/net/ceph/osd_client.c
index e3c2e8b..6539dfc 100644
--- a/net/ceph/osd_client.c
+++ b/net/ceph/osd_client.c
@@ -1748,8 +1748,7 @@ static void complete_request(struct ceph_osd_request *req)
  * handle osd op reply.  either call the callback if it is specified,
  * or do the completion to wake up the waiting thread.
  */
-static void handle_reply(struct ceph_osd_client *osdc, struct ceph_msg *msg,
-struct ceph_connection *con)
+static void handle_reply(struct ceph_osd_client *osdc, struct ceph_msg *msg)
 {
void *p, *end;
struct ceph_osd_request *req;
@@ -2796,7 +2795,7 @@ static void dispatch(struct ceph_connection *con, struct 
ceph_msg *msg)
ceph_osdc_handle_map(osdc, msg);
break;
case CEPH_MSG_OSD_OPREPLY:
-   handle_reply(osdc, msg, con);
+   handle_reply(osdc, msg);
break;
case CEPH_MSG_WATCH_NOTIFY:
handle_watch_notify(osdc, msg);
-- 
2.1.0

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCHv2 0/3] rbd: header read/refresh improvements

2015-04-26 Thread Alex Elder


On 04/24/2015 08:22 AM, Douglas Fuller wrote:

Support multiple class op calls in one ceph_msg and consolidate rbd header
read and refresh processes to use this feature to reduce the number of
ceph_msgs sent for that process. Refresh features on header refresh and
begin returning EIO if features have changed since mapping.


I think I understand the motivation for what you're doing here.
You ultimately want to pack more ops into a message, which
would be really great.  Ideally you could pack some arbitrary
number (much larger than 3), but it seems to me there'll be
some restructuring required to get there.

I am having a hard time understanding your intended
implementation though.  I've looked mainly at your first
patch, and scanned the other two.  Let's see if I have
it right.

You are adding another hunk of data to a class request,
and using it as an intermediate buffer for receiving
data.

Currently a class request has three hunks:
- request_info, a pagelist data item, which contains the
  class name and method name, and perhaps someday, an
  argument list (?).
- request_data, which can contain any number of data
  items of any type, and constitutes outgoing data for
  a write request.
- response_data, also any number of any type of data
  items, which constitutes space into which incoming
  data is placed.

You are adding a fourth:
- chain_data, which is a page vector big enough to hold
  all incoming response data.

When building up an OSD request, we call osd_req_encode_op()
for every op in the request.  For a class op, that records
the length of the class and method names in the op structure,
and then appends those names to the outbound request_data.
Following that, the real request data (if any) provided
with the class op is appended to request_data.

Next (before your patches) whatever buffer space was provided
(as a page array, pagelist, or bio) to receive response data
is appended to response_data.  Prior to this, the class
op will have been set up with adequate buffers to directly
receive the expected data.

What your first patch seems to do is circumvent this last
step, and instead of using the prearranged buffers, you
allocate another page vector, big enough to hold the
combined size of all response_data data items, and store
that in the new chain_data field of an OSD request class op.
(Note that this could be in *any* of the ops in an OSD
request, though I suspect it's always the first.)

When a response arrives, if the *first* op in the request
array is CEPH_OSD_OP_CALL, you split the data, which will
have been deposited in the chain_data page vector.  That
process involves allocating another buffer sufficient to
hold the entirety of the received data.  You copy the
received data into that buffer and release the chain_data
page vector.  And then you copy the data from this new
buffer back into the original response array, and then
free the buffer.

This sounds pretty expensive.  For every class operation
you are copying the received data two extra times.

It's possible I'm not understanding what you're doing here
and that's why I'm writing this now.  Before I provide
any more specific commentary on your patches, I want to
be sure I understand what you're trying to do.

If my understanding *is* correct, I would say you're
going about this the wrong way, and I may have some
suggestions for a better approach.

Will you please correct me where I'm wrong above, and
maybe give a little better high-level explanation of
what you're trying to do?  I see in patch 1 you mention
a problem with short reads, and there may be a simpler
fix than that (to begin with).  But beyond that, if
you're trying to incorporate more ops in a message,
there may be better ways to go about that as well.

Thanks.

-Alex


v2: Edit history and address comments from Mike Christie.

Douglas Fuller (3):
   ceph: support multiple class method calls in one ceph_msg
   rbd: combine object method calls in header refresh using fewer
 ceph_msgs
   rbd: re-read features during header refresh and detect changes.

  drivers/block/rbd.c | 512 +---
  include/linux/ceph/osd_client.h |   3 +-
  net/ceph/messenger.c|   4 +
  net/ceph/osd_client.c   |  90 ++-
  4 files changed, 462 insertions(+), 147 deletions(-)



--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCHv2 0/3] rbd: header read/refresh improvements

2015-04-26 Thread Alex Elder


On 04/26/2015 11:35 PM, Douglas Fuller wrote:

This solution just feels hacky and inefficient, so I think there is a
desire to feel like we at least tried to come up something simpler
and more efficient before proceeding.


Right.  And somehow more fitting with the existing code, if
that's possible.  There are a few things in this proposal
that are just special case code, and it would be nice to
avoid that.

I'm sorry I've been too busy today to get back to this.  I spent
a lot of time on Saturday getting ramped back up on this code
so I could provide a decent review...

I'm going to at least look at the code (error on a read op)
before going to bed, so I at least know what the problem is.
I'll try to have something to say (even if it's I've got
nothing) in the next day or two.

-Alex
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 0/3] rbd: header read/refresh improvements

2015-04-24 Thread Alex Elder


On 04/23/2015 02:06 PM, Douglas Fuller wrote:

Support multiple class op calls in one ceph_msg and consolidate rbd header
read and refresh processes to use this feature to reduce the number of
ceph_msgs sent for that process. Refresh features on header refresh and
begin returning EIO if features have changed since mapping.

Douglas Fuller (3):
   ceph: support multiple class method calls in one ceph_msg
   rbd: combine object method calls in header refresh using fewer
 ceph_msgs
   rbd: re-read features during header refresh and detect changes.

  drivers/block/rbd.c | 518 +---
  include/linux/ceph/osd_client.h |   3 +-
  net/ceph/messenger.c|   4 +
  net/ceph/osd_client.c   |  92 ++-
  4 files changed, 470 insertions(+), 147 deletions(-)



In case Ilya or others don't get to it soon, I plan to review this
series tomorrow.

-Alex
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 0/3] rbd: header read/refresh improvements

2015-04-24 Thread Alex Elder


On 04/24/2015 09:40 AM, Douglas Fuller wrote:



On Apr 24, 2015, at 10:17 AM, Ilya Dryomov idryo...@gmail.com
wrote:

On Fri, Apr 24, 2015 at 4:11 PM, Alex Elder el...@ieee.org
wrote:

On 04/23/2015 02:06 PM, Douglas Fuller wrote:


Support multiple class op calls in one ceph_msg and consolidate
rbd header read and refresh processes to use this feature to
reduce the number of ceph_msgs sent for that process. Refresh
features on header refresh and begin returning EIO if features
have changed since mapping.

Douglas Fuller (3): ceph: support multiple class method calls
in one ceph_msg rbd: combine object method calls in header
refresh using fewer ceph_msgs rbd: re-read features during
header refresh and detect changes.

drivers/block/rbd.c | 518
+---
include/linux/ceph/osd_client.h |   3 +- net/ceph/messenger.c
|   4 + net/ceph/osd_client.c   |  92 ++- 4 files
changed, 470 insertions(+), 147 deletions(-)



In case Ilya or others don't get to it soon, I plan to review
this series tomorrow.


I was planning take a look while I'm the road during the weekend.

Doug, from a quick look this revision still has a bunch of style
issues, most notably the alignment of function parameters and
braces around if / else.  See Documentation/CodingStyle in the
kernel tree for examples.


I needed to put out v2 in part because I squashed a couple fixup
commits in the wrong place, leaving some things behind in #2 that
were corrected in #3.

I changed the braces in that version, but the function parameter
indents are inconsistent throughout the code. I’ll try to come up
with a compromise.


When in doubt, lean toward the style used in the rest of
the kernel.  I used a few conventions that are not consistent
with that in a lot of places, and those can be gradually
phased toward what's recommended for the kernel.  Some
examples are:
sizeof x or sizeof (x)  --  sizeof(x)
(cast) foo  -- (cast)foo
White space under comment blocks
static int\nfunction(...)   - static int function(...)

-Alex



You might also want to run your patches through
scripts/checkpatch.pl, but take it with a grain of salt - it can be
a bit too extreme at times.  No need to post v3 with just style
fixes, wait for more feedback.


Thanks again for all feedback.



Thanks,

Ilya




--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] libceph: kfree() in put_osd() shouldn't depend on authorizer

2015-02-18 Thread Alex Elder

On 02/18/2015 07:27 AM, Ilya Dryomov wrote:
 a255651d4cad (ceph: ensure auth ops are defined before use) made
 kfree() in put_osd() conditional on the authorizer.  A mechanical
 mistake most likely - fix it.

You are generous in suggesting it's a mechanical mistake.
But it is a mistake nevertheless.  The fix looks good.

Reviewed-by: Alex Elder el...@linaro.org

 Cc: Alex Elder el...@linaro.org
 Signed-off-by: Ilya Dryomov idryo...@gmail.com
 ---
  net/ceph/osd_client.c | 5 +++--
  1 file changed, 3 insertions(+), 2 deletions(-)
 
 diff --git a/net/ceph/osd_client.c b/net/ceph/osd_client.c
 index f693a2f8ac86..41a4abc7e98e 100644
 --- a/net/ceph/osd_client.c
 +++ b/net/ceph/osd_client.c
 @@ -1035,10 +1035,11 @@ static void put_osd(struct ceph_osd *osd)
  {
   dout(put_osd %p %d - %d\n, osd, atomic_read(osd-o_ref),
atomic_read(osd-o_ref) - 1);
 - if (atomic_dec_and_test(osd-o_ref)  osd-o_auth.authorizer) {
 + if (atomic_dec_and_test(osd-o_ref)) {
   struct ceph_auth_client *ac = osd-o_osdc-client-monc.auth;
  
 - ceph_auth_destroy_authorizer(ac, osd-o_auth.authorizer);
 + if (osd-o_auth.authorizer)
 + ceph_auth_destroy_authorizer(ac, 
 osd-o_auth.authorizer);
   kfree(osd);
   }
  }
 

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] libceph: fix double __remove_osd() problem

2015-02-18 Thread Alex Elder

On 02/18/2015 07:25 AM, Ilya Dryomov wrote:
 It turns out it's possible to get __remove_osd() called twice on the
 same OSD.  That doesn't sit well with rb_erase() - depending on the
 shape of the tree we can get a NULL dereference, a soft lockup or
 a random crash at some point in the future as we end up touching freed
 memory.  One scenario that I was able to reproduce is as follows:
 
 osd3 is idle, on the osd lru list
 con reset - osd3
 con_fault_finish()
   osd_reset()
   osdmap - osd3 down
   ceph_osdc_handle_map()
 takes map_sem
 kick_requests()
   takes request_mutex
   reset_changed_osds()
 __reset_osd()
   __remove_osd()
   releases request_mutex
 releases map_sem
 takes map_sem
 takes request_mutex
 __kick_osd_requests()
   __reset_osd()
 __remove_osd() -- !!!
 
 A case can be made that osd refcounting is imperfect and reworking it
 would be a proper resolution, but for now Sage and I decided to fix
 this by adding a safe guard around __remove_osd().
 
 Fixes: http://tracker.ceph.com/issues/8087

So now you rely on the RB node in the osd getting cleared
as a signal that it has been removed already.  Yes, that
sounds like refcounting isn't working as desired...

The mutex around all calls to (now) remove_osd() avoids
races.  I like the RB_CLEAR_NODE() call anyway.
OK, enough chit chat.  This looks OK to me.

Reviewed-by: Alex Elder el...@linaro.org

 Cc: Sage Weil sw...@redhat.com
 Cc: sta...@vger.kernel.org # 3.9+: 7c6e6fc53e73: libceph: assert both regular 
 and lingering lists in __remove_osd()
 Cc: sta...@vger.kernel.org # 3.9+: cc9f1f518cec: libceph: change from BUG to 
 WARN for __remove_osd() asserts
 Cc: sta...@vger.kernel.org # 3.9+
 Signed-off-by: Ilya Dryomov idryo...@gmail.com
 ---
  net/ceph/osd_client.c | 26 ++
  1 file changed, 18 insertions(+), 8 deletions(-)
 
 diff --git a/net/ceph/osd_client.c b/net/ceph/osd_client.c
 index 53299c7b0ca4..f693a2f8ac86 100644
 --- a/net/ceph/osd_client.c
 +++ b/net/ceph/osd_client.c
 @@ -1048,14 +1048,24 @@ static void put_osd(struct ceph_osd *osd)
   */
  static void __remove_osd(struct ceph_osd_client *osdc, struct ceph_osd *osd)
  {
 - dout(__remove_osd %p\n, osd);
 + dout(%s %p osd%d\n, __func__, osd, osd-o_osd);
   WARN_ON(!list_empty(osd-o_requests));
   WARN_ON(!list_empty(osd-o_linger_requests));
  
 - rb_erase(osd-o_node, osdc-osds);
   list_del_init(osd-o_osd_lru);
 - ceph_con_close(osd-o_con);
 - put_osd(osd);
 + rb_erase(osd-o_node, osdc-osds);
 + RB_CLEAR_NODE(osd-o_node);
 +}
 +
 +static void remove_osd(struct ceph_osd_client *osdc, struct ceph_osd *osd)
 +{
 + dout(%s %p osd%d\n, __func__, osd, osd-o_osd);
 +
 + if (!RB_EMPTY_NODE(osd-o_node)) {
 + ceph_con_close(osd-o_con);
 + __remove_osd(osdc, osd);
 + put_osd(osd);
 + }
  }
  
  static void remove_all_osds(struct ceph_osd_client *osdc)
 @@ -1065,7 +1075,7 @@ static void remove_all_osds(struct ceph_osd_client 
 *osdc)
   while (!RB_EMPTY_ROOT(osdc-osds)) {
   struct ceph_osd *osd = rb_entry(rb_first(osdc-osds),
   struct ceph_osd, o_node);
 - __remove_osd(osdc, osd);
 + remove_osd(osdc, osd);
   }
   mutex_unlock(osdc-request_mutex);
  }
 @@ -1106,7 +1116,7 @@ static void remove_old_osds(struct ceph_osd_client 
 *osdc)
   list_for_each_entry_safe(osd, nosd, osdc-osd_lru, o_osd_lru) {
   if (time_before(jiffies, osd-lru_ttl))
   break;
 - __remove_osd(osdc, osd);
 + remove_osd(osdc, osd);
   }
   mutex_unlock(osdc-request_mutex);
  }
 @@ -1121,8 +1131,7 @@ static int __reset_osd(struct ceph_osd_client *osdc, 
 struct ceph_osd *osd)
   dout(__reset_osd %p osd%d\n, osd, osd-o_osd);
   if (list_empty(osd-o_requests) 
   list_empty(osd-o_linger_requests)) {
 - __remove_osd(osdc, osd);
 -
 + remove_osd(osdc, osd);
   return -ENODEV;
   }
  
 @@ -1926,6 +1935,7 @@ static void reset_changed_osds(struct ceph_osd_client 
 *osdc)
  {
   struct rb_node *p, *n;
  
 + dout(%s %p\n, __func__, osdc);
   for (p = rb_first(osdc-osds); p; p = n) {
   struct ceph_osd *osd = rb_entry(p, struct ceph_osd, o_node);
  
 

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 1/3] rbd: fix rbd_dev_parent_get() when parent_overlap == 0

2015-01-26 Thread Alex Elder

On 01/20/2015 06:41 AM, Ilya Dryomov wrote:
 The comment for rbd_dev_parent_get() said
 
 * We must get the reference before checking for the overlap to
 * coordinate properly with zeroing the parent overlap in
 * rbd_dev_v2_parent_info() when an image gets flattened.  We
 * drop it again if there is no overlap.
 
 but the drop it again if there is no overlap part was missing from
 the implementation.  This lead to absurd parent_ref values for images
 with parent_overlap == 0, as parent_ref was incremented for each
 img_request and virtually never decremented.

You're right about this.  If the image had a parent with no
overlap this would leak a reference to the parent image.  The
code should have said:

counter = atomic_inc_return_safe(rbd_dev-parent_ref);
if (counter  0) {
if (rbd_dev-parent_overlap)
return true;
atomic_dec(rbd_dev-parent_ref);
} else if (counter  0) {
rbd_warn(rbd_dev, parent reference overflow);
}

 Fix this by leveraging the fact that refresh path calls
 rbd_dev_v2_parent_info() under header_rwsem and use it for read in
 rbd_dev_parent_get(), instead of messing around with atomics.  Get rid
 of barriers in rbd_dev_v2_parent_info() while at it - I don't see what
 they'd pair with now and I suspect we are in a pretty miserable
 situation as far as proper locking goes regardless.

The point of the memory barrier was to ensure that when parent_overlap
gets zeroed, this code sees the zero rather than the old non-zero
value.  The atomic_inc_return_safe() call has an implicit memory
barrier to match the smp_mb() call.  It allowed the synchronization
to occur without the use of a lock.

We're trying to atomically determine whether an image request needs
to be marked as layered, to know how to handle ENOENT on parent reads.
If it is a write to an image with a parent having a non-zero overlap,
it's layered, otherwise we can treat it as a simple request.

I think in this particular case, this is just an optimization,
trying very hard to avoid having to do layered image handling
if the parent has become flattened.  I think that even if it
got old information (suggesting non-zero overlap) things would
behave correctly, just less efficiently.

Using the semaphore adds a lock to this path and therefore
implements whatever barriers are being removed.  I'm not
sure how often this is hit--maybe the optimization isn't
buying much after all.

I am getting a little rusty on some of details of what
precisely happens when a layered image gets flattened.
But I think this looks OK.  Maybe just watch for small
(perhaps insignificant) performance regressions with
this change in place...

Reviewed-by: Alex Elder el...@linaro.org

 Cc: sta...@vger.kernel.org # 3.11+
 Signed-off-by: Ilya Dryomov idryo...@redhat.com
 ---
  drivers/block/rbd.c | 20 ++--
  1 file changed, 6 insertions(+), 14 deletions(-)
 
 diff --git a/drivers/block/rbd.c b/drivers/block/rbd.c
 index 31fa00f0d707..2990a1c75159 100644
 --- a/drivers/block/rbd.c
 +++ b/drivers/block/rbd.c
 @@ -2098,32 +2098,26 @@ static void rbd_dev_parent_put(struct rbd_device 
 *rbd_dev)
   * If an image has a non-zero parent overlap, get a reference to its
   * parent.
   *
 - * We must get the reference before checking for the overlap to
 - * coordinate properly with zeroing the parent overlap in
 - * rbd_dev_v2_parent_info() when an image gets flattened.  We
 - * drop it again if there is no overlap.
 - *
   * Returns true if the rbd device has a parent with a non-zero
   * overlap and a reference for it was successfully taken, or
   * false otherwise.
   */
  static bool rbd_dev_parent_get(struct rbd_device *rbd_dev)
  {
 - int counter;
 + int counter = 0;
  
   if (!rbd_dev-parent_spec)
   return false;
  
 - counter = atomic_inc_return_safe(rbd_dev-parent_ref);
 - if (counter  0  rbd_dev-parent_overlap)
 - return true;
 -
 - /* Image was flattened, but parent is not yet torn down */
 + down_read(rbd_dev-header_rwsem);
 + if (rbd_dev-parent_overlap)
 + counter = atomic_inc_return_safe(rbd_dev-parent_ref);
 + up_read(rbd_dev-header_rwsem);
  
   if (counter  0)
   rbd_warn(rbd_dev, parent reference overflow);
  
 - return false;
 + return counter  0;
  }
  
  /*
 @@ -4238,7 +4232,6 @@ static int rbd_dev_v2_parent_info(struct rbd_device 
 *rbd_dev)
*/
   if (rbd_dev-parent_overlap) {
   rbd_dev-parent_overlap = 0;
 - smp_mb();
   rbd_dev_parent_put(rbd_dev);
   pr_info(%s: clone image has been flattened\n,
   rbd_dev-disk-disk_name);
 @@ -4284,7 +4277,6 @@ static int rbd_dev_v2_parent_info(struct rbd_device 
 *rbd_dev)
* treat it specially.
*/
   rbd_dev-parent_overlap = overlap

Re: [PATCH 3/3] rbd: do not treat standalone as flatten

2015-01-26 Thread Alex Elder

On 01/20/2015 06:41 AM, Ilya Dryomov wrote:
 If the clone is resized down to 0, it becomes standalone.  If such
 resize is carried over while an image is mapped we would detect this
 and call rbd_dev_parent_put() which means let go of all parent state,
 including the spec(s) of parent images(s).  This leads to a mismatch
 between rbd info and sysfs parent fields, so a fix is in order.
 
 # rbd create --image-format 2 --size 1 foo
 # rbd snap create foo@snap
 # rbd snap protect foo@snap
 # rbd clone foo@snap bar
 # DEV=$(rbd map bar)
 # rbd resize --allow-shrink --size 0 bar
 # rbd resize --size 1 bar
 # rbd info bar | grep parent
 parent: rbd/foo@snap
 
 Before:
 
 # cat /sys/bus/rbd/devices/0/parent
 (no parent image)
 
 After:
 
 # cat /sys/bus/rbd/devices/0/parent
 pool_id 0
 pool_name rbd
 image_id 10056b8b4567
 image_name foo
 snap_id 2
 snap_name snap
 overlap 0
 
 Signed-off-by: Ilya Dryomov idryo...@redhat.com

Hmm.  Interesting.

I think that a parent with an overlap of 0 is of no
real use.  So in the last patch I was suggesting it
should just go away.

But now, looking at it from this perspective, the fact
that an image *came from* a particular parent, but which
has no more overlap, could be useful information.  The
parent shouldn't simply go away without the user requesting
that.

I haven't completely followed through the logic of keeping
the reference around but I understand what you're doing and
it looks OK to me.

Reviewed-by: Alex Elder el...@linaro.org

 ---
  drivers/block/rbd.c | 30 ++
  1 file changed, 10 insertions(+), 20 deletions(-)
 
 diff --git a/drivers/block/rbd.c b/drivers/block/rbd.c
 index b85d52005a21..e818c2a6ffb1 100644
 --- a/drivers/block/rbd.c
 +++ b/drivers/block/rbd.c
 @@ -4273,32 +4273,22 @@ static int rbd_dev_v2_parent_info(struct rbd_device 
 *rbd_dev)
   }
  
   /*
 -  * We always update the parent overlap.  If it's zero we
 -  * treat it specially.
 +  * We always update the parent overlap.  If it's zero we issue
 +  * a warning, as we will proceed as if there was no parent.
*/
 - rbd_dev-parent_overlap = overlap;
   if (!overlap) {
 -
 - /* A null parent_spec indicates it's the initial probe */
 -
   if (parent_spec) {
 - /*
 -  * The overlap has become zero, so the clone
 -  * must have been resized down to 0 at some
 -  * point.  Treat this the same as a flatten.
 -  */
 - rbd_dev_parent_put(rbd_dev);
 - pr_info(%s: clone image now standalone\n,
 - rbd_dev-disk-disk_name);
 + /* refresh, careful to warn just once */
 + if (rbd_dev-parent_overlap)
 + rbd_warn(rbd_dev,
 + clone now standalone (overlap became 0));
   } else {
 - /*
 -  * For the initial probe, if we find the
 -  * overlap is zero we just pretend there was
 -  * no parent image.
 -  */
 - rbd_warn(rbd_dev, ignoring parent with overlap 0);
 + /* initial probe */
 + rbd_warn(rbd_dev, clone is standalone (overlap 0));
   }
   }
 + rbd_dev-parent_overlap = overlap;
 +
  out:
   ret = 0;
  out_err:
 

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 2/3] rbd: drop parent_ref in rbd_dev_unprobe() unconditionally

2015-01-26 Thread Alex Elder

On 01/20/2015 06:41 AM, Ilya Dryomov wrote:
 This effectively reverts the last hunk of 392a9dad7e77 (rbd: detect
 when clone image is flattened).
 
 The problem with parent_overlap != 0 condition is that it's possible
 and completely valid to have an image with parent_overlap == 0 whose
 parent state needs to be cleaned up on unmap.  The next commit, which
 drops the clone image now standalone logic, opens up another window
 of opportunity to hit this, but even without it
 
 # cat parent-ref.sh
 #!/bin/bash
 rbd create --image-format 2 --size 1 foo
 rbd snap create foo@snap
 rbd snap protect foo@snap
 rbd clone foo@snap bar
 rbd resize --allow-shrink --size 0 bar
 rbd resize --size 1 bar
 DEV=$(rbd map bar)
 rbd unmap $DEV
 
 leaves rbd_device/rbd_spec/etc and rbd_client along with ceph_client
 hanging around.

I'm not sure why the last reference to the parent
doesn't get dropped (and state cleaned up) as soon
as the overlap becomes 0.  I suspect it's the original
reference taken when there's a parent, we don't get
rid of it until it's torn down.  (I think we should.)

It seems to me the test here should be for a non-null
parent_spec pointer rather than non-zero parent_overlap.
And that's done inside rbd_dev_parent_put(), so your
change looks reasonable to me.

Reviewed-by: Alex Elder el...@linaro.org

 
 My thinking behind calling rbd_dev_parent_put() unconditionally is that
 there shouldn't be any requests in flight at that point in time as we
 are deep into unmap sequence.  Hence, even if rbd_dev_unparent() caused
 by flatten is delayed by in-flight requests, it will have finished by
 the time we reach rbd_dev_unprobe() caused by unmap, thus turning
 unconditional rbd_dev_parent_put() into a no-op.
 
 Fixes: http://tracker.ceph.com/issues/10352
 
 Cc: sta...@vger.kernel.org # 3.11+
 Signed-off-by: Ilya Dryomov idryo...@redhat.com
 ---
  drivers/block/rbd.c | 5 +
  1 file changed, 1 insertion(+), 4 deletions(-)
 
 diff --git a/drivers/block/rbd.c b/drivers/block/rbd.c
 index 2990a1c75159..b85d52005a21 100644
 --- a/drivers/block/rbd.c
 +++ b/drivers/block/rbd.c
 @@ -5075,10 +5075,7 @@ static void rbd_dev_unprobe(struct rbd_device *rbd_dev)
  {
   struct rbd_image_header *header;
  
 - /* Drop parent reference unless it's already been done (or none) */
 -
 - if (rbd_dev-parent_overlap)
 - rbd_dev_parent_put(rbd_dev);
 + rbd_dev_parent_put(rbd_dev);
  
   /* Free dynamic fields from the header, then zero it out */
  
 

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] rbd: convert to blk-mq

2015-01-10 Thread Alex Elder

On 01/10/2015 12:31 PM, Christoph Hellwig wrote:
 This converts the rbd driver to use the blk-mq infrastructure.  Except
 for switching to a per-request work item this is almost mechanical.
 
 This was tested by Alexandre DERUMIER in November, and found to give
 him 12 iops, although the only comparism available was an old
 3.10 kernel which gave 8iops.

I'm coming up to speed with the blk-mq stuff only now.  It looks
like requests are sent to the driver via -queue_rq() rather than
the driver taking them via blk_fetch_request(q).

Previously we would pull as many requests as were available, put
them on the device's request queue, and then activate the rbd
workqueue to handle them one-by-one using rbd_handle_request().

Now, the rbd queue_rq method rbd_request_workfn() adds the request
to the rbd workqueue directly.  The work_struct implicitly follows
the request structure (which is set up by the blk-mq code).  We
have to do the REQ_TYPE_FS check at the time it's queued now,
rather than when it's fetched from the queue.  And finally we now
have to tell the blk-mq subsystem when we've started and ended a
request.

I didn't follow up on all the tag_set initialization values
so I assume you got that right (it looks reasonable to me).

Given the above, it looks like everything else should work
about the same as before, we're just handed requests rather
than asking for them.

With this patch applied, rbd_device-rq_queue is no longer
needed so you should delete it.  I got two warnings about
endo-of-line whitespace in your patch.  And I have one other
very small suggestion below.

Other than those things, this looks great to me.

Reviewed-by: Alex Elder el...@linaro.org

 Signed-off-by: Christoph Hellwig h...@lst.de
 ---
  drivers/block/rbd.c | 118 
 +---
  1 file changed, 67 insertions(+), 51 deletions(-)
 
 diff --git a/drivers/block/rbd.c b/drivers/block/rbd.c
 index 3ec85df..52cd677 100644
 --- a/drivers/block/rbd.c
 +++ b/drivers/block/rbd.c

. . .

(The following is in the new rbd_queue_rq().)

 + queue_work(rbd_wq, work);
 + return 0;

return BLK_MQ_RQ_QUEUE_OK;

(Because the symbolic values are explicitly checked
by the caller.)

  }
  
  /*

. . .

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] rbd: don't treat CEPH_OSD_OP_DELETE as extent op

2014-11-24 Thread Alex Elder

On 11/24/2014 03:59 AM, Ilya Dryomov wrote:
 CEPH_OSD_OP_DELETE is not an extent op, stop treating it as such.  This
 sneaked in with discard patches - it's one of the three osd ops (the
 other two are CEPH_OSD_OP_TRUNCATE and CEPH_OSD_OP_ZERO) that discard
 is implemented with.
 
 Signed-off-by: Ilya Dryomov idryo...@redhat.com

Is the CEPH_OSD_OP_DELETE used in ceph_zero_partial_object()
an extent op?  If it is not, you should get rid of the BUG_ON()
in osd_req_op_extent_init() that allows CEPH_OSD_OP_DELETE.

And if that's the case it looks like that function or
ceph_osdc_new_request() handle CEPH_OSD_OP_DELETE
properly--so it's not treated as an extent op.

And:  osd_req_encode_op() encodes a CEPH_OSD_OP_DELETE
as an extent op as well.

If it *can* be an extent op (but just not as used by RBD)
then it warrants a comment here that explains why it is
not being initialized as an extent op.

-Alex

 ---
  drivers/block/rbd.c | 8 ++--
  1 file changed, 6 insertions(+), 2 deletions(-)
 
 diff --git a/drivers/block/rbd.c b/drivers/block/rbd.c
 index 27b71a0b72d0..1df0802bf6cb 100644
 --- a/drivers/block/rbd.c
 +++ b/drivers/block/rbd.c
 @@ -2370,8 +2370,12 @@ static void rbd_img_obj_request_fill(struct 
 rbd_obj_request *obj_request,
   opcode = CEPH_OSD_OP_READ;
   }
  
 - osd_req_op_extent_init(osd_request, num_ops, opcode, offset, length,
 - 0, 0);
 + if (opcode == CEPH_OSD_OP_DELETE)
 + osd_req_op_init(osd_request, num_ops, opcode);
 + else
 + osd_req_op_extent_init(osd_request, num_ops, opcode,
 +offset, length, 0, 0);
 +
   if (obj_request-type == OBJ_REQUEST_BIO)
   osd_req_op_extent_osd_data_bio(osd_request, num_ops,
   obj_request-bio_list, length);
 

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: krbd blk-mq support ?

2014-10-28 Thread Alex Elder


On 10/28/2014 01:07 PM, Christoph Hellwig wrote:

On Mon, Oct 27, 2014 at 11:00:46AM +0100, Alexandre DERUMIER wrote:

Can you do a perf report -ag and then a perf report to see where these
cycles are spent?


Yes, sure.

I have attached the perf report to this mail.
(This is with kernel 3.14, don't have access to my 3.18  host for now)


Oh, that's without the blk-mq patch?

Either way the profile doesn't really sum up to a fully used up
cpu.  Sage, Alex - are there any ordring constraints in the rbd client?


I don't remember off hand.

In libceph I recall going to great lengths to retain the original
order of requests when they got re-sent after a connection reset.

I'll go look at the code a bit and see if I can refresh my memory
(though Sage may answer before I do).

-Alex


If not we could probably aim for per-cpu queues using blk-mq and a
socket per cpu or similar.
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html



--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: krbd blk-mq support ?

2014-10-28 Thread Alex Elder


On 10/28/2014 01:07 PM, Christoph Hellwig wrote:

On Mon, Oct 27, 2014 at 11:00:46AM +0100, Alexandre DERUMIER wrote:

Can you do a perf report -ag and then a perf report to see where these
cycles are spent?


Yes, sure.

I have attached the perf report to this mail.
(This is with kernel 3.14, don't have access to my 3.18  host for now)


Oh, that's without the blk-mq patch?

Either way the profile doesn't really sum up to a fully used up
cpu.  Sage, Alex - are there any ordring constraints in the rbd client?
If not we could probably aim for per-cpu queues using blk-mq and a
socket per cpu or similar.


First, a disclaimer--I haven't really been following this discussion
very closely.

For an rbd image request (which is what gets created from requests
from the block queue), the order of completion doesn't matter, and
although the object requests are submitted in order that shouldn't
be required either.

The image request is broken into one or more object requests (usually
just one) and they are treated as a unit.  When the last object request
of a set for an image request has completed, the image request is
treated as completed.

I hope that helps.  If not, ask again a different way...

-Alex


--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html



--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 2/2] libceph: resend lingering requests with a new tid

2014-09-22 Thread Alex Elder

On 09/05/2014 03:42 AM, Ilya Dryomov wrote:
 Both not yet registered (r_linger  list_empty(r_linger_item)) and
 registered linger requests should use the new tid on resend to avoid
 the dup op detection logic on the OSDs, yet we were doing this only for
 registered case.  Factor out and simplify the registered logic and
 use the new helper for not registered case as well.
 
 Fixes: http://tracker.ceph.com/issues/8806

The issue description describes the failure scenario nicely.
These linger requests are a nice service to provide but they
sure have proved tricky to get right...

After reviewing almost everything I came up with one big
question and I don't have time right now to investigate the
answer so I hope you will.  See below for the question in
context.

With that exception, the logic looks correct to me.  I have
some suggestions below for you to consider.  Let me know
what you intend to do, and if you are sure my big question
is not an issue you can go ahead and add this:

Reviewed-by: Alex Elder el...@linaro.org

If not, please update and give me a chance to look at it
again.  Thanks.

-Alex

 Signed-off-by: Ilya Dryomov ilya.dryo...@inktank.com
 ---
  net/ceph/osd_client.c |   75 
 -
  1 file changed, 55 insertions(+), 20 deletions(-)
 
 diff --git a/net/ceph/osd_client.c b/net/ceph/osd_client.c
 index 8db10b4a6553..81ee046b24d0 100644
 --- a/net/ceph/osd_client.c
 +++ b/net/ceph/osd_client.c
 @@ -30,8 +30,11 @@ static void __send_queued(struct ceph_osd_client *osdc);
  static int __reset_osd(struct ceph_osd_client *osdc, struct ceph_osd *osd);
  static void __register_request(struct ceph_osd_client *osdc,
  struct ceph_osd_request *req);
 +static void __unregister_request(struct ceph_osd_client *osdc,
 +  struct ceph_osd_request *req);
  static void __unregister_linger_request(struct ceph_osd_client *osdc,
   struct ceph_osd_request *req);
 +static void __enqueue_request(struct ceph_osd_request *req);
  static void __send_request(struct ceph_osd_client *osdc,
  struct ceph_osd_request *req);
  
 @@ -892,6 +895,39 @@ __lookup_request_ge(struct ceph_osd_client *osdc,
   return NULL;
  }
  
 +static void __kick_linger_request(struct ceph_osd_request *req,
 +   bool lingering)

I think committed or maybe acknowledged might be a better
name than lingering.  All of them are marked to linger, so
something other than lingering might be a little clearer.

 +{
 + struct ceph_osd_client *osdc = req-r_osdc;
 + struct ceph_osd *osd = req-r_osd;
 +
 + /*
 +  * Linger requests need to be resent with a new tid to avoid
 +  * the dup op detection logic on the OSDs.  Achieve this with
 +  * a re-register dance instead of open-coding.

This comment is more oriented toward this particular commit
rather than what's really going on in the code.  I.e., you
should instead sayRe-register each request--whether or not
we know it's been committed to disk on the OSD.  If we ever
sent a linger request we must assume it's been committed, so
even un-acknowledged linger requests need a new TID. or
something like that (or maybe something simpler).

 +  */
 + ceph_osdc_get_request(req);

As a rule, when there's an if-else, I prefer to avoid
negating the condition.  It's a subtle style thing, but
to me it makes it slightly easier to mentally parse
(occasionally avoiding a double negative).  There's
another instance of this a little further down.

 + if (!lingering)
 + __unregister_request(osdc, req);
 + else
 + __unregister_linger_request(osdc, req);
 + __register_request(osdc, req);
 + ceph_osdc_put_request(req);
 +
 + /*
 +  * Unless request has been registered as both normal and
 +  * lingering, __unregister{,_linger}_request clears r_osd.
 +  * However, here we need to preserve r_osd to make sure we
 +  * requeue on the same OSD.
 +  */
 + WARN_ON(req-r_osd || !osd);
 + req-r_osd = osd;
 +
 + dout(requeueing %p tid %llu osd%d lingering=%d\n, req, req-r_tid,
 +  osd-o_osd, lingering);
 + __enqueue_request(req);
 +}
 +
  /*
   * Resubmit requests pending on the given osd.
   */
 @@ -900,12 +936,14 @@ static void __kick_osd_requests(struct ceph_osd_client 
 *osdc,
  {
   struct ceph_osd_request *req, *nreq;
   LIST_HEAD(resend);
 + LIST_HEAD(resend_linger);
   int err;
  
   dout(__kick_osd_requests osd%d\n, osd-o_osd);
   err = __reset_osd(osdc, osd);
   if (err)
   return;
 +
   /*
* Build up a list of requests to resend by traversing the
* osd's list of requests.  Requests for a given object are

This block of comments (starting with the above but not shown
here) should be updated to reflect the modified logic

Re: [PATCH] libceph: fix a memory leak in handle_watch_notify

2014-09-11 Thread Alex Elder


On 09/11/2014 03:31 AM, Ilya Dryomov wrote:

On Thu, Sep 11, 2014 at 5:41 AM, Alex Elder el...@ieee.org wrote:

On 09/10/2014 07:20 PM, roy.qing...@gmail.com wrote:


From: Li RongQing roy.qing...@gmail.com

event_work should be freed when adding it to queue failed

Signed-off-by: Li RongQing roy.qing...@gmail.com



Looks good.

Reviewed-by: Alex Elder el...@linaro.org


Hmm, queue_work() returns %false if @work was already on a queue, %true
otherwise, so this seems bogus to me.  I'd go with something like this
(mangled).


The original change was fine.  Whether it matters is another question.
Your suggestion looks good as well, and on the assumption that if you
choose to use it instead your real fix is done correctly you can
use Reviewed-by: me if you like.

-Alex



 From c0711eee447b199b1c2193460fce8c9d958f23f4 Mon Sep 17 00:00:00 2001
From: Ilya Dryomov ilya.dryo...@inktank.com
Date: Thu, 11 Sep 2014 12:18:53 +0400
Subject: [PATCH] libceph: don't try checking queue_work() return value

queue_work() doesn't fail to queue, it returns false if work was
already on a queue, which can't happen here since we allocate
event_work right before we queue it.  So don't bother at all.

Signed-off-by: Ilya Dryomov ilya.dryo...@inktank.com
---
  net/ceph/osd_client.c |   15 +--
  1 file changed, 5 insertions(+), 10 deletions(-)

diff --git a/net/ceph/osd_client.c b/net/ceph/osd_client.c
index 0f569d322405..952e9c254cc7 100644
--- a/net/ceph/osd_client.c
+++ b/net/ceph/osd_client.c
@@ -2355,26 +2355,21 @@ static void handle_watch_notify(struct
ceph_osd_client *osdc,
 if (event) {
 event_work = kmalloc(sizeof(*event_work), GFP_NOIO);
 if (!event_work) {
-   dout(ERROR: could not allocate event_work\n);
-   goto done_err;
+   pr_err(couldn't allocate event_work\n);
+   ceph_osdc_put_event(event);
+   return;
 }
 INIT_WORK(event_work-work, do_event_work);
 event_work-event = event;
 event_work-ver = ver;
 event_work-notify_id = notify_id;
 event_work-opcode = opcode;
-   if (!queue_work(osdc-notify_wq, event_work-work)) {
-   dout(WARNING: failed to queue notify event work\n);
-   goto done_err;
-   }
+
+   queue_work(osdc-notify_wq, event_work-work);
 }

 return;

-done_err:
-   ceph_osdc_put_event(event);
-   return;
-
  bad:
 pr_err(osdc handle_watch_notify corrupt msg\n);
  }



--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] rbd: do not return -ERANGE on auth failure

2014-09-11 Thread Alex Elder

On 09/11/2014 10:10 AM, Ilya Dryomov wrote:
 Trying to map an image out of a pool for which we don't have an 'x'
 permission bit fails with -ERANGE from ceph_extract_encoded_string()
 due to an unsigned vs signed bug.  Fix it and get rid of the -ENIVAL
 sink, thus exposing rbd::get_id cls method return value.  (I've seen
 a bunch of unexplained -ERANGE reports, I bet this is it).
 
 Signed-off-by: Ilya Dryomov ilya.dryo...@inktank.com

I often think people are annoyed by my explicit type casts
all over the place.  This (missed) one matters a lot.

I think the -EINVAL was to ensure an error code that was
expected by a write() call would be returned.

In any case, this looks good to me.

Reviewed-by: Alex Elder el...@linaro.org

 ---
  drivers/block/rbd.c |4 +---
  1 file changed, 1 insertion(+), 3 deletions(-)
 
 diff --git a/drivers/block/rbd.c b/drivers/block/rbd.c
 index 4b97baf8afa3..fe3726c62a37 100644
 --- a/drivers/block/rbd.c
 +++ b/drivers/block/rbd.c
 @@ -4924,7 +4924,7 @@ static int rbd_dev_image_id(struct rbd_device *rbd_dev)
   ret = image_id ? 0 : -ENOMEM;
   if (!ret)
   rbd_dev-image_format = 1;
 - } else if (ret  sizeof (__le32)) {
 + } else if (ret  (int)sizeof(u32)) {
   void *p = response;
  
   image_id = ceph_extract_encoded_string(p, p + ret,
 @@ -4932,8 +4932,6 @@ static int rbd_dev_image_id(struct rbd_device *rbd_dev)
   ret = PTR_ERR_OR_ZERO(image_id);
   if (!ret)
   rbd_dev-image_format = 2;
 - } else {
 - ret = -EINVAL;
   }
  
   if (!ret) {
 

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] rbd: do not return -ERANGE on auth failure

2014-09-11 Thread Alex Elder

On 09/11/2014 11:17 AM, Ilya Dryomov wrote:
 On Thu, Sep 11, 2014 at 7:23 PM, Alex Elder el...@ieee.org wrote:
 On 09/11/2014 10:10 AM, Ilya Dryomov wrote:
 Trying to map an image out of a pool for which we don't have an 'x'
 permission bit fails with -ERANGE from ceph_extract_encoded_string()
 due to an unsigned vs signed bug.  Fix it and get rid of the -ENIVAL
 sink, thus exposing rbd::get_id cls method return value.  (I've seen
 a bunch of unexplained -ERANGE reports, I bet this is it).

 Signed-off-by: Ilya Dryomov ilya.dryo...@inktank.com

 I often think people are annoyed by my explicit type casts
 all over the place.  This (missed) one matters a lot.

 I think the -EINVAL was to ensure an error code that was
 expected by a write() call would be returned.
 
 Yeah, the way it's written it's possible in theory to get a positive
 return value from rbd_dev_image_id().  Looking deeper, this sizeof() is
 not needed at all - ceph_extract_encoded_string() deals with short
 buffers as it should.  As for the ret == sizeof(u32) (i.e. an empty
 string), neither userspace nor us check against empty strings in
 similar cases (object prefix, snapshot name, etc).

I should have asked this before.  Why is a permission error
leading to ceph_extract_encoded_string() finding a short
buffer?  I didn't take the time to trace the error path
you're talking about here all the way back.

(I'm looking at your new patch in the mean time.)

-Alex
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] rbd: do not return -ERANGE on auth failure

2014-09-11 Thread Alex Elder

On 09/11/2014 11:27 AM, Ilya Dryomov wrote:
 I should have asked this before.  Why is a permission error
  leading to ceph_extract_encoded_string() finding a short
  buffer?  I didn't take the time to trace the error path
  you're talking about here all the way back.
 rbd_obj_method_sync() returns -EPERM, which when compared with size_t
 from sizeof() ends up a big positive.  ceph_extract_encoded_string() is
 then called and the safe decode macro kicks in..

I somehow got lost along the way and thought rbd_obj_method_sync()
was returning -ERANGE.  The problem starts because rbd_obj_method_sync()
returns an error other than -ENOENT, and that led me astray I think.

Sorry for the confusion.

-Alex

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] rbd: do not return -ERANGE on auth failure

2014-09-11 Thread Alex Elder

On 09/11/2014 11:17 AM, Ilya Dryomov wrote:
 On Thu, Sep 11, 2014 at 7:23 PM, Alex Elder el...@ieee.org wrote:
 On 09/11/2014 10:10 AM, Ilya Dryomov wrote:
 Trying to map an image out of a pool for which we don't have an 'x'
 permission bit fails with -ERANGE from ceph_extract_encoded_string()
 due to an unsigned vs signed bug.  Fix it and get rid of the -ENIVAL
 sink, thus exposing rbd::get_id cls method return value.  (I've seen
 a bunch of unexplained -ERANGE reports, I bet this is it).

 Signed-off-by: Ilya Dryomov ilya.dryo...@inktank.com

 I often think people are annoyed by my explicit type casts
 all over the place.  This (missed) one matters a lot.

 I think the -EINVAL was to ensure an error code that was
 expected by a write() call would be returned.
 
 Yeah, the way it's written it's possible in theory to get a positive
 return value from rbd_dev_image_id().  Looking deeper, this sizeof() is
 not needed at all - ceph_extract_encoded_string() deals with short
 buffers as it should.  As for the ret == sizeof(u32) (i.e. an empty
 string), neither userspace nor us check against empty strings in
 similar cases (object prefix, snapshot name, etc).
 
 With the above in mind, how about this?
 
 From 3ded0a7fee82f2204c58b4fc00fc74f05331514d Mon Sep 17 00:00:00 2001
 From: Ilya Dryomov ilya.dryo...@inktank.com
 Date: Thu, 11 Sep 2014 18:49:18 +0400
 Subject: [PATCH] rbd: do not return -ERANGE on auth failures
 
 Trying to map an image out of a pool for which we don't have an 'x'
 permission bit fails with -ERANGE from ceph_extract_encoded_string()
 due to an unsigned vs signed bug.  Fix it and get rid of the -EINVAL
 sink, thus propagating rbd::get_id cls method errors.  (I've seen
 a bunch of unexplained -ERANGE reports, I bet this is it).
 
 Signed-off-by: Ilya Dryomov ilya.dryo...@inktank.com

So now we know that the value returned by rbd_dev_image_id()
will be either 0 or a negative errno.  It could still
return something that write(2) isn't defined to return,
but at least it's an error.  That's OK with me...

Reviewed-by: Alex Elder el...@linaro.org

 ---
  drivers/block/rbd.c |4 +---
  1 file changed, 1 insertion(+), 3 deletions(-)
 
 diff --git a/drivers/block/rbd.c b/drivers/block/rbd.c
 index 4b97baf8afa3..ce457db5d847 100644
 --- a/drivers/block/rbd.c
 +++ b/drivers/block/rbd.c
 @@ -4924,7 +4924,7 @@ static int rbd_dev_image_id(struct rbd_device *rbd_dev)
 ret = image_id ? 0 : -ENOMEM;
 if (!ret)
 rbd_dev-image_format = 1;
 -   } else if (ret  sizeof (__le32)) {
 +   } else if (ret = 0) {
 void *p = response;
 
 image_id = ceph_extract_encoded_string(p, p + ret,
 @@ -4932,8 +4932,6 @@ static int rbd_dev_image_id(struct rbd_device *rbd_dev)
 ret = PTR_ERR_OR_ZERO(image_id);
 if (!ret)
 rbd_dev-image_format = 2;
 -   } else {
 -   ret = -EINVAL;
 }
 
 if (!ret) {
 

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] libceph: fix a memory leak in handle_watch_notify

2014-09-10 Thread Alex Elder


On 09/10/2014 07:20 PM, roy.qing...@gmail.com wrote:

From: Li RongQing roy.qing...@gmail.com

event_work should be freed when adding it to queue failed

Signed-off-by: Li RongQing roy.qing...@gmail.com


Looks good.

Reviewed-by: Alex Elder el...@linaro.org


---
  net/ceph/osd_client.c |1 +
  1 file changed, 1 insertion(+)

diff --git a/net/ceph/osd_client.c b/net/ceph/osd_client.c
index 30f6faf..1e1b4f1 100644
--- a/net/ceph/osd_client.c
+++ b/net/ceph/osd_client.c
@@ -2323,6 +2323,7 @@ static void handle_watch_notify(struct ceph_osd_client 
*osdc,
event_work-opcode = opcode;
if (!queue_work(osdc-notify_wq, event_work-work)) {
dout(WARNING: failed to queue notify event work\n);
+   kfree(event_work);
goto done_err;
}
}



--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH -next] rbd: fix error return code in rbd_dev_device_setup()

2014-08-13 Thread Alex Elder


On 08/13/2014 07:56 PM, weiyj...@163.com wrote:

From: Wei Yongjun yongjun_...@trendmicro.com.cn

Fix to return -ENOMEM from the workqueue alloc error handling
case instead of 0, as done elsewhere in this function.

Signed-off-by: Wei Yongjun yongjun_...@trendmicro.com.cn


This looks good to me.  Sorry I missed this when I first
reviewed the workqueue code.

Reviewed-by: Alex Elder el...@linaro.org



---
  drivers/block/rbd.c | 5 +++--
  1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/drivers/block/rbd.c b/drivers/block/rbd.c
index 623c841..f3be022 100644
--- a/drivers/block/rbd.c
+++ b/drivers/block/rbd.c
@@ -5088,8 +5087,10 @@ static int rbd_dev_device_setup(struct rbd_device 
*rbd_dev)
set_disk_ro(rbd_dev-disk, rbd_dev-mapping.read_only);

rbd_dev-rq_wq = alloc_workqueue(rbd_dev-disk-disk_name, 0, 0);
-   if (!rbd_dev-rq_wq)
+   if (!rbd_dev-rq_wq) {
+   ret = -ENOMEM;
goto err_out_mapping;
+   }

ret = rbd_bus_add_dev(rbd_dev);
if (ret)

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html



--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] libceph: set last_piece in ceph_msg_data_pages_cursor_init() correctly

2014-08-08 Thread Alex Elder

On 08/08/2014 06:18 AM, Ilya Dryomov wrote:
 Determining -last_piece based on the value of -page_offset + length
 is incorrect because length here is the length of the entire message.
 -last_piece set to false even if page array data item length is =
 PAGE_SIZE, which results in invalid length passed to
 ceph_tcp_{send,recv}page() and causes various asserts to fire.

I glanced at this this morning and thought it might have
a problem.  Now that I've had the time to look at it a
little more closely I conclude it's fine.

I'm going to explain my reasoning, which just reaffirms what
you've explained. (Doing this helps me brush up on things,
I'm starting to get rusty...)

A message has a list of data items, and a data item is
broken into pieces, each of which (for a pages data item)
holds data from a single page.  A cursor keeps the data item
now being processed in cursor-data, and cursor-resid is the
number of un-consumed bytes in the current data item.  We want
to set cursor-last_piece when the next piece to process is
the last in the *data item*, not the last in the *message*,
so we need to compare against cursor-resid, not the message
length.

Before your change, any message with more than one data
item including a page array data item that was not last
in the list would have problems.

So in summary:  Looks good.

Reviewed-by: Alex Elder el...@linaro.org

 # cat pages-cursor-init.sh
 #!/bin/bash
 rbd create --size 10 --image-format 2 foo
 FOO_DEV=$(rbd map foo)
 dd if=/dev/urandom of=$FOO_DEV bs=1M /dev/null
 rbd snap create foo@snap
 rbd snap protect foo@snap
 rbd clone foo@snap bar
 # rbd_resize calls librbd rbd_resize(), size is in bytes
 ./rbd_resize bar $(((4  20) + 512))
 rbd resize --size 10 bar
 BAR_DEV=$(rbd map bar)
 # trigger a 512-byte copyup -- 512-byte page array data item
 dd if=/dev/urandom of=$BAR_DEV bs=1M count=1 seek=5
 
 The problem exists only in ceph_msg_data_pages_cursor_init(),
 ceph_msg_data_pages_advance() does the right thing.  The size_t cast is
 unnecessary.
 
 Signed-off-by: Ilya Dryomov ilya.dryo...@inktank.com
 ---
  net/ceph/messenger.c |2 +-
  1 file changed, 1 insertion(+), 1 deletion(-)
 
 diff --git a/net/ceph/messenger.c b/net/ceph/messenger.c
 index e51cad0db580..b2f571dd933d 100644
 --- a/net/ceph/messenger.c
 +++ b/net/ceph/messenger.c
 @@ -901,7 +901,7 @@ static void ceph_msg_data_pages_cursor_init(struct 
 ceph_msg_data_cursor *cursor,
   BUG_ON(page_count  (int)USHRT_MAX);
   cursor-page_count = (unsigned short)page_count;
   BUG_ON(length  SIZE_MAX - cursor-page_offset);
 - cursor-last_piece = (size_t)cursor-page_offset + length = PAGE_SIZE;
 + cursor-last_piece = cursor-page_offset + cursor-resid = PAGE_SIZE;
  }
  
  static struct page *
 

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] rbd: allocate img_request with GFP_NOIO instead GFP_ATOMIC

2014-08-07 Thread Alex Elder

On 08/07/2014 04:40 AM, Ilya Dryomov wrote:
 Now that rbd_img_request_create() is called from work functions, no
 need to use GFP_ATOMIC.
 
 Signed-off-by: Ilya Dryomov ilya.dryo...@inktank.com

Looks good.

Reviewed-by: Alex Elder el...@linaro.org

 ---
  drivers/block/rbd.c |2 +-
  1 file changed, 1 insertion(+), 1 deletion(-)
 
 diff --git a/drivers/block/rbd.c b/drivers/block/rbd.c
 index 4515b128d0b4..a5ebcf28e041 100644
 --- a/drivers/block/rbd.c
 +++ b/drivers/block/rbd.c
 @@ -2061,7 +2061,7 @@ static struct rbd_img_request *rbd_img_request_create(
  {
   struct rbd_img_request *img_request;
  
 - img_request = kmem_cache_alloc(rbd_img_request_cache, GFP_ATOMIC);
 + img_request = kmem_cache_alloc(rbd_img_request_cache, GFP_NOIO);
   if (!img_request)
   return NULL;
  
 

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] rbd: rework rbd_request_fn()

2014-08-06 Thread Alex Elder

On 08/05/2014 02:38 AM, Ilya Dryomov wrote:
 While it was never a good idea to sleep in request_fn(), commit
 34c6bc2c919a (locking/mutexes: Add extra reschedule point) made it
 a *bad* idea.  mutex_lock() since 3.15 may reschedule *before* putting
 task on the mutex wait queue, which for tasks in !TASK_RUNNING state
 means block forever.  request_fn() may be called with !TASK_RUNNING on
 the way to schedule() in io_schedule().
 
 Offload request handling to a workqueue, one per rbd device, to avoid
 calling blocking primitives from rbd_request_fn().

Off topic...  If you supply --patience to your git diff command
you'll get an easier-to-read result in some cases (like this one).
If you like that you can just do:
git config --global --add diff.algorithm patience

I have several comments below, none of which are bugs, just
suggestions.

I was thinking you could use a single work queue for all rbd
devices, but I'm not sure that's very important.  You'd have
to stash the rbd_dev pointer in every request somewhere.

Reviewed-by: Alex Elder el...@linaro.org


 Cc: sta...@vger.kernel.org # 3.15+
 Signed-off-by: Ilya Dryomov ilya.dryo...@inktank.com
 ---
  drivers/block/rbd.c |  195 
 +++
  1 file changed, 118 insertions(+), 77 deletions(-)
 
 diff --git a/drivers/block/rbd.c b/drivers/block/rbd.c
 index e3412317d3f9..e07d9d71b187 100644
 --- a/drivers/block/rbd.c
 +++ b/drivers/block/rbd.c
 @@ -42,6 +42,7 @@
  #include linux/blkdev.h
  #include linux/slab.h
  #include linux/idr.h
 +#include linux/workqueue.h
  
  #include rbd_types.h
  
 @@ -332,7 +333,10 @@ struct rbd_device {
  
   charname[DEV_NAME_LEN]; /* blkdev name, e.g. rbd3 */
  
 + struct list_headrq_queue;   /* incoming rq queue */
   spinlock_t  lock;   /* queue, flags, open_count */
 + struct workqueue_struct *rq_wq;
 + struct work_struct  rq_work;
  
   struct rbd_image_header header;
   unsigned long   flags;  /* possibly lock protected */
 @@ -3176,102 +3180,129 @@ out:
   return ret;
  }
  
 -static void rbd_request_fn(struct request_queue *q)
 - __releases(q-queue_lock) __acquires(q-queue_lock)
 +static void rbd_handle_request(struct rbd_device *rbd_dev, struct request 
 *rq)
  {

. . .

 + if (offset + length  rbd_dev-mapping.size) {
 + rbd_warn(rbd_dev, beyond EOD (%llu~%llu  %llu), offset,
 +  length, rbd_dev-mapping.size);
 + result = -EIO;
 + goto err_rq;
 + }
  
 - spin_unlock_irq(q-queue_lock);
 + img_request = rbd_img_request_create(rbd_dev, offset, length, wr);

This isn't new with your patch, but perhaps we should consider
having rbd_img_request_create() use GFP_KERNEL or at least GFP_NOIO
rather than GFP_ATOMIC.  In general, I think a pass ought to be
made through all the allocation calls to make sure they have the
right flags.  Some may be too restrictive, some maybe not enough.

Anyway, a task for another day...

 + if (!img_request) {
 + result = -ENOMEM;
 + goto err_rq;
 + }
 + img_request-rq = rq;

 . . .

 +/*
 + * Called with q-queue_lock held and interrupts disabled, possibly on
 + * the way to schedule().  Do not sleep here!
 + */
 +static void rbd_request_fn(struct request_queue *q)
 +{
 + struct rbd_device *rbd_dev = q-queuedata;
 + struct request *rq;
 + int queued = 0;
 +
 + rbd_assert(rbd_dev);
 +
 + while ((rq = blk_fetch_request(q))) {
 + /* Ignore any non-FS requests that filter through. */
 + if (rq-cmd_type != REQ_TYPE_FS) {
 + dout(%s: non-fs request type %d\n, __func__,
 + (int) rq-cmd_type);
 + __blk_end_request_all(rq, 0);
 + continue;
   }
 +
 + list_add_tail(rq-queuelist, rbd_dev-rq_queue);
 + queued++;
if (!queued)
queued++;
Or alternately, make queued be a Boolean.  My point is
to guarantee there is no wrap (which is nearly but not quite
impossible).

   }
 +
 + if (queued)
 + queue_work(rbd_dev-rq_wq, rbd_dev-rq_work);
  }
  
  /*

. . .

 @@ -5067,6 +5105,8 @@ static int rbd_dev_device_setup(struct rbd_device 
 *rbd_dev)
  
   return ret;
  
 +err_out_workqueue:
 + destroy_workqueue(rbd_dev-rq_wq);

rbd_dev-rq_wq = NULL;

  err_out_mapping:
   rbd_dev_mapping_clear(rbd_dev);
  err_out_disk:
 @@ -5313,6 +5353,7 @@ static void rbd_dev_device_release(struct device *dev)
  {
   struct rbd_device *rbd_dev = dev_to_rbd_dev(dev);
  
 + destroy_workqueue(rbd_dev-rq_wq);
   rbd_free_disk(rbd_dev);
   clear_bit(RBD_DEV_FLAG_EXISTS, rbd_dev-flags);
   rbd_dev_mapping_clear(rbd_dev);
 

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body

Re: [PATCH] rbd: rework rbd_request_fn()

2014-08-06 Thread Alex Elder

On 08/06/2014 02:09 PM, Ilya Dryomov wrote:
 Off topic...  If you supply --patience to your git diff command
  you'll get an easier-to-read result in some cases (like this one).
  If you like that you can just do:
  git config --global --add diff.algorithm patience
 
  I'm generally using (and used to review this patch in particular)
  --histogram.  In my experience it gives slightly better results than
  --patience.  Didn't think of putting it into .gitconfig though.
 Looks like --histogram beats --patience in this case too.
 alloc_workqueue() hunk is much better, otherwise identical.

I didn't know about histogram...  I guess I should update
my git config.

-Alex
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 7/8] rbd: do not read in parent info before snap context

2014-07-25 Thread Alex Elder

On 07/24/2014 03:42 AM, Ilya Dryomov wrote:
 Currently rbd_dev_v2_header_info() reads in parent info before the snap
 context is read in.  This is wrong, because we may need to look at the
 the parent_overlap value of the snapshot instead of that of the base

I had some trouble understanding this.

The parent image, snapshot id and overlap, are together a property
of a particular image (and associated with its header object).
Nothing about that is dependent on the child image snapshots
or snapshot id.

However...

 image, for example when mapping a snapshot - see next commit.  (When
 mapping a snapshot, all we got is its name and we need the snap context
 to translate that name into an id to know which parent info to look
 for).

I finally figured out what path through the code you
were talking about.  Here's what I see.

On the initial probe (previously), we have:
  rbd_add()
do_rbd_add()
  rbd_dev_image_probe()
   |rbd_dev_header_info()
   | |rbd_dev_v2_header_info()
   | | |rbd_dev_v2_parent_info()
   | | | -- expects rbd_dev-spec-snap_id to be valid
   | | |rbd_dev_v2_snap_context()
   | | | -- this fills in the snapshot context
   |rbd_spec_fill_snap_id()
   | | -- fills rbd_dev-spec-snap_id
   |rbd_dev_probe_parent()

So clearly, at least when mapping a snapshot, we would not
get the desired result.  We'll be unable to look up the id
for the named snapshot, so would get ENOENT.

Now you've pulled getting the parent info back out
into rbd_dev_image_probe():
  rbd_add()
do_rbd_add()
  rbd_dev_image_probe()
   |rbd_dev_header_info()
   | |rbd_dev_v2_header_info()
   | | |rbd_dev_v2_snap_context()
   | | | -- this fills in the snapshot context
   |rbd_spec_fill_snap_id()
   | | -- fills rbd_dev-spec-snap_id
   |rbd_dev_v2_parent_info()
   | | -- rbd_dev-spec-snap_id will be valid
   |rbd_dev_probe_parent()

In the refresh path it's similar.  You move the
rbd_dev_v2_parent_info() call into rbd_dev_refresh()
instead of it happening in rbd_dev_v2_header_info().
Missing the ordering problem here might have caused
even more subtle problems (due to using an apparently
valid but possibly out-of-date snapshot context).

Given this understanding I'd say your change looks
good.

 The approach taken here is to make sure rbd_dev_v2_parent_info() is
 called after the snap context has been read in.  The other approach
 would be to add a parent_overlap field to struct rbd_mapping and
 maintain it the same way rbd_mapping::size is maintained.  The reason
 I chose the first approach is that the value of keeping around both
 base image values and the actual mapping values is unclear to me.

I'll think about this and respond to your followup e-mail.

Reviewed-by: Alex Elder el...@linaro.org

 Signed-off-by: Ilya Dryomov ilya.dryo...@inktank.com
 ---
  drivers/block/rbd.c |   64 
 ---
  1 file changed, 30 insertions(+), 34 deletions(-)
 
 diff --git a/drivers/block/rbd.c b/drivers/block/rbd.c
 index 16f388f2960b..c4606987e9d1 100644
 --- a/drivers/block/rbd.c
 +++ b/drivers/block/rbd.c
 @@ -515,6 +515,7 @@ static void rbd_dev_remove_parent(struct rbd_device 
 *rbd_dev);
  static int rbd_dev_refresh(struct rbd_device *rbd_dev);
  static int rbd_dev_v2_header_onetime(struct rbd_device *rbd_dev);
  static int rbd_dev_header_info(struct rbd_device *rbd_dev);
 +static int rbd_dev_v2_parent_info(struct rbd_device *rbd_dev);
  static const char *rbd_dev_v2_snap_name(struct rbd_device *rbd_dev,
   u64 snap_id);
  static int _rbd_dev_v2_snap_size(struct rbd_device *rbd_dev, u64 snap_id,
 @@ -3513,6 +3514,18 @@ static int rbd_dev_refresh(struct rbd_device *rbd_dev)
   mapping_size = rbd_dev-mapping.size;
  
   ret = rbd_dev_header_info(rbd_dev);

This looked odd.  But I guess I didn't notice you
didn't check the return value when you first added
this line a few patches ago...

 + if (ret)
 + return ret;
 +
 + /*
 +  * If there is a parent, see if it has disappeared due to the
 +  * mapped image getting flattened.
 +  */
 + if (rbd_dev-parent) {
 + ret = rbd_dev_v2_parent_info(rbd_dev);
 + if (ret)
 + return ret;
 + }
  
   if (rbd_dev-spec-snap_id == CEPH_NOSNAP) {
   if (rbd_dev-mapping.size != rbd_dev-header.image_size)
 @@ -3524,11 +3537,10 @@ static int rbd_dev_refresh(struct rbd_device *rbd_dev)
  
   up_write(rbd_dev-header_rwsem);
  
 - if (mapping_size != rbd_dev-mapping.size) {
 + if (mapping_size != rbd_dev-mapping.size)
   rbd_dev_update_size(rbd_dev);
 - }
  
 - return ret;
 + return 0;
  }
  
  static int rbd_init_disk(struct rbd_device *rbd_dev)
 @@ -4477,33 +4489,6 @@ static int rbd_dev_v2_header_info(struct rbd_device 
 *rbd_dev)
   return ret

Re: [PATCH 7/8] rbd: do not read in parent info before snap context

2014-07-25 Thread Alex Elder

On 07/25/2014 03:36 AM, Ilya Dryomov wrote:
 
  + if (ret)
  + return ret;
 I'll move this bit to harden rbd_dev_refresh() commit.

Sounds good.-Alex
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] ceph: fix sizeof(struct tYpO *) typo

2014-07-25 Thread Alex Elder

On 07/25/2014 04:26 AM, Ilya Dryomov wrote:
 struct ceph_xattr - struct ceph_inode_xattr

Looks good.  I can't find the definition of ceph_xattr.

Reviewed-by: Alex Elder el...@linaro.org

 Reported-by: Toralf Förster toralf.foers...@gmx.de
 Signed-off-by: Ilya Dryomov ilya.dryo...@inktank.com
 ---
  fs/ceph/xattr.c |2 +-
  1 file changed, 1 insertion(+), 1 deletion(-)
 
 diff --git a/fs/ceph/xattr.c b/fs/ceph/xattr.c
 index f89698cdbc41..12f58d22e017 100644
 --- a/fs/ceph/xattr.c
 +++ b/fs/ceph/xattr.c
 @@ -592,7 +592,7 @@ start:
   xattr_version = ci-i_xattrs.version;
   spin_unlock(ci-i_ceph_lock);
  
 - xattrs = kcalloc(numattr, sizeof(struct ceph_xattr *),
 + xattrs = kcalloc(numattr, sizeof(struct ceph_inode_xattr *),
GFP_NOFS);
   err = -ENOMEM;
   if (!xattrs)
 

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 6/8] rbd: update mapping size only on refresh

2014-07-25 Thread Alex Elder

On 07/24/2014 10:10 AM, Ilya Dryomov wrote:
 On Thu, Jul 24, 2014 at 5:46 PM, Ilya Dryomov ilya.dryo...@inktank.com 
 wrote:
 On Thu, Jul 24, 2014 at 5:25 PM, Alex Elder el...@ieee.org wrote:
 On 07/24/2014 03:42 AM, Ilya Dryomov wrote:
 There is no sense in trying to update the mapping size before it's even
 been set.

 It took me a bit to follow this.  But basically there is
 no mapping unless it's mapped.  So previously this was
 updating the mapping information even for unmapped
 parent (or other) images.  There's no need to update
 the mapping size for a snapshot--it'll never change.

 Is that right?  If not, please advise; otherwise:

 No.  Previously it was updating the mapping size both on the inital map
 and on refresh (of the base image).  Doing that on the inital map
 doesn't make sense: not only struct rbd_mapping fields aren't properly
 initialized at that point - rbd_dev_mapping_set() is called much later
 in the map sequence, snap_id isn't initialized either (at least in the
 format 2 case, I haven't looked too closely at the format 1 case).  And
 just in general, trying to update something before it had a chance to
 change is bogus..
 
 BTW, while we are on the subject of struct rbd_mapping, is there any
 particular reason to keep around both the base image values and actual
 mapping values?  I am tempted to change the mapping sequence so that 1)
 snap context is read in immediately after watch setup, before anything
 else is done, 2) supplied snap name is resolved into an id and 3) the
 actual (i.e. based on snap_id) mapping size, features, parent_overlap,
 etc are read in directly into struct rbd_image_header.  That would
 allow to rip struct rbd_mapping entirely and make the code more clear.

The rbd_mapping structure started out well-intentioned but
over time it seems clear it hasn't added the value it was
intended to add.  Here's where it got created:
f84344f rbd: separate mapping info in rbd_dev
The only original field that survives is read_only.  There's
no harm at all in just moving the fields in that structure
out into the rbd_device structure.

Preserving the base image size in the header structure is
an artifact of the version 1 code, where it held the last
version of the on-disk header data.  The version 2 code
maintains it for consistency.

You could eliminate that I suppose.  I think the result
would require rbd_header_from_disk() to know about the
mapping rather than doing a simple conversion of data
from one form to another.

I say try it, and if you like the result I probably will too...

-Alex


 Thanks,
 
 Ilya
 

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] ceph: fix sizeof(struct tYpO *) typo

2014-07-25 Thread Alex Elder

On 07/25/2014 08:27 AM, Ilya Dryomov wrote:
 On Fri, Jul 25, 2014 at 4:52 PM, Alex Elder el...@ieee.org wrote:
 On 07/25/2014 04:26 AM, Ilya Dryomov wrote:
 struct ceph_xattr - struct ceph_inode_xattr

 Looks good.  I can't find the definition of ceph_xattr.
 
 That's the point of the patch ;)

Well, I thought the point was that the type was
wrong; I was surprised that the type was non-existent.

-Alex

 Thanks,
 
 Ilya
 

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 1/8] rbd: show the entire chain of parent images

2014-07-24 Thread Alex Elder

On 07/24/2014 03:42 AM, Ilya Dryomov wrote:
 Make /sys/bus/rbd/devices/id/parent show the entire chain of parent
 images.  While at it, kernel sprintf() doesn't return negative values,
 casting to unsigned long long is no longer necessary and there is no
 good reason to split into multiple sprintf() calls.

I like the use of a single snprintf() call, it is much less
cumbersome.

The reason I always cast u64 values to (unsigned long long)
in printf() calls is because on some 64-bit architectures,
u64 is defined as simply (unsigned long), so without the
cast they spit out warnings.  I hate this, but it is reality
(see include/asm-generic/{int-l64.h,int-ll64.h}).  You can
alternatively use %PRIu64 rather than %llu in your format
strings, but I think I hate that more.  Anyway, if you want
to avoid warnings on all architectures you should fix that.

As another aside, I've been too timid to use the ?: conditional
expression without its middle operand.  I have no objection
to it at all, I like it.  I bring it up because I recently
got burned for using a gcc feature that wasn't supported
on older compilers (unnamed struct/union fields)--specifically
a version newer than gcc 3.2, which is the minimum supported
version for the kernel (see Documentation/Changes).

But fear not!  That extension is supported in gcc 3.2:

https://gcc.gnu.org/onlinedocs/gcc-3.2/gcc/Conditionals.html#Conditionals
Just FYI...

Reviewed-by: Alex Elder el...@linaro.org

 Signed-off-by: Ilya Dryomov ilya.dryo...@inktank.com
 ---
  Documentation/ABI/testing/sysfs-bus-rbd |4 +--
  drivers/block/rbd.c |   56 
 +--
  2 files changed, 25 insertions(+), 35 deletions(-)
 
 diff --git a/Documentation/ABI/testing/sysfs-bus-rbd 
 b/Documentation/ABI/testing/sysfs-bus-rbd
 index 501adc2a9ec7..2ddd680929d8 100644
 --- a/Documentation/ABI/testing/sysfs-bus-rbd
 +++ b/Documentation/ABI/testing/sysfs-bus-rbd
 @@ -94,5 +94,5 @@ current_snap
  
  parent
  
 - Information identifying the pool, image, and snapshot id for
 - the parent image in a layered rbd image (format 2 only).
 + Information identifying the chain of parent images in a layered rbd
 + image.  Entries are separated by empty lines.
 diff --git a/drivers/block/rbd.c b/drivers/block/rbd.c
 index 703b728e05fa..7847fbb949ff 100644
 --- a/drivers/block/rbd.c
 +++ b/drivers/block/rbd.c
 @@ -3685,46 +3685,36 @@ static ssize_t rbd_snap_show(struct device *dev,
  }
  
  /*
 - * For an rbd v2 image, shows the pool id, image id, and snapshot id
 - * for the parent image.  If there is no parent, simply shows
 - * (no parent image).
 + * For a v2 image, shows the chain of parent images, separated by empty
 + * lines.  For v1 images or if there is no parent, shows (no parent
 + * image).
   */
  static ssize_t rbd_parent_show(struct device *dev,
 -  struct device_attribute *attr,
 -  char *buf)
 +struct device_attribute *attr,
 +char *buf)
  {
   struct rbd_device *rbd_dev = dev_to_rbd_dev(dev);
 - struct rbd_spec *spec = rbd_dev-parent_spec;
 - int count;
 - char *bufp = buf;
 + ssize_t count = 0;
  
 - if (!spec)
 + if (!rbd_dev-parent)
   return sprintf(buf, (no parent image)\n);
  
 - count = sprintf(bufp, pool_id %llu\npool_name %s\n,
 - (unsigned long long) spec-pool_id, spec-pool_name);
 - if (count  0)
 - return count;
 - bufp += count;
 -
 - count = sprintf(bufp, image_id %s\nimage_name %s\n, spec-image_id,
 - spec-image_name ? spec-image_name : (unknown));
 - if (count  0)
 - return count;
 - bufp += count;
 -
 - count = sprintf(bufp, snap_id %llu\nsnap_name %s\n,
 - (unsigned long long) spec-snap_id, spec-snap_name);
 - if (count  0)
 - return count;
 - bufp += count;
 -
 - count = sprintf(bufp, overlap %llu\n, rbd_dev-parent_overlap);
 - if (count  0)
 - return count;
 - bufp += count;
 -
 - return (ssize_t) (bufp - buf);
 + for ( ; rbd_dev-parent; rbd_dev = rbd_dev-parent) {
 + struct rbd_spec *spec = rbd_dev-parent_spec;
 +
 + count += sprintf(buf[count], %s
 + pool_id %llu\npool_name %s\n
 + image_id %s\nimage_name %s\n
 + snap_id %llu\nsnap_name %s\n
 + overlap %llu\n,
 + !count ?  : \n, /* first? */
 + spec-pool_id, spec-pool_name,
 + spec-image_id, spec-image_name ?: (unknown),
 + spec-snap_id, spec-snap_name,
 + rbd_dev-parent_overlap);
 + }
 +
 + return count;
  }
  
  static ssize_t rbd_image_refresh(struct device *dev,
 

--
To unsubscribe from this list: send the line

Re: [PATCH 2/8] rbd: introduce rbd_dev_header_info()

2014-07-24 Thread Alex Elder

On 07/24/2014 03:42 AM, Ilya Dryomov wrote:
 A wrapper around rbd_dev_v{1,2}_header_info() to reduce duplication.

Looks good.

Reviewed-by: Alex Elder el...@linaro.org

 Signed-off-by: Ilya Dryomov ilya.dryo...@inktank.com
 ---
  drivers/block/rbd.c |   24 ++--
  1 file changed, 14 insertions(+), 10 deletions(-)
 
 diff --git a/drivers/block/rbd.c b/drivers/block/rbd.c
 index 7847fbb949ff..0d3be608f16f 100644
 --- a/drivers/block/rbd.c
 +++ b/drivers/block/rbd.c
 @@ -514,7 +514,7 @@ static void rbd_dev_remove_parent(struct rbd_device 
 *rbd_dev);
  
  static int rbd_dev_refresh(struct rbd_device *rbd_dev);
  static int rbd_dev_v2_header_onetime(struct rbd_device *rbd_dev);
 -static int rbd_dev_v2_header_info(struct rbd_device *rbd_dev);
 +static int rbd_dev_header_info(struct rbd_device *rbd_dev);
  static const char *rbd_dev_v2_snap_name(struct rbd_device *rbd_dev,
   u64 snap_id);
  static int _rbd_dev_v2_snap_size(struct rbd_device *rbd_dev, u64 snap_id,
 @@ -3506,13 +3506,10 @@ static int rbd_dev_refresh(struct rbd_device *rbd_dev)
   u64 mapping_size;
   int ret;
  
 - rbd_assert(rbd_image_format_valid(rbd_dev-image_format));
   down_write(rbd_dev-header_rwsem);
   mapping_size = rbd_dev-mapping.size;
 - if (rbd_dev-image_format == 1)
 - ret = rbd_dev_v1_header_info(rbd_dev);
 - else
 - ret = rbd_dev_v2_header_info(rbd_dev);
 +
 + ret = rbd_dev_header_info(rbd_dev);
  
   /* If it's a mapped snapshot, validate its EXISTS flag */
  
 @@ -4501,6 +4498,16 @@ static int rbd_dev_v2_header_info(struct rbd_device 
 *rbd_dev)
   return ret;
  }
  
 +static int rbd_dev_header_info(struct rbd_device *rbd_dev)
 +{
 + rbd_assert(rbd_image_format_valid(rbd_dev-image_format));
 +
 + if (rbd_dev-image_format == 1)
 + return rbd_dev_v1_header_info(rbd_dev);
 +
 + return rbd_dev_v2_header_info(rbd_dev);
 +}
 +
  static int rbd_bus_add_dev(struct rbd_device *rbd_dev)
  {
   struct device *dev;
 @@ -5149,10 +5156,7 @@ static int rbd_dev_image_probe(struct rbd_device 
 *rbd_dev, bool mapping)
   goto out_header_name;
   }
  
 - if (rbd_dev-image_format == 1)
 - ret = rbd_dev_v1_header_info(rbd_dev);
 - else
 - ret = rbd_dev_v2_header_info(rbd_dev);
 + ret = rbd_dev_header_info(rbd_dev);
   if (ret)
   goto err_out_watch;
  
 

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 3/8] rbd: remove unnecessary asserts in rbd_dev_image_probe()

2014-07-24 Thread Alex Elder

On 07/24/2014 03:42 AM, Ilya Dryomov wrote:
 spec-image_id assert doesn't buy us much and image_format is asserted
 in rbd_dev_header_name() and rbd_dev_header_info() anyway.

Looks good.  Over time I'm sure there are more assertions
that can go away; they were most useful during rapid development
when confidence in certain things was lower.

Reviewed-by: Alex Elder el...@linaro.org


 
 Signed-off-by: Ilya Dryomov ilya.dryo...@inktank.com
 ---
  drivers/block/rbd.c |2 --
  1 file changed, 2 deletions(-)
 
 diff --git a/drivers/block/rbd.c b/drivers/block/rbd.c
 index 0d3be608f16f..4541f6027e4a 100644
 --- a/drivers/block/rbd.c
 +++ b/drivers/block/rbd.c
 @@ -5143,8 +5143,6 @@ static int rbd_dev_image_probe(struct rbd_device 
 *rbd_dev, bool mapping)
   ret = rbd_dev_image_id(rbd_dev);
   if (ret)
   return ret;
 - rbd_assert(rbd_dev-spec-image_id);
 - rbd_assert(rbd_image_format_valid(rbd_dev-image_format));
  
   ret = rbd_dev_header_name(rbd_dev);
   if (ret)
 

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 4/8] rbd: split rbd_dev_spec_update() into two functions

2014-07-24 Thread Alex Elder

On 07/24/2014 03:42 AM, Ilya Dryomov wrote:
 rbd_dev_spec_update() has two modes of operation, with nothing in
 common between them.  Split it into two functions, one for each mode
 and make our expectations more clear.
 
 Signed-off-by: Ilya Dryomov ilya.dryo...@inktank.com

Looks good.

Reviewed-by: Alex Elder el...@linaro.org

 ---
  drivers/block/rbd.c |   79 
 +++
  1 file changed, 48 insertions(+), 31 deletions(-)
 
 diff --git a/drivers/block/rbd.c b/drivers/block/rbd.c
 index 4541f6027e4a..23df1773ef77 100644
 --- a/drivers/block/rbd.c
 +++ b/drivers/block/rbd.c
 @@ -3798,6 +3798,9 @@ static struct rbd_spec *rbd_spec_alloc(void)
   spec = kzalloc(sizeof (*spec), GFP_KERNEL);
   if (!spec)
   return NULL;
 +
 + spec-pool_id = CEPH_NOPOOL;
 + spec-snap_id = CEPH_NOSNAP;
   kref_init(spec-kref);
  
   return spec;
 @@ -4257,18 +4260,38 @@ static u64 rbd_snap_id_by_name(struct rbd_device 
 *rbd_dev, const char *name)
  }
  
  /*
 - * When an rbd image has a parent image, it is identified by the
 - * pool, image, and snapshot ids (not names).  This function fills
 - * in the names for those ids.  (It's OK if we can't figure out the
 - * name for an image id, but the pool and snapshot ids should always
 - * exist and have names.)  All names in an rbd spec are dynamically
 - * allocated.
 + * An image being mapped will have everything but the snap id.
 + */
 +static int rbd_spec_fill_snap_id(struct rbd_device *rbd_dev)
 +{
 + struct rbd_spec *spec = rbd_dev-spec;
 +
 + rbd_assert(spec-pool_id != CEPH_NOPOOL  spec-pool_name);
 + rbd_assert(spec-image_id  spec-image_name);
 + rbd_assert(spec-snap_name);
 +
 + if (strcmp(spec-snap_name, RBD_SNAP_HEAD_NAME)) {
 + u64 snap_id;
 +
 + snap_id = rbd_snap_id_by_name(rbd_dev, spec-snap_name);
 + if (snap_id == CEPH_NOSNAP)
 + return -ENOENT;
 +
 + spec-snap_id = snap_id;
 + } else {
 + spec-snap_id = CEPH_NOSNAP;
 + }
 +
 + return 0;
 +}
 +
 +/*
 + * A parent image will have all ids but none of the names.
   *
 - * When an image being mapped (not a parent) is probed, we have the
 - * pool name and pool id, image name and image id, and the snapshot
 - * name.  The only thing we're missing is the snapshot id.
 + * All names in an rbd spec are dynamically allocated.  It's OK if we
 + * can't figure out the name for an image id.
   */
 -static int rbd_dev_spec_update(struct rbd_device *rbd_dev)
 +static int rbd_spec_fill_names(struct rbd_device *rbd_dev)
  {
   struct ceph_osd_client *osdc = rbd_dev-rbd_client-client-osdc;
   struct rbd_spec *spec = rbd_dev-spec;
 @@ -4277,24 +4300,9 @@ static int rbd_dev_spec_update(struct rbd_device 
 *rbd_dev)
   const char *snap_name;
   int ret;
  
 - /*
 -  * An image being mapped will have the pool name (etc.), but
 -  * we need to look up the snapshot id.
 -  */
 - if (spec-pool_name) {
 - if (strcmp(spec-snap_name, RBD_SNAP_HEAD_NAME)) {
 - u64 snap_id;
 -
 - snap_id = rbd_snap_id_by_name(rbd_dev, spec-snap_name);
 - if (snap_id == CEPH_NOSNAP)
 - return -ENOENT;
 - spec-snap_id = snap_id;
 - } else {
 - spec-snap_id = CEPH_NOSNAP;
 - }
 -
 - return 0;
 - }
 + rbd_assert(spec-pool_id != CEPH_NOPOOL);
 + rbd_assert(spec-image_id);
 + rbd_assert(spec-snap_id != CEPH_NOSNAP);
  
   /* Get the pool name; we have to make our own copy of this */
  
 @@ -4313,7 +4321,7 @@ static int rbd_dev_spec_update(struct rbd_device 
 *rbd_dev)
   if (!image_name)
   rbd_warn(rbd_dev, unable to get image name);
  
 - /* Look up the snapshot name, and make a copy */
 + /* Fetch the snapshot name */
  
   snap_name = rbd_snap_name(rbd_dev, spec-snap_id);
   if (IS_ERR(snap_name)) {
 @@ -4326,10 +4334,10 @@ static int rbd_dev_spec_update(struct rbd_device 
 *rbd_dev)
   spec-snap_name = snap_name;
  
   return 0;
 +
  out_err:
   kfree(image_name);
   kfree(pool_name);
 -
   return ret;
  }
  
 @@ -5158,7 +5166,16 @@ static int rbd_dev_image_probe(struct rbd_device 
 *rbd_dev, bool mapping)
   if (ret)
   goto err_out_watch;
  
 - ret = rbd_dev_spec_update(rbd_dev);
 + /*
 +  * If this image is the one being mapped, we have pool name and
 +  * id, image name and id, and snap name - need to fill snap id.
 +  * Otherwise this is a parent image, identified by pool, image
 +  * and snap ids - need to fill in names for those ids.
 +  */
 + if (mapping)
 + ret = rbd_spec_fill_snap_id(rbd_dev);
 + else
 + ret = rbd_spec_fill_names(rbd_dev);
   if (ret)
   goto err_out_probe

Re: [PATCH 5/8] rbd: harden rbd_dev_refresh() caller

2014-07-24 Thread Alex Elder

On 07/24/2014 03:42 AM, Ilya Dryomov wrote:
 Recently discovered watch/notify problems showed that we really can't
 ignore errors in anything refresh related.  Alas, currently there is
 not much we can do in response to those errors, except print warnings.

Looks good.

Reviewed-by: Alex Elder el...@linaro.org


 
 Signed-off-by: Ilya Dryomov ilya.dryo...@inktank.com
 ---
  drivers/block/rbd.c |   17 +
  1 file changed, 13 insertions(+), 4 deletions(-)
 
 diff --git a/drivers/block/rbd.c b/drivers/block/rbd.c
 index 23df1773ef77..5dd6f5057ef4 100644
 --- a/drivers/block/rbd.c
 +++ b/drivers/block/rbd.c
 @@ -2963,11 +2963,20 @@ static void rbd_watch_cb(u64 ver, u64 notify_id, u8 
 opcode, void *data)
   dout(%s: \%s\ notify_id %llu opcode %u\n, __func__,
   rbd_dev-header_name, (unsigned long long)notify_id,
   (unsigned int)opcode);
 +
 + /*
 +  * Until adequate refresh error handling is in place, there is
 +  * not much we can do here, except warn.
 +  *
 +  * See http://tracker.ceph.com/issues/5040
 +  */
   ret = rbd_dev_refresh(rbd_dev);
   if (ret)
 - rbd_warn(rbd_dev, header refresh error (%d)\n, ret);
 + rbd_warn(rbd_dev, refresh failed: %d\n, ret);
  
 - rbd_obj_notify_ack_sync(rbd_dev, notify_id);
 + ret = rbd_obj_notify_ack_sync(rbd_dev, notify_id);
 + if (ret)
 + rbd_warn(rbd_dev, notify_ack ret %d\n, ret);
  }
  
  /*
 @@ -3724,9 +3733,9 @@ static ssize_t rbd_image_refresh(struct device *dev,
  
   ret = rbd_dev_refresh(rbd_dev);
   if (ret)
 - rbd_warn(rbd_dev, : manual header refresh error (%d)\n, ret);
 + return ret;
  
 - return ret  0 ? ret : size;
 + return size;
  }
  
  static DEVICE_ATTR(size, S_IRUGO, rbd_size_show, NULL);
 

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 6/8] rbd: update mapping size only on refresh

2014-07-24 Thread Alex Elder

On 07/24/2014 03:42 AM, Ilya Dryomov wrote:
 There is no sense in trying to update the mapping size before it's even
 been set.

It took me a bit to follow this.  But basically there is
no mapping unless it's mapped.  So previously this was
updating the mapping information even for unmapped
parent (or other) images.  There's no need to update
the mapping size for a snapshot--it'll never change.

Is that right?  If not, please advise; otherwise:

Reviewed-by: Alex Elder el...@linaro.org

 Signed-off-by: Ilya Dryomov ilya.dryo...@inktank.com
 ---
  drivers/block/rbd.c |   19 +++
  1 file changed, 7 insertions(+), 12 deletions(-)
 
 diff --git a/drivers/block/rbd.c b/drivers/block/rbd.c
 index 5dd6f5057ef4..16f388f2960b 100644
 --- a/drivers/block/rbd.c
 +++ b/drivers/block/rbd.c
 @@ -971,12 +971,6 @@ static int rbd_header_from_disk(struct rbd_device 
 *rbd_dev,
   header-snap_names = snap_names;
   header-snap_sizes = snap_sizes;
  
 - /* Make sure mapping size is consistent with header info */
 -
 - if (rbd_dev-spec-snap_id == CEPH_NOSNAP || first_time)
 - if (rbd_dev-mapping.size != header-image_size)
 - rbd_dev-mapping.size = header-image_size;
 -
   return 0;
  out_2big:
   ret = -EIO;
 @@ -3520,9 +3514,14 @@ static int rbd_dev_refresh(struct rbd_device *rbd_dev)
  
   ret = rbd_dev_header_info(rbd_dev);
  
 - /* If it's a mapped snapshot, validate its EXISTS flag */
 + if (rbd_dev-spec-snap_id == CEPH_NOSNAP) {
 + if (rbd_dev-mapping.size != rbd_dev-header.image_size)
 + rbd_dev-mapping.size = rbd_dev-header.image_size;
 + } else {
 + /* validate mapped snapshot's EXISTS flag */
 + rbd_exists_validate(rbd_dev);
 + }
  
 - rbd_exists_validate(rbd_dev);
   up_write(rbd_dev-header_rwsem);
  
   if (mapping_size != rbd_dev-mapping.size) {
 @@ -4505,10 +4504,6 @@ static int rbd_dev_v2_header_info(struct rbd_device 
 *rbd_dev)
   is EXPERIMENTAL!);
   }
  
 - if (rbd_dev-spec-snap_id == CEPH_NOSNAP)
 - if (rbd_dev-mapping.size != rbd_dev-header.image_size)
 - rbd_dev-mapping.size = rbd_dev-header.image_size;
 -
   ret = rbd_dev_v2_snap_context(rbd_dev);
   dout(rbd_dev_v2_snap_context returned %d\n, ret);
  
 

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 6/8] rbd: update mapping size only on refresh

2014-07-24 Thread Alex Elder

On 07/24/2014 08:46 AM, Ilya Dryomov wrote:
 On Thu, Jul 24, 2014 at 5:25 PM, Alex Elder el...@ieee.org wrote:
 On 07/24/2014 03:42 AM, Ilya Dryomov wrote:
 There is no sense in trying to update the mapping size before it's even
 been set.

 It took me a bit to follow this.  But basically there is
 no mapping unless it's mapped.  So previously this was
 updating the mapping information even for unmapped
 parent (or other) images.  There's no need to update
 the mapping size for a snapshot--it'll never change.

 Is that right?  If not, please advise; otherwise:
 
 No.  Previously it was updating the mapping size both on the inital map
 and on refresh (of the base image).  Doing that on the inital map
 doesn't make sense: not only struct rbd_mapping fields aren't properly
 initialized at that point - rbd_dev_mapping_set() is called much later
 in the map sequence, snap_id isn't initialized either (at least in the
 format 2 case, I haven't looked too closely at the format 1 case).  And
 just in general, trying to update something before it had a chance to
 change is bogus..

OK, now I see it.  Hopefully this makes sense.  Here is how it
was previously structured:

  rbd_add()
do_rbd_add()
 |rbd_dev_image_probe()
 |  rbd_dev_header_info()
 |rbd_dev_v1_header_info()orrbd_dev_v2_header_info()
 |  rbd_header_from_disk()update mapping
 |update mapping
 |  . . .
 |rbd_dev_device_setup()
 |  rbd_dev_mapping_set()

So neither rbd_header_from_disk() nor rbd_dev_v2_header_info()
should be using the values in the rbd_dev-mapping field during
initial image probe, because they have not yet been initialized
at that point.

Meanwhile, the only reason we need to *update* a mapping size
is if we learn it may have changed, which will be covered by a
refresh, so doing so in rbd_dev_refresh() makes sense.  And
as long as you know whether it's mapping the base image you
might as well do the rbd_exists_validate() call conditionally.

OK, this all looks good and makes good sense to me.
Thank you for explaining it.

Reviewed-by: Alex Elder el...@linaro.org


 
 Thanks,
 
 Ilya
 

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 8/8] rbd: take snap_id into account when reading in parent info

2014-07-24 Thread Alex Elder

On 07/24/2014 03:42 AM, Ilya Dryomov wrote:
 If we are mapping a snapshot, we must read in the parent_overlap value
 of that snapshot instead of that of the base image.  Not doing so may
 in particular result in us returning zeros instead of user data:
 
 # cat overlap-snap.sh
 #!/bin/bash
 rbd create --size 10 --image-format 2 foo
 FOO_DEV=$(rbd map foo)
 dd if=/dev/urandom of=/dev/rbd0 bs=1M /dev/null
 echo Base image
 dd if=$FOO_DEV bs=1 count=16 skip=$(((4  20) - 8)) 2/dev/null | xxd
 rbd snap create foo@snap
 rbd snap protect foo@snap
 rbd clone foo@snap bar
 rbd snap create bar@snap
 BAR_DEV=$(rbd map bar@snap)
 echo Snapshot
 dd if=$BAR_DEV bs=1 count=16 skip=$(((4  20) - 8)) 2/dev/null | xxd
 rbd resize --allow-shrink --size 4 bar
 echo Snapshot after base image resize
 dd if=$BAR_DEV bs=1 count=16 skip=$(((4  20) - 8)) 2/dev/null | xxd
 
 # ./overlap-snap.sh
 Base image
 000: e781 e33b d34b 2225 6034 2845 a2e3 36ed  ...;.K%`4(E..6.
 Snapshot
 000: e781 e33b d34b 2225 6034 2845 a2e3 36ed  ...;.K%`4(E..6.
 Resizing image: 100% complete...done.
 Snapshot after base image resize
 000: e781 e33b d34b 2225      ...;.K%
 
 Even though bar@snap was taken with the old bar parent_overlap (8M),
 reads from bar@snap beyond the new bar parent_overlap (4M) return
 zeroes.  Fix it.
 
 Signed-off-by: Ilya Dryomov ilya.dryo...@inktank.com
 ---
  drivers/block/rbd.c |2 +-
  1 file changed, 1 insertion(+), 1 deletion(-)
 
 diff --git a/drivers/block/rbd.c b/drivers/block/rbd.c
 index c4606987e9d1..cbc89fa9a677 100644
 --- a/drivers/block/rbd.c
 +++ b/drivers/block/rbd.c
 @@ -4020,7 +4020,7 @@ static int rbd_dev_v2_parent_info(struct rbd_device 
 *rbd_dev)
   goto out_err;
   }
  
 - snapid = cpu_to_le64(CEPH_NOSNAP);
 + snapid = cpu_to_le64(rbd_dev-spec-snap_id);

Well that's just an outright bug.  It's been there since the
original commit that added parent support:
86b00e0 rbd: get parent spec for version 2 images
Parent images *must* be snapshots, so this was never
right.

I bet that was hard to figure out...

Looks good.

Reviewed-by: Alex Elder el...@linaro.org


   ret = rbd_obj_method_sync(rbd_dev, rbd_dev-header_name,
   rbd, get_parent,
   snapid, sizeof (snapid),
 

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 10/14] rbd: rbd_obj_request_wait() should cancel the request if interrupted

2014-07-08 Thread Alex Elder

On 07/08/2014 06:18 AM, Ilya Dryomov wrote:
  The only question that leaves me with is, does
  ceph_osdc_cancel_request() need to include the
  call to complete_request() that's present in
  ceph_osdc_wait_request()?
 I don't think so - I mentioned it in the ceph_osdc_cancel_request()
 function comment.  ceph_osdc_cancel_request() is supposed to be used by
 higher layers - rbd, cephfs - and exactly because their completion
 logic is decoupled from libceph completions (as you have brilliantly
 explained above) it's the higher layers who should be taking care of
 it.  IOW higher layers are in charge and are supposed to know what and
 when they are cancelling.

I noticed that comment only after sending my message.

RBD doesn't use the safe completion, only the FS client
does, and I was pretty focused on RBD behavior while
looking at this.  I was trying to conceptualize how
(from the perspective of the upper layer) the safe
completion differs from the normal completion.

It's possible that an I have your request (normal
completion) *also* carries with it the your request
has completed (safe completion) indication, but
the higher layer caller has no way of knowing that.

Maybe I should flip my question around, and ask, why
should the ceph_osdc_cancel_request() include the call
to complete_request()?

The answer lies in details of the file system client,
and I'm not in a position right now to dive into that.
Whether it's called in ceph_osdc_cancel_request() or
not has no effect on RBD.

Anyway, your response is fine with me, thank you.

-Alex
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 09/14] libceph: introduce ceph_osdc_cancel_request()

2014-07-08 Thread Alex Elder

On 07/08/2014 06:15 AM, Ilya Dryomov wrote:
 Are you OK with your Reviewed-by for this patch?

Reviewed-by: Alex Elder el...@linaro.org

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 09/14] libceph: introduce ceph_osdc_cancel_request()

2014-07-07 Thread Alex Elder

On 06/30/2014 09:34 AM, Ilya Dryomov wrote:
 On Mon, Jun 30, 2014 at 5:39 PM, Alex Elder el...@ieee.org wrote:
 On 06/25/2014 12:16 PM, Ilya Dryomov wrote:
 Introduce ceph_osdc_cancel_request() intended for canceling requests
 from the higher layers (rbd and cephfs).  Because higher layers are in
 charge and are supposed to know what and when they are canceling, the
 request is not completed, only unref'ed and removed from the libceph
 data structures.

 This seems reasonable.

 But you make two changes here that I'd like to see a little
 more explanation on.  They seem significant enough to warrant
 a little more attention in the commit description.

 - __cancel_request() is no longer called when
   ceph_osdc_wait_request() fails due to an
   an interrupt.  This is my main concern, and I
   plan to work through it but I'm in a small hurry
   right now.
 
 Perhaps it should have been a separate commit.  __unregister_request()
 revokes r_request, so I opted for not trying to do it twice.  As for

OK, that makes sense--revoking the request is basically all
__cancel_request() does anyway.  You ought to have mentioned
that in the description anyway, even if it wasn't a separate
commit.

 the r_sent condition and assignment, it doesn't seem to make much of
 a difference, given that the request is about to be unregistered
 anyway.

If the request is getting canceled (from a caller outside libceph)
it can't take into account whether it was sent or not.  As you said,
it is getting revoked unconditionally by __unregister_request().
To be honest though, *that* revoke call should have been zeroing
the r_sent value also.  It apparently won't matter, but I think it's
wrong.  The revoke drops a reference, it doesn't necessarily free
the structure (though I expect it may always do so anyway).

 - __unregister_linger_request() is now called when
   a request is canceled, but it wasn't before.  (Since
   we're consistent about setting the r_linger flag
   this should be fine, but it *is* a behavior change.)
 
 The general goal here is to make ceph_osdc_cancel_request() cancel
 *any* request correctly, so if r_linger is set, which means that the
 request in question *could* be lingering, __unregister_linger_request()
 is called.

The goal is good.  Note that __unregister_linger_request() drops
a reference to the request though.  There are three other callers
of this function.  Two of them drop a reference that had just been
added by a call to __register_request().  The other one is in
ceph_osdc_unregister_linger_request(), initiated by a higher layer.
In that last case, r_linger will be zeroed, so calling it again
should be harmless.

I guess I'm going to move on...  I expect all of this discussion
will be moot and you will have made everything work right and
better by the time I get to the end of the series.

-Alex

 
 Thanks,
 
 Ilya
 

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 10/14] rbd: rbd_obj_request_wait() should cancel the request if interrupted

2014-07-07 Thread Alex Elder

On 06/25/2014 12:16 PM, Ilya Dryomov wrote:
 rbd_obj_request_wait() should cancel the underlying OSD request if
 interrupted.  Otherwise libceph will hold onto it indefinitely, causing
 assert failures or leaking the original object request.

At first I didn't understand this.  Let me see if I've got
it now, though.

Each OSD request has a completion associated with it.  An
OSD request is started via ceph_osdc_start_request(), which
registers the request and takes a reference to it.  One can
call ceph_osdc_wait_request() after the request has been
successfully started.  Whether the wait succeeds or not,
by the time ceph_osdc_wait_request() returns the request
should have been cleaned up, back to the state it was
in before the start_request call.  That means the request
needs to be unregistered and its reference dropped, etc.

Similarly, each RBD object request has a completion associated
with it.  It is distinct from the OSD request associated
with the RBD object request because there may be more to do
for RBD request to complete than just complete one object
request.  An RBD object request is started by a call to
rbd_obj_request_submit(), and once that's successfully
returned, one can wait for it to complete using a call to
rbd_obj_request_wait().  And as above, that call should
return state to (roughly) where it was before the submit
call, whether the wait request succeeded or not.

Now, RBD doesn't actually wait for its object requests
to complete--all its OSD requests complete asynchronously.
Instead, it relies on the OSD client to call the callback
function (always rbd_osd_req_callback()) when it has
completed.  That function will lead to the RBD request's
completion being signaled when appropriate.

So...  What happens when an interrupt occurs after
rbd_obj_request_submit() has returned successfully?  That
function is a simple wrapper for ceph_osdc_start_request(),
so a successful return means the request was mapped and
put on a target's unsent list (or the OSD client's no
target list).  Nobody waits for the OSD request, so an
interrupt won't get the benefit of the cleanup done in
ceph_osdc_wait_request().  Therefore the RBD layer needs
to be responsible for doing that.

In other words, when rbd_obj_request_wait() gets an
interrupt while waiting for the completion, it needs
to clean up (end) the interrupted request, and
rbd_obj_request_end() sounds right.  And what that
cleanup function should do is basically the same
as what ceph_osdc_wait_request() should do in that
situation, which is call ceph_osdc_cancel_request().

The only question that leaves me with is, does
ceph_osdc_cancel_request() need to include the
call to complete_request() that's present in
ceph_osdc_wait_request()?

I'll leave it at that.  I think I've convinced myself
this is a good change.  So I await your confirmation
that I understand it right, and your answer to my
question above.  But in any case you can consider this:

Reviewed-by: Alex Elder el...@linaro.org

Sorry it's taking me so long to get through these.



 
 This also adds an rbd wrapper around ceph_osdc_cancel_request() to
 match rbd_obj_request_submit() and rbd_obj_request_wait().
 
 Signed-off-by: Ilya Dryomov ilya.dryo...@inktank.com
 ---
  drivers/block/rbd.c |   39 ---
  1 file changed, 28 insertions(+), 11 deletions(-)
 
 diff --git a/drivers/block/rbd.c b/drivers/block/rbd.c
 index b2c98c1bc037..20147aec86f3 100644
 --- a/drivers/block/rbd.c
 +++ b/drivers/block/rbd.c
 @@ -1527,11 +1527,37 @@ static bool obj_request_type_valid(enum 
 obj_request_type type)
  static int rbd_obj_request_submit(struct ceph_osd_client *osdc,
   struct rbd_obj_request *obj_request)
  {
 - dout(%s: osdc %p obj %p\n, __func__, osdc, obj_request);
 -
 + dout(%s %p\n, __func__, obj_request);
   return ceph_osdc_start_request(osdc, obj_request-osd_req, false);
  }
  
 +static void rbd_obj_request_end(struct rbd_obj_request *obj_request)
 +{
 + dout(%s %p\n, __func__, obj_request);
 + ceph_osdc_cancel_request(obj_request-osd_req);
 +}
 +
 +/*
 + * Wait for an object request to complete.  If interrupted, cancel the
 + * underlying osd request.
 + */
 +static int rbd_obj_request_wait(struct rbd_obj_request *obj_request)
 +{
 + int ret;
 +
 + dout(%s %p\n, __func__, obj_request);
 +
 + ret = wait_for_completion_interruptible(obj_request-completion);
 + if (ret  0) {
 + dout(%s %p interrupted\n, __func__, obj_request);
 + rbd_obj_request_end(obj_request);
 + return ret;
 + }
 +
 + dout(%s %p done\n, __func__, obj_request);
 + return 0;
 +}
 +
  static void rbd_img_request_complete(struct rbd_img_request *img_request)
  {
  
 @@ -1558,15 +1584,6 @@ static void rbd_img_request_complete(struct 
 rbd_img_request *img_request)
   rbd_img_request_put(img_request);
  }
  
 -/* Caller is responsible for rbd_obj_request_destroy(obj_request

Re: [PATCH 13/14] libceph: nuke ceph_osdc_unregister_linger_request()

2014-07-07 Thread Alex Elder

On 06/25/2014 12:16 PM, Ilya Dryomov wrote:
 Remove now unused ceph_osdc_unregister_linger_request().
 
 Signed-off-by: Ilya Dryomov ilya.dryo...@inktank.com

Looks good.

Reviewed-by: Alex Elder el...@linaro.org

 ---
  include/linux/ceph/osd_client.h |2 --
  net/ceph/osd_client.c   |   12 
  2 files changed, 14 deletions(-)
 
 diff --git a/include/linux/ceph/osd_client.h b/include/linux/ceph/osd_client.h
 index de09cad7b7c7..03aeb27fcc69 100644
 --- a/include/linux/ceph/osd_client.h
 +++ b/include/linux/ceph/osd_client.h
 @@ -325,8 +325,6 @@ extern struct ceph_osd_request 
 *ceph_osdc_new_request(struct ceph_osd_client *,
  
  extern void ceph_osdc_set_request_linger(struct ceph_osd_client *osdc,
struct ceph_osd_request *req);
 -extern void ceph_osdc_unregister_linger_request(struct ceph_osd_client *osdc,
 - struct ceph_osd_request *req);
  
  extern void ceph_osdc_get_request(struct ceph_osd_request *req);
  extern void ceph_osdc_put_request(struct ceph_osd_request *req);
 diff --git a/net/ceph/osd_client.c b/net/ceph/osd_client.c
 index ef44cc0bbc6a..30f6faf3584f 100644
 --- a/net/ceph/osd_client.c
 +++ b/net/ceph/osd_client.c
 @@ -1281,18 +1281,6 @@ static void __unregister_linger_request(struct 
 ceph_osd_client *osdc,
   ceph_osdc_put_request(req);
  }
  
 -void ceph_osdc_unregister_linger_request(struct ceph_osd_client *osdc,
 -  struct ceph_osd_request *req)
 -{
 - mutex_lock(osdc-request_mutex);
 - if (req-r_linger) {
 - req-r_linger = 0;
 - __unregister_linger_request(osdc, req);
 - }
 - mutex_unlock(osdc-request_mutex);
 -}
 -EXPORT_SYMBOL(ceph_osdc_unregister_linger_request);
 -
  void ceph_osdc_set_request_linger(struct ceph_osd_client *osdc,
 struct ceph_osd_request *req)
  {
 

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 12/14] rbd: use rbd_obj_watch_request_helper() helper

2014-07-07 Thread Alex Elder

On 06/25/2014 12:16 PM, Ilya Dryomov wrote:
 Switch rbd_dev_header_{un,}watch_sync() to use the new helper and fix
 rbd_dev_header_unwatch_sync() to destroy watch_request structures
 before queuing watch-remove message while at it.  This mistake slipped
 into commit b30a01f2a307 (rbd: fix osd_request memory leak in
 __rbd_dev_header_watch_sync()) and could lead to image still in use
 errors on image removal.

You can see why--even if they were entangled--both setting up
and taking down the watch request were done in the same function.
So much of what they do is common, just slightly different.

Reviewed-by: Alex Elder el...@linaro.org

 Signed-off-by: Ilya Dryomov ilya.dryo...@inktank.com
 ---
  drivers/block/rbd.c |  115 
 ---
  1 file changed, 17 insertions(+), 98 deletions(-)
 
 diff --git a/drivers/block/rbd.c b/drivers/block/rbd.c
 index 02cf7aba7679..d99aa81774f8 100644
 --- a/drivers/block/rbd.c
 +++ b/drivers/block/rbd.c
 @@ -3040,130 +3040,49 @@ static int rbd_dev_header_watch_sync(struct 
 rbd_device *rbd_dev)
   if (ret  0)
   return ret;
  
 - rbd_assert(rbd_dev-watch_event);
 -
 - obj_request = rbd_obj_request_create(rbd_dev-header_name, 0, 0,
 -  OBJ_REQUEST_NODATA);
 - if (!obj_request) {
 - ret = -ENOMEM;
 - goto out_cancel;
 + obj_request = rbd_obj_watch_request_helper(rbd_dev, true);
 + if (IS_ERR(obj_request)) {
 + ceph_osdc_cancel_event(rbd_dev-watch_event);
 + rbd_dev-watch_event = NULL;
 + return PTR_ERR(obj_request);
   }
  
 - obj_request-osd_req = rbd_osd_req_create(rbd_dev, true, 1,
 -   obj_request);
 - if (!obj_request-osd_req) {
 - ret = -ENOMEM;
 - goto out_put;
 - }
 -
 - ceph_osdc_set_request_linger(osdc, obj_request-osd_req);
 -
 - osd_req_op_watch_init(obj_request-osd_req, 0, CEPH_OSD_OP_WATCH,
 -   rbd_dev-watch_event-cookie, 0, 1);
 - rbd_osd_req_format_write(obj_request);
 -
 - ret = rbd_obj_request_submit(osdc, obj_request);
 - if (ret)
 - goto out_linger;
 -
 - ret = rbd_obj_request_wait(obj_request);
 - if (ret)
 - goto out_linger;
 -
 - ret = obj_request-result;
 - if (ret)
 - goto out_linger;
 -
   /*
* A watch request is set to linger, so the underlying osd
* request won't go away until we unregister it.  We retain
* a pointer to the object request during that time (in
 -  * rbd_dev-watch_request), so we'll keep a reference to
 -  * it.  We'll drop that reference (below) after we've
 -  * unregistered it.
 +  * rbd_dev-watch_request), so we'll keep a reference to it.
 +  * We'll drop that reference after we've unregistered it in
 +  * rbd_dev_header_unwatch_sync().
*/
   rbd_dev-watch_request = obj_request;
  
   return 0;
 -
 -out_linger:
 - ceph_osdc_unregister_linger_request(osdc, obj_request-osd_req);
 -out_put:
 - rbd_obj_request_put(obj_request);
 -out_cancel:
 - ceph_osdc_cancel_event(rbd_dev-watch_event);
 - rbd_dev-watch_event = NULL;
 -
 - return ret;
  }
  
  /*
   * Tear down a watch request, synchronously.
   */
 -static int __rbd_dev_header_unwatch_sync(struct rbd_device *rbd_dev)
 +static void rbd_dev_header_unwatch_sync(struct rbd_device *rbd_dev)
  {
 - struct ceph_osd_client *osdc = rbd_dev-rbd_client-client-osdc;
   struct rbd_obj_request *obj_request;
 - int ret;
  
   rbd_assert(rbd_dev-watch_event);
   rbd_assert(rbd_dev-watch_request);
  
 - obj_request = rbd_obj_request_create(rbd_dev-header_name, 0, 0,
 -  OBJ_REQUEST_NODATA);
 - if (!obj_request) {
 - ret = -ENOMEM;
 - goto out_cancel;
 - }
 -
 - obj_request-osd_req = rbd_osd_req_create(rbd_dev, true, 1,
 -   obj_request);
 - if (!obj_request-osd_req) {
 - ret = -ENOMEM;
 - goto out_put;
 - }
 -
 - osd_req_op_watch_init(obj_request-osd_req, 0, CEPH_OSD_OP_WATCH,
 -   rbd_dev-watch_event-cookie, 0, 0);
 - rbd_osd_req_format_write(obj_request);
 -
 - ret = rbd_obj_request_submit(osdc, obj_request);
 - if (ret)
 - goto out_put;
 -
 - ret = rbd_obj_request_wait(obj_request);
 - if (ret)
 - goto out_put;
 -
 - ret = obj_request-result;
 - if (ret)
 - goto out_put;
 -
 - /* We have successfully torn down the watch request */
 -
 - ceph_osdc_unregister_linger_request(osdc,
 - rbd_dev-watch_request-osd_req);
 + rbd_obj_request_end(rbd_dev-watch_request);
   rbd_obj_request_put(rbd_dev-watch_request);
   rbd_dev-watch_request

Re: [PATCH 11/14] rbd: add rbd_obj_watch_request_helper() helper

2014-07-07 Thread Alex Elder

On 06/25/2014 12:16 PM, Ilya Dryomov wrote:
 In the past, rbd_dev_header_watch_sync() used to handle both watch and
 unwatch requests and was entangled and leaky.  Commit b30a01f2a307
 (rbd: fix osd_request memory leak in __rbd_dev_header_watch_sync())
 split it into two separate functions.  This commit cleanly abstracts
 the common bits, relying on the fixed rbd_obj_request_wait().

Adding this without calling it leads to an unused function
warning in the build, I'm sure.

You could probably squash this into the next patch.

Reviewed-by: Alex Elder el...@linaro.org

 Signed-off-by: Ilya Dryomov ilya.dryo...@inktank.com
 ---
  drivers/block/rbd.c |   53 
 +++
  1 file changed, 53 insertions(+)
 
 diff --git a/drivers/block/rbd.c b/drivers/block/rbd.c
 index 20147aec86f3..02cf7aba7679 100644
 --- a/drivers/block/rbd.c
 +++ b/drivers/block/rbd.c
 @@ -2971,6 +2971,59 @@ static void rbd_watch_cb(u64 ver, u64 notify_id, u8 
 opcode, void *data)
  }
  
  /*
 + * Send a (un)watch request and wait for the ack.  Return a request
 + * with a ref held on success or error.
 + */
 +static struct rbd_obj_request *rbd_obj_watch_request_helper(
 + struct rbd_device *rbd_dev,
 + bool watch)
 +{
 + struct ceph_osd_client *osdc = rbd_dev-rbd_client-client-osdc;
 + struct rbd_obj_request *obj_request;
 + int ret;
 +
 + obj_request = rbd_obj_request_create(rbd_dev-header_name, 0, 0,
 +  OBJ_REQUEST_NODATA);
 + if (!obj_request)
 + return ERR_PTR(-ENOMEM);
 +
 + obj_request-osd_req = rbd_osd_req_create(rbd_dev, true, 1,
 +   obj_request);
 + if (!obj_request-osd_req) {
 + ret = -ENOMEM;
 + goto out;
 + }
 +
 + osd_req_op_watch_init(obj_request-osd_req, 0, CEPH_OSD_OP_WATCH,
 +   rbd_dev-watch_event-cookie, 0, watch);
 + rbd_osd_req_format_write(obj_request);
 +
 + if (watch)
 + ceph_osdc_set_request_linger(osdc, obj_request-osd_req);
 +
 + ret = rbd_obj_request_submit(osdc, obj_request);
 + if (ret)
 + goto out;
 +
 + ret = rbd_obj_request_wait(obj_request);
 + if (ret)
 + goto out;
 +
 + ret = obj_request-result;
 + if (ret) {
 + if (watch)
 + rbd_obj_request_end(obj_request);
 + goto out;
 + }
 +
 + return obj_request;
 +
 +out:
 + rbd_obj_request_put(obj_request);
 + return ERR_PTR(ret);
 +}
 +
 +/*
   * Initiate a watch request, synchronously.
   */
  static int rbd_dev_header_watch_sync(struct rbd_device *rbd_dev)
 

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 14/14] libceph: drop osd ref when canceling con work

2014-07-07 Thread Alex Elder

On 06/25/2014 12:16 PM, Ilya Dryomov wrote:
 queue_con() bumps osd ref count.  We should do the reverse when
 canceling con work.

Kind of unrelated to the rest of the series, but it looks
good.  Good to have a same-level-of-abstraction function
for it as well.

Reviewed-by: Alex Elder el...@linaro.org
 
 Signed-off-by: Ilya Dryomov ilya.dryo...@inktank.com
 ---
  net/ceph/messenger.c |   14 ++
  1 file changed, 10 insertions(+), 4 deletions(-)
 
 diff --git a/net/ceph/messenger.c b/net/ceph/messenger.c
 index 8bffa5b90fef..e51cad0db580 100644
 --- a/net/ceph/messenger.c
 +++ b/net/ceph/messenger.c
 @@ -174,6 +174,7 @@ static struct lock_class_key socket_class;
  #define SKIP_BUF_SIZE1024
  
  static void queue_con(struct ceph_connection *con);
 +static void cancel_con(struct ceph_connection *con);
  static void con_work(struct work_struct *);
  static void con_fault(struct ceph_connection *con);
  
 @@ -680,7 +681,7 @@ void ceph_con_close(struct ceph_connection *con)
  
   reset_connection(con);
   con-peer_global_seq = 0;
 - cancel_delayed_work(con-work);
 + cancel_con(con);
   con_close_socket(con);
   mutex_unlock(con-mutex);
  }
 @@ -2667,19 +2668,16 @@ static int queue_con_delay(struct ceph_connection 
 *con, unsigned long delay)
  {
   if (!con-ops-get(con)) {
   dout(%s %p ref count 0\n, __func__, con);
 -
   return -ENOENT;
   }
  
   if (!queue_delayed_work(ceph_msgr_wq, con-work, delay)) {
   dout(%s %p - already queued\n, __func__, con);
   con-ops-put(con);
 -
   return -EBUSY;
   }
  
   dout(%s %p %lu\n, __func__, con, delay);
 -
   return 0;
  }
  
 @@ -2688,6 +2686,14 @@ static void queue_con(struct ceph_connection *con)
   (void) queue_con_delay(con, 0);
  }
  
 +static void cancel_con(struct ceph_connection *con)
 +{
 + if (cancel_delayed_work(con-work)) {
 + dout(%s %p\n, __func__, con);
 + con-ops-put(con);
 + }
 +}
 +
  static bool con_sock_closed(struct ceph_connection *con)
  {
   if (!con_flag_test_and_clear(con, CON_FLAG_SOCK_CLOSED))
 

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] rbd: do not leak image_id in rbd_dev_v2_parent_info()

2014-07-07 Thread Alex Elder

On 06/30/2014 04:45 AM, Ilya Dryomov wrote:
 image_id is leaked if the parent happens to have been recorded already.
 Fix it.

Looks good.

Reviewed-by: Alex Elder el...@linaro.org

 
 Signed-off-by: Ilya Dryomov ilya.dryo...@inktank.com
 ---
  drivers/block/rbd.c |2 ++
  1 file changed, 2 insertions(+)
 
 diff --git a/drivers/block/rbd.c b/drivers/block/rbd.c
 index d99aa81774f8..adedb393b374 100644
 --- a/drivers/block/rbd.c
 +++ b/drivers/block/rbd.c
 @@ -4072,6 +4072,8 @@ static int rbd_dev_v2_parent_info(struct rbd_device 
 *rbd_dev)
   parent_spec-snap_id = snap_id;
   rbd_dev-parent_spec = parent_spec;
   parent_spec = NULL; /* rbd_dev now owns this */
 + } else {
 + kfree(image_id);
   }
  
   /*
 

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] rbd: handle parent_overlap on writes correctly

2014-06-30 Thread Alex Elder

On 06/11/2014 11:40 AM, Ilya Dryomov wrote:
 The following check in rbd_img_obj_request_submit()
 
 rbd_dev-parent_overlap = obj_request-img_offset
 
 allows the fall through to the non-layered write case even if both
 parent_overlap and obj_request-img_offset belong to the same RADOS
 object.  This leads to data corruption, because the area to the left of
 parent_overlap ends up unconditionally zero-filled instead of being
 populated with parent data.  Suppose we want to write 1M to offset 6M
 of image bar, which is a clone of foo@snap; object_size is 4M,
 parent_overlap is 5M:
 
 rbd_data.id.0001
  -|--|
 | should be copyup'ed | should be zeroed out | write ...
  -|--|
4M5M 6M
 parent_overlapobj_request-img_offset
 
 4..5M should be copyup'ed from foo, yet it is zero-filled, just like
 5..6M is.
 
 Given that the only striping mode kernel client currently supports is
 chunking (i.e. stripe_unit == object_size, stripe_count == 1), round
 parent_overlap up to the next object boundary for the purposes of the
 overlap check.

I know Josh already reviewed this, but I wanted to make sure
I understood it.  Now I do.

Previously the first layered write request starting at the
should be zeroed out region would be treated as a normal
write.  Consequently the child object would get created, but
only the affected portion would get written, meaning the
should be copyup'ed region in the child object would be
zero filled.  Later reads to the should be copyup'ed would
see that child object existed, so the parent data will never
get filled into the should be copyup'ed area as it must be.

Rounding up like you did for this test makes sense--it makes
the granularity of this overlap test match the granularity
of the exists test.

Reviewed-by: Alex Elder el...@linaro.org

 
 Signed-off-by: Ilya Dryomov ilya.dryo...@inktank.com
 ---
  drivers/block/rbd.c |   10 +-
  1 file changed, 9 insertions(+), 1 deletion(-)
 
 diff --git a/drivers/block/rbd.c b/drivers/block/rbd.c
 index 8295b3afa8e0..813e673d49df 100644
 --- a/drivers/block/rbd.c
 +++ b/drivers/block/rbd.c
 @@ -1366,6 +1366,14 @@ static bool obj_request_exists_test(struct 
 rbd_obj_request *obj_request)
   return test_bit(OBJ_REQ_EXISTS, obj_request-flags) != 0;
  }
  
 +static bool obj_request_overlaps_parent(struct rbd_obj_request *obj_request)
 +{
 + struct rbd_device *rbd_dev = obj_request-img_request-rbd_dev;
 +
 + return obj_request-img_offset 
 + round_up(rbd_dev-parent_overlap, rbd_obj_bytes(rbd_dev-header));
 +}
 +
  static void rbd_obj_request_get(struct rbd_obj_request *obj_request)
  {
   dout(%s: obj %p (was %d)\n, __func__, obj_request,
 @@ -2683,7 +2691,7 @@ static int rbd_img_obj_request_submit(struct 
 rbd_obj_request *obj_request)
*/
   if (!img_request_write_test(img_request) ||
   !img_request_layered_test(img_request) ||
 - rbd_dev-parent_overlap = obj_request-img_offset ||
 + !obj_request_overlaps_parent(obj_request) ||
   ((known = obj_request_known_test(obj_request)) 
   obj_request_exists_test(obj_request))) {
  
 

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 01/14] libceph: rename ceph_osd_request::r_linger_osd to r_linger_osd_item

2014-06-30 Thread Alex Elder

On 06/25/2014 12:16 PM, Ilya Dryomov wrote:
 So that:
 
 req-r_osd_item -- osd-o_requests list
 req-r_linger_osd_item -- osd-o_linger_requests list
 
 Signed-off-by: Ilya Dryomov ilya.dryo...@inktank.com

This looks good and I prefer it too.

Reviewed-by: Alex Elder el...@linaro.org

 ---
  include/linux/ceph/osd_client.h |2 +-
  net/ceph/osd_client.c   |8 
  2 files changed, 5 insertions(+), 5 deletions(-)
 
 diff --git a/include/linux/ceph/osd_client.h b/include/linux/ceph/osd_client.h
 index 94ec69672164..7490a03ac163 100644
 --- a/include/linux/ceph/osd_client.h
 +++ b/include/linux/ceph/osd_client.h
 @@ -117,7 +117,7 @@ struct ceph_osd_request {
   struct list_head r_req_lru_item;
   struct list_head r_osd_item;
   struct list_head r_linger_item;
 - struct list_head r_linger_osd;
 + struct list_head r_linger_osd_item;
   struct ceph_osd *r_osd;
   struct ceph_pg   r_pgid;
   int  r_pg_osds[CEPH_PG_MAX_SIZE];
 diff --git a/net/ceph/osd_client.c b/net/ceph/osd_client.c
 index 05be0c181695..d5d2be3bd113 100644
 --- a/net/ceph/osd_client.c
 +++ b/net/ceph/osd_client.c
 @@ -364,7 +364,7 @@ struct ceph_osd_request *ceph_osdc_alloc_request(struct 
 ceph_osd_client *osdc,
   RB_CLEAR_NODE(req-r_node);
   INIT_LIST_HEAD(req-r_unsafe_item);
   INIT_LIST_HEAD(req-r_linger_item);
 - INIT_LIST_HEAD(req-r_linger_osd);
 + INIT_LIST_HEAD(req-r_linger_osd_item);
   INIT_LIST_HEAD(req-r_req_lru_item);
   INIT_LIST_HEAD(req-r_osd_item);
  
 @@ -916,7 +916,7 @@ static void __kick_osd_requests(struct ceph_osd_client 
 *osdc,
* list at the end to keep things in tid order.
*/
   list_for_each_entry_safe(req, nreq, osd-o_linger_requests,
 -  r_linger_osd) {
 +  r_linger_osd_item) {
   /*
* reregister request prior to unregistering linger so
* that r_osd is preserved.
 @@ -1218,7 +1218,7 @@ static void __register_linger_request(struct 
 ceph_osd_client *osdc,
   ceph_osdc_get_request(req);
   list_add_tail(req-r_linger_item, osdc-req_linger);
   if (req-r_osd)
 - list_add_tail(req-r_linger_osd,
 + list_add_tail(req-r_linger_osd_item,
 req-r_osd-o_linger_requests);
  }
  
 @@ -1228,7 +1228,7 @@ static void __unregister_linger_request(struct 
 ceph_osd_client *osdc,
   dout(__unregister_linger_request %p\n, req);
   list_del_init(req-r_linger_item);
   if (req-r_osd) {
 - list_del_init(req-r_linger_osd);
 + list_del_init(req-r_linger_osd_item);
  
   if (list_empty(req-r_osd-o_requests) 
   list_empty(req-r_osd-o_linger_requests)) {
 

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 02/14] libceph: add maybe_move_osd_to_lru() and switch to it

2014-06-30 Thread Alex Elder

On 06/25/2014 12:16 PM, Ilya Dryomov wrote:
 Abstract out __move_osd_to_lru() logic from __unregister_request() and
 __unregister_linger_request().
 
 Signed-off-by: Ilya Dryomov ilya.dryo...@inktank.com

Looks good.

Reviewed-by: Alex Elder el...@linaro.org

 ---
  net/ceph/osd_client.c |   26 ++
  1 file changed, 14 insertions(+), 12 deletions(-)
 
 diff --git a/net/ceph/osd_client.c b/net/ceph/osd_client.c
 index d5d2be3bd113..6202923b41ff 100644
 --- a/net/ceph/osd_client.c
 +++ b/net/ceph/osd_client.c
 @@ -1029,12 +1029,23 @@ static void remove_all_osds(struct ceph_osd_client 
 *osdc)
  static void __move_osd_to_lru(struct ceph_osd_client *osdc,
 struct ceph_osd *osd)
  {
 - dout(__move_osd_to_lru %p\n, osd);
 + dout(%s %p\n, __func__, osd);
   BUG_ON(!list_empty(osd-o_osd_lru));
 +
   list_add_tail(osd-o_osd_lru, osdc-osd_lru);
   osd-lru_ttl = jiffies + osdc-client-options-osd_idle_ttl * HZ;
  }
  
 +static void maybe_move_osd_to_lru(struct ceph_osd_client *osdc,
 +   struct ceph_osd *osd)
 +{
 + dout(%s %p\n, __func__, osd);
 +
 + if (list_empty(osd-o_requests) 
 + list_empty(osd-o_linger_requests))
 + __move_osd_to_lru(osdc, osd);
 +}
 +
  static void __remove_osd_from_lru(struct ceph_osd *osd)
  {
   dout(__remove_osd_from_lru %p\n, osd);
 @@ -1182,11 +1193,7 @@ static void __unregister_request(struct 
 ceph_osd_client *osdc,
   ceph_msg_revoke(req-r_request);
  
   list_del_init(req-r_osd_item);
 - if (list_empty(req-r_osd-o_requests) 
 - list_empty(req-r_osd-o_linger_requests)) {
 - dout(moving osd to %p lru\n, req-r_osd);
 - __move_osd_to_lru(osdc, req-r_osd);
 - }
 + maybe_move_osd_to_lru(osdc, req-r_osd);
   if (list_empty(req-r_linger_item))
   req-r_osd = NULL;
   }
 @@ -1229,12 +1236,7 @@ static void __unregister_linger_request(struct 
 ceph_osd_client *osdc,
   list_del_init(req-r_linger_item);
   if (req-r_osd) {
   list_del_init(req-r_linger_osd_item);
 -
 - if (list_empty(req-r_osd-o_requests) 
 - list_empty(req-r_osd-o_linger_requests)) {
 - dout(moving osd to %p lru\n, req-r_osd);
 - __move_osd_to_lru(osdc, req-r_osd);
 - }
 + maybe_move_osd_to_lru(osdc, req-r_osd);
   if (list_empty(req-r_osd_item))
   req-r_osd = NULL;
   }
 

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 03/14] libceph: move and add dout()s to ceph_msg_{get,put}()

2014-06-30 Thread Alex Elder

On 06/25/2014 12:16 PM, Ilya Dryomov wrote:
 Add dout()s to ceph_msg_{get,put}().  Also move them to .c and turn
 kref release callback into a static function.
 
 Signed-off-by: Ilya Dryomov ilya.dryo...@inktank.com

This is all very good.

I have one suggestion though, below, but regardless:

Reviewed-by: Alex Elder el...@linaro.org


 ---
  include/linux/ceph/messenger.h |   14 ++
  net/ceph/messenger.c   |   31 ++-
  2 files changed, 24 insertions(+), 21 deletions(-)
 
 diff --git a/include/linux/ceph/messenger.h b/include/linux/ceph/messenger.h
 index d21f2dba0731..40ae58e3e9db 100644
 --- a/include/linux/ceph/messenger.h
 +++ b/include/linux/ceph/messenger.h
 @@ -285,19 +285,9 @@ extern void ceph_msg_data_add_bio(struct ceph_msg *msg, 
 struct bio *bio,
  
  extern struct ceph_msg *ceph_msg_new(int type, int front_len, gfp_t flags,
bool can_fail);
 -extern void ceph_msg_kfree(struct ceph_msg *m);
  
 -
 -static inline struct ceph_msg *ceph_msg_get(struct ceph_msg *msg)
 -{
 - kref_get(msg-kref);
 - return msg;
 -}
 -extern void ceph_msg_last_put(struct kref *kref);
 -static inline void ceph_msg_put(struct ceph_msg *msg)
 -{
 - kref_put(msg-kref, ceph_msg_last_put);
 -}
 +extern struct ceph_msg *ceph_msg_get(struct ceph_msg *msg);
 +extern void ceph_msg_put(struct ceph_msg *msg);
  
  extern void ceph_msg_dump(struct ceph_msg *msg);
  
 diff --git a/net/ceph/messenger.c b/net/ceph/messenger.c
 index 1948d592aa54..8bffa5b90fef 100644
 --- a/net/ceph/messenger.c
 +++ b/net/ceph/messenger.c
 @@ -3269,24 +3269,21 @@ static int ceph_con_in_msg_alloc(struct 
 ceph_connection *con, int *skip)
  /*
   * Free a generically kmalloc'd message.
   */
 -void ceph_msg_kfree(struct ceph_msg *m)
 +static void ceph_msg_free(struct ceph_msg *m)
  {
 - dout(msg_kfree %p\n, m);
 + dout(%s %p\n, __func__, m);
   ceph_kvfree(m-front.iov_base);
   kmem_cache_free(ceph_msg_cache, m);
  }
  
 -/*
 - * Drop a msg ref.  Destroy as needed.
 - */
 -void ceph_msg_last_put(struct kref *kref)
 +static void ceph_msg_release(struct kref *kref)
  {
   struct ceph_msg *m = container_of(kref, struct ceph_msg, kref);
   LIST_HEAD(data);
   struct list_head *links;
   struct list_head *next;
  
 - dout(ceph_msg_put last one on %p\n, m);
 + dout(%s %p\n, __func__, m);
   WARN_ON(!list_empty(m-list_head));
  
   /* drop middle, data, if any */
 @@ -3308,9 +3305,25 @@ void ceph_msg_last_put(struct kref *kref)
   if (m-pool)
   ceph_msgpool_put(m-pool, m);
   else
 - ceph_msg_kfree(m);
 + ceph_msg_free(m);
 +}
 +
 +struct ceph_msg *ceph_msg_get(struct ceph_msg *msg)
 +{
 + dout(%s %p (was %d)\n, __func__, msg,
 +  atomic_read(msg-kref.refcount));
 + kref_get(msg-kref);

I suggest you do the dout() *after* you call kref_get().
I'm sure it doesn't matter in practice here, but my
reasoning is that you get the reference immediately, and
you'll have the reference when reading the refcount value.
It also  makes the dout() calls here and in ceph_msg_put()
end up getting symmetrically wrapped by the get kref
get and put.  (You have a race reading the updated
refcount value either way, but it's debug code.)

 + return msg;
 +}
 +EXPORT_SYMBOL(ceph_msg_get);
 +
 +void ceph_msg_put(struct ceph_msg *msg)
 +{
 + dout(%s %p (was %d)\n, __func__, msg,
 +  atomic_read(msg-kref.refcount));
 + kref_put(msg-kref, ceph_msg_release);
  }
 -EXPORT_SYMBOL(ceph_msg_last_put);
 +EXPORT_SYMBOL(ceph_msg_put);
  
  void ceph_msg_dump(struct ceph_msg *msg)
  {
 

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 04/14] libceph: move and add dout()s to ceph_osdc_request_{get,put}()

2014-06-30 Thread Alex Elder

On 06/25/2014 12:16 PM, Ilya Dryomov wrote:
 Add dout()s to ceph_osdc_request_{get,put}().  Also move them to .c and
 turn kref release callback into a static function.

You can pretty much take the identical comments from
what I said on [PATCH 03/14].

Reviewed-by: Alex Elder el...@linaro.org

 Signed-off-by: Ilya Dryomov ilya.dryo...@inktank.com
 ---
  include/linux/ceph/osd_client.h |   11 ++-
  net/ceph/osd_client.c   |   26 ++
  2 files changed, 24 insertions(+), 13 deletions(-)
 
 diff --git a/include/linux/ceph/osd_client.h b/include/linux/ceph/osd_client.h
 index 7490a03ac163..a8d5652f589d 100644
 --- a/include/linux/ceph/osd_client.h
 +++ b/include/linux/ceph/osd_client.h
 @@ -328,15 +328,8 @@ extern void ceph_osdc_set_request_linger(struct 
 ceph_osd_client *osdc,
  extern void ceph_osdc_unregister_linger_request(struct ceph_osd_client *osdc,
   struct ceph_osd_request *req);
  
 -static inline void ceph_osdc_get_request(struct ceph_osd_request *req)
 -{
 - kref_get(req-r_kref);
 -}
 -extern void ceph_osdc_release_request(struct kref *kref);
 -static inline void ceph_osdc_put_request(struct ceph_osd_request *req)
 -{
 - kref_put(req-r_kref, ceph_osdc_release_request);
 -}
 +extern void ceph_osdc_get_request(struct ceph_osd_request *req);
 +extern void ceph_osdc_put_request(struct ceph_osd_request *req);
  
  extern int ceph_osdc_start_request(struct ceph_osd_client *osdc,
  struct ceph_osd_request *req,
 diff --git a/net/ceph/osd_client.c b/net/ceph/osd_client.c
 index 6202923b41ff..7406046212dc 100644
 --- a/net/ceph/osd_client.c
 +++ b/net/ceph/osd_client.c
 @@ -297,12 +297,15 @@ static void osd_req_op_data_release(struct 
 ceph_osd_request *osd_req,
  /*
   * requests
   */
 -void ceph_osdc_release_request(struct kref *kref)
 +static void ceph_osdc_release_request(struct kref *kref)
  {
 - struct ceph_osd_request *req;
 + struct ceph_osd_request *req = container_of(kref,
 + struct ceph_osd_request, r_kref);
   unsigned int which;
  
 - req = container_of(kref, struct ceph_osd_request, r_kref);
 + dout(%s %p (r_request %p r_reply %p)\n, __func__, req,
 +  req-r_request, req-r_reply);
 +
   if (req-r_request)
   ceph_msg_put(req-r_request);
   if (req-r_reply) {
 @@ -320,7 +323,22 @@ void ceph_osdc_release_request(struct kref *kref)
   kmem_cache_free(ceph_osd_request_cache, req);
  
  }
 -EXPORT_SYMBOL(ceph_osdc_release_request);
 +
 +void ceph_osdc_get_request(struct ceph_osd_request *req)
 +{
 + dout(%s %p (was %d)\n, __func__, req,
 +  atomic_read(req-r_kref.refcount));
 + kref_get(req-r_kref);
 +}
 +EXPORT_SYMBOL(ceph_osdc_get_request);
 +
 +void ceph_osdc_put_request(struct ceph_osd_request *req)
 +{
 + dout(%s %p (was %d)\n, __func__, req,
 +  atomic_read(req-r_kref.refcount));
 + kref_put(req-r_kref, ceph_osdc_release_request);
 +}
 +EXPORT_SYMBOL(ceph_osdc_put_request);
  
  struct ceph_osd_request *ceph_osdc_alloc_request(struct ceph_osd_client 
 *osdc,
  struct ceph_snap_context *snapc,
 

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 06/14] libceph: assert both regular and lingering lists in __remove_osd()

2014-06-30 Thread Alex Elder

On 06/25/2014 12:16 PM, Ilya Dryomov wrote:
 It is important that both regular and lingering requests lists are
 empty when the OSD is removed.

Looks good.

Reviewed-by: Alex Elder el...@linaro.org

 
 Signed-off-by: Ilya Dryomov ilya.dryo...@inktank.com
 ---
  net/ceph/osd_client.c |2 ++
  1 file changed, 2 insertions(+)
 
 diff --git a/net/ceph/osd_client.c b/net/ceph/osd_client.c
 index 052eb8bfcc74..a9b7ea7bfdc6 100644
 --- a/net/ceph/osd_client.c
 +++ b/net/ceph/osd_client.c
 @@ -1032,6 +1032,8 @@ static void __remove_osd(struct ceph_osd_client *osdc, 
 struct ceph_osd *osd)
  {
   dout(__remove_osd %p\n, osd);
   BUG_ON(!list_empty(osd-o_requests));
 + BUG_ON(!list_empty(osd-o_linger_requests));
 +
   rb_erase(osd-o_node, osdc-osds);
   list_del_init(osd-o_osd_lru);
   ceph_con_close(osd-o_con);
 

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 05/14] libceph: harden ceph_osdc_request_release() a bit

2014-06-30 Thread Alex Elder

On 06/25/2014 12:16 PM, Ilya Dryomov wrote:
 Add some WARN_ONs to alert us when we try to destroy requests that are
 still registered.
 
 Signed-off-by: Ilya Dryomov ilya.dryo...@inktank.com

Good idea.  Especially the RB_CLEAR_NODE() call.

Reviewed-by: Alex Elder el...@linaro.org

 ---
  net/ceph/osd_client.c |7 +++
  1 file changed, 7 insertions(+)
 
 diff --git a/net/ceph/osd_client.c b/net/ceph/osd_client.c
 index 7406046212dc..052eb8bfcc74 100644
 --- a/net/ceph/osd_client.c
 +++ b/net/ceph/osd_client.c
 @@ -305,6 +305,12 @@ static void ceph_osdc_release_request(struct kref *kref)
  
   dout(%s %p (r_request %p r_reply %p)\n, __func__, req,
req-r_request, req-r_reply);
 + WARN_ON(!RB_EMPTY_NODE(req-r_node));
 + WARN_ON(!list_empty(req-r_req_lru_item));
 + WARN_ON(!list_empty(req-r_osd_item));
 + WARN_ON(!list_empty(req-r_linger_item));
 + WARN_ON(!list_empty(req-r_linger_osd_item));
 + WARN_ON(req-r_osd);
  
   if (req-r_request)
   ceph_msg_put(req-r_request);
 @@ -1204,6 +1210,7 @@ static void __unregister_request(struct ceph_osd_client 
 *osdc,
  
   dout(__unregister_request %p tid %lld\n, req, req-r_tid);
   rb_erase(req-r_node, osdc-requests);
 + RB_CLEAR_NODE(req-r_node);
   osdc-num_requests--;
  
   if (req-r_osd) {
 

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 07/14] libceph: unregister only registered linger requests

2014-06-30 Thread Alex Elder

On 06/25/2014 12:16 PM, Ilya Dryomov wrote:
 Linger requests that have not yet been registered should not be
 unregistered by __unregister_linger_request().  This messes up ref
 count and leads to use-after-free.

This makes sense.  The problem can occur when updating the OSD
map.  An OSD *client* has its list of linger requests, but they
are not all necessarily registered as associated with the *OSD*.
So the __unregister_linger_request() call in kick_requests()
might pass a not-yet-registered linger request.

It could also occur if a client (like RBD) gets an error after
setting a request to linger but the request has completed
successfully.

Anyway, looks good.  This explains why the rename of the
r_linger_osd_item field was helpful.

Reviewed-by: Alex Elder el...@linaro.org

 
 Signed-off-by: Ilya Dryomov ilya.dryo...@inktank.com
 ---
  net/ceph/osd_client.c |   15 +--
  1 file changed, 13 insertions(+), 2 deletions(-)
 
 diff --git a/net/ceph/osd_client.c b/net/ceph/osd_client.c
 index a9b7ea7bfdc6..12ec553a7e76 100644
 --- a/net/ceph/osd_client.c
 +++ b/net/ceph/osd_client.c
 @@ -1248,7 +1248,9 @@ static void __cancel_request(struct ceph_osd_request 
 *req)
  static void __register_linger_request(struct ceph_osd_client *osdc,
   struct ceph_osd_request *req)
  {
 - dout(__register_linger_request %p\n, req);
 + dout(%s %p tid %llu\n, __func__, req, req-r_tid);
 + WARN_ON(!req-r_linger);
 +
   ceph_osdc_get_request(req);
   list_add_tail(req-r_linger_item, osdc-req_linger);
   if (req-r_osd)
 @@ -1259,8 +1261,17 @@ static void __register_linger_request(struct 
 ceph_osd_client *osdc,
  static void __unregister_linger_request(struct ceph_osd_client *osdc,
   struct ceph_osd_request *req)
  {
 - dout(__unregister_linger_request %p\n, req);
 + WARN_ON(!req-r_linger);
 +
 + if (list_empty(req-r_linger_item)) {
 + dout(%s %p tid %llu not registered\n, __func__, req,
 +  req-r_tid);
 + return;
 + }
 +
 + dout(%s %p tid %llu\n, __func__, req, req-r_tid);
   list_del_init(req-r_linger_item);
 +
   if (req-r_osd) {
   list_del_init(req-r_linger_osd_item);
   maybe_move_osd_to_lru(osdc, req-r_osd);
 

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 08/14] libceph: fix linger request check in __unregister_request()

2014-06-30 Thread Alex Elder

On 06/25/2014 12:16 PM, Ilya Dryomov wrote:
 We should check if request is on the linger request list of any of the
 OSDs, not whether request is registered or not.
 
 Signed-off-by: Ilya Dryomov ilya.dryo...@inktank.com

That was a difficult to spot bug.  Very good.

Reviewed-by: Alex Elder el...@linaro.org

 ---
  net/ceph/osd_client.c |2 +-
  1 file changed, 1 insertion(+), 1 deletion(-)
 
 diff --git a/net/ceph/osd_client.c b/net/ceph/osd_client.c
 index 12ec553a7e76..8104db24bc09 100644
 --- a/net/ceph/osd_client.c
 +++ b/net/ceph/osd_client.c
 @@ -1221,7 +1221,7 @@ static void __unregister_request(struct ceph_osd_client 
 *osdc,
  
   list_del_init(req-r_osd_item);
   maybe_move_osd_to_lru(osdc, req-r_osd);
 - if (list_empty(req-r_linger_item))
 + if (list_empty(req-r_linger_osd_item))
   req-r_osd = NULL;
   }
  
 

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 09/14] libceph: introduce ceph_osdc_cancel_request()

2014-06-30 Thread Alex Elder

On 06/25/2014 12:16 PM, Ilya Dryomov wrote:
 Introduce ceph_osdc_cancel_request() intended for canceling requests
 from the higher layers (rbd and cephfs).  Because higher layers are in
 charge and are supposed to know what and when they are canceling, the
 request is not completed, only unref'ed and removed from the libceph
 data structures.

This seems reasonable.

But you make two changes here that I'd like to see a little
more explanation on.  They seem significant enough to warrant
a little more attention in the commit description.

- __cancel_request() is no longer called when
  ceph_osdc_wait_request() fails due to an
  an interrupt.  This is my main concern, and I
  plan to work through it but I'm in a small hurry
  right now.
- __unregister_linger_request() is now called when
  a request is canceled, but it wasn't before.  (Since
  we're consistent about setting the r_linger flag
  this should be fine, but it *is* a behavior change.)

I'm going to keep looking at this; I have about
20 minutes before I have to leave for a while.

-Alex

 Signed-off-by: Ilya Dryomov ilya.dryo...@inktank.com
 ---
  include/linux/ceph/osd_client.h |1 +
  net/ceph/osd_client.c   |   31 +--
  2 files changed, 26 insertions(+), 6 deletions(-)
 
 diff --git a/include/linux/ceph/osd_client.h b/include/linux/ceph/osd_client.h
 index a8d5652f589d..de09cad7b7c7 100644
 --- a/include/linux/ceph/osd_client.h
 +++ b/include/linux/ceph/osd_client.h
 @@ -334,6 +334,7 @@ extern void ceph_osdc_put_request(struct ceph_osd_request 
 *req);
  extern int ceph_osdc_start_request(struct ceph_osd_client *osdc,
  struct ceph_osd_request *req,
  bool nofail);
 +extern void ceph_osdc_cancel_request(struct ceph_osd_request *req);
  extern int ceph_osdc_wait_request(struct ceph_osd_client *osdc,
 struct ceph_osd_request *req);
  extern void ceph_osdc_sync(struct ceph_osd_client *osdc);
 diff --git a/net/ceph/osd_client.c b/net/ceph/osd_client.c
 index 8104db24bc09..ef44cc0bbc6a 100644
 --- a/net/ceph/osd_client.c
 +++ b/net/ceph/osd_client.c
 @@ -2470,6 +2470,25 @@ int ceph_osdc_start_request(struct ceph_osd_client 
 *osdc,
  EXPORT_SYMBOL(ceph_osdc_start_request);
  
  /*
 + * Unregister a registered request.  The request is not completed (i.e.
 + * no callbacks or wakeups) - higher layers are supposed to know what
 + * they are canceling.
 + */
 +void ceph_osdc_cancel_request(struct ceph_osd_request *req)
 +{
 + struct ceph_osd_client *osdc = req-r_osdc;
 +
 + mutex_lock(osdc-request_mutex);
 + if (req-r_linger)
 + __unregister_linger_request(osdc, req);
 + __unregister_request(osdc, req);
 + mutex_unlock(osdc-request_mutex);
 +
 + dout(%s %p tid %llu canceled\n, __func__, req, req-r_tid);
 +}
 +EXPORT_SYMBOL(ceph_osdc_cancel_request);
 +
 +/*
   * wait for a request to complete
   */
  int ceph_osdc_wait_request(struct ceph_osd_client *osdc,
 @@ -2477,18 +2496,18 @@ int ceph_osdc_wait_request(struct ceph_osd_client 
 *osdc,
  {
   int rc;
  
 + dout(%s %p tid %llu\n, __func__, req, req-r_tid);
 +
   rc = wait_for_completion_interruptible(req-r_completion);
   if (rc  0) {
 - mutex_lock(osdc-request_mutex);
 - __cancel_request(req);
 - __unregister_request(osdc, req);
 - mutex_unlock(osdc-request_mutex);
 + dout(%s %p tid %llu interrupted\n, __func__, req, req-r_tid);
 + ceph_osdc_cancel_request(req);
   complete_request(req);
 - dout(wait_request tid %llu canceled/timed out\n, req-r_tid);
   return rc;
   }
  
 - dout(wait_request tid %llu result %d\n, req-r_tid, req-r_result);
 + dout(%s %p tid %llu result %d\n, __func__, req, req-r_tid,
 +  req-r_result);
   return req-r_result;
  }
  EXPORT_SYMBOL(ceph_osdc_wait_request);
 

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 07/14] libceph: unregister only registered linger requests

2014-06-30 Thread Alex Elder

On 06/25/2014 12:16 PM, Ilya Dryomov wrote:
 Linger requests that have not yet been registered should not be
 unregistered by __unregister_linger_request().  This messes up ref
 count and leads to use-after-free.
 
 Signed-off-by: Ilya Dryomov ilya.dryo...@inktank.com
 ---
  net/ceph/osd_client.c |   15 +--
  1 file changed, 13 insertions(+), 2 deletions(-)
 
 diff --git a/net/ceph/osd_client.c b/net/ceph/osd_client.c
 index a9b7ea7bfdc6..12ec553a7e76 100644
 --- a/net/ceph/osd_client.c
 +++ b/net/ceph/osd_client.c
 @@ -1248,7 +1248,9 @@ static void __cancel_request(struct ceph_osd_request 
 *req)
  static void __register_linger_request(struct ceph_osd_client *osdc,
   struct ceph_osd_request *req)
  {
 - dout(__register_linger_request %p\n, req);
 + dout(%s %p tid %llu\n, __func__, req, req-r_tid);
 + WARN_ON(!req-r_linger);
 +
   ceph_osdc_get_request(req);
   list_add_tail(req-r_linger_item, osdc-req_linger);
   if (req-r_osd)
 @@ -1259,8 +1261,17 @@ static void __register_linger_request(struct 
 ceph_osd_client *osdc,
  static void __unregister_linger_request(struct ceph_osd_client *osdc,
   struct ceph_osd_request *req)
  {
 - dout(__unregister_linger_request %p\n, req);
 + WARN_ON(!req-r_linger);


I just noticed something.  ceph_osdc_unregister_linger_request()
clears req-r_linger before calling __unregister_linger_request(),
which means this warning must be tripping a lot...

Just delete that assignment in ceph_osdc_unregister_linger_request()
as part of this commit.

-Alex


 +
 + if (list_empty(req-r_linger_item)) {
 + dout(%s %p tid %llu not registered\n, __func__, req,
 +  req-r_tid);
 + return;
 + }
 +
 + dout(%s %p tid %llu\n, __func__, req, req-r_tid);
   list_del_init(req-r_linger_item);
 +
   if (req-r_osd) {
   list_del_init(req-r_linger_osd_item);
   maybe_move_osd_to_lru(osdc, req-r_osd);
 

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] rbd: fix ida/idr memory leak

2014-06-04 Thread Alex Elder

On 06/04/2014 11:38 AM, Ilya Dryomov wrote:
 ida_destroy() needs to be called on module exit to release ida caches.

Looks good to me.

Reviewed-by: Alex Elder el...@linaro.org

 Signed-off-by: Ilya Dryomov ilya.dryo...@inktank.com
 ---
  drivers/block/rbd.c |1 +
  1 file changed, 1 insertion(+)
 
 diff --git a/drivers/block/rbd.c b/drivers/block/rbd.c
 index 94747afc2e78..ecdd4aee173c 100644
 --- a/drivers/block/rbd.c
 +++ b/drivers/block/rbd.c
 @@ -5464,6 +5464,7 @@ err_out_slab:
  
  static void __exit rbd_exit(void)
  {
 + ida_destroy(rbd_dev_id_ida);
   rbd_sysfs_cleanup();
   if (single_major)
   unregister_blkdev(rbd_major, RBD_DRV_NAME);
 

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] ceph: refactor readpage_nounlock() to make the logic clearer

2014-05-28 Thread Alex Elder

On 05/28/2014 02:09 AM, Zhang Zhen wrote:
 If the return value of ceph_osdc_readpages() is not negative,
 it is certainly greater than or equal to zero.
 
 Remove the useless condition judgment and redundant braces.
 
 Signed-off-by: Zhang Zhen zhenzhang.zh...@huawei.com

This looks good.

Reviewed-by: Alex Elder el...@linaro.org

 ---
  fs/ceph/addr.c | 17 +++--
  1 file changed, 7 insertions(+), 10 deletions(-)
 
 diff --git a/fs/ceph/addr.c b/fs/ceph/addr.c
 index b53278c..6aa2e3f 100644
 --- a/fs/ceph/addr.c
 +++ b/fs/ceph/addr.c
 @@ -211,18 +211,15 @@ static int readpage_nounlock(struct file *filp, struct 
 page *page)
   SetPageError(page);
   ceph_fscache_readpage_cancel(inode, page);
   goto out;
 - } else {
 - if (err  PAGE_CACHE_SIZE) {
 - /* zero fill remainder of page */
 - zero_user_segment(page, err, PAGE_CACHE_SIZE);
 - } else {
 - flush_dcache_page(page);
 - }
   }
 - SetPageUptodate(page);
 + if (err  PAGE_CACHE_SIZE)
 + /* zero fill remainder of page */
 + zero_user_segment(page, err, PAGE_CACHE_SIZE);
 + else
 + flush_dcache_page(page);
 
 - if (err = 0)
 - ceph_readpage_to_fscache(inode, page);
 + SetPageUptodate(page);
 + ceph_readpage_to_fscache(inode, page);
 
  out:
   return err  0 ? err : 0;
 

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: crash report: paging request errors in various krbd contexts

2014-05-16 Thread Alex Elder

On 05/16/2014 01:00 AM, Hannes Landeholm wrote:
 A production server just locked up and had to be hard rebooted. It had
 these various rbd related crash signatures in it's system log within
 the same 10 second interval:

I'll try to provide a quick summary of what's likely
happened in each of these.

The bottom line is that it appears that some
memory used by rbd and/or libceph has become
corrupted, or there is something (or more than
one thing) that is being used after it's been
freed.  Either way this sort of thing will be
difficult to try to understand; it would be
great if it could be reproduced independently.

 kernel: BUG: unable to handle kernel paging request at 
 kernel: IP: [812832dd] strnlen+0xd/0x40
 kernel: PGD 180f067 PUD 1811067 PMD 0
 kernel: Oops:  [#1] PREEMPT SMP
 kernel: Modules linked in: xt_recent xt_limit veth ipt_MASQUERADE cbc
 ipt_REJECT xt_conntrack iptable_filter iptable_nat nf_conntrack_ipv4
 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack xt_tcpudp
 iptable_mangle ip_tables x_tables coretemp hwmon x86_pkg_temp_thermal
 crct10dif_pclmul crct10di
 f_common crc32_pclmul crc32c_intel ghash_clmulni_intel aesni_intel
 aes_x86_64 lrw gf128mul glue_helper ablk_helper cryptd microcode evdev
 pcspkr xen_netback xen_blkback xen_gntalloc xenfs xen_gntdev
 xen_evtchn rbd libceph crc32c libcrc32c ext4 crc16 mbcache jbd2
 ixgbevf xen_privcmd xen_pcifront xen_netfront xen_kbdfro
 nt xen_fbfront syscopyarea sysfillrect sysimgblt fb_sys_fops
 xen_blkfront virtio_pci virtio_net virtio_blk virtio_ring virtio
 ipmi_poweroff ipmi_msghandler
 kernel: CPU: 3 PID: 2887 Comm: mysqld Not tainted 3.14.1-1-js #1
 kernel: task: 880342ce1d70 ti: 88033e17e000 task.ti: 88033e17e000
 kernel: RIP: e030:[812832dd]  [812832dd] strnlen+0xd/0x40

We're calling strnlen() (ultimately) from snprintf().  The
format provided will be %s.%012llx (or similar).  The
string provided for the %s is rbd_dev-header.object_prefix,
which is a dynamically allocated string initialized once
for the rbd device, which will be NUL-terminated and
unchanging until the device gets mapped.

Either the rbd device got unmapped while still
in use, or the memory holding this rbd_dev structure
got corrupted somehow.

 kernel: RSP: e02b:88033e17fa78  EFLAGS: 00010286
 kernel: RAX: 816feed6 RBX: 8800ffb60068 RCX: fffe
 kernel: RDX:  RSI:  RDI: 
 kernel: RBP: 88033e17fa78 R08:  R09: 
 kernel: R10: 756d696e612d736a R11: 2f30622d637a762f R12: 
 kernel: R13: 8800ffb600cd R14:  R15: 
 kernel: FS:  7f98dec7f700() GS:8803b5d8()
 knlGS:
 kernel: CS:  e033 DS:  ES:  CR0: 8005003b
 kernel: CR2:  CR3: 000335dcc000 CR4: 2660
 kernel: Stack:
 kernel:  88033e17fab0 8128562b 8800ffb60068 8800ffb600cd
 kernel:  88033e17fb28 a0196869 a0196869 88033e17fb18
 kernel:  81286ac1 880389525960 814de243 
 kernel: Call Trace:
 kernel:  [8128562b] string.isra.6+0x3b/0xf0
 kernel:  [81286ac1] vsnprintf+0x1c1/0x610
 kernel:  [814de243] ? _raw_spin_unlock_irq+0x13/0x30
 kernel:  [81286fd9] snprintf+0x39/0x40
 kernel:  [a01911e0] ? rbd_img_request_fill+0x100/0x6d0 [rbd]
 kernel:  [a019122a] rbd_img_request_fill+0x14a/0x6d0 [rbd]
 kernel:  [a018f4d5] ? rbd_img_request_create+0x155/0x220 [rbd]
 kernel:  [8125cab9] ? blk_add_timer+0x19/0x20
 kernel:  [a0194a1d] rbd_request_fn+0x1ed/0x330 [rbd]
 kernel:  [81252f13] __blk_run_queue+0x33/0x40
 kernel:  [8125411e] queue_unplugged+0x2e/0xd0
 kernel:  [81256cf0] blk_flush_plug_list+0x1f0/0x230
 kernel:  [812570a4] blk_finish_plug+0x14/0x40
 kernel:  [a00b9d6e] ext4_writepages+0x48e/0xd50 [ext4]
 kernel:  [81136aae] ? generic_file_aio_write+0x5e/0xe0
 kernel:  [811417ae] do_writepages+0x1e/0x40
 kernel:  [811363d9] __filemap_fdatawrite_range+0x59/0x60
 kernel:  [811364da] filemap_write_and_wait_range+0x2a/0x70
 kernel:  [a00b149a] ext4_sync_file+0xba/0x360 [ext4]
 kernel:  [811d50ce] do_fsync+0x4e/0x80
 kernel:  [811d5373] SyS_fdatasync+0x13/0x20
 kernel:  [814e66e9] system_call_fastpath+0x16/0x1b
 kernel: Code: c0 01 80 38 00 75 f7 48 29 f8 5d c3 31 c0 5d c3 66 66 66
 66 66 2e 0f 1f 84 00 00 00 00 00 55 48 85 f6 48 8d 4e ff 48 89 e5 74
 2a 80 3f 00 74 25 48 89 f8 31 d2 eb 10 0f 1f 80 00 00 00 00 48 83
 kernel: RIP  [812832dd] strnlen+0xd/0x40
 kernel:  RSP 88033e17fa78
 kernel: CR2: 
 kernel: ---[ end trace 83a2fd2a9969b20d ]---
 
 kernel: BUG: unable to handle kernel paging request at 177b
 kernel: IP: [814dd33c] down_read+0xc/0x20

1 2 3 4 5 6 7 8 9 10 >

1 - 100 of 1396 matches

Mail list logo