Re: [PATCH v2] Net: ceph: messenger: Use local variable cursor in read_partial_msg_data()
On 10/18/2015 09:49 PM, Shraddha Barke wrote: > Use local variable cursor in place of >cursor in > read_partial_msg_data() > > Signed-off-by: Shraddha BarkeThis is a pretty minor comment, but the "Net" in your subject line is probably better *not* capitalized. -Alex > --- > Changes in v2- > Drop incorrect use of cursor > > net/ceph/messenger.c | 4 ++-- > 1 file changed, 2 insertions(+), 2 deletions(-) > > diff --git a/net/ceph/messenger.c b/net/ceph/messenger.c > index b9b0e3b..b087edd 100644 > --- a/net/ceph/messenger.c > +++ b/net/ceph/messenger.c > @@ -2246,7 +2246,7 @@ static int read_partial_msg_data(struct ceph_connection > *con) > if (do_datacrc) > crc = con->in_data_crc; > while (cursor->resid) { > - page = ceph_msg_data_next(>cursor, _offset, , > + page = ceph_msg_data_next(cursor, _offset, , > NULL); > ret = ceph_tcp_recvpage(con->sock, page, page_offset, length); > if (ret <= 0) { > @@ -2258,7 +2258,7 @@ static int read_partial_msg_data(struct ceph_connection > *con) > > if (do_datacrc) > crc = ceph_crc32c_page(crc, page, page_offset, ret); > - (void) ceph_msg_data_advance(>cursor, (size_t)ret); > + (void) ceph_msg_data_advance(cursor, (size_t)ret); > } > if (do_datacrc) > con->in_data_crc = crc; > -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Fwd: Re: [PATCH] rbd: set max_sectors explicitly
Forgot to "Reply-all". -Alex Forwarded Message Subject: Re: [PATCH] rbd: set max_sectors explicitly Date: Fri, 16 Oct 2015 07:22:55 -0500 From: Alex Elder <el...@ieee.org> To: Ilya Dryomov <idryo...@gmail.com> On 10/07/2015 12:00 PM, Ilya Dryomov wrote: > Commit 30e2bc08b2bb ("Revert "block: remove artifical max_hw_sectors > cap"") restored a clamp on max_sectors. It's now 2560 sectors instead > of 1024, but it's not good enough: we set max_hw_sectors to rbd object > size because we don't want object sized I/Os to be split, and the > default object size is 4M. > > So, set max_sectors to max_hw_sectors in rbd at queue init time. > > Signed-off-by: Ilya Dryomov <idryo...@gmail.com> > --- > drivers/block/rbd.c | 1 + > 1 file changed, 1 insertion(+) > > diff --git a/drivers/block/rbd.c b/drivers/block/rbd.c > index 05072464d25e..04e69b4df664 100644 > --- a/drivers/block/rbd.c > +++ b/drivers/block/rbd.c > @@ -3760,6 +3760,7 @@ static int rbd_init_disk(struct rbd_device *rbd_dev) > /* set io sizes to object size */ > segment_size = rbd_obj_bytes(_dev->header); > blk_queue_max_hw_sectors(q, segment_size / SECTOR_SIZE); > + q->limits.max_sectors = queue_max_hw_sectors(q); This currently is done by default by blk_queue_max_hw_sectors(). Do you see any different behavior with this patch in place? This change seems at least harmless so if it's not too late: Reviewed-by: Alex Elder <el...@linaro.org> > blk_queue_max_segments(q, segment_size / SECTOR_SIZE); > blk_queue_max_segment_size(q, segment_size); > blk_queue_io_min(q, segment_size); > -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] rbd: use writefull op for object size writes
On 10/07/2015 12:02 PM, Ilya Dryomov wrote: > This covers only the simplest case - an object size sized write, but > it's still useful in tiering setups when EC is used for the base tier > as writefull op can be proxied, saving an object promotion. > > Even though updating ceph_osdc_new_request() to allow writefull should > just be a matter of fixing an assert, I didn't do it because its only > user is cephfs. All other sites were updated. > > Reflects ceph.git commit 7bfb7f9025a8ee0d2305f49bf0336d2424da5b5b. > > Signed-off-by: Ilya Dryomov <idryo...@gmail.com> Looks good to me. Reviewed-by: Alex Elder <el...@linaro.org> > --- > drivers/block/rbd.c | 9 +++-- > net/ceph/osd_client.c | 13 + > 2 files changed, 16 insertions(+), 6 deletions(-) > > diff --git a/drivers/block/rbd.c b/drivers/block/rbd.c > index 04e69b4df664..cd00e4653e49 100644 > --- a/drivers/block/rbd.c > +++ b/drivers/block/rbd.c > @@ -1863,9 +1863,11 @@ static void rbd_osd_req_callback(struct > ceph_osd_request *osd_req, > rbd_osd_read_callback(obj_request); > break; > case CEPH_OSD_OP_SETALLOCHINT: > - rbd_assert(osd_req->r_ops[1].op == CEPH_OSD_OP_WRITE); > + rbd_assert(osd_req->r_ops[1].op == CEPH_OSD_OP_WRITE || > +osd_req->r_ops[1].op == CEPH_OSD_OP_WRITEFULL); > /* fall through */ > case CEPH_OSD_OP_WRITE: > + case CEPH_OSD_OP_WRITEFULL: > rbd_osd_write_callback(obj_request); > break; > case CEPH_OSD_OP_STAT: > @@ -2401,7 +2403,10 @@ static void rbd_img_obj_request_fill(struct > rbd_obj_request *obj_request, > opcode = CEPH_OSD_OP_ZERO; > } > } else if (op_type == OBJ_OP_WRITE) { > - opcode = CEPH_OSD_OP_WRITE; > + if (!offset && length == object_size) > + opcode = CEPH_OSD_OP_WRITEFULL; > + else > + opcode = CEPH_OSD_OP_WRITE; > osd_req_op_alloc_hint_init(osd_request, num_ops, > object_size, object_size); > num_ops++; > diff --git a/net/ceph/osd_client.c b/net/ceph/osd_client.c > index 80b94e37c94a..f79ccac6699f 100644 > --- a/net/ceph/osd_client.c > +++ b/net/ceph/osd_client.c > @@ -285,6 +285,7 @@ static void osd_req_op_data_release(struct > ceph_osd_request *osd_req, > switch (op->op) { > case CEPH_OSD_OP_READ: > case CEPH_OSD_OP_WRITE: > + case CEPH_OSD_OP_WRITEFULL: > ceph_osd_data_release(>extent.osd_data); > break; > case CEPH_OSD_OP_CALL: > @@ -485,13 +486,14 @@ void osd_req_op_extent_init(struct ceph_osd_request > *osd_req, > size_t payload_len = 0; > > BUG_ON(opcode != CEPH_OSD_OP_READ && opcode != CEPH_OSD_OP_WRITE && > -opcode != CEPH_OSD_OP_ZERO && opcode != CEPH_OSD_OP_TRUNCATE); > +opcode != CEPH_OSD_OP_WRITEFULL && opcode != CEPH_OSD_OP_ZERO && > +opcode != CEPH_OSD_OP_TRUNCATE); > > op->extent.offset = offset; > op->extent.length = length; > op->extent.truncate_size = truncate_size; > op->extent.truncate_seq = truncate_seq; > - if (opcode == CEPH_OSD_OP_WRITE) > + if (opcode == CEPH_OSD_OP_WRITE || opcode == CEPH_OSD_OP_WRITEFULL) > payload_len += length; > > op->payload_len = payload_len; > @@ -670,9 +672,11 @@ static u64 osd_req_encode_op(struct ceph_osd_request > *req, > break; > case CEPH_OSD_OP_READ: > case CEPH_OSD_OP_WRITE: > + case CEPH_OSD_OP_WRITEFULL: > case CEPH_OSD_OP_ZERO: > case CEPH_OSD_OP_TRUNCATE: > - if (src->op == CEPH_OSD_OP_WRITE) > + if (src->op == CEPH_OSD_OP_WRITE || > + src->op == CEPH_OSD_OP_WRITEFULL) > request_data_len = src->extent.length; > dst->extent.offset = cpu_to_le64(src->extent.offset); > dst->extent.length = cpu_to_le64(src->extent.length); > @@ -681,7 +685,8 @@ static u64 osd_req_encode_op(struct ceph_osd_request *req, > dst->extent.truncate_seq = > cpu_to_le32(src->extent.truncate_seq); > osd_data = >extent.osd_data; > - if (src->op == CEPH_OSD_OP_WRITE) > + if (src->op == CEPH_OSD_OP_WRITE || > + src->op == CEPH_OSD_OP_WRITEFULL) > ceph_osdc_msg_data_add(req->r_request, osd_data); > else > ceph_osdc_msg_data_add(req->r_reply, osd_data); > -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] rbd: don't leak parent_spec in rbd_dev_probe_parent()
On 10/16/2015 04:50 AM, Ilya Dryomov wrote: > On Thu, Oct 15, 2015 at 11:10 PM, Alex Elder <el...@ieee.org> wrote: >> On 10/11/2015 01:03 PM, Ilya Dryomov wrote: >>> Currently we leak parent_spec and trigger a "parent reference >>> underflow" warning if rbd_dev_create() in rbd_dev_probe_parent() fails. >>> The problem is we take the !parent out_err branch and that only drops >>> refcounts; parent_spec that would've been freed had we called >>> rbd_dev_unparent() remains and triggers rbd_warn() in >>> rbd_dev_parent_put() - at that point we have parent_spec != NULL and >>> parent_ref == 0, so counter ends up being -1 after the decrement. >> >> OK, now that I understand the context... >> >> You now get extra references for the spec and client >> for the parent only after creating the parent device. >> I think the reason they logically belonged before >> the call to rbd_device_create() for the parent is >> because the client and spec pointers passed to that >> function carry with them references that are passed >> to the resulting rbd_device if successful. > > Let me stress that those two get()s that I moved is not the point of > the patch at all. Whether we do it before or after rbd_dev_create() is > pretty much irrelevant, the only reason I moved those calls is to make > the error path simpler. Yes I understand that. >> If creating the parent device fails, you unparent the >> original device, which will still have a null parent >> pointer. The effect of unparenting in this case is >> dropping a reference to the parent's spec, and clearing >> the device's pointer to it. This is confusing, but >> let's run with it. >> >> If creating the parent device succeeds, references to >> the client and parent spec are taken (basically, these >> belong to the just-created parent device). The parent >> image is now probed. If this fails, you again >> unparent the device. We still have not set the >> device's parent pointer, so the effect is as before, >> dropping the parent spec reference and clearing >> the device's reference to it. The error handling >> now destroys the parent, which drops references to >> its client and the spec. Again, this seems >> confusing, as if we've dropped one more reference >> to the parent spec than desired. >> >> This logic now seems to work. But it's working >> around a flaw in the caller I think. Upon entry >> to rbd_dev_probe_parent(), a layered device will >> have have a non-null parent_spec pointer (and a >> reference to it), which will have been filled in >> by rbd_dev_v2_parent_info(). >> >> Really, it should not be rbd_dev_probe_parent() >> that drops that parent spec reference on error. >> Instead, rbd_dev_image_probe() (which got the >> reference to the parent spec) should handle >> cleaning up the device's parent spec if an >> error occurs after it has been assigned. >> >> I'll wait for your response, I'd like to know >> if what I'm saying makes sense. > > All of the above makes sense. I agree that it is asymmetrical and that > it would have been better to have rbd_dev_image_probe() drop the last > ref on ->parent_spec. But, that's not what all of the existing code is > doing. Agreed. > Consider rbd_dev_probe_parent() without my patch. There are two > out_err jumps. As it is, if rbd_dev_create() fails, we only revert > those two get()s and return with alive ->parent_spec. However, if > rbd_dev_image_probe() fails, we call rbd_dev_unparent(), which will put > ->parent_spec. Really, the issue is rbd_dev_unparent(), which not only > removes the parent rbd_dev, but also the parent spec. All I did with > this patch was align both out_err cases to do the same, namely undo the > effects of rbd_dev_v2_parent_info() - I didn't want to create yet > another error path. OK. Walking through the code I did concluded that what you did made things "line up" properly. But I just felt like it wasn't necessarily the *right* fix, but it was a correct fix. The only point in suggesting what I think would be a better fix is that it would make things easier to reason about and get correct, and in so doing, make it easier to maintain and adapt in the future. But I have no objection to your fix, so please accept this for your original patch: Reviewed-by: Alex Elder <el...@linaro.org> > Thanks, > > Ilya > -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] rbd: don't leak parent_spec in rbd_dev_probe_parent()
On 10/15/2015 01:31 PM, Ilya Dryomov wrote: > On Thu, Oct 15, 2015 at 7:10 PM, Alex Elder <el...@ieee.org> wrote: >> On 10/11/2015 01:03 PM, Ilya Dryomov wrote: >>> Currently we leak parent_spec and trigger a "parent reference >>> underflow" warning if rbd_dev_create() in rbd_dev_probe_parent() fails. >>> The problem is we take the !parent out_err branch and that only drops >>> refcounts; parent_spec that would've been freed had we called >>> rbd_dev_unparent() remains and triggers rbd_warn() in >>> rbd_dev_parent_put() - at that point we have parent_spec != NULL and >>> parent_ref == 0, so counter ends up being -1 after the decrement. >>> >>> Redo rbd_dev_probe_parent() to fix this. >> >> I'm just starting to look at this. >> >> My up-front problem is that I want to understand why >> it's OK to no longer grab references to the parent spec >> and the client before creating the new parent device. >> Is it simply because the rbd_dev that's the subject >> of this call is already holding a reference to both, >> so no need to get another? I think that's right, but >> you should say that in the explanation. It's a >> change that's independent of the other change (which >> you describe). > > I still grab references, I just moved that code after rbd_dev_create(). > Fundamentally rbd_dev_create() is just a memory allocation, so whether > we acquire references before or after doesn't matter, except that > acquiring them before complicates the error path. > >> >> OK, onto the change you describe. >> >> You say that we get the underflow if rbd_dev_create() fails. >> >> At that point in the function, we have: >> parent == NULL >> rbd_dev->parent_spec != NULL >> parent_spec holds a new reference to that >> rbdc holds a new reference to the client >> >> rbd_dev_create() only returns NULL if the kzalloc() fails. >> And if so, we jump to out_err, take the !parent branch, >> and we drop the references we took in rbdc and parent_spec >> before returning. >> >> Where is the leak? (Actually, underflow means we're >> dropping more references than we took.) OK, I'm still reviewing your patch, but I have another comment. > We enter rbd_dev_probe_parent(). If ->parent_spec != NULL (and > therefore has a refcount of 1), we bump refcounts and proceed to > allocating parent rbd_dev. If rbd_dev_create() fails, we drop > refcounts and exit rbd_dev_probe_parent() with ->parent_spec != NULL > and a refcount of 1 on that spec. We take err_out_probe in This is exactly as it should be. On error, we exit rbd_dev_create() with things in exactly the state they were in when it was called. > rbd_dev_image_probe() and through rbd_dev_unprobe() end up in > rbd_dev_parent_put(). Now, ->parent_spec != NULL but ->parent_ref == 0 > because we never got to atomic_set(_dev->parent_ref, 1) in That's where the faulty logic lies. We set the parent_spec before we set up the parent, but we update parent_ref only after the parent is recorded. And we later assume a non-null parent_spec means we can decrement parent_ref, which in this case is not a valid assumption. I'll send my review comments shortly. I think you have fixed a bug but I think it's the wrong fix. -Alex > rbd_dev_probe_parent(). Local counter ends up -1 after the decrement > and so instead of calling rbd_dev_unparent() which would have freed > ->parent_spec we issue an underflow warning and bail. ->parent_spec is > leaked. > > Thanks, > > Ilya > -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] rbd: don't leak parent_spec in rbd_dev_probe_parent()
On 10/11/2015 01:03 PM, Ilya Dryomov wrote: > Currently we leak parent_spec and trigger a "parent reference > underflow" warning if rbd_dev_create() in rbd_dev_probe_parent() fails. > The problem is we take the !parent out_err branch and that only drops > refcounts; parent_spec that would've been freed had we called > rbd_dev_unparent() remains and triggers rbd_warn() in > rbd_dev_parent_put() - at that point we have parent_spec != NULL and > parent_ref == 0, so counter ends up being -1 after the decrement. OK, now that I understand the context... You now get extra references for the spec and client for the parent only after creating the parent device. I think the reason they logically belonged before the call to rbd_device_create() for the parent is because the client and spec pointers passed to that function carry with them references that are passed to the resulting rbd_device if successful. If creating the parent device fails, you unparent the original device, which will still have a null parent pointer. The effect of unparenting in this case is dropping a reference to the parent's spec, and clearing the device's pointer to it. This is confusing, but let's run with it. If creating the parent device succeeds, references to the client and parent spec are taken (basically, these belong to the just-created parent device). The parent image is now probed. If this fails, you again unparent the device. We still have not set the device's parent pointer, so the effect is as before, dropping the parent spec reference and clearing the device's reference to it. The error handling now destroys the parent, which drops references to its client and the spec. Again, this seems confusing, as if we've dropped one more reference to the parent spec than desired. This logic now seems to work. But it's working around a flaw in the caller I think. Upon entry to rbd_dev_probe_parent(), a layered device will have have a non-null parent_spec pointer (and a reference to it), which will have been filled in by rbd_dev_v2_parent_info(). Really, it should not be rbd_dev_probe_parent() that drops that parent spec reference on error. Instead, rbd_dev_image_probe() (which got the reference to the parent spec) should handle cleaning up the device's parent spec if an error occurs after it has been assigned. I'll wait for your response, I'd like to know if what I'm saying makes sense. -Alex > Redo rbd_dev_probe_parent() to fix this. > > Signed-off-by: Ilya Dryomov> --- > drivers/block/rbd.c | 36 > 1 file changed, 16 insertions(+), 20 deletions(-) > > diff --git a/drivers/block/rbd.c b/drivers/block/rbd.c > index cd00e4653e49..ccbc3cbbf24e 100644 > --- a/drivers/block/rbd.c > +++ b/drivers/block/rbd.c > @@ -5134,41 +5134,37 @@ out_err: > static int rbd_dev_probe_parent(struct rbd_device *rbd_dev) > { > struct rbd_device *parent = NULL; > - struct rbd_spec *parent_spec; > - struct rbd_client *rbdc; > int ret; > > if (!rbd_dev->parent_spec) > return 0; > - /* > - * We need to pass a reference to the client and the parent > - * spec when creating the parent rbd_dev. Images related by > - * parent/child relationships always share both. > - */ > - parent_spec = rbd_spec_get(rbd_dev->parent_spec); > - rbdc = __rbd_get_client(rbd_dev->rbd_client); > > - ret = -ENOMEM; > - parent = rbd_dev_create(rbdc, parent_spec, NULL); > - if (!parent) > + parent = rbd_dev_create(rbd_dev->rbd_client, rbd_dev->parent_spec, > + NULL); > + if (!parent) { > + ret = -ENOMEM; > goto out_err; > + } > + > + /* > + * Images related by parent/child relationships always share > + * rbd_client and spec/parent_spec, so bump their refcounts. > + */ > + __rbd_get_client(rbd_dev->rbd_client); > + rbd_spec_get(rbd_dev->parent_spec); > > ret = rbd_dev_image_probe(parent, false); > if (ret < 0) > goto out_err; > + > rbd_dev->parent = parent; > atomic_set(_dev->parent_ref, 1); > - > return 0; > + > out_err: > - if (parent) { > - rbd_dev_unparent(rbd_dev); > + rbd_dev_unparent(rbd_dev); > + if (parent) > rbd_dev_destroy(parent); > - } else { > - rbd_put_client(rbdc); > - rbd_spec_put(parent_spec); > - } > - > return ret; > } > > -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] rbd: don't leak parent_spec in rbd_dev_probe_parent()
On 10/15/2015 01:31 PM, Ilya Dryomov wrote: > On Thu, Oct 15, 2015 at 7:10 PM, Alex Elder <el...@ieee.org> wrote: >> On 10/11/2015 01:03 PM, Ilya Dryomov wrote: >>> Currently we leak parent_spec and trigger a "parent reference >>> underflow" warning if rbd_dev_create() in rbd_dev_probe_parent() fails. >>> The problem is we take the !parent out_err branch and that only drops >>> refcounts; parent_spec that would've been freed had we called >>> rbd_dev_unparent() remains and triggers rbd_warn() in >>> rbd_dev_parent_put() - at that point we have parent_spec != NULL and >>> parent_ref == 0, so counter ends up being -1 after the decrement. >>> >>> Redo rbd_dev_probe_parent() to fix this. >> >> I'm just starting to look at this. >> >> My up-front problem is that I want to understand why >> it's OK to no longer grab references to the parent spec >> and the client before creating the new parent device. >> Is it simply because the rbd_dev that's the subject >> of this call is already holding a reference to both, >> so no need to get another? I think that's right, but >> you should say that in the explanation. It's a >> change that's independent of the other change (which >> you describe). > > I still grab references, I just moved that code after rbd_dev_create(). > Fundamentally rbd_dev_create() is just a memory allocation, so whether > we acquire references before or after doesn't matter, except that > acquiring them before complicates the error path. OK, I accept that. But looking at the patch alone it made me wonder whether grabbing the references was necessary before creating the parent device. I guess the main point is you should have mentioned it in your description. I'm going to take another look at the original patch, now that I have your explanation (below). Thanks. -Alex >> OK, onto the change you describe. >> >> You say that we get the underflow if rbd_dev_create() fails. >> >> At that point in the function, we have: >> parent == NULL >> rbd_dev->parent_spec != NULL >> parent_spec holds a new reference to that >> rbdc holds a new reference to the client >> >> rbd_dev_create() only returns NULL if the kzalloc() fails. >> And if so, we jump to out_err, take the !parent branch, >> and we drop the references we took in rbdc and parent_spec >> before returning. >> >> Where is the leak? (Actually, underflow means we're >> dropping more references than we took.) > > We enter rbd_dev_probe_parent(). If ->parent_spec != NULL (and > therefore has a refcount of 1), we bump refcounts and proceed to > allocating parent rbd_dev. If rbd_dev_create() fails, we drop > refcounts and exit rbd_dev_probe_parent() with ->parent_spec != NULL > and a refcount of 1 on that spec. We take err_out_probe in > rbd_dev_image_probe() and through rbd_dev_unprobe() end up in > rbd_dev_parent_put(). Now, ->parent_spec != NULL but ->parent_ref == 0 > because we never got to atomic_set(_dev->parent_ref, 1) in > rbd_dev_probe_parent(). Local counter ends up -1 after the decrement > and so instead of calling rbd_dev_unparent() which would have freed > ->parent_spec we issue an underflow warning and bail. ->parent_spec is > leaked. > > Thanks, > > Ilya > -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] libceph: check data_len in ->alloc_msg()
On 09/02/2015 12:33 PM, Ilya Dryomov wrote: > Only ->alloc_msg() should check data_len of the incoming message > against the preallocated ceph_msg, doing it in the messenger is not > right. The contract is that either ->alloc_msg() returns a ceph_msg > which will fit all of the portions of the incoming message, or it > returns NULL and possibly sets skip, signaling whether NULL is due to > an -ENOMEM. ->alloc_msg() should be the only place where we make the > skip/no-skip decision. > > I stumbled upon this while looking at con/osd ref counting. Right now, > if we get a non-extent message with a larger data portion than we are > prepared for, ->alloc_msg() returns a ceph_msg, and then, when we skip > it in the messenger, we don't put the con/osd ref acquired in > ceph_con_in_msg_alloc() (which is normally put in process_message()), > so this also fixes a memory leak. Sorry for the delay reviewing this. This looks good. In a follow-on patch you could update the comments at the top of ceph_con_in_msg_alloc() to suggest that *skip may be set or not when this function returns, without saying "if we set" (which says to me that this function sets it directly). Minor, but I think it could be stated a little more clearly. That comment block also talks about a connection's "alloc_msg op if available" but we assert that pointer is non-null, so that doesn't really sound right either. > An existing BUG_ON in ceph_msg_data_cursor_init() ensures we don't > corrupt random memory should a buggy ->alloc_msg() return an unfit > ceph_msg. So this will catch the problem, just a little later. Reviewed-by: Alex Elder <el...@linaro.org> > While at it, I changed the "unknown tid" dout() to a pr_warn() to make > sure all skips are seen and unified format strings. > > Signed-off-by: Ilya Dryomov <idryo...@gmail.com> > --- > net/ceph/messenger.c | 7 --- > net/ceph/osd_client.c | 51 > ++- > 2 files changed, 18 insertions(+), 40 deletions(-) > > diff --git a/net/ceph/messenger.c b/net/ceph/messenger.c > index 36757d46ac40..525f454f7531 100644 > --- a/net/ceph/messenger.c > +++ b/net/ceph/messenger.c > @@ -2337,13 +2337,6 @@ static int read_partial_message(struct ceph_connection > *con) > return ret; > > BUG_ON(!con->in_msg ^ skip); > - if (con->in_msg && data_len > con->in_msg->data_length) { > - pr_warn("%s skipping long message (%u > %zd)\n", > - __func__, data_len, con->in_msg->data_length); > - ceph_msg_put(con->in_msg); > - con->in_msg = NULL; > - skip = 1; > - } > if (skip) { > /* skip this message */ > dout("alloc_msg said skip message\n"); > diff --git a/net/ceph/osd_client.c b/net/ceph/osd_client.c > index 50033677c0fa..80b94e37c94a 100644 > --- a/net/ceph/osd_client.c > +++ b/net/ceph/osd_client.c > @@ -2817,8 +2817,9 @@ out: > } > > /* > - * lookup and return message for incoming reply. set up reply message > - * pages. > + * Lookup and return message for incoming reply. Don't try to do > + * anything about a larger than preallocated data portion of the > + * message at the moment - for now, just skip the message. > */ > static struct ceph_msg *get_reply(struct ceph_connection *con, > struct ceph_msg_header *hdr, > @@ -2836,10 +2837,10 @@ static struct ceph_msg *get_reply(struct > ceph_connection *con, > mutex_lock(>request_mutex); > req = __lookup_request(osdc, tid); > if (!req) { > - *skip = 1; > + pr_warn("%s osd%d tid %llu unknown, skipping\n", > + __func__, osd->o_osd, tid); > m = NULL; > - dout("get_reply unknown tid %llu from osd%d\n", tid, > - osd->o_osd); > + *skip = 1; > goto out; > } > > @@ -2849,10 +2850,9 @@ static struct ceph_msg *get_reply(struct > ceph_connection *con, > ceph_msg_revoke_incoming(req->r_reply); > > if (front_len > req->r_reply->front_alloc_len) { > - pr_warn("get_reply front %d > preallocated %d (%u#%llu)\n", > - front_len, req->r_reply->front_alloc_len, > - (unsigned int)con->peer_name.type, > - le64_to_cpu(con->peer_name.num)); > + pr_warn("%s osd%d tid %llu front %d > prea
Re: [PATCH] rbd: plug rbd_dev->header.object_prefix memory leak
On 08/31/2015 12:02 PM, Ilya Dryomov wrote: > Need to free object_prefix when rbd_dev_v2_snap_context() fails, but > only if this is the first time we are reading in the header. > > Signed-off-by: Ilya Dryomov <idryo...@gmail.com> Yes this makes sense. It might be nice to encapsulate some of this, something like rbd_dev_header_info_release(). As things stand, freeing of the object prefix appears sort of random. It wasn't easy to follow the life cycle of that field doing a quick scan of the code for when it's set and cleared. Anyway, looks good. Reviewed-by: Alex Elder <el...@linaro.org> > --- > drivers/block/rbd.c | 5 - > 1 file changed, 4 insertions(+), 1 deletion(-) > > diff --git a/drivers/block/rbd.c b/drivers/block/rbd.c > index 324bf35ec4dd..69d03aa46d0d 100644 > --- a/drivers/block/rbd.c > +++ b/drivers/block/rbd.c > @@ -4720,7 +4720,10 @@ static int rbd_dev_v2_header_info(struct rbd_device > *rbd_dev) > } > > ret = rbd_dev_v2_snap_context(rbd_dev); > - dout("rbd_dev_v2_snap_context returned %d\n", ret); > + if (ret && first_time) { > + kfree(rbd_dev->header.object_prefix); > + rbd_dev->header.object_prefix = NULL; > + } > > return ret; > } > -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] rbd: fix double free on rbd_dev->header_name
On 08/31/2015 07:47 AM, Ilya Dryomov wrote: > If rbd_dev_image_probe() in rbd_dev_probe_parent() fails, header_name > is freed twice: once in rbd_dev_probe_parent() and then in its caller > rbd_dev_image_probe() (rbd_dev_image_probe() is called recursively to > handle parent images). > > rbd_dev_probe_parent() is responsible for probing the parent, so it > shoudn't mock with clone's fields. Agreed. (But I think you mean "muck with.") The other argument is that the caller is who allocated it (via rbd_dev_header_name()), so it should be responsible for freeing it. Reviewed-by: Alex Elder <el...@linaro.org> > > Signed-off-by: Ilya Dryomov <idryo...@gmail.com> > --- > drivers/block/rbd.c | 1 - > 1 file changed, 1 deletion(-) > > diff --git a/drivers/block/rbd.c b/drivers/block/rbd.c > index bc67a93aa4f4..324bf35ec4dd 100644 > --- a/drivers/block/rbd.c > +++ b/drivers/block/rbd.c > @@ -5201,7 +5201,6 @@ static int rbd_dev_probe_parent(struct rbd_device > *rbd_dev) > out_err: > if (parent) { > rbd_dev_unparent(rbd_dev); > - kfree(rbd_dev->header_name); > rbd_dev_destroy(parent); > } else { > rbd_put_client(rbdc); > -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 01/18] libceph: add scatterlist messenger data type
On 07/29/2015 04:23 AM, mchri...@redhat.com wrote: From: Mike Christie micha...@cs.wisc.edu LIO uses scatterlist for its page/data management. This patch adds a scatterlist messenger data type, so LIO can pass its sg down directly to rbd. Signed-off-by: Mike Christie micha...@cs.wisc.edu I'm not going to be able to review all of these, and this isnt' even a complete review. But it's something... You're clearly on the right track, but I want to provide a meaningful review for correctness and design so I'm looking for a bit more information. --- include/linux/ceph/messenger.h | 13 ++ include/linux/ceph/osd_client.h | 12 +- net/ceph/messenger.c| 96 + net/ceph/osd_client.c | 26 +++ 4 files changed, 146 insertions(+), 1 deletion(-) diff --git a/include/linux/ceph/messenger.h b/include/linux/ceph/messenger.h index 3775327..bc1bde8 100644 --- a/include/linux/ceph/messenger.h +++ b/include/linux/ceph/messenger.h @@ -79,6 +79,7 @@ enum ceph_msg_data_type { #ifdef CONFIG_BLOCK CEPH_MSG_DATA_BIO, /* data source/destination is a bio list */ #endif /* CONFIG_BLOCK */ + CEPH_MSG_DATA_SG, /* data source/destination is a scatterlist */ }; static __inline__ bool ceph_msg_data_type_valid(enum ceph_msg_data_type type) @@ -90,6 +91,7 @@ static __inline__ bool ceph_msg_data_type_valid(enum ceph_msg_data_type type) #ifdef CONFIG_BLOCK case CEPH_MSG_DATA_BIO: #endif /* CONFIG_BLOCK */ + case CEPH_MSG_DATA_SG: return true; default: return false; @@ -112,6 +114,11 @@ struct ceph_msg_data { unsigned intalignment; /* first page */ }; struct ceph_pagelist*pagelist; + struct { + struct scatterlist *sgl; + unsigned intsgl_init_offset; + u64 sgl_length; + }; Can you supply a short explanation of what these fields represent? It seems sgl_init_offset is the offset of the starting byte in the sgl, but is the purpose of that for page offset calculation, or does it represent an offset into the total length of the sgl? Or put another way, does sgl_init_offset represent some portion of the sgl_length that has been consumed already (and so the initial residual length is sgl_length - sgl_init_offset)? }; }; @@ -139,6 +146,10 @@ struct ceph_msg_data_cursor { struct page *page; /* page from list */ size_t offset; /* bytes from list */ }; + struct { + struct scatterlist *sg;/* curr sg */ /* current sg */ + unsigned intsg_consumed; Here too, what does sg_consumed represent with respect to the initial offset and the length? I guess I'm going to stop with that. It'll be a lot easier for me to review this if I'm sure I'm sure I understand what these represent. Thanks. -Alex + }; }; }; . . . -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] rbd: fix copyup completion race
() when it completed. This meant the copyup would complete, and the copyup callback function would essentially be inserted as an extra object request callback function, called *before* the normal rbd_img_obj_callback() completion function. With this change, when the parent read full image request completes, the original object request callback is left as-is (I think/assume it's always rbd_img_obj_callback()). However, unlike before, rbd_img_obj_copyup_callback() is now called as part of the *osd request* completion handling for the copyup operation. This finishes before the object request completion handling. Rather than chaining object request handlers, we clean up the copyup OSD request as part of OSD completion (and then mark the containing original object request done), and *then* complete the object request normally. I have two comments about this. - I *think* this is a better way to do this anyway. By which I mean that since we're setting up that copyup operation sort of behind the scenes, separate from the original object request, it's just as well that we clean up the pages allocated, etc. behind the scenes (at the OSD request layer, not the object request layer) also. - Since rbd_img_obj_copyup_callback() is now being called in the OSD request callback context (only), it should be renamed rbd_osd_copyup_callback(). I wish there were a nice way to diagram the various paths these requests take. They're not overly complicated, but every time I go through it I need to sort of think through it again each time. I'm going to say this looks good... Reviewed-by: Alex Elder el...@linaro.org Cc: Alex Elder el...@linaro.org Cc: sta...@vger.kernel.org # 3.10+, needs backporting for 3.18 Signed-off-by: Ilya Dryomov idryo...@gmail.com --- drivers/block/rbd.c | 20 1 file changed, 16 insertions(+), 4 deletions(-) diff --git a/drivers/block/rbd.c b/drivers/block/rbd.c index d94529d5c8e9..007e5d0cadaf 100644 --- a/drivers/block/rbd.c +++ b/drivers/block/rbd.c @@ -523,6 +523,7 @@ void rbd_warn(struct rbd_device *rbd_dev, const char *fmt, ...) # define rbd_assert(expr) ((void) 0) #endif /* !RBD_DEBUG */ +static void rbd_img_obj_copyup_callback(struct rbd_obj_request *obj_request); static int rbd_img_obj_request_submit(struct rbd_obj_request *obj_request); static void rbd_img_parent_read(struct rbd_obj_request *obj_request); static void rbd_dev_remove_parent(struct rbd_device *rbd_dev); @@ -1818,6 +1819,16 @@ static void rbd_osd_stat_callback(struct rbd_obj_request *obj_request) obj_request_done_set(obj_request); } +static void rbd_osd_call_callback(struct rbd_obj_request *obj_request) +{ + dout(%s: obj %p\n, __func__, obj_request); + + if (obj_request_img_data_test(obj_request)) + rbd_img_obj_copyup_callback(obj_request); + else + obj_request_done_set(obj_request); +} + static void rbd_osd_req_callback(struct ceph_osd_request *osd_req, struct ceph_msg *msg) { @@ -1866,6 +1877,8 @@ static void rbd_osd_req_callback(struct ceph_osd_request *osd_req, rbd_osd_discard_callback(obj_request); break; case CEPH_OSD_OP_CALL: + rbd_osd_call_callback(obj_request); + break; case CEPH_OSD_OP_NOTIFY_ACK: case CEPH_OSD_OP_WATCH: rbd_osd_trivial_callback(obj_request); @@ -2537,6 +2550,8 @@ rbd_img_obj_copyup_callback(struct rbd_obj_request *obj_request) struct page **pages; u32 page_count; + dout(%s: obj %p\n, __func__, obj_request); + rbd_assert(obj_request-type == OBJ_REQUEST_BIO || obj_request-type == OBJ_REQUEST_NODATA); rbd_assert(obj_request_img_data_test(obj_request)); @@ -2563,9 +2578,7 @@ rbd_img_obj_copyup_callback(struct rbd_obj_request *obj_request) if (!obj_request-result) obj_request-xferred = obj_request-length; - /* Finish up with the normal image object callback */ - - rbd_img_obj_callback(obj_request); + obj_request_done_set(obj_request); } static void @@ -2650,7 +2663,6 @@ rbd_img_obj_parent_read_full_callback(struct rbd_img_request *img_request) /* All set, send it off. */ - orig_request-callback = rbd_img_obj_copyup_callback; What is orig_request-callback here, is it always rbd_img_obj_callback()? osdc = rbd_dev-rbd_client-client-osdc; img_result = rbd_obj_request_submit(osdc, orig_request); if (!img_result) -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] rbd: use GFP_NOIO in rbd_obj_request_create()
On 06/30/2015 02:19 PM, Ilya Dryomov wrote: rbd_obj_request_create() is called on the main I/O path, so we need to use GFP_NOIO to make sure allocation doesn't blow back on us. Not all callers need this, but I'm still hardcoding the flag inside rather than making it a parameter because a) this is going to stable, and b) those callers shouldn't really use rbd_obj_request_create() and will be fixed in the future. More memory allocation fixes will follow. This looks OK to me. It's conservative; you can add a GFP parameter in the future. Reviewed-by: Alex Elder el...@linaro.org Cc: sta...@vger.kernel.org # 3.10+ Signed-off-by: Ilya Dryomov idryo...@gmail.com --- drivers/block/rbd.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/drivers/block/rbd.c b/drivers/block/rbd.c index b316ee48a30b..d94529d5c8e9 100644 --- a/drivers/block/rbd.c +++ b/drivers/block/rbd.c @@ -2022,11 +2022,11 @@ static struct rbd_obj_request *rbd_obj_request_create(const char *object_name, rbd_assert(obj_request_type_valid(type)); size = strlen(object_name) + 1; - name = kmalloc(size, GFP_KERNEL); + name = kmalloc(size, GFP_NOIO); if (!name) return NULL; - obj_request = kmem_cache_zalloc(rbd_obj_request_cache, GFP_KERNEL); + obj_request = kmem_cache_zalloc(rbd_obj_request_cache, GFP_NOIO); if (!obj_request) { kfree(name); return NULL; -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] libceph: Fix ceph_tcp_sendpage()'s more boolean usage
On 06/25/2015 11:56 AM, Benoît Canet wrote: Spotted while hunting http://tracker.ceph.com/issues/10905. From struct ceph_msg_data_cursor in include/linux/ceph/messenger.h: boollast_piece; /* current is last piece */ In ceph_msg_data_next(): *last_piece = cursor-last_piece; A call to ceph_msg_data_next() is followed by: ret = ceph_tcp_sendpage(con-sock, page, page_offset, length, last_piece); while ceph_tcp_sendpage() is: static int ceph_tcp_sendpage(struct socket *sock, struct page *page,i int offset, size_t size, bool more) The logic is inverted: correct it. Whoops. I'm surprised we haven't had more trouble from this. This should go to stable too. Looks good. Reviewed-by: Alex Elder el...@linaro.org Signed-off-by: Benoît Canet benoit.ca...@nodalink.com --- net/ceph/messenger.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/net/ceph/messenger.c b/net/ceph/messenger.c index ec68cd3..eda06fd 100644 --- a/net/ceph/messenger.c +++ b/net/ceph/messenger.c @@ -1544,7 +1544,7 @@ static int write_partial_message_data(struct ceph_connection *con) page = ceph_msg_data_next(msg-cursor, page_offset, length, last_piece); ret = ceph_tcp_sendpage(con-sock, page, page_offset, - length, last_piece); + length, !last_piece); if (ret = 0) { if (do_datacrc) msg-footer.data_crc = cpu_to_le32(crc); -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] rbd: bump queue_max_segments
On 06/25/2015 04:01 AM, Ilya Dryomov wrote: The default queue_limits::max_segments value (BLK_MAX_SEGMENTS = 128) unnecessarily limits bio sizes to 512k (assuming 4k pages). rbd, being a virtual block device, doesn't have any restrictions on the number of physical segments, so bump max_segments to max_hw_sectors, in theory allowing a sector per segment (although the only case this matters that I can think of is some readv/writev style thing). In practice this is going to give us 1M bios - the number of segments in a bio is limited in bio_get_nr_vecs() by BIO_MAX_PAGES = 256. Note that this doesn't result in any improvement on a typical direct sequential test. This is because on a box with a not too badly fragmented memory the default BLK_MAX_SEGMENTS is enough to see nice rbd object size sized requests. The only difference is the size of bios being merged - 512k vs 1M for something like $ dd if=/dev/zero of=/dev/rbd0 oflag=direct bs=$RBD_OBJ_SIZE $ dd if=/dev/rbd0 iflag=direct of=/dev/null bs=$RBD_OBJ_SIZE Signed-off-by: Ilya Dryomov idryo...@gmail.com This looks good. Reviewed-by: Alex Elder el...@linaro.org --- drivers/block/rbd.c | 1 + 1 file changed, 1 insertion(+) diff --git a/drivers/block/rbd.c b/drivers/block/rbd.c index 89fe8a4bc02e..bc88fbcb9715 100644 --- a/drivers/block/rbd.c +++ b/drivers/block/rbd.c @@ -3791,6 +3791,7 @@ static int rbd_init_disk(struct rbd_device *rbd_dev) /* set io sizes to object size */ segment_size = rbd_obj_bytes(rbd_dev-header); blk_queue_max_hw_sectors(q, segment_size / SECTOR_SIZE); + blk_queue_max_segments(q, segment_size / SECTOR_SIZE); blk_queue_max_segment_size(q, segment_size); blk_queue_io_min(q, segment_size); blk_queue_io_opt(q, segment_size); -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 3/3] rbd: queue_depth map option
On 06/25/2015 04:11 AM, Ilya Dryomov wrote: nr_requests (/sys/block/rbdid/queue/nr_requests) is pretty much irrelevant in blk-mq case because each driver sets its own max depth that it can handle and that's the number of tags that gets preallocated on setup. Users can't increase queue depth beyond that value via writing to nr_requests. For rbd we are happy with the default BLKDEV_MAX_RQ (128) for most cases but we want to give users the opportunity to increase it. Introduce a new per-device queue_depth option to do just that: $ sudo rbd map -o queue_depth=1024 ... Signed-off-by: Ilya Dryomov idryo...@gmail.com I haven't gone to follow through what happens with this but I assume a value that's too large will be caught when it's attempted to be used or something. In any case this looks good to me. Reviewed-by: Alex Elder el...@linaro.org --- drivers/block/rbd.c | 17 ++--- 1 file changed, 14 insertions(+), 3 deletions(-) diff --git a/drivers/block/rbd.c b/drivers/block/rbd.c index e502bce02d2c..b316ee48a30b 100644 --- a/drivers/block/rbd.c +++ b/drivers/block/rbd.c @@ -728,6 +728,7 @@ static struct rbd_client *rbd_client_find(struct ceph_options *ceph_opts) * (Per device) rbd map options */ enum { + Opt_queue_depth, Opt_last_int, /* int args above */ Opt_last_string, @@ -738,6 +739,7 @@ enum { }; static match_table_t rbd_opts_tokens = { + {Opt_queue_depth, queue_depth=%d}, /* int args above */ /* string args above */ {Opt_read_only, read_only}, @@ -748,9 +750,11 @@ static match_table_t rbd_opts_tokens = { }; struct rbd_options { + int queue_depth; boolread_only; }; +#define RBD_QUEUE_DEPTH_DEFAULTBLKDEV_MAX_RQ #define RBD_READ_ONLY_DEFAULT false static int parse_rbd_opts_token(char *c, void *private) @@ -774,6 +778,13 @@ static int parse_rbd_opts_token(char *c, void *private) } switch (token) { + case Opt_queue_depth: + if (intval 1) { + pr_err(queue_depth out of range\n); + return -EINVAL; + } + rbd_opts-queue_depth = intval; + break; case Opt_read_only: rbd_opts-read_only = true; break; @@ -3761,10 +3772,9 @@ static int rbd_init_disk(struct rbd_device *rbd_dev) memset(rbd_dev-tag_set, 0, sizeof(rbd_dev-tag_set)); rbd_dev-tag_set.ops = rbd_mq_ops; - rbd_dev-tag_set.queue_depth = BLKDEV_MAX_RQ; + rbd_dev-tag_set.queue_depth = rbd_dev-opts-queue_depth; rbd_dev-tag_set.numa_node = NUMA_NO_NODE; - rbd_dev-tag_set.flags = - BLK_MQ_F_SHOULD_MERGE | BLK_MQ_F_SG_MERGE; + rbd_dev-tag_set.flags = BLK_MQ_F_SHOULD_MERGE | BLK_MQ_F_SG_MERGE; rbd_dev-tag_set.nr_hw_queues = 1; rbd_dev-tag_set.cmd_size = sizeof(struct work_struct); @@ -4948,6 +4958,7 @@ static int rbd_add_parse_args(const char *buf, goto out_mem; rbd_opts-read_only = RBD_READ_ONLY_DEFAULT; + rbd_opts-queue_depth = RBD_QUEUE_DEPTH_DEFAULT; copts = ceph_parse_options(options, mon_addrs, mon_addrs + mon_addrs_size - 1, -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] Avoid holding the zero page on slab init failure
On 06/24/2015 08:27 PM, Benoît Canet wrote: Spotted by visual inspection. Applies on libceph: Remove spurious kunmap() of the zero page. Benoît Canet (1): libceph: Avoid holding the zero page on ceph_msgr_slab_init errors net/ceph/messenger.c | 10 +- 1 file changed, 5 insertions(+), 5 deletions(-) Benoit, in cases where you have just a single patch, there is no need to include an additional mail message like this one. If you have a series, that's when you'll supply a PATCH 0/N message. -Alex -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/3] rbd: terminate rbd_opts_tokens with Opt_err
On 06/25/2015 04:11 AM, Ilya Dryomov wrote: Also nuke useless Opt_last_bool and don't break lines unnecessarily. Signed-off-by: Ilya Dryomov idryo...@gmail.com Good idea. Reviewed-by: Alex Elder el...@linaro.org --- drivers/block/rbd.c | 24 1 file changed, 8 insertions(+), 16 deletions(-) diff --git a/drivers/block/rbd.c b/drivers/block/rbd.c index bc88fbcb9715..4de8c9167c4b 100644 --- a/drivers/block/rbd.c +++ b/drivers/block/rbd.c @@ -724,7 +724,7 @@ static struct rbd_client *rbd_client_find(struct ceph_options *ceph_opts) } /* - * mount options + * (Per device) rbd map options */ enum { Opt_last_int, @@ -733,8 +733,7 @@ enum { /* string args above */ Opt_read_only, Opt_read_write, - /* Boolean args above */ - Opt_last_bool, + Opt_err }; static match_table_t rbd_opts_tokens = { @@ -744,8 +743,7 @@ static match_table_t rbd_opts_tokens = { {Opt_read_only, ro},/* Alternate spelling */ {Opt_read_write, read_write}, {Opt_read_write, rw}, /* Alternate spelling */ - /* Boolean args above */ - {-1, NULL} + {Opt_err, NULL} }; struct rbd_options { @@ -761,22 +759,15 @@ static int parse_rbd_opts_token(char *c, void *private) int token, intval, ret; token = match_token(c, rbd_opts_tokens, argstr); - if (token 0) - return -EINVAL; - if (token Opt_last_int) { ret = match_int(argstr[0], intval); if (ret 0) { - pr_err(bad mount option arg (not int) - at '%s'\n, c); + pr_err(bad mount option arg (not int) at '%s'\n, c); return ret; } dout(got int token %d val %d\n, token, intval); } else if (token Opt_last_int token Opt_last_string) { - dout(got string token %d val %s\n, token, -argstr[0].from); - } else if (token Opt_last_string token Opt_last_bool) { - dout(got Boolean token %d\n, token); + dout(got string token %d val %s\n, token, argstr[0].from); } else { dout(got token %d\n, token); } @@ -789,9 +780,10 @@ static int parse_rbd_opts_token(char *c, void *private) rbd_opts-read_only = false; break; default: - rbd_assert(false); - break; + /* libceph prints bad option msg */ + return -EINVAL; } + return 0; } -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/3] rbd: store rbd_options in rbd_device
On 06/25/2015 04:11 AM, Ilya Dryomov wrote: Signed-off-by: Ilya Dryomov idryo...@gmail.com Now that you need it when initializing the disk, this makes sense. Reviewed-by: Alex Elder el...@linaro.org --- drivers/block/rbd.c | 18 +++--- 1 file changed, 11 insertions(+), 7 deletions(-) diff --git a/drivers/block/rbd.c b/drivers/block/rbd.c index 4de8c9167c4b..e502bce02d2c 100644 --- a/drivers/block/rbd.c +++ b/drivers/block/rbd.c @@ -346,6 +346,7 @@ struct rbd_device { struct rbd_image_header header; unsigned long flags; /* possibly lock protected */ struct rbd_spec *spec; + struct rbd_options *opts; char*header_name; @@ -4055,7 +4056,8 @@ static void rbd_spec_free(struct kref *kref) } static struct rbd_device *rbd_dev_create(struct rbd_client *rbdc, - struct rbd_spec *spec) +struct rbd_spec *spec, +struct rbd_options *opts) { struct rbd_device *rbd_dev; @@ -4069,8 +4071,9 @@ static struct rbd_device *rbd_dev_create(struct rbd_client *rbdc, INIT_LIST_HEAD(rbd_dev-node); init_rwsem(rbd_dev-header_rwsem); - rbd_dev-spec = spec; rbd_dev-rbd_client = rbdc; + rbd_dev-spec = spec; + rbd_dev-opts = opts; /* Initialize the layout used for all rbd requests */ @@ -4086,6 +4089,7 @@ static void rbd_dev_destroy(struct rbd_device *rbd_dev) { rbd_put_client(rbd_dev-rbd_client); rbd_spec_put(rbd_dev-spec); + kfree(rbd_dev-opts); kfree(rbd_dev); } @@ -5160,7 +5164,7 @@ static int rbd_dev_probe_parent(struct rbd_device *rbd_dev) rbdc = __rbd_get_client(rbd_dev-rbd_client); ret = -ENOMEM; - parent = rbd_dev_create(rbdc, parent_spec); + parent = rbd_dev_create(rbdc, parent_spec, NULL); if (!parent) goto out_err; @@ -5406,9 +5410,6 @@ static ssize_t do_rbd_add(struct bus_type *bus, rc = rbd_add_parse_args(buf, ceph_opts, rbd_opts, spec); if (rc 0) goto err_out_module; - read_only = rbd_opts-read_only; - kfree(rbd_opts); - rbd_opts = NULL;/* done with this */ rbdc = rbd_get_client(ceph_opts); if (IS_ERR(rbdc)) { @@ -5434,11 +5435,12 @@ static ssize_t do_rbd_add(struct bus_type *bus, goto err_out_client; } - rbd_dev = rbd_dev_create(rbdc, spec); + rbd_dev = rbd_dev_create(rbdc, spec, rbd_opts); if (!rbd_dev) goto err_out_client; rbdc = NULL;/* rbd_dev now owns this */ spec = NULL;/* rbd_dev now owns this */ + rbd_opts = NULL;/* rbd_dev now owns this */ rc = rbd_dev_image_probe(rbd_dev, true); if (rc 0) @@ -5446,6 +5448,7 @@ static ssize_t do_rbd_add(struct bus_type *bus, /* If we are mapping a snapshot it must be marked read-only */ + read_only = rbd_dev-opts-read_only; if (rbd_dev-spec-snap_id != CEPH_NOSNAP) read_only = true; rbd_dev-mapping.read_only = read_only; @@ -5470,6 +5473,7 @@ err_out_client: rbd_put_client(rbdc); err_out_args: rbd_spec_put(spec); + kfree(rbd_opts); err_out_module: module_put(THIS_MODULE); -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] libceph: Avoid holding the zero page on ceph_msgr_slab_init errors
On 06/24/2015 08:27 PM, Benoît Canet wrote: ceph_msgr_slab_init may fail due to a temporary ENOMEM. Looks good. Delay a bit the initialization of zero_page in ceph_msgr_init and reorder it's cleanup in _ceph_msgr_exit for readability sake. I'd say it's not readability, but a proper ordering (tear down in reverse order of set up). Reviewed-by: Alex Elder el...@linaro.org BUG_ON() will not suffer to be postponed in case it is triggered. This is unrelated, but in cases like this (where there's already something in place to return an error) we should replace these BUG_ON() calls with logged errors and a return. They should never ever happen, of course (they represent design problems) but if it did it's better to allow the system to keep running. Signed-off-by: Benoît Canet benoit.ca...@nodalink.com --- net/ceph/messenger.c | 10 +- 1 file changed, 5 insertions(+), 5 deletions(-) diff --git a/net/ceph/messenger.c b/net/ceph/messenger.c index 38f06a4..ec68cd3 100644 --- a/net/ceph/messenger.c +++ b/net/ceph/messenger.c @@ -275,22 +275,22 @@ static void _ceph_msgr_exit(void) ceph_msgr_wq = NULL; } - ceph_msgr_slab_exit(); - BUG_ON(zero_page == NULL); page_cache_release(zero_page); zero_page = NULL; + + ceph_msgr_slab_exit(); } int ceph_msgr_init(void) { + if (ceph_msgr_slab_init()) + return -ENOMEM; + BUG_ON(zero_page != NULL); zero_page = ZERO_PAGE(0); page_cache_get(zero_page); - if (ceph_msgr_slab_init()) - return -ENOMEM; - /* * The number of active work items is limited by the number of * connections, so leave @max_active at default. -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] remove unbalanced kunmap
On 06/24/2015 07:35 PM, Alex Elder wrote: On 06/24/2015 04:18 PM, Benoît Canet wrote: Spotted via visual inspection of the code. Benoît Canet (1): libceph: Remove spurious kunmap() of the zero page net/ceph/messenger.c | 1 - 1 file changed, 1 deletion(-) I got no patch with this. Is it just me?-Alex Nevermind. I get it. This was associated with the patch I just signed off on... -Alex -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] remove unbalanced kunmap
On 06/24/2015 04:18 PM, Benoît Canet wrote: Spotted via visual inspection of the code. Benoît Canet (1): libceph: Remove spurious kunmap() of the zero page net/ceph/messenger.c | 1 - 1 file changed, 1 deletion(-) I got no patch with this. Is it just me? -Alex -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] libceph: Remove spurious kunmap() of the zero page
On 06/24/2015 04:18 PM, Benoît Canet wrote: ceph_tcp_sendpage already does the work of mapping/unmapping the zero page if needed. Signed-off-by: Benoît Canet benoit.ca...@nodalink.com This looks good. Reviewed-by: Alex Elder el...@linaro.org --- net/ceph/messenger.c | 1 - 1 file changed, 1 deletion(-) diff --git a/net/ceph/messenger.c b/net/ceph/messenger.c index 967080a..38f06a4 100644 --- a/net/ceph/messenger.c +++ b/net/ceph/messenger.c @@ -278,7 +278,6 @@ static void _ceph_msgr_exit(void) ceph_msgr_slab_exit(); BUG_ON(zero_page == NULL); - kunmap(zero_page); page_cache_release(zero_page); zero_page = NULL; } -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 3/5] libceph: a couple tweaks for wait loops
On 05/25/2015 05:38 AM, Ilya Dryomov wrote: On Thu, May 21, 2015 at 4:29 PM, Alex Elder el...@ieee.org wrote: On 05/21/2015 07:35 AM, Ilya Dryomov wrote: - return -ETIMEDOUT instead of -EIO in case of timeout - wait_event_interruptible_timeout() returns time left until timeout and since it can be almost LONG_MAX we had better assign it to long Any error returned by wait_event_interruptible_timeout() can now be returned by __ceph_open_session(). It looks like that may, in fact, be only -EINTR and -ERESTARTSYS. But it's a change you could note in the log message. I think it's just -ERESTARTSYS so I didn't bother. My point was almost a little more philosophical. It's conceivable (though not likely) that wait_event_interruptible_timeout() could be changed to return a value that your caller here does not expect. It turns out the only caller ignores the return value of ceph_monc_wait_osdmap() anyway. That should maybe be fixed. That's on purpose - rbd map tries to wait for a new enough osdmap only if the pool that the image is supposed to be in doesn't exist and we know we have a stale osdmap. We ignore wait retval because if we timeout we should return this pool doesn't exist, not -ETIMEDOUT. Yes I realize that. This second part of my response was following on to my previous thought. That is, the caller might get a different return value that it didn't expect; but since the only caller ignores what gets returned, it's a moot point. -Alex Thanks, Ilya -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v4 08/11] block: kill merge_bvec_fn() completely
On 05/25/2015 10:02 AM, Ilya Dryomov wrote: On Mon, May 25, 2015 at 5:04 PM, Christoph Hellwig h...@lst.de wrote: On Fri, May 22, 2015 at 11:18:40AM -0700, Ming Lin wrote: From: Kent Overstreet kent.overstr...@gmail.com As generic_make_request() is now able to handle arbitrarily sized bios, it's no longer necessary for each individual block driver to define its own -merge_bvec_fn() callback. Remove every invocation completely. It might be good to replace patch 1 and this one by a patch per driver to remove the merge_bvec_fn instance and add the blk_queue_split call for all those drivers that actually had a -merge_bvec_fn. As some of them were non-trivial attention from the maintainers would be helpful, and a patch per driver might help with that. -/* This is called by bio_add_page(). - * - * q-max_hw_sectors and other global limits are already enforced there. - * - * We need to call down to our lower level device, - * in case it has special restrictions. - * - * We also may need to enforce configured max-bio-bvecs limits. - * - * As long as the BIO is empty we have to allow at least one bvec, - * regardless of size and offset, so no need to ask lower levels. - */ -int drbd_merge_bvec(struct request_queue *q, struct bvec_merge_data *bvm, struct bio_vec *bvec) This just checks the lower device, so it looks obviously fine. -static int pkt_merge_bvec(struct request_queue *q, struct bvec_merge_data *bmd, - struct bio_vec *bvec) -{ - struct pktcdvd_device *pd = q-queuedata; - sector_t zone = get_zone(bmd-bi_sector, pd); - int used = ((bmd-bi_sector - zone) 9) + bmd-bi_size; - int remaining = (pd-settings.size 9) - used; - int remaining2; - - /* - * A bio = PAGE_SIZE must be allowed. If it crosses a packet - * boundary, pkt_make_request() will split the bio. - */ - remaining2 = PAGE_SIZE - bmd-bi_size; - remaining = max(remaining, remaining2); - - BUG_ON(remaining 0); - return remaining; -} As mentioned in the comment pkt_make_request will split the bio so pkt looks fine. diff --git a/drivers/block/rbd.c b/drivers/block/rbd.c index ec6c5c6..f50edb3 100644 --- a/drivers/block/rbd.c +++ b/drivers/block/rbd.c @@ -3440,52 +3440,6 @@ static int rbd_queue_rq(struct blk_mq_hw_ctx *hctx, return BLK_MQ_RQ_QUEUE_OK; } -/* - * a queue callback. Makes sure that we don't create a bio that spans across - * multiple osd objects. One exception would be with a single page bios, - * which we handle later at bio_chain_clone_range() - */ -static int rbd_merge_bvec(struct request_queue *q, struct bvec_merge_data *bmd, - struct bio_vec *bvec) It seems rbd handles requests spanning objects just fine, so I don't really understand why rbd_merge_bvec even exists. Getting some form of ACK from the ceph folks would be useful. I'm not Alex, but yeah, we have all the clone/split machinery and so we can handle a spanning case just fine. I think rbd_merge_bvec() exists to make sure we don't have to do that unless it's really necessary - like when a single page gets submitted at an inconvenient offset. I am Alex. This is something I never removed. I haven't looked at it closely now, but it seems to me that after I created a function that split stuff properly up *before* the BIO layer got to it (which has since been replaced by code related to Kent's immutable BIO work), there has been no need for this function. Removing this was on a long-ago to-do list--but I didn't want to do it without spending some time ensuring it wouldn't break anything. If you want me to work through it in more detail so I can give a more certain response, let me know and I will do so. -Alex I have a patch that adds a blk_queue_chunk_sectors(object_size) call to rbd_init_disk() but I haven't had a chance to play with it yet. In any case, we should be fine with getting rid of rbd_merge_bvec(). If this ends up a per-driver patchset, I can make rbd_merge_bvec() - blk_queue_chunk_sectors() a single patch and push it through ceph-client.git. Thanks, Ilya -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/5] libceph: nuke time_sub()
On 05/21/2015 07:35 AM, Ilya Dryomov wrote: Unused since ceph got merged into mainline I guess. Signed-off-by: Ilya Dryomov idryo...@gmail.com Looks good. Reviewed-by: Alex Elder el...@linaro.org --- include/linux/ceph/libceph.h | 9 - 1 file changed, 9 deletions(-) diff --git a/include/linux/ceph/libceph.h b/include/linux/ceph/libceph.h index 30f92cefaa72..85ae9a889a3f 100644 --- a/include/linux/ceph/libceph.h +++ b/include/linux/ceph/libceph.h @@ -93,15 +93,6 @@ enum { CEPH_MOUNT_SHUTDOWN, }; -/* - * subtract jiffies - */ -static inline unsigned long time_sub(unsigned long a, unsigned long b) -{ - BUG_ON(time_after(b, a)); - return (long)a - (long)b; -} - struct ceph_mds_client; /* -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/5] libceph: store timeouts in jiffies, verify user input
On 05/21/2015 07:35 AM, Ilya Dryomov wrote: There are currently three libceph-level timeouts that the user can specify on mount: mount_timeout, osd_idle_ttl and osdkeepalive. All of these are in seconds and no checking is done on user input: negative values are accepted, we multiply them all by HZ which may or may not overflow, arbitrarily large jiffies then get added together, etc. There is also a bug in the way mount_timeout=0 is handled. It's supposed to mean infinite timeout, but that's not how wait.h APIs treat it and so __ceph_open_session() for example will busy loop without much chance of being interrupted if none of ceph-mons are there. Fix all this by verifying user input, storing timeouts capped by msecs_to_jiffies() in jiffies and using the new ceph_timeout_jiffies() helper for all user-specified waits to handle infinite timeouts correctly. This looks good. I like your use of local variables for the options pointer; it makes it easy to see in the code where the timeout values come from. You could have handled timeout option checking and error reporting generically, but I'm not sure that would be better. Reviewed-by: Alex Elder el...@linaro.org Signed-off-by: Ilya Dryomov idryo...@gmail.com --- drivers/block/rbd.c | 5 +++-- fs/ceph/dir.c| 4 ++-- fs/ceph/mds_client.c | 12 ++-- fs/ceph/mds_client.h | 2 +- fs/ceph/super.c | 2 +- include/linux/ceph/libceph.h | 17 +++-- net/ceph/ceph_common.c | 41 +++-- net/ceph/mon_client.c| 11 +-- net/ceph/osd_client.c| 15 +++ 9 files changed, 71 insertions(+), 38 deletions(-) diff --git a/drivers/block/rbd.c b/drivers/block/rbd.c index 349115ae3bc2..992683b6b299 100644 --- a/drivers/block/rbd.c +++ b/drivers/block/rbd.c @@ -4963,8 +4963,8 @@ out_err: */ static int rbd_add_get_pool_id(struct rbd_client *rbdc, const char *pool_name) { + struct ceph_options *opts = rbdc-client-options; u64 newest_epoch; - unsigned long timeout = rbdc-client-options-mount_timeout * HZ; int tries = 0; int ret; @@ -4979,7 +4979,8 @@ again: if (rbdc-client-osdc.osdmap-epoch newest_epoch) { ceph_monc_request_next_osdmap(rbdc-client-monc); (void) ceph_monc_wait_osdmap(rbdc-client-monc, - newest_epoch, timeout); + newest_epoch, + opts-mount_timeout); goto again; } else { /* the osdmap we have is new enough */ diff --git a/fs/ceph/dir.c b/fs/ceph/dir.c index 4248307fea90..173dd4b58c71 100644 --- a/fs/ceph/dir.c +++ b/fs/ceph/dir.c @@ -1259,8 +1259,8 @@ static int ceph_dir_fsync(struct file *file, loff_t start, loff_t end, inode, req-r_tid, last_tid); if (req-r_timeout) { unsigned long time_left = wait_for_completion_timeout( - req-r_safe_completion, - req-r_timeout); + req-r_safe_completion, + ceph_timeout_jiffies(req-r_timeout)); if (time_left 0) ret = 0; else diff --git a/fs/ceph/mds_client.c b/fs/ceph/mds_client.c index 69a36f40517f..0b0e0a9a81c0 100644 --- a/fs/ceph/mds_client.c +++ b/fs/ceph/mds_client.c @@ -2268,7 +2268,8 @@ int ceph_mdsc_do_request(struct ceph_mds_client *mdsc, dout(do_request waiting\n); if (req-r_timeout) { err = (long)wait_for_completion_killable_timeout( - req-r_completion, req-r_timeout); + req-r_completion, + ceph_timeout_jiffies(req-r_timeout)); if (err == 0) err = -EIO; } else if (req-r_wait_for_completion) { @@ -3424,8 +3425,8 @@ int ceph_mdsc_init(struct ceph_fs_client *fsc) */ static void wait_requests(struct ceph_mds_client *mdsc) { + struct ceph_options *opts = mdsc-fsc-client-options; struct ceph_mds_request *req; - struct ceph_fs_client *fsc = mdsc-fsc; mutex_lock(mdsc-mutex); if (__get_oldest_req(mdsc)) { @@ -3433,7 +3434,7 @@ static void wait_requests(struct ceph_mds_client *mdsc) dout(wait_requests waiting for requests\n); wait_for_completion_timeout(mdsc-safe_umount_waiters, - fsc-client-options-mount_timeout * HZ); + ceph_timeout_jiffies(opts-mount_timeout)); /* tear down remaining
Re: [PATCH 3/5] libceph: a couple tweaks for wait loops
On 05/21/2015 07:35 AM, Ilya Dryomov wrote: - return -ETIMEDOUT instead of -EIO in case of timeout - wait_event_interruptible_timeout() returns time left until timeout and since it can be almost LONG_MAX we had better assign it to long Any error returned by wait_event_interruptible_timeout() can now be returned by __ceph_open_session(). It looks like that may, in fact, be only -EINTR and -ERESTARTSYS. But it's a change you could note in the log message. It turns out the only caller ignores the return value of ceph_monc_wait_osdmap() anyway. That should maybe be fixed. In any case, this looks good. Reviewed-by: Alex Elder el...@linaro.org Signed-off-by: Ilya Dryomov idryo...@gmail.com --- net/ceph/ceph_common.c | 7 +++ net/ceph/mon_client.c | 2 +- 2 files changed, 4 insertions(+), 5 deletions(-) diff --git a/net/ceph/ceph_common.c b/net/ceph/ceph_common.c index a80e91c2c9a3..925d0c890b80 100644 --- a/net/ceph/ceph_common.c +++ b/net/ceph/ceph_common.c @@ -647,8 +647,8 @@ static int have_mon_and_osd_map(struct ceph_client *client) */ int __ceph_open_session(struct ceph_client *client, unsigned long started) { - int err; unsigned long timeout = client-options-mount_timeout; + long err; /* open session, and wait for mon and osd maps */ err = ceph_monc_open_session(client-monc); @@ -656,16 +656,15 @@ int __ceph_open_session(struct ceph_client *client, unsigned long started) return err; while (!have_mon_and_osd_map(client)) { - err = -EIO; if (timeout time_after_eq(jiffies, started + timeout)) - return err; + return -ETIMEDOUT; /* wait */ dout(mount waiting for mon_map\n); err = wait_event_interruptible_timeout(client-auth_wq, have_mon_and_osd_map(client) || (client-auth_err 0), ceph_timeout_jiffies(timeout)); - if (err == -EINTR || err == -ERESTARTSYS) + if (err 0) return err; if (client-auth_err 0) return client-auth_err; diff --git a/net/ceph/mon_client.c b/net/ceph/mon_client.c index 0da3bdc116f7..9d6ff1215928 100644 --- a/net/ceph/mon_client.c +++ b/net/ceph/mon_client.c @@ -308,7 +308,7 @@ int ceph_monc_wait_osdmap(struct ceph_mon_client *monc, u32 epoch, unsigned long timeout) { unsigned long started = jiffies; - int ret; + long ret; mutex_lock(monc-mutex); while (monc-have_osdmap epoch) { -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2 5/5] rbd: timeout watch teardown on unmap with mount_timeout
On 05/21/2015 07:35 AM, Ilya Dryomov wrote: As part of unmap sequence, kernel client has to talk to the OSDs to teardown watch on the header object. If none of the OSDs are available it would hang forever, until interrupted by a signal - when that happens we follow through with the rest of unmap procedure (i.e. unregister the device and put all the data structures) and the unmap is still considired successful (rbd cli tool exits with 0). The watch on the userspace side should eventually timeout so that's fine. This isn't very nice, because various userspace tools (pacemaker rbd resource agent, for example) then have to worry about setting up their own timeouts. Timeout it with mount_timeout (60 seconds by default). Signed-off-by: Ilya Dryomov idryo...@gmail.com Reviewed-by: Sage Weil s...@redhat.com You can now add: Reviewed-by: Alex Elder el...@linaro.org --- drivers/block/rbd.c | 38 -- 1 file changed, 28 insertions(+), 10 deletions(-) diff --git a/drivers/block/rbd.c b/drivers/block/rbd.c index 992683b6b299..89fe8a4bc02e 100644 --- a/drivers/block/rbd.c +++ b/drivers/block/rbd.c @@ -1563,22 +1563,39 @@ static void rbd_obj_request_end(struct rbd_obj_request *obj_request) /* * Wait for an object request to complete. If interrupted, cancel the * underlying osd request. + * + * @timeout: in jiffies, 0 means wait forever */ -static int rbd_obj_request_wait(struct rbd_obj_request *obj_request) +static int __rbd_obj_request_wait(struct rbd_obj_request *obj_request, + unsigned long timeout) { - int ret; + long ret; dout(%s %p\n, __func__, obj_request); - - ret = wait_for_completion_interruptible(obj_request-completion); - if (ret 0) { - dout(%s %p interrupted\n, __func__, obj_request); + ret = wait_for_completion_interruptible_timeout( + obj_request-completion, + ceph_timeout_jiffies(timeout)); + if (ret = 0) { + if (ret == 0) + ret = -ETIMEDOUT; rbd_obj_request_end(obj_request); - return ret; + } else { + ret = 0; } - dout(%s %p done\n, __func__, obj_request); - return 0; + dout(%s %p ret %d\n, __func__, obj_request, (int)ret); + return ret; +} + +static int rbd_obj_request_wait(struct rbd_obj_request *obj_request) +{ + return __rbd_obj_request_wait(obj_request, 0); +} + +static int rbd_obj_request_wait_timeout(struct rbd_obj_request *obj_request, + unsigned long timeout) +{ + return __rbd_obj_request_wait(obj_request, timeout); } static void rbd_img_request_complete(struct rbd_img_request *img_request) @@ -3122,6 +3139,7 @@ static struct rbd_obj_request *rbd_obj_watch_request_helper( bool watch) { struct ceph_osd_client *osdc = rbd_dev-rbd_client-client-osdc; + struct ceph_options *opts = osdc-client-options; struct rbd_obj_request *obj_request; int ret; @@ -3148,7 +3166,7 @@ static struct rbd_obj_request *rbd_obj_watch_request_helper( if (ret) goto out; - ret = rbd_obj_request_wait(obj_request); + ret = rbd_obj_request_wait_timeout(obj_request, opts-mount_timeout); if (ret) goto out; -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 4/5] ceph: simplify two mount_timeout sites
On 05/21/2015 07:35 AM, Ilya Dryomov wrote: No need to bifurcate wait now that we've got ceph_timeout_jiffies(). Looks good. Reviewed-by: Alex Elder el...@linaro.org Signed-off-by: Ilya Dryomov idryo...@gmail.com --- fs/ceph/dir.c| 14 -- fs/ceph/mds_client.c | 18 ++ 2 files changed, 14 insertions(+), 18 deletions(-) diff --git a/fs/ceph/dir.c b/fs/ceph/dir.c index 173dd4b58c71..3dec27e36417 100644 --- a/fs/ceph/dir.c +++ b/fs/ceph/dir.c @@ -1257,17 +1257,11 @@ static int ceph_dir_fsync(struct file *file, loff_t start, loff_t end, dout(dir_fsync %p wait on tid %llu (until %llu)\n, inode, req-r_tid, last_tid); - if (req-r_timeout) { - unsigned long time_left = wait_for_completion_timeout( - req-r_safe_completion, + ret = !wait_for_completion_timeout(req-r_safe_completion, ceph_timeout_jiffies(req-r_timeout)); - if (time_left 0) - ret = 0; - else - ret = -EIO; /* timed out */ - } else { - wait_for_completion(req-r_safe_completion); - } + if (ret) + ret = -EIO; /* timed out */ + ceph_mdsc_put_request(req); spin_lock(ci-i_unsafe_lock); diff --git a/fs/ceph/mds_client.c b/fs/ceph/mds_client.c index 0b0e0a9a81c0..5be2d287a26c 100644 --- a/fs/ceph/mds_client.c +++ b/fs/ceph/mds_client.c @@ -2266,16 +2266,18 @@ int ceph_mdsc_do_request(struct ceph_mds_client *mdsc, /* wait */ mutex_unlock(mdsc-mutex); dout(do_request waiting\n); - if (req-r_timeout) { - err = (long)wait_for_completion_killable_timeout( - req-r_completion, - ceph_timeout_jiffies(req-r_timeout)); - if (err == 0) - err = -EIO; - } else if (req-r_wait_for_completion) { + if (!req-r_timeout req-r_wait_for_completion) { err = req-r_wait_for_completion(mdsc, req); } else { - err = wait_for_completion_killable(req-r_completion); + long timeleft = wait_for_completion_killable_timeout( + req-r_completion, + ceph_timeout_jiffies(req-r_timeout)); + if (timeleft 0) + err = 0; + else if (!timeleft) + err = -EIO; /* timed out */ + else + err = timeleft; /* killed */ } dout(do_request waited, got %d\n, err); mutex_lock(mdsc-mutex); -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 02/10] ceph: add start/finish encoding helpers
On 05/01/2015 02:47 PM, Mike Christie wrote: On 05/01/2015 02:39 PM, Mike Christie wrote: On 04/30/2015 07:22 AM, Alex Elder wrote: +/** + * ceph_start_decoding_compat - decode block with legacy support for older schemes + * @p: buffer to decode + * @end: end of decode buffer + * @curr_ver: current version of the encoding that the code supports/encode + * @compat_ver: oldest version that includes a __u8 compat version field + * @len_ver: oldest version that includes a __u32 length wrapper + * @len: buffer to return len of data in buffer + */ +static inline int ceph_start_decoding_compat(void **p, void *end, u8 curr_ver, + u8 compat_ver, u8 len_ver, u32 *len) +{ +u8 struct_ver, struct_compat; + +ceph_decode_8_safe(p, end, struct_ver, fail); +if (struct_ver = curr_ver) { It seems to me it doesn't matter whether the current code supports the struct compat version or not. What matters is whether the encoded header contains the compat field or not. I'm not sure what determines that. I am not sure if I understood this comment. I thought different structs got the compat field in different versions. So, I was concerned about a case where we might get a old struct sent to us. If the compat field was added to some struct_abc in version 2 and that is what we support in the kernel, but some old osd sent us a struct that was version 1, then we do not want to do the compat check below. Doh! I wrote the above mail, then realized what you meant. OK good... And I should have known what determines whether the header contains the compat field, but I was already a little confused about what I was looking at... -Alex I think I should have checked the compat_ver passed into the version above. if (struct_ver = compat_ver) { ceph_decode_8_safe(p, end, struct_compat, fail); if (curr_ver struct_compat) +ceph_decode_8_safe(p, end, struct_compat, fail); +if (curr_ver struct_compat) +return -EINVAL; +} + +if (struct_ver = len_ver) { +ceph_decode_32_safe(p, end, *len, fail); +} else { +*len = 0; +} + +return 0; +fail: +return -ERANGE; +} + + #define ceph_encode_need(p, end, n, bad)\ do {\ if (!likely(ceph_has_room(p, end, n)))\ -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/3] libceph: properly release STAT request's raw_data_in
On 04/27/2015 03:24 AM, Yan, Zheng wrote: Signed-off-by: Yan, Zheng z...@redhat.com Looks good. Reviewed-by: Alex Elder el...@linaro.org --- net/ceph/osd_client.c | 3 +++ 1 file changed, 3 insertions(+) diff --git a/net/ceph/osd_client.c b/net/ceph/osd_client.c index 41a4abc..b93531f 100644 --- a/net/ceph/osd_client.c +++ b/net/ceph/osd_client.c @@ -296,6 +296,9 @@ static void osd_req_op_data_release(struct ceph_osd_request *osd_req, case CEPH_OSD_OP_CMPXATTR: ceph_osd_data_release(op-xattr.osd_data); break; + case CEPH_OSD_OP_STAT: + ceph_osd_data_release(op-raw_data_in); + break; default: break; } -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 0/7] ceph-client: some refactoring patches
Most of this series is just comment changes and a few minor naming tweaks. The last two patches actually change code, but the resulting behavior should be identical to before. These have been compile tested only. They (especially the last two) need to run through teuthology. The series is based on the current testing branch: 5eeb14f ceph: check OSD caps before read/write -Alex Alex Elder (7): messenger: update state diagram messenger: add some clarifying comments messenger: rename out_kvec_left field osd_client: don't supply connection to handle_reply() osd_client: improve safety of osd_req_op_data() macro messenger: encapsulate grabbing incoming message messenger: con_work() is not really a loop include/linux/ceph/messenger.h | 2 +- net/ceph/messenger.c | 254 +++-- net/ceph/osd_client.c | 16 +-- 3 files changed, 156 insertions(+), 116 deletions(-) -- 2.1.0 -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 1/7] messenger: update state diagram
Update the fabulous state diagram in messenger.c to reflect the way the code operates. Two new transitions are added: CLOSED - CLOSED CLOSING - CLOSING Signed-off-by: Alex Elder el...@linaro.org --- net/ceph/messenger.c | 51 +-- 1 file changed, 25 insertions(+), 26 deletions(-) diff --git a/net/ceph/messenger.c b/net/ceph/messenger.c index 967080a..9dfdb20 100644 --- a/net/ceph/messenger.c +++ b/net/ceph/messenger.c @@ -46,32 +46,31 @@ * | con_sock_state_init() * v * -- - * | CLOSED | initialized, but no socket (and no - * -- TCP connection) - * ^ \ - * | \ con_sock_state_connecting() - * |-- - * | \ - * + con_sock_state_closed() \ - * |+---\ - * | \ \\ - * | ---\\ - * | | CLOSING | socket event; \\ - * | --- await close \\ - * | ^\ | - * | | \ | - * | + con_sock_state_closing() \ | - * | / \ | | - * | / --- | | - * |/ \ v v - * | /-- - * | /-| CONNECTING | socket created, TCP - * | | / -- connect initiated - * | | | con_sock_state_connected() - * | | v - * - - * | CONNECTED | TCP connection established - * - + * | CLOSED | initialized, but no socket (and no TCP connection) + * -- + * ^ \ \ con_sock_state_connecting() + * | / + * +--\ + * |\ cen_sock_state_closed() \ + * | +--\ + * | \ \\ + * | --- \\ + * | | CLOSING | socket event;\\ + * | --- await close \\ + * |/ ^\\ + * |\ | \ | + * | --+ con_sock_state_closing() \ | + * | / \ | | + * | / -- | | + * | / \ | v + * |/-- + * | /-| CONNECTING | socket created, TCP + * | | / -- connect initiated + * | | | con_sock_state_connected() + * | | v + * - + * | CONNECTED | TCP connection established + * - * * State values for ceph_connection-sock_state; NEW is assumed to be 0. */ -- 2.1.0 -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 7/7] messenger: con_work() is not really a loop
In con_work(), a while(true) loop is used to contain a block of code. The normal path through that code does *not* actually loop. The only time it loops is if a socket read or write operation returns -EAGAIN. So the use of the loop control structure is a little misleading. Restructure that block so it's *not* a loop, and use goto calls rather than a loop to implement the control flow. Signed-off-by: Alex Elder el...@linaro.org --- net/ceph/messenger.c | 84 +--- 1 file changed, 41 insertions(+), 43 deletions(-) diff --git a/net/ceph/messenger.c b/net/ceph/messenger.c index ec60c23..053c5f3 100644 --- a/net/ceph/messenger.c +++ b/net/ceph/messenger.c @@ -2845,54 +2845,52 @@ static void con_work(struct work_struct *work) struct ceph_connection *con = container_of(work, struct ceph_connection, work.work); bool fault; + int ret; mutex_lock(con-mutex); - while (true) { - int ret; - - if ((fault = con_sock_closed(con))) { - dout(%s: con %p SOCK_CLOSED\n, __func__, con); - break; - } - if (con_backoff(con)) { - dout(%s: con %p BACKOFF\n, __func__, con); - break; - } - if (con-state == CON_STATE_STANDBY) { - dout(%s: con %p STANDBY\n, __func__, con); - break; - } - if (con-state == CON_STATE_CLOSED) { - dout(%s: con %p CLOSED\n, __func__, con); - BUG_ON(con-sock); - break; - } - if (con-state == CON_STATE_PREOPEN) { - dout(%s: con %p PREOPEN\n, __func__, con); - BUG_ON(con-sock); - } - - ret = try_read(con); - if (ret 0) { - if (ret == -EAGAIN) - continue; - if (!con-error_msg) - con-error_msg = socket error on read; - fault = true; - break; - } +again: + fault = con_sock_closed(con); + if (fault) { + dout(%s: con %p SOCK_CLOSED\n, __func__, con); + goto done; + } + if (con_backoff(con)) { + dout(%s: con %p BACKOFF\n, __func__, con); + goto done; + } + if (con-state == CON_STATE_STANDBY) { + dout(%s: con %p STANDBY\n, __func__, con); + goto done; + } + if (con-state == CON_STATE_CLOSED) { + dout(%s: con %p CLOSED\n, __func__, con); + BUG_ON(con-sock); + goto done; + } + if (con-state == CON_STATE_PREOPEN) { + dout(%s: con %p PREOPEN\n, __func__, con); + BUG_ON(con-sock); + } - ret = try_write(con); - if (ret 0) { - if (ret == -EAGAIN) - continue; - if (!con-error_msg) - con-error_msg = socket error on write; - fault = true; - } + ret = try_read(con); + if (ret 0) { + if (ret == -EAGAIN) + goto again; + if (!con-error_msg) + con-error_msg = socket error on read; + fault = true; + goto done; + } - break; /* If we make it to here, we're done */ + ret = try_write(con); + if (ret 0) { + if (ret == -EAGAIN) + goto again; + if (!con-error_msg) + con-error_msg = socket error on write; + fault = true; } +done: if (fault) con_fault(con); mutex_unlock(con-mutex); -- 2.1.0 -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 6/7] messenger: encapsulate grabbing incoming message
There are currently three places in the messenger code that grab a connection's incoming message (i.e., get the message--if any--and and replace the connection's incoming message with a null pointer). Create a new function ceph_con_in_msg_grab() to encapsulate this operation, returning what had been the connection's incoming message. The connection's reference to the message is transferred to the caller. In the new function, test for a few null pointers before using them, in case of pathological errors. Do not call BUG_ON() if the message's connection pointer is bad. Instead, report an error, then drop and ignore the message. Take care to use the right connection's put operation. Establish the naming convention that in_msg is a pointer to a message that was a connection's incoming message. (This affects some additional code in process_message().) Signed-off-by: Alex Elder el...@linaro.org --- net/ceph/messenger.c | 88 +++- 1 file changed, 60 insertions(+), 28 deletions(-) diff --git a/net/ceph/messenger.c b/net/ceph/messenger.c index a0d2673..ec60c23 100644 --- a/net/ceph/messenger.c +++ b/net/ceph/messenger.c @@ -650,21 +650,55 @@ static void ceph_msg_remove_list(struct list_head *head) } } +/* + * Grab the given connection's incoming message. + */ +static struct ceph_msg *ceph_con_in_msg_grab(struct ceph_connection *con) +{ + struct ceph_msg *in_msg; + struct ceph_connection *in_msg_con; + + if (!con-in_msg) + return NULL; + + /* +* Grab the incoming message and its connection, then update +* their pointers to each other. The connection's reference +* to the message is transferred to the caller unless there +* is an error. +*/ + in_msg = con-in_msg; + con-in_msg = NULL; + in_msg_con = in_msg-con; + in_msg-con = NULL; + + if (in_msg_con in_msg-con-ops) + in_msg_con-ops-put(in_msg_con); + + if (in_msg_con == con) + return in_msg; + + pr_err(incoming message %p with bad connection (%p != %p)\n, + in_msg, in_msg_con, con); + ceph_msg_put(in_msg); + + return NULL; +} + static void reset_connection(struct ceph_connection *con) { + struct ceph_msg *in_msg; + /* reset connection, out_queue, msg_ and connect_seq */ /* discard existing out_queue and msg_seq */ dout(reset_connection %p\n, con); ceph_msg_remove_list(con-out_queue); ceph_msg_remove_list(con-out_sent); - if (con-in_msg) { - BUG_ON(con-in_msg-con != con); - con-in_msg-con = NULL; - ceph_msg_put(con-in_msg); - con-in_msg = NULL; - con-ops-put(con); - } + /* If there is an incoming message, grab it and release it */ + in_msg = ceph_con_in_msg_grab(con); + if (in_msg) + ceph_msg_put(in_msg); con-connect_seq = 0; con-out_seq = 0; @@ -2428,30 +2462,29 @@ static int read_partial_message(struct ceph_connection *con) */ static void process_message(struct ceph_connection *con) { - struct ceph_msg *msg; + struct ceph_msg *in_msg = ceph_con_in_msg_grab(con); - BUG_ON(con-in_msg-con != con); - con-in_msg-con = NULL; - msg = con-in_msg; - con-in_msg = NULL; - con-ops-put(con); + if (!in_msg) { + pr_err(no message to process); + return; + } /* if first message, set peer_name */ if (con-peer_name.type == 0) - con-peer_name = msg-hdr.src; + con-peer_name = in_msg-hdr.src; con-in_seq++; mutex_unlock(con-mutex); dout(= %p %llu from %s%lld %d=%s len %d+%d (%u %u %u) =\n, -msg, le64_to_cpu(msg-hdr.seq), -ENTITY_NAME(msg-hdr.src), -le16_to_cpu(msg-hdr.type), -ceph_msg_type_name(le16_to_cpu(msg-hdr.type)), -le32_to_cpu(msg-hdr.front_len), -le32_to_cpu(msg-hdr.data_len), +in_msg, le64_to_cpu(in_msg-hdr.seq), +ENTITY_NAME(in_msg-hdr.src), +le16_to_cpu(in_msg-hdr.type), +ceph_msg_type_name(le16_to_cpu(in_msg-hdr.type)), +le32_to_cpu(in_msg-hdr.front_len), +le32_to_cpu(in_msg-hdr.data_len), con-in_front_crc, con-in_middle_crc, con-in_data_crc); - con-ops-dispatch(con, msg); + con-ops-dispatch(con, in_msg); mutex_lock(con-mutex); } @@ -2876,6 +2909,8 @@ static void con_work(struct work_struct *work) */ static void con_fault(struct ceph_connection *con) { + struct ceph_msg *in_msg; + dout(fault %p state %lu to peer %s\n, con, con-state, ceph_pr_addr(con-peer_addr.in_addr)); @@ -2895,13 +2930,10 @@ static void con_fault(struct ceph_connection *con) return
[PATCH 2/7] messenger: add some clarifying comments
In con_sock_closed() we test *and* clear the socket closed connection flag. It's not immediately obvious why the flag needs to be cleared. Add a comment so that's a little clearer. Similarly, we only want to do backoff processing once each time the backoff flag is set. Add a comment about that as well. Signed-off-by: Alex Elder el...@linaro.org --- net/ceph/messenger.c | 9 + 1 file changed, 9 insertions(+) diff --git a/net/ceph/messenger.c b/net/ceph/messenger.c index 9dfdb20..37b0fa7 100644 --- a/net/ceph/messenger.c +++ b/net/ceph/messenger.c @@ -2733,6 +2733,10 @@ static void cancel_con(struct ceph_connection *con) static bool con_sock_closed(struct ceph_connection *con) { + /* +* We want to handle the event of the socket being +* closed once, so clear the flag in case it is set. +*/ if (!con_flag_test_and_clear(con, CON_FLAG_SOCK_CLOSED)) return false; @@ -2764,6 +2768,11 @@ static bool con_backoff(struct ceph_connection *con) { int ret; + /* +* See if we need to back off. We only want to do this +* processing once each time the flag is set, so clear the +* flag after it's tested. +*/ if (!con_flag_test_and_clear(con, CON_FLAG_BACKOFF)) return false; -- 2.1.0 -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 3/7] messenger: rename out_kvec_left field
The name out_kvec_left in the ceph_connection structure is sort of ambiguous. It represents the number of the entries of the out_kvec[] array that are filled with data to send. The name can easily be misinterpreted to mean the number of such entries that are available. Change the name to out_kvec_used to avoid misunderstanding. Signed-off-by: Alex Elder el...@linaro.org --- include/linux/ceph/messenger.h | 2 +- net/ceph/messenger.c | 22 +++--- 2 files changed, 12 insertions(+), 12 deletions(-) diff --git a/include/linux/ceph/messenger.h b/include/linux/ceph/messenger.h index e154994..c457f13 100644 --- a/include/linux/ceph/messenger.h +++ b/include/linux/ceph/messenger.h @@ -230,7 +230,7 @@ struct ceph_connection { struct kvec out_kvec[8], /* sending header/footer data */ *out_kvec_cur; - int out_kvec_left; /* kvec's left in out_kvec */ + int out_kvec_used; /* kvec's used in out_kvec */ int out_skip;/* skip this many bytes */ int out_kvec_bytes; /* total bytes left */ bool out_kvec_is_msg; /* kvec refers to out_msg */ diff --git a/net/ceph/messenger.c b/net/ceph/messenger.c index 37b0fa7..a0d2673 100644 --- a/net/ceph/messenger.c +++ b/net/ceph/messenger.c @@ -773,7 +773,7 @@ static u32 get_global_seq(struct ceph_messenger *msgr, u32 gt) static void con_out_kvec_reset(struct ceph_connection *con) { - con-out_kvec_left = 0; + con-out_kvec_used = 0; con-out_kvec_bytes = 0; con-out_kvec_cur = con-out_kvec[0]; } @@ -783,12 +783,12 @@ static void con_out_kvec_add(struct ceph_connection *con, { int index; - index = con-out_kvec_left; + index = con-out_kvec_used; BUG_ON(index = ARRAY_SIZE(con-out_kvec)); con-out_kvec[index].iov_len = size; con-out_kvec[index].iov_base = data; - con-out_kvec_left++; + con-out_kvec_used++; con-out_kvec_bytes += size; } @@ -1194,7 +1194,7 @@ static void prepare_message_data(struct ceph_msg *msg, u32 data_len) static void prepare_write_message_footer(struct ceph_connection *con) { struct ceph_msg *m = con-out_msg; - int v = con-out_kvec_left; + int v = con-out_kvec_used; m-footer.flags |= CEPH_MSG_FOOTER_COMPLETE; @@ -1213,7 +1213,7 @@ static void prepare_write_message_footer(struct ceph_connection *con) con-out_kvec[v].iov_len = sizeof(m-old_footer); con-out_kvec_bytes += sizeof(m-old_footer); } - con-out_kvec_left++; + con-out_kvec_used++; con-out_more = m-more_to_follow; con-out_msg_done = true; } @@ -1462,7 +1462,7 @@ static int write_partial_kvec(struct ceph_connection *con) dout(write_partial_kvec %p %d left\n, con, con-out_kvec_bytes); while (con-out_kvec_bytes 0) { ret = ceph_tcp_sendmsg(con-sock, con-out_kvec_cur, - con-out_kvec_left, con-out_kvec_bytes, + con-out_kvec_used, con-out_kvec_bytes, con-out_more); if (ret = 0) goto out; @@ -1472,10 +1472,10 @@ static int write_partial_kvec(struct ceph_connection *con) /* account for full iov entries consumed */ while (ret = con-out_kvec_cur-iov_len) { - BUG_ON(!con-out_kvec_left); + BUG_ON(!con-out_kvec_used); ret -= con-out_kvec_cur-iov_len; con-out_kvec_cur++; - con-out_kvec_left--; + con-out_kvec_used--; } /* and for a partially-consumed entry */ if (ret) { @@ -1483,12 +1483,12 @@ static int write_partial_kvec(struct ceph_connection *con) con-out_kvec_cur-iov_base += ret; } } - con-out_kvec_left = 0; + con-out_kvec_used = 0; con-out_kvec_is_msg = false; ret = 1; out: dout(write_partial_kvec %p %d left in %d kvecs ret = %d\n, con, -con-out_kvec_bytes, con-out_kvec_left, ret); +con-out_kvec_bytes, con-out_kvec_used, ret); return ret; /* done! */ } @@ -2497,7 +2497,7 @@ more_kvec: if (ret = 0) goto out; } - if (con-out_kvec_left) { + if (con-out_kvec_used) { ret = write_partial_kvec(con); if (ret = 0) goto out; -- 2.1.0 -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 4/7] osd_client: don't supply connection to handle_reply()
The connection parameter in handle_reply() is never used, so get rid of it. Signed-off-by: Alex Elder el...@linaro.org --- net/ceph/osd_client.c | 5 ++--- 1 file changed, 2 insertions(+), 3 deletions(-) diff --git a/net/ceph/osd_client.c b/net/ceph/osd_client.c index e3c2e8b..6539dfc 100644 --- a/net/ceph/osd_client.c +++ b/net/ceph/osd_client.c @@ -1748,8 +1748,7 @@ static void complete_request(struct ceph_osd_request *req) * handle osd op reply. either call the callback if it is specified, * or do the completion to wake up the waiting thread. */ -static void handle_reply(struct ceph_osd_client *osdc, struct ceph_msg *msg, -struct ceph_connection *con) +static void handle_reply(struct ceph_osd_client *osdc, struct ceph_msg *msg) { void *p, *end; struct ceph_osd_request *req; @@ -2796,7 +2795,7 @@ static void dispatch(struct ceph_connection *con, struct ceph_msg *msg) ceph_osdc_handle_map(osdc, msg); break; case CEPH_MSG_OSD_OPREPLY: - handle_reply(osdc, msg, con); + handle_reply(osdc, msg); break; case CEPH_MSG_WATCH_NOTIFY: handle_watch_notify(osdc, msg); -- 2.1.0 -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCHv2 0/3] rbd: header read/refresh improvements
On 04/24/2015 08:22 AM, Douglas Fuller wrote: Support multiple class op calls in one ceph_msg and consolidate rbd header read and refresh processes to use this feature to reduce the number of ceph_msgs sent for that process. Refresh features on header refresh and begin returning EIO if features have changed since mapping. I think I understand the motivation for what you're doing here. You ultimately want to pack more ops into a message, which would be really great. Ideally you could pack some arbitrary number (much larger than 3), but it seems to me there'll be some restructuring required to get there. I am having a hard time understanding your intended implementation though. I've looked mainly at your first patch, and scanned the other two. Let's see if I have it right. You are adding another hunk of data to a class request, and using it as an intermediate buffer for receiving data. Currently a class request has three hunks: - request_info, a pagelist data item, which contains the class name and method name, and perhaps someday, an argument list (?). - request_data, which can contain any number of data items of any type, and constitutes outgoing data for a write request. - response_data, also any number of any type of data items, which constitutes space into which incoming data is placed. You are adding a fourth: - chain_data, which is a page vector big enough to hold all incoming response data. When building up an OSD request, we call osd_req_encode_op() for every op in the request. For a class op, that records the length of the class and method names in the op structure, and then appends those names to the outbound request_data. Following that, the real request data (if any) provided with the class op is appended to request_data. Next (before your patches) whatever buffer space was provided (as a page array, pagelist, or bio) to receive response data is appended to response_data. Prior to this, the class op will have been set up with adequate buffers to directly receive the expected data. What your first patch seems to do is circumvent this last step, and instead of using the prearranged buffers, you allocate another page vector, big enough to hold the combined size of all response_data data items, and store that in the new chain_data field of an OSD request class op. (Note that this could be in *any* of the ops in an OSD request, though I suspect it's always the first.) When a response arrives, if the *first* op in the request array is CEPH_OSD_OP_CALL, you split the data, which will have been deposited in the chain_data page vector. That process involves allocating another buffer sufficient to hold the entirety of the received data. You copy the received data into that buffer and release the chain_data page vector. And then you copy the data from this new buffer back into the original response array, and then free the buffer. This sounds pretty expensive. For every class operation you are copying the received data two extra times. It's possible I'm not understanding what you're doing here and that's why I'm writing this now. Before I provide any more specific commentary on your patches, I want to be sure I understand what you're trying to do. If my understanding *is* correct, I would say you're going about this the wrong way, and I may have some suggestions for a better approach. Will you please correct me where I'm wrong above, and maybe give a little better high-level explanation of what you're trying to do? I see in patch 1 you mention a problem with short reads, and there may be a simpler fix than that (to begin with). But beyond that, if you're trying to incorporate more ops in a message, there may be better ways to go about that as well. Thanks. -Alex v2: Edit history and address comments from Mike Christie. Douglas Fuller (3): ceph: support multiple class method calls in one ceph_msg rbd: combine object method calls in header refresh using fewer ceph_msgs rbd: re-read features during header refresh and detect changes. drivers/block/rbd.c | 512 +--- include/linux/ceph/osd_client.h | 3 +- net/ceph/messenger.c| 4 + net/ceph/osd_client.c | 90 ++- 4 files changed, 462 insertions(+), 147 deletions(-) -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCHv2 0/3] rbd: header read/refresh improvements
On 04/26/2015 11:35 PM, Douglas Fuller wrote: This solution just feels hacky and inefficient, so I think there is a desire to feel like we at least tried to come up something simpler and more efficient before proceeding. Right. And somehow more fitting with the existing code, if that's possible. There are a few things in this proposal that are just special case code, and it would be nice to avoid that. I'm sorry I've been too busy today to get back to this. I spent a lot of time on Saturday getting ramped back up on this code so I could provide a decent review... I'm going to at least look at the code (error on a read op) before going to bed, so I at least know what the problem is. I'll try to have something to say (even if it's I've got nothing) in the next day or two. -Alex -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 0/3] rbd: header read/refresh improvements
On 04/23/2015 02:06 PM, Douglas Fuller wrote: Support multiple class op calls in one ceph_msg and consolidate rbd header read and refresh processes to use this feature to reduce the number of ceph_msgs sent for that process. Refresh features on header refresh and begin returning EIO if features have changed since mapping. Douglas Fuller (3): ceph: support multiple class method calls in one ceph_msg rbd: combine object method calls in header refresh using fewer ceph_msgs rbd: re-read features during header refresh and detect changes. drivers/block/rbd.c | 518 +--- include/linux/ceph/osd_client.h | 3 +- net/ceph/messenger.c| 4 + net/ceph/osd_client.c | 92 ++- 4 files changed, 470 insertions(+), 147 deletions(-) In case Ilya or others don't get to it soon, I plan to review this series tomorrow. -Alex -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 0/3] rbd: header read/refresh improvements
On 04/24/2015 09:40 AM, Douglas Fuller wrote: On Apr 24, 2015, at 10:17 AM, Ilya Dryomov idryo...@gmail.com wrote: On Fri, Apr 24, 2015 at 4:11 PM, Alex Elder el...@ieee.org wrote: On 04/23/2015 02:06 PM, Douglas Fuller wrote: Support multiple class op calls in one ceph_msg and consolidate rbd header read and refresh processes to use this feature to reduce the number of ceph_msgs sent for that process. Refresh features on header refresh and begin returning EIO if features have changed since mapping. Douglas Fuller (3): ceph: support multiple class method calls in one ceph_msg rbd: combine object method calls in header refresh using fewer ceph_msgs rbd: re-read features during header refresh and detect changes. drivers/block/rbd.c | 518 +--- include/linux/ceph/osd_client.h | 3 +- net/ceph/messenger.c | 4 + net/ceph/osd_client.c | 92 ++- 4 files changed, 470 insertions(+), 147 deletions(-) In case Ilya or others don't get to it soon, I plan to review this series tomorrow. I was planning take a look while I'm the road during the weekend. Doug, from a quick look this revision still has a bunch of style issues, most notably the alignment of function parameters and braces around if / else. See Documentation/CodingStyle in the kernel tree for examples. I needed to put out v2 in part because I squashed a couple fixup commits in the wrong place, leaving some things behind in #2 that were corrected in #3. I changed the braces in that version, but the function parameter indents are inconsistent throughout the code. I’ll try to come up with a compromise. When in doubt, lean toward the style used in the rest of the kernel. I used a few conventions that are not consistent with that in a lot of places, and those can be gradually phased toward what's recommended for the kernel. Some examples are: sizeof x or sizeof (x) -- sizeof(x) (cast) foo -- (cast)foo White space under comment blocks static int\nfunction(...) - static int function(...) -Alex You might also want to run your patches through scripts/checkpatch.pl, but take it with a grain of salt - it can be a bit too extreme at times. No need to post v3 with just style fixes, wait for more feedback. Thanks again for all feedback. Thanks, Ilya -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] libceph: kfree() in put_osd() shouldn't depend on authorizer
On 02/18/2015 07:27 AM, Ilya Dryomov wrote: a255651d4cad (ceph: ensure auth ops are defined before use) made kfree() in put_osd() conditional on the authorizer. A mechanical mistake most likely - fix it. You are generous in suggesting it's a mechanical mistake. But it is a mistake nevertheless. The fix looks good. Reviewed-by: Alex Elder el...@linaro.org Cc: Alex Elder el...@linaro.org Signed-off-by: Ilya Dryomov idryo...@gmail.com --- net/ceph/osd_client.c | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-) diff --git a/net/ceph/osd_client.c b/net/ceph/osd_client.c index f693a2f8ac86..41a4abc7e98e 100644 --- a/net/ceph/osd_client.c +++ b/net/ceph/osd_client.c @@ -1035,10 +1035,11 @@ static void put_osd(struct ceph_osd *osd) { dout(put_osd %p %d - %d\n, osd, atomic_read(osd-o_ref), atomic_read(osd-o_ref) - 1); - if (atomic_dec_and_test(osd-o_ref) osd-o_auth.authorizer) { + if (atomic_dec_and_test(osd-o_ref)) { struct ceph_auth_client *ac = osd-o_osdc-client-monc.auth; - ceph_auth_destroy_authorizer(ac, osd-o_auth.authorizer); + if (osd-o_auth.authorizer) + ceph_auth_destroy_authorizer(ac, osd-o_auth.authorizer); kfree(osd); } } -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] libceph: fix double __remove_osd() problem
On 02/18/2015 07:25 AM, Ilya Dryomov wrote: It turns out it's possible to get __remove_osd() called twice on the same OSD. That doesn't sit well with rb_erase() - depending on the shape of the tree we can get a NULL dereference, a soft lockup or a random crash at some point in the future as we end up touching freed memory. One scenario that I was able to reproduce is as follows: osd3 is idle, on the osd lru list con reset - osd3 con_fault_finish() osd_reset() osdmap - osd3 down ceph_osdc_handle_map() takes map_sem kick_requests() takes request_mutex reset_changed_osds() __reset_osd() __remove_osd() releases request_mutex releases map_sem takes map_sem takes request_mutex __kick_osd_requests() __reset_osd() __remove_osd() -- !!! A case can be made that osd refcounting is imperfect and reworking it would be a proper resolution, but for now Sage and I decided to fix this by adding a safe guard around __remove_osd(). Fixes: http://tracker.ceph.com/issues/8087 So now you rely on the RB node in the osd getting cleared as a signal that it has been removed already. Yes, that sounds like refcounting isn't working as desired... The mutex around all calls to (now) remove_osd() avoids races. I like the RB_CLEAR_NODE() call anyway. OK, enough chit chat. This looks OK to me. Reviewed-by: Alex Elder el...@linaro.org Cc: Sage Weil sw...@redhat.com Cc: sta...@vger.kernel.org # 3.9+: 7c6e6fc53e73: libceph: assert both regular and lingering lists in __remove_osd() Cc: sta...@vger.kernel.org # 3.9+: cc9f1f518cec: libceph: change from BUG to WARN for __remove_osd() asserts Cc: sta...@vger.kernel.org # 3.9+ Signed-off-by: Ilya Dryomov idryo...@gmail.com --- net/ceph/osd_client.c | 26 ++ 1 file changed, 18 insertions(+), 8 deletions(-) diff --git a/net/ceph/osd_client.c b/net/ceph/osd_client.c index 53299c7b0ca4..f693a2f8ac86 100644 --- a/net/ceph/osd_client.c +++ b/net/ceph/osd_client.c @@ -1048,14 +1048,24 @@ static void put_osd(struct ceph_osd *osd) */ static void __remove_osd(struct ceph_osd_client *osdc, struct ceph_osd *osd) { - dout(__remove_osd %p\n, osd); + dout(%s %p osd%d\n, __func__, osd, osd-o_osd); WARN_ON(!list_empty(osd-o_requests)); WARN_ON(!list_empty(osd-o_linger_requests)); - rb_erase(osd-o_node, osdc-osds); list_del_init(osd-o_osd_lru); - ceph_con_close(osd-o_con); - put_osd(osd); + rb_erase(osd-o_node, osdc-osds); + RB_CLEAR_NODE(osd-o_node); +} + +static void remove_osd(struct ceph_osd_client *osdc, struct ceph_osd *osd) +{ + dout(%s %p osd%d\n, __func__, osd, osd-o_osd); + + if (!RB_EMPTY_NODE(osd-o_node)) { + ceph_con_close(osd-o_con); + __remove_osd(osdc, osd); + put_osd(osd); + } } static void remove_all_osds(struct ceph_osd_client *osdc) @@ -1065,7 +1075,7 @@ static void remove_all_osds(struct ceph_osd_client *osdc) while (!RB_EMPTY_ROOT(osdc-osds)) { struct ceph_osd *osd = rb_entry(rb_first(osdc-osds), struct ceph_osd, o_node); - __remove_osd(osdc, osd); + remove_osd(osdc, osd); } mutex_unlock(osdc-request_mutex); } @@ -1106,7 +1116,7 @@ static void remove_old_osds(struct ceph_osd_client *osdc) list_for_each_entry_safe(osd, nosd, osdc-osd_lru, o_osd_lru) { if (time_before(jiffies, osd-lru_ttl)) break; - __remove_osd(osdc, osd); + remove_osd(osdc, osd); } mutex_unlock(osdc-request_mutex); } @@ -1121,8 +1131,7 @@ static int __reset_osd(struct ceph_osd_client *osdc, struct ceph_osd *osd) dout(__reset_osd %p osd%d\n, osd, osd-o_osd); if (list_empty(osd-o_requests) list_empty(osd-o_linger_requests)) { - __remove_osd(osdc, osd); - + remove_osd(osdc, osd); return -ENODEV; } @@ -1926,6 +1935,7 @@ static void reset_changed_osds(struct ceph_osd_client *osdc) { struct rb_node *p, *n; + dout(%s %p\n, __func__, osdc); for (p = rb_first(osdc-osds); p; p = n) { struct ceph_osd *osd = rb_entry(p, struct ceph_osd, o_node); -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/3] rbd: fix rbd_dev_parent_get() when parent_overlap == 0
On 01/20/2015 06:41 AM, Ilya Dryomov wrote: The comment for rbd_dev_parent_get() said * We must get the reference before checking for the overlap to * coordinate properly with zeroing the parent overlap in * rbd_dev_v2_parent_info() when an image gets flattened. We * drop it again if there is no overlap. but the drop it again if there is no overlap part was missing from the implementation. This lead to absurd parent_ref values for images with parent_overlap == 0, as parent_ref was incremented for each img_request and virtually never decremented. You're right about this. If the image had a parent with no overlap this would leak a reference to the parent image. The code should have said: counter = atomic_inc_return_safe(rbd_dev-parent_ref); if (counter 0) { if (rbd_dev-parent_overlap) return true; atomic_dec(rbd_dev-parent_ref); } else if (counter 0) { rbd_warn(rbd_dev, parent reference overflow); } Fix this by leveraging the fact that refresh path calls rbd_dev_v2_parent_info() under header_rwsem and use it for read in rbd_dev_parent_get(), instead of messing around with atomics. Get rid of barriers in rbd_dev_v2_parent_info() while at it - I don't see what they'd pair with now and I suspect we are in a pretty miserable situation as far as proper locking goes regardless. The point of the memory barrier was to ensure that when parent_overlap gets zeroed, this code sees the zero rather than the old non-zero value. The atomic_inc_return_safe() call has an implicit memory barrier to match the smp_mb() call. It allowed the synchronization to occur without the use of a lock. We're trying to atomically determine whether an image request needs to be marked as layered, to know how to handle ENOENT on parent reads. If it is a write to an image with a parent having a non-zero overlap, it's layered, otherwise we can treat it as a simple request. I think in this particular case, this is just an optimization, trying very hard to avoid having to do layered image handling if the parent has become flattened. I think that even if it got old information (suggesting non-zero overlap) things would behave correctly, just less efficiently. Using the semaphore adds a lock to this path and therefore implements whatever barriers are being removed. I'm not sure how often this is hit--maybe the optimization isn't buying much after all. I am getting a little rusty on some of details of what precisely happens when a layered image gets flattened. But I think this looks OK. Maybe just watch for small (perhaps insignificant) performance regressions with this change in place... Reviewed-by: Alex Elder el...@linaro.org Cc: sta...@vger.kernel.org # 3.11+ Signed-off-by: Ilya Dryomov idryo...@redhat.com --- drivers/block/rbd.c | 20 ++-- 1 file changed, 6 insertions(+), 14 deletions(-) diff --git a/drivers/block/rbd.c b/drivers/block/rbd.c index 31fa00f0d707..2990a1c75159 100644 --- a/drivers/block/rbd.c +++ b/drivers/block/rbd.c @@ -2098,32 +2098,26 @@ static void rbd_dev_parent_put(struct rbd_device *rbd_dev) * If an image has a non-zero parent overlap, get a reference to its * parent. * - * We must get the reference before checking for the overlap to - * coordinate properly with zeroing the parent overlap in - * rbd_dev_v2_parent_info() when an image gets flattened. We - * drop it again if there is no overlap. - * * Returns true if the rbd device has a parent with a non-zero * overlap and a reference for it was successfully taken, or * false otherwise. */ static bool rbd_dev_parent_get(struct rbd_device *rbd_dev) { - int counter; + int counter = 0; if (!rbd_dev-parent_spec) return false; - counter = atomic_inc_return_safe(rbd_dev-parent_ref); - if (counter 0 rbd_dev-parent_overlap) - return true; - - /* Image was flattened, but parent is not yet torn down */ + down_read(rbd_dev-header_rwsem); + if (rbd_dev-parent_overlap) + counter = atomic_inc_return_safe(rbd_dev-parent_ref); + up_read(rbd_dev-header_rwsem); if (counter 0) rbd_warn(rbd_dev, parent reference overflow); - return false; + return counter 0; } /* @@ -4238,7 +4232,6 @@ static int rbd_dev_v2_parent_info(struct rbd_device *rbd_dev) */ if (rbd_dev-parent_overlap) { rbd_dev-parent_overlap = 0; - smp_mb(); rbd_dev_parent_put(rbd_dev); pr_info(%s: clone image has been flattened\n, rbd_dev-disk-disk_name); @@ -4284,7 +4277,6 @@ static int rbd_dev_v2_parent_info(struct rbd_device *rbd_dev) * treat it specially. */ rbd_dev-parent_overlap = overlap
Re: [PATCH 3/3] rbd: do not treat standalone as flatten
On 01/20/2015 06:41 AM, Ilya Dryomov wrote: If the clone is resized down to 0, it becomes standalone. If such resize is carried over while an image is mapped we would detect this and call rbd_dev_parent_put() which means let go of all parent state, including the spec(s) of parent images(s). This leads to a mismatch between rbd info and sysfs parent fields, so a fix is in order. # rbd create --image-format 2 --size 1 foo # rbd snap create foo@snap # rbd snap protect foo@snap # rbd clone foo@snap bar # DEV=$(rbd map bar) # rbd resize --allow-shrink --size 0 bar # rbd resize --size 1 bar # rbd info bar | grep parent parent: rbd/foo@snap Before: # cat /sys/bus/rbd/devices/0/parent (no parent image) After: # cat /sys/bus/rbd/devices/0/parent pool_id 0 pool_name rbd image_id 10056b8b4567 image_name foo snap_id 2 snap_name snap overlap 0 Signed-off-by: Ilya Dryomov idryo...@redhat.com Hmm. Interesting. I think that a parent with an overlap of 0 is of no real use. So in the last patch I was suggesting it should just go away. But now, looking at it from this perspective, the fact that an image *came from* a particular parent, but which has no more overlap, could be useful information. The parent shouldn't simply go away without the user requesting that. I haven't completely followed through the logic of keeping the reference around but I understand what you're doing and it looks OK to me. Reviewed-by: Alex Elder el...@linaro.org --- drivers/block/rbd.c | 30 ++ 1 file changed, 10 insertions(+), 20 deletions(-) diff --git a/drivers/block/rbd.c b/drivers/block/rbd.c index b85d52005a21..e818c2a6ffb1 100644 --- a/drivers/block/rbd.c +++ b/drivers/block/rbd.c @@ -4273,32 +4273,22 @@ static int rbd_dev_v2_parent_info(struct rbd_device *rbd_dev) } /* - * We always update the parent overlap. If it's zero we - * treat it specially. + * We always update the parent overlap. If it's zero we issue + * a warning, as we will proceed as if there was no parent. */ - rbd_dev-parent_overlap = overlap; if (!overlap) { - - /* A null parent_spec indicates it's the initial probe */ - if (parent_spec) { - /* - * The overlap has become zero, so the clone - * must have been resized down to 0 at some - * point. Treat this the same as a flatten. - */ - rbd_dev_parent_put(rbd_dev); - pr_info(%s: clone image now standalone\n, - rbd_dev-disk-disk_name); + /* refresh, careful to warn just once */ + if (rbd_dev-parent_overlap) + rbd_warn(rbd_dev, + clone now standalone (overlap became 0)); } else { - /* - * For the initial probe, if we find the - * overlap is zero we just pretend there was - * no parent image. - */ - rbd_warn(rbd_dev, ignoring parent with overlap 0); + /* initial probe */ + rbd_warn(rbd_dev, clone is standalone (overlap 0)); } } + rbd_dev-parent_overlap = overlap; + out: ret = 0; out_err: -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/3] rbd: drop parent_ref in rbd_dev_unprobe() unconditionally
On 01/20/2015 06:41 AM, Ilya Dryomov wrote: This effectively reverts the last hunk of 392a9dad7e77 (rbd: detect when clone image is flattened). The problem with parent_overlap != 0 condition is that it's possible and completely valid to have an image with parent_overlap == 0 whose parent state needs to be cleaned up on unmap. The next commit, which drops the clone image now standalone logic, opens up another window of opportunity to hit this, but even without it # cat parent-ref.sh #!/bin/bash rbd create --image-format 2 --size 1 foo rbd snap create foo@snap rbd snap protect foo@snap rbd clone foo@snap bar rbd resize --allow-shrink --size 0 bar rbd resize --size 1 bar DEV=$(rbd map bar) rbd unmap $DEV leaves rbd_device/rbd_spec/etc and rbd_client along with ceph_client hanging around. I'm not sure why the last reference to the parent doesn't get dropped (and state cleaned up) as soon as the overlap becomes 0. I suspect it's the original reference taken when there's a parent, we don't get rid of it until it's torn down. (I think we should.) It seems to me the test here should be for a non-null parent_spec pointer rather than non-zero parent_overlap. And that's done inside rbd_dev_parent_put(), so your change looks reasonable to me. Reviewed-by: Alex Elder el...@linaro.org My thinking behind calling rbd_dev_parent_put() unconditionally is that there shouldn't be any requests in flight at that point in time as we are deep into unmap sequence. Hence, even if rbd_dev_unparent() caused by flatten is delayed by in-flight requests, it will have finished by the time we reach rbd_dev_unprobe() caused by unmap, thus turning unconditional rbd_dev_parent_put() into a no-op. Fixes: http://tracker.ceph.com/issues/10352 Cc: sta...@vger.kernel.org # 3.11+ Signed-off-by: Ilya Dryomov idryo...@redhat.com --- drivers/block/rbd.c | 5 + 1 file changed, 1 insertion(+), 4 deletions(-) diff --git a/drivers/block/rbd.c b/drivers/block/rbd.c index 2990a1c75159..b85d52005a21 100644 --- a/drivers/block/rbd.c +++ b/drivers/block/rbd.c @@ -5075,10 +5075,7 @@ static void rbd_dev_unprobe(struct rbd_device *rbd_dev) { struct rbd_image_header *header; - /* Drop parent reference unless it's already been done (or none) */ - - if (rbd_dev-parent_overlap) - rbd_dev_parent_put(rbd_dev); + rbd_dev_parent_put(rbd_dev); /* Free dynamic fields from the header, then zero it out */ -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] rbd: convert to blk-mq
On 01/10/2015 12:31 PM, Christoph Hellwig wrote: This converts the rbd driver to use the blk-mq infrastructure. Except for switching to a per-request work item this is almost mechanical. This was tested by Alexandre DERUMIER in November, and found to give him 12 iops, although the only comparism available was an old 3.10 kernel which gave 8iops. I'm coming up to speed with the blk-mq stuff only now. It looks like requests are sent to the driver via -queue_rq() rather than the driver taking them via blk_fetch_request(q). Previously we would pull as many requests as were available, put them on the device's request queue, and then activate the rbd workqueue to handle them one-by-one using rbd_handle_request(). Now, the rbd queue_rq method rbd_request_workfn() adds the request to the rbd workqueue directly. The work_struct implicitly follows the request structure (which is set up by the blk-mq code). We have to do the REQ_TYPE_FS check at the time it's queued now, rather than when it's fetched from the queue. And finally we now have to tell the blk-mq subsystem when we've started and ended a request. I didn't follow up on all the tag_set initialization values so I assume you got that right (it looks reasonable to me). Given the above, it looks like everything else should work about the same as before, we're just handed requests rather than asking for them. With this patch applied, rbd_device-rq_queue is no longer needed so you should delete it. I got two warnings about endo-of-line whitespace in your patch. And I have one other very small suggestion below. Other than those things, this looks great to me. Reviewed-by: Alex Elder el...@linaro.org Signed-off-by: Christoph Hellwig h...@lst.de --- drivers/block/rbd.c | 118 +--- 1 file changed, 67 insertions(+), 51 deletions(-) diff --git a/drivers/block/rbd.c b/drivers/block/rbd.c index 3ec85df..52cd677 100644 --- a/drivers/block/rbd.c +++ b/drivers/block/rbd.c . . . (The following is in the new rbd_queue_rq().) + queue_work(rbd_wq, work); + return 0; return BLK_MQ_RQ_QUEUE_OK; (Because the symbolic values are explicitly checked by the caller.) } /* . . . -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] rbd: don't treat CEPH_OSD_OP_DELETE as extent op
On 11/24/2014 03:59 AM, Ilya Dryomov wrote: CEPH_OSD_OP_DELETE is not an extent op, stop treating it as such. This sneaked in with discard patches - it's one of the three osd ops (the other two are CEPH_OSD_OP_TRUNCATE and CEPH_OSD_OP_ZERO) that discard is implemented with. Signed-off-by: Ilya Dryomov idryo...@redhat.com Is the CEPH_OSD_OP_DELETE used in ceph_zero_partial_object() an extent op? If it is not, you should get rid of the BUG_ON() in osd_req_op_extent_init() that allows CEPH_OSD_OP_DELETE. And if that's the case it looks like that function or ceph_osdc_new_request() handle CEPH_OSD_OP_DELETE properly--so it's not treated as an extent op. And: osd_req_encode_op() encodes a CEPH_OSD_OP_DELETE as an extent op as well. If it *can* be an extent op (but just not as used by RBD) then it warrants a comment here that explains why it is not being initialized as an extent op. -Alex --- drivers/block/rbd.c | 8 ++-- 1 file changed, 6 insertions(+), 2 deletions(-) diff --git a/drivers/block/rbd.c b/drivers/block/rbd.c index 27b71a0b72d0..1df0802bf6cb 100644 --- a/drivers/block/rbd.c +++ b/drivers/block/rbd.c @@ -2370,8 +2370,12 @@ static void rbd_img_obj_request_fill(struct rbd_obj_request *obj_request, opcode = CEPH_OSD_OP_READ; } - osd_req_op_extent_init(osd_request, num_ops, opcode, offset, length, - 0, 0); + if (opcode == CEPH_OSD_OP_DELETE) + osd_req_op_init(osd_request, num_ops, opcode); + else + osd_req_op_extent_init(osd_request, num_ops, opcode, +offset, length, 0, 0); + if (obj_request-type == OBJ_REQUEST_BIO) osd_req_op_extent_osd_data_bio(osd_request, num_ops, obj_request-bio_list, length); -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: krbd blk-mq support ?
On 10/28/2014 01:07 PM, Christoph Hellwig wrote: On Mon, Oct 27, 2014 at 11:00:46AM +0100, Alexandre DERUMIER wrote: Can you do a perf report -ag and then a perf report to see where these cycles are spent? Yes, sure. I have attached the perf report to this mail. (This is with kernel 3.14, don't have access to my 3.18 host for now) Oh, that's without the blk-mq patch? Either way the profile doesn't really sum up to a fully used up cpu. Sage, Alex - are there any ordring constraints in the rbd client? I don't remember off hand. In libceph I recall going to great lengths to retain the original order of requests when they got re-sent after a connection reset. I'll go look at the code a bit and see if I can refresh my memory (though Sage may answer before I do). -Alex If not we could probably aim for per-cpu queues using blk-mq and a socket per cpu or similar. -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: krbd blk-mq support ?
On 10/28/2014 01:07 PM, Christoph Hellwig wrote: On Mon, Oct 27, 2014 at 11:00:46AM +0100, Alexandre DERUMIER wrote: Can you do a perf report -ag and then a perf report to see where these cycles are spent? Yes, sure. I have attached the perf report to this mail. (This is with kernel 3.14, don't have access to my 3.18 host for now) Oh, that's without the blk-mq patch? Either way the profile doesn't really sum up to a fully used up cpu. Sage, Alex - are there any ordring constraints in the rbd client? If not we could probably aim for per-cpu queues using blk-mq and a socket per cpu or similar. First, a disclaimer--I haven't really been following this discussion very closely. For an rbd image request (which is what gets created from requests from the block queue), the order of completion doesn't matter, and although the object requests are submitted in order that shouldn't be required either. The image request is broken into one or more object requests (usually just one) and they are treated as a unit. When the last object request of a set for an image request has completed, the image request is treated as completed. I hope that helps. If not, ask again a different way... -Alex -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/2] libceph: resend lingering requests with a new tid
On 09/05/2014 03:42 AM, Ilya Dryomov wrote: Both not yet registered (r_linger list_empty(r_linger_item)) and registered linger requests should use the new tid on resend to avoid the dup op detection logic on the OSDs, yet we were doing this only for registered case. Factor out and simplify the registered logic and use the new helper for not registered case as well. Fixes: http://tracker.ceph.com/issues/8806 The issue description describes the failure scenario nicely. These linger requests are a nice service to provide but they sure have proved tricky to get right... After reviewing almost everything I came up with one big question and I don't have time right now to investigate the answer so I hope you will. See below for the question in context. With that exception, the logic looks correct to me. I have some suggestions below for you to consider. Let me know what you intend to do, and if you are sure my big question is not an issue you can go ahead and add this: Reviewed-by: Alex Elder el...@linaro.org If not, please update and give me a chance to look at it again. Thanks. -Alex Signed-off-by: Ilya Dryomov ilya.dryo...@inktank.com --- net/ceph/osd_client.c | 75 - 1 file changed, 55 insertions(+), 20 deletions(-) diff --git a/net/ceph/osd_client.c b/net/ceph/osd_client.c index 8db10b4a6553..81ee046b24d0 100644 --- a/net/ceph/osd_client.c +++ b/net/ceph/osd_client.c @@ -30,8 +30,11 @@ static void __send_queued(struct ceph_osd_client *osdc); static int __reset_osd(struct ceph_osd_client *osdc, struct ceph_osd *osd); static void __register_request(struct ceph_osd_client *osdc, struct ceph_osd_request *req); +static void __unregister_request(struct ceph_osd_client *osdc, + struct ceph_osd_request *req); static void __unregister_linger_request(struct ceph_osd_client *osdc, struct ceph_osd_request *req); +static void __enqueue_request(struct ceph_osd_request *req); static void __send_request(struct ceph_osd_client *osdc, struct ceph_osd_request *req); @@ -892,6 +895,39 @@ __lookup_request_ge(struct ceph_osd_client *osdc, return NULL; } +static void __kick_linger_request(struct ceph_osd_request *req, + bool lingering) I think committed or maybe acknowledged might be a better name than lingering. All of them are marked to linger, so something other than lingering might be a little clearer. +{ + struct ceph_osd_client *osdc = req-r_osdc; + struct ceph_osd *osd = req-r_osd; + + /* + * Linger requests need to be resent with a new tid to avoid + * the dup op detection logic on the OSDs. Achieve this with + * a re-register dance instead of open-coding. This comment is more oriented toward this particular commit rather than what's really going on in the code. I.e., you should instead sayRe-register each request--whether or not we know it's been committed to disk on the OSD. If we ever sent a linger request we must assume it's been committed, so even un-acknowledged linger requests need a new TID. or something like that (or maybe something simpler). + */ + ceph_osdc_get_request(req); As a rule, when there's an if-else, I prefer to avoid negating the condition. It's a subtle style thing, but to me it makes it slightly easier to mentally parse (occasionally avoiding a double negative). There's another instance of this a little further down. + if (!lingering) + __unregister_request(osdc, req); + else + __unregister_linger_request(osdc, req); + __register_request(osdc, req); + ceph_osdc_put_request(req); + + /* + * Unless request has been registered as both normal and + * lingering, __unregister{,_linger}_request clears r_osd. + * However, here we need to preserve r_osd to make sure we + * requeue on the same OSD. + */ + WARN_ON(req-r_osd || !osd); + req-r_osd = osd; + + dout(requeueing %p tid %llu osd%d lingering=%d\n, req, req-r_tid, + osd-o_osd, lingering); + __enqueue_request(req); +} + /* * Resubmit requests pending on the given osd. */ @@ -900,12 +936,14 @@ static void __kick_osd_requests(struct ceph_osd_client *osdc, { struct ceph_osd_request *req, *nreq; LIST_HEAD(resend); + LIST_HEAD(resend_linger); int err; dout(__kick_osd_requests osd%d\n, osd-o_osd); err = __reset_osd(osdc, osd); if (err) return; + /* * Build up a list of requests to resend by traversing the * osd's list of requests. Requests for a given object are This block of comments (starting with the above but not shown here) should be updated to reflect the modified logic
Re: [PATCH] libceph: fix a memory leak in handle_watch_notify
On 09/11/2014 03:31 AM, Ilya Dryomov wrote: On Thu, Sep 11, 2014 at 5:41 AM, Alex Elder el...@ieee.org wrote: On 09/10/2014 07:20 PM, roy.qing...@gmail.com wrote: From: Li RongQing roy.qing...@gmail.com event_work should be freed when adding it to queue failed Signed-off-by: Li RongQing roy.qing...@gmail.com Looks good. Reviewed-by: Alex Elder el...@linaro.org Hmm, queue_work() returns %false if @work was already on a queue, %true otherwise, so this seems bogus to me. I'd go with something like this (mangled). The original change was fine. Whether it matters is another question. Your suggestion looks good as well, and on the assumption that if you choose to use it instead your real fix is done correctly you can use Reviewed-by: me if you like. -Alex From c0711eee447b199b1c2193460fce8c9d958f23f4 Mon Sep 17 00:00:00 2001 From: Ilya Dryomov ilya.dryo...@inktank.com Date: Thu, 11 Sep 2014 12:18:53 +0400 Subject: [PATCH] libceph: don't try checking queue_work() return value queue_work() doesn't fail to queue, it returns false if work was already on a queue, which can't happen here since we allocate event_work right before we queue it. So don't bother at all. Signed-off-by: Ilya Dryomov ilya.dryo...@inktank.com --- net/ceph/osd_client.c | 15 +-- 1 file changed, 5 insertions(+), 10 deletions(-) diff --git a/net/ceph/osd_client.c b/net/ceph/osd_client.c index 0f569d322405..952e9c254cc7 100644 --- a/net/ceph/osd_client.c +++ b/net/ceph/osd_client.c @@ -2355,26 +2355,21 @@ static void handle_watch_notify(struct ceph_osd_client *osdc, if (event) { event_work = kmalloc(sizeof(*event_work), GFP_NOIO); if (!event_work) { - dout(ERROR: could not allocate event_work\n); - goto done_err; + pr_err(couldn't allocate event_work\n); + ceph_osdc_put_event(event); + return; } INIT_WORK(event_work-work, do_event_work); event_work-event = event; event_work-ver = ver; event_work-notify_id = notify_id; event_work-opcode = opcode; - if (!queue_work(osdc-notify_wq, event_work-work)) { - dout(WARNING: failed to queue notify event work\n); - goto done_err; - } + + queue_work(osdc-notify_wq, event_work-work); } return; -done_err: - ceph_osdc_put_event(event); - return; - bad: pr_err(osdc handle_watch_notify corrupt msg\n); } -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] rbd: do not return -ERANGE on auth failure
On 09/11/2014 10:10 AM, Ilya Dryomov wrote: Trying to map an image out of a pool for which we don't have an 'x' permission bit fails with -ERANGE from ceph_extract_encoded_string() due to an unsigned vs signed bug. Fix it and get rid of the -ENIVAL sink, thus exposing rbd::get_id cls method return value. (I've seen a bunch of unexplained -ERANGE reports, I bet this is it). Signed-off-by: Ilya Dryomov ilya.dryo...@inktank.com I often think people are annoyed by my explicit type casts all over the place. This (missed) one matters a lot. I think the -EINVAL was to ensure an error code that was expected by a write() call would be returned. In any case, this looks good to me. Reviewed-by: Alex Elder el...@linaro.org --- drivers/block/rbd.c |4 +--- 1 file changed, 1 insertion(+), 3 deletions(-) diff --git a/drivers/block/rbd.c b/drivers/block/rbd.c index 4b97baf8afa3..fe3726c62a37 100644 --- a/drivers/block/rbd.c +++ b/drivers/block/rbd.c @@ -4924,7 +4924,7 @@ static int rbd_dev_image_id(struct rbd_device *rbd_dev) ret = image_id ? 0 : -ENOMEM; if (!ret) rbd_dev-image_format = 1; - } else if (ret sizeof (__le32)) { + } else if (ret (int)sizeof(u32)) { void *p = response; image_id = ceph_extract_encoded_string(p, p + ret, @@ -4932,8 +4932,6 @@ static int rbd_dev_image_id(struct rbd_device *rbd_dev) ret = PTR_ERR_OR_ZERO(image_id); if (!ret) rbd_dev-image_format = 2; - } else { - ret = -EINVAL; } if (!ret) { -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] rbd: do not return -ERANGE on auth failure
On 09/11/2014 11:17 AM, Ilya Dryomov wrote: On Thu, Sep 11, 2014 at 7:23 PM, Alex Elder el...@ieee.org wrote: On 09/11/2014 10:10 AM, Ilya Dryomov wrote: Trying to map an image out of a pool for which we don't have an 'x' permission bit fails with -ERANGE from ceph_extract_encoded_string() due to an unsigned vs signed bug. Fix it and get rid of the -ENIVAL sink, thus exposing rbd::get_id cls method return value. (I've seen a bunch of unexplained -ERANGE reports, I bet this is it). Signed-off-by: Ilya Dryomov ilya.dryo...@inktank.com I often think people are annoyed by my explicit type casts all over the place. This (missed) one matters a lot. I think the -EINVAL was to ensure an error code that was expected by a write() call would be returned. Yeah, the way it's written it's possible in theory to get a positive return value from rbd_dev_image_id(). Looking deeper, this sizeof() is not needed at all - ceph_extract_encoded_string() deals with short buffers as it should. As for the ret == sizeof(u32) (i.e. an empty string), neither userspace nor us check against empty strings in similar cases (object prefix, snapshot name, etc). I should have asked this before. Why is a permission error leading to ceph_extract_encoded_string() finding a short buffer? I didn't take the time to trace the error path you're talking about here all the way back. (I'm looking at your new patch in the mean time.) -Alex -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] rbd: do not return -ERANGE on auth failure
On 09/11/2014 11:27 AM, Ilya Dryomov wrote: I should have asked this before. Why is a permission error leading to ceph_extract_encoded_string() finding a short buffer? I didn't take the time to trace the error path you're talking about here all the way back. rbd_obj_method_sync() returns -EPERM, which when compared with size_t from sizeof() ends up a big positive. ceph_extract_encoded_string() is then called and the safe decode macro kicks in.. I somehow got lost along the way and thought rbd_obj_method_sync() was returning -ERANGE. The problem starts because rbd_obj_method_sync() returns an error other than -ENOENT, and that led me astray I think. Sorry for the confusion. -Alex -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] rbd: do not return -ERANGE on auth failure
On 09/11/2014 11:17 AM, Ilya Dryomov wrote: On Thu, Sep 11, 2014 at 7:23 PM, Alex Elder el...@ieee.org wrote: On 09/11/2014 10:10 AM, Ilya Dryomov wrote: Trying to map an image out of a pool for which we don't have an 'x' permission bit fails with -ERANGE from ceph_extract_encoded_string() due to an unsigned vs signed bug. Fix it and get rid of the -ENIVAL sink, thus exposing rbd::get_id cls method return value. (I've seen a bunch of unexplained -ERANGE reports, I bet this is it). Signed-off-by: Ilya Dryomov ilya.dryo...@inktank.com I often think people are annoyed by my explicit type casts all over the place. This (missed) one matters a lot. I think the -EINVAL was to ensure an error code that was expected by a write() call would be returned. Yeah, the way it's written it's possible in theory to get a positive return value from rbd_dev_image_id(). Looking deeper, this sizeof() is not needed at all - ceph_extract_encoded_string() deals with short buffers as it should. As for the ret == sizeof(u32) (i.e. an empty string), neither userspace nor us check against empty strings in similar cases (object prefix, snapshot name, etc). With the above in mind, how about this? From 3ded0a7fee82f2204c58b4fc00fc74f05331514d Mon Sep 17 00:00:00 2001 From: Ilya Dryomov ilya.dryo...@inktank.com Date: Thu, 11 Sep 2014 18:49:18 +0400 Subject: [PATCH] rbd: do not return -ERANGE on auth failures Trying to map an image out of a pool for which we don't have an 'x' permission bit fails with -ERANGE from ceph_extract_encoded_string() due to an unsigned vs signed bug. Fix it and get rid of the -EINVAL sink, thus propagating rbd::get_id cls method errors. (I've seen a bunch of unexplained -ERANGE reports, I bet this is it). Signed-off-by: Ilya Dryomov ilya.dryo...@inktank.com So now we know that the value returned by rbd_dev_image_id() will be either 0 or a negative errno. It could still return something that write(2) isn't defined to return, but at least it's an error. That's OK with me... Reviewed-by: Alex Elder el...@linaro.org --- drivers/block/rbd.c |4 +--- 1 file changed, 1 insertion(+), 3 deletions(-) diff --git a/drivers/block/rbd.c b/drivers/block/rbd.c index 4b97baf8afa3..ce457db5d847 100644 --- a/drivers/block/rbd.c +++ b/drivers/block/rbd.c @@ -4924,7 +4924,7 @@ static int rbd_dev_image_id(struct rbd_device *rbd_dev) ret = image_id ? 0 : -ENOMEM; if (!ret) rbd_dev-image_format = 1; - } else if (ret sizeof (__le32)) { + } else if (ret = 0) { void *p = response; image_id = ceph_extract_encoded_string(p, p + ret, @@ -4932,8 +4932,6 @@ static int rbd_dev_image_id(struct rbd_device *rbd_dev) ret = PTR_ERR_OR_ZERO(image_id); if (!ret) rbd_dev-image_format = 2; - } else { - ret = -EINVAL; } if (!ret) { -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] libceph: fix a memory leak in handle_watch_notify
On 09/10/2014 07:20 PM, roy.qing...@gmail.com wrote: From: Li RongQing roy.qing...@gmail.com event_work should be freed when adding it to queue failed Signed-off-by: Li RongQing roy.qing...@gmail.com Looks good. Reviewed-by: Alex Elder el...@linaro.org --- net/ceph/osd_client.c |1 + 1 file changed, 1 insertion(+) diff --git a/net/ceph/osd_client.c b/net/ceph/osd_client.c index 30f6faf..1e1b4f1 100644 --- a/net/ceph/osd_client.c +++ b/net/ceph/osd_client.c @@ -2323,6 +2323,7 @@ static void handle_watch_notify(struct ceph_osd_client *osdc, event_work-opcode = opcode; if (!queue_work(osdc-notify_wq, event_work-work)) { dout(WARNING: failed to queue notify event work\n); + kfree(event_work); goto done_err; } } -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH -next] rbd: fix error return code in rbd_dev_device_setup()
On 08/13/2014 07:56 PM, weiyj...@163.com wrote: From: Wei Yongjun yongjun_...@trendmicro.com.cn Fix to return -ENOMEM from the workqueue alloc error handling case instead of 0, as done elsewhere in this function. Signed-off-by: Wei Yongjun yongjun_...@trendmicro.com.cn This looks good to me. Sorry I missed this when I first reviewed the workqueue code. Reviewed-by: Alex Elder el...@linaro.org --- drivers/block/rbd.c | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-) diff --git a/drivers/block/rbd.c b/drivers/block/rbd.c index 623c841..f3be022 100644 --- a/drivers/block/rbd.c +++ b/drivers/block/rbd.c @@ -5088,8 +5087,10 @@ static int rbd_dev_device_setup(struct rbd_device *rbd_dev) set_disk_ro(rbd_dev-disk, rbd_dev-mapping.read_only); rbd_dev-rq_wq = alloc_workqueue(rbd_dev-disk-disk_name, 0, 0); - if (!rbd_dev-rq_wq) + if (!rbd_dev-rq_wq) { + ret = -ENOMEM; goto err_out_mapping; + } ret = rbd_bus_add_dev(rbd_dev); if (ret) -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] libceph: set last_piece in ceph_msg_data_pages_cursor_init() correctly
On 08/08/2014 06:18 AM, Ilya Dryomov wrote: Determining -last_piece based on the value of -page_offset + length is incorrect because length here is the length of the entire message. -last_piece set to false even if page array data item length is = PAGE_SIZE, which results in invalid length passed to ceph_tcp_{send,recv}page() and causes various asserts to fire. I glanced at this this morning and thought it might have a problem. Now that I've had the time to look at it a little more closely I conclude it's fine. I'm going to explain my reasoning, which just reaffirms what you've explained. (Doing this helps me brush up on things, I'm starting to get rusty...) A message has a list of data items, and a data item is broken into pieces, each of which (for a pages data item) holds data from a single page. A cursor keeps the data item now being processed in cursor-data, and cursor-resid is the number of un-consumed bytes in the current data item. We want to set cursor-last_piece when the next piece to process is the last in the *data item*, not the last in the *message*, so we need to compare against cursor-resid, not the message length. Before your change, any message with more than one data item including a page array data item that was not last in the list would have problems. So in summary: Looks good. Reviewed-by: Alex Elder el...@linaro.org # cat pages-cursor-init.sh #!/bin/bash rbd create --size 10 --image-format 2 foo FOO_DEV=$(rbd map foo) dd if=/dev/urandom of=$FOO_DEV bs=1M /dev/null rbd snap create foo@snap rbd snap protect foo@snap rbd clone foo@snap bar # rbd_resize calls librbd rbd_resize(), size is in bytes ./rbd_resize bar $(((4 20) + 512)) rbd resize --size 10 bar BAR_DEV=$(rbd map bar) # trigger a 512-byte copyup -- 512-byte page array data item dd if=/dev/urandom of=$BAR_DEV bs=1M count=1 seek=5 The problem exists only in ceph_msg_data_pages_cursor_init(), ceph_msg_data_pages_advance() does the right thing. The size_t cast is unnecessary. Signed-off-by: Ilya Dryomov ilya.dryo...@inktank.com --- net/ceph/messenger.c |2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/net/ceph/messenger.c b/net/ceph/messenger.c index e51cad0db580..b2f571dd933d 100644 --- a/net/ceph/messenger.c +++ b/net/ceph/messenger.c @@ -901,7 +901,7 @@ static void ceph_msg_data_pages_cursor_init(struct ceph_msg_data_cursor *cursor, BUG_ON(page_count (int)USHRT_MAX); cursor-page_count = (unsigned short)page_count; BUG_ON(length SIZE_MAX - cursor-page_offset); - cursor-last_piece = (size_t)cursor-page_offset + length = PAGE_SIZE; + cursor-last_piece = cursor-page_offset + cursor-resid = PAGE_SIZE; } static struct page * -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] rbd: allocate img_request with GFP_NOIO instead GFP_ATOMIC
On 08/07/2014 04:40 AM, Ilya Dryomov wrote: Now that rbd_img_request_create() is called from work functions, no need to use GFP_ATOMIC. Signed-off-by: Ilya Dryomov ilya.dryo...@inktank.com Looks good. Reviewed-by: Alex Elder el...@linaro.org --- drivers/block/rbd.c |2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/block/rbd.c b/drivers/block/rbd.c index 4515b128d0b4..a5ebcf28e041 100644 --- a/drivers/block/rbd.c +++ b/drivers/block/rbd.c @@ -2061,7 +2061,7 @@ static struct rbd_img_request *rbd_img_request_create( { struct rbd_img_request *img_request; - img_request = kmem_cache_alloc(rbd_img_request_cache, GFP_ATOMIC); + img_request = kmem_cache_alloc(rbd_img_request_cache, GFP_NOIO); if (!img_request) return NULL; -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] rbd: rework rbd_request_fn()
On 08/05/2014 02:38 AM, Ilya Dryomov wrote: While it was never a good idea to sleep in request_fn(), commit 34c6bc2c919a (locking/mutexes: Add extra reschedule point) made it a *bad* idea. mutex_lock() since 3.15 may reschedule *before* putting task on the mutex wait queue, which for tasks in !TASK_RUNNING state means block forever. request_fn() may be called with !TASK_RUNNING on the way to schedule() in io_schedule(). Offload request handling to a workqueue, one per rbd device, to avoid calling blocking primitives from rbd_request_fn(). Off topic... If you supply --patience to your git diff command you'll get an easier-to-read result in some cases (like this one). If you like that you can just do: git config --global --add diff.algorithm patience I have several comments below, none of which are bugs, just suggestions. I was thinking you could use a single work queue for all rbd devices, but I'm not sure that's very important. You'd have to stash the rbd_dev pointer in every request somewhere. Reviewed-by: Alex Elder el...@linaro.org Cc: sta...@vger.kernel.org # 3.15+ Signed-off-by: Ilya Dryomov ilya.dryo...@inktank.com --- drivers/block/rbd.c | 195 +++ 1 file changed, 118 insertions(+), 77 deletions(-) diff --git a/drivers/block/rbd.c b/drivers/block/rbd.c index e3412317d3f9..e07d9d71b187 100644 --- a/drivers/block/rbd.c +++ b/drivers/block/rbd.c @@ -42,6 +42,7 @@ #include linux/blkdev.h #include linux/slab.h #include linux/idr.h +#include linux/workqueue.h #include rbd_types.h @@ -332,7 +333,10 @@ struct rbd_device { charname[DEV_NAME_LEN]; /* blkdev name, e.g. rbd3 */ + struct list_headrq_queue; /* incoming rq queue */ spinlock_t lock; /* queue, flags, open_count */ + struct workqueue_struct *rq_wq; + struct work_struct rq_work; struct rbd_image_header header; unsigned long flags; /* possibly lock protected */ @@ -3176,102 +3180,129 @@ out: return ret; } -static void rbd_request_fn(struct request_queue *q) - __releases(q-queue_lock) __acquires(q-queue_lock) +static void rbd_handle_request(struct rbd_device *rbd_dev, struct request *rq) { . . . + if (offset + length rbd_dev-mapping.size) { + rbd_warn(rbd_dev, beyond EOD (%llu~%llu %llu), offset, + length, rbd_dev-mapping.size); + result = -EIO; + goto err_rq; + } - spin_unlock_irq(q-queue_lock); + img_request = rbd_img_request_create(rbd_dev, offset, length, wr); This isn't new with your patch, but perhaps we should consider having rbd_img_request_create() use GFP_KERNEL or at least GFP_NOIO rather than GFP_ATOMIC. In general, I think a pass ought to be made through all the allocation calls to make sure they have the right flags. Some may be too restrictive, some maybe not enough. Anyway, a task for another day... + if (!img_request) { + result = -ENOMEM; + goto err_rq; + } + img_request-rq = rq; . . . +/* + * Called with q-queue_lock held and interrupts disabled, possibly on + * the way to schedule(). Do not sleep here! + */ +static void rbd_request_fn(struct request_queue *q) +{ + struct rbd_device *rbd_dev = q-queuedata; + struct request *rq; + int queued = 0; + + rbd_assert(rbd_dev); + + while ((rq = blk_fetch_request(q))) { + /* Ignore any non-FS requests that filter through. */ + if (rq-cmd_type != REQ_TYPE_FS) { + dout(%s: non-fs request type %d\n, __func__, + (int) rq-cmd_type); + __blk_end_request_all(rq, 0); + continue; } + + list_add_tail(rq-queuelist, rbd_dev-rq_queue); + queued++; if (!queued) queued++; Or alternately, make queued be a Boolean. My point is to guarantee there is no wrap (which is nearly but not quite impossible). } + + if (queued) + queue_work(rbd_dev-rq_wq, rbd_dev-rq_work); } /* . . . @@ -5067,6 +5105,8 @@ static int rbd_dev_device_setup(struct rbd_device *rbd_dev) return ret; +err_out_workqueue: + destroy_workqueue(rbd_dev-rq_wq); rbd_dev-rq_wq = NULL; err_out_mapping: rbd_dev_mapping_clear(rbd_dev); err_out_disk: @@ -5313,6 +5353,7 @@ static void rbd_dev_device_release(struct device *dev) { struct rbd_device *rbd_dev = dev_to_rbd_dev(dev); + destroy_workqueue(rbd_dev-rq_wq); rbd_free_disk(rbd_dev); clear_bit(RBD_DEV_FLAG_EXISTS, rbd_dev-flags); rbd_dev_mapping_clear(rbd_dev); -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body
Re: [PATCH] rbd: rework rbd_request_fn()
On 08/06/2014 02:09 PM, Ilya Dryomov wrote: Off topic... If you supply --patience to your git diff command you'll get an easier-to-read result in some cases (like this one). If you like that you can just do: git config --global --add diff.algorithm patience I'm generally using (and used to review this patch in particular) --histogram. In my experience it gives slightly better results than --patience. Didn't think of putting it into .gitconfig though. Looks like --histogram beats --patience in this case too. alloc_workqueue() hunk is much better, otherwise identical. I didn't know about histogram... I guess I should update my git config. -Alex -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 7/8] rbd: do not read in parent info before snap context
On 07/24/2014 03:42 AM, Ilya Dryomov wrote: Currently rbd_dev_v2_header_info() reads in parent info before the snap context is read in. This is wrong, because we may need to look at the the parent_overlap value of the snapshot instead of that of the base I had some trouble understanding this. The parent image, snapshot id and overlap, are together a property of a particular image (and associated with its header object). Nothing about that is dependent on the child image snapshots or snapshot id. However... image, for example when mapping a snapshot - see next commit. (When mapping a snapshot, all we got is its name and we need the snap context to translate that name into an id to know which parent info to look for). I finally figured out what path through the code you were talking about. Here's what I see. On the initial probe (previously), we have: rbd_add() do_rbd_add() rbd_dev_image_probe() |rbd_dev_header_info() | |rbd_dev_v2_header_info() | | |rbd_dev_v2_parent_info() | | | -- expects rbd_dev-spec-snap_id to be valid | | |rbd_dev_v2_snap_context() | | | -- this fills in the snapshot context |rbd_spec_fill_snap_id() | | -- fills rbd_dev-spec-snap_id |rbd_dev_probe_parent() So clearly, at least when mapping a snapshot, we would not get the desired result. We'll be unable to look up the id for the named snapshot, so would get ENOENT. Now you've pulled getting the parent info back out into rbd_dev_image_probe(): rbd_add() do_rbd_add() rbd_dev_image_probe() |rbd_dev_header_info() | |rbd_dev_v2_header_info() | | |rbd_dev_v2_snap_context() | | | -- this fills in the snapshot context |rbd_spec_fill_snap_id() | | -- fills rbd_dev-spec-snap_id |rbd_dev_v2_parent_info() | | -- rbd_dev-spec-snap_id will be valid |rbd_dev_probe_parent() In the refresh path it's similar. You move the rbd_dev_v2_parent_info() call into rbd_dev_refresh() instead of it happening in rbd_dev_v2_header_info(). Missing the ordering problem here might have caused even more subtle problems (due to using an apparently valid but possibly out-of-date snapshot context). Given this understanding I'd say your change looks good. The approach taken here is to make sure rbd_dev_v2_parent_info() is called after the snap context has been read in. The other approach would be to add a parent_overlap field to struct rbd_mapping and maintain it the same way rbd_mapping::size is maintained. The reason I chose the first approach is that the value of keeping around both base image values and the actual mapping values is unclear to me. I'll think about this and respond to your followup e-mail. Reviewed-by: Alex Elder el...@linaro.org Signed-off-by: Ilya Dryomov ilya.dryo...@inktank.com --- drivers/block/rbd.c | 64 --- 1 file changed, 30 insertions(+), 34 deletions(-) diff --git a/drivers/block/rbd.c b/drivers/block/rbd.c index 16f388f2960b..c4606987e9d1 100644 --- a/drivers/block/rbd.c +++ b/drivers/block/rbd.c @@ -515,6 +515,7 @@ static void rbd_dev_remove_parent(struct rbd_device *rbd_dev); static int rbd_dev_refresh(struct rbd_device *rbd_dev); static int rbd_dev_v2_header_onetime(struct rbd_device *rbd_dev); static int rbd_dev_header_info(struct rbd_device *rbd_dev); +static int rbd_dev_v2_parent_info(struct rbd_device *rbd_dev); static const char *rbd_dev_v2_snap_name(struct rbd_device *rbd_dev, u64 snap_id); static int _rbd_dev_v2_snap_size(struct rbd_device *rbd_dev, u64 snap_id, @@ -3513,6 +3514,18 @@ static int rbd_dev_refresh(struct rbd_device *rbd_dev) mapping_size = rbd_dev-mapping.size; ret = rbd_dev_header_info(rbd_dev); This looked odd. But I guess I didn't notice you didn't check the return value when you first added this line a few patches ago... + if (ret) + return ret; + + /* + * If there is a parent, see if it has disappeared due to the + * mapped image getting flattened. + */ + if (rbd_dev-parent) { + ret = rbd_dev_v2_parent_info(rbd_dev); + if (ret) + return ret; + } if (rbd_dev-spec-snap_id == CEPH_NOSNAP) { if (rbd_dev-mapping.size != rbd_dev-header.image_size) @@ -3524,11 +3537,10 @@ static int rbd_dev_refresh(struct rbd_device *rbd_dev) up_write(rbd_dev-header_rwsem); - if (mapping_size != rbd_dev-mapping.size) { + if (mapping_size != rbd_dev-mapping.size) rbd_dev_update_size(rbd_dev); - } - return ret; + return 0; } static int rbd_init_disk(struct rbd_device *rbd_dev) @@ -4477,33 +4489,6 @@ static int rbd_dev_v2_header_info(struct rbd_device *rbd_dev) return ret
Re: [PATCH 7/8] rbd: do not read in parent info before snap context
On 07/25/2014 03:36 AM, Ilya Dryomov wrote: + if (ret) + return ret; I'll move this bit to harden rbd_dev_refresh() commit. Sounds good.-Alex -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] ceph: fix sizeof(struct tYpO *) typo
On 07/25/2014 04:26 AM, Ilya Dryomov wrote: struct ceph_xattr - struct ceph_inode_xattr Looks good. I can't find the definition of ceph_xattr. Reviewed-by: Alex Elder el...@linaro.org Reported-by: Toralf Förster toralf.foers...@gmx.de Signed-off-by: Ilya Dryomov ilya.dryo...@inktank.com --- fs/ceph/xattr.c |2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/fs/ceph/xattr.c b/fs/ceph/xattr.c index f89698cdbc41..12f58d22e017 100644 --- a/fs/ceph/xattr.c +++ b/fs/ceph/xattr.c @@ -592,7 +592,7 @@ start: xattr_version = ci-i_xattrs.version; spin_unlock(ci-i_ceph_lock); - xattrs = kcalloc(numattr, sizeof(struct ceph_xattr *), + xattrs = kcalloc(numattr, sizeof(struct ceph_inode_xattr *), GFP_NOFS); err = -ENOMEM; if (!xattrs) -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 6/8] rbd: update mapping size only on refresh
On 07/24/2014 10:10 AM, Ilya Dryomov wrote: On Thu, Jul 24, 2014 at 5:46 PM, Ilya Dryomov ilya.dryo...@inktank.com wrote: On Thu, Jul 24, 2014 at 5:25 PM, Alex Elder el...@ieee.org wrote: On 07/24/2014 03:42 AM, Ilya Dryomov wrote: There is no sense in trying to update the mapping size before it's even been set. It took me a bit to follow this. But basically there is no mapping unless it's mapped. So previously this was updating the mapping information even for unmapped parent (or other) images. There's no need to update the mapping size for a snapshot--it'll never change. Is that right? If not, please advise; otherwise: No. Previously it was updating the mapping size both on the inital map and on refresh (of the base image). Doing that on the inital map doesn't make sense: not only struct rbd_mapping fields aren't properly initialized at that point - rbd_dev_mapping_set() is called much later in the map sequence, snap_id isn't initialized either (at least in the format 2 case, I haven't looked too closely at the format 1 case). And just in general, trying to update something before it had a chance to change is bogus.. BTW, while we are on the subject of struct rbd_mapping, is there any particular reason to keep around both the base image values and actual mapping values? I am tempted to change the mapping sequence so that 1) snap context is read in immediately after watch setup, before anything else is done, 2) supplied snap name is resolved into an id and 3) the actual (i.e. based on snap_id) mapping size, features, parent_overlap, etc are read in directly into struct rbd_image_header. That would allow to rip struct rbd_mapping entirely and make the code more clear. The rbd_mapping structure started out well-intentioned but over time it seems clear it hasn't added the value it was intended to add. Here's where it got created: f84344f rbd: separate mapping info in rbd_dev The only original field that survives is read_only. There's no harm at all in just moving the fields in that structure out into the rbd_device structure. Preserving the base image size in the header structure is an artifact of the version 1 code, where it held the last version of the on-disk header data. The version 2 code maintains it for consistency. You could eliminate that I suppose. I think the result would require rbd_header_from_disk() to know about the mapping rather than doing a simple conversion of data from one form to another. I say try it, and if you like the result I probably will too... -Alex Thanks, Ilya -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] ceph: fix sizeof(struct tYpO *) typo
On 07/25/2014 08:27 AM, Ilya Dryomov wrote: On Fri, Jul 25, 2014 at 4:52 PM, Alex Elder el...@ieee.org wrote: On 07/25/2014 04:26 AM, Ilya Dryomov wrote: struct ceph_xattr - struct ceph_inode_xattr Looks good. I can't find the definition of ceph_xattr. That's the point of the patch ;) Well, I thought the point was that the type was wrong; I was surprised that the type was non-existent. -Alex Thanks, Ilya -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/8] rbd: show the entire chain of parent images
On 07/24/2014 03:42 AM, Ilya Dryomov wrote: Make /sys/bus/rbd/devices/id/parent show the entire chain of parent images. While at it, kernel sprintf() doesn't return negative values, casting to unsigned long long is no longer necessary and there is no good reason to split into multiple sprintf() calls. I like the use of a single snprintf() call, it is much less cumbersome. The reason I always cast u64 values to (unsigned long long) in printf() calls is because on some 64-bit architectures, u64 is defined as simply (unsigned long), so without the cast they spit out warnings. I hate this, but it is reality (see include/asm-generic/{int-l64.h,int-ll64.h}). You can alternatively use %PRIu64 rather than %llu in your format strings, but I think I hate that more. Anyway, if you want to avoid warnings on all architectures you should fix that. As another aside, I've been too timid to use the ?: conditional expression without its middle operand. I have no objection to it at all, I like it. I bring it up because I recently got burned for using a gcc feature that wasn't supported on older compilers (unnamed struct/union fields)--specifically a version newer than gcc 3.2, which is the minimum supported version for the kernel (see Documentation/Changes). But fear not! That extension is supported in gcc 3.2: https://gcc.gnu.org/onlinedocs/gcc-3.2/gcc/Conditionals.html#Conditionals Just FYI... Reviewed-by: Alex Elder el...@linaro.org Signed-off-by: Ilya Dryomov ilya.dryo...@inktank.com --- Documentation/ABI/testing/sysfs-bus-rbd |4 +-- drivers/block/rbd.c | 56 +-- 2 files changed, 25 insertions(+), 35 deletions(-) diff --git a/Documentation/ABI/testing/sysfs-bus-rbd b/Documentation/ABI/testing/sysfs-bus-rbd index 501adc2a9ec7..2ddd680929d8 100644 --- a/Documentation/ABI/testing/sysfs-bus-rbd +++ b/Documentation/ABI/testing/sysfs-bus-rbd @@ -94,5 +94,5 @@ current_snap parent - Information identifying the pool, image, and snapshot id for - the parent image in a layered rbd image (format 2 only). + Information identifying the chain of parent images in a layered rbd + image. Entries are separated by empty lines. diff --git a/drivers/block/rbd.c b/drivers/block/rbd.c index 703b728e05fa..7847fbb949ff 100644 --- a/drivers/block/rbd.c +++ b/drivers/block/rbd.c @@ -3685,46 +3685,36 @@ static ssize_t rbd_snap_show(struct device *dev, } /* - * For an rbd v2 image, shows the pool id, image id, and snapshot id - * for the parent image. If there is no parent, simply shows - * (no parent image). + * For a v2 image, shows the chain of parent images, separated by empty + * lines. For v1 images or if there is no parent, shows (no parent + * image). */ static ssize_t rbd_parent_show(struct device *dev, - struct device_attribute *attr, - char *buf) +struct device_attribute *attr, +char *buf) { struct rbd_device *rbd_dev = dev_to_rbd_dev(dev); - struct rbd_spec *spec = rbd_dev-parent_spec; - int count; - char *bufp = buf; + ssize_t count = 0; - if (!spec) + if (!rbd_dev-parent) return sprintf(buf, (no parent image)\n); - count = sprintf(bufp, pool_id %llu\npool_name %s\n, - (unsigned long long) spec-pool_id, spec-pool_name); - if (count 0) - return count; - bufp += count; - - count = sprintf(bufp, image_id %s\nimage_name %s\n, spec-image_id, - spec-image_name ? spec-image_name : (unknown)); - if (count 0) - return count; - bufp += count; - - count = sprintf(bufp, snap_id %llu\nsnap_name %s\n, - (unsigned long long) spec-snap_id, spec-snap_name); - if (count 0) - return count; - bufp += count; - - count = sprintf(bufp, overlap %llu\n, rbd_dev-parent_overlap); - if (count 0) - return count; - bufp += count; - - return (ssize_t) (bufp - buf); + for ( ; rbd_dev-parent; rbd_dev = rbd_dev-parent) { + struct rbd_spec *spec = rbd_dev-parent_spec; + + count += sprintf(buf[count], %s + pool_id %llu\npool_name %s\n + image_id %s\nimage_name %s\n + snap_id %llu\nsnap_name %s\n + overlap %llu\n, + !count ? : \n, /* first? */ + spec-pool_id, spec-pool_name, + spec-image_id, spec-image_name ?: (unknown), + spec-snap_id, spec-snap_name, + rbd_dev-parent_overlap); + } + + return count; } static ssize_t rbd_image_refresh(struct device *dev, -- To unsubscribe from this list: send the line
Re: [PATCH 2/8] rbd: introduce rbd_dev_header_info()
On 07/24/2014 03:42 AM, Ilya Dryomov wrote: A wrapper around rbd_dev_v{1,2}_header_info() to reduce duplication. Looks good. Reviewed-by: Alex Elder el...@linaro.org Signed-off-by: Ilya Dryomov ilya.dryo...@inktank.com --- drivers/block/rbd.c | 24 ++-- 1 file changed, 14 insertions(+), 10 deletions(-) diff --git a/drivers/block/rbd.c b/drivers/block/rbd.c index 7847fbb949ff..0d3be608f16f 100644 --- a/drivers/block/rbd.c +++ b/drivers/block/rbd.c @@ -514,7 +514,7 @@ static void rbd_dev_remove_parent(struct rbd_device *rbd_dev); static int rbd_dev_refresh(struct rbd_device *rbd_dev); static int rbd_dev_v2_header_onetime(struct rbd_device *rbd_dev); -static int rbd_dev_v2_header_info(struct rbd_device *rbd_dev); +static int rbd_dev_header_info(struct rbd_device *rbd_dev); static const char *rbd_dev_v2_snap_name(struct rbd_device *rbd_dev, u64 snap_id); static int _rbd_dev_v2_snap_size(struct rbd_device *rbd_dev, u64 snap_id, @@ -3506,13 +3506,10 @@ static int rbd_dev_refresh(struct rbd_device *rbd_dev) u64 mapping_size; int ret; - rbd_assert(rbd_image_format_valid(rbd_dev-image_format)); down_write(rbd_dev-header_rwsem); mapping_size = rbd_dev-mapping.size; - if (rbd_dev-image_format == 1) - ret = rbd_dev_v1_header_info(rbd_dev); - else - ret = rbd_dev_v2_header_info(rbd_dev); + + ret = rbd_dev_header_info(rbd_dev); /* If it's a mapped snapshot, validate its EXISTS flag */ @@ -4501,6 +4498,16 @@ static int rbd_dev_v2_header_info(struct rbd_device *rbd_dev) return ret; } +static int rbd_dev_header_info(struct rbd_device *rbd_dev) +{ + rbd_assert(rbd_image_format_valid(rbd_dev-image_format)); + + if (rbd_dev-image_format == 1) + return rbd_dev_v1_header_info(rbd_dev); + + return rbd_dev_v2_header_info(rbd_dev); +} + static int rbd_bus_add_dev(struct rbd_device *rbd_dev) { struct device *dev; @@ -5149,10 +5156,7 @@ static int rbd_dev_image_probe(struct rbd_device *rbd_dev, bool mapping) goto out_header_name; } - if (rbd_dev-image_format == 1) - ret = rbd_dev_v1_header_info(rbd_dev); - else - ret = rbd_dev_v2_header_info(rbd_dev); + ret = rbd_dev_header_info(rbd_dev); if (ret) goto err_out_watch; -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 3/8] rbd: remove unnecessary asserts in rbd_dev_image_probe()
On 07/24/2014 03:42 AM, Ilya Dryomov wrote: spec-image_id assert doesn't buy us much and image_format is asserted in rbd_dev_header_name() and rbd_dev_header_info() anyway. Looks good. Over time I'm sure there are more assertions that can go away; they were most useful during rapid development when confidence in certain things was lower. Reviewed-by: Alex Elder el...@linaro.org Signed-off-by: Ilya Dryomov ilya.dryo...@inktank.com --- drivers/block/rbd.c |2 -- 1 file changed, 2 deletions(-) diff --git a/drivers/block/rbd.c b/drivers/block/rbd.c index 0d3be608f16f..4541f6027e4a 100644 --- a/drivers/block/rbd.c +++ b/drivers/block/rbd.c @@ -5143,8 +5143,6 @@ static int rbd_dev_image_probe(struct rbd_device *rbd_dev, bool mapping) ret = rbd_dev_image_id(rbd_dev); if (ret) return ret; - rbd_assert(rbd_dev-spec-image_id); - rbd_assert(rbd_image_format_valid(rbd_dev-image_format)); ret = rbd_dev_header_name(rbd_dev); if (ret) -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 4/8] rbd: split rbd_dev_spec_update() into two functions
On 07/24/2014 03:42 AM, Ilya Dryomov wrote: rbd_dev_spec_update() has two modes of operation, with nothing in common between them. Split it into two functions, one for each mode and make our expectations more clear. Signed-off-by: Ilya Dryomov ilya.dryo...@inktank.com Looks good. Reviewed-by: Alex Elder el...@linaro.org --- drivers/block/rbd.c | 79 +++ 1 file changed, 48 insertions(+), 31 deletions(-) diff --git a/drivers/block/rbd.c b/drivers/block/rbd.c index 4541f6027e4a..23df1773ef77 100644 --- a/drivers/block/rbd.c +++ b/drivers/block/rbd.c @@ -3798,6 +3798,9 @@ static struct rbd_spec *rbd_spec_alloc(void) spec = kzalloc(sizeof (*spec), GFP_KERNEL); if (!spec) return NULL; + + spec-pool_id = CEPH_NOPOOL; + spec-snap_id = CEPH_NOSNAP; kref_init(spec-kref); return spec; @@ -4257,18 +4260,38 @@ static u64 rbd_snap_id_by_name(struct rbd_device *rbd_dev, const char *name) } /* - * When an rbd image has a parent image, it is identified by the - * pool, image, and snapshot ids (not names). This function fills - * in the names for those ids. (It's OK if we can't figure out the - * name for an image id, but the pool and snapshot ids should always - * exist and have names.) All names in an rbd spec are dynamically - * allocated. + * An image being mapped will have everything but the snap id. + */ +static int rbd_spec_fill_snap_id(struct rbd_device *rbd_dev) +{ + struct rbd_spec *spec = rbd_dev-spec; + + rbd_assert(spec-pool_id != CEPH_NOPOOL spec-pool_name); + rbd_assert(spec-image_id spec-image_name); + rbd_assert(spec-snap_name); + + if (strcmp(spec-snap_name, RBD_SNAP_HEAD_NAME)) { + u64 snap_id; + + snap_id = rbd_snap_id_by_name(rbd_dev, spec-snap_name); + if (snap_id == CEPH_NOSNAP) + return -ENOENT; + + spec-snap_id = snap_id; + } else { + spec-snap_id = CEPH_NOSNAP; + } + + return 0; +} + +/* + * A parent image will have all ids but none of the names. * - * When an image being mapped (not a parent) is probed, we have the - * pool name and pool id, image name and image id, and the snapshot - * name. The only thing we're missing is the snapshot id. + * All names in an rbd spec are dynamically allocated. It's OK if we + * can't figure out the name for an image id. */ -static int rbd_dev_spec_update(struct rbd_device *rbd_dev) +static int rbd_spec_fill_names(struct rbd_device *rbd_dev) { struct ceph_osd_client *osdc = rbd_dev-rbd_client-client-osdc; struct rbd_spec *spec = rbd_dev-spec; @@ -4277,24 +4300,9 @@ static int rbd_dev_spec_update(struct rbd_device *rbd_dev) const char *snap_name; int ret; - /* - * An image being mapped will have the pool name (etc.), but - * we need to look up the snapshot id. - */ - if (spec-pool_name) { - if (strcmp(spec-snap_name, RBD_SNAP_HEAD_NAME)) { - u64 snap_id; - - snap_id = rbd_snap_id_by_name(rbd_dev, spec-snap_name); - if (snap_id == CEPH_NOSNAP) - return -ENOENT; - spec-snap_id = snap_id; - } else { - spec-snap_id = CEPH_NOSNAP; - } - - return 0; - } + rbd_assert(spec-pool_id != CEPH_NOPOOL); + rbd_assert(spec-image_id); + rbd_assert(spec-snap_id != CEPH_NOSNAP); /* Get the pool name; we have to make our own copy of this */ @@ -4313,7 +4321,7 @@ static int rbd_dev_spec_update(struct rbd_device *rbd_dev) if (!image_name) rbd_warn(rbd_dev, unable to get image name); - /* Look up the snapshot name, and make a copy */ + /* Fetch the snapshot name */ snap_name = rbd_snap_name(rbd_dev, spec-snap_id); if (IS_ERR(snap_name)) { @@ -4326,10 +4334,10 @@ static int rbd_dev_spec_update(struct rbd_device *rbd_dev) spec-snap_name = snap_name; return 0; + out_err: kfree(image_name); kfree(pool_name); - return ret; } @@ -5158,7 +5166,16 @@ static int rbd_dev_image_probe(struct rbd_device *rbd_dev, bool mapping) if (ret) goto err_out_watch; - ret = rbd_dev_spec_update(rbd_dev); + /* + * If this image is the one being mapped, we have pool name and + * id, image name and id, and snap name - need to fill snap id. + * Otherwise this is a parent image, identified by pool, image + * and snap ids - need to fill in names for those ids. + */ + if (mapping) + ret = rbd_spec_fill_snap_id(rbd_dev); + else + ret = rbd_spec_fill_names(rbd_dev); if (ret) goto err_out_probe
Re: [PATCH 5/8] rbd: harden rbd_dev_refresh() caller
On 07/24/2014 03:42 AM, Ilya Dryomov wrote: Recently discovered watch/notify problems showed that we really can't ignore errors in anything refresh related. Alas, currently there is not much we can do in response to those errors, except print warnings. Looks good. Reviewed-by: Alex Elder el...@linaro.org Signed-off-by: Ilya Dryomov ilya.dryo...@inktank.com --- drivers/block/rbd.c | 17 + 1 file changed, 13 insertions(+), 4 deletions(-) diff --git a/drivers/block/rbd.c b/drivers/block/rbd.c index 23df1773ef77..5dd6f5057ef4 100644 --- a/drivers/block/rbd.c +++ b/drivers/block/rbd.c @@ -2963,11 +2963,20 @@ static void rbd_watch_cb(u64 ver, u64 notify_id, u8 opcode, void *data) dout(%s: \%s\ notify_id %llu opcode %u\n, __func__, rbd_dev-header_name, (unsigned long long)notify_id, (unsigned int)opcode); + + /* + * Until adequate refresh error handling is in place, there is + * not much we can do here, except warn. + * + * See http://tracker.ceph.com/issues/5040 + */ ret = rbd_dev_refresh(rbd_dev); if (ret) - rbd_warn(rbd_dev, header refresh error (%d)\n, ret); + rbd_warn(rbd_dev, refresh failed: %d\n, ret); - rbd_obj_notify_ack_sync(rbd_dev, notify_id); + ret = rbd_obj_notify_ack_sync(rbd_dev, notify_id); + if (ret) + rbd_warn(rbd_dev, notify_ack ret %d\n, ret); } /* @@ -3724,9 +3733,9 @@ static ssize_t rbd_image_refresh(struct device *dev, ret = rbd_dev_refresh(rbd_dev); if (ret) - rbd_warn(rbd_dev, : manual header refresh error (%d)\n, ret); + return ret; - return ret 0 ? ret : size; + return size; } static DEVICE_ATTR(size, S_IRUGO, rbd_size_show, NULL); -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 6/8] rbd: update mapping size only on refresh
On 07/24/2014 03:42 AM, Ilya Dryomov wrote: There is no sense in trying to update the mapping size before it's even been set. It took me a bit to follow this. But basically there is no mapping unless it's mapped. So previously this was updating the mapping information even for unmapped parent (or other) images. There's no need to update the mapping size for a snapshot--it'll never change. Is that right? If not, please advise; otherwise: Reviewed-by: Alex Elder el...@linaro.org Signed-off-by: Ilya Dryomov ilya.dryo...@inktank.com --- drivers/block/rbd.c | 19 +++ 1 file changed, 7 insertions(+), 12 deletions(-) diff --git a/drivers/block/rbd.c b/drivers/block/rbd.c index 5dd6f5057ef4..16f388f2960b 100644 --- a/drivers/block/rbd.c +++ b/drivers/block/rbd.c @@ -971,12 +971,6 @@ static int rbd_header_from_disk(struct rbd_device *rbd_dev, header-snap_names = snap_names; header-snap_sizes = snap_sizes; - /* Make sure mapping size is consistent with header info */ - - if (rbd_dev-spec-snap_id == CEPH_NOSNAP || first_time) - if (rbd_dev-mapping.size != header-image_size) - rbd_dev-mapping.size = header-image_size; - return 0; out_2big: ret = -EIO; @@ -3520,9 +3514,14 @@ static int rbd_dev_refresh(struct rbd_device *rbd_dev) ret = rbd_dev_header_info(rbd_dev); - /* If it's a mapped snapshot, validate its EXISTS flag */ + if (rbd_dev-spec-snap_id == CEPH_NOSNAP) { + if (rbd_dev-mapping.size != rbd_dev-header.image_size) + rbd_dev-mapping.size = rbd_dev-header.image_size; + } else { + /* validate mapped snapshot's EXISTS flag */ + rbd_exists_validate(rbd_dev); + } - rbd_exists_validate(rbd_dev); up_write(rbd_dev-header_rwsem); if (mapping_size != rbd_dev-mapping.size) { @@ -4505,10 +4504,6 @@ static int rbd_dev_v2_header_info(struct rbd_device *rbd_dev) is EXPERIMENTAL!); } - if (rbd_dev-spec-snap_id == CEPH_NOSNAP) - if (rbd_dev-mapping.size != rbd_dev-header.image_size) - rbd_dev-mapping.size = rbd_dev-header.image_size; - ret = rbd_dev_v2_snap_context(rbd_dev); dout(rbd_dev_v2_snap_context returned %d\n, ret); -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 6/8] rbd: update mapping size only on refresh
On 07/24/2014 08:46 AM, Ilya Dryomov wrote: On Thu, Jul 24, 2014 at 5:25 PM, Alex Elder el...@ieee.org wrote: On 07/24/2014 03:42 AM, Ilya Dryomov wrote: There is no sense in trying to update the mapping size before it's even been set. It took me a bit to follow this. But basically there is no mapping unless it's mapped. So previously this was updating the mapping information even for unmapped parent (or other) images. There's no need to update the mapping size for a snapshot--it'll never change. Is that right? If not, please advise; otherwise: No. Previously it was updating the mapping size both on the inital map and on refresh (of the base image). Doing that on the inital map doesn't make sense: not only struct rbd_mapping fields aren't properly initialized at that point - rbd_dev_mapping_set() is called much later in the map sequence, snap_id isn't initialized either (at least in the format 2 case, I haven't looked too closely at the format 1 case). And just in general, trying to update something before it had a chance to change is bogus.. OK, now I see it. Hopefully this makes sense. Here is how it was previously structured: rbd_add() do_rbd_add() |rbd_dev_image_probe() | rbd_dev_header_info() |rbd_dev_v1_header_info()orrbd_dev_v2_header_info() | rbd_header_from_disk()update mapping |update mapping | . . . |rbd_dev_device_setup() | rbd_dev_mapping_set() So neither rbd_header_from_disk() nor rbd_dev_v2_header_info() should be using the values in the rbd_dev-mapping field during initial image probe, because they have not yet been initialized at that point. Meanwhile, the only reason we need to *update* a mapping size is if we learn it may have changed, which will be covered by a refresh, so doing so in rbd_dev_refresh() makes sense. And as long as you know whether it's mapping the base image you might as well do the rbd_exists_validate() call conditionally. OK, this all looks good and makes good sense to me. Thank you for explaining it. Reviewed-by: Alex Elder el...@linaro.org Thanks, Ilya -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 8/8] rbd: take snap_id into account when reading in parent info
On 07/24/2014 03:42 AM, Ilya Dryomov wrote: If we are mapping a snapshot, we must read in the parent_overlap value of that snapshot instead of that of the base image. Not doing so may in particular result in us returning zeros instead of user data: # cat overlap-snap.sh #!/bin/bash rbd create --size 10 --image-format 2 foo FOO_DEV=$(rbd map foo) dd if=/dev/urandom of=/dev/rbd0 bs=1M /dev/null echo Base image dd if=$FOO_DEV bs=1 count=16 skip=$(((4 20) - 8)) 2/dev/null | xxd rbd snap create foo@snap rbd snap protect foo@snap rbd clone foo@snap bar rbd snap create bar@snap BAR_DEV=$(rbd map bar@snap) echo Snapshot dd if=$BAR_DEV bs=1 count=16 skip=$(((4 20) - 8)) 2/dev/null | xxd rbd resize --allow-shrink --size 4 bar echo Snapshot after base image resize dd if=$BAR_DEV bs=1 count=16 skip=$(((4 20) - 8)) 2/dev/null | xxd # ./overlap-snap.sh Base image 000: e781 e33b d34b 2225 6034 2845 a2e3 36ed ...;.K%`4(E..6. Snapshot 000: e781 e33b d34b 2225 6034 2845 a2e3 36ed ...;.K%`4(E..6. Resizing image: 100% complete...done. Snapshot after base image resize 000: e781 e33b d34b 2225 ...;.K% Even though bar@snap was taken with the old bar parent_overlap (8M), reads from bar@snap beyond the new bar parent_overlap (4M) return zeroes. Fix it. Signed-off-by: Ilya Dryomov ilya.dryo...@inktank.com --- drivers/block/rbd.c |2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/block/rbd.c b/drivers/block/rbd.c index c4606987e9d1..cbc89fa9a677 100644 --- a/drivers/block/rbd.c +++ b/drivers/block/rbd.c @@ -4020,7 +4020,7 @@ static int rbd_dev_v2_parent_info(struct rbd_device *rbd_dev) goto out_err; } - snapid = cpu_to_le64(CEPH_NOSNAP); + snapid = cpu_to_le64(rbd_dev-spec-snap_id); Well that's just an outright bug. It's been there since the original commit that added parent support: 86b00e0 rbd: get parent spec for version 2 images Parent images *must* be snapshots, so this was never right. I bet that was hard to figure out... Looks good. Reviewed-by: Alex Elder el...@linaro.org ret = rbd_obj_method_sync(rbd_dev, rbd_dev-header_name, rbd, get_parent, snapid, sizeof (snapid), -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 10/14] rbd: rbd_obj_request_wait() should cancel the request if interrupted
On 07/08/2014 06:18 AM, Ilya Dryomov wrote: The only question that leaves me with is, does ceph_osdc_cancel_request() need to include the call to complete_request() that's present in ceph_osdc_wait_request()? I don't think so - I mentioned it in the ceph_osdc_cancel_request() function comment. ceph_osdc_cancel_request() is supposed to be used by higher layers - rbd, cephfs - and exactly because their completion logic is decoupled from libceph completions (as you have brilliantly explained above) it's the higher layers who should be taking care of it. IOW higher layers are in charge and are supposed to know what and when they are cancelling. I noticed that comment only after sending my message. RBD doesn't use the safe completion, only the FS client does, and I was pretty focused on RBD behavior while looking at this. I was trying to conceptualize how (from the perspective of the upper layer) the safe completion differs from the normal completion. It's possible that an I have your request (normal completion) *also* carries with it the your request has completed (safe completion) indication, but the higher layer caller has no way of knowing that. Maybe I should flip my question around, and ask, why should the ceph_osdc_cancel_request() include the call to complete_request()? The answer lies in details of the file system client, and I'm not in a position right now to dive into that. Whether it's called in ceph_osdc_cancel_request() or not has no effect on RBD. Anyway, your response is fine with me, thank you. -Alex -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 09/14] libceph: introduce ceph_osdc_cancel_request()
On 07/08/2014 06:15 AM, Ilya Dryomov wrote: Are you OK with your Reviewed-by for this patch? Reviewed-by: Alex Elder el...@linaro.org -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 09/14] libceph: introduce ceph_osdc_cancel_request()
On 06/30/2014 09:34 AM, Ilya Dryomov wrote: On Mon, Jun 30, 2014 at 5:39 PM, Alex Elder el...@ieee.org wrote: On 06/25/2014 12:16 PM, Ilya Dryomov wrote: Introduce ceph_osdc_cancel_request() intended for canceling requests from the higher layers (rbd and cephfs). Because higher layers are in charge and are supposed to know what and when they are canceling, the request is not completed, only unref'ed and removed from the libceph data structures. This seems reasonable. But you make two changes here that I'd like to see a little more explanation on. They seem significant enough to warrant a little more attention in the commit description. - __cancel_request() is no longer called when ceph_osdc_wait_request() fails due to an an interrupt. This is my main concern, and I plan to work through it but I'm in a small hurry right now. Perhaps it should have been a separate commit. __unregister_request() revokes r_request, so I opted for not trying to do it twice. As for OK, that makes sense--revoking the request is basically all __cancel_request() does anyway. You ought to have mentioned that in the description anyway, even if it wasn't a separate commit. the r_sent condition and assignment, it doesn't seem to make much of a difference, given that the request is about to be unregistered anyway. If the request is getting canceled (from a caller outside libceph) it can't take into account whether it was sent or not. As you said, it is getting revoked unconditionally by __unregister_request(). To be honest though, *that* revoke call should have been zeroing the r_sent value also. It apparently won't matter, but I think it's wrong. The revoke drops a reference, it doesn't necessarily free the structure (though I expect it may always do so anyway). - __unregister_linger_request() is now called when a request is canceled, but it wasn't before. (Since we're consistent about setting the r_linger flag this should be fine, but it *is* a behavior change.) The general goal here is to make ceph_osdc_cancel_request() cancel *any* request correctly, so if r_linger is set, which means that the request in question *could* be lingering, __unregister_linger_request() is called. The goal is good. Note that __unregister_linger_request() drops a reference to the request though. There are three other callers of this function. Two of them drop a reference that had just been added by a call to __register_request(). The other one is in ceph_osdc_unregister_linger_request(), initiated by a higher layer. In that last case, r_linger will be zeroed, so calling it again should be harmless. I guess I'm going to move on... I expect all of this discussion will be moot and you will have made everything work right and better by the time I get to the end of the series. -Alex Thanks, Ilya -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 10/14] rbd: rbd_obj_request_wait() should cancel the request if interrupted
On 06/25/2014 12:16 PM, Ilya Dryomov wrote: rbd_obj_request_wait() should cancel the underlying OSD request if interrupted. Otherwise libceph will hold onto it indefinitely, causing assert failures or leaking the original object request. At first I didn't understand this. Let me see if I've got it now, though. Each OSD request has a completion associated with it. An OSD request is started via ceph_osdc_start_request(), which registers the request and takes a reference to it. One can call ceph_osdc_wait_request() after the request has been successfully started. Whether the wait succeeds or not, by the time ceph_osdc_wait_request() returns the request should have been cleaned up, back to the state it was in before the start_request call. That means the request needs to be unregistered and its reference dropped, etc. Similarly, each RBD object request has a completion associated with it. It is distinct from the OSD request associated with the RBD object request because there may be more to do for RBD request to complete than just complete one object request. An RBD object request is started by a call to rbd_obj_request_submit(), and once that's successfully returned, one can wait for it to complete using a call to rbd_obj_request_wait(). And as above, that call should return state to (roughly) where it was before the submit call, whether the wait request succeeded or not. Now, RBD doesn't actually wait for its object requests to complete--all its OSD requests complete asynchronously. Instead, it relies on the OSD client to call the callback function (always rbd_osd_req_callback()) when it has completed. That function will lead to the RBD request's completion being signaled when appropriate. So... What happens when an interrupt occurs after rbd_obj_request_submit() has returned successfully? That function is a simple wrapper for ceph_osdc_start_request(), so a successful return means the request was mapped and put on a target's unsent list (or the OSD client's no target list). Nobody waits for the OSD request, so an interrupt won't get the benefit of the cleanup done in ceph_osdc_wait_request(). Therefore the RBD layer needs to be responsible for doing that. In other words, when rbd_obj_request_wait() gets an interrupt while waiting for the completion, it needs to clean up (end) the interrupted request, and rbd_obj_request_end() sounds right. And what that cleanup function should do is basically the same as what ceph_osdc_wait_request() should do in that situation, which is call ceph_osdc_cancel_request(). The only question that leaves me with is, does ceph_osdc_cancel_request() need to include the call to complete_request() that's present in ceph_osdc_wait_request()? I'll leave it at that. I think I've convinced myself this is a good change. So I await your confirmation that I understand it right, and your answer to my question above. But in any case you can consider this: Reviewed-by: Alex Elder el...@linaro.org Sorry it's taking me so long to get through these. This also adds an rbd wrapper around ceph_osdc_cancel_request() to match rbd_obj_request_submit() and rbd_obj_request_wait(). Signed-off-by: Ilya Dryomov ilya.dryo...@inktank.com --- drivers/block/rbd.c | 39 --- 1 file changed, 28 insertions(+), 11 deletions(-) diff --git a/drivers/block/rbd.c b/drivers/block/rbd.c index b2c98c1bc037..20147aec86f3 100644 --- a/drivers/block/rbd.c +++ b/drivers/block/rbd.c @@ -1527,11 +1527,37 @@ static bool obj_request_type_valid(enum obj_request_type type) static int rbd_obj_request_submit(struct ceph_osd_client *osdc, struct rbd_obj_request *obj_request) { - dout(%s: osdc %p obj %p\n, __func__, osdc, obj_request); - + dout(%s %p\n, __func__, obj_request); return ceph_osdc_start_request(osdc, obj_request-osd_req, false); } +static void rbd_obj_request_end(struct rbd_obj_request *obj_request) +{ + dout(%s %p\n, __func__, obj_request); + ceph_osdc_cancel_request(obj_request-osd_req); +} + +/* + * Wait for an object request to complete. If interrupted, cancel the + * underlying osd request. + */ +static int rbd_obj_request_wait(struct rbd_obj_request *obj_request) +{ + int ret; + + dout(%s %p\n, __func__, obj_request); + + ret = wait_for_completion_interruptible(obj_request-completion); + if (ret 0) { + dout(%s %p interrupted\n, __func__, obj_request); + rbd_obj_request_end(obj_request); + return ret; + } + + dout(%s %p done\n, __func__, obj_request); + return 0; +} + static void rbd_img_request_complete(struct rbd_img_request *img_request) { @@ -1558,15 +1584,6 @@ static void rbd_img_request_complete(struct rbd_img_request *img_request) rbd_img_request_put(img_request); } -/* Caller is responsible for rbd_obj_request_destroy(obj_request
Re: [PATCH 13/14] libceph: nuke ceph_osdc_unregister_linger_request()
On 06/25/2014 12:16 PM, Ilya Dryomov wrote: Remove now unused ceph_osdc_unregister_linger_request(). Signed-off-by: Ilya Dryomov ilya.dryo...@inktank.com Looks good. Reviewed-by: Alex Elder el...@linaro.org --- include/linux/ceph/osd_client.h |2 -- net/ceph/osd_client.c | 12 2 files changed, 14 deletions(-) diff --git a/include/linux/ceph/osd_client.h b/include/linux/ceph/osd_client.h index de09cad7b7c7..03aeb27fcc69 100644 --- a/include/linux/ceph/osd_client.h +++ b/include/linux/ceph/osd_client.h @@ -325,8 +325,6 @@ extern struct ceph_osd_request *ceph_osdc_new_request(struct ceph_osd_client *, extern void ceph_osdc_set_request_linger(struct ceph_osd_client *osdc, struct ceph_osd_request *req); -extern void ceph_osdc_unregister_linger_request(struct ceph_osd_client *osdc, - struct ceph_osd_request *req); extern void ceph_osdc_get_request(struct ceph_osd_request *req); extern void ceph_osdc_put_request(struct ceph_osd_request *req); diff --git a/net/ceph/osd_client.c b/net/ceph/osd_client.c index ef44cc0bbc6a..30f6faf3584f 100644 --- a/net/ceph/osd_client.c +++ b/net/ceph/osd_client.c @@ -1281,18 +1281,6 @@ static void __unregister_linger_request(struct ceph_osd_client *osdc, ceph_osdc_put_request(req); } -void ceph_osdc_unregister_linger_request(struct ceph_osd_client *osdc, - struct ceph_osd_request *req) -{ - mutex_lock(osdc-request_mutex); - if (req-r_linger) { - req-r_linger = 0; - __unregister_linger_request(osdc, req); - } - mutex_unlock(osdc-request_mutex); -} -EXPORT_SYMBOL(ceph_osdc_unregister_linger_request); - void ceph_osdc_set_request_linger(struct ceph_osd_client *osdc, struct ceph_osd_request *req) { -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 12/14] rbd: use rbd_obj_watch_request_helper() helper
On 06/25/2014 12:16 PM, Ilya Dryomov wrote: Switch rbd_dev_header_{un,}watch_sync() to use the new helper and fix rbd_dev_header_unwatch_sync() to destroy watch_request structures before queuing watch-remove message while at it. This mistake slipped into commit b30a01f2a307 (rbd: fix osd_request memory leak in __rbd_dev_header_watch_sync()) and could lead to image still in use errors on image removal. You can see why--even if they were entangled--both setting up and taking down the watch request were done in the same function. So much of what they do is common, just slightly different. Reviewed-by: Alex Elder el...@linaro.org Signed-off-by: Ilya Dryomov ilya.dryo...@inktank.com --- drivers/block/rbd.c | 115 --- 1 file changed, 17 insertions(+), 98 deletions(-) diff --git a/drivers/block/rbd.c b/drivers/block/rbd.c index 02cf7aba7679..d99aa81774f8 100644 --- a/drivers/block/rbd.c +++ b/drivers/block/rbd.c @@ -3040,130 +3040,49 @@ static int rbd_dev_header_watch_sync(struct rbd_device *rbd_dev) if (ret 0) return ret; - rbd_assert(rbd_dev-watch_event); - - obj_request = rbd_obj_request_create(rbd_dev-header_name, 0, 0, - OBJ_REQUEST_NODATA); - if (!obj_request) { - ret = -ENOMEM; - goto out_cancel; + obj_request = rbd_obj_watch_request_helper(rbd_dev, true); + if (IS_ERR(obj_request)) { + ceph_osdc_cancel_event(rbd_dev-watch_event); + rbd_dev-watch_event = NULL; + return PTR_ERR(obj_request); } - obj_request-osd_req = rbd_osd_req_create(rbd_dev, true, 1, - obj_request); - if (!obj_request-osd_req) { - ret = -ENOMEM; - goto out_put; - } - - ceph_osdc_set_request_linger(osdc, obj_request-osd_req); - - osd_req_op_watch_init(obj_request-osd_req, 0, CEPH_OSD_OP_WATCH, - rbd_dev-watch_event-cookie, 0, 1); - rbd_osd_req_format_write(obj_request); - - ret = rbd_obj_request_submit(osdc, obj_request); - if (ret) - goto out_linger; - - ret = rbd_obj_request_wait(obj_request); - if (ret) - goto out_linger; - - ret = obj_request-result; - if (ret) - goto out_linger; - /* * A watch request is set to linger, so the underlying osd * request won't go away until we unregister it. We retain * a pointer to the object request during that time (in - * rbd_dev-watch_request), so we'll keep a reference to - * it. We'll drop that reference (below) after we've - * unregistered it. + * rbd_dev-watch_request), so we'll keep a reference to it. + * We'll drop that reference after we've unregistered it in + * rbd_dev_header_unwatch_sync(). */ rbd_dev-watch_request = obj_request; return 0; - -out_linger: - ceph_osdc_unregister_linger_request(osdc, obj_request-osd_req); -out_put: - rbd_obj_request_put(obj_request); -out_cancel: - ceph_osdc_cancel_event(rbd_dev-watch_event); - rbd_dev-watch_event = NULL; - - return ret; } /* * Tear down a watch request, synchronously. */ -static int __rbd_dev_header_unwatch_sync(struct rbd_device *rbd_dev) +static void rbd_dev_header_unwatch_sync(struct rbd_device *rbd_dev) { - struct ceph_osd_client *osdc = rbd_dev-rbd_client-client-osdc; struct rbd_obj_request *obj_request; - int ret; rbd_assert(rbd_dev-watch_event); rbd_assert(rbd_dev-watch_request); - obj_request = rbd_obj_request_create(rbd_dev-header_name, 0, 0, - OBJ_REQUEST_NODATA); - if (!obj_request) { - ret = -ENOMEM; - goto out_cancel; - } - - obj_request-osd_req = rbd_osd_req_create(rbd_dev, true, 1, - obj_request); - if (!obj_request-osd_req) { - ret = -ENOMEM; - goto out_put; - } - - osd_req_op_watch_init(obj_request-osd_req, 0, CEPH_OSD_OP_WATCH, - rbd_dev-watch_event-cookie, 0, 0); - rbd_osd_req_format_write(obj_request); - - ret = rbd_obj_request_submit(osdc, obj_request); - if (ret) - goto out_put; - - ret = rbd_obj_request_wait(obj_request); - if (ret) - goto out_put; - - ret = obj_request-result; - if (ret) - goto out_put; - - /* We have successfully torn down the watch request */ - - ceph_osdc_unregister_linger_request(osdc, - rbd_dev-watch_request-osd_req); + rbd_obj_request_end(rbd_dev-watch_request); rbd_obj_request_put(rbd_dev-watch_request); rbd_dev-watch_request
Re: [PATCH 11/14] rbd: add rbd_obj_watch_request_helper() helper
On 06/25/2014 12:16 PM, Ilya Dryomov wrote: In the past, rbd_dev_header_watch_sync() used to handle both watch and unwatch requests and was entangled and leaky. Commit b30a01f2a307 (rbd: fix osd_request memory leak in __rbd_dev_header_watch_sync()) split it into two separate functions. This commit cleanly abstracts the common bits, relying on the fixed rbd_obj_request_wait(). Adding this without calling it leads to an unused function warning in the build, I'm sure. You could probably squash this into the next patch. Reviewed-by: Alex Elder el...@linaro.org Signed-off-by: Ilya Dryomov ilya.dryo...@inktank.com --- drivers/block/rbd.c | 53 +++ 1 file changed, 53 insertions(+) diff --git a/drivers/block/rbd.c b/drivers/block/rbd.c index 20147aec86f3..02cf7aba7679 100644 --- a/drivers/block/rbd.c +++ b/drivers/block/rbd.c @@ -2971,6 +2971,59 @@ static void rbd_watch_cb(u64 ver, u64 notify_id, u8 opcode, void *data) } /* + * Send a (un)watch request and wait for the ack. Return a request + * with a ref held on success or error. + */ +static struct rbd_obj_request *rbd_obj_watch_request_helper( + struct rbd_device *rbd_dev, + bool watch) +{ + struct ceph_osd_client *osdc = rbd_dev-rbd_client-client-osdc; + struct rbd_obj_request *obj_request; + int ret; + + obj_request = rbd_obj_request_create(rbd_dev-header_name, 0, 0, + OBJ_REQUEST_NODATA); + if (!obj_request) + return ERR_PTR(-ENOMEM); + + obj_request-osd_req = rbd_osd_req_create(rbd_dev, true, 1, + obj_request); + if (!obj_request-osd_req) { + ret = -ENOMEM; + goto out; + } + + osd_req_op_watch_init(obj_request-osd_req, 0, CEPH_OSD_OP_WATCH, + rbd_dev-watch_event-cookie, 0, watch); + rbd_osd_req_format_write(obj_request); + + if (watch) + ceph_osdc_set_request_linger(osdc, obj_request-osd_req); + + ret = rbd_obj_request_submit(osdc, obj_request); + if (ret) + goto out; + + ret = rbd_obj_request_wait(obj_request); + if (ret) + goto out; + + ret = obj_request-result; + if (ret) { + if (watch) + rbd_obj_request_end(obj_request); + goto out; + } + + return obj_request; + +out: + rbd_obj_request_put(obj_request); + return ERR_PTR(ret); +} + +/* * Initiate a watch request, synchronously. */ static int rbd_dev_header_watch_sync(struct rbd_device *rbd_dev) -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 14/14] libceph: drop osd ref when canceling con work
On 06/25/2014 12:16 PM, Ilya Dryomov wrote: queue_con() bumps osd ref count. We should do the reverse when canceling con work. Kind of unrelated to the rest of the series, but it looks good. Good to have a same-level-of-abstraction function for it as well. Reviewed-by: Alex Elder el...@linaro.org Signed-off-by: Ilya Dryomov ilya.dryo...@inktank.com --- net/ceph/messenger.c | 14 ++ 1 file changed, 10 insertions(+), 4 deletions(-) diff --git a/net/ceph/messenger.c b/net/ceph/messenger.c index 8bffa5b90fef..e51cad0db580 100644 --- a/net/ceph/messenger.c +++ b/net/ceph/messenger.c @@ -174,6 +174,7 @@ static struct lock_class_key socket_class; #define SKIP_BUF_SIZE1024 static void queue_con(struct ceph_connection *con); +static void cancel_con(struct ceph_connection *con); static void con_work(struct work_struct *); static void con_fault(struct ceph_connection *con); @@ -680,7 +681,7 @@ void ceph_con_close(struct ceph_connection *con) reset_connection(con); con-peer_global_seq = 0; - cancel_delayed_work(con-work); + cancel_con(con); con_close_socket(con); mutex_unlock(con-mutex); } @@ -2667,19 +2668,16 @@ static int queue_con_delay(struct ceph_connection *con, unsigned long delay) { if (!con-ops-get(con)) { dout(%s %p ref count 0\n, __func__, con); - return -ENOENT; } if (!queue_delayed_work(ceph_msgr_wq, con-work, delay)) { dout(%s %p - already queued\n, __func__, con); con-ops-put(con); - return -EBUSY; } dout(%s %p %lu\n, __func__, con, delay); - return 0; } @@ -2688,6 +2686,14 @@ static void queue_con(struct ceph_connection *con) (void) queue_con_delay(con, 0); } +static void cancel_con(struct ceph_connection *con) +{ + if (cancel_delayed_work(con-work)) { + dout(%s %p\n, __func__, con); + con-ops-put(con); + } +} + static bool con_sock_closed(struct ceph_connection *con) { if (!con_flag_test_and_clear(con, CON_FLAG_SOCK_CLOSED)) -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] rbd: do not leak image_id in rbd_dev_v2_parent_info()
On 06/30/2014 04:45 AM, Ilya Dryomov wrote: image_id is leaked if the parent happens to have been recorded already. Fix it. Looks good. Reviewed-by: Alex Elder el...@linaro.org Signed-off-by: Ilya Dryomov ilya.dryo...@inktank.com --- drivers/block/rbd.c |2 ++ 1 file changed, 2 insertions(+) diff --git a/drivers/block/rbd.c b/drivers/block/rbd.c index d99aa81774f8..adedb393b374 100644 --- a/drivers/block/rbd.c +++ b/drivers/block/rbd.c @@ -4072,6 +4072,8 @@ static int rbd_dev_v2_parent_info(struct rbd_device *rbd_dev) parent_spec-snap_id = snap_id; rbd_dev-parent_spec = parent_spec; parent_spec = NULL; /* rbd_dev now owns this */ + } else { + kfree(image_id); } /* -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] rbd: handle parent_overlap on writes correctly
On 06/11/2014 11:40 AM, Ilya Dryomov wrote: The following check in rbd_img_obj_request_submit() rbd_dev-parent_overlap = obj_request-img_offset allows the fall through to the non-layered write case even if both parent_overlap and obj_request-img_offset belong to the same RADOS object. This leads to data corruption, because the area to the left of parent_overlap ends up unconditionally zero-filled instead of being populated with parent data. Suppose we want to write 1M to offset 6M of image bar, which is a clone of foo@snap; object_size is 4M, parent_overlap is 5M: rbd_data.id.0001 -|--| | should be copyup'ed | should be zeroed out | write ... -|--| 4M5M 6M parent_overlapobj_request-img_offset 4..5M should be copyup'ed from foo, yet it is zero-filled, just like 5..6M is. Given that the only striping mode kernel client currently supports is chunking (i.e. stripe_unit == object_size, stripe_count == 1), round parent_overlap up to the next object boundary for the purposes of the overlap check. I know Josh already reviewed this, but I wanted to make sure I understood it. Now I do. Previously the first layered write request starting at the should be zeroed out region would be treated as a normal write. Consequently the child object would get created, but only the affected portion would get written, meaning the should be copyup'ed region in the child object would be zero filled. Later reads to the should be copyup'ed would see that child object existed, so the parent data will never get filled into the should be copyup'ed area as it must be. Rounding up like you did for this test makes sense--it makes the granularity of this overlap test match the granularity of the exists test. Reviewed-by: Alex Elder el...@linaro.org Signed-off-by: Ilya Dryomov ilya.dryo...@inktank.com --- drivers/block/rbd.c | 10 +- 1 file changed, 9 insertions(+), 1 deletion(-) diff --git a/drivers/block/rbd.c b/drivers/block/rbd.c index 8295b3afa8e0..813e673d49df 100644 --- a/drivers/block/rbd.c +++ b/drivers/block/rbd.c @@ -1366,6 +1366,14 @@ static bool obj_request_exists_test(struct rbd_obj_request *obj_request) return test_bit(OBJ_REQ_EXISTS, obj_request-flags) != 0; } +static bool obj_request_overlaps_parent(struct rbd_obj_request *obj_request) +{ + struct rbd_device *rbd_dev = obj_request-img_request-rbd_dev; + + return obj_request-img_offset + round_up(rbd_dev-parent_overlap, rbd_obj_bytes(rbd_dev-header)); +} + static void rbd_obj_request_get(struct rbd_obj_request *obj_request) { dout(%s: obj %p (was %d)\n, __func__, obj_request, @@ -2683,7 +2691,7 @@ static int rbd_img_obj_request_submit(struct rbd_obj_request *obj_request) */ if (!img_request_write_test(img_request) || !img_request_layered_test(img_request) || - rbd_dev-parent_overlap = obj_request-img_offset || + !obj_request_overlaps_parent(obj_request) || ((known = obj_request_known_test(obj_request)) obj_request_exists_test(obj_request))) { -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 01/14] libceph: rename ceph_osd_request::r_linger_osd to r_linger_osd_item
On 06/25/2014 12:16 PM, Ilya Dryomov wrote: So that: req-r_osd_item -- osd-o_requests list req-r_linger_osd_item -- osd-o_linger_requests list Signed-off-by: Ilya Dryomov ilya.dryo...@inktank.com This looks good and I prefer it too. Reviewed-by: Alex Elder el...@linaro.org --- include/linux/ceph/osd_client.h |2 +- net/ceph/osd_client.c |8 2 files changed, 5 insertions(+), 5 deletions(-) diff --git a/include/linux/ceph/osd_client.h b/include/linux/ceph/osd_client.h index 94ec69672164..7490a03ac163 100644 --- a/include/linux/ceph/osd_client.h +++ b/include/linux/ceph/osd_client.h @@ -117,7 +117,7 @@ struct ceph_osd_request { struct list_head r_req_lru_item; struct list_head r_osd_item; struct list_head r_linger_item; - struct list_head r_linger_osd; + struct list_head r_linger_osd_item; struct ceph_osd *r_osd; struct ceph_pg r_pgid; int r_pg_osds[CEPH_PG_MAX_SIZE]; diff --git a/net/ceph/osd_client.c b/net/ceph/osd_client.c index 05be0c181695..d5d2be3bd113 100644 --- a/net/ceph/osd_client.c +++ b/net/ceph/osd_client.c @@ -364,7 +364,7 @@ struct ceph_osd_request *ceph_osdc_alloc_request(struct ceph_osd_client *osdc, RB_CLEAR_NODE(req-r_node); INIT_LIST_HEAD(req-r_unsafe_item); INIT_LIST_HEAD(req-r_linger_item); - INIT_LIST_HEAD(req-r_linger_osd); + INIT_LIST_HEAD(req-r_linger_osd_item); INIT_LIST_HEAD(req-r_req_lru_item); INIT_LIST_HEAD(req-r_osd_item); @@ -916,7 +916,7 @@ static void __kick_osd_requests(struct ceph_osd_client *osdc, * list at the end to keep things in tid order. */ list_for_each_entry_safe(req, nreq, osd-o_linger_requests, - r_linger_osd) { + r_linger_osd_item) { /* * reregister request prior to unregistering linger so * that r_osd is preserved. @@ -1218,7 +1218,7 @@ static void __register_linger_request(struct ceph_osd_client *osdc, ceph_osdc_get_request(req); list_add_tail(req-r_linger_item, osdc-req_linger); if (req-r_osd) - list_add_tail(req-r_linger_osd, + list_add_tail(req-r_linger_osd_item, req-r_osd-o_linger_requests); } @@ -1228,7 +1228,7 @@ static void __unregister_linger_request(struct ceph_osd_client *osdc, dout(__unregister_linger_request %p\n, req); list_del_init(req-r_linger_item); if (req-r_osd) { - list_del_init(req-r_linger_osd); + list_del_init(req-r_linger_osd_item); if (list_empty(req-r_osd-o_requests) list_empty(req-r_osd-o_linger_requests)) { -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 02/14] libceph: add maybe_move_osd_to_lru() and switch to it
On 06/25/2014 12:16 PM, Ilya Dryomov wrote: Abstract out __move_osd_to_lru() logic from __unregister_request() and __unregister_linger_request(). Signed-off-by: Ilya Dryomov ilya.dryo...@inktank.com Looks good. Reviewed-by: Alex Elder el...@linaro.org --- net/ceph/osd_client.c | 26 ++ 1 file changed, 14 insertions(+), 12 deletions(-) diff --git a/net/ceph/osd_client.c b/net/ceph/osd_client.c index d5d2be3bd113..6202923b41ff 100644 --- a/net/ceph/osd_client.c +++ b/net/ceph/osd_client.c @@ -1029,12 +1029,23 @@ static void remove_all_osds(struct ceph_osd_client *osdc) static void __move_osd_to_lru(struct ceph_osd_client *osdc, struct ceph_osd *osd) { - dout(__move_osd_to_lru %p\n, osd); + dout(%s %p\n, __func__, osd); BUG_ON(!list_empty(osd-o_osd_lru)); + list_add_tail(osd-o_osd_lru, osdc-osd_lru); osd-lru_ttl = jiffies + osdc-client-options-osd_idle_ttl * HZ; } +static void maybe_move_osd_to_lru(struct ceph_osd_client *osdc, + struct ceph_osd *osd) +{ + dout(%s %p\n, __func__, osd); + + if (list_empty(osd-o_requests) + list_empty(osd-o_linger_requests)) + __move_osd_to_lru(osdc, osd); +} + static void __remove_osd_from_lru(struct ceph_osd *osd) { dout(__remove_osd_from_lru %p\n, osd); @@ -1182,11 +1193,7 @@ static void __unregister_request(struct ceph_osd_client *osdc, ceph_msg_revoke(req-r_request); list_del_init(req-r_osd_item); - if (list_empty(req-r_osd-o_requests) - list_empty(req-r_osd-o_linger_requests)) { - dout(moving osd to %p lru\n, req-r_osd); - __move_osd_to_lru(osdc, req-r_osd); - } + maybe_move_osd_to_lru(osdc, req-r_osd); if (list_empty(req-r_linger_item)) req-r_osd = NULL; } @@ -1229,12 +1236,7 @@ static void __unregister_linger_request(struct ceph_osd_client *osdc, list_del_init(req-r_linger_item); if (req-r_osd) { list_del_init(req-r_linger_osd_item); - - if (list_empty(req-r_osd-o_requests) - list_empty(req-r_osd-o_linger_requests)) { - dout(moving osd to %p lru\n, req-r_osd); - __move_osd_to_lru(osdc, req-r_osd); - } + maybe_move_osd_to_lru(osdc, req-r_osd); if (list_empty(req-r_osd_item)) req-r_osd = NULL; } -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 03/14] libceph: move and add dout()s to ceph_msg_{get,put}()
On 06/25/2014 12:16 PM, Ilya Dryomov wrote: Add dout()s to ceph_msg_{get,put}(). Also move them to .c and turn kref release callback into a static function. Signed-off-by: Ilya Dryomov ilya.dryo...@inktank.com This is all very good. I have one suggestion though, below, but regardless: Reviewed-by: Alex Elder el...@linaro.org --- include/linux/ceph/messenger.h | 14 ++ net/ceph/messenger.c | 31 ++- 2 files changed, 24 insertions(+), 21 deletions(-) diff --git a/include/linux/ceph/messenger.h b/include/linux/ceph/messenger.h index d21f2dba0731..40ae58e3e9db 100644 --- a/include/linux/ceph/messenger.h +++ b/include/linux/ceph/messenger.h @@ -285,19 +285,9 @@ extern void ceph_msg_data_add_bio(struct ceph_msg *msg, struct bio *bio, extern struct ceph_msg *ceph_msg_new(int type, int front_len, gfp_t flags, bool can_fail); -extern void ceph_msg_kfree(struct ceph_msg *m); - -static inline struct ceph_msg *ceph_msg_get(struct ceph_msg *msg) -{ - kref_get(msg-kref); - return msg; -} -extern void ceph_msg_last_put(struct kref *kref); -static inline void ceph_msg_put(struct ceph_msg *msg) -{ - kref_put(msg-kref, ceph_msg_last_put); -} +extern struct ceph_msg *ceph_msg_get(struct ceph_msg *msg); +extern void ceph_msg_put(struct ceph_msg *msg); extern void ceph_msg_dump(struct ceph_msg *msg); diff --git a/net/ceph/messenger.c b/net/ceph/messenger.c index 1948d592aa54..8bffa5b90fef 100644 --- a/net/ceph/messenger.c +++ b/net/ceph/messenger.c @@ -3269,24 +3269,21 @@ static int ceph_con_in_msg_alloc(struct ceph_connection *con, int *skip) /* * Free a generically kmalloc'd message. */ -void ceph_msg_kfree(struct ceph_msg *m) +static void ceph_msg_free(struct ceph_msg *m) { - dout(msg_kfree %p\n, m); + dout(%s %p\n, __func__, m); ceph_kvfree(m-front.iov_base); kmem_cache_free(ceph_msg_cache, m); } -/* - * Drop a msg ref. Destroy as needed. - */ -void ceph_msg_last_put(struct kref *kref) +static void ceph_msg_release(struct kref *kref) { struct ceph_msg *m = container_of(kref, struct ceph_msg, kref); LIST_HEAD(data); struct list_head *links; struct list_head *next; - dout(ceph_msg_put last one on %p\n, m); + dout(%s %p\n, __func__, m); WARN_ON(!list_empty(m-list_head)); /* drop middle, data, if any */ @@ -3308,9 +3305,25 @@ void ceph_msg_last_put(struct kref *kref) if (m-pool) ceph_msgpool_put(m-pool, m); else - ceph_msg_kfree(m); + ceph_msg_free(m); +} + +struct ceph_msg *ceph_msg_get(struct ceph_msg *msg) +{ + dout(%s %p (was %d)\n, __func__, msg, + atomic_read(msg-kref.refcount)); + kref_get(msg-kref); I suggest you do the dout() *after* you call kref_get(). I'm sure it doesn't matter in practice here, but my reasoning is that you get the reference immediately, and you'll have the reference when reading the refcount value. It also makes the dout() calls here and in ceph_msg_put() end up getting symmetrically wrapped by the get kref get and put. (You have a race reading the updated refcount value either way, but it's debug code.) + return msg; +} +EXPORT_SYMBOL(ceph_msg_get); + +void ceph_msg_put(struct ceph_msg *msg) +{ + dout(%s %p (was %d)\n, __func__, msg, + atomic_read(msg-kref.refcount)); + kref_put(msg-kref, ceph_msg_release); } -EXPORT_SYMBOL(ceph_msg_last_put); +EXPORT_SYMBOL(ceph_msg_put); void ceph_msg_dump(struct ceph_msg *msg) { -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 04/14] libceph: move and add dout()s to ceph_osdc_request_{get,put}()
On 06/25/2014 12:16 PM, Ilya Dryomov wrote: Add dout()s to ceph_osdc_request_{get,put}(). Also move them to .c and turn kref release callback into a static function. You can pretty much take the identical comments from what I said on [PATCH 03/14]. Reviewed-by: Alex Elder el...@linaro.org Signed-off-by: Ilya Dryomov ilya.dryo...@inktank.com --- include/linux/ceph/osd_client.h | 11 ++- net/ceph/osd_client.c | 26 ++ 2 files changed, 24 insertions(+), 13 deletions(-) diff --git a/include/linux/ceph/osd_client.h b/include/linux/ceph/osd_client.h index 7490a03ac163..a8d5652f589d 100644 --- a/include/linux/ceph/osd_client.h +++ b/include/linux/ceph/osd_client.h @@ -328,15 +328,8 @@ extern void ceph_osdc_set_request_linger(struct ceph_osd_client *osdc, extern void ceph_osdc_unregister_linger_request(struct ceph_osd_client *osdc, struct ceph_osd_request *req); -static inline void ceph_osdc_get_request(struct ceph_osd_request *req) -{ - kref_get(req-r_kref); -} -extern void ceph_osdc_release_request(struct kref *kref); -static inline void ceph_osdc_put_request(struct ceph_osd_request *req) -{ - kref_put(req-r_kref, ceph_osdc_release_request); -} +extern void ceph_osdc_get_request(struct ceph_osd_request *req); +extern void ceph_osdc_put_request(struct ceph_osd_request *req); extern int ceph_osdc_start_request(struct ceph_osd_client *osdc, struct ceph_osd_request *req, diff --git a/net/ceph/osd_client.c b/net/ceph/osd_client.c index 6202923b41ff..7406046212dc 100644 --- a/net/ceph/osd_client.c +++ b/net/ceph/osd_client.c @@ -297,12 +297,15 @@ static void osd_req_op_data_release(struct ceph_osd_request *osd_req, /* * requests */ -void ceph_osdc_release_request(struct kref *kref) +static void ceph_osdc_release_request(struct kref *kref) { - struct ceph_osd_request *req; + struct ceph_osd_request *req = container_of(kref, + struct ceph_osd_request, r_kref); unsigned int which; - req = container_of(kref, struct ceph_osd_request, r_kref); + dout(%s %p (r_request %p r_reply %p)\n, __func__, req, + req-r_request, req-r_reply); + if (req-r_request) ceph_msg_put(req-r_request); if (req-r_reply) { @@ -320,7 +323,22 @@ void ceph_osdc_release_request(struct kref *kref) kmem_cache_free(ceph_osd_request_cache, req); } -EXPORT_SYMBOL(ceph_osdc_release_request); + +void ceph_osdc_get_request(struct ceph_osd_request *req) +{ + dout(%s %p (was %d)\n, __func__, req, + atomic_read(req-r_kref.refcount)); + kref_get(req-r_kref); +} +EXPORT_SYMBOL(ceph_osdc_get_request); + +void ceph_osdc_put_request(struct ceph_osd_request *req) +{ + dout(%s %p (was %d)\n, __func__, req, + atomic_read(req-r_kref.refcount)); + kref_put(req-r_kref, ceph_osdc_release_request); +} +EXPORT_SYMBOL(ceph_osdc_put_request); struct ceph_osd_request *ceph_osdc_alloc_request(struct ceph_osd_client *osdc, struct ceph_snap_context *snapc, -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 06/14] libceph: assert both regular and lingering lists in __remove_osd()
On 06/25/2014 12:16 PM, Ilya Dryomov wrote: It is important that both regular and lingering requests lists are empty when the OSD is removed. Looks good. Reviewed-by: Alex Elder el...@linaro.org Signed-off-by: Ilya Dryomov ilya.dryo...@inktank.com --- net/ceph/osd_client.c |2 ++ 1 file changed, 2 insertions(+) diff --git a/net/ceph/osd_client.c b/net/ceph/osd_client.c index 052eb8bfcc74..a9b7ea7bfdc6 100644 --- a/net/ceph/osd_client.c +++ b/net/ceph/osd_client.c @@ -1032,6 +1032,8 @@ static void __remove_osd(struct ceph_osd_client *osdc, struct ceph_osd *osd) { dout(__remove_osd %p\n, osd); BUG_ON(!list_empty(osd-o_requests)); + BUG_ON(!list_empty(osd-o_linger_requests)); + rb_erase(osd-o_node, osdc-osds); list_del_init(osd-o_osd_lru); ceph_con_close(osd-o_con); -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 05/14] libceph: harden ceph_osdc_request_release() a bit
On 06/25/2014 12:16 PM, Ilya Dryomov wrote: Add some WARN_ONs to alert us when we try to destroy requests that are still registered. Signed-off-by: Ilya Dryomov ilya.dryo...@inktank.com Good idea. Especially the RB_CLEAR_NODE() call. Reviewed-by: Alex Elder el...@linaro.org --- net/ceph/osd_client.c |7 +++ 1 file changed, 7 insertions(+) diff --git a/net/ceph/osd_client.c b/net/ceph/osd_client.c index 7406046212dc..052eb8bfcc74 100644 --- a/net/ceph/osd_client.c +++ b/net/ceph/osd_client.c @@ -305,6 +305,12 @@ static void ceph_osdc_release_request(struct kref *kref) dout(%s %p (r_request %p r_reply %p)\n, __func__, req, req-r_request, req-r_reply); + WARN_ON(!RB_EMPTY_NODE(req-r_node)); + WARN_ON(!list_empty(req-r_req_lru_item)); + WARN_ON(!list_empty(req-r_osd_item)); + WARN_ON(!list_empty(req-r_linger_item)); + WARN_ON(!list_empty(req-r_linger_osd_item)); + WARN_ON(req-r_osd); if (req-r_request) ceph_msg_put(req-r_request); @@ -1204,6 +1210,7 @@ static void __unregister_request(struct ceph_osd_client *osdc, dout(__unregister_request %p tid %lld\n, req, req-r_tid); rb_erase(req-r_node, osdc-requests); + RB_CLEAR_NODE(req-r_node); osdc-num_requests--; if (req-r_osd) { -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 07/14] libceph: unregister only registered linger requests
On 06/25/2014 12:16 PM, Ilya Dryomov wrote: Linger requests that have not yet been registered should not be unregistered by __unregister_linger_request(). This messes up ref count and leads to use-after-free. This makes sense. The problem can occur when updating the OSD map. An OSD *client* has its list of linger requests, but they are not all necessarily registered as associated with the *OSD*. So the __unregister_linger_request() call in kick_requests() might pass a not-yet-registered linger request. It could also occur if a client (like RBD) gets an error after setting a request to linger but the request has completed successfully. Anyway, looks good. This explains why the rename of the r_linger_osd_item field was helpful. Reviewed-by: Alex Elder el...@linaro.org Signed-off-by: Ilya Dryomov ilya.dryo...@inktank.com --- net/ceph/osd_client.c | 15 +-- 1 file changed, 13 insertions(+), 2 deletions(-) diff --git a/net/ceph/osd_client.c b/net/ceph/osd_client.c index a9b7ea7bfdc6..12ec553a7e76 100644 --- a/net/ceph/osd_client.c +++ b/net/ceph/osd_client.c @@ -1248,7 +1248,9 @@ static void __cancel_request(struct ceph_osd_request *req) static void __register_linger_request(struct ceph_osd_client *osdc, struct ceph_osd_request *req) { - dout(__register_linger_request %p\n, req); + dout(%s %p tid %llu\n, __func__, req, req-r_tid); + WARN_ON(!req-r_linger); + ceph_osdc_get_request(req); list_add_tail(req-r_linger_item, osdc-req_linger); if (req-r_osd) @@ -1259,8 +1261,17 @@ static void __register_linger_request(struct ceph_osd_client *osdc, static void __unregister_linger_request(struct ceph_osd_client *osdc, struct ceph_osd_request *req) { - dout(__unregister_linger_request %p\n, req); + WARN_ON(!req-r_linger); + + if (list_empty(req-r_linger_item)) { + dout(%s %p tid %llu not registered\n, __func__, req, + req-r_tid); + return; + } + + dout(%s %p tid %llu\n, __func__, req, req-r_tid); list_del_init(req-r_linger_item); + if (req-r_osd) { list_del_init(req-r_linger_osd_item); maybe_move_osd_to_lru(osdc, req-r_osd); -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 08/14] libceph: fix linger request check in __unregister_request()
On 06/25/2014 12:16 PM, Ilya Dryomov wrote: We should check if request is on the linger request list of any of the OSDs, not whether request is registered or not. Signed-off-by: Ilya Dryomov ilya.dryo...@inktank.com That was a difficult to spot bug. Very good. Reviewed-by: Alex Elder el...@linaro.org --- net/ceph/osd_client.c |2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/net/ceph/osd_client.c b/net/ceph/osd_client.c index 12ec553a7e76..8104db24bc09 100644 --- a/net/ceph/osd_client.c +++ b/net/ceph/osd_client.c @@ -1221,7 +1221,7 @@ static void __unregister_request(struct ceph_osd_client *osdc, list_del_init(req-r_osd_item); maybe_move_osd_to_lru(osdc, req-r_osd); - if (list_empty(req-r_linger_item)) + if (list_empty(req-r_linger_osd_item)) req-r_osd = NULL; } -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 09/14] libceph: introduce ceph_osdc_cancel_request()
On 06/25/2014 12:16 PM, Ilya Dryomov wrote: Introduce ceph_osdc_cancel_request() intended for canceling requests from the higher layers (rbd and cephfs). Because higher layers are in charge and are supposed to know what and when they are canceling, the request is not completed, only unref'ed and removed from the libceph data structures. This seems reasonable. But you make two changes here that I'd like to see a little more explanation on. They seem significant enough to warrant a little more attention in the commit description. - __cancel_request() is no longer called when ceph_osdc_wait_request() fails due to an an interrupt. This is my main concern, and I plan to work through it but I'm in a small hurry right now. - __unregister_linger_request() is now called when a request is canceled, but it wasn't before. (Since we're consistent about setting the r_linger flag this should be fine, but it *is* a behavior change.) I'm going to keep looking at this; I have about 20 minutes before I have to leave for a while. -Alex Signed-off-by: Ilya Dryomov ilya.dryo...@inktank.com --- include/linux/ceph/osd_client.h |1 + net/ceph/osd_client.c | 31 +-- 2 files changed, 26 insertions(+), 6 deletions(-) diff --git a/include/linux/ceph/osd_client.h b/include/linux/ceph/osd_client.h index a8d5652f589d..de09cad7b7c7 100644 --- a/include/linux/ceph/osd_client.h +++ b/include/linux/ceph/osd_client.h @@ -334,6 +334,7 @@ extern void ceph_osdc_put_request(struct ceph_osd_request *req); extern int ceph_osdc_start_request(struct ceph_osd_client *osdc, struct ceph_osd_request *req, bool nofail); +extern void ceph_osdc_cancel_request(struct ceph_osd_request *req); extern int ceph_osdc_wait_request(struct ceph_osd_client *osdc, struct ceph_osd_request *req); extern void ceph_osdc_sync(struct ceph_osd_client *osdc); diff --git a/net/ceph/osd_client.c b/net/ceph/osd_client.c index 8104db24bc09..ef44cc0bbc6a 100644 --- a/net/ceph/osd_client.c +++ b/net/ceph/osd_client.c @@ -2470,6 +2470,25 @@ int ceph_osdc_start_request(struct ceph_osd_client *osdc, EXPORT_SYMBOL(ceph_osdc_start_request); /* + * Unregister a registered request. The request is not completed (i.e. + * no callbacks or wakeups) - higher layers are supposed to know what + * they are canceling. + */ +void ceph_osdc_cancel_request(struct ceph_osd_request *req) +{ + struct ceph_osd_client *osdc = req-r_osdc; + + mutex_lock(osdc-request_mutex); + if (req-r_linger) + __unregister_linger_request(osdc, req); + __unregister_request(osdc, req); + mutex_unlock(osdc-request_mutex); + + dout(%s %p tid %llu canceled\n, __func__, req, req-r_tid); +} +EXPORT_SYMBOL(ceph_osdc_cancel_request); + +/* * wait for a request to complete */ int ceph_osdc_wait_request(struct ceph_osd_client *osdc, @@ -2477,18 +2496,18 @@ int ceph_osdc_wait_request(struct ceph_osd_client *osdc, { int rc; + dout(%s %p tid %llu\n, __func__, req, req-r_tid); + rc = wait_for_completion_interruptible(req-r_completion); if (rc 0) { - mutex_lock(osdc-request_mutex); - __cancel_request(req); - __unregister_request(osdc, req); - mutex_unlock(osdc-request_mutex); + dout(%s %p tid %llu interrupted\n, __func__, req, req-r_tid); + ceph_osdc_cancel_request(req); complete_request(req); - dout(wait_request tid %llu canceled/timed out\n, req-r_tid); return rc; } - dout(wait_request tid %llu result %d\n, req-r_tid, req-r_result); + dout(%s %p tid %llu result %d\n, __func__, req, req-r_tid, + req-r_result); return req-r_result; } EXPORT_SYMBOL(ceph_osdc_wait_request); -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 07/14] libceph: unregister only registered linger requests
On 06/25/2014 12:16 PM, Ilya Dryomov wrote: Linger requests that have not yet been registered should not be unregistered by __unregister_linger_request(). This messes up ref count and leads to use-after-free. Signed-off-by: Ilya Dryomov ilya.dryo...@inktank.com --- net/ceph/osd_client.c | 15 +-- 1 file changed, 13 insertions(+), 2 deletions(-) diff --git a/net/ceph/osd_client.c b/net/ceph/osd_client.c index a9b7ea7bfdc6..12ec553a7e76 100644 --- a/net/ceph/osd_client.c +++ b/net/ceph/osd_client.c @@ -1248,7 +1248,9 @@ static void __cancel_request(struct ceph_osd_request *req) static void __register_linger_request(struct ceph_osd_client *osdc, struct ceph_osd_request *req) { - dout(__register_linger_request %p\n, req); + dout(%s %p tid %llu\n, __func__, req, req-r_tid); + WARN_ON(!req-r_linger); + ceph_osdc_get_request(req); list_add_tail(req-r_linger_item, osdc-req_linger); if (req-r_osd) @@ -1259,8 +1261,17 @@ static void __register_linger_request(struct ceph_osd_client *osdc, static void __unregister_linger_request(struct ceph_osd_client *osdc, struct ceph_osd_request *req) { - dout(__unregister_linger_request %p\n, req); + WARN_ON(!req-r_linger); I just noticed something. ceph_osdc_unregister_linger_request() clears req-r_linger before calling __unregister_linger_request(), which means this warning must be tripping a lot... Just delete that assignment in ceph_osdc_unregister_linger_request() as part of this commit. -Alex + + if (list_empty(req-r_linger_item)) { + dout(%s %p tid %llu not registered\n, __func__, req, + req-r_tid); + return; + } + + dout(%s %p tid %llu\n, __func__, req, req-r_tid); list_del_init(req-r_linger_item); + if (req-r_osd) { list_del_init(req-r_linger_osd_item); maybe_move_osd_to_lru(osdc, req-r_osd); -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] rbd: fix ida/idr memory leak
On 06/04/2014 11:38 AM, Ilya Dryomov wrote: ida_destroy() needs to be called on module exit to release ida caches. Looks good to me. Reviewed-by: Alex Elder el...@linaro.org Signed-off-by: Ilya Dryomov ilya.dryo...@inktank.com --- drivers/block/rbd.c |1 + 1 file changed, 1 insertion(+) diff --git a/drivers/block/rbd.c b/drivers/block/rbd.c index 94747afc2e78..ecdd4aee173c 100644 --- a/drivers/block/rbd.c +++ b/drivers/block/rbd.c @@ -5464,6 +5464,7 @@ err_out_slab: static void __exit rbd_exit(void) { + ida_destroy(rbd_dev_id_ida); rbd_sysfs_cleanup(); if (single_major) unregister_blkdev(rbd_major, RBD_DRV_NAME); -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] ceph: refactor readpage_nounlock() to make the logic clearer
On 05/28/2014 02:09 AM, Zhang Zhen wrote: If the return value of ceph_osdc_readpages() is not negative, it is certainly greater than or equal to zero. Remove the useless condition judgment and redundant braces. Signed-off-by: Zhang Zhen zhenzhang.zh...@huawei.com This looks good. Reviewed-by: Alex Elder el...@linaro.org --- fs/ceph/addr.c | 17 +++-- 1 file changed, 7 insertions(+), 10 deletions(-) diff --git a/fs/ceph/addr.c b/fs/ceph/addr.c index b53278c..6aa2e3f 100644 --- a/fs/ceph/addr.c +++ b/fs/ceph/addr.c @@ -211,18 +211,15 @@ static int readpage_nounlock(struct file *filp, struct page *page) SetPageError(page); ceph_fscache_readpage_cancel(inode, page); goto out; - } else { - if (err PAGE_CACHE_SIZE) { - /* zero fill remainder of page */ - zero_user_segment(page, err, PAGE_CACHE_SIZE); - } else { - flush_dcache_page(page); - } } - SetPageUptodate(page); + if (err PAGE_CACHE_SIZE) + /* zero fill remainder of page */ + zero_user_segment(page, err, PAGE_CACHE_SIZE); + else + flush_dcache_page(page); - if (err = 0) - ceph_readpage_to_fscache(inode, page); + SetPageUptodate(page); + ceph_readpage_to_fscache(inode, page); out: return err 0 ? err : 0; -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: crash report: paging request errors in various krbd contexts
On 05/16/2014 01:00 AM, Hannes Landeholm wrote: A production server just locked up and had to be hard rebooted. It had these various rbd related crash signatures in it's system log within the same 10 second interval: I'll try to provide a quick summary of what's likely happened in each of these. The bottom line is that it appears that some memory used by rbd and/or libceph has become corrupted, or there is something (or more than one thing) that is being used after it's been freed. Either way this sort of thing will be difficult to try to understand; it would be great if it could be reproduced independently. kernel: BUG: unable to handle kernel paging request at kernel: IP: [812832dd] strnlen+0xd/0x40 kernel: PGD 180f067 PUD 1811067 PMD 0 kernel: Oops: [#1] PREEMPT SMP kernel: Modules linked in: xt_recent xt_limit veth ipt_MASQUERADE cbc ipt_REJECT xt_conntrack iptable_filter iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack xt_tcpudp iptable_mangle ip_tables x_tables coretemp hwmon x86_pkg_temp_thermal crct10dif_pclmul crct10di f_common crc32_pclmul crc32c_intel ghash_clmulni_intel aesni_intel aes_x86_64 lrw gf128mul glue_helper ablk_helper cryptd microcode evdev pcspkr xen_netback xen_blkback xen_gntalloc xenfs xen_gntdev xen_evtchn rbd libceph crc32c libcrc32c ext4 crc16 mbcache jbd2 ixgbevf xen_privcmd xen_pcifront xen_netfront xen_kbdfro nt xen_fbfront syscopyarea sysfillrect sysimgblt fb_sys_fops xen_blkfront virtio_pci virtio_net virtio_blk virtio_ring virtio ipmi_poweroff ipmi_msghandler kernel: CPU: 3 PID: 2887 Comm: mysqld Not tainted 3.14.1-1-js #1 kernel: task: 880342ce1d70 ti: 88033e17e000 task.ti: 88033e17e000 kernel: RIP: e030:[812832dd] [812832dd] strnlen+0xd/0x40 We're calling strnlen() (ultimately) from snprintf(). The format provided will be %s.%012llx (or similar). The string provided for the %s is rbd_dev-header.object_prefix, which is a dynamically allocated string initialized once for the rbd device, which will be NUL-terminated and unchanging until the device gets mapped. Either the rbd device got unmapped while still in use, or the memory holding this rbd_dev structure got corrupted somehow. kernel: RSP: e02b:88033e17fa78 EFLAGS: 00010286 kernel: RAX: 816feed6 RBX: 8800ffb60068 RCX: fffe kernel: RDX: RSI: RDI: kernel: RBP: 88033e17fa78 R08: R09: kernel: R10: 756d696e612d736a R11: 2f30622d637a762f R12: kernel: R13: 8800ffb600cd R14: R15: kernel: FS: 7f98dec7f700() GS:8803b5d8() knlGS: kernel: CS: e033 DS: ES: CR0: 8005003b kernel: CR2: CR3: 000335dcc000 CR4: 2660 kernel: Stack: kernel: 88033e17fab0 8128562b 8800ffb60068 8800ffb600cd kernel: 88033e17fb28 a0196869 a0196869 88033e17fb18 kernel: 81286ac1 880389525960 814de243 kernel: Call Trace: kernel: [8128562b] string.isra.6+0x3b/0xf0 kernel: [81286ac1] vsnprintf+0x1c1/0x610 kernel: [814de243] ? _raw_spin_unlock_irq+0x13/0x30 kernel: [81286fd9] snprintf+0x39/0x40 kernel: [a01911e0] ? rbd_img_request_fill+0x100/0x6d0 [rbd] kernel: [a019122a] rbd_img_request_fill+0x14a/0x6d0 [rbd] kernel: [a018f4d5] ? rbd_img_request_create+0x155/0x220 [rbd] kernel: [8125cab9] ? blk_add_timer+0x19/0x20 kernel: [a0194a1d] rbd_request_fn+0x1ed/0x330 [rbd] kernel: [81252f13] __blk_run_queue+0x33/0x40 kernel: [8125411e] queue_unplugged+0x2e/0xd0 kernel: [81256cf0] blk_flush_plug_list+0x1f0/0x230 kernel: [812570a4] blk_finish_plug+0x14/0x40 kernel: [a00b9d6e] ext4_writepages+0x48e/0xd50 [ext4] kernel: [81136aae] ? generic_file_aio_write+0x5e/0xe0 kernel: [811417ae] do_writepages+0x1e/0x40 kernel: [811363d9] __filemap_fdatawrite_range+0x59/0x60 kernel: [811364da] filemap_write_and_wait_range+0x2a/0x70 kernel: [a00b149a] ext4_sync_file+0xba/0x360 [ext4] kernel: [811d50ce] do_fsync+0x4e/0x80 kernel: [811d5373] SyS_fdatasync+0x13/0x20 kernel: [814e66e9] system_call_fastpath+0x16/0x1b kernel: Code: c0 01 80 38 00 75 f7 48 29 f8 5d c3 31 c0 5d c3 66 66 66 66 66 2e 0f 1f 84 00 00 00 00 00 55 48 85 f6 48 8d 4e ff 48 89 e5 74 2a 80 3f 00 74 25 48 89 f8 31 d2 eb 10 0f 1f 80 00 00 00 00 48 83 kernel: RIP [812832dd] strnlen+0xd/0x40 kernel: RSP 88033e17fa78 kernel: CR2: kernel: ---[ end trace 83a2fd2a9969b20d ]--- kernel: BUG: unable to handle kernel paging request at 177b kernel: IP: [814dd33c] down_read+0xc/0x20