Re: [ceph] locking fun with d_materialise_unique()
On Mon, Jan 28, 2013 at 09:20:34PM -0800, Sage Weil wrote: Yep, that is indeed a problem. I think we just need to do the r_aborted and/or r_locked_dir check in the else if condition... I'm not sure if we are guaranteed that ceph_readdir_prepopulate() won't get to its splice_dentry() and d_delete() calls in similar situations - I hadn't checked that one yet. If it isn't guaranteed, we have a problem there as well. ...and the condition guarding readdir_prepopulate(). :) I think you're reading it correctly. The main thing to keep in mind here is that we *do* need to call fill_inode() for the inode metadata on these requests to keep the mds and client state in sync. The dentry state is safe to ignore. You mean the parts under if (rinfo-head-is_dentry) { and if (rinfo-head-is_target) { in there? Because there's fill_inode() called from readdir_prepopulate() and it's a lot more problematic than those two... It would be great to have the dir i_mutex rules summarized somewhere, even if it is just a copy of the below. It took a fair bit of trial and error to infer what was going on when writing this code. :) Directory -i_mutex rules are in part documented - what VFS guarantees to hold side is in Documentation/filesystems/directory-locking. It's the other side (what locks are expected to be held by callers of dcache.c functions) that is badly missing... Ping me when you've pushed that branch and I'll take a look... To gitol...@ra.kernel.org:/pub/scm/linux/kernel/git/viro/vfs.git 01a88fa..4056362 master - master with tentative ceph patch in the very end. Should be on git.kernel.org shortly... -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 01/12, v2] rbd: new request tracking code
On 01/24/2013 06:08 AM, Alex Elder wrote: This is an update of the first patch in my request tracking code series. After posting it the other day I identified some problems related to reference counting of image and object requests. I also am starting to look at the details of implementing layered reads, and ended up making some substantive changes. Since I have not seen any review feedback I thought the best thing would be to just re-post the updated patch. The remaining patches in the series have changed accordingly, but they have not really changed substantively, so I am not re-posting those (but will if it's requested). The main functional change is that an image request no longer maintains an array of object request pointers, it maintains a list of object requests. This simplifies some things, and makes the image request structure fixed size. A few other functional changes: - Reference counting of object and image requests is now done sensibly. - Image requests now support a callback when complete, which will be used for layered I/O requests. - There are a few new helper functions that encapsulate tying an object request to an image request. - An distinct value is now used for the which field for object requests not associated with a image request (mainly used for validation/assertions). Other changes: - Everything that was named image_request now uses img_request instead. - A few blocks and lines of code have been rearranged. The updated series is available on the ceph-client git repository in the branch wip-rbd-review-v2. -Alex This patch fully implements the new request tracking code for rbd I/O requests. Each I/O request to an rbd image will get an rbd_image_request structure allocated to track it. This provides access to all information about the original request, as well as access to the set of one or more object requests that are initiated as a result of the image request. An rbd_obj_request structure defines a request sent to a single osd object (possibly) as part of an rbd image request. An rbd object request refers to a ceph_osd_request structure built up to represent the request; for now it will contain a single osd operation. It also provides space to hold the result status and the version of the object when the osd request completes. An rbd_obj_request structure can also stand on its own. This will be used for reading the version 1 header object, for issuing acknowledgements to event notifications, and for making object method calls. All rbd object requests now complete asynchronously with respect to the osd client--they supply a common callback routine. This resolves: http://tracker.newdream.net/issues/3741 Signed-off-by: Alex Elder el...@inktank.com --- v2: - fixed reference counting - image request callback support - image/object connection helper functions - distinct BAD_WHICH value for non-image object requests drivers/block/rbd.c | 622 ++- 1 file changed, 620 insertions(+), 2 deletions(-) diff --git a/drivers/block/rbd.c b/drivers/block/rbd.c index 6689363..46a61dd 100644 --- a/drivers/block/rbd.c +++ b/drivers/block/rbd.c @@ -181,6 +181,67 @@ struct rbd_req_coll { struct rbd_req_status status[0]; }; +struct rbd_img_request; +typedef void (*rbd_img_callback_t)(struct rbd_img_request *); + +#defineBAD_WHICH U32_MAX /* Good which or bad which, which? */ + +struct rbd_obj_request; +typedef void (*rbd_obj_callback_t)(struct rbd_obj_request *); + +enum obj_req_type { obj_req_bio }; /* More types to come */ enum labels should be capitalized. +struct rbd_obj_request { + const char *object_name; + u64 offset; /* object start byte */ + u64 length; /* bytes from offset */ + + struct rbd_img_request *img_request; + struct list_headlinks; + u32 which; /* posn image request list */ + + enum obj_req_type type; + struct bio *bio_list; + + struct ceph_osd_request *osd_req; + + u64 xferred;/* bytes transferred */ + u64 version; This version is only used (uselessly) for the watch operation. It should be removed in a future patch (along with the obj_ver in the header). + s32 result; + atomic_tdone; + + rbd_obj_callback_t callback; + + struct kref kref; +}; + +struct rbd_img_request { + struct request *rq; + struct rbd_device *rbd_dev; + u64 offset; /* starting image byte offset */ + u64 length; /* byte count from offset */ + boolwrite_request; /* false for read */ + union { +
Re: [PATCH 02/12, v2] rbd: kill rbd_rq_fn() and all other related code
Reviewed-by: Josh Durgin josh.dur...@inktank.com On 01/24/2013 08:32 AM, Alex Elder wrote: Now that the request function has been replaced by one using the new request management data structures the old one can go away. Deleting it makes rbd_dev_do_request() no longer needed, and deleting that makes other functions unneeded, and so on. Signed-off-by: Alex Elder el...@inktank.com --- drivers/block/rbd.c | 319 --- 1 file changed, 319 deletions(-) diff --git a/drivers/block/rbd.c b/drivers/block/rbd.c index 46a61dd..7caddea 100644 --- a/drivers/block/rbd.c +++ b/drivers/block/rbd.c @@ -621,18 +621,6 @@ static void rbd_put_client(struct rbd_client *rbdc) kref_put(rbdc-kref, rbd_client_release); } -/* - * Destroy requests collection - */ -static void rbd_coll_release(struct kref *kref) -{ - struct rbd_req_coll *coll = - container_of(kref, struct rbd_req_coll, kref); - - dout(rbd_coll_release %p\n, coll); - kfree(coll); -} - static bool rbd_image_format_valid(u32 image_format) { return image_format == 1 || image_format == 2; @@ -876,28 +864,6 @@ static u64 rbd_segment_length(struct rbd_device *rbd_dev, return length; } -static int rbd_get_num_segments(struct rbd_image_header *header, - u64 ofs, u64 len) -{ - u64 start_seg; - u64 end_seg; - u64 result; - - if (!len) - return 0; - if (len - 1 U64_MAX - ofs) - return -ERANGE; - - start_seg = ofs header-obj_order; - end_seg = (ofs + len - 1) header-obj_order; - - result = end_seg - start_seg + 1; - if (result (u64) INT_MAX) - return -ERANGE; - - return (int) result; -} - /* * returns the size of an object in the image */ @@ -1216,52 +1182,6 @@ static void rbd_osd_req_op_destroy(struct ceph_osd_req_op *op) kfree(op); } -static void rbd_coll_end_req_index(struct request *rq, - struct rbd_req_coll *coll, - int index, - s32 ret, u64 len) -{ - struct request_queue *q; - int min, max, i; - - dout(rbd_coll_end_req_index %p index %d ret %d len %llu\n, -coll, index, (int)ret, (unsigned long long)len); - - if (!rq) - return; - - if (!coll) { - blk_end_request(rq, ret, len); - return; - } - - q = rq-q; - - spin_lock_irq(q-queue_lock); - coll-status[index].done = 1; - coll-status[index].rc = ret; - coll-status[index].bytes = len; - max = min = coll-num_done; - while (max coll-total coll-status[max].done) - max++; - - for (i = min; imax; i++) { - __blk_end_request(rq, (int)coll-status[i].rc, - coll-status[i].bytes); - coll-num_done++; - kref_put(coll-kref, rbd_coll_release); - } - spin_unlock_irq(q-queue_lock); -} - -static void rbd_coll_end_req(struct rbd_request *rbd_req, -s32 ret, u64 len) -{ - rbd_coll_end_req_index(rbd_req-rq, - rbd_req-coll, rbd_req-coll_index, - ret, len); -} - /* * Send ceph osd request */ @@ -1361,46 +1281,6 @@ done_osd_req: return ret; } -/* - * Ceph osd op callback - */ -static void rbd_req_cb(struct ceph_osd_request *osd_req, struct ceph_msg *msg) -{ - struct rbd_request *rbd_req = osd_req-r_priv; - struct ceph_osd_reply_head *replyhead; - struct ceph_osd_op *op; - s32 rc; - u64 bytes; - int read_op; - - /* parse reply */ - replyhead = msg-front.iov_base; - WARN_ON(le32_to_cpu(replyhead-num_ops) == 0); - op = (void *)(replyhead + 1); - rc = (s32)le32_to_cpu(replyhead-result); - bytes = le64_to_cpu(op-extent.length); - read_op = (le16_to_cpu(op-op) == CEPH_OSD_OP_READ); - - dout(rbd_req_cb bytes=%llu readop=%d rc=%d\n, - (unsigned long long) bytes, read_op, (int) rc); - - if (rc == (s32)-ENOENT read_op) { - zero_bio_chain(rbd_req-bio, 0); - rc = 0; - } else if (rc == 0 read_op bytes rbd_req-len) { - zero_bio_chain(rbd_req-bio, bytes); - bytes = rbd_req-len; - } - - rbd_coll_end_req(rbd_req, rc, bytes); - - if (rbd_req-bio) - bio_chain_put(rbd_req-bio); - - ceph_osdc_put_request(osd_req); - kfree(rbd_req); -} - static void rbd_simple_req_cb(struct ceph_osd_request *osd_req, struct ceph_msg *msg) { @@ -1448,70 +1328,6 @@ done: return ret; } -/* - * Do an asynchronous ceph osd operation - */ -static int rbd_do_op(struct request *rq, -struct rbd_device
Re: [PATCH 03/12, v2] rbd: kill rbd_req_coll and rbd_request
Reviewed-by: Josh Durgin josh.dur...@inktank.com On 01/24/2013 08:32 AM, Alex Elder wrote: The two remaining callers of rbd_do_request() always pass a null collection pointer, so the coll and coll_index parameters are not needed. There is no other use of that data structure, so it can be eliminated. Deleting them means there is no need to allocate a rbd_request structure for the callback function. And since that's the only use of *that* structure, it too can be eliminated. Signed-off-by: Alex Elder el...@inktank.com --- drivers/block/rbd.c | 58 +++ 1 file changed, 3 insertions(+), 55 deletions(-) diff --git a/drivers/block/rbd.c b/drivers/block/rbd.c index 7caddea..3302cea 100644 --- a/drivers/block/rbd.c +++ b/drivers/block/rbd.c @@ -162,25 +162,6 @@ struct rbd_client { struct list_headnode; }; -/* - * a request completion status - */ -struct rbd_req_status { - int done; - s32 rc; - u64 bytes; -}; - -/* - * a collection of requests - */ -struct rbd_req_coll { - int total; - int num_done; - struct kref kref; - struct rbd_req_status status[0]; -}; - struct rbd_img_request; typedef void (*rbd_img_callback_t)(struct rbd_img_request *); @@ -242,18 +223,6 @@ struct rbd_img_request { #define for_each_obj_request_safe(ireq, oreq, n) \ list_for_each_entry_safe_reverse(oreq, n, ireq-obj_requests, links) -/* - * a single io request - */ -struct rbd_request { - struct request *rq;/* blk layer request */ - struct bio *bio; /* cloned bio */ - struct page **pages;/* list of used pages */ - u64 len; - int coll_index; - struct rbd_req_coll *coll; -}; - struct rbd_snap { struct device dev; const char *name; @@ -1195,21 +1164,18 @@ static int rbd_do_request(struct request *rq, int num_pages, int flags, struct ceph_osd_req_op *op, - struct rbd_req_coll *coll, - int coll_index, void (*rbd_cb)(struct ceph_osd_request *, struct ceph_msg *), u64 *ver) { struct ceph_osd_client *osdc; struct ceph_osd_request *osd_req; - struct rbd_request *rbd_req = NULL; struct timespec mtime = CURRENT_TIME; int ret; - dout(rbd_do_request object_name=%s ofs=%llu len=%llu coll=%p[%d]\n, + dout(rbd_do_request object_name=%s ofs=%llu len=%llu\n, object_name, (unsigned long long) ofs, - (unsigned long long) len, coll, coll_index); + (unsigned long long) len); osdc = rbd_dev-rbd_client-client-osdc; osd_req = ceph_osdc_alloc_request(osdc, snapc, 1, false, GFP_NOIO); @@ -1223,22 +1189,8 @@ static int rbd_do_request(struct request *rq, bio_get(osd_req-r_bio); } - if (coll) { - ret = -ENOMEM; - rbd_req = kmalloc(sizeof(*rbd_req), GFP_NOIO); - if (!rbd_req) - goto done_osd_req; - - rbd_req-rq = rq; - rbd_req-bio = bio; - rbd_req-pages = pages; - rbd_req-len = len; - rbd_req-coll = coll; - rbd_req-coll_index = coll_index; - } - osd_req-r_callback = rbd_cb; - osd_req-r_priv = rbd_req; + osd_req-r_priv = NULL; strncpy(osd_req-r_oid, object_name, sizeof(osd_req-r_oid)); osd_req-r_oid_len = strlen(osd_req-r_oid); @@ -1274,8 +1226,6 @@ static int rbd_do_request(struct request *rq, done_err: if (bio) bio_chain_put(osd_req-r_bio); - kfree(rbd_req); -done_osd_req: ceph_osdc_put_request(osd_req); return ret; @@ -1314,7 +1264,6 @@ static int rbd_req_sync_op(struct rbd_device *rbd_dev, pages, num_pages, flags, op, - NULL, 0, NULL, ver); if (ret 0) @@ -1390,7 +1339,6 @@ static int rbd_req_sync_notify_ack(struct rbd_device *rbd_dev, NULL, 0, CEPH_OSD_FLAG_READ, op, - NULL, 0, rbd_simple_req_cb, NULL); rbd_osd_req_op_destroy(op); -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 04/12, v2] rbd: implement sync object read with new code
A couple small comments, but looks good. Reviewed-by: Josh Durgin josh.dur...@inktank.com On 01/24/2013 08:33 AM, Alex Elder wrote: Reimplement the synchronous read operation used for reading a version 1 header using the new request tracking code. Name the resulting function rbd_obj_read_sync() to better reflect that it's a full object operation, not an object request. To do this, implement a new obj_req_pages object request type. This implements a new mechanism to allow the caller to wait for completion for an rbd_obj_request by calling rbd_obj_request_wait(). This partially resolves: http://tracker.newdream.net/issues/3755 Signed-off-by: Alex Elder el...@inktank.com --- drivers/block/rbd.c | 96 --- 1 file changed, 92 insertions(+), 4 deletions(-) diff --git a/drivers/block/rbd.c b/drivers/block/rbd.c index 3302cea..742236b 100644 --- a/drivers/block/rbd.c +++ b/drivers/block/rbd.c @@ -170,7 +170,7 @@ typedef void (*rbd_img_callback_t)(struct rbd_img_request *); struct rbd_obj_request; typedef void (*rbd_obj_callback_t)(struct rbd_obj_request *); -enum obj_req_type { obj_req_bio }; /* More types to come */ +enum obj_req_type { obj_req_bio, obj_req_pages }; Should be capitalized struct rbd_obj_request { const char *object_name; @@ -182,7 +182,13 @@ struct rbd_obj_request { u32 which; /* posn image request list */ enum obj_req_type type; - struct bio *bio_list; + union { + struct bio *bio_list; + struct { + struct page **pages; + u32 page_count; + }; + }; struct ceph_osd_request *osd_req; @@ -192,6 +198,7 @@ struct rbd_obj_request { atomic_tdone; rbd_obj_callback_t callback; + struct completion completion; struct kref kref; }; @@ -1077,6 +1084,7 @@ static bool obj_req_type_valid(enum obj_req_type type) { switch (type) { case obj_req_bio: + case obj_req_pages: return true; default: return false; @@ -1291,14 +1299,23 @@ static void rbd_img_request_complete(struct rbd_img_request *img_request) rbd_img_request_put(img_request); } +/* Caller is responsible for rbd_obj_request_destroy(obj_request) */ + +static int rbd_obj_request_wait(struct rbd_obj_request *obj_request) +{ + return wait_for_completion_interruptible(obj_request-completion); +} + static void rbd_obj_request_complete(struct rbd_obj_request *obj_request) { if (obj_request-callback) obj_request-callback(obj_request); + else + complete_all(obj_request-completion); } /* - * Request sync osd read + * Synchronously read a range from an object into a provided buffer */ static int rbd_req_sync_read(struct rbd_device *rbd_dev, const char *object_name, @@ -1556,6 +1573,11 @@ static struct ceph_osd_request *rbd_osd_req_create( /* osd client requires num pages even for bio */ osd_req-r_num_pages = calc_pages_for(offset, length); break; + case obj_req_pages: + osd_req-r_pages = obj_request-pages; + osd_req-r_num_pages = obj_request-page_count; + osd_req-r_page_alignment = offset ~PAGE_MASK; + break; } if (write_request) { @@ -1618,6 +1640,7 @@ static struct rbd_obj_request *rbd_obj_request_create(const char *object_name, obj_request-type = type; INIT_LIST_HEAD(obj_request-links); atomic_set(obj_request-done, 0); + init_completion(obj_request-completion); kref_init(obj_request-kref); return obj_request; @@ -1641,6 +1664,11 @@ static void rbd_obj_request_destroy(struct kref *kref) if (obj_request-bio_list) bio_chain_put(obj_request-bio_list); break; + case obj_req_pages: + if (obj_request-pages) + ceph_release_page_vector(obj_request-pages, + obj_request-page_count); + break; } kfree(obj_request); @@ -1988,6 +2016,65 @@ static void rbd_free_disk(struct rbd_device *rbd_dev) put_disk(disk); } +static int rbd_obj_read_sync(struct rbd_device *rbd_dev, + const char *object_name, + u64 offset, u64 length, + char *buf, u64 *version) + +{ + struct ceph_osd_req_op *op; + struct rbd_obj_request *obj_request; + struct ceph_osd_client *osdc; + struct page **pages = NULL; + u32 page_count; + int ret; + + page_count = (u32) calc_pages_for(offset,
Re: [PATCH 05/12, v2] rbd: get rid of rbd_req_sync_read()
Reviewed-by: Josh Durgin josh.dur...@inktank.com On 01/24/2013 08:33 AM, Alex Elder wrote: Delete rbd_req_sync_read() is no longer used, so get rid of it. Signed-off-by: Alex Elder el...@inktank.com --- drivers/block/rbd.c | 24 1 file changed, 24 deletions(-) diff --git a/drivers/block/rbd.c b/drivers/block/rbd.c index 742236b..750fc73 100644 --- a/drivers/block/rbd.c +++ b/drivers/block/rbd.c @@ -1315,29 +1315,6 @@ static void rbd_obj_request_complete(struct rbd_obj_request *obj_request) } /* - * Synchronously read a range from an object into a provided buffer - */ -static int rbd_req_sync_read(struct rbd_device *rbd_dev, - const char *object_name, - u64 ofs, u64 len, - char *buf, - u64 *ver) -{ - struct ceph_osd_req_op *op; - int ret; - - op = rbd_osd_req_op_create(CEPH_OSD_OP_READ, ofs, len); - if (!op) - return -ENOMEM; - - ret = rbd_req_sync_op(rbd_dev, CEPH_OSD_FLAG_READ, - op, object_name, ofs, len, buf, ver); - rbd_osd_req_op_destroy(op); - - return ret; -} - -/* * Request sync osd watch */ static int rbd_req_sync_notify_ack(struct rbd_device *rbd_dev, @@ -2113,7 +2090,6 @@ rbd_dev_v1_header_read(struct rbd_device *rbd_dev, u64 *version) if (!ondisk) return ERR_PTR(-ENOMEM); - (void) rbd_req_sync_read; /* avoid a warning */ ret = rbd_obj_read_sync(rbd_dev, rbd_dev-header_name, 0, size, (char *) ondisk, version); -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 06/12, v2] rbd: implement watch/unwatch with new code
Reviewed-by: Josh Durgin josh.dur...@inktank.com On 01/24/2013 08:33 AM, Alex Elder wrote: Implement a new function to set up or tear down a watch event for an mapped rbd image header using the new request code. Create a new object request type nodata to handle this. And define rbd_osd_trivial_callback() which simply marks a request done. Signed-off-by: Alex Elder el...@inktank.com --- drivers/block/rbd.c | 87 +-- 1 file changed, 84 insertions(+), 3 deletions(-) diff --git a/drivers/block/rbd.c b/drivers/block/rbd.c index 750fc73..7b1eddc 100644 --- a/drivers/block/rbd.c +++ b/drivers/block/rbd.c @@ -170,7 +170,7 @@ typedef void (*rbd_img_callback_t)(struct rbd_img_request *); struct rbd_obj_request; typedef void (*rbd_obj_callback_t)(struct rbd_obj_request *); -enum obj_req_type { obj_req_bio, obj_req_pages }; +enum obj_req_type { obj_req_nodata, obj_req_bio, obj_req_pages }; struct rbd_obj_request { const char *object_name; @@ -1083,6 +1083,7 @@ static inline void rbd_img_obj_request_del(struct rbd_img_request *img_request, static bool obj_req_type_valid(enum obj_req_type type) { switch (type) { + case obj_req_nodata: case obj_req_bio: case obj_req_pages: return true; @@ -1306,6 +1307,12 @@ static int rbd_obj_request_wait(struct rbd_obj_request *obj_request) return wait_for_completion_interruptible(obj_request-completion); } +static void rbd_osd_trivial_callback(struct rbd_obj_request *obj_request, + struct ceph_osd_op *op) +{ + atomic_set(obj_request-done, 1); +} + static void rbd_obj_request_complete(struct rbd_obj_request *obj_request) { if (obj_request-callback) @@ -1500,6 +1507,9 @@ static void rbd_osd_req_callback(struct ceph_osd_request *osd_req, case CEPH_OSD_OP_WRITE: rbd_osd_write_callback(obj_request, op); break; + case CEPH_OSD_OP_WATCH: + rbd_osd_trivial_callback(obj_request, op); + break; default: rbd_warn(NULL, %s: unsupported op %hu\n, obj_request-object_name, (unsigned short) opcode); @@ -1543,6 +1553,8 @@ static struct ceph_osd_request *rbd_osd_req_create( rbd_assert(obj_req_type_valid(obj_request-type)); switch (obj_request-type) { + case obj_req_nodata: + break; /* Nothing to do */ case obj_req_bio: rbd_assert(obj_request-bio_list != NULL); osd_req-r_bio = obj_request-bio_list; @@ -1637,6 +1649,8 @@ static void rbd_obj_request_destroy(struct kref *kref) rbd_assert(obj_req_type_valid(obj_request-type)); switch (obj_request-type) { + case obj_req_nodata: + break; /* Nothing to do */ case obj_req_bio: if (obj_request-bio_list) bio_chain_put(obj_request-bio_list); @@ -1863,6 +1877,72 @@ static int rbd_img_request_submit(struct rbd_img_request *img_request) return 0; } +/* + * Request sync osd watch/unwatch. The value of start determines + * whether a watch request is being initiated or torn down. + */ +static int rbd_dev_header_watch_sync(struct rbd_device *rbd_dev, int start) +{ + struct ceph_osd_client *osdc = rbd_dev-rbd_client-client-osdc; + struct rbd_obj_request *obj_request; + struct ceph_osd_req_op *op; + int ret; + + rbd_assert(start ^ !!rbd_dev-watch_event); + rbd_assert(start ^ !!rbd_dev-watch_request); + + if (start) { + ret = ceph_osdc_create_event(osdc, rbd_watch_cb, 0, rbd_dev, + rbd_dev-watch_event); + if (ret 0) + return ret; + } + + ret = -ENOMEM; + obj_request = rbd_obj_request_create(rbd_dev-header_name, + 0, 0, obj_req_nodata); + if (!obj_request) + goto out_cancel; + + op = rbd_osd_req_op_create(CEPH_OSD_OP_WATCH, + rbd_dev-watch_event-cookie, + rbd_dev-header.obj_version, start); + if (!op) + goto out_cancel; + obj_request-osd_req = rbd_osd_req_create(rbd_dev, true, + obj_request, op); + rbd_osd_req_op_destroy(op); + if (!obj_request-osd_req) + goto out_cancel; + + if (start) { + rbd_dev-watch_request = obj_request-osd_req; + ceph_osdc_set_request_linger(osdc, rbd_dev-watch_request); + } + ret = rbd_obj_request_submit(osdc, obj_request); + if (ret) + goto out_cancel; + ret = rbd_obj_request_wait(obj_request); + if (ret) + goto out_cancel; + + ret = obj_request-result; + if
Re: [PATCH 07/12, v2] rbd: get rid of rbd_req_sync_watch()
Reviewed-by: Josh Durgin josh.dur...@inktank.com On 01/24/2013 08:34 AM, Alex Elder wrote: Get rid of rbd_req_sync_watch(), because it is no longer used. Signed-off-by: Alex Elder el...@inktank.com --- drivers/block/rbd.c | 42 -- 1 file changed, 42 deletions(-) diff --git a/drivers/block/rbd.c b/drivers/block/rbd.c index 7b1eddc..8f659f3 100644 --- a/drivers/block/rbd.c +++ b/drivers/block/rbd.c @@ -1367,47 +1367,6 @@ static void rbd_watch_cb(u64 ver, u64 notify_id, u8 opcode, void *data) rbd_req_sync_notify_ack(rbd_dev, hver, notify_id); } -/* - * Request sync osd watch/unwatch. The value of start determines - * whether a watch request is being initiated or torn down. - */ -static int rbd_req_sync_watch(struct rbd_device *rbd_dev, int start) -{ - struct ceph_osd_req_op *op; - int ret = 0; - - rbd_assert(start ^ !!rbd_dev-watch_event); - rbd_assert(start ^ !!rbd_dev-watch_request); - - if (start) { - struct ceph_osd_client *osdc; - - osdc = rbd_dev-rbd_client-client-osdc; - ret = ceph_osdc_create_event(osdc, rbd_watch_cb, 0, rbd_dev, - rbd_dev-watch_event); - if (ret 0) - return ret; - } - - op = rbd_osd_req_op_create(CEPH_OSD_OP_WATCH, - rbd_dev-watch_event-cookie, - rbd_dev-header.obj_version, start); - if (op) - ret = rbd_req_sync_op(rbd_dev, - CEPH_OSD_FLAG_WRITE | CEPH_OSD_FLAG_ONDISK, - op, rbd_dev-header_name, - 0, 0, NULL, NULL); - - /* Cancel the event if we're tearing down, or on error */ - - if (!start || !op || ret 0) { - ceph_osdc_cancel_event(rbd_dev-watch_event); - rbd_dev-watch_event = NULL; - } - rbd_osd_req_op_destroy(op); - - return ret; -} /* * Synchronous osd object method call @@ -3960,7 +3919,6 @@ static int rbd_dev_probe_finish(struct rbd_device *rbd_dev) if (ret) goto err_out_bus; - (void) rbd_req_sync_watch; /* avoid a warning */ ret = rbd_dev_header_watch_sync(rbd_dev, 1); if (ret) goto err_out_bus; -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 08/12, v2] rbd: use new code for notify ack
Reviewed-by: Josh Durgin josh.dur...@inktank.com On 01/24/2013 08:34 AM, Alex Elder wrote: Use the new object request tracking mechanism for handling a notify_ack request. Move the callback function below the definition of this so we don't have to do a pre-declaration. This resolves: http://tracker.newdream.net/issues/3754 Signed-off-by: Alex Elder el...@inktank.com --- drivers/block/rbd.c | 76 +-- 1 file changed, 55 insertions(+), 21 deletions(-) diff --git a/drivers/block/rbd.c b/drivers/block/rbd.c index 8f659f3..e2b6230 100644 --- a/drivers/block/rbd.c +++ b/drivers/block/rbd.c @@ -1347,27 +1347,6 @@ static int rbd_req_sync_notify_ack(struct rbd_device *rbd_dev, return ret; } -static void rbd_watch_cb(u64 ver, u64 notify_id, u8 opcode, void *data) -{ - struct rbd_device *rbd_dev = (struct rbd_device *)data; - u64 hver; - int rc; - - if (!rbd_dev) - return; - - dout(rbd_watch_cb %s notify_id=%llu opcode=%u\n, - rbd_dev-header_name, (unsigned long long) notify_id, - (unsigned int) opcode); - rc = rbd_dev_refresh(rbd_dev, hver); - if (rc) - rbd_warn(rbd_dev, got notification but failed to - update snaps: %d\n, rc); - - rbd_req_sync_notify_ack(rbd_dev, hver, notify_id); -} - - /* * Synchronous osd object method call */ @@ -1466,6 +1445,7 @@ static void rbd_osd_req_callback(struct ceph_osd_request *osd_req, case CEPH_OSD_OP_WRITE: rbd_osd_write_callback(obj_request, op); break; + case CEPH_OSD_OP_NOTIFY_ACK: case CEPH_OSD_OP_WATCH: rbd_osd_trivial_callback(obj_request, op); break; @@ -1836,6 +1816,60 @@ static int rbd_img_request_submit(struct rbd_img_request *img_request) return 0; } +static int rbd_obj_notify_ack_sync(struct rbd_device *rbd_dev, + u64 ver, u64 notify_id) +{ + struct rbd_obj_request *obj_request; + struct ceph_osd_req_op *op; + struct ceph_osd_client *osdc; + int ret; + + obj_request = rbd_obj_request_create(rbd_dev-header_name, 0, 0, + obj_req_nodata); + if (!obj_request) + return -ENOMEM; + + ret = -ENOMEM; + op = rbd_osd_req_op_create(CEPH_OSD_OP_NOTIFY_ACK, notify_id, ver); + if (!op) + goto out; + obj_request-osd_req = rbd_osd_req_create(rbd_dev, false, + obj_request, op); + rbd_osd_req_op_destroy(op); + if (!obj_request-osd_req) + goto out; + + osdc = rbd_dev-rbd_client-client-osdc; + ret = rbd_obj_request_submit(osdc, obj_request); + if (!ret) + ret = rbd_obj_request_wait(obj_request); +out: + rbd_obj_request_put(obj_request); + + return ret; +} + +static void rbd_watch_cb(u64 ver, u64 notify_id, u8 opcode, void *data) +{ + struct rbd_device *rbd_dev = (struct rbd_device *)data; + u64 hver; + int rc; + + if (!rbd_dev) + return; + + dout(rbd_watch_cb %s notify_id=%llu opcode=%u\n, + rbd_dev-header_name, (unsigned long long) notify_id, + (unsigned int) opcode); + rc = rbd_dev_refresh(rbd_dev, hver); + if (rc) + rbd_warn(rbd_dev, got notification but failed to + update snaps: %d\n, rc); + + (void) rbd_req_sync_notify_ack; /* avoid a warning */ + rbd_obj_notify_ack_sync(rbd_dev, hver, notify_id); +} + /* * Request sync osd watch/unwatch. The value of start determines * whether a watch request is being initiated or torn down. -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 09/12, v2] rbd: get rid of rbd_req_sync_notify_ack()
Reviewed-by: Josh Durgin josh.dur...@inktank.com On 01/24/2013 08:35 AM, Alex Elder wrote: Get rid rbd_req_sync_notify_ack() because it is no longer used. As a result rbd_simple_req_cb() becomes unreferenced, so get rid of that too. Signed-off-by: Alex Elder el...@inktank.com --- drivers/block/rbd.c | 33 - 1 file changed, 33 deletions(-) diff --git a/drivers/block/rbd.c b/drivers/block/rbd.c index e2b6230..b952b2f 100644 --- a/drivers/block/rbd.c +++ b/drivers/block/rbd.c @@ -1240,12 +1240,6 @@ done_err: return ret; } -static void rbd_simple_req_cb(struct ceph_osd_request *osd_req, - struct ceph_msg *msg) -{ - ceph_osdc_put_request(osd_req); -} - /* * Do a synchronous ceph osd operation */ @@ -1322,32 +1316,6 @@ static void rbd_obj_request_complete(struct rbd_obj_request *obj_request) } /* - * Request sync osd watch - */ -static int rbd_req_sync_notify_ack(struct rbd_device *rbd_dev, - u64 ver, - u64 notify_id) -{ - struct ceph_osd_req_op *op; - int ret; - - op = rbd_osd_req_op_create(CEPH_OSD_OP_NOTIFY_ACK, notify_id, ver); - if (!op) - return -ENOMEM; - - ret = rbd_do_request(NULL, rbd_dev, NULL, CEPH_NOSNAP, - rbd_dev-header_name, 0, 0, NULL, - NULL, 0, - CEPH_OSD_FLAG_READ, - op, - rbd_simple_req_cb, NULL); - - rbd_osd_req_op_destroy(op); - - return ret; -} - -/* * Synchronous osd object method call */ static int rbd_req_sync_exec(struct rbd_device *rbd_dev, @@ -1866,7 +1834,6 @@ static void rbd_watch_cb(u64 ver, u64 notify_id, u8 opcode, void *data) rbd_warn(rbd_dev, got notification but failed to update snaps: %d\n, rc); - (void) rbd_req_sync_notify_ack; /* avoid a warning */ rbd_obj_notify_ack_sync(rbd_dev, hver, notify_id); } -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 10/12, v2] rbd: send notify ack asynchronously
Reviewed-by: Josh Durgin josh.dur...@inktank.com On 01/24/2013 08:35 AM, Alex Elder wrote: When we receive notification of a change to an rbd image's header object we need to refresh our information about the image (its size and snapshot context). Once we have refreshed our rbd image we need to acknowledge the notification. This acknowledgement was previously done synchronously, but there's really no need to wait for it to complete. Change it so the caller doesn't wait for the notify acknowledgement request to complete. And change the name to reflect it's no longer synchronous. This resolves: http://tracker.newdream.net/issues/3877 Signed-off-by: Alex Elder el...@inktank.com --- drivers/block/rbd.c | 10 +- 1 file changed, 5 insertions(+), 5 deletions(-) diff --git a/drivers/block/rbd.c b/drivers/block/rbd.c index b952b2f..48650d1 100644 --- a/drivers/block/rbd.c +++ b/drivers/block/rbd.c @@ -1784,7 +1784,7 @@ static int rbd_img_request_submit(struct rbd_img_request *img_request) return 0; } -static int rbd_obj_notify_ack_sync(struct rbd_device *rbd_dev, +static int rbd_obj_notify_ack(struct rbd_device *rbd_dev, u64 ver, u64 notify_id) { struct rbd_obj_request *obj_request; @@ -1808,11 +1808,11 @@ static int rbd_obj_notify_ack_sync(struct rbd_device *rbd_dev, goto out; osdc = rbd_dev-rbd_client-client-osdc; + obj_request-callback = rbd_obj_request_put; ret = rbd_obj_request_submit(osdc, obj_request); - if (!ret) - ret = rbd_obj_request_wait(obj_request); out: - rbd_obj_request_put(obj_request); + if (ret) + rbd_obj_request_put(obj_request); return ret; } @@ -1834,7 +1834,7 @@ static void rbd_watch_cb(u64 ver, u64 notify_id, u8 opcode, void *data) rbd_warn(rbd_dev, got notification but failed to update snaps: %d\n, rc); - rbd_obj_notify_ack_sync(rbd_dev, hver, notify_id); + rbd_obj_notify_ack(rbd_dev, hver, notify_id); } /* -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 11/12, v2] rbd: implement sync method with new code
With the version parameter removed now or in a later patch: Reviewed-by: Josh Durgin josh.dur...@inktank.com On 01/24/2013 08:36 AM, Alex Elder wrote: Reimplement synchronous object method calls using the new request tracking code. Use the name rbd_obj_method_sync() Signed-off-by: Alex Elder el...@inktank.com --- drivers/block/rbd.c | 111 +++ 1 file changed, 94 insertions(+), 17 deletions(-) diff --git a/drivers/block/rbd.c b/drivers/block/rbd.c index 48650d1..5ad2ac2 100644 --- a/drivers/block/rbd.c +++ b/drivers/block/rbd.c @@ -1413,6 +1413,7 @@ static void rbd_osd_req_callback(struct ceph_osd_request *osd_req, case CEPH_OSD_OP_WRITE: rbd_osd_write_callback(obj_request, op); break; + case CEPH_OSD_OP_CALL: case CEPH_OSD_OP_NOTIFY_ACK: case CEPH_OSD_OP_WATCH: rbd_osd_trivial_callback(obj_request, op); @@ -1903,6 +1904,81 @@ done: return ret; } +/* + * Synchronous osd object method call + */ +static int rbd_obj_method_sync(struct rbd_device *rbd_dev, +const char *object_name, +const char *class_name, +const char *method_name, +const char *outbound, +size_t outbound_size, +char *inbound, +size_t inbound_size, +u64 *version) You can get rid of the version parameter. It's not useful to any current or planned rbd code that I can think of. +{ + struct rbd_obj_request *obj_request; + struct ceph_osd_client *osdc; + struct ceph_osd_req_op *op; + struct page **pages; + u32 page_count; + int ret; + + /* +* Method calls are ultimately read operations but they +* don't involve object data (so no offset or length). +* The result should placed into the inbound buffer +* provided. They also supply outbound data--parameters for +* the object method. Currently if this is present it will +* be a snapshot id. +*/ + page_count = (u32) calc_pages_for(0, inbound_size); + pages = ceph_alloc_page_vector(page_count, GFP_KERNEL); + if (IS_ERR(pages)) + return PTR_ERR(pages); + + ret = -ENOMEM; + obj_request = rbd_obj_request_create(object_name, 0, 0, obj_req_pages); + if (!obj_request) + goto out; + + obj_request-pages = pages; + obj_request-page_count = page_count; + + op = rbd_osd_req_op_create(CEPH_OSD_OP_CALL, class_name, + method_name, outbound, outbound_size); + if (!op) + goto out; + obj_request-osd_req = rbd_osd_req_create(rbd_dev, false, + obj_request, op); + rbd_osd_req_op_destroy(op); + if (!obj_request-osd_req) + goto out; + + osdc = rbd_dev-rbd_client-client-osdc; + ret = rbd_obj_request_submit(osdc, obj_request); + if (ret) + goto out; + ret = rbd_obj_request_wait(obj_request); + if (ret) + goto out; + + ret = obj_request-result; + if (ret 0) + goto out; + ret = ceph_copy_from_page_vector(pages, inbound, 0, + obj_request-xferred); + if (version) + *version = obj_request-version; Here too +out: + if (obj_request) + rbd_obj_request_put(obj_request); + else + ceph_release_page_vector(pages, page_count); + + return ret; +} + static void rbd_request_fn(struct request_queue *q) { struct rbd_device *rbd_dev = q-queuedata; @@ -2753,11 +2829,12 @@ static int _rbd_dev_v2_snap_size(struct rbd_device *rbd_dev, u64 snap_id, __le64 size; } __attribute__ ((packed)) size_buf = { 0 }; - ret = rbd_req_sync_exec(rbd_dev, rbd_dev-header_name, + (void) rbd_req_sync_exec; /* Avoid a warning */ + ret = rbd_obj_method_sync(rbd_dev, rbd_dev-header_name, rbd, get_size, (char *) snapid, sizeof (snapid), (char *) size_buf, sizeof (size_buf), NULL); - dout(%s: rbd_req_sync_exec returned %d\n, __func__, ret); + dout(%s: rbd_obj_method_sync returned %d\n, __func__, ret); if (ret 0) return ret; @@ -2788,14 +2865,14 @@ static int rbd_dev_v2_object_prefix(struct rbd_device *rbd_dev) if (!reply_buf) return -ENOMEM; - ret = rbd_req_sync_exec(rbd_dev, rbd_dev-header_name, + ret = rbd_obj_method_sync(rbd_dev, rbd_dev-header_name, rbd, get_object_prefix, NULL, 0,
Re: [PATCH 12/12, v2] rbd: get rid of rbd_req_sync_exec()
Reviewed-by: Josh Durgin josh.dur...@inktank.com On 01/24/2013 08:36 AM, Alex Elder wrote: Get rid rbd_req_sync_exec() because it is no longer used. That eliminates the last use of rbd_req_sync_op(), so get rid of that too. And finally, that leaves rbd_do_request() unreferenced, so get rid of that. Signed-off-by: Alex Elder el...@inktank.com --- drivers/block/rbd.c | 160 --- 1 file changed, 160 deletions(-) diff --git a/drivers/block/rbd.c b/drivers/block/rbd.c index 5ad2ac2..c807a4c 100644 --- a/drivers/block/rbd.c +++ b/drivers/block/rbd.c @@ -1160,126 +1160,6 @@ static void rbd_osd_req_op_destroy(struct ceph_osd_req_op *op) kfree(op); } -/* - * Send ceph osd request - */ -static int rbd_do_request(struct request *rq, - struct rbd_device *rbd_dev, - struct ceph_snap_context *snapc, - u64 snapid, - const char *object_name, u64 ofs, u64 len, - struct bio *bio, - struct page **pages, - int num_pages, - int flags, - struct ceph_osd_req_op *op, - void (*rbd_cb)(struct ceph_osd_request *, -struct ceph_msg *), - u64 *ver) -{ - struct ceph_osd_client *osdc; - struct ceph_osd_request *osd_req; - struct timespec mtime = CURRENT_TIME; - int ret; - - dout(rbd_do_request object_name=%s ofs=%llu len=%llu\n, - object_name, (unsigned long long) ofs, - (unsigned long long) len); - - osdc = rbd_dev-rbd_client-client-osdc; - osd_req = ceph_osdc_alloc_request(osdc, snapc, 1, false, GFP_NOIO); - if (!osd_req) - return -ENOMEM; - - osd_req-r_flags = flags; - osd_req-r_pages = pages; - if (bio) { - osd_req-r_bio = bio; - bio_get(osd_req-r_bio); - } - - osd_req-r_callback = rbd_cb; - osd_req-r_priv = NULL; - - strncpy(osd_req-r_oid, object_name, sizeof(osd_req-r_oid)); - osd_req-r_oid_len = strlen(osd_req-r_oid); - - osd_req-r_file_layout = rbd_dev-layout; /* struct */ - osd_req-r_num_pages = calc_pages_for(ofs, len); - osd_req-r_page_alignment = ofs ~PAGE_MASK; - - ceph_osdc_build_request(osd_req, ofs, len, 1, op, - snapc, snapid, mtime); - - if (op-op == CEPH_OSD_OP_WATCH op-watch.flag) { - ceph_osdc_set_request_linger(osdc, osd_req); - rbd_dev-watch_request = osd_req; - } - - ret = ceph_osdc_start_request(osdc, osd_req, false); - if (ret 0) - goto done_err; - - if (!rbd_cb) { - u64 version; - - ret = ceph_osdc_wait_request(osdc, osd_req); - version = le64_to_cpu(osd_req-r_reassert_version.version); - if (ver) - *ver = version; - dout(reassert_ver=%llu\n, (unsigned long long) version); - ceph_osdc_put_request(osd_req); - } - return ret; - -done_err: - if (bio) - bio_chain_put(osd_req-r_bio); - ceph_osdc_put_request(osd_req); - - return ret; -} - -/* - * Do a synchronous ceph osd operation - */ -static int rbd_req_sync_op(struct rbd_device *rbd_dev, - int flags, - struct ceph_osd_req_op *op, - const char *object_name, - u64 ofs, u64 inbound_size, - char *inbound, - u64 *ver) -{ - int ret; - struct page **pages; - int num_pages; - - rbd_assert(op != NULL); - - num_pages = calc_pages_for(ofs, inbound_size); - pages = ceph_alloc_page_vector(num_pages, GFP_KERNEL); - if (IS_ERR(pages)) - return PTR_ERR(pages); - - ret = rbd_do_request(NULL, rbd_dev, NULL, CEPH_NOSNAP, - object_name, ofs, inbound_size, NULL, - pages, num_pages, - flags, - op, - NULL, - ver); - if (ret 0) - goto done; - - if ((flags CEPH_OSD_FLAG_READ) inbound) - ret = ceph_copy_from_page_vector(pages, inbound, ofs, ret); - -done: - ceph_release_page_vector(pages, num_pages); - return ret; -} - static int rbd_obj_request_submit(struct ceph_osd_client *osdc, struct rbd_obj_request *obj_request) { @@ -1315,45 +1195,6 @@ static void rbd_obj_request_complete(struct rbd_obj_request *obj_request) complete_all(obj_request-completion); } -/* - * Synchronous osd object method call - */ -static int
RE: Ceph Production Environment Setup and Configurations?
[The following views only behalf of myself, not relate with Intel..] Looking forward for the performance data on Atom. Atom perform badly in Swift, but since Ceph is slightly efficient than Swift, it must be better. I have some concern about weather Atom can support such high throughput( you have 16 disks, assuming 50MB/s per disk, you would like to have 50*16 *2(include journal write)=1600MB/s ,together with corresponding network throughput, say 10GbE. The total IO throughput seems too high for Atom. Baidu (www.baidu.com) use Arm based storage node(not using ceph,but their own DFS), a node with a quad-core Arm together with 4 disks, and a 2U box can put up to 6 nodes, the 6 nodes share a 10Gb NIC. Although Atom is different with Arm, but you can take Baidu as a reference. -Original Message- From: ceph-devel-ow...@vger.kernel.org [mailto:ceph-devel-ow...@vger.kernel.org] On Behalf Of Gandalf Corvotempesta Sent: 2013年1月29日 17:25 To: femi anjorin Cc: ceph-devel@vger.kernel.org; Ross Turk Subject: Re: Ceph Production Environment Setup and Configurations? 2013/1/29 femi anjorin femi.anjo...@gmail.com: CPU - Intel(R) Atom(TM) CPU D525 @ 1.80GHz Atom? For which kind of role ? -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Ceph Production Environment Setup and Configurations?
On 01/29/2013 11:25 AM, Chen, Xiaoxi wrote: [The following views only behalf of myself, not relate with Intel..] Looking forward for the performance data on Atom. Atom perform badly in Swift, but since Ceph is slightly efficient than Swift, it must be better. I have some concern about weather Atom can support such high throughput( you have 16 disks, assuming 50MB/s per disk, you would like to have 50*16 *2(include journal write)=1600MB/s ,together with corresponding network throughput, say 10GbE. The total IO throughput seems too high for Atom. Baidu (www.baidu.com) use Arm based storage node(not using ceph,but their own DFS), a node with a quad-core Arm together with 4 disks, and a 2U box can put up to 6 nodes, the 6 nodes share a 10Gb NIC. Although Atom is different with Arm, but you can take Baidu as a reference. -Original Message- From: ceph-devel-ow...@vger.kernel.org [mailto:ceph-devel-ow...@vger.kernel.org] On Behalf Of Gandalf Corvotempesta Sent: 2013年1月29日 17:25 To: femi anjorin Cc: ceph-devel@vger.kernel.org; Ross Turk Subject: Re: Ceph Production Environment Setup and Configurations? 2013/1/29 femi anjorin femi.anjo...@gmail.com: CPU - Intel(R) Atom(TM) CPU D525 @ 1.80GHz Atom? For which kind of role ? -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html N�Р骒r��yb�X�肚�v�^�)藓{.n�+���z�]z鳐�{ay��,j��f"�h���z��wア� ⒎�j:+v���w�j�m������赙zZ+�茛j��!tml= FWIW, Wido had been testing Ceph on Atoms a while back. I don't know what conclusions he reached, or even if he has already reached any conclusions at all. I, for one, would be really interested in knowing how well (or how poorly) Atoms deal with Ceph. Maybe Wido can share some thoughts on this one? :-) -Joao -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/2] utime: fix narrowing conversion compiler warning in sleep()
Hi, could someone cherry-pick this patch to the bobtail branch? I see the same issue there. Thanks! Danny Am 27.01.2013 21:57, schrieb Danny Al-Gaaf: Fix compiler warning: ./include/utime.h: In member function 'void utime_t::sleep()': ./include/utime.h:139:50: warning: narrowing conversion of '((utime_t*)this)-utime_t::tv.utime_t::anonymous struct::tv_sec' from '__u32 {aka unsigned int}' to '__time_t {aka long int}' inside { } is ill-formed in C++11 [-Wnarrowing] ./include/utime.h:139:50: warning: narrowing conversion of '((utime_t*)this)-utime_t::tv.utime_t::anonymous struct::tv_nsec' from '__u32 {aka unsigned int}' to 'long int' inside { } is ill-formed in C++11 [-Wnarrowing] Signed-off-by: Danny Al-Gaaf danny.al-g...@bisect.de --- src/include/utime.h | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/src/include/utime.h b/src/include/utime.h index 526dec5..f433fff 100644 --- a/src/include/utime.h +++ b/src/include/utime.h @@ -136,7 +136,7 @@ public: } void sleep() { -struct timespec ts = { tv.tv_sec, tv.tv_nsec }; +struct timespec ts = { (__time_t)tv.tv_sec, (long)tv.tv_nsec }; nanosleep(ts, ts); } -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Ceph Production Environment Setup?
On 01/29/2013 02:56 AM, femi anjorin wrote: Please can anyone an advise on how exactly a CEPH production environment should look like? and what the configuration files should be. My hardwares include the following: Server A, B, C configuration CPU - Intel(R) Core(TM)2 Quad CPU Q9550 @ 2.83GHz RAM - 16GB Hard drive - 500GB SSD - 120GB Server D,E,F,G,H,J configuration CPU - Intel(R) Atom(TM) CPU D525 @ 1.80GHz RAM - 4 GB Boot drive - 320gb SSD - 120 GB Storage drives - 16 X 2 TB I am thinking of these configurations but i am not sure. Server A - MDS and MON Server B - MON Server C - MON Server D, E,F,G,H,J - OSD Those 16GB RAM on the monitor nodes vs the 4GB RAM on the osd nodes seem to be a bit wrong to me. The OSDs tend to require much more RAM, for instance for recovery, while the monitor is not as heavy on the memory -- if a cluster grows significantly large, the in-memory maps may grow a lot too, but that reason alone shouldn't be the reason you would give a monitor 16GB RAM and 4GB for an osd. Furthermore, I see you have 16x2TB storage drives. Is that per OSD node? I'm assuming that's what you're aiming to do, so how many OSDs were you thinking of running on the same host? Usually we go for 1 OSD per drive, but you might have something else on your mind. I am not an expert on server configuration, but my point is that, if you are going to have more than one OSD on the same host, your RAM sure looks smaller than what I would envision. BTW, not sure if you're placing SSDs on the monitor/mds nodes with the same intent as when you place it on the OSD nodes (keeping the osd journal, maybe?), but if you indeed intend to keep the daemons journals in them I think you should know that the monitor and the mds don't keep a journal. The monitors do keep a store on disk, but the mds don't even do that, instead keeping its data directly on the osds and whatever it needs in-memory. -Joao -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Ceph Production Environment Setup?
There is also the hardware recommendation page in the ceph docs ( http://ceph.com/docs/master/install/hardware-recommendations/ ) Basically they recommend something like ~ 1GHz CPU (or 1 core/osd), 500M-1GB RAM pr OSD daemon. Also most run with 1 OSD daemon pr. disk (so if you put 16x disk pr. node you'll vastly overpower your atom cpu) Overall, while the cluster chugs along happily the hw specs are relatively modest; as soon as it starts to recover you'll see high cpu/mem usage. Cheers, Martin On Tue, Jan 29, 2013 at 3:56 AM, femi anjorin femi.anjo...@gmail.com wrote: Please can anyone an advise on how exactly a CEPH production environment should look like? and what the configuration files should be. My hardwares include the following: Server A, B, C configuration CPU - Intel(R) Core(TM)2 Quad CPU Q9550 @ 2.83GHz RAM - 16GB Hard drive - 500GB SSD - 120GB Server D,E,F,G,H,J configuration CPU - Intel(R) Atom(TM) CPU D525 @ 1.80GHz RAM - 4 GB Boot drive - 320gb SSD - 120 GB Storage drives - 16 X 2 TB I am thinking of these configurations but i am not sure. Server A - MDS and MON Server B - MON Server C - MON Server D, E,F,G,H,J - OSD Regards. -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Ceph Production Environment Setup and Configurations?
On 01/29/2013 05:34 AM, Joao Eduardo Luis wrote: On 01/29/2013 11:25 AM, Chen, Xiaoxi wrote: [The following views only behalf of myself, not relate with Intel..] Looking forward for the performance data on Atom. Atom perform badly in Swift, but since Ceph is slightly efficient than Swift, it must be better. I have some concern about weather Atom can support such high throughput( you have 16 disks, assuming 50MB/s per disk, you would like to have 50*16 *2(include journal write)=1600MB/s ,together with corresponding network throughput, say 10GbE. The total IO throughput seems too high for Atom. Baidu (www.baidu.com) use Arm based storage node(not using ceph,but their own DFS), a node with a quad-core Arm together with 4 disks, and a 2U box can put up to 6 nodes, the 6 nodes share a 10Gb NIC. Although Atom is different with Arm, but you can take Baidu as a reference. -Original Message- From: ceph-devel-ow...@vger.kernel.org [mailto:ceph-devel-ow...@vger.kernel.org] On Behalf Of Gandalf Corvotempesta Sent: 2013年1月29日 17:25 To: femi anjorin Cc: ceph-devel@vger.kernel.org; Ross Turk Subject: Re: Ceph Production Environment Setup and Configurations? 2013/1/29 femi anjorin femi.anjo...@gmail.com: CPU - Intel(R) Atom(TM) CPU D525 @ 1.80GHz Atom? For which kind of role ? -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html N�Р骒r��yb�X�肚�v�^�)藓{.n�+���z�]z鳐�{ay��,j��f"�h���z��wア� ⒎�j:+v���w�j�m������赙zZ+�茛j��!tml= FWIW, Wido had been testing Ceph on Atoms a while back. I don't know what conclusions he reached, or even if he has already reached any conclusions at all. I, for one, would be really interested in knowing how well (or how poorly) Atoms deal with Ceph. Maybe Wido can share some thoughts on this one? :-) -Joao -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Just FYI, I don't expect you'll be able to get anywhere close to 10GbE performance on the Atom. You'll almost certainly be CPU bound long before that. I actually would be curious how much throughput you could push to those disks just doing straight fio tests and how much CPU overhead you'd see. I imagine you could max out the CPU without even having Ceph involved. I'm guessing you probably are going to be limited to about 0.75-1 OSD per atom core, and that's assuming that your network and SAS controllers aren't CPU hogs. Mark -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Ceph Production Environment Setup?
Thanks. I have upgraded all the systems to quad core processor machine with 32 GB RAM although i still have 16 hard drives on each of the storage nodes. 16 hard drives means i should have 16 OSD daemon but i dont know what the OSD configuration should look like in ceph.conf. I mounted the disk under the OSD data directory according to http://ceph.com/docs/master/rados/deployment/mkcephfs/ the mount looks like this: /dev/sda1 on /var/lib/ceph/osd/ceph-0 /dev/sdb1 on /var/lib/ceph/osd/ceph-1 /dev/sdc1 on /var/lib/ceph/osd/ceph-2 /dev/sdd1 on /var/lib/ceph/osd/ceph-3 /dev/sde1 on /var/lib/ceph/osd/ceph-4 /dev/sdf1 on /var/lib/ceph/osd/ceph-5 /dev/sdg1 on /var/lib/ceph/osd/ceph-6 /dev/sdh1 on /var/lib/ceph/osd/ceph-7 /dev/sdi1 on /var/lib/ceph/osd/ceph-8 /dev/sdj1 on /var/lib/ceph/osd/ceph-9 /dev/sdk1 on /var/lib/ceph/osd/ceph-10 /dev/sdl1 on /var/lib/ceph/osd/ceph-11 /dev/sdm1 on /var/lib/ceph/osd/ceph-12 /dev/sdn1 on /var/lib/ceph/osd/ceph-13 /dev/sdo1 on /var/lib/ceph/osd/ceph-14 /dev/sdp1 on /var/lib/ceph/osd/ceph-15 BUT I dont know how the OSD configuration should look like? I see the following in this ceph reference : http://ceph.com/docs/master/rados/deployment/mkcephfs/ For each [osd.n] section of your configuration file, specify the storage device. For example: [osd.1] devs = /dev/sda [osd.2] devs = /dev/sdb I guess this is a configuration for one hard drive. What should the OSD config be with 16 drives in one host? Regards, Femi. On Tue, Jan 29, 2013 at 1:39 PM, Martin B Nielsen mar...@unity3d.com wrote: There is also the hardware recommendation page in the ceph docs ( http://ceph.com/docs/master/install/hardware-recommendations/ ) Basically they recommend something like ~ 1GHz CPU (or 1 core/osd), 500M-1GB RAM pr OSD daemon. Also most run with 1 OSD daemon pr. disk (so if you put 16x disk pr. node you'll vastly overpower your atom cpu) Overall, while the cluster chugs along happily the hw specs are relatively modest; as soon as it starts to recover you'll see high cpu/mem usage. Cheers, Martin On Tue, Jan 29, 2013 at 3:56 AM, femi anjorin femi.anjo...@gmail.com wrote: Please can anyone an advise on how exactly a CEPH production environment should look like? and what the configuration files should be. My hardwares include the following: Server A, B, C configuration CPU - Intel(R) Core(TM)2 Quad CPU Q9550 @ 2.83GHz RAM - 16GB Hard drive - 500GB SSD - 120GB Server D,E,F,G,H,J configuration CPU - Intel(R) Atom(TM) CPU D525 @ 1.80GHz RAM - 4 GB Boot drive - 320gb SSD - 120 GB Storage drives - 16 X 2 TB I am thinking of these configurations but i am not sure. Server A - MDS and MON Server B - MON Server C - MON Server D, E,F,G,H,J - OSD Regards. -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Testing nonsubscriber posts
ignore -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: page allocation failures on osd nodes
On Mon, Jan 28, 2013 at 8:55 PM, Andrey Korolyov and...@xdel.ru wrote: On Mon, Jan 28, 2013 at 5:48 PM, Sam Lang sam.l...@inktank.com wrote: On Sun, Jan 27, 2013 at 2:52 PM, Andrey Korolyov and...@xdel.ru wrote: Ahem. once on almost empty node same trace produced by qemu process(which was actually pinned to the specific numa node), so seems that`s generally is a some scheduler/mm bug, not directly related to the osd processes. In other words, the less percentage of memory actually is an RSS, the more is a probability of such allocation failure. This might be a known bug in xen for your kernel? The xen users list might be able to help. -sam It is vanilla-3.4, I really wonder from where comes paravirt bits in the trace. Bug exposed only in 3.4 and really harmless, at least in ways I have tested that. Ceph-osd memory allocation behavior more likely to trigger those messages than most other applications in ``same'' conditions. -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: page allocation failures on osd nodes
On Tue, Jan 29, 2013 at 2:21 PM, Andrey Korolyov and...@xdel.ru wrote: On Mon, Jan 28, 2013 at 8:55 PM, Andrey Korolyov and...@xdel.ru wrote: On Mon, Jan 28, 2013 at 5:48 PM, Sam Lang sam.l...@inktank.com wrote: On Sun, Jan 27, 2013 at 2:52 PM, Andrey Korolyov and...@xdel.ru wrote: Ahem. once on almost empty node same trace produced by qemu process(which was actually pinned to the specific numa node), so seems that`s generally is a some scheduler/mm bug, not directly related to the osd processes. In other words, the less percentage of memory actually is an RSS, the more is a probability of such allocation failure. This might be a known bug in xen for your kernel? The xen users list might be able to help. -sam It is vanilla-3.4, I really wonder from where comes paravirt bits in the trace. Bug exposed only in 3.4 and really harmless, at least in ways I have tested that. Ceph-osd memory allocation behavior more likely to trigger those messages than most other applications in ``same'' conditions. Good to know. Thanks! -sam -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/2] utime: fix narrowing conversion compiler warning in sleep()
On Tue, 29 Jan 2013, Danny Al-Gaaf wrote: Hi, could someone cherry-pick this patch to the bobtail branch? I see the same issue there. Cherry-picked, thanks! s Thanks! Danny Am 27.01.2013 21:57, schrieb Danny Al-Gaaf: Fix compiler warning: ./include/utime.h: In member function 'void utime_t::sleep()': ./include/utime.h:139:50: warning: narrowing conversion of '((utime_t*)this)-utime_t::tv.utime_t::anonymous struct::tv_sec' from '__u32 {aka unsigned int}' to '__time_t {aka long int}' inside { } is ill-formed in C++11 [-Wnarrowing] ./include/utime.h:139:50: warning: narrowing conversion of '((utime_t*)this)-utime_t::tv.utime_t::anonymous struct::tv_nsec' from '__u32 {aka unsigned int}' to 'long int' inside { } is ill-formed in C++11 [-Wnarrowing] Signed-off-by: Danny Al-Gaaf danny.al-g...@bisect.de --- src/include/utime.h | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/src/include/utime.h b/src/include/utime.h index 526dec5..f433fff 100644 --- a/src/include/utime.h +++ b/src/include/utime.h @@ -136,7 +136,7 @@ public: } void sleep() { -struct timespec ts = { tv.tv_sec, tv.tv_nsec }; +struct timespec ts = { (__time_t)tv.tv_sec, (long)tv.tv_nsec }; nanosleep(ts, ts); } -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [ceph] locking fun with d_materialise_unique()
Hi Al, On Tue, 29 Jan 2013, Al Viro wrote: On Mon, Jan 28, 2013 at 09:20:34PM -0800, Sage Weil wrote: Yep, that is indeed a problem. I think we just need to do the r_aborted and/or r_locked_dir check in the else if condition... I'm not sure if we are guaranteed that ceph_readdir_prepopulate() won't get to its splice_dentry() and d_delete() calls in similar situations - I hadn't checked that one yet. If it isn't guaranteed, we have a problem there as well. ...and the condition guarding readdir_prepopulate(). :) I think you're reading it correctly. The main thing to keep in mind here is that we *do* need to call fill_inode() for the inode metadata on these requests to keep the mds and client state in sync. The dentry state is safe to ignore. You mean the parts under if (rinfo-head-is_dentry) { and if (rinfo-head-is_target) { in there? Because there's fill_inode() called from readdir_prepopulate() and it's a lot more problematic than those two... Yep, patch below. It would be great to have the dir i_mutex rules summarized somewhere, even if it is just a copy of the below. It took a fair bit of trial and error to infer what was going on when writing this code. :) Directory -i_mutex rules are in part documented - what VFS guarantees to hold side is in Documentation/filesystems/directory-locking. It's the other side (what locks are expected to be held by callers of dcache.c functions) that is badly missing... Ping me when you've pushed that branch and I'll take a look... To gitol...@ra.kernel.org:/pub/scm/linux/kernel/git/viro/vfs.git 01a88fa..4056362 master - master with tentative ceph patch in the very end. Should be on git.kernel.org shortly... We should drop teh mds_client.c hunk from your patch, and then do something like the below. I'll put it in the ceph tree so we can do some basic testing. Unfortunately, the abort case is a bit annoying to trigger.. we'll need to write a new test for it. Thanks- sage From 80d70ef02d5277af8e1355db74fffe71f7217e76 Mon Sep 17 00:00:00 2001 From: Sage Weil s...@inktank.com Date: Tue, 29 Jan 2013 13:01:15 -0800 Subject: [PATCH] ceph: prepopulate inodes only when request is aborted If r_aborted is true, we do not hold the dir i_mutex, and cannot touch the dcache. However, we still need to update the inodes with the state returned by the MDS. Reported-by: Al Viro v...@zeniv.linux.org.uk Signed-off-by: Sage Weil s...@inktank.com --- fs/ceph/inode.c | 36 1 file changed, 36 insertions(+) diff --git a/fs/ceph/inode.c b/fs/ceph/inode.c index 77b1b18..b51653e 100644 --- a/fs/ceph/inode.c +++ b/fs/ceph/inode.c @@ -1195,6 +1195,39 @@ done: /* * Prepopulate our cache with readdir results, leases, etc. */ +static int readdir_prepopulate_inodes_only(struct ceph_mds_request *req, + struct ceph_mds_session *session) +{ + struct ceph_mds_reply_info_parsed *rinfo = req-r_reply_info; + int i, err = 0; + + for (i = 0; i rinfo-dir_nr; i++) { + struct ceph_vino vino; + struct inode *in; + int rc; + + vino.ino = le64_to_cpu(rinfo-dir_in[i].in-ino); + vino.snap = le64_to_cpu(rinfo-dir_in[i].in-snapid); + + in = ceph_get_inode(req-r_dentry-d_sb, vino); + if (IS_ERR(in)) { + err = PTR_ERR(in); + dout(new_inode badness got %d\n, err); + continue; + } + rc = fill_inode(in, rinfo-dir_in[i], NULL, session, + req-r_request_started, -1, + req-r_caps_reservation); + if (rc 0) { + pr_err(fill_inode badness on %p got %d\n, in, rc); + err = rc; + continue; + } + } + + return err; +} + int ceph_readdir_prepopulate(struct ceph_mds_request *req, struct ceph_mds_session *session) { @@ -1209,6 +1242,9 @@ int ceph_readdir_prepopulate(struct ceph_mds_request *req, u64 frag = le32_to_cpu(rhead-args.readdir.frag); struct ceph_dentry_info *di; + if (req-r_aborted) + return readdir_prepopulate_inodes_only(req, session); + if (le32_to_cpu(rinfo-head-op) == CEPH_MDS_OP_LSSNAP) { snapdir = ceph_get_snapdir(parent-d_inode); parent = d_find_alias(snapdir); -- 1.7.9.5 -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 01/12, v2] rbd: new request tracking code
On 01/29/2013 04:43 AM, Josh Durgin wrote: On 01/24/2013 06:08 AM, Alex Elder wrote: This is an update of the first patch in my request tracking code series. After posting it the other day I identified some problems related to reference counting of image and object requests. I also am starting to look at the details of implementing layered reads, and ended up making some substantive changes. Since I have not seen any review feedback I thought the best thing would be to just re-post the updated patch. The remaining patches in the series have changed accordingly, but they have not really changed substantively, so I am not re-posting those (but will if it's requested). The main functional change is that an image request no longer maintains an array of object request pointers, it maintains a list of object requests. This simplifies some things, and makes the image request structure fixed size. A few other functional changes: - Reference counting of object and image requests is now done sensibly. - Image requests now support a callback when complete, which will be used for layered I/O requests. - There are a few new helper functions that encapsulate tying an object request to an image request. - An distinct value is now used for the which field for object requests not associated with a image request (mainly used for validation/assertions). Other changes: - Everything that was named image_request now uses img_request instead. - A few blocks and lines of code have been rearranged. The updated series is available on the ceph-client git repository in the branch wip-rbd-review-v2. -Alex This patch fully implements the new request tracking code for rbd I/O requests. Responses to your review comments are below. Thank you very much for careful, thorough, and thoughtful work on this. I believe it's very important and you do a good job of it. -Alex Each I/O request to an rbd image will get an rbd_image_request structure allocated to track it. This provides access to all information about the original request, as well as access to the set of one or more object requests that are initiated as a result of the image request. An rbd_obj_request structure defines a request sent to a single osd object (possibly) as part of an rbd image request. An rbd object request refers to a ceph_osd_request structure built up to represent the request; for now it will contain a single osd operation. It also provides space to hold the result status and the version of the object when the osd request completes. An rbd_obj_request structure can also stand on its own. This will be used for reading the version 1 header object, for issuing acknowledgements to event notifications, and for making object method calls. All rbd object requests now complete asynchronously with respect to the osd client--they supply a common callback routine. This resolves: http://tracker.newdream.net/issues/3741 Signed-off-by: Alex Elder el...@inktank.com --- v2: - fixed reference counting - image request callback support - image/object connection helper functions - distinct BAD_WHICH value for non-image object requests drivers/block/rbd.c | 622 ++- 1 file changed, 620 insertions(+), 2 deletions(-) diff --git a/drivers/block/rbd.c b/drivers/block/rbd.c index 6689363..46a61dd 100644 --- a/drivers/block/rbd.c +++ b/drivers/block/rbd.c @@ -181,6 +181,67 @@ struct rbd_req_coll { struct rbd_req_statusstatus[0]; }; +struct rbd_img_request; +typedef void (*rbd_img_callback_t)(struct rbd_img_request *); + +#defineBAD_WHICHU32_MAX/* Good which or bad which, which? */ + +struct rbd_obj_request; +typedef void (*rbd_obj_callback_t)(struct rbd_obj_request *); + +enum obj_req_type { obj_req_bio };/* More types to come */ enum labels should be capitalized. Not where I come from (at least not always), but I'll convert these to be all caps. +struct rbd_obj_request { +const char*object_name; +u64offset;/* object start byte */ +u64length;/* bytes from offset */ + +struct rbd_img_request*img_request; +struct list_headlinks; +u32which;/* posn image request list */ + +enum obj_req_typetype; +struct bio*bio_list; + +struct ceph_osd_request*osd_req; + +u64xferred;/* bytes transferred */ +u64version; This version is only used (uselessly) for the watch operation. It should be removed in a future patch (along with the obj_ver in the header). I'm not sure whether the description is quite right, but: http://tracker.ceph.com/issues/3952 +s32result; +atomic_tdone; + +rbd_obj_callback_t