Re: [PATCH] ceph: convert comma to semicolon
On Wed, Jul 23, 2014 at 6:41 PM, Himangi Saraogi himangi...@gmail.com wrote: Replace a comma between expression statements by a semicolon. This changes the semantics of the code, but given the current indentation appears to be what is intended. A simplified version of the Coccinelle semantic patch that performs this transformation is as follows: // smpl @r@ expression e1,e2; @@ e1 -, +; e2; // /smpl Signed-off-by: Himangi Saraogi himangi...@gmail.com Acked-by: Julia Lawall julia.law...@lip6.fr Applied. Thanks, Ilya -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] rbd: Use rbd_segment_name_free
On Thu, Jul 24, 2014 at 1:47 AM, Himangi Saraogi himangi...@gmail.com wrote: Free memory allocated using kmem_cache_zalloc using kmem_cache_free rather than kfree. The helper rbd_segment_name_free does the job here. Its position is shifted above the calling function. The Coccinelle semantic patch that detects this change is as follows: // smpl @@ expression x,E,c; @@ x = \(kmem_cache_alloc\|kmem_cache_zalloc\|kmem_cache_alloc_node\)(c,...) ... when != x = E when != x ?-kfree(x) +kmem_cache_free(c,x) // /smpl Signed-off-by: Himangi Saraogi himangi...@gmail.com Acked-by: Julia Lawall julia.law...@lip6.fr Applied. Thanks, Ilya -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 0/8] wip-overlap
Hi, This series fixes a bug reported by Sandisk in the Read from clones thread about a week ago. The description along with a simplified reproducer are in 7/8 and 8/8. 1/8 touches on our sysfs interface. The rest are cleanups. Thanks, Ilya Ilya Dryomov (8): rbd: show the entire chain of parent images rbd: introduce rbd_dev_header_info() rbd: remove unnecessary asserts in rbd_dev_image_probe() rbd: split rbd_dev_spec_update() into two functions rbd: harden rbd_dev_refresh() caller rbd: update mapping size only on refresh rbd: do not read in parent info before snap context rbd: take snap_id into account when reading in parent info Documentation/ABI/testing/sysfs-bus-rbd |4 +- drivers/block/rbd.c | 261 --- 2 files changed, 137 insertions(+), 128 deletions(-) -- 1.7.10.4 -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 2/8] rbd: introduce rbd_dev_header_info()
A wrapper around rbd_dev_v{1,2}_header_info() to reduce duplication. Signed-off-by: Ilya Dryomov ilya.dryo...@inktank.com --- drivers/block/rbd.c | 24 ++-- 1 file changed, 14 insertions(+), 10 deletions(-) diff --git a/drivers/block/rbd.c b/drivers/block/rbd.c index 7847fbb949ff..0d3be608f16f 100644 --- a/drivers/block/rbd.c +++ b/drivers/block/rbd.c @@ -514,7 +514,7 @@ static void rbd_dev_remove_parent(struct rbd_device *rbd_dev); static int rbd_dev_refresh(struct rbd_device *rbd_dev); static int rbd_dev_v2_header_onetime(struct rbd_device *rbd_dev); -static int rbd_dev_v2_header_info(struct rbd_device *rbd_dev); +static int rbd_dev_header_info(struct rbd_device *rbd_dev); static const char *rbd_dev_v2_snap_name(struct rbd_device *rbd_dev, u64 snap_id); static int _rbd_dev_v2_snap_size(struct rbd_device *rbd_dev, u64 snap_id, @@ -3506,13 +3506,10 @@ static int rbd_dev_refresh(struct rbd_device *rbd_dev) u64 mapping_size; int ret; - rbd_assert(rbd_image_format_valid(rbd_dev-image_format)); down_write(rbd_dev-header_rwsem); mapping_size = rbd_dev-mapping.size; - if (rbd_dev-image_format == 1) - ret = rbd_dev_v1_header_info(rbd_dev); - else - ret = rbd_dev_v2_header_info(rbd_dev); + + ret = rbd_dev_header_info(rbd_dev); /* If it's a mapped snapshot, validate its EXISTS flag */ @@ -4501,6 +4498,16 @@ static int rbd_dev_v2_header_info(struct rbd_device *rbd_dev) return ret; } +static int rbd_dev_header_info(struct rbd_device *rbd_dev) +{ + rbd_assert(rbd_image_format_valid(rbd_dev-image_format)); + + if (rbd_dev-image_format == 1) + return rbd_dev_v1_header_info(rbd_dev); + + return rbd_dev_v2_header_info(rbd_dev); +} + static int rbd_bus_add_dev(struct rbd_device *rbd_dev) { struct device *dev; @@ -5149,10 +5156,7 @@ static int rbd_dev_image_probe(struct rbd_device *rbd_dev, bool mapping) goto out_header_name; } - if (rbd_dev-image_format == 1) - ret = rbd_dev_v1_header_info(rbd_dev); - else - ret = rbd_dev_v2_header_info(rbd_dev); + ret = rbd_dev_header_info(rbd_dev); if (ret) goto err_out_watch; -- 1.7.10.4 -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 1/8] rbd: show the entire chain of parent images
Make /sys/bus/rbd/devices/id/parent show the entire chain of parent images. While at it, kernel sprintf() doesn't return negative values, casting to unsigned long long is no longer necessary and there is no good reason to split into multiple sprintf() calls. Signed-off-by: Ilya Dryomov ilya.dryo...@inktank.com --- Documentation/ABI/testing/sysfs-bus-rbd |4 +-- drivers/block/rbd.c | 56 +-- 2 files changed, 25 insertions(+), 35 deletions(-) diff --git a/Documentation/ABI/testing/sysfs-bus-rbd b/Documentation/ABI/testing/sysfs-bus-rbd index 501adc2a9ec7..2ddd680929d8 100644 --- a/Documentation/ABI/testing/sysfs-bus-rbd +++ b/Documentation/ABI/testing/sysfs-bus-rbd @@ -94,5 +94,5 @@ current_snap parent - Information identifying the pool, image, and snapshot id for - the parent image in a layered rbd image (format 2 only). + Information identifying the chain of parent images in a layered rbd + image. Entries are separated by empty lines. diff --git a/drivers/block/rbd.c b/drivers/block/rbd.c index 703b728e05fa..7847fbb949ff 100644 --- a/drivers/block/rbd.c +++ b/drivers/block/rbd.c @@ -3685,46 +3685,36 @@ static ssize_t rbd_snap_show(struct device *dev, } /* - * For an rbd v2 image, shows the pool id, image id, and snapshot id - * for the parent image. If there is no parent, simply shows - * (no parent image). + * For a v2 image, shows the chain of parent images, separated by empty + * lines. For v1 images or if there is no parent, shows (no parent + * image). */ static ssize_t rbd_parent_show(struct device *dev, -struct device_attribute *attr, -char *buf) + struct device_attribute *attr, + char *buf) { struct rbd_device *rbd_dev = dev_to_rbd_dev(dev); - struct rbd_spec *spec = rbd_dev-parent_spec; - int count; - char *bufp = buf; + ssize_t count = 0; - if (!spec) + if (!rbd_dev-parent) return sprintf(buf, (no parent image)\n); - count = sprintf(bufp, pool_id %llu\npool_name %s\n, - (unsigned long long) spec-pool_id, spec-pool_name); - if (count 0) - return count; - bufp += count; - - count = sprintf(bufp, image_id %s\nimage_name %s\n, spec-image_id, - spec-image_name ? spec-image_name : (unknown)); - if (count 0) - return count; - bufp += count; - - count = sprintf(bufp, snap_id %llu\nsnap_name %s\n, - (unsigned long long) spec-snap_id, spec-snap_name); - if (count 0) - return count; - bufp += count; - - count = sprintf(bufp, overlap %llu\n, rbd_dev-parent_overlap); - if (count 0) - return count; - bufp += count; - - return (ssize_t) (bufp - buf); + for ( ; rbd_dev-parent; rbd_dev = rbd_dev-parent) { + struct rbd_spec *spec = rbd_dev-parent_spec; + + count += sprintf(buf[count], %s + pool_id %llu\npool_name %s\n + image_id %s\nimage_name %s\n + snap_id %llu\nsnap_name %s\n + overlap %llu\n, + !count ? : \n, /* first? */ + spec-pool_id, spec-pool_name, + spec-image_id, spec-image_name ?: (unknown), + spec-snap_id, spec-snap_name, + rbd_dev-parent_overlap); + } + + return count; } static ssize_t rbd_image_refresh(struct device *dev, -- 1.7.10.4 -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 5/8] rbd: harden rbd_dev_refresh() caller
Recently discovered watch/notify problems showed that we really can't ignore errors in anything refresh related. Alas, currently there is not much we can do in response to those errors, except print warnings. Signed-off-by: Ilya Dryomov ilya.dryo...@inktank.com --- drivers/block/rbd.c | 17 + 1 file changed, 13 insertions(+), 4 deletions(-) diff --git a/drivers/block/rbd.c b/drivers/block/rbd.c index 23df1773ef77..5dd6f5057ef4 100644 --- a/drivers/block/rbd.c +++ b/drivers/block/rbd.c @@ -2963,11 +2963,20 @@ static void rbd_watch_cb(u64 ver, u64 notify_id, u8 opcode, void *data) dout(%s: \%s\ notify_id %llu opcode %u\n, __func__, rbd_dev-header_name, (unsigned long long)notify_id, (unsigned int)opcode); + + /* +* Until adequate refresh error handling is in place, there is +* not much we can do here, except warn. +* +* See http://tracker.ceph.com/issues/5040 +*/ ret = rbd_dev_refresh(rbd_dev); if (ret) - rbd_warn(rbd_dev, header refresh error (%d)\n, ret); + rbd_warn(rbd_dev, refresh failed: %d\n, ret); - rbd_obj_notify_ack_sync(rbd_dev, notify_id); + ret = rbd_obj_notify_ack_sync(rbd_dev, notify_id); + if (ret) + rbd_warn(rbd_dev, notify_ack ret %d\n, ret); } /* @@ -3724,9 +3733,9 @@ static ssize_t rbd_image_refresh(struct device *dev, ret = rbd_dev_refresh(rbd_dev); if (ret) - rbd_warn(rbd_dev, : manual header refresh error (%d)\n, ret); + return ret; - return ret 0 ? ret : size; + return size; } static DEVICE_ATTR(size, S_IRUGO, rbd_size_show, NULL); -- 1.7.10.4 -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 6/8] rbd: update mapping size only on refresh
There is no sense in trying to update the mapping size before it's even been set. Signed-off-by: Ilya Dryomov ilya.dryo...@inktank.com --- drivers/block/rbd.c | 19 +++ 1 file changed, 7 insertions(+), 12 deletions(-) diff --git a/drivers/block/rbd.c b/drivers/block/rbd.c index 5dd6f5057ef4..16f388f2960b 100644 --- a/drivers/block/rbd.c +++ b/drivers/block/rbd.c @@ -971,12 +971,6 @@ static int rbd_header_from_disk(struct rbd_device *rbd_dev, header-snap_names = snap_names; header-snap_sizes = snap_sizes; - /* Make sure mapping size is consistent with header info */ - - if (rbd_dev-spec-snap_id == CEPH_NOSNAP || first_time) - if (rbd_dev-mapping.size != header-image_size) - rbd_dev-mapping.size = header-image_size; - return 0; out_2big: ret = -EIO; @@ -3520,9 +3514,14 @@ static int rbd_dev_refresh(struct rbd_device *rbd_dev) ret = rbd_dev_header_info(rbd_dev); - /* If it's a mapped snapshot, validate its EXISTS flag */ + if (rbd_dev-spec-snap_id == CEPH_NOSNAP) { + if (rbd_dev-mapping.size != rbd_dev-header.image_size) + rbd_dev-mapping.size = rbd_dev-header.image_size; + } else { + /* validate mapped snapshot's EXISTS flag */ + rbd_exists_validate(rbd_dev); + } - rbd_exists_validate(rbd_dev); up_write(rbd_dev-header_rwsem); if (mapping_size != rbd_dev-mapping.size) { @@ -4505,10 +4504,6 @@ static int rbd_dev_v2_header_info(struct rbd_device *rbd_dev) is EXPERIMENTAL!); } - if (rbd_dev-spec-snap_id == CEPH_NOSNAP) - if (rbd_dev-mapping.size != rbd_dev-header.image_size) - rbd_dev-mapping.size = rbd_dev-header.image_size; - ret = rbd_dev_v2_snap_context(rbd_dev); dout(rbd_dev_v2_snap_context returned %d\n, ret); -- 1.7.10.4 -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 8/8] rbd: take snap_id into account when reading in parent info
If we are mapping a snapshot, we must read in the parent_overlap value of that snapshot instead of that of the base image. Not doing so may in particular result in us returning zeros instead of user data: # cat overlap-snap.sh #!/bin/bash rbd create --size 10 --image-format 2 foo FOO_DEV=$(rbd map foo) dd if=/dev/urandom of=/dev/rbd0 bs=1M /dev/null echo Base image dd if=$FOO_DEV bs=1 count=16 skip=$(((4 20) - 8)) 2/dev/null | xxd rbd snap create foo@snap rbd snap protect foo@snap rbd clone foo@snap bar rbd snap create bar@snap BAR_DEV=$(rbd map bar@snap) echo Snapshot dd if=$BAR_DEV bs=1 count=16 skip=$(((4 20) - 8)) 2/dev/null | xxd rbd resize --allow-shrink --size 4 bar echo Snapshot after base image resize dd if=$BAR_DEV bs=1 count=16 skip=$(((4 20) - 8)) 2/dev/null | xxd # ./overlap-snap.sh Base image 000: e781 e33b d34b 2225 6034 2845 a2e3 36ed ...;.K%`4(E..6. Snapshot 000: e781 e33b d34b 2225 6034 2845 a2e3 36ed ...;.K%`4(E..6. Resizing image: 100% complete...done. Snapshot after base image resize 000: e781 e33b d34b 2225 ...;.K% Even though bar@snap was taken with the old bar parent_overlap (8M), reads from bar@snap beyond the new bar parent_overlap (4M) return zeroes. Fix it. Signed-off-by: Ilya Dryomov ilya.dryo...@inktank.com --- drivers/block/rbd.c |2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/block/rbd.c b/drivers/block/rbd.c index c4606987e9d1..cbc89fa9a677 100644 --- a/drivers/block/rbd.c +++ b/drivers/block/rbd.c @@ -4020,7 +4020,7 @@ static int rbd_dev_v2_parent_info(struct rbd_device *rbd_dev) goto out_err; } - snapid = cpu_to_le64(CEPH_NOSNAP); + snapid = cpu_to_le64(rbd_dev-spec-snap_id); ret = rbd_obj_method_sync(rbd_dev, rbd_dev-header_name, rbd, get_parent, snapid, sizeof (snapid), -- 1.7.10.4 -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 4/8] rbd: split rbd_dev_spec_update() into two functions
rbd_dev_spec_update() has two modes of operation, with nothing in common between them. Split it into two functions, one for each mode and make our expectations more clear. Signed-off-by: Ilya Dryomov ilya.dryo...@inktank.com --- drivers/block/rbd.c | 79 +++ 1 file changed, 48 insertions(+), 31 deletions(-) diff --git a/drivers/block/rbd.c b/drivers/block/rbd.c index 4541f6027e4a..23df1773ef77 100644 --- a/drivers/block/rbd.c +++ b/drivers/block/rbd.c @@ -3798,6 +3798,9 @@ static struct rbd_spec *rbd_spec_alloc(void) spec = kzalloc(sizeof (*spec), GFP_KERNEL); if (!spec) return NULL; + + spec-pool_id = CEPH_NOPOOL; + spec-snap_id = CEPH_NOSNAP; kref_init(spec-kref); return spec; @@ -4257,18 +4260,38 @@ static u64 rbd_snap_id_by_name(struct rbd_device *rbd_dev, const char *name) } /* - * When an rbd image has a parent image, it is identified by the - * pool, image, and snapshot ids (not names). This function fills - * in the names for those ids. (It's OK if we can't figure out the - * name for an image id, but the pool and snapshot ids should always - * exist and have names.) All names in an rbd spec are dynamically - * allocated. + * An image being mapped will have everything but the snap id. + */ +static int rbd_spec_fill_snap_id(struct rbd_device *rbd_dev) +{ + struct rbd_spec *spec = rbd_dev-spec; + + rbd_assert(spec-pool_id != CEPH_NOPOOL spec-pool_name); + rbd_assert(spec-image_id spec-image_name); + rbd_assert(spec-snap_name); + + if (strcmp(spec-snap_name, RBD_SNAP_HEAD_NAME)) { + u64 snap_id; + + snap_id = rbd_snap_id_by_name(rbd_dev, spec-snap_name); + if (snap_id == CEPH_NOSNAP) + return -ENOENT; + + spec-snap_id = snap_id; + } else { + spec-snap_id = CEPH_NOSNAP; + } + + return 0; +} + +/* + * A parent image will have all ids but none of the names. * - * When an image being mapped (not a parent) is probed, we have the - * pool name and pool id, image name and image id, and the snapshot - * name. The only thing we're missing is the snapshot id. + * All names in an rbd spec are dynamically allocated. It's OK if we + * can't figure out the name for an image id. */ -static int rbd_dev_spec_update(struct rbd_device *rbd_dev) +static int rbd_spec_fill_names(struct rbd_device *rbd_dev) { struct ceph_osd_client *osdc = rbd_dev-rbd_client-client-osdc; struct rbd_spec *spec = rbd_dev-spec; @@ -4277,24 +4300,9 @@ static int rbd_dev_spec_update(struct rbd_device *rbd_dev) const char *snap_name; int ret; - /* -* An image being mapped will have the pool name (etc.), but -* we need to look up the snapshot id. -*/ - if (spec-pool_name) { - if (strcmp(spec-snap_name, RBD_SNAP_HEAD_NAME)) { - u64 snap_id; - - snap_id = rbd_snap_id_by_name(rbd_dev, spec-snap_name); - if (snap_id == CEPH_NOSNAP) - return -ENOENT; - spec-snap_id = snap_id; - } else { - spec-snap_id = CEPH_NOSNAP; - } - - return 0; - } + rbd_assert(spec-pool_id != CEPH_NOPOOL); + rbd_assert(spec-image_id); + rbd_assert(spec-snap_id != CEPH_NOSNAP); /* Get the pool name; we have to make our own copy of this */ @@ -4313,7 +4321,7 @@ static int rbd_dev_spec_update(struct rbd_device *rbd_dev) if (!image_name) rbd_warn(rbd_dev, unable to get image name); - /* Look up the snapshot name, and make a copy */ + /* Fetch the snapshot name */ snap_name = rbd_snap_name(rbd_dev, spec-snap_id); if (IS_ERR(snap_name)) { @@ -4326,10 +4334,10 @@ static int rbd_dev_spec_update(struct rbd_device *rbd_dev) spec-snap_name = snap_name; return 0; + out_err: kfree(image_name); kfree(pool_name); - return ret; } @@ -5158,7 +5166,16 @@ static int rbd_dev_image_probe(struct rbd_device *rbd_dev, bool mapping) if (ret) goto err_out_watch; - ret = rbd_dev_spec_update(rbd_dev); + /* +* If this image is the one being mapped, we have pool name and +* id, image name and id, and snap name - need to fill snap id. +* Otherwise this is a parent image, identified by pool, image +* and snap ids - need to fill in names for those ids. +*/ + if (mapping) + ret = rbd_spec_fill_snap_id(rbd_dev); + else + ret = rbd_spec_fill_names(rbd_dev); if (ret) goto err_out_probe; -- 1.7.10.4 -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to
[PATCH 7/8] rbd: do not read in parent info before snap context
Currently rbd_dev_v2_header_info() reads in parent info before the snap context is read in. This is wrong, because we may need to look at the the parent_overlap value of the snapshot instead of that of the base image, for example when mapping a snapshot - see next commit. (When mapping a snapshot, all we got is its name and we need the snap context to translate that name into an id to know which parent info to look for). The approach taken here is to make sure rbd_dev_v2_parent_info() is called after the snap context has been read in. The other approach would be to add a parent_overlap field to struct rbd_mapping and maintain it the same way rbd_mapping::size is maintained. The reason I chose the first approach is that the value of keeping around both base image values and the actual mapping values is unclear to me. Signed-off-by: Ilya Dryomov ilya.dryo...@inktank.com --- drivers/block/rbd.c | 64 --- 1 file changed, 30 insertions(+), 34 deletions(-) diff --git a/drivers/block/rbd.c b/drivers/block/rbd.c index 16f388f2960b..c4606987e9d1 100644 --- a/drivers/block/rbd.c +++ b/drivers/block/rbd.c @@ -515,6 +515,7 @@ static void rbd_dev_remove_parent(struct rbd_device *rbd_dev); static int rbd_dev_refresh(struct rbd_device *rbd_dev); static int rbd_dev_v2_header_onetime(struct rbd_device *rbd_dev); static int rbd_dev_header_info(struct rbd_device *rbd_dev); +static int rbd_dev_v2_parent_info(struct rbd_device *rbd_dev); static const char *rbd_dev_v2_snap_name(struct rbd_device *rbd_dev, u64 snap_id); static int _rbd_dev_v2_snap_size(struct rbd_device *rbd_dev, u64 snap_id, @@ -3513,6 +3514,18 @@ static int rbd_dev_refresh(struct rbd_device *rbd_dev) mapping_size = rbd_dev-mapping.size; ret = rbd_dev_header_info(rbd_dev); + if (ret) + return ret; + + /* +* If there is a parent, see if it has disappeared due to the +* mapped image getting flattened. +*/ + if (rbd_dev-parent) { + ret = rbd_dev_v2_parent_info(rbd_dev); + if (ret) + return ret; + } if (rbd_dev-spec-snap_id == CEPH_NOSNAP) { if (rbd_dev-mapping.size != rbd_dev-header.image_size) @@ -3524,11 +3537,10 @@ static int rbd_dev_refresh(struct rbd_device *rbd_dev) up_write(rbd_dev-header_rwsem); - if (mapping_size != rbd_dev-mapping.size) { + if (mapping_size != rbd_dev-mapping.size) rbd_dev_update_size(rbd_dev); - } - return ret; + return 0; } static int rbd_init_disk(struct rbd_device *rbd_dev) @@ -4477,33 +4489,6 @@ static int rbd_dev_v2_header_info(struct rbd_device *rbd_dev) return ret; } - /* -* If the image supports layering, get the parent info. We -* need to probe the first time regardless. Thereafter we -* only need to if there's a parent, to see if it has -* disappeared due to the mapped image getting flattened. -*/ - if (rbd_dev-header.features RBD_FEATURE_LAYERING - (first_time || rbd_dev-parent_spec)) { - bool warn; - - ret = rbd_dev_v2_parent_info(rbd_dev); - if (ret) - return ret; - - /* -* Print a warning if this is the initial probe and -* the image has a parent. Don't print it if the -* image now being probed is itself a parent. We -* can tell at this point because we won't know its -* pool name yet (just its pool id). -*/ - warn = rbd_dev-parent_spec rbd_dev-spec-pool_name; - if (first_time warn) - rbd_warn(rbd_dev, WARNING: kernel layering - is EXPERIMENTAL!); - } - ret = rbd_dev_v2_snap_context(rbd_dev); dout(rbd_dev_v2_snap_context returned %d\n, ret); @@ -5183,14 +5168,28 @@ static int rbd_dev_image_probe(struct rbd_device *rbd_dev, bool mapping) if (ret) goto err_out_probe; + if (rbd_dev-header.features RBD_FEATURE_LAYERING) { + ret = rbd_dev_v2_parent_info(rbd_dev); + if (ret) + goto err_out_probe; + + /* +* Need to warn users if this image is the one being +* mapped and has a parent. +*/ + if (mapping rbd_dev-parent_spec) + rbd_warn(rbd_dev, +WARNING: kernel layering is EXPERIMENTAL!); + } + ret = rbd_dev_probe_parent(rbd_dev); if (ret) goto err_out_probe; dout(discovered format %u image, header name is %s\n, rbd_dev-image_format,
[PATCH 3/8] rbd: remove unnecessary asserts in rbd_dev_image_probe()
spec-image_id assert doesn't buy us much and image_format is asserted in rbd_dev_header_name() and rbd_dev_header_info() anyway. Signed-off-by: Ilya Dryomov ilya.dryo...@inktank.com --- drivers/block/rbd.c |2 -- 1 file changed, 2 deletions(-) diff --git a/drivers/block/rbd.c b/drivers/block/rbd.c index 0d3be608f16f..4541f6027e4a 100644 --- a/drivers/block/rbd.c +++ b/drivers/block/rbd.c @@ -5143,8 +5143,6 @@ static int rbd_dev_image_probe(struct rbd_device *rbd_dev, bool mapping) ret = rbd_dev_image_id(rbd_dev); if (ret) return ret; - rbd_assert(rbd_dev-spec-image_id); - rbd_assert(rbd_image_format_valid(rbd_dev-image_format)); ret = rbd_dev_header_name(rbd_dev); if (ret) -- 1.7.10.4 -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Read from clones
On Tue, Jul 15, 2014 at 9:44 AM, Lakshminarayana Mavunduri lakshminarayana.mavund...@sandisk.com wrote: Hi, Following the below set of steps, we are seeing data loss while reading from clones. 1)Create an image with image format 2 (in this case we have made the size to be 1024MB). rbd create image1 --size 1024 --image-format 2 2)Map the image and write 1024MB worth of data to it. 3)Create a snapshot for the image created in step 1) rbd snap create image1@snap1 4)Create a clone for the snapshot created in step 3) rbd clone image1@snap1 clone1 5)Create a snapshot for the clone created in step 4) rbd snap create clone1@snap2 6)Create a clone for the snapshot created in step 5) rbd clone clone1@snap2 clone2 7)Shrink the size of the clone created in step 4) (in this case we have made it to half of its size) rbd resize -s 512 --allow-shrink clone1 8)Map the clone created in step 6) and try reading 1024MB worth of data. 9)Our observation is that, only the first 512MB worth data is intact, the rest is not copied over. (In fact, it's only the parent overlap worth data of clone1 that is always copied over!) After the above set of steps, the parent overlap for clone2 would be 1024MB, whereas the parent overlap for clone1 would be 512MB. Our understanding is, since clone2's parent snapshot is taken before a shrink is performed on clone1, any reads worth parent overlap data on clone2 should be serviced from it's parent (at least, as long as there are no overwrites done on clone2, which is the case here), and we are not finding that to be true in this case. To augment our theory, if the parent image (a base RBD image, which is not a clone) is shrinked, any reads on the clones that are created before the shrink, are as expected to our understanding. Wanted to check if this is indeed a bug or if we are missing anything here. The tests are run across ceph version 0.80. FWIW, I just pushed a fix for this - wip-overlap in ceph-client.git. Thanks, Ilya -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: Read from clones
Thanks Ilya! -Original Message- From: Ilya Dryomov [mailto:ilya.dryo...@inktank.com] Sent: Thursday, July 24, 2014 2:27 PM To: Lakshminarayana Mavunduri Cc: ceph-devel@vger.kernel.org; Pavan Rallabhandi Subject: Re: Read from clones On Tue, Jul 15, 2014 at 9:44 AM, Lakshminarayana Mavunduri lakshminarayana.mavund...@sandisk.com wrote: Hi, Following the below set of steps, we are seeing data loss while reading from clones. 1)Create an image with image format 2 (in this case we have made the size to be 1024MB). rbd create image1 --size 1024 --image-format 2 2)Map the image and write 1024MB worth of data to it. 3)Create a snapshot for the image created in step 1) rbd snap create image1@snap1 4)Create a clone for the snapshot created in step 3) rbd clone image1@snap1 clone1 5)Create a snapshot for the clone created in step 4) rbd snap create clone1@snap2 6)Create a clone for the snapshot created in step 5) rbd clone clone1@snap2 clone2 7)Shrink the size of the clone created in step 4) (in this case we have made it to half of its size) rbd resize -s 512 --allow-shrink clone1 8)Map the clone created in step 6) and try reading 1024MB worth of data. 9)Our observation is that, only the first 512MB worth data is intact, the rest is not copied over. (In fact, it's only the parent overlap worth data of clone1 that is always copied over!) After the above set of steps, the parent overlap for clone2 would be 1024MB, whereas the parent overlap for clone1 would be 512MB. Our understanding is, since clone2's parent snapshot is taken before a shrink is performed on clone1, any reads worth parent overlap data on clone2 should be serviced from it's parent (at least, as long as there are no overwrites done on clone2, which is the case here), and we are not finding that to be true in this case. To augment our theory, if the parent image (a base RBD image, which is not a clone) is shrinked, any reads on the clones that are created before the shrink, are as expected to our understanding. Wanted to check if this is indeed a bug or if we are missing anything here. The tests are run across ceph version 0.80. FWIW, I just pushed a fix for this - wip-overlap in ceph-client.git. Thanks, Ilya PLEASE NOTE: The information contained in this electronic mail message is intended only for the use of the designated recipient(s) named above. If the reader of this message is not the intended recipient, you are hereby notified that you have received this message in error and that any review, dissemination, distribution, or copying of this message is strictly prohibited. If you have received this communication in error, please notify the sender by telephone or e-mail (as shown above) immediately and destroy any and all copies of this message in your possession (whether hard copies or electronically stored copies).
Re: First attempt at rocksdb monitor store stress testing
Hi Xinxin, Thanks! I wonder as well if it might be interesting to expose the options related to universal compaction? It looks like rocksdb provides a lot of interesting knobs you can adjust! Mark On 07/24/2014 12:08 AM, Shu, Xinxin wrote: Hi mark, I think this maybe related to 'verify_checksums' config option ,when ReadOptions is initialized, default this option is true , all data read from underlying storage will be verified against corresponding checksums, however, this option cannot be configured in wip-rocksdb branch. I will modify code to make this option configurable . Cheers, xinxin -Original Message- From: ceph-devel-ow...@vger.kernel.org [mailto:ceph-devel-ow...@vger.kernel.org] On Behalf Of Mark Nelson Sent: Thursday, July 24, 2014 7:14 AM To: ceph-devel@vger.kernel.org Subject: First attempt at rocksdb monitor store stress testing Hi Guys, So I've been interested lately in leveldb 99th percentile latency (and the amount of write amplification we are seeing) with leveldb. Joao mentioned he has written a tool called mon-store-stress in wip-leveldb-misc to try to provide a means to roughly guess at what's happening on the mons under heavy load. I cherry-picked it over to wip-rocksdb and after a couple of hacks was able to get everything built and running with some basic tests. There was little tuning done and I don't know how realistic this workload is, so don't assume this means anything yet, but some initial results are here: http://nhm.ceph.com/mon-store-stress/First%20Attempt.pdf Command that was used to run the tests: ./ceph-test-mon-store-stress --mon-keyvaluedb leveldb|rocksdb --write-min-size 50K --write-max-size 2M --percent-write 70 --percent-read 30 --keep-state --test-seed 1406137270 --stop-at 5000 foo The most interesting bit right now is that rocksdb seems to be hanging in the middle of the test (left it running for several hours). CPU usage on one core was quite high during the hang. Profiling using perf with dwarf symbols I see: - 49.14% ceph-test-mon-s ceph-test-mon-store-stress [.] unsigned int rocksdb::crc32c::ExtendImplrocksdb::crc32c::Fast_CRC32(unsigned int, char const*, unsigned long) - unsigned int rocksdb::crc32c::ExtendImplrocksdb::crc32c::Fast_CRC32(unsigned int, char const*, unsigned long) 51.70% rocksdb::ReadBlockContents(rocksdb::RandomAccessFile*, rocksdb::Footer const, rocksdb::ReadOptions const, rocksdb::BlockHandle const, rocksdb::BlockContents*, rocksdb::Env*, bool) 48.30% rocksdb::BlockBasedTableBuilder::WriteRawBlock(rocksdb::Slice const, rocksdb::CompressionType, rocksdb::BlockHandle*) Not sure what's going on yet, may need to try to enable logging/debugging in rocksdb. Thoughts/Suggestions welcome. :) Mark -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/8] rbd: show the entire chain of parent images
On 07/24/2014 03:42 AM, Ilya Dryomov wrote: Make /sys/bus/rbd/devices/id/parent show the entire chain of parent images. While at it, kernel sprintf() doesn't return negative values, casting to unsigned long long is no longer necessary and there is no good reason to split into multiple sprintf() calls. I like the use of a single snprintf() call, it is much less cumbersome. The reason I always cast u64 values to (unsigned long long) in printf() calls is because on some 64-bit architectures, u64 is defined as simply (unsigned long), so without the cast they spit out warnings. I hate this, but it is reality (see include/asm-generic/{int-l64.h,int-ll64.h}). You can alternatively use %PRIu64 rather than %llu in your format strings, but I think I hate that more. Anyway, if you want to avoid warnings on all architectures you should fix that. As another aside, I've been too timid to use the ?: conditional expression without its middle operand. I have no objection to it at all, I like it. I bring it up because I recently got burned for using a gcc feature that wasn't supported on older compilers (unnamed struct/union fields)--specifically a version newer than gcc 3.2, which is the minimum supported version for the kernel (see Documentation/Changes). But fear not! That extension is supported in gcc 3.2: https://gcc.gnu.org/onlinedocs/gcc-3.2/gcc/Conditionals.html#Conditionals Just FYI... Reviewed-by: Alex Elder el...@linaro.org Signed-off-by: Ilya Dryomov ilya.dryo...@inktank.com --- Documentation/ABI/testing/sysfs-bus-rbd |4 +-- drivers/block/rbd.c | 56 +-- 2 files changed, 25 insertions(+), 35 deletions(-) diff --git a/Documentation/ABI/testing/sysfs-bus-rbd b/Documentation/ABI/testing/sysfs-bus-rbd index 501adc2a9ec7..2ddd680929d8 100644 --- a/Documentation/ABI/testing/sysfs-bus-rbd +++ b/Documentation/ABI/testing/sysfs-bus-rbd @@ -94,5 +94,5 @@ current_snap parent - Information identifying the pool, image, and snapshot id for - the parent image in a layered rbd image (format 2 only). + Information identifying the chain of parent images in a layered rbd + image. Entries are separated by empty lines. diff --git a/drivers/block/rbd.c b/drivers/block/rbd.c index 703b728e05fa..7847fbb949ff 100644 --- a/drivers/block/rbd.c +++ b/drivers/block/rbd.c @@ -3685,46 +3685,36 @@ static ssize_t rbd_snap_show(struct device *dev, } /* - * For an rbd v2 image, shows the pool id, image id, and snapshot id - * for the parent image. If there is no parent, simply shows - * (no parent image). + * For a v2 image, shows the chain of parent images, separated by empty + * lines. For v1 images or if there is no parent, shows (no parent + * image). */ static ssize_t rbd_parent_show(struct device *dev, - struct device_attribute *attr, - char *buf) +struct device_attribute *attr, +char *buf) { struct rbd_device *rbd_dev = dev_to_rbd_dev(dev); - struct rbd_spec *spec = rbd_dev-parent_spec; - int count; - char *bufp = buf; + ssize_t count = 0; - if (!spec) + if (!rbd_dev-parent) return sprintf(buf, (no parent image)\n); - count = sprintf(bufp, pool_id %llu\npool_name %s\n, - (unsigned long long) spec-pool_id, spec-pool_name); - if (count 0) - return count; - bufp += count; - - count = sprintf(bufp, image_id %s\nimage_name %s\n, spec-image_id, - spec-image_name ? spec-image_name : (unknown)); - if (count 0) - return count; - bufp += count; - - count = sprintf(bufp, snap_id %llu\nsnap_name %s\n, - (unsigned long long) spec-snap_id, spec-snap_name); - if (count 0) - return count; - bufp += count; - - count = sprintf(bufp, overlap %llu\n, rbd_dev-parent_overlap); - if (count 0) - return count; - bufp += count; - - return (ssize_t) (bufp - buf); + for ( ; rbd_dev-parent; rbd_dev = rbd_dev-parent) { + struct rbd_spec *spec = rbd_dev-parent_spec; + + count += sprintf(buf[count], %s + pool_id %llu\npool_name %s\n + image_id %s\nimage_name %s\n + snap_id %llu\nsnap_name %s\n + overlap %llu\n, + !count ? : \n, /* first? */ + spec-pool_id, spec-pool_name, + spec-image_id, spec-image_name ?: (unknown), + spec-snap_id, spec-snap_name, + rbd_dev-parent_overlap); + } + + return count; } static ssize_t rbd_image_refresh(struct device *dev, -- To unsubscribe from this list: send the line
Re: [PATCH 2/8] rbd: introduce rbd_dev_header_info()
On 07/24/2014 03:42 AM, Ilya Dryomov wrote: A wrapper around rbd_dev_v{1,2}_header_info() to reduce duplication. Looks good. Reviewed-by: Alex Elder el...@linaro.org Signed-off-by: Ilya Dryomov ilya.dryo...@inktank.com --- drivers/block/rbd.c | 24 ++-- 1 file changed, 14 insertions(+), 10 deletions(-) diff --git a/drivers/block/rbd.c b/drivers/block/rbd.c index 7847fbb949ff..0d3be608f16f 100644 --- a/drivers/block/rbd.c +++ b/drivers/block/rbd.c @@ -514,7 +514,7 @@ static void rbd_dev_remove_parent(struct rbd_device *rbd_dev); static int rbd_dev_refresh(struct rbd_device *rbd_dev); static int rbd_dev_v2_header_onetime(struct rbd_device *rbd_dev); -static int rbd_dev_v2_header_info(struct rbd_device *rbd_dev); +static int rbd_dev_header_info(struct rbd_device *rbd_dev); static const char *rbd_dev_v2_snap_name(struct rbd_device *rbd_dev, u64 snap_id); static int _rbd_dev_v2_snap_size(struct rbd_device *rbd_dev, u64 snap_id, @@ -3506,13 +3506,10 @@ static int rbd_dev_refresh(struct rbd_device *rbd_dev) u64 mapping_size; int ret; - rbd_assert(rbd_image_format_valid(rbd_dev-image_format)); down_write(rbd_dev-header_rwsem); mapping_size = rbd_dev-mapping.size; - if (rbd_dev-image_format == 1) - ret = rbd_dev_v1_header_info(rbd_dev); - else - ret = rbd_dev_v2_header_info(rbd_dev); + + ret = rbd_dev_header_info(rbd_dev); /* If it's a mapped snapshot, validate its EXISTS flag */ @@ -4501,6 +4498,16 @@ static int rbd_dev_v2_header_info(struct rbd_device *rbd_dev) return ret; } +static int rbd_dev_header_info(struct rbd_device *rbd_dev) +{ + rbd_assert(rbd_image_format_valid(rbd_dev-image_format)); + + if (rbd_dev-image_format == 1) + return rbd_dev_v1_header_info(rbd_dev); + + return rbd_dev_v2_header_info(rbd_dev); +} + static int rbd_bus_add_dev(struct rbd_device *rbd_dev) { struct device *dev; @@ -5149,10 +5156,7 @@ static int rbd_dev_image_probe(struct rbd_device *rbd_dev, bool mapping) goto out_header_name; } - if (rbd_dev-image_format == 1) - ret = rbd_dev_v1_header_info(rbd_dev); - else - ret = rbd_dev_v2_header_info(rbd_dev); + ret = rbd_dev_header_info(rbd_dev); if (ret) goto err_out_watch; -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 3/8] rbd: remove unnecessary asserts in rbd_dev_image_probe()
On 07/24/2014 03:42 AM, Ilya Dryomov wrote: spec-image_id assert doesn't buy us much and image_format is asserted in rbd_dev_header_name() and rbd_dev_header_info() anyway. Looks good. Over time I'm sure there are more assertions that can go away; they were most useful during rapid development when confidence in certain things was lower. Reviewed-by: Alex Elder el...@linaro.org Signed-off-by: Ilya Dryomov ilya.dryo...@inktank.com --- drivers/block/rbd.c |2 -- 1 file changed, 2 deletions(-) diff --git a/drivers/block/rbd.c b/drivers/block/rbd.c index 0d3be608f16f..4541f6027e4a 100644 --- a/drivers/block/rbd.c +++ b/drivers/block/rbd.c @@ -5143,8 +5143,6 @@ static int rbd_dev_image_probe(struct rbd_device *rbd_dev, bool mapping) ret = rbd_dev_image_id(rbd_dev); if (ret) return ret; - rbd_assert(rbd_dev-spec-image_id); - rbd_assert(rbd_image_format_valid(rbd_dev-image_format)); ret = rbd_dev_header_name(rbd_dev); if (ret) -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/8] rbd: show the entire chain of parent images
On Thu, Jul 24, 2014 at 4:31 PM, Alex Elder el...@ieee.org wrote: On 07/24/2014 03:42 AM, Ilya Dryomov wrote: Make /sys/bus/rbd/devices/id/parent show the entire chain of parent images. While at it, kernel sprintf() doesn't return negative values, casting to unsigned long long is no longer necessary and there is no good reason to split into multiple sprintf() calls. I like the use of a single snprintf() call, it is much less cumbersome. The reason I always cast u64 values to (unsigned long long) in printf() calls is because on some 64-bit architectures, u64 is defined as simply (unsigned long), so without the cast they spit out warnings. I hate this, but it is reality (see include/asm-generic/{int-l64.h,int-ll64.h}). You can alternatively use %PRIu64 rather than %llu in your format strings, but I think I hate that more. Anyway, if you want to avoid warnings on all architectures you should fix that. IIRC all architectures nowadays use int-ll64.h. int-l64.h is used only in userspace for compatibility. Thanks, Ilya -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 4/8] rbd: split rbd_dev_spec_update() into two functions
On 07/24/2014 03:42 AM, Ilya Dryomov wrote: rbd_dev_spec_update() has two modes of operation, with nothing in common between them. Split it into two functions, one for each mode and make our expectations more clear. Signed-off-by: Ilya Dryomov ilya.dryo...@inktank.com Looks good. Reviewed-by: Alex Elder el...@linaro.org --- drivers/block/rbd.c | 79 +++ 1 file changed, 48 insertions(+), 31 deletions(-) diff --git a/drivers/block/rbd.c b/drivers/block/rbd.c index 4541f6027e4a..23df1773ef77 100644 --- a/drivers/block/rbd.c +++ b/drivers/block/rbd.c @@ -3798,6 +3798,9 @@ static struct rbd_spec *rbd_spec_alloc(void) spec = kzalloc(sizeof (*spec), GFP_KERNEL); if (!spec) return NULL; + + spec-pool_id = CEPH_NOPOOL; + spec-snap_id = CEPH_NOSNAP; kref_init(spec-kref); return spec; @@ -4257,18 +4260,38 @@ static u64 rbd_snap_id_by_name(struct rbd_device *rbd_dev, const char *name) } /* - * When an rbd image has a parent image, it is identified by the - * pool, image, and snapshot ids (not names). This function fills - * in the names for those ids. (It's OK if we can't figure out the - * name for an image id, but the pool and snapshot ids should always - * exist and have names.) All names in an rbd spec are dynamically - * allocated. + * An image being mapped will have everything but the snap id. + */ +static int rbd_spec_fill_snap_id(struct rbd_device *rbd_dev) +{ + struct rbd_spec *spec = rbd_dev-spec; + + rbd_assert(spec-pool_id != CEPH_NOPOOL spec-pool_name); + rbd_assert(spec-image_id spec-image_name); + rbd_assert(spec-snap_name); + + if (strcmp(spec-snap_name, RBD_SNAP_HEAD_NAME)) { + u64 snap_id; + + snap_id = rbd_snap_id_by_name(rbd_dev, spec-snap_name); + if (snap_id == CEPH_NOSNAP) + return -ENOENT; + + spec-snap_id = snap_id; + } else { + spec-snap_id = CEPH_NOSNAP; + } + + return 0; +} + +/* + * A parent image will have all ids but none of the names. * - * When an image being mapped (not a parent) is probed, we have the - * pool name and pool id, image name and image id, and the snapshot - * name. The only thing we're missing is the snapshot id. + * All names in an rbd spec are dynamically allocated. It's OK if we + * can't figure out the name for an image id. */ -static int rbd_dev_spec_update(struct rbd_device *rbd_dev) +static int rbd_spec_fill_names(struct rbd_device *rbd_dev) { struct ceph_osd_client *osdc = rbd_dev-rbd_client-client-osdc; struct rbd_spec *spec = rbd_dev-spec; @@ -4277,24 +4300,9 @@ static int rbd_dev_spec_update(struct rbd_device *rbd_dev) const char *snap_name; int ret; - /* - * An image being mapped will have the pool name (etc.), but - * we need to look up the snapshot id. - */ - if (spec-pool_name) { - if (strcmp(spec-snap_name, RBD_SNAP_HEAD_NAME)) { - u64 snap_id; - - snap_id = rbd_snap_id_by_name(rbd_dev, spec-snap_name); - if (snap_id == CEPH_NOSNAP) - return -ENOENT; - spec-snap_id = snap_id; - } else { - spec-snap_id = CEPH_NOSNAP; - } - - return 0; - } + rbd_assert(spec-pool_id != CEPH_NOPOOL); + rbd_assert(spec-image_id); + rbd_assert(spec-snap_id != CEPH_NOSNAP); /* Get the pool name; we have to make our own copy of this */ @@ -4313,7 +4321,7 @@ static int rbd_dev_spec_update(struct rbd_device *rbd_dev) if (!image_name) rbd_warn(rbd_dev, unable to get image name); - /* Look up the snapshot name, and make a copy */ + /* Fetch the snapshot name */ snap_name = rbd_snap_name(rbd_dev, spec-snap_id); if (IS_ERR(snap_name)) { @@ -4326,10 +4334,10 @@ static int rbd_dev_spec_update(struct rbd_device *rbd_dev) spec-snap_name = snap_name; return 0; + out_err: kfree(image_name); kfree(pool_name); - return ret; } @@ -5158,7 +5166,16 @@ static int rbd_dev_image_probe(struct rbd_device *rbd_dev, bool mapping) if (ret) goto err_out_watch; - ret = rbd_dev_spec_update(rbd_dev); + /* + * If this image is the one being mapped, we have pool name and + * id, image name and id, and snap name - need to fill snap id. + * Otherwise this is a parent image, identified by pool, image + * and snap ids - need to fill in names for those ids. + */ + if (mapping) + ret = rbd_spec_fill_snap_id(rbd_dev); + else + ret = rbd_spec_fill_names(rbd_dev); if (ret) goto err_out_probe; -- To
Re: [PATCH 5/8] rbd: harden rbd_dev_refresh() caller
On 07/24/2014 03:42 AM, Ilya Dryomov wrote: Recently discovered watch/notify problems showed that we really can't ignore errors in anything refresh related. Alas, currently there is not much we can do in response to those errors, except print warnings. Looks good. Reviewed-by: Alex Elder el...@linaro.org Signed-off-by: Ilya Dryomov ilya.dryo...@inktank.com --- drivers/block/rbd.c | 17 + 1 file changed, 13 insertions(+), 4 deletions(-) diff --git a/drivers/block/rbd.c b/drivers/block/rbd.c index 23df1773ef77..5dd6f5057ef4 100644 --- a/drivers/block/rbd.c +++ b/drivers/block/rbd.c @@ -2963,11 +2963,20 @@ static void rbd_watch_cb(u64 ver, u64 notify_id, u8 opcode, void *data) dout(%s: \%s\ notify_id %llu opcode %u\n, __func__, rbd_dev-header_name, (unsigned long long)notify_id, (unsigned int)opcode); + + /* + * Until adequate refresh error handling is in place, there is + * not much we can do here, except warn. + * + * See http://tracker.ceph.com/issues/5040 + */ ret = rbd_dev_refresh(rbd_dev); if (ret) - rbd_warn(rbd_dev, header refresh error (%d)\n, ret); + rbd_warn(rbd_dev, refresh failed: %d\n, ret); - rbd_obj_notify_ack_sync(rbd_dev, notify_id); + ret = rbd_obj_notify_ack_sync(rbd_dev, notify_id); + if (ret) + rbd_warn(rbd_dev, notify_ack ret %d\n, ret); } /* @@ -3724,9 +3733,9 @@ static ssize_t rbd_image_refresh(struct device *dev, ret = rbd_dev_refresh(rbd_dev); if (ret) - rbd_warn(rbd_dev, : manual header refresh error (%d)\n, ret); + return ret; - return ret 0 ? ret : size; + return size; } static DEVICE_ATTR(size, S_IRUGO, rbd_size_show, NULL); -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 6/8] rbd: update mapping size only on refresh
On 07/24/2014 03:42 AM, Ilya Dryomov wrote: There is no sense in trying to update the mapping size before it's even been set. It took me a bit to follow this. But basically there is no mapping unless it's mapped. So previously this was updating the mapping information even for unmapped parent (or other) images. There's no need to update the mapping size for a snapshot--it'll never change. Is that right? If not, please advise; otherwise: Reviewed-by: Alex Elder el...@linaro.org Signed-off-by: Ilya Dryomov ilya.dryo...@inktank.com --- drivers/block/rbd.c | 19 +++ 1 file changed, 7 insertions(+), 12 deletions(-) diff --git a/drivers/block/rbd.c b/drivers/block/rbd.c index 5dd6f5057ef4..16f388f2960b 100644 --- a/drivers/block/rbd.c +++ b/drivers/block/rbd.c @@ -971,12 +971,6 @@ static int rbd_header_from_disk(struct rbd_device *rbd_dev, header-snap_names = snap_names; header-snap_sizes = snap_sizes; - /* Make sure mapping size is consistent with header info */ - - if (rbd_dev-spec-snap_id == CEPH_NOSNAP || first_time) - if (rbd_dev-mapping.size != header-image_size) - rbd_dev-mapping.size = header-image_size; - return 0; out_2big: ret = -EIO; @@ -3520,9 +3514,14 @@ static int rbd_dev_refresh(struct rbd_device *rbd_dev) ret = rbd_dev_header_info(rbd_dev); - /* If it's a mapped snapshot, validate its EXISTS flag */ + if (rbd_dev-spec-snap_id == CEPH_NOSNAP) { + if (rbd_dev-mapping.size != rbd_dev-header.image_size) + rbd_dev-mapping.size = rbd_dev-header.image_size; + } else { + /* validate mapped snapshot's EXISTS flag */ + rbd_exists_validate(rbd_dev); + } - rbd_exists_validate(rbd_dev); up_write(rbd_dev-header_rwsem); if (mapping_size != rbd_dev-mapping.size) { @@ -4505,10 +4504,6 @@ static int rbd_dev_v2_header_info(struct rbd_device *rbd_dev) is EXPERIMENTAL!); } - if (rbd_dev-spec-snap_id == CEPH_NOSNAP) - if (rbd_dev-mapping.size != rbd_dev-header.image_size) - rbd_dev-mapping.size = rbd_dev-header.image_size; - ret = rbd_dev_v2_snap_context(rbd_dev); dout(rbd_dev_v2_snap_context returned %d\n, ret); -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 6/8] rbd: update mapping size only on refresh
On Thu, Jul 24, 2014 at 5:25 PM, Alex Elder el...@ieee.org wrote: On 07/24/2014 03:42 AM, Ilya Dryomov wrote: There is no sense in trying to update the mapping size before it's even been set. It took me a bit to follow this. But basically there is no mapping unless it's mapped. So previously this was updating the mapping information even for unmapped parent (or other) images. There's no need to update the mapping size for a snapshot--it'll never change. Is that right? If not, please advise; otherwise: No. Previously it was updating the mapping size both on the inital map and on refresh (of the base image). Doing that on the inital map doesn't make sense: not only struct rbd_mapping fields aren't properly initialized at that point - rbd_dev_mapping_set() is called much later in the map sequence, snap_id isn't initialized either (at least in the format 2 case, I haven't looked too closely at the format 1 case). And just in general, trying to update something before it had a chance to change is bogus.. Thanks, Ilya -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 2/3] rgw: list all available options during help()
Adding the available help arguments from the man page Fixes: #8112 Signed-off-by: Abhishek Lekshmanan abhishek.lekshma...@gmail.com --- src/rgw/rgw_main.cc | 6 ++ 1 file changed, 6 insertions(+) diff --git a/src/rgw/rgw_main.cc b/src/rgw/rgw_main.cc index 51cfebe..29d6577 100644 --- a/src/rgw/rgw_main.cc +++ b/src/rgw/rgw_main.cc @@ -726,7 +726,13 @@ int usage() cerr options:\n; cerr--rgw-region=region region in which radosgw runs\n; cerr--rgw-zone=zone zone in which radosgw runs\n; + cerr--rgw-socket-path=path specify a unix domain socket path\n; generic_server_usage(); + cerr-m monaddress[:port] connect to specified monitor\n; + cerr--keyring=path path to radosgw keyring\n; + cerr--logfile=logfile file to log debug output\n; + cerr--debug-rgw=log-level/memory-level Set radosgw debug level\n; + return 0; } -- 1.9.1 -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 1/3] rgw: format help options to align with the rest
Whitespace removal to make all help options align in a similar fashion Signed-off-by: Abhishek Lekshmanan abhishek.lekshma...@gmail.com --- src/rgw/rgw_main.cc | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/src/rgw/rgw_main.cc b/src/rgw/rgw_main.cc index f6257d7..51cfebe 100644 --- a/src/rgw/rgw_main.cc +++ b/src/rgw/rgw_main.cc @@ -724,8 +724,8 @@ int usage() { cerr usage: radosgw [options...] std::endl; cerr options:\n; - cerr --rgw-region=region region in which radosgw runs\n; - cerr --rgw-zone=zone zone in which radosgw runs\n; + cerr--rgw-region=region region in which radosgw runs\n; + cerr--rgw-zone=zone zone in which radosgw runs\n; generic_server_usage(); return 0; } -- 1.9.1 -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 0/3] WIP-8112 radosgw usage man update
Hi, This is a series of patches to update radosgw usage and man pages with all the available help options. 1/3 is a minor format fix to display the options in the same line. Abhishek Lekshmanan (3): rgw: format help options to align with the rest rgw: list all available options during help() doc: update radosgw man page with available opts doc/man/8/radosgw.rst | 28 src/rgw/rgw_main.cc | 10 -- 2 files changed, 36 insertions(+), 2 deletions(-) -- 1.9.1 Thanks, Abhishek -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 3/3] doc: update radosgw man page with available opts
Fixes:#8112 Signed-off-by: Abhishek Lekshmanan abhishek.lekshma...@gmail.com --- doc/man/8/radosgw.rst | 28 1 file changed, 28 insertions(+) diff --git a/doc/man/8/radosgw.rst b/doc/man/8/radosgw.rst index 5e5e3c0..7aa9655 100644 --- a/doc/man/8/radosgw.rst +++ b/doc/man/8/radosgw.rst @@ -32,10 +32,38 @@ Options Connect to specified monitor (instead of looking through ``ceph.conf``). +.. option:: -i ID, --id ID + + Set the ID portion of name for radosgw + +.. option:: -n TYPE.ID, --name TYPE.ID + + Set the name for rados gateway (eg. client.radosgw.gateway) + +.. option:: --cluster NAME + + Set the cluster name (default: ceph) + +.. option:: -d + + Run in foreground, log to stderr + +.. option:: -f + + Run in foreground, log to usual location + .. option:: --rgw-socket-path=path Specify a unix domain socket path. +.. option:: --rgw-region=region + + The region where radosgw runs + +.. option:: --rgw-zone=zone + + The zone where radosgw runs + Configuration = -- 1.9.1 -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: help: how to get ceph daemon debug info from ceph-rest-api?
Hi Zhu Qiang, Unless I am mistaken, it looks like the command “dump_ops_in_flight” is only available as an admin-socket command: https://github.com/ceph/ceph/blob/master/src/osd/OSD.cc#L1371-L1373 The commands available to ceph-rest-api are, to my knowledge, usually declared like so: https://github.com/ceph/ceph/blob/master/src/mon/MonCommands.h I put the ceph-devel mailing list in copy, perhaps someone else knows if there is another way to retrieve that information. -- David Moreau Simard On Jul 24, 2014, at 4:48 AM, zhu qiang zhu_qiang...@foxmail.com wrote: Hello Mr. Simard I want to use ceph-rest-api to view some debug details from ceph daemons. On linux shell I can get this message from below: # ceph daemon osd.0 dump_ops_in_flight | python -m json.tool { num_ops: 0, ops: []} This is my question: Can I get this output from ceph-rest-api ? Until now I tried some method : curl, python-cephclient ,but I did not get the right respone. Best regarding. N�r��yb�X��ǧv�^�){.n�+���z�]z���{ay�ʇڙ�,j��f���h���z��w��� ���j:+v���w�j�mzZ+�ݢj��!�i
Re: [PATCH 6/8] rbd: update mapping size only on refresh
On Thu, Jul 24, 2014 at 5:46 PM, Ilya Dryomov ilya.dryo...@inktank.com wrote: On Thu, Jul 24, 2014 at 5:25 PM, Alex Elder el...@ieee.org wrote: On 07/24/2014 03:42 AM, Ilya Dryomov wrote: There is no sense in trying to update the mapping size before it's even been set. It took me a bit to follow this. But basically there is no mapping unless it's mapped. So previously this was updating the mapping information even for unmapped parent (or other) images. There's no need to update the mapping size for a snapshot--it'll never change. Is that right? If not, please advise; otherwise: No. Previously it was updating the mapping size both on the inital map and on refresh (of the base image). Doing that on the inital map doesn't make sense: not only struct rbd_mapping fields aren't properly initialized at that point - rbd_dev_mapping_set() is called much later in the map sequence, snap_id isn't initialized either (at least in the format 2 case, I haven't looked too closely at the format 1 case). And just in general, trying to update something before it had a chance to change is bogus.. BTW, while we are on the subject of struct rbd_mapping, is there any particular reason to keep around both the base image values and actual mapping values? I am tempted to change the mapping sequence so that 1) snap context is read in immediately after watch setup, before anything else is done, 2) supplied snap name is resolved into an id and 3) the actual (i.e. based on snap_id) mapping size, features, parent_overlap, etc are read in directly into struct rbd_image_header. That would allow to rip struct rbd_mapping entirely and make the code more clear. Thanks, Ilya -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 6/8] rbd: update mapping size only on refresh
On 07/24/2014 08:46 AM, Ilya Dryomov wrote: On Thu, Jul 24, 2014 at 5:25 PM, Alex Elder el...@ieee.org wrote: On 07/24/2014 03:42 AM, Ilya Dryomov wrote: There is no sense in trying to update the mapping size before it's even been set. It took me a bit to follow this. But basically there is no mapping unless it's mapped. So previously this was updating the mapping information even for unmapped parent (or other) images. There's no need to update the mapping size for a snapshot--it'll never change. Is that right? If not, please advise; otherwise: No. Previously it was updating the mapping size both on the inital map and on refresh (of the base image). Doing that on the inital map doesn't make sense: not only struct rbd_mapping fields aren't properly initialized at that point - rbd_dev_mapping_set() is called much later in the map sequence, snap_id isn't initialized either (at least in the format 2 case, I haven't looked too closely at the format 1 case). And just in general, trying to update something before it had a chance to change is bogus.. OK, now I see it. Hopefully this makes sense. Here is how it was previously structured: rbd_add() do_rbd_add() |rbd_dev_image_probe() | rbd_dev_header_info() |rbd_dev_v1_header_info()orrbd_dev_v2_header_info() | rbd_header_from_disk()update mapping |update mapping | . . . |rbd_dev_device_setup() | rbd_dev_mapping_set() So neither rbd_header_from_disk() nor rbd_dev_v2_header_info() should be using the values in the rbd_dev-mapping field during initial image probe, because they have not yet been initialized at that point. Meanwhile, the only reason we need to *update* a mapping size is if we learn it may have changed, which will be covered by a refresh, so doing so in rbd_dev_refresh() makes sense. And as long as you know whether it's mapping the base image you might as well do the rbd_exists_validate() call conditionally. OK, this all looks good and makes good sense to me. Thank you for explaining it. Reviewed-by: Alex Elder el...@linaro.org Thanks, Ilya -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
sizeof (struct tYpO *) : it is just a typo or rather a bug ?
Inspired by this typo fix http://article.gmane.org/gmane.linux.kernel/1754640 I grep'ed the current git tree of linus for similar issues. For these 4 places I'm wondering where the appropriate struct definition is located : arch/ia64/sn/kernel/io_acpi_init.c: sizeof(struct pci_devdev_info *)) { tools/perf/builtin-sched.c: sched-tasks = realloc(sched-tasks, sched-nr_tasks * sizeof(struct task_task *)); fs/ceph/xattr.c:xattrs = kcalloc(numattr, sizeof(struct ceph_xattr *), fs/ceph/xattr.c:memset(xattrs, 0, numattr*sizeof(struct ceph_xattr *)); Nevertheless I was told, that gcc doesn't complain about such things due to eventually evaluating it to sizeof(null). I'm however curious if at least a warning should be emitted in such a case, or? -- Toralf -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: sizeof (struct tYpO *) : it is just a typo or rather a bug ?
On Thu, Jul 24, 2014 at 10:20 PM, Toralf Förster toralf.foers...@gmx.de wrote: Inspired by this typo fix http://article.gmane.org/gmane.linux.kernel/1754640 I grep'ed the current git tree of linus for similar issues. For these 4 places I'm wondering where the appropriate struct definition is located : arch/ia64/sn/kernel/io_acpi_init.c: sizeof(struct pci_devdev_info *)) { tools/perf/builtin-sched.c: sched-tasks = realloc(sched-tasks, sched-nr_tasks * sizeof(struct task_task *)); fs/ceph/xattr.c:xattrs = kcalloc(numattr, sizeof(struct ceph_xattr *), fs/ceph/xattr.c:memset(xattrs, 0, numattr*sizeof(struct ceph_xattr *)); Heh, the ceph one is a five year old typo.. Looks like it should be struct ceph_inode_xattr, I'll fix it up. I'm curious though, how did you grep for these? Thanks, Ilya -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: sizeof (struct tYpO *) : it is just a typo or rather a bug ?
On 07/24/2014 08:33 PM, Ilya Dryomov wrote: On Thu, Jul 24, 2014 at 10:20 PM, Toralf Förster toralf.foers...@gmx.de wrote: Inspired by this typo fix http://article.gmane.org/gmane.linux.kernel/1754640 I grep'ed the current git tree of linus for similar issues. For these 4 places I'm wondering where the appropriate struct definition is located : arch/ia64/sn/kernel/io_acpi_init.c: sizeof(struct pci_devdev_info *)) { tools/perf/builtin-sched.c: sched-tasks = realloc(sched-tasks, sched-nr_tasks * sizeof(struct task_task *)); fs/ceph/xattr.c:xattrs = kcalloc(numattr, sizeof(struct ceph_xattr *), fs/ceph/xattr.c:memset(xattrs, 0, numattr*sizeof(struct ceph_xattr *)); Heh, the ceph one is a five year old typo.. Looks like it should be struct ceph_inode_xattr, I'll fix it up. I'm curious though, how did you grep for these? Thanks, Ilya 1: grep -Hr sizeof[ *(]struct .* \*.) | cut -f2- -d':' | tee ~/tmp/out 2: cat ~/tmp/out | perl -wane 'chomp(); my ($left, $right) = split (/sizeof\(/); print $right, \n;' | cut -f2 -d' ' | sort -u | cut -f1 -d')' | grep -v '^+' | while read i; do echo $i; git grep -q struct $i { || echo error; echo; done 3: ignore false positives -- Toralf -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 8/8] rbd: take snap_id into account when reading in parent info
On 07/24/2014 03:42 AM, Ilya Dryomov wrote: If we are mapping a snapshot, we must read in the parent_overlap value of that snapshot instead of that of the base image. Not doing so may in particular result in us returning zeros instead of user data: # cat overlap-snap.sh #!/bin/bash rbd create --size 10 --image-format 2 foo FOO_DEV=$(rbd map foo) dd if=/dev/urandom of=/dev/rbd0 bs=1M /dev/null echo Base image dd if=$FOO_DEV bs=1 count=16 skip=$(((4 20) - 8)) 2/dev/null | xxd rbd snap create foo@snap rbd snap protect foo@snap rbd clone foo@snap bar rbd snap create bar@snap BAR_DEV=$(rbd map bar@snap) echo Snapshot dd if=$BAR_DEV bs=1 count=16 skip=$(((4 20) - 8)) 2/dev/null | xxd rbd resize --allow-shrink --size 4 bar echo Snapshot after base image resize dd if=$BAR_DEV bs=1 count=16 skip=$(((4 20) - 8)) 2/dev/null | xxd # ./overlap-snap.sh Base image 000: e781 e33b d34b 2225 6034 2845 a2e3 36ed ...;.K%`4(E..6. Snapshot 000: e781 e33b d34b 2225 6034 2845 a2e3 36ed ...;.K%`4(E..6. Resizing image: 100% complete...done. Snapshot after base image resize 000: e781 e33b d34b 2225 ...;.K% Even though bar@snap was taken with the old bar parent_overlap (8M), reads from bar@snap beyond the new bar parent_overlap (4M) return zeroes. Fix it. Signed-off-by: Ilya Dryomov ilya.dryo...@inktank.com --- drivers/block/rbd.c |2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/block/rbd.c b/drivers/block/rbd.c index c4606987e9d1..cbc89fa9a677 100644 --- a/drivers/block/rbd.c +++ b/drivers/block/rbd.c @@ -4020,7 +4020,7 @@ static int rbd_dev_v2_parent_info(struct rbd_device *rbd_dev) goto out_err; } - snapid = cpu_to_le64(CEPH_NOSNAP); + snapid = cpu_to_le64(rbd_dev-spec-snap_id); Well that's just an outright bug. It's been there since the original commit that added parent support: 86b00e0 rbd: get parent spec for version 2 images Parent images *must* be snapshots, so this was never right. I bet that was hard to figure out... Looks good. Reviewed-by: Alex Elder el...@linaro.org ret = rbd_obj_method_sync(rbd_dev, rbd_dev-header_name, rbd, get_parent, snapid, sizeof (snapid), -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 0/3] WIP-8112 radosgw usage man update
Applied (with very slight modifications). Thanks, Yehuda On Thu, Jul 24, 2014 at 8:00 AM, Abhishek Lekshmanan abhishek.lekshma...@gmail.com wrote: Hi, This is a series of patches to update radosgw usage and man pages with all the available help options. 1/3 is a minor format fix to display the options in the same line. Abhishek Lekshmanan (3): rgw: format help options to align with the rest rgw: list all available options during help() doc: update radosgw man page with available opts doc/man/8/radosgw.rst | 28 src/rgw/rgw_main.cc | 10 -- 2 files changed, 36 insertions(+), 2 deletions(-) -- 1.9.1 Thanks, Abhishek -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: First attempt at rocksdb monitor store stress testing
Earlier today I modified the rocksdb options so I could enable universal compaction. Over all performance is lower but I don't see the hang/stall in the middle of the test either. Instead the disk is basically pegged with 100% writes. I suspect average latency is higher than leveldb, but the highest latency is about 5-6s while we were seeing 30s spikes for leveldb with levelled (heh) compaction. I haven't done much tuning either way yet. It may be that if we keep level 0 and level 1 roughly the same size we can reduce stalls in the levelled setups. It will also be interesting to see what happens in these tests on SSDs. Mark On 07/24/2014 06:13 AM, Mark Nelson wrote: Hi Xinxin, Thanks! I wonder as well if it might be interesting to expose the options related to universal compaction? It looks like rocksdb provides a lot of interesting knobs you can adjust! Mark On 07/24/2014 12:08 AM, Shu, Xinxin wrote: Hi mark, I think this maybe related to 'verify_checksums' config option ,when ReadOptions is initialized, default this option is true , all data read from underlying storage will be verified against corresponding checksums, however, this option cannot be configured in wip-rocksdb branch. I will modify code to make this option configurable . Cheers, xinxin -Original Message- From: ceph-devel-ow...@vger.kernel.org [mailto:ceph-devel-ow...@vger.kernel.org] On Behalf Of Mark Nelson Sent: Thursday, July 24, 2014 7:14 AM To: ceph-devel@vger.kernel.org Subject: First attempt at rocksdb monitor store stress testing Hi Guys, So I've been interested lately in leveldb 99th percentile latency (and the amount of write amplification we are seeing) with leveldb. Joao mentioned he has written a tool called mon-store-stress in wip-leveldb-misc to try to provide a means to roughly guess at what's happening on the mons under heavy load. I cherry-picked it over to wip-rocksdb and after a couple of hacks was able to get everything built and running with some basic tests. There was little tuning done and I don't know how realistic this workload is, so don't assume this means anything yet, but some initial results are here: http://nhm.ceph.com/mon-store-stress/First%20Attempt.pdf Command that was used to run the tests: ./ceph-test-mon-store-stress --mon-keyvaluedb leveldb|rocksdb --write-min-size 50K --write-max-size 2M --percent-write 70 --percent-read 30 --keep-state --test-seed 1406137270 --stop-at 5000 foo The most interesting bit right now is that rocksdb seems to be hanging in the middle of the test (left it running for several hours). CPU usage on one core was quite high during the hang. Profiling using perf with dwarf symbols I see: - 49.14% ceph-test-mon-s ceph-test-mon-store-stress [.] unsigned int rocksdb::crc32c::ExtendImplrocksdb::crc32c::Fast_CRC32(unsigned int, char const*, unsigned long) - unsigned int rocksdb::crc32c::ExtendImplrocksdb::crc32c::Fast_CRC32(unsigned int, char const*, unsigned long) 51.70% rocksdb::ReadBlockContents(rocksdb::RandomAccessFile*, rocksdb::Footer const, rocksdb::ReadOptions const, rocksdb::BlockHandle const, rocksdb::BlockContents*, rocksdb::Env*, bool) 48.30% rocksdb::BlockBasedTableBuilder::WriteRawBlock(rocksdb::Slice const, rocksdb::CompressionType, rocksdb::BlockHandle*) Not sure what's going on yet, may need to try to enable logging/debugging in rocksdb. Thoughts/Suggestions welcome. :) Mark -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: First attempt at rocksdb monitor store stress testing
Hi mark, I am looking forward to your results on SSDs . rocksdb generates a crc of data to be written , this cannot be switch off (but can be subsititued with xxhash), there are two options , Option. verify_checksums_in_compaction and ReadOptions. verify_checksums, If we disable these two options , i think cpu usage will goes down . If we use universal compaction , this is not friendly with read operation. Btw , can you list your rocksdb configuration? Cheers, xinxin -Original Message- From: Mark Nelson [mailto:mark.nel...@inktank.com] Sent: Friday, July 25, 2014 7:45 AM To: Shu, Xinxin; ceph-devel@vger.kernel.org Subject: Re: First attempt at rocksdb monitor store stress testing Earlier today I modified the rocksdb options so I could enable universal compaction. Over all performance is lower but I don't see the hang/stall in the middle of the test either. Instead the disk is basically pegged with 100% writes. I suspect average latency is higher than leveldb, but the highest latency is about 5-6s while we were seeing 30s spikes for leveldb with levelled (heh) compaction. I haven't done much tuning either way yet. It may be that if we keep level 0 and level 1 roughly the same size we can reduce stalls in the levelled setups. It will also be interesting to see what happens in these tests on SSDs. Mark On 07/24/2014 06:13 AM, Mark Nelson wrote: Hi Xinxin, Thanks! I wonder as well if it might be interesting to expose the options related to universal compaction? It looks like rocksdb provides a lot of interesting knobs you can adjust! Mark On 07/24/2014 12:08 AM, Shu, Xinxin wrote: Hi mark, I think this maybe related to 'verify_checksums' config option ,when ReadOptions is initialized, default this option is true , all data read from underlying storage will be verified against corresponding checksums, however, this option cannot be configured in wip-rocksdb branch. I will modify code to make this option configurable . Cheers, xinxin -Original Message- From: ceph-devel-ow...@vger.kernel.org [mailto:ceph-devel-ow...@vger.kernel.org] On Behalf Of Mark Nelson Sent: Thursday, July 24, 2014 7:14 AM To: ceph-devel@vger.kernel.org Subject: First attempt at rocksdb monitor store stress testing Hi Guys, So I've been interested lately in leveldb 99th percentile latency (and the amount of write amplification we are seeing) with leveldb. Joao mentioned he has written a tool called mon-store-stress in wip-leveldb-misc to try to provide a means to roughly guess at what's happening on the mons under heavy load. I cherry-picked it over to wip-rocksdb and after a couple of hacks was able to get everything built and running with some basic tests. There was little tuning done and I don't know how realistic this workload is, so don't assume this means anything yet, but some initial results are here: http://nhm.ceph.com/mon-store-stress/First%20Attempt.pdf Command that was used to run the tests: ./ceph-test-mon-store-stress --mon-keyvaluedb leveldb|rocksdb --write-min-size 50K --write-max-size 2M --percent-write 70 --percent-read 30 --keep-state --test-seed 1406137270 --stop-at 5000 foo The most interesting bit right now is that rocksdb seems to be hanging in the middle of the test (left it running for several hours). CPU usage on one core was quite high during the hang. Profiling using perf with dwarf symbols I see: - 49.14% ceph-test-mon-s ceph-test-mon-store-stress [.] unsigned int rocksdb::crc32c::ExtendImplrocksdb::crc32c::Fast_CRC32(unsigned int, char const*, unsigned long) - unsigned int rocksdb::crc32c::ExtendImplrocksdb::crc32c::Fast_CRC32(unsigned int, char const*, unsigned long) 51.70% rocksdb::ReadBlockContents(rocksdb::RandomAccessFile*, rocksdb::Footer const, rocksdb::ReadOptions const, rocksdb::BlockHandle const, rocksdb::BlockContents*, rocksdb::Env*, bool) 48.30% rocksdb::BlockBasedTableBuilder::WriteRawBlock(rocksdb::Slice const, rocksdb::CompressionType, rocksdb::BlockHandle*) Not sure what's going on yet, may need to try to enable logging/debugging in rocksdb. Thoughts/Suggestions welcome. :) Mark -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html