Re: [ceph-users] Data still in OSD directories after removing
Le mercredi 21 mai 2014 à 18:20 -0700, Josh Durgin a écrit : On 05/21/2014 03:03 PM, Olivier Bonvalet wrote: Le mercredi 21 mai 2014 à 08:20 -0700, Sage Weil a écrit : You're certain that that is the correct prefix for the rbd image you removed? Do you see the objects lists when you do 'rados -p rbd ls - | grep prefix'? I'm pretty sure yes : since I didn't see a lot of space freed by the rbd snap purge command, I looked at the RBD prefix before to do the rbd rm (it's not the first time I see that problem, but previous time without the RBD prefix I was not able to check). So : - rados -p sas3copies ls - | grep rb.0.14bfb5a.238e1f29 return nothing at all - # rados stat -p sas3copies rb.0.14bfb5a.238e1f29.0002f026 error stat-ing sas3copies/rb.0.14bfb5a.238e1f29.0002f026: No such file or directory - # rados stat -p sas3copies rb.0.14bfb5a.238e1f29. error stat-ing sas3copies/rb.0.14bfb5a.238e1f29.: No such file or directory - # ls -al /var/lib/ceph/osd/ceph-67/current/9.1fe_head/DIR_E/DIR_F/DIR_1/DIR_7/rb.0.14bfb5a.238e1f29.0002f026__a252_E68871FE__9 -rw-r--r-- 1 root root 4194304 oct. 8 2013 /var/lib/ceph/osd/ceph-67/current/9.1fe_head/DIR_E/DIR_F/DIR_1/DIR_7/rb.0.14bfb5a.238e1f29.0002f026__a252_E68871FE__9 If the objects really are orphaned, teh way to clean them up is via 'rados -p rbd rm objectname'. I'd like to get to the bottom of how they ended up that way first, though! I suppose the problem came from me, by doing CTRL+C while rbd snap purge $IMG. rados rm -p sas3copies rb.0.14bfb5a.238e1f29.0002f026 don't remove thoses files, and just answer with a No such file or directory. Those files are all for snapshots, which are removed by the osds asynchronously in a process called 'snap trimming'. There's no way to directly remove them via rados. Since you stopped 'rbd snap purge' partway through, it may have removed the reference to the snapshot before removing the snapshot itself. You can get a list of snapshot ids for the remaining objects via the 'rados listsnaps' command, and use rados_ioctx_selfmanaged_snap_remove() (no convenient wrapper unfortunately) on each of those snapshot ids to be sure they are all scheduled for asynchronous deletion. Josh Great : rados listsnaps see it : # rados listsnaps -p sas3copies rb.0.14bfb5a.238e1f29.0002f026 rb.0.14bfb5a.238e1f29.0002f026: cloneid snaps sizeoverlap 41554 35746 4194304 [] So, I have to writecompile a wrapper to rados_ioctx_selfmanaged_snap_remove(), and find a way to obtain a list of all orphan objects ? I also try to recreate the object (rados put) then remove it (rados rm), but snapshots still here. Olivier -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: SMART monitoring
On Fri, Dec 27, 2013 at 9:09 PM, Andrey Korolyov and...@xdel.ru wrote: On 12/27/2013 08:15 PM, Justin Erenkrantz wrote: On Thu, Dec 26, 2013 at 9:17 PM, Sage Weil s...@inktank.com wrote: I think the question comes down to whether Ceph should take some internal action based on the information, or whether that is better handled by some external monitoring agent. For example, an external agent might collect SMART info into graphite, and every so often do some predictive analysis and mark out disks that are expected to fail soon. I'd love to see some consensus form around what this should look like... My $.02 from the peanut gallery: at a minimum, set the HEALTH_WARN flag if there is a SMART failure on a physical drive that contains an OSD. Yes, you could build the monitoring into a separate system, but I think it'd be really useful to combine it into the cluster health assessment. -- justin -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Hi, Judging from my personal experience SMART failures can be dangerous if they are not bad enough to completely tear down an OSD therefore it will not flap and will not be marked as down in time, but cluster performance is greatly affected in this case. I don`t think that the SMART monitoring task is somehow related to Ceph because seperate monitoring of predictive failure counters can do its job well and in cause of sudden errors SMART query may not work at all since a lot of bus resets was made by the system and disk can be inaccessible at all. So I propose two set of strategies - do a regular scattered background checks and monitor OSD responsiveness to word around cases with performance degradation due to read/write errors. Some necromant job for this thread.. Considering a year-long experience with Hitachi 4T disks, there are a lot of failures which are cannot be handled by SMART completely - speed degradation and sudden disk death. Although second case rules out by itself by kicking out stuck OSD, it is not very easy to check which disks are about to die without throughout dmesg monitoring for bus errors and periodical speed calibration. Probably introducing such thing as idle-priority speed measurement for OSDs without dramatically increasing overall wearout may be useful enough to implement in couple with additional OSD perf metric, like seek_time in SMART, though SMART may return good value for it when performance already slowed down to crawl, also it`ll handle most things impacting performance which can be unexposable at all to the host OS - correctable bus errors and so on. By the way, although 1T Seagates have way higher failure rate, they always dying with an 'appropriate' set of attributes in SMART, Hitachi tends to die without warning :) Hope that it`ll be helpful for someone. -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] rbd: fix osd_request memory leak in __rbd_dev_header_watch_sync()
osd_request, along with r_request and r_reply messages attached to it are leaked in __rbd_dev_header_watch_sync() if the requested image doesn't exist. This is because lingering requests are special and get an extra ref in the reply path. Fix it by unregistering linger request on the error path and split __rbd_dev_header_watch_sync() into two functions to make it maintainable. Signed-off-by: Ilya Dryomov ilya.dryo...@inktank.com --- drivers/block/rbd.c | 123 +++ 1 file changed, 85 insertions(+), 38 deletions(-) diff --git a/drivers/block/rbd.c b/drivers/block/rbd.c index 552a2edcaa74..55c34b70842a 100644 --- a/drivers/block/rbd.c +++ b/drivers/block/rbd.c @@ -2872,56 +2872,55 @@ static void rbd_watch_cb(u64 ver, u64 notify_id, u8 opcode, void *data) } /* - * Request sync osd watch/unwatch. The value of start determines - * whether a watch request is being initiated or torn down. + * Initiate a watch request, synchronously. */ -static int __rbd_dev_header_watch_sync(struct rbd_device *rbd_dev, bool start) +static int rbd_dev_header_watch_sync(struct rbd_device *rbd_dev) { struct ceph_osd_client *osdc = rbd_dev-rbd_client-client-osdc; struct rbd_obj_request *obj_request; int ret; - rbd_assert(start ^ !!rbd_dev-watch_event); - rbd_assert(start ^ !!rbd_dev-watch_request); + rbd_assert(!rbd_dev-watch_event); + rbd_assert(!rbd_dev-watch_request); - if (start) { - ret = ceph_osdc_create_event(osdc, rbd_watch_cb, rbd_dev, - rbd_dev-watch_event); - if (ret 0) - return ret; - rbd_assert(rbd_dev-watch_event != NULL); - } + ret = ceph_osdc_create_event(osdc, rbd_watch_cb, rbd_dev, +rbd_dev-watch_event); + if (ret 0) + return ret; + + rbd_assert(rbd_dev-watch_event); - ret = -ENOMEM; obj_request = rbd_obj_request_create(rbd_dev-header_name, 0, 0, - OBJ_REQUEST_NODATA); - if (!obj_request) +OBJ_REQUEST_NODATA); + if (!obj_request) { + ret = -ENOMEM; goto out_cancel; + } obj_request-osd_req = rbd_osd_req_create(rbd_dev, true, 1, obj_request); - if (!obj_request-osd_req) - goto out_cancel; + if (!obj_request-osd_req) { + ret = -ENOMEM; + goto out_put; + } - if (start) - ceph_osdc_set_request_linger(osdc, obj_request-osd_req); - else - ceph_osdc_unregister_linger_request(osdc, - rbd_dev-watch_request-osd_req); + ceph_osdc_set_request_linger(osdc, obj_request-osd_req); osd_req_op_watch_init(obj_request-osd_req, 0, CEPH_OSD_OP_WATCH, - rbd_dev-watch_event-cookie, 0, start ? 1 : 0); + rbd_dev-watch_event-cookie, 0, 1); rbd_osd_req_format_write(obj_request); ret = rbd_obj_request_submit(osdc, obj_request); if (ret) - goto out_cancel; + goto out_linger; + ret = rbd_obj_request_wait(obj_request); if (ret) - goto out_cancel; + goto out_linger; + ret = obj_request-result; if (ret) - goto out_cancel; + goto out_linger; /* * A watch request is set to linger, so the underlying osd @@ -2931,36 +2930,84 @@ static int __rbd_dev_header_watch_sync(struct rbd_device *rbd_dev, bool start) * it. We'll drop that reference (below) after we've * unregistered it. */ - if (start) { - rbd_dev-watch_request = obj_request; + rbd_dev-watch_request = obj_request; - return 0; + return 0; + +out_linger: + ceph_osdc_unregister_linger_request(osdc, obj_request-osd_req); +out_put: + rbd_obj_request_put(obj_request); +out_cancel: + ceph_osdc_cancel_event(rbd_dev-watch_event); + rbd_dev-watch_event = NULL; + + return ret; +} + +/* + * Tear down a watch request, synchronously. + */ +static int __rbd_dev_header_unwatch_sync(struct rbd_device *rbd_dev) +{ + struct ceph_osd_client *osdc = rbd_dev-rbd_client-client-osdc; + struct rbd_obj_request *obj_request; + int ret; + + rbd_assert(rbd_dev-watch_event); + rbd_assert(rbd_dev-watch_request); + + obj_request = rbd_obj_request_create(rbd_dev-header_name, 0, 0, +OBJ_REQUEST_NODATA); + if (!obj_request) { + ret = -ENOMEM; + goto out_cancel; } + obj_request-osd_req = rbd_osd_req_create(rbd_dev,
[PATCH v3 0/4] rbd: make sure we have latest osdmap on 'rbd map'
Hello, This is a fix for #8184 that makes use of the updated MMonGetVersionReply userspace code, which will now populate its tid with the tid of the original MMonGetVersion request. Thanks, Ilya Ilya Dryomov (4): libceph: recognize poolop requests in debugfs libceph: mon_get_version request infrastructure libceph: add ceph_monc_wait_osdmap() rbd: make sure we have latest osdmap on 'rbd map' drivers/block/rbd.c | 36 +- include/linux/ceph/mon_client.h | 11 ++- net/ceph/ceph_common.c |2 + net/ceph/debugfs.c |8 ++- net/ceph/mon_client.c | 150 +-- 5 files changed, 194 insertions(+), 13 deletions(-) -- 1.7.10.4 -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v3 1/4] libceph: recognize poolop requests in debugfs
Recognize poolop requests in debugfs monc dump, fix prink format specifiers - tid is unsigned. Signed-off-by: Ilya Dryomov ilya.dryo...@inktank.com --- net/ceph/debugfs.c |6 -- 1 file changed, 4 insertions(+), 2 deletions(-) diff --git a/net/ceph/debugfs.c b/net/ceph/debugfs.c index 10421a4b76f8..8903dcee8d8e 100644 --- a/net/ceph/debugfs.c +++ b/net/ceph/debugfs.c @@ -126,9 +126,11 @@ static int monc_show(struct seq_file *s, void *p) req = rb_entry(rp, struct ceph_mon_generic_request, node); op = le16_to_cpu(req-request-hdr.type); if (op == CEPH_MSG_STATFS) - seq_printf(s, %lld statfs\n, req-tid); + seq_printf(s, %llu statfs\n, req-tid); + else if (op == CEPH_MSG_POOLOP) + seq_printf(s, %llu poolop\n, req-tid); else - seq_printf(s, %lld unknown\n, req-tid); + seq_printf(s, %llu unknown\n, req-tid); } mutex_unlock(monc-mutex); -- 1.7.10.4 -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v3 3/4] libceph: add ceph_monc_wait_osdmap()
Add ceph_monc_wait_osdmap(), which will block until the osdmap with the specified epoch is received or timeout occurs. Export both of these as they are going to be needed by rbd. Signed-off-by: Ilya Dryomov ilya.dryo...@inktank.com --- include/linux/ceph/mon_client.h |2 ++ net/ceph/mon_client.c | 27 +++ 2 files changed, 29 insertions(+) diff --git a/include/linux/ceph/mon_client.h b/include/linux/ceph/mon_client.h index 585ef9450e9d..deb47e45ac7c 100644 --- a/include/linux/ceph/mon_client.h +++ b/include/linux/ceph/mon_client.h @@ -104,6 +104,8 @@ extern int ceph_monc_got_mdsmap(struct ceph_mon_client *monc, u32 have); extern int ceph_monc_got_osdmap(struct ceph_mon_client *monc, u32 have); extern void ceph_monc_request_next_osdmap(struct ceph_mon_client *monc); +extern int ceph_monc_wait_osdmap(struct ceph_mon_client *monc, u32 epoch, +unsigned long timeout); extern int ceph_monc_do_statfs(struct ceph_mon_client *monc, struct ceph_statfs *buf); diff --git a/net/ceph/mon_client.c b/net/ceph/mon_client.c index 6b46f1205ceb..ecfd65c05f49 100644 --- a/net/ceph/mon_client.c +++ b/net/ceph/mon_client.c @@ -296,6 +296,33 @@ void ceph_monc_request_next_osdmap(struct ceph_mon_client *monc) __send_subscribe(monc); mutex_unlock(monc-mutex); } +EXPORT_SYMBOL(ceph_monc_request_next_osdmap); + +int ceph_monc_wait_osdmap(struct ceph_mon_client *monc, u32 epoch, + unsigned long timeout) +{ + unsigned long started = jiffies; + int ret; + + mutex_lock(monc-mutex); + while (monc-have_osdmap epoch) { + mutex_unlock(monc-mutex); + + if (timeout != 0 time_after_eq(jiffies, started + timeout)) + return -ETIMEDOUT; + + ret = wait_event_interruptible_timeout(monc-client-auth_wq, +monc-have_osdmap = epoch, timeout); + if (ret 0) + return ret; + + mutex_lock(monc-mutex); + } + + mutex_unlock(monc-mutex); + return 0; +} +EXPORT_SYMBOL(ceph_monc_wait_osdmap); /* * -- 1.7.10.4 -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v3 4/4] rbd: make sure we have latest osdmap on 'rbd map'
Given an existing idle mapping (img1), mapping an image (img2) in a newly created pool (pool2) fails: $ ceph osd pool create pool1 8 8 $ rbd create --size 1000 pool1/img1 $ sudo rbd map pool1/img1 $ ceph osd pool create pool2 8 8 $ rbd create --size 1000 pool2/img2 $ sudo rbd map pool2/img2 rbd: sysfs write failed rbd: map failed: (2) No such file or directory This is because client instances are shared by default and we don't request an osdmap update when bumping a ref on an existing client. The fix is to use the mon_get_version request to see if the osdmap we have is the latest, and block until the requested update is received if it's not. Fixes: http://tracker.ceph.com/issues/8184 Signed-off-by: Ilya Dryomov ilya.dryo...@inktank.com --- v2: - send mon_get_version request and wait for a reply only if we were unable to locate the pool (i.e. don't hurt the common case) v3: - make use of the updated MMonGetVersionReply userspace code, which will now populate MMonGetVersionReply tid with the tid of the original MMonGetVersion request drivers/block/rbd.c | 36 +--- 1 file changed, 33 insertions(+), 3 deletions(-) diff --git a/drivers/block/rbd.c b/drivers/block/rbd.c index 552a2edcaa74..daf7b4659b4a 100644 --- a/drivers/block/rbd.c +++ b/drivers/block/rbd.c @@ -4683,6 +4683,38 @@ out_err: } /* + * Return pool id (= 0) or a negative error code. + */ +static int rbd_add_get_pool_id(struct rbd_client *rbdc, const char *pool_name) +{ + u64 newest_epoch; + unsigned long timeout = rbdc-client-options-mount_timeout * HZ; + int tries = 0; + int ret; + +again: + ret = ceph_pg_poolid_by_name(rbdc-client-osdc.osdmap, pool_name); + if (ret == -ENOENT tries++ 1) { + ret = ceph_monc_do_get_version(rbdc-client-monc, osdmap, + newest_epoch); + if (ret 0) + return ret; + + if (rbdc-client-osdc.osdmap-epoch newest_epoch) { + ceph_monc_request_next_osdmap(rbdc-client-monc); + (void) ceph_monc_wait_osdmap(rbdc-client-monc, +newest_epoch, timeout); + goto again; + } else { + /* the osdmap we have is new enough */ + return -ENOENT; + } + } + + return ret; +} + +/* * An rbd format 2 image has a unique identifier, distinct from the * name given to it by the user. Internally, that identifier is * what's used to specify the names of objects related to the image. @@ -5053,7 +5085,6 @@ static ssize_t do_rbd_add(struct bus_type *bus, struct rbd_options *rbd_opts = NULL; struct rbd_spec *spec = NULL; struct rbd_client *rbdc; - struct ceph_osd_client *osdc; bool read_only; int rc = -ENOMEM; @@ -5075,8 +5106,7 @@ static ssize_t do_rbd_add(struct bus_type *bus, } /* pick the pool */ - osdc = rbdc-client-osdc; - rc = ceph_pg_poolid_by_name(osdc-osdmap, spec-pool_name); + rc = rbd_add_get_pool_id(rbdc, spec-pool_name); if (rc 0) goto err_out_client; spec-pool_id = (u64)rc; -- 1.7.10.4 -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v3 2/4] libceph: mon_get_version request infrastructure
Add support for mon_get_version requests to libceph. This reuses much of the ceph_mon_generic_request infrastructure, with one exception. Older OSDs don't set mon_get_version reply hdr-tid even if the original request had a non-zero tid, which makes it impossible to lookup ceph_mon_generic_request contexts by tid in get_generic_reply() for such replies. As a workaround, we allocate a reply message on the reply path. This can probably interfere with revoke, but I don't see a better way. Signed-off-by: Ilya Dryomov ilya.dryo...@inktank.com --- include/linux/ceph/mon_client.h |9 ++- net/ceph/ceph_common.c |2 + net/ceph/debugfs.c |2 + net/ceph/mon_client.c | 123 +-- 4 files changed, 128 insertions(+), 8 deletions(-) diff --git a/include/linux/ceph/mon_client.h b/include/linux/ceph/mon_client.h index a486f390dfbe..585ef9450e9d 100644 --- a/include/linux/ceph/mon_client.h +++ b/include/linux/ceph/mon_client.h @@ -40,9 +40,9 @@ struct ceph_mon_request { }; /* - * ceph_mon_generic_request is being used for the statfs and poolop requests - * which are bening done a bit differently because we need to get data back - * to the caller + * ceph_mon_generic_request is being used for the statfs, poolop and + * mon_get_version requests which are being done a bit differently + * because we need to get data back to the caller */ struct ceph_mon_generic_request { struct kref kref; @@ -108,6 +108,9 @@ extern void ceph_monc_request_next_osdmap(struct ceph_mon_client *monc); extern int ceph_monc_do_statfs(struct ceph_mon_client *monc, struct ceph_statfs *buf); +extern int ceph_monc_do_get_version(struct ceph_mon_client *monc, + const char *what, u64 *newest); + extern int ceph_monc_open_session(struct ceph_mon_client *monc); extern int ceph_monc_validate_auth(struct ceph_mon_client *monc); diff --git a/net/ceph/ceph_common.c b/net/ceph/ceph_common.c index 67d7721d237e..1675021d8c12 100644 --- a/net/ceph/ceph_common.c +++ b/net/ceph/ceph_common.c @@ -72,6 +72,8 @@ const char *ceph_msg_type_name(int type) case CEPH_MSG_MON_SUBSCRIBE_ACK: return mon_subscribe_ack; case CEPH_MSG_STATFS: return statfs; case CEPH_MSG_STATFS_REPLY: return statfs_reply; + case CEPH_MSG_MON_GET_VERSION: return mon_get_version; + case CEPH_MSG_MON_GET_VERSION_REPLY: return mon_get_version_reply; case CEPH_MSG_MDS_MAP: return mds_map; case CEPH_MSG_CLIENT_SESSION: return client_session; case CEPH_MSG_CLIENT_RECONNECT: return client_reconnect; diff --git a/net/ceph/debugfs.c b/net/ceph/debugfs.c index 8903dcee8d8e..d1a62c69a9f4 100644 --- a/net/ceph/debugfs.c +++ b/net/ceph/debugfs.c @@ -129,6 +129,8 @@ static int monc_show(struct seq_file *s, void *p) seq_printf(s, %llu statfs\n, req-tid); else if (op == CEPH_MSG_POOLOP) seq_printf(s, %llu poolop\n, req-tid); + else if (op == CEPH_MSG_MON_GET_VERSION) + seq_printf(s, %llu mon_get_version, req-tid); else seq_printf(s, %llu unknown\n, req-tid); } diff --git a/net/ceph/mon_client.c b/net/ceph/mon_client.c index 2ac9ef35110b..6b46f1205ceb 100644 --- a/net/ceph/mon_client.c +++ b/net/ceph/mon_client.c @@ -477,14 +477,13 @@ static struct ceph_msg *get_generic_reply(struct ceph_connection *con, return m; } -static int do_generic_request(struct ceph_mon_client *monc, - struct ceph_mon_generic_request *req) +static int __do_generic_request(struct ceph_mon_client *monc, u64 tid, + struct ceph_mon_generic_request *req) { int err; /* register request */ - mutex_lock(monc-mutex); - req-tid = ++monc-last_tid; + req-tid = tid != 0 ? tid : ++monc-last_tid; req-request-hdr.tid = cpu_to_le64(req-tid); __insert_generic_request(monc, req); monc-num_generic_requests++; @@ -496,13 +495,24 @@ static int do_generic_request(struct ceph_mon_client *monc, mutex_lock(monc-mutex); rb_erase(req-node, monc-generic_request_tree); monc-num_generic_requests--; - mutex_unlock(monc-mutex); if (!err) err = req-result; return err; } +static int do_generic_request(struct ceph_mon_client *monc, + struct ceph_mon_generic_request *req) +{ + int err; + + mutex_lock(monc-mutex); + err = __do_generic_request(monc, 0, req); + mutex_unlock(monc-mutex); + + return err; +} + /* * statfs */ @@ -579,6 +589,96 @@ out: } EXPORT_SYMBOL(ceph_monc_do_statfs); +static void handle_get_version_reply(struct ceph_mon_client *monc, +struct ceph_msg *msg) +{ + struct
collectd / graphite / grafana .. calamari?
Hi. I saw the thread a couple days ago on ceph-users regarding collectd... and yes, i've been working on something similar for the last few days :) https://github.com/rochaporto/collectd-ceph It has a set of collectd plugins pushing metrics which mostly map what the ceph commands return. In the setup we have it pushes them to graphite and the displays rely on grafana (check for a screenshot in the link above). As it relies on common building blocks, it's easily extensible and we'll come up with new dashboards soon - things like plotting osd data against the metrics from the collectd disk plugin, which we also deploy. This email is mostly to share the work, but also to check on Calamari? I asked Patrick after the RedHat/Inktank news and have no idea what it provides, but i'm sure it comes with lots of extra sauce - he suggested to ask in the list. What's the timeline to have it open sourced? It would be great to have a look at it, and as there's work from different people in this area maybe start working together on some fancier monitoring tools. Regards, Ricardo -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html