Re: [PATCH 3/3] ceph: fix vmtruncate deadlock
On 02/26/2013 01:00 PM, Sage Weil wrote: On Tue, 26 Feb 2013, Yan, Zheng wrote: It looks to me like truncates can get queued for later, so that's not the case? And how could the client receive a truncate while in the middle of writing? Either it's got the write caps (in which case nobody else can truncate), or it shouldn't be writing...I suppose if somebody enables lazyIO that would do it, but again in that case I imagine we'd rather do the truncate before writing: Commit 22cddde104 make buffered write get/put caps for individual page. So the MDS can revoke the write caps in the middle of write. I think it's better to do vmtruncate after write finishes. What if you're writing past the truncate point? Then your written data would be truncated off even though it should have been a file extension. I was wrong, I thought the order of write and truncate is not important. Now I'm worry about the correctness of commit 22cddde104, we probably should revert that commit. I'm sorry I haven't had time to dig into this thread. Do you think we should revert that commit for this merge window? No, that commit fixes a i_size update bug, it is more important than the deadlock. But I think the eventual fix is revert that commit and also remove the optimization that drops Fw caps early (to avoid slow cap revocation caused by balance_dirty_pages) My gut feeling is that the whole locking scheme here needs to be reworked. I suspect where we'll end up is something more like XFS, where there is an explicit truncate (or other) mutex that is separate from i_mutex. The fact that we were relying on i_mutex serialization from the VFS was fragile and (as we can see) ultimately a flawed approach. And the locking and races between writes, mmap, and truncation in the page cache code are fragile and annoying. The only good thing going for us is that fsx is pretty good at stress testing the code. :) If we want to keep write operation atomic, write and vmtruncate need to be protected by a mutex. I don't think introducing a new mutex makes thing simple. Oh, you're right about the locking (I think, anyway?). However, i_size doesn't protect us ? we might for instance have truncated to zero and then back up to the original file size. :( But that doesn't look like it's handled correctly anyway (racy) ? am I misreading that badly or is the infrastructure in fact broken in that case? You are right. probably we can fix the race by disallowing both read and write when there is pending truncate. FWIW, I think the high-level strategy before should still be basically sound: we can queue a truncation without blocking, and any write path will apply a pending truncation under the appropriate lock. But I think it's going to mean carefully documenting the locking requirements for the various paths and working out a new locking structure. Yan, is this something you are able to put some time into? I would like to work on it, but it's going to be hard for me to find the time in the next several weeks to get to it. OK, I will work on it. Regards Yan, Zheng -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
maintanance on osd host
Hi list, how can i do a short maintanance like a kernel upgrade on an osd host? Right now ceph starts to backfill immediatly if i say: ceph osd out 41 ... Without ceph osd out command all clients hang for the time ceph does not know that the host was rebootet. I tried ceph osd set nodown and ceph osd set noout but this doesn't result in any difference Stefan -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: maintanance on osd host
On Tue, Feb 26, 2013 at 6:56 PM, Stefan Priebe - Profihost AG s.pri...@profihost.ag wrote: Hi list, how can i do a short maintanance like a kernel upgrade on an osd host? Right now ceph starts to backfill immediatly if i say: ceph osd out 41 ... Without ceph osd out command all clients hang for the time ceph does not know that the host was rebootet. I tried ceph osd set nodown and ceph osd set noout but this doesn't result in any difference Hi Stefan, in my practice nodown will freeze all I/O for sure until OSD will return, killing osd process and setting ``mon osd down out interval'' large enough will do the trick - you`ll get only two small freezes on the peering process at start and at the end. Also it is very strange that your clients hanging for a long time - I have set non-optimal values for purpose and was not able to observe re-peering process longer than a minute. -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: OpenStack summit : Ceph design session
It's been an embryonic internal Inktank conversation and Nick Barcet at eNovance mentioned some ideas when we last met. Will try and put together a blueprint soon. Neil On Mon, Feb 25, 2013 at 2:04 AM, Loic Dachary l...@dachary.org wrote: Hi Neil, I've added RBD backups secondary clusters within Openstack to the list of blueprints. Do you have links to mail threads / chat logs related to this topic ? I moved the content of the session to an etherpad for collaborative editing https://etherpad.openstack.org/roadmap-for-ceph-integration-with-openstack and it is now linked from http://summit.openstack.org/cfp/details/38 Cheers On 02/25/2013 07:12 AM, Neil Levine wrote: Thanks for taking the lead on this Loic. As a blueprint, I'd like to look at RBD backups to secondary clusters within Openstack. Nick Barcet and others have mentioned ideas for this now that Cinder is multi-cluster aware. Neil On Sun, Feb 24, 2013 at 3:16 PM, Josh Durgin josh.dur...@inktank.com wrote: On 02/23/2013 02:33 AM, Loic Dachary wrote: Hi, In anticipation of the next OpenStack summit http://www.openstack.org/summit/portland-2013/, I proposed a session to discuss OpenStack and Ceph integration. Our meeting during FOSDEM earlier this month was a great experience although it was planned at the last minute. I hope we can organize something even better for the summit. For developers and contributors to both Ceph and OpenStack such as myself, it would be a great opportunity to figure out a sensible roadmap for the next six months. I realize this roadmap is already clear for Josh Durgin and other Ceph / OpenStack developers who are passionately invested in both projects for a long time. However I am new to both projects and such a session would be a precious guide and highly motivating. http://summit.openstack.org/cfp/details/38 What do you think ? Sounds like a great idea! Thanks for putting together the session! Josh -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- Loïc Dachary, Artisan Logiciel Libre -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: maintanance on osd host
On Tue, 26 Feb 2013, Stefan Priebe - Profihost AG wrote: Hi list, how can i do a short maintanance like a kernel upgrade on an osd host? Right now ceph starts to backfill immediatly if i say: ceph osd out 41 ... Without ceph osd out command all clients hang for the time ceph does not know that the host was rebootet. I tried ceph osd set nodown and ceph osd set noout but this doesn't result in any difference For a temporary event like this, you want the osd to be down (so that io can continue with remaining replicas) but NOT to mark it out (so that data doesn't get rebalanced). The simplest way to do that is ceph osd set noout killall ceph-osd .. reboot .. Just remember to do ceph osd unset noout when you are done so that future osds that fail will get marked out on their own after the 5 minute (default) interval. sage -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: maintanance on osd host
But that redults in a 1-3s hickup for all KVM vms. This is not what I want. Stefan Am 26.02.2013 um 18:06 schrieb Sage Weil s...@inktank.com: On Tue, 26 Feb 2013, Stefan Priebe - Profihost AG wrote: Hi list, how can i do a short maintanance like a kernel upgrade on an osd host? Right now ceph starts to backfill immediatly if i say: ceph osd out 41 ... Without ceph osd out command all clients hang for the time ceph does not know that the host was rebootet. I tried ceph osd set nodown and ceph osd set noout but this doesn't result in any difference For a temporary event like this, you want the osd to be down (so that io can continue with remaining replicas) but NOT to mark it out (so that data doesn't get rebalanced). The simplest way to do that is ceph osd set noout killall ceph-osd .. reboot .. Just remember to do ceph osd unset noout when you are done so that future osds that fail will get marked out on their own after the 5 minute (default) interval. sage -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: maintanance on osd host
On Tue, 26 Feb 2013, Stefan Priebe - Profihost AG wrote: But that redults in a 1-3s hickup for all KVM vms. This is not what I want. You can do kill $pid ceph osd down $osdid (or even reverse the order, if the sequence is quick enough) to avoid waiting for the failure detection delay. But if the OSDs are going down, then the peering has to happen one way or another. sage Stefan Am 26.02.2013 um 18:06 schrieb Sage Weil s...@inktank.com: On Tue, 26 Feb 2013, Stefan Priebe - Profihost AG wrote: Hi list, how can i do a short maintanance like a kernel upgrade on an osd host? Right now ceph starts to backfill immediatly if i say: ceph osd out 41 ... Without ceph osd out command all clients hang for the time ceph does not know that the host was rebootet. I tried ceph osd set nodown and ceph osd set noout but this doesn't result in any difference For a temporary event like this, you want the osd to be down (so that io can continue with remaining replicas) but NOT to mark it out (so that data doesn't get rebalanced). The simplest way to do that is ceph osd set noout killall ceph-osd .. reboot .. Just remember to do ceph osd unset noout when you are done so that future osds that fail will get marked out on their own after the 5 minute (default) interval. sage -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Crash and strange things on MDS
On Tue, Feb 19, 2013 at 05:09:30PM -0800, Gregory Farnum wrote: On Tue, Feb 19, 2013 at 5:00 PM, Kevin Decherf ke...@kdecherf.com wrote: On Tue, Feb 19, 2013 at 10:15:48AM -0800, Gregory Farnum wrote: Looks like you've got ~424k dentries pinned, and it's trying to keep 400k inodes in cache. So you're still a bit oversubscribed, yes. This might just be the issue where your clients are keeping a bunch of inodes cached for the VFS (http://tracker.ceph.com/issues/3289). Thanks for the analyze. We use only one ceph-fuse client at this time which makes all high-load commands like rsync, tar and cp on a huge amount of files. Well, I will replace it by the kernel client. Oh, that bug is just an explanation of what's happening; I believe it exists in the kernel client as well. After setting the mds cache size to 900k, storms are gone. However we continue to observe high latency on some clients (always the same clients): each IO takes between 40 and 90ms (for example with Wordpress, it takes ~20 seconds to load all needed files...). With a non-laggy client, IO requests take less than 1ms. -- Kevin Decherf - @Kdecherf GPG C610 FE73 E706 F968 612B E4B2 108A BD75 A81E 6E2F http://kdecherf.com -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Crash and strange things on MDS
On Tue, Feb 26, 2013 at 9:57 AM, Kevin Decherf ke...@kdecherf.com wrote: On Tue, Feb 19, 2013 at 05:09:30PM -0800, Gregory Farnum wrote: On Tue, Feb 19, 2013 at 5:00 PM, Kevin Decherf ke...@kdecherf.com wrote: On Tue, Feb 19, 2013 at 10:15:48AM -0800, Gregory Farnum wrote: Looks like you've got ~424k dentries pinned, and it's trying to keep 400k inodes in cache. So you're still a bit oversubscribed, yes. This might just be the issue where your clients are keeping a bunch of inodes cached for the VFS (http://tracker.ceph.com/issues/3289). Thanks for the analyze. We use only one ceph-fuse client at this time which makes all high-load commands like rsync, tar and cp on a huge amount of files. Well, I will replace it by the kernel client. Oh, that bug is just an explanation of what's happening; I believe it exists in the kernel client as well. After setting the mds cache size to 900k, storms are gone. However we continue to observe high latency on some clients (always the same clients): each IO takes between 40 and 90ms (for example with Wordpress, it takes ~20 seconds to load all needed files...). With a non-laggy client, IO requests take less than 1ms. I can't be sure from that description, but it sounds like you've got one client which is generally holding the RW caps on the files, and then another client which comes in occasionally to read those same files. That requires the first client to drop its caps, and involves a couple round-trip messages and is going to take some time — this is an unavoidable consequence if you have clients sharing files, although there's probably still room for us to optimize. Can you describe your client workload in a bit more detail? -Greg -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] libceph: fix a osd request memory leak
Reviewed-by: Josh Durgin josh.dur...@inktank.com On 02/25/2013 02:36 PM, Alex Elder wrote: If an invalid layout is provided to ceph_osdc_new_request(), its call to calc_layout() might return an error. At that point in the function we've already allocated an osd request structure, so we need to free it (drop a reference) in the event such an error occurs. The only other value calc_layout() will return is 0, so make that explicit in the successful case. This resolves: http://tracker.ceph.com/issues/4240 Signed-off-by: Alex Elder el...@inktank.com --- net/ceph/osd_client.c |6 -- 1 file changed, 4 insertions(+), 2 deletions(-) diff --git a/net/ceph/osd_client.c b/net/ceph/osd_client.c index 39629b6..5daced2 100644 --- a/net/ceph/osd_client.c +++ b/net/ceph/osd_client.c @@ -109,7 +109,7 @@ static int calc_layout(struct ceph_vino vino, snprintf(req-r_oid, sizeof(req-r_oid), %llx.%08llx, vino.ino, bno); req-r_oid_len = strlen(req-r_oid); - return r; + return 0; } /* @@ -437,8 +437,10 @@ struct ceph_osd_request *ceph_osdc_new_request(struct ceph_osd_client *osdc, /* calculate max write size */ r = calc_layout(vino, layout, off, plen, req, ops); - if (r 0) + if (r 0) { + ceph_osdc_put_request(req); return ERR_PTR(r); + } req-r_file_layout = *layout; /* keep a copy */ /* in case it differs from natural (file) alignment that -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] libceph: make ceph_msg-bio_seg be unsigned
Reviewed-by: Josh Durgin josh.dur...@inktank.com On 02/25/2013 02:40 PM, Alex Elder wrote: The bio_seg field is used by the ceph messenger in iterating through a bio. It should never have a negative value, so make it an unsigned. Change variables used to hold bio_seg values to all be unsigned as well. Change two variable names in init_bio_iter() to match the convention used everywhere else. Signed-off-by: Alex Elder el...@inktank.com --- include/linux/ceph/messenger.h |2 +- net/ceph/messenger.c | 16 +--- 2 files changed, 10 insertions(+), 8 deletions(-) diff --git a/include/linux/ceph/messenger.h b/include/linux/ceph/messenger.h index 60903e0..8297288 100644 --- a/include/linux/ceph/messenger.h +++ b/include/linux/ceph/messenger.h @@ -86,7 +86,7 @@ struct ceph_msg { #ifdef CONFIG_BLOCK struct bio *bio; /* instead of pages/pagelist */ struct bio *bio_iter; /* bio iterator */ - int bio_seg;/* current bio segment */ + unsigned int bio_seg; /* current bio segment */ #endif /* CONFIG_BLOCK */ struct ceph_pagelist *trail;/* the trailing part of the data */ bool front_is_vmalloc; diff --git a/net/ceph/messenger.c b/net/ceph/messenger.c index 2c0669f..c06f940 100644 --- a/net/ceph/messenger.c +++ b/net/ceph/messenger.c @@ -697,18 +697,19 @@ static void con_out_kvec_add(struct ceph_connection *con, } #ifdef CONFIG_BLOCK -static void init_bio_iter(struct bio *bio, struct bio **iter, int *seg) +static void init_bio_iter(struct bio *bio, struct bio **bio_iter, + unsigned int *bio_seg) { if (!bio) { - *iter = NULL; - *seg = 0; + *bio_iter = NULL; + *bio_seg = 0; return; } - *iter = bio; - *seg = bio-bi_idx; + *bio_iter = bio; + *bio_seg = (unsigned int) bio-bi_idx; } -static void iter_bio_next(struct bio **bio_iter, int *seg) +static void iter_bio_next(struct bio **bio_iter, unsigned int *seg) { if (*bio_iter == NULL) return; @@ -1818,7 +1819,8 @@ static int read_partial_message_pages(struct ceph_connection *con, #ifdef CONFIG_BLOCK static int read_partial_message_bio(struct ceph_connection *con, - struct bio **bio_iter, int *bio_seg, + struct bio **bio_iter, + unsigned int *bio_seg, unsigned int data_len, bool do_datacrc) { struct bio_vec *bv = bio_iovec_idx(*bio_iter, *bio_seg); -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 3/3] ceph: fix vmtruncate deadlock
On Mon, Feb 25, 2013 at 4:01 PM, Gregory Farnum g...@inktank.com wrote: On Fri, Feb 22, 2013 at 8:31 PM, Yan, Zheng zheng.z@intel.com wrote: On 02/23/2013 02:54 AM, Gregory Farnum wrote: I haven't spent that much time in the kernel client, but this patch isn't working out for me. In particular, I'm pretty sure we need to preserve this: diff --git a/fs/ceph/caps.c b/fs/ceph/caps.c index 5d5c32b..b9d8417 100644 --- a/fs/ceph/caps.c +++ b/fs/ceph/caps.c @@ -2067,12 +2067,6 @@ static int try_get_cap_refs(struct ceph_inode_info *ci, int need, int want, } have = __ceph_caps_issued(ci, implemented); - /* -* disallow writes while a truncate is pending -*/ - if (ci-i_truncate_pending) - have = ~CEPH_CAP_FILE_WR; - if ((have need) == need) { /* * Look at (implemented ~have not) so that we keep waiting Because if there's a pending truncate, we really can't write. You do handle it in the particular case of doing buffered file writes, but these caps are a marker of permission, and the client shouldn't have write permission to a file until it's up to date on the truncates. Or am I misunderstanding something? pending vmtruncate is only relevant to buffered write case. If client doesn't have 'b' cap, the page cache is empty, and __ceph_do_pending_vmtruncate is no-op. For buffered write, this patch only affects situation that clients receives truncate request from MDS in the middle of write. It looks to me like truncates can get queued for later, so that's not the case? And how could the client receive a truncate while in the middle of writing? Either it's got the write caps (in which case nobody else can truncate), or it shouldn't be writing...I suppose if somebody enables lazyIO that would do it, but again in that case I imagine we'd rather do the truncate before writing: I think it's better to do vmtruncate after write finishes. What if you're writing past the truncate point? Then your written data would be truncated off even though it should have been a file extension. diff --git a/fs/ceph/file.c b/fs/ceph/file.c index a1e5b81..bf7849a 100644 --- a/fs/ceph/file.c +++ b/fs/ceph/file.c @@ -653,7 +653,6 @@ static ssize_t ceph_aio_read(struct kiocb *iocb, const struct iovec *iov, dout(aio_read %p %llx.%llx %llu~%u trying to get caps on %p\n, inode, ceph_vinop(inode), pos, (unsigned)len, inode); again: - __ceph_do_pending_vmtruncate(inode); There doesn't seem to be any kind of replacement for this? We need to do any pending truncates before reading or we might get stale data back. generic_file_aio_read checks i_size when coping data to user buffer, so the user program can't get stale data. This __ceph_do_pending_vmtruncate is not protected by i_mutex, it's a potential bug, that's the reason I remove it. Oh, you're right about the locking (I think, anyway?). However, i_size doesn't protect us — we might for instance have truncated to zero and then back up to the original file size. :( But that doesn't look like it's handled correctly anyway (racy) — am I misreading that badly or is the infrastructure in fact broken in that case? Sage pointed out in standup today that we only send a truncate message for truncate downs, and he thinks that the more complicated cases (truncate to zero, write out 2KB, truncate to 1KB) are okay thanks to the capabilities bits. I haven't thought it through but that seems plausible to me. -Greg -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 0/3] libceph: focus calc_layout() on filling in the osd op
On 02/25/2013 03:09 PM, Alex Elder wrote: This series refactors the code involved with identifying the details of the name, offset, and length of an object involved with an osd request based on a file layout. It makes the focus of calc_layout() be filling in an osd op structure based on the file layout it is provided. The caller (ceph_osdc_new_request()) is then responsible for filling in fields related to the request, such as the name of the target object. -Alex [PATCH 1/3] libceph: pass object number back to calc_layout() caller [PATCH 2/3] libceph: format target object name in caller [PATCH 3/3] libceph: don't pass request to calc_layout() These all look good. Reviewed-by: Josh Durgin josh.dur...@inktank.com -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 0/4] libceph: abstract setting message data info
On 02/25/2013 03:40 PM, Alex Elder wrote: This series makes the fields related to the data portion of a ceph message not get manipulated by code outside the ceph messenger. It implements some interface functions that can be used to assign data-related fields. Doing this will allow the way message data is managed to be changed independent of the users of the messenger module. -Alex [PATCH 1/4] libceph: distinguish page array and pagelist count [PATCH 2/4] libceph: set page alignment in start_request() [PATCH 3/4] libceph: isolate message page field manipulation [PATCH 4/4] libceph: isolate other message data fields These all look good. Reviewed-by: Josh Durgin josh.dur...@inktank.com -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH 0/6] Understanding delays due to throttling under very heavy write load
Hi Sage, On 02/20/2013 05:12 PM, Sage Weil wrote: Hi Jim, I'm resurrecting an ancient thread here, but: we've just observed this on another big cluster and remembered that this hasn't actually been fixed. Sorry for the delayed reply - I missed this in a backlog of unread email... I think the right solution is to make an option that will setsockopt on SO_RECVBUF to some value (say, 256KB). I pushed a branch that does this, wip-tcp. Do you mind checking to see if this addresses the issue (without manually adjusting things in /proc)? I'll be happy to test it out... And perhaps we should consider making this default to 256KB... That's the value I've been using with my /proc adjustments since I figured out what was going on. My servers use a 10 GbE port for each of the cluster and public networks, with cephfs clients using 1 GbE, and I've not detected any issues resulting from that value. So, it seems like a decent starting point for a default... -- Jim Thanks! sage -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH 0/6] Understanding delays due to throttling under very heavy write load
On Tue, 26 Feb 2013, Jim Schutt wrote: I think the right solution is to make an option that will setsockopt on SO_RECVBUF to some value (say, 256KB). I pushed a branch that does this, wip-tcp. Do you mind checking to see if this addresses the issue (without manually adjusting things in /proc)? I'll be happy to test it out... That would be great! It's branch wip-tcp, and the setting is 'ms tcp rcvbuf'. Thanks! sage -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: maintanance on osd host
Hi Sage, Am 26.02.2013 18:24, schrieb Sage Weil: On Tue, 26 Feb 2013, Stefan Priebe - Profihost AG wrote: But that redults in a 1-3s hickup for all KVM vms. This is not what I want. You can do kill $pid ceph osd down $osdid (or even reverse the order, if the sequence is quick enough) to avoid waiting for the failure detection delay. But if the OSDs are going down, then the peering has to happen one way or another. But exaclty this results in starting backfill immediatly. My idea was to first mark the osd down so the mon knows about this fact. So no I/O is stalled. And then reboot the whole host but exactly this does not work like expected as backfilling is starting immediatly after setting the osd to down ;-( Greets, Stefan -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Crash and strange things on MDS
On Tue, Feb 26, 2013 at 10:10:06AM -0800, Gregory Farnum wrote: On Tue, Feb 26, 2013 at 9:57 AM, Kevin Decherf ke...@kdecherf.com wrote: On Tue, Feb 19, 2013 at 05:09:30PM -0800, Gregory Farnum wrote: On Tue, Feb 19, 2013 at 5:00 PM, Kevin Decherf ke...@kdecherf.com wrote: On Tue, Feb 19, 2013 at 10:15:48AM -0800, Gregory Farnum wrote: Looks like you've got ~424k dentries pinned, and it's trying to keep 400k inodes in cache. So you're still a bit oversubscribed, yes. This might just be the issue where your clients are keeping a bunch of inodes cached for the VFS (http://tracker.ceph.com/issues/3289). Thanks for the analyze. We use only one ceph-fuse client at this time which makes all high-load commands like rsync, tar and cp on a huge amount of files. Well, I will replace it by the kernel client. Oh, that bug is just an explanation of what's happening; I believe it exists in the kernel client as well. After setting the mds cache size to 900k, storms are gone. However we continue to observe high latency on some clients (always the same clients): each IO takes between 40 and 90ms (for example with Wordpress, it takes ~20 seconds to load all needed files...). With a non-laggy client, IO requests take less than 1ms. I can't be sure from that description, but it sounds like you've got one client which is generally holding the RW caps on the files, and then another client which comes in occasionally to read those same files. That requires the first client to drop its caps, and involves a couple round-trip messages and is going to take some time — this is an unavoidable consequence if you have clients sharing files, although there's probably still room for us to optimize. Can you describe your client workload in a bit more detail? We have one folder per application (php, java, ruby). Every application has small (1M) files. The folder is mounted by only one client by default. In case of overload, another clients spawn to mount the same folder and access the same files. In the following test, only one client was used to serve the application (a website using wordpress). I made the test with strace to see the time of each IO request (strace -T -e trace=file) and I noticed the same pattern: ... [pid 4378] stat(/data/wp-includes/user.php, {st_mode=S_IFREG|0750, st_size=28622, ...}) = 0 0.033409 [pid 4378] lstat(/data/wp-includes/user.php, {st_mode=S_IFREG|0750, st_size=28622, ...}) = 0 0.081642 [pid 4378] open(/data/wp-includes/user.php, O_RDONLY) = 5 0.041138 [pid 4378] stat(/data/wp-includes/meta.php, {st_mode=S_IFREG|0750, st_size=10896, ...}) = 0 0.082303 [pid 4378] lstat(/data/wp-includes/meta.php, {st_mode=S_IFREG|0750, st_size=10896, ...}) = 0 0.004090 [pid 4378] open(/data/wp-includes/meta.php, O_RDONLY) = 5 0.081929 ... ~250 files were accessed for only one request (thanks Wordpress.). The fs is mounted with these options: rw,noatime,name=hidden,secret=hidden,nodcache. I have a debug (debug_mds=20) log of the active mds during this test if you want. -- Kevin Decherf - @Kdecherf GPG C610 FE73 E706 F968 612B E4B2 108A BD75 A81E 6E2F http://kdecherf.com -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: maintanance on osd host
On Tue, 26 Feb 2013, Stefan Priebe wrote: Hi Sage, Am 26.02.2013 18:24, schrieb Sage Weil: On Tue, 26 Feb 2013, Stefan Priebe - Profihost AG wrote: But that redults in a 1-3s hickup for all KVM vms. This is not what I want. You can do kill $pid ceph osd down $osdid (or even reverse the order, if the sequence is quick enough) to avoid waiting for the failure detection delay. But if the OSDs are going down, then the peering has to happen one way or another. But exaclty this results in starting backfill immediatly. My idea was to first mark the osd down so the mon knows about this fact. So no I/O is stalled. And then reboot the whole host but exactly this does not work like expected as backfilling is starting immediatly after setting the osd to down ;-( Backfilling should not happen on down, unless you have reconfigured 'mon osd down out interval = 0' or something along those lines. Setting the 'noout' flag will also prevent the osds from marking out. As for limiting the IO stall: you could also do 'ceph osd set noup', then mark them down, then kill the daemon, and you won't have to worry about racing with the daemon marking itself back up (as it normally does). sage -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Crash and strange things on MDS
On Tue, Feb 26, 2013 at 11:58 AM, Kevin Decherf ke...@kdecherf.com wrote: We have one folder per application (php, java, ruby). Every application has small (1M) files. The folder is mounted by only one client by default. In case of overload, another clients spawn to mount the same folder and access the same files. In the following test, only one client was used to serve the application (a website using wordpress). I made the test with strace to see the time of each IO request (strace -T -e trace=file) and I noticed the same pattern: ... [pid 4378] stat(/data/wp-includes/user.php, {st_mode=S_IFREG|0750, st_size=28622, ...}) = 0 0.033409 [pid 4378] lstat(/data/wp-includes/user.php, {st_mode=S_IFREG|0750, st_size=28622, ...}) = 0 0.081642 [pid 4378] open(/data/wp-includes/user.php, O_RDONLY) = 5 0.041138 [pid 4378] stat(/data/wp-includes/meta.php, {st_mode=S_IFREG|0750, st_size=10896, ...}) = 0 0.082303 [pid 4378] lstat(/data/wp-includes/meta.php, {st_mode=S_IFREG|0750, st_size=10896, ...}) = 0 0.004090 [pid 4378] open(/data/wp-includes/meta.php, O_RDONLY) = 5 0.081929 ... ~250 files were accessed for only one request (thanks Wordpress.). Okay, that is slower than I'd expect, even for an across-the-wire request... The fs is mounted with these options: rw,noatime,name=hidden,secret=hidden,nodcache. What kernel and why are you using nodcache? Did you have problems without that mount option? That's forcing an MDS access for most operations, rather than using local data. I have a debug (debug_mds=20) log of the active mds during this test if you want. Yeah, can you post it somewhere? -Greg -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: maintanance on osd host
On Tue, Feb 26, 2013 at 11:44 AM, Stefan Priebe s.pri...@profihost.ag wrote: Hi Sage, Am 26.02.2013 18:24, schrieb Sage Weil: On Tue, 26 Feb 2013, Stefan Priebe - Profihost AG wrote: But that redults in a 1-3s hickup for all KVM vms. This is not what I want. You can do kill $pid ceph osd down $osdid (or even reverse the order, if the sequence is quick enough) to avoid waiting for the failure detection delay. But if the OSDs are going down, then the peering has to happen one way or another. But exaclty this results in starting backfill immediatly. My idea was to first mark the osd down so the mon knows about this fact. So no I/O is stalled. And then reboot the whole host but exactly this does not work like expected as backfilling is starting immediatly after setting the osd to down ;-( out and down are quite different — are you sure you tried down and not out? (You reference out in your first email, rather than down.) -Greg -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Crash and strange things on MDS
On Tue, Feb 26, 2013 at 12:26:17PM -0800, Gregory Farnum wrote: On Tue, Feb 26, 2013 at 11:58 AM, Kevin Decherf ke...@kdecherf.com wrote: We have one folder per application (php, java, ruby). Every application has small (1M) files. The folder is mounted by only one client by default. In case of overload, another clients spawn to mount the same folder and access the same files. In the following test, only one client was used to serve the application (a website using wordpress). I made the test with strace to see the time of each IO request (strace -T -e trace=file) and I noticed the same pattern: ... [pid 4378] stat(/data/wp-includes/user.php, {st_mode=S_IFREG|0750, st_size=28622, ...}) = 0 0.033409 [pid 4378] lstat(/data/wp-includes/user.php, {st_mode=S_IFREG|0750, st_size=28622, ...}) = 0 0.081642 [pid 4378] open(/data/wp-includes/user.php, O_RDONLY) = 5 0.041138 [pid 4378] stat(/data/wp-includes/meta.php, {st_mode=S_IFREG|0750, st_size=10896, ...}) = 0 0.082303 [pid 4378] lstat(/data/wp-includes/meta.php, {st_mode=S_IFREG|0750, st_size=10896, ...}) = 0 0.004090 [pid 4378] open(/data/wp-includes/meta.php, O_RDONLY) = 5 0.081929 ... ~250 files were accessed for only one request (thanks Wordpress.). Okay, that is slower than I'd expect, even for an across-the-wire request... The fs is mounted with these options: rw,noatime,name=hidden,secret=hidden,nodcache. What kernel and why are you using nodcache? We use kernel 3.7.0. nodcache is enabled by default (we only specify user and secretfile as mount options) and I didn't find it in the documentation of mount.ceph. Did you have problems without that mount option? That's forcing an MDS access for most operations, rather than using local data. Good question, I will try it (-o dcache?). I have a debug (debug_mds=20) log of the active mds during this test if you want. Yeah, can you post it somewhere? Upload in progress :-) -- Kevin Decherf - @Kdecherf GPG C610 FE73 E706 F968 612B E4B2 108A BD75 A81E 6E2F http://kdecherf.com -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Crash and strange things on MDS
On Tue, Feb 26, 2013 at 1:57 PM, Kevin Decherf ke...@kdecherf.com wrote: On Tue, Feb 26, 2013 at 12:26:17PM -0800, Gregory Farnum wrote: On Tue, Feb 26, 2013 at 11:58 AM, Kevin Decherf ke...@kdecherf.com wrote: We have one folder per application (php, java, ruby). Every application has small (1M) files. The folder is mounted by only one client by default. In case of overload, another clients spawn to mount the same folder and access the same files. In the following test, only one client was used to serve the application (a website using wordpress). I made the test with strace to see the time of each IO request (strace -T -e trace=file) and I noticed the same pattern: ... [pid 4378] stat(/data/wp-includes/user.php, {st_mode=S_IFREG|0750, st_size=28622, ...}) = 0 0.033409 [pid 4378] lstat(/data/wp-includes/user.php, {st_mode=S_IFREG|0750, st_size=28622, ...}) = 0 0.081642 [pid 4378] open(/data/wp-includes/user.php, O_RDONLY) = 5 0.041138 [pid 4378] stat(/data/wp-includes/meta.php, {st_mode=S_IFREG|0750, st_size=10896, ...}) = 0 0.082303 [pid 4378] lstat(/data/wp-includes/meta.php, {st_mode=S_IFREG|0750, st_size=10896, ...}) = 0 0.004090 [pid 4378] open(/data/wp-includes/meta.php, O_RDONLY) = 5 0.081929 ... ~250 files were accessed for only one request (thanks Wordpress.). Okay, that is slower than I'd expect, even for an across-the-wire request... The fs is mounted with these options: rw,noatime,name=hidden,secret=hidden,nodcache. What kernel and why are you using nodcache? We use kernel 3.7.0. nodcache is enabled by default (we only specify user and secretfile as mount options) and I didn't find it in the documentation of mount.ceph. Did you have problems without that mount option? That's forcing an MDS access for most operations, rather than using local data. Good question, I will try it (-o dcache?). Oh right — I forgot Sage had enabled that by default; I don't recall how necessary it is. (Sage?) I have a debug (debug_mds=20) log of the active mds during this test if you want. Yeah, can you post it somewhere? Upload in progress :-) Looking forward to it. ;) -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Crash and strange things on MDS
On Wed, Feb 27, 2013 at 5:58 AM, Gregory Farnum g...@inktank.com wrote: On Tue, Feb 26, 2013 at 1:57 PM, Kevin Decherf ke...@kdecherf.com wrote: On Tue, Feb 26, 2013 at 12:26:17PM -0800, Gregory Farnum wrote: On Tue, Feb 26, 2013 at 11:58 AM, Kevin Decherf ke...@kdecherf.com wrote: We have one folder per application (php, java, ruby). Every application has small (1M) files. The folder is mounted by only one client by default. In case of overload, another clients spawn to mount the same folder and access the same files. In the following test, only one client was used to serve the application (a website using wordpress). I made the test with strace to see the time of each IO request (strace -T -e trace=file) and I noticed the same pattern: ... [pid 4378] stat(/data/wp-includes/user.php, {st_mode=S_IFREG|0750, st_size=28622, ...}) = 0 0.033409 [pid 4378] lstat(/data/wp-includes/user.php, {st_mode=S_IFREG|0750, st_size=28622, ...}) = 0 0.081642 [pid 4378] open(/data/wp-includes/user.php, O_RDONLY) = 5 0.041138 [pid 4378] stat(/data/wp-includes/meta.php, {st_mode=S_IFREG|0750, st_size=10896, ...}) = 0 0.082303 [pid 4378] lstat(/data/wp-includes/meta.php, {st_mode=S_IFREG|0750, st_size=10896, ...}) = 0 0.004090 [pid 4378] open(/data/wp-includes/meta.php, O_RDONLY) = 5 0.081929 ... ~250 files were accessed for only one request (thanks Wordpress.). Okay, that is slower than I'd expect, even for an across-the-wire request... The fs is mounted with these options: rw,noatime,name=hidden,secret=hidden,nodcache. What kernel and why are you using nodcache? We use kernel 3.7.0. nodcache is enabled by default (we only specify user and secretfile as mount options) and I didn't find it in the documentation of mount.ceph. Did you have problems without that mount option? That's forcing an MDS access for most operations, rather than using local data. Good question, I will try it (-o dcache?). Oh right — I forgot Sage had enabled that by default; I don't recall how necessary it is. (Sage?) That code is buggy, see ceph_dir_test_complete(), it always return false. Yan, Zheng I have a debug (debug_mds=20) log of the active mds during this test if you want. Yeah, can you post it somewhere? Upload in progress :-) Looking forward to it. ;) -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Crash and strange things on MDS
On Wed, 27 Feb 2013, Yan, Zheng wrote: On Wed, Feb 27, 2013 at 5:58 AM, Gregory Farnum g...@inktank.com wrote: On Tue, Feb 26, 2013 at 1:57 PM, Kevin Decherf ke...@kdecherf.com wrote: On Tue, Feb 26, 2013 at 12:26:17PM -0800, Gregory Farnum wrote: On Tue, Feb 26, 2013 at 11:58 AM, Kevin Decherf ke...@kdecherf.com wrote: We have one folder per application (php, java, ruby). Every application has small (1M) files. The folder is mounted by only one client by default. In case of overload, another clients spawn to mount the same folder and access the same files. In the following test, only one client was used to serve the application (a website using wordpress). I made the test with strace to see the time of each IO request (strace -T -e trace=file) and I noticed the same pattern: ... [pid 4378] stat(/data/wp-includes/user.php, {st_mode=S_IFREG|0750, st_size=28622, ...}) = 0 0.033409 [pid 4378] lstat(/data/wp-includes/user.php, {st_mode=S_IFREG|0750, st_size=28622, ...}) = 0 0.081642 [pid 4378] open(/data/wp-includes/user.php, O_RDONLY) = 5 0.041138 [pid 4378] stat(/data/wp-includes/meta.php, {st_mode=S_IFREG|0750, st_size=10896, ...}) = 0 0.082303 [pid 4378] lstat(/data/wp-includes/meta.php, {st_mode=S_IFREG|0750, st_size=10896, ...}) = 0 0.004090 [pid 4378] open(/data/wp-includes/meta.php, O_RDONLY) = 5 0.081929 ... ~250 files were accessed for only one request (thanks Wordpress.). Okay, that is slower than I'd expect, even for an across-the-wire request... The fs is mounted with these options: rw,noatime,name=hidden,secret=hidden,nodcache. What kernel and why are you using nodcache? We use kernel 3.7.0. nodcache is enabled by default (we only specify user and secretfile as mount options) and I didn't find it in the documentation of mount.ceph. Did you have problems without that mount option? That's forcing an MDS access for most operations, rather than using local data. Good question, I will try it (-o dcache?). Oh right ? I forgot Sage had enabled that by default; I don't recall how necessary it is. (Sage?) That code is buggy, see ceph_dir_test_complete(), it always return false. FWIW, I think #4023 may be the root of those original bugs. I think that should be up next (as far as fs/ceph goes) after after this i_mutex stuff is sorted out. sage Yan, Zheng I have a debug (debug_mds=20) log of the active mds during this test if you want. Yeah, can you post it somewhere? Upload in progress :-) Looking forward to it. ;) -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[GIT PULL] Ceph updates for 3.9-rc1
Hi Linus, Please pull the following Ceph updates from git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client.git for-linus A few groups of patches here. Alex has been hard at work improving the RBD code, layout groundwork for understanding the new formats and doing layering. Most of the infrastructure is now in place for the final bits that will come with the next window. There are a few changes to the data layout. Jim Schutt's patch fixes some non-ideal CRUSH behavior, and a set of patches from me updates the client to speak a newer version of the protocol and implement an improved hashing strategy across storage nodes (when the server side supports it too). A pair of patches from Sam Lang fix the atomicity of open+create operations. Several patches from Yan, Zheng fix various mds/client issues that turned up during multi-mds torture tests. A final set of patches expose file layouts via virtual xattrs, and allow the policies to be set on directories via xattrs as well (avoiding the awkward ioctl interface and providing a consistent interface for both kernel mount and ceph-fuse users). Thanks! sage Alex Elder (118): libceph: reformat __reset_osd() rbd: document rbd_spec structure rbd: kill rbd_spec-image_name_len rbd: kill rbd_spec-image_id_len rbd: use kmemdup() ceph: define ceph_encode_8_safe() rbd: define and use rbd_warn() rbd: add warning messages for missing arguments rbd: add a warning in bio_chain_clone_range() rbd: add warnings to rbd_dev_probe_update_spec() rbd: standardize rbd_request variable names rbd: standardize ceph_osd_request variable names rbd: be picky about osd request status type rbd: encapsulate handling for a single request rbd: end request on error in rbd_do_request() caller rbd: a little more cleanup of rbd_rq_fn() rbd: make exists flag atomic rbd: only get snap context for write requests rbd: separate layout init rbd: drop oid parameters from ceph_osdc_build_request() rbd: drop snapid parameter from rbd_req_sync_read() rbd: drop flags parameter from rbd_req_sync_exec() rbd: kill rbd_req_sync_op() snapc and snapid parameters rbd: don't bother setting snapid in rbd_do_request() libceph: always allow trail in osd request libceph: kill op_needs_trail() libceph: pass length to ceph_osdc_build_request() libceph: pass length to ceph_calc_file_object_mapping() libceph: drop snapid in ceph_calc_raw_layout() libceph: drop osdc from ceph_calc_raw_layout() libceph: don't set flags in ceph_osdc_alloc_request() libceph: don't set pages or bio in ceph_osdc_alloc_request() rbd: pass num_op with ops array libceph: pass num_op with ops rbd: there is really only one op rbd: assume single op in a request rbd: kill ceph_osd_req_op-flags rbd: pull in ceph_calc_raw_layout() rbd: open code rbd_calc_raw_layout() rbd: don't bother calculating file mapping rbd: use a common layout for each device rbd: combine rbd sync watch/unwatch functions rbd: don't leak rbd_req on synchronous requests rbd: don't leak rbd_req for rbd_req_sync_notify_ack() rbd: don't assign extent info in rbd_do_request() rbd: don't assign extent info in rbd_req_sync_op() rbd: move call osd op setup into rbd_osd_req_op_create() rbd: move remaining osd op setup into rbd_osd_req_op_create() rbd: assign watch request more directly rbd: fix type of snap_id in rbd_dev_v2_snap_info() rbd: small changes rbd: check for overflow in rbd_get_num_segments() rbd: don't retry setting up header watch Merge branch 'testing' of github.com:ceph/ceph-client into v3.8-rc5-testing libceph: fix messenger CONFIG_BLOCK dependencies rbd: new request tracking code rbd: kill rbd_rq_fn() and all other related code rbd: kill rbd_req_coll and rbd_request rbd: implement sync object read with new code rbd: get rid of rbd_req_sync_read() rbd: implement watch/unwatch with new code rbd: get rid of rbd_req_sync_watch() rbd: use new code for notify ack rbd: get rid of rbd_req_sync_notify_ack() rbd: send notify ack asynchronously rbd: implement sync method with new code rbd: get rid of rbd_req_sync_exec() rbd: unregister linger in watch sync routine rbd: track object rather than osd request for watch rbd: decrement obj request count when deleting rbd: don't drop watch requests on completion rbd: define flags field, use it for exists flag rbd: prevent open for image being removed libceph: add a compatibility check interface rbd: don't take extra bio reference for osd client libceph: don't require r_num_pages for bio requests
Re: maintanance on osd host
Hi Greg, Hi Sage, Am 26.02.2013 21:27, schrieb Gregory Farnum: On Tue, Feb 26, 2013 at 11:44 AM, Stefan Priebe s.pri...@profihost.ag wrote: out and down are quite different — are you sure you tried down and not out? (You reference out in your first email, rather than down.) -Greg sorry that's it i misread down / out. Sorry. Wouldn't it make sense to mark the osd automatically down when shutting down via the init script? It doesn't seem to make sense to hope for the automatic detection when somebody uses the init script. Stefan -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html