A bug in rolling back to a degraded object
Hi Sage and all, Another bug discovered during the proxy write teuthology testing is when rolling back to a degraded object. This doesn't seem specific to proxy write. I see a scenario like below from the log file: - A rollback op comes in, and is enqueued. - Several other ops on the same object come in, and are enqueued. - The rollback op dispatches, and finds the object which it rollbacks to is degraded, then this op is pushbacked into a list to wait for the degraded object to recover. - The later ops are handled and responded back to client. - The degraded object recovers. The rollback op is enqueued again and finally responded to client. This breaks the op order. A fix for this is to maintain a map to track the source, destination pair. And when an op on the source dispatches, if such a pair exists, queue the op in the destination's degraded waiting list. A drawback of this approach is that some entries in the ' waiting_for_degraded_object' list of the destination object may not be actually accessing the destination, but the source. Does this make sense? -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Nightly Kernel branch not found errors
On Mon, Jun 1, 2015 at 1:19 AM, Sage Weil s...@newdream.net wrote: On Sun, 31 May 2015, Gregory Farnum wrote: On Sun, May 31, 2015 at 1:40 PM, Ilya Dryomov idryo...@gmail.com wrote: On Sun, May 31, 2015 at 10:54 PM, Gregory Farnum g...@gregs42.com wrote: We are getting this error in what looks like everything that specifies the testing kernel. (That turns out to be almost all of the FS tests and a surprising number of the non-rados runs; e.g. rgw.) I've checked that the testing branch of ceph-client.git still exists and when looking at the teuthology git logs the recent ping-pong of commits on kernel flavors et al stand out. Any ideas? :) Which lab is this in? Latest rgw and fs in sepia look fine. Kernel task was broken the entire last week but in a different way - when scheduled with teuthology-suite it wouldn't install anything even if you told it to install e.g. testing. I fixed that on Friday. Looks like this is coming from create_initial_config(). Could be an environment issue, like down gitbuilders or a problem with requests module? Kernel branch is checked before the others so it may not have anything to do with it at all. It's happening across labs for tests that were supposed to be scheduled starting on the 29 (at least, that I've noticed). I think this issue is before they get into pulpito, which is why the latest fs suite run there was scheduled on May 27. :( Looking at the gitbuliders I do see that the CentOS6 testing branch is red and rhel7 appears to be down...maybe we're checking on more of them now and then failing when those don't appear? :/ We're half-way through creating the centos7 kernel builder to replace the rhel ones so I expect things are broken on the rpm side. Not sure if that is the root cause here, but we can probably wait for that to get fixed first before looking further I think it's definitely related to this. commit f2ce5e1ed3d4 (Treat RHEL as CentOS when scheduling) by Zack makes it so centos7 gitbuilders are poked in case distro wasn't specified explicitly, which is how I assume all those runs are scheduled. kernel centos6 gitbuilder is down and AFAICT the centos7 one doesn't exist. Thanks, Ilya -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [ceph-users] Discuss: New default recovery config settings
On 06/01/15 09:43, Jan Schermer wrote: We had to disable deep scrub or the cluster would me unusable - we need to turn it back on sooner or later, though. With minimal scrubbing and recovery settings, everything is mostly good. Turned out many issues we had were due to too few PGs - once we increased them from 4K to 16K everything sped up nicely (because the chunks are smaller), but during heavy activity we are still getting some “slow IOs”. I believe there is an ionice knob in newer versions (we still run Dumpling), and that should do the trick no matter how much additional “load” is put on the OSDs. Everybody’s bottleneck will be different - we run all flash so disk IO is not a problem but an OSD daemon is - no ionice setting will help with that, it just needs to be faster ;-) If you are interested I'm currently testing a ruby script which schedules the deep scrubs one at a time trying to simultaneously make them fit in a given time window, avoid successive scrubs on the same OSD and space the deep scrubs according to the amount of data scrubed. I use it because Ceph by itself can't prevent multiple scrubs to happen simultaneously on the network and it can severely impact our VM performance. I can clean it up and post it on Github. Best regards, Lionel -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Nightly Kernel branch not found errors
On Mon, Jun 1, 2015 at 1:28 AM, Gregory Farnum g...@gregs42.com wrote: On Sun, May 31, 2015 at 3:19 PM, Sage Weil s...@newdream.net wrote: On Sun, 31 May 2015, Gregory Farnum wrote: On Sun, May 31, 2015 at 1:40 PM, Ilya Dryomov idryo...@gmail.com wrote: On Sun, May 31, 2015 at 10:54 PM, Gregory Farnum g...@gregs42.com wrote: We are getting this error in what looks like everything that specifies the testing kernel. (That turns out to be almost all of the FS tests and a surprising number of the non-rados runs; e.g. rgw.) I've checked that the testing branch of ceph-client.git still exists and when looking at the teuthology git logs the recent ping-pong of commits on kernel flavors et al stand out. Any ideas? :) Which lab is this in? Latest rgw and fs in sepia look fine. Kernel task was broken the entire last week but in a different way - when scheduled with teuthology-suite it wouldn't install anything even if you told it to install e.g. testing. I fixed that on Friday. Looks like this is coming from create_initial_config(). Could be an environment issue, like down gitbuilders or a problem with requests module? Kernel branch is checked before the others so it may not have anything to do with it at all. It's happening across labs for tests that were supposed to be scheduled starting on the 29 (at least, that I've noticed). I think this issue is before they get into pulpito, which is why the latest fs suite run there was scheduled on May 27. :( Looking at the gitbuliders I do see that the CentOS6 testing branch is red and rhel7 appears to be down...maybe we're checking on more of them now and then failing when those don't appear? :/ We're half-way through creating the centos7 kernel builder to replace the rhel ones so I expect things are broken on the rpm side. Not sure if that is the root cause here, but we can probably wait for that to get fixed first before looking further *None* of our FS tests are running while this problem persists (and they're not alone). That's not the sort of thing we can wait on... Maybe we have some gratuitous non-use of the testing kernel we can remove (I'm not sure), but that's the sort of thing that needs to be discussed across teams so we can deal with it proactively instead of just finding out when the nightlies start failing. It's not the kernel to blame here, so unless there is a specific reason let's not remove use of testing kernel. What I think we need to do is to make sure that teuthology test suite includes teuthology-suite --dry-run ... with --kernel testing at the very least. It would be good to also verify that testing kernel is actually going to be installed by the kernel task to avoid last week's breakage, but that's a matter of adding unit tests and is a different issue. Unless I'm missing something, the only problem here is really that the commit I mentioned in another mail got pushed before the gitbuilder was setup. Thanks, Ilya -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [ceph-users] Discuss: New default recovery config settings
Slow requests are not exactly tied to the PG number, but we were getting slow requests whenever backfills or recoveries fired up - increasing the number of PGs helped with this as the “blocks” of work are much smaller than before. We have roughly the same number of OSDs as you but only one really important pool (“volumes”), we ended with 16384 PGs for this one. Number of threads increased exponentionaly, some latencies wet down, some went up, in the end it works just as well as before with the added benefit of better data distribution and a better behaving cluster. But YMMV - once you go up you can’t go down. Jan On 01 Jun 2015, at 10:57, huang jun hjwsm1...@gmail.com wrote: hi,jan 2015-06-01 15:43 GMT+08:00 Jan Schermer j...@schermer.cz: We had to disable deep scrub or the cluster would me unusable - we need to turn it back on sooner or later, though. With minimal scrubbing and recovery settings, everything is mostly good. Turned out many issues we had were due to too few PGs - once we increased them from 4K to 16K everything sped up nicely (because the chunks are smaller), but during heavy activity we are still getting some “slow IOs”. How many PGs do you set ? we get slow requests many times, but didn't relate it to PG number. And we follow the equation below for every pool: (OSDs * 100) Total PGs = - pool size our cluster has 157 OSDs and 3 POOLs, we set pg_num to 8192 for every pool, but osd cpu utlity percentage is up to 300% after restart, we think it's loading pgs during the period. and we will try different PG number when we get slow request thanks! I believe there is an ionice knob in newer versions (we still run Dumpling), and that should do the trick no matter how much additional “load” is put on the OSDs. Everybody’s bottleneck will be different - we run all flash so disk IO is not a problem but an OSD daemon is - no ionice setting will help with that, it just needs to be faster ;-) Jan On 30 May 2015, at 01:17, Gregory Farnum g...@gregs42.com wrote: On Fri, May 29, 2015 at 2:47 PM, Samuel Just sj...@redhat.com wrote: Many people have reported that they need to lower the osd recovery config options to minimize the impact of recovery on client io. We are talking about changing the defaults as follows: osd_max_backfills to 1 (from 10) osd_recovery_max_active to 3 (from 15) osd_recovery_op_priority to 1 (from 10) osd_recovery_max_single_start to 1 (from 5) I'm under the (possibly erroneous) impression that reducing the number of max backfills doesn't actually reduce recovery speed much (but will reduce memory use), but that dropping the op priority can. I'd rather we make users manually adjust values which can have a material impact on their data safety, even if most of them choose to do so. After all, even under our worst behavior we're still doing a lot better than a resilvering RAID array. ;) -Greg ___ ceph-users mailing list ceph-us...@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- thanks huangjun -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [ceph-users] Discuss: New default recovery config settings
We had to disable deep scrub or the cluster would me unusable - we need to turn it back on sooner or later, though. With minimal scrubbing and recovery settings, everything is mostly good. Turned out many issues we had were due to too few PGs - once we increased them from 4K to 16K everything sped up nicely (because the chunks are smaller), but during heavy activity we are still getting some “slow IOs”. I believe there is an ionice knob in newer versions (we still run Dumpling), and that should do the trick no matter how much additional “load” is put on the OSDs. Everybody’s bottleneck will be different - we run all flash so disk IO is not a problem but an OSD daemon is - no ionice setting will help with that, it just needs to be faster ;-) Jan On 30 May 2015, at 01:17, Gregory Farnum g...@gregs42.com wrote: On Fri, May 29, 2015 at 2:47 PM, Samuel Just sj...@redhat.com wrote: Many people have reported that they need to lower the osd recovery config options to minimize the impact of recovery on client io. We are talking about changing the defaults as follows: osd_max_backfills to 1 (from 10) osd_recovery_max_active to 3 (from 15) osd_recovery_op_priority to 1 (from 10) osd_recovery_max_single_start to 1 (from 5) I'm under the (possibly erroneous) impression that reducing the number of max backfills doesn't actually reduce recovery speed much (but will reduce memory use), but that dropping the op priority can. I'd rather we make users manually adjust values which can have a material impact on their data safety, even if most of them choose to do so. After all, even under our worst behavior we're still doing a lot better than a resilvering RAID array. ;) -Greg ___ ceph-users mailing list ceph-us...@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [ceph-users] Discuss: New default recovery config settings
hi,jan 2015-06-01 15:43 GMT+08:00 Jan Schermer j...@schermer.cz: We had to disable deep scrub or the cluster would me unusable - we need to turn it back on sooner or later, though. With minimal scrubbing and recovery settings, everything is mostly good. Turned out many issues we had were due to too few PGs - once we increased them from 4K to 16K everything sped up nicely (because the chunks are smaller), but during heavy activity we are still getting some “slow IOs”. How many PGs do you set ? we get slow requests many times, but didn't relate it to PG number. And we follow the equation below for every pool: (OSDs * 100) Total PGs = - pool size our cluster has 157 OSDs and 3 POOLs, we set pg_num to 8192 for every pool, but osd cpu utlity percentage is up to 300% after restart, we think it's loading pgs during the period. and we will try different PG number when we get slow request thanks! I believe there is an ionice knob in newer versions (we still run Dumpling), and that should do the trick no matter how much additional “load” is put on the OSDs. Everybody’s bottleneck will be different - we run all flash so disk IO is not a problem but an OSD daemon is - no ionice setting will help with that, it just needs to be faster ;-) Jan On 30 May 2015, at 01:17, Gregory Farnum g...@gregs42.com wrote: On Fri, May 29, 2015 at 2:47 PM, Samuel Just sj...@redhat.com wrote: Many people have reported that they need to lower the osd recovery config options to minimize the impact of recovery on client io. We are talking about changing the defaults as follows: osd_max_backfills to 1 (from 10) osd_recovery_max_active to 3 (from 15) osd_recovery_op_priority to 1 (from 10) osd_recovery_max_single_start to 1 (from 5) I'm under the (possibly erroneous) impression that reducing the number of max backfills doesn't actually reduce recovery speed much (but will reduce memory use), but that dropping the op priority can. I'd rather we make users manually adjust values which can have a material impact on their data safety, even if most of them choose to do so. After all, even under our worst behavior we're still doing a lot better than a resilvering RAID array. ;) -Greg ___ ceph-users mailing list ceph-us...@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- thanks huangjun -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] libceph: use kvfree() in ceph_put_page_vector()
On Mon, Jun 1, 2015 at 5:36 PM, Geliang Tang geliangt...@163.com wrote: Use kvfree() instead of open-coding it. Signed-off-by: Geliang Tang geliangt...@163.com --- net/ceph/pagevec.c | 5 + 1 file changed, 1 insertion(+), 4 deletions(-) diff --git a/net/ceph/pagevec.c b/net/ceph/pagevec.c index 096d914..d4f5f22 100644 --- a/net/ceph/pagevec.c +++ b/net/ceph/pagevec.c @@ -51,10 +51,7 @@ void ceph_put_page_vector(struct page **pages, int num_pages, bool dirty) set_page_dirty_lock(pages[i]); put_page(pages[i]); } - if (is_vmalloc_addr(pages)) - vfree(pages); - else - kfree(pages); + kvfree(pages); } EXPORT_SYMBOL(ceph_put_page_vector); Already fixed in testing, wasn't pushed to linux-next though, sorry! Thanks, Ilya -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Adding chance_test_backfill_full thrasher in the ec tasks
I filed http://tracker.ceph.com/issues/11831 and will work on it On 01/06/2015 17:18, Samuel Just wrote: Yep, that should be included in the ec thrashing tests. -Sam - Original Message - From: shylesh kumar shylesh.mo...@gmail.com To: Samuel Just sj...@redhat.com, l...@dachary.org Cc: ceph-devel@vger.kernel.org Sent: Saturday, May 30, 2015 8:27:20 AM Subject: Adding chance_test_backfill_full thrasher in the ec tasks Hi , As per discussion with loic none of the ec tests have chance_test_backfill_full thrasher in the tasks, do you think it would be a good idea to add it which can simulate disk space scenarios. -- Loïc Dachary, Artisan Logiciel Libre signature.asc Description: OpenPGP digital signature
Re: Adding chance_test_backfill_full thrasher in the ec tasks
Yep, that should be included in the ec thrashing tests. -Sam - Original Message - From: shylesh kumar shylesh.mo...@gmail.com To: Samuel Just sj...@redhat.com, l...@dachary.org Cc: ceph-devel@vger.kernel.org Sent: Saturday, May 30, 2015 8:27:20 AM Subject: Adding chance_test_backfill_full thrasher in the ec tasks Hi , As per discussion with loic none of the ec tests have chance_test_backfill_full thrasher in the tasks, do you think it would be a good idea to add it which can simulate disk space scenarios. -- Thanks Shylesh -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] libceph: use kvfree() in ceph_put_page_vector()
Use kvfree() instead of open-coding it. Signed-off-by: Geliang Tang geliangt...@163.com --- net/ceph/pagevec.c | 5 + 1 file changed, 1 insertion(+), 4 deletions(-) diff --git a/net/ceph/pagevec.c b/net/ceph/pagevec.c index 096d914..d4f5f22 100644 --- a/net/ceph/pagevec.c +++ b/net/ceph/pagevec.c @@ -51,10 +51,7 @@ void ceph_put_page_vector(struct page **pages, int num_pages, bool dirty) set_page_dirty_lock(pages[i]); put_page(pages[i]); } - if (is_vmalloc_addr(pages)) - vfree(pages); - else - kfree(pages); + kvfree(pages); } EXPORT_SYMBOL(ceph_put_page_vector); -- 2.3.4 -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
ceph branch status
-- All Branches -- Adam Crume adamcr...@gmail.com 2014-12-01 20:45:58 -0800 wip-doc-rbd-replay Alfredo Deza ad...@redhat.com 2015-03-23 16:39:48 -0400 wip-11212 2015-03-25 10:10:43 -0400 wip-11065 Alfredo Deza alfredo.d...@inktank.com 2014-07-08 13:58:35 -0400 wip-8679 2014-09-04 13:58:14 -0400 wip-8366 2014-10-13 11:10:10 -0400 wip-9730 Boris Ranto bra...@redhat.com 2015-04-13 13:51:32 +0200 wip-fix-ceph-dencoder-build 2015-04-14 13:51:49 +0200 wip-fix-ceph-dencoder-build-master 2015-05-15 15:26:05 +0200 wip-selinux-policy Chi Xinze xmdx...@gmail.com 2015-05-15 21:47:44 + XinzeChi-wip-ec-read Dan Mick dan.m...@inktank.com 2013-07-16 23:00:06 -0700 wip-5634 Danny Al-Gaaf danny.al-g...@bisect.de 2015-04-23 16:32:00 +0200 wip-da-SCA-20150421 2015-04-23 17:18:57 +0200 wip-nosetests 2015-04-23 18:20:16 +0200 wip-unify-num_objects_degraded 2015-04-30 12:34:08 +0200 wip-user 2015-06-01 06:59:26 +0200 wip-da-SCA-20150427 David Zafman dzaf...@redhat.com 2014-08-29 10:41:23 -0700 wip-libcommon-rebase 2014-11-26 09:41:50 -0800 wip-9403 2014-12-02 21:20:17 -0800 wip-zafman-docfix 2015-01-08 15:07:45 -0800 wip-vstart-kvs 2015-02-20 16:13:43 -0800 wip-10883-firefly 2015-02-20 16:14:57 -0800 wip-10883-dumpling 2015-04-23 10:22:09 -0700 wip-11454-80.8 2015-04-24 13:14:23 -0700 wip-cot-giant 2015-05-21 15:29:48 -0700 wip-11511 2015-05-28 13:41:30 -0700 wip-10794 Dongmao Zhang deanracc...@gmail.com 2014-11-14 19:14:34 +0800 thesues-master Greg Farnum gfar...@redhat.com 2015-04-19 18:03:41 -0700 greg-testing-quota-full 2015-04-29 21:44:11 -0700 wip-init-names 2015-05-31 16:30:59 -0700 greg-fs-testing Greg Farnum g...@inktank.com 2014-10-23 13:33:44 -0700 wip-forward-scrub Gregory Meno gm...@redhat.com 2015-02-25 17:30:33 -0800 wip-fix-typo-troubleshooting Guang Yang ygu...@yahoo-inc.com 2014-08-08 10:41:12 + wip-guangyy-pg-splitting 2014-09-25 00:47:46 + wip-9008 2014-09-30 10:36:39 + guangyy-wip-9614 Haomai Wang haomaiw...@gmail.com 2014-07-27 13:37:49 +0800 wip-flush-set 2015-04-20 00:47:59 +0800 update-organization 2015-04-20 00:48:42 +0800 update-organization-1 Ilya Dryomov ilya.dryo...@inktank.com 2014-09-05 16:15:10 +0400 wip-rbd-notify-errors James Page james.p...@ubuntu.com 2013-02-27 22:50:38 + wip-debhelper-8 Jason Dillaman dilla...@redhat.com 2015-04-14 14:55:50 -0400 wip-11056-firefly 2015-05-12 11:04:36 -0400 wip-librbd-helgrind 2015-05-15 11:19:33 -0400 wip-11537 2015-05-21 16:27:18 -0400 wip-11579 2015-05-22 00:52:20 -0400 wip-11625 2015-05-27 00:41:42 -0400 wip-librbd-perf-counters 2015-05-28 15:23:33 -0400 wip-11405-next Jenkins jenk...@inktank.com 2014-07-29 05:24:39 -0700 wip-nhm-hang 2015-02-02 10:35:28 -0800 wip-sam-v0.92 2015-04-13 13:24:40 -0700 rhcs-v0.80.8 2015-05-27 16:44:15 -0700 rhcs-v0.94.1-ubuntu Joao Eduardo Luis jec...@gmail.com 2014-09-10 09:39:23 +0100 wip-leveldb-get.dumpling Joao Eduardo Luis joao.l...@gmail.com 2014-07-22 15:41:42 +0100 wip-leveldb-misc Joao Eduardo Luis joao.l...@inktank.com 2014-09-02 17:19:52 +0100 wip-leveldb-get 2014-10-17 16:20:11 +0100 wip-paxos-fix 2014-10-21 21:32:46 +0100 wip-9675.dumpling Joao Eduardo Luis j...@redhat.com 2014-11-17 16:43:53 + wip-mon-osdmap-cleanup 2014-12-15 16:18:56 + wip-giant-mon-backports 2014-12-17 17:13:57 + wip-mon-backports.firefly 2014-12-17 23:15:10 + wip-mon-sync-fix.dumpling 2015-01-07 23:01:00 + wip-mon-blackhole-mlog-0.87.7 2015-01-10 02:40:42 + wip-dho-joao 2015-01-10 02:46:31 + wip-mon-paxos-fix 2015-01-26 13:00:09 + wip-mon-datahealth-fix 2015-02-04 22:36:14 + wip-10643 2015-02-26 14:54:01 + wip-10507 Joao Eduardo Luis j...@suse.de 2015-05-27 23:48:45 +0100 wip-mon-scrub 2015-05-28 08:12:48 +0100 wip-11786 2015-05-29 12:21:43 +0100 wip-11545 2015-06-01 12:18:01 +0100 wip-joao-testing John Spray john.sp...@redhat.com 2015-02-18 14:04:18 + wip10649 2015-04-06 17:25:02 +0100 wip-progress-events 2015-05-05 14:29:16 +0100 wip-live-query 2015-05-06 16:21:25 +0100 wip-11541-hammer-workaround 2015-05-22 10:48:32 +0100 wip-offline-backward 2015-05-22 14:17:20 +0100 wip-damaged-fixes 2015-05-28 13:31:32 +0100 wip-9963 2015-05-29 13:59:03 +0100 wip-9964-intrapg 2015-05-29
Re: MDS auth caps for cephfs
I have a pull request posted at https://github.com/ceph/ceph/pull/4809 that updates the mds cap parser and defines a check method. Please take a look and see if this makes sense. For the path restrictions, I think the next steps are something like - generate a path string and pass it in to that method - write some simple tests - make sure the hook is called from everywhere it needs to be (all of the other request handlers in Server.cc) - call the hook from the cap writeback path (tricky) - figure out how to handle files in the stray dir (tricky) For the user-based restrictions, - I think we need to expand the allows() method so that it has a couple output arguments (uid and gid list) that subsequent permissions should be validated against. Then we can change the function in Server.cc so that when those are populated it does an actually unix permissions check. This seems like the simplest thing to me, although it does have the slightly-odd property that if you are doing an operation that requires permission on two inodes (directory and file, say) you might have to different 'allow ...' lines granting access to each. (I think this is both painful to avoid and also harmless?) - same items above to call into the check method in the appropriate places. - extend client/mds protocol to pass credential struct. This should piggyback on Goldwyn's work to fix up these structures for namespaces. - mark caps with credentials on clients, fix writeback order, etc. In any case, the first item on both those lists seems like the place to start. sage -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: A bug in rolling back to a degraded object
Hi Zhiqiang, On Mon, 1 Jun 2015, Wang, Zhiqiang wrote: Hi Sage and all, Another bug discovered during the proxy write teuthology testing is when rolling back to a degraded object. This doesn't seem specific to proxy write. I see a scenario like below from the log file: - A rollback op comes in, and is enqueued. - Several other ops on the same object come in, and are enqueued. - The rollback op dispatches, and finds the object which it rollbacks to is degraded, then this op is pushbacked into a list to wait for the degraded object to recover. - The later ops are handled and responded back to client. - The degraded object recovers. The rollback op is enqueued again and finally responded to client. Yep! This breaks the op order. A fix for this is to maintain a map to track the source, destination pair. And when an op on the source dispatches, if such a pair exists, queue the op in the destination's degraded waiting list. A drawback of this approach is that some entries in the ' waiting_for_degraded_object' list of the destination object may not be actually accessing the destination, but the source. Does this make sense? Yeah, and I think it's fine for the op to appear in the other object's list. In fact there is already a mechanism in place that does something similar: obc-blocked_by. It was added for the clone operation, which nfortunately I don't think it's not exercised in any of our test.. but I think it does exactly what you need. If you set the head's blocked_by to the clone (and the clone's blocks set to include the head) then anybody trying to write to the head will queue up on the clone's degraded list (see the check for this in ReplicatedPG::do_op()). I think this mostl amounts to making the _rollback_to() method get the clone's obc, set up the blocked_by/blocks relationship, start recovery of that object immediately, and queue itself on the waiting list. Does that make sense? sage -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: osd crash with object store set to newstore
I pushed a commit to wip-newstore-debuglist.. can you reproduce the crash with that branch with 'debug newstore = 20' and send us the log? (You can just do 'ceph-post-file filename'.) Thanks! sage On Mon, 1 Jun 2015, Srikanth Madugundi wrote: Hi Sage, The assertion failed at line 1639, here is the log message 2015-05-30 23:17:55.141388 7f0891be0700 -1 os/newstore/NewStore.cc: In function 'virtual int NewStore::collection_list_partial(coll_t, ghobject_t, int, int, snapid_t, std::vectorghobject_t*, ghobject_t*)' thread 7f0891be0700 time 2015-05-30 23:17:55.137174 os/newstore/NewStore.cc: 1639: FAILED assert(k = start_key k end_key) Just before the crash the here are the debug statements printed by the method (collection_list_partial) 2015-05-30 22:49:23.607232 7f1681934700 15 newstore(/var/lib/ceph/osd/ceph-7) collection_list_partial 75.0_head start -1/0//0/0 min/max 1024/1024 snap head 2015-05-30 22:49:23.607251 7f1681934700 20 newstore(/var/lib/ceph/osd/ceph-7) collection_list_partial range --.7fb4.. to --.7fb4.0800. and --.804b.. to --.804b.0800. start -1/0//0/0 Regards Srikanth On Mon, Jun 1, 2015 at 8:54 PM, Sage Weil s...@newdream.net wrote: On Mon, 1 Jun 2015, Srikanth Madugundi wrote: Hi Sage and all, I build ceph code from wip-newstore on RHEL7 and running performance tests to compare with filestore. After few hours of running the tests the osd daemons started to crash. Here is the stack trace, the osd crashes immediately after the restart. So I could not get the osd up and running. ceph version b8e22893f44979613738dfcdd40dada2b513118 (eb8e22893f44979613738dfcdd40dada2b513118) 1: /usr/bin/ceph-osd() [0xb84652] 2: (()+0xf130) [0x7f915f84f130] 3: (gsignal()+0x39) [0x7f915e2695c9] 4: (abort()+0x148) [0x7f915e26acd8] 5: (__gnu_cxx::__verbose_terminate_handler()+0x165) [0x7f915eb6d9d5] 6: (()+0x5e946) [0x7f915eb6b946] 7: (()+0x5e973) [0x7f915eb6b973] 8: (()+0x5eb9f) [0x7f915eb6bb9f] 9: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x27a) [0xc84c5a] 10: (NewStore::collection_list_partial(coll_t, ghobject_t, int, int, snapid_t, std::vectorghobject_t, std::allocatorghobject_t *, ghobject_t*)+0x13c9) [0xa08639] 11: (PGBackend::objects_list_partial(hobject_t const, int, int, snapid_t, std::vectorhobject_t, std::allocatorhobject_t *, hobject_t*)+0x352) [0x918a02] 12: (ReplicatedPG::do_pg_op(std::tr1::shared_ptrOpRequest)+0x1066) [0x8aa906] 13: (ReplicatedPG::do_op(std::tr1::shared_ptrOpRequest)+0x1eb) [0x8cd06b] 14: (ReplicatedPG::do_request(std::tr1::shared_ptrOpRequest, ThreadPool::TPHandle)+0x68a) [0x85dbea] 15: (OSD::dequeue_op(boost::intrusive_ptrPG, std::tr1::shared_ptrOpRequest, ThreadPool::TPHandle)+0x3ed) [0x6c3f5d] 16: (OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0x2e9) [0x6c4449] 17: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x86f) [0xc746bf] 18: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0xc767f0] 19: (()+0x7df3) [0x7f915f847df3] 20: (clone()+0x6d) [0x7f915e32a01d] NOTE: a copy of the executable, or `objdump -rdS executable` is needed to interpret this. Please let me know the cause of this crash, when this crash happens I noticed that two osds on separate machines are down. I can bring one osd up but restarting the other osd causes both OSDs to crash. My understanding is the crash seems to happen when two OSDs try to communicate and replicate a particular PG. Can you include the log lines that preceed the dump above? In particular, there should be a line that tells you what assertion failed in what function and at what line number. I haven't seen this crash so I'm not sure offhand what it is. Thanks! sage -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: A bug in rolling back to a degraded object
That's great we already have such a field. I'll make use of it to fix this. Thanks for pointing it out. -Original Message- From: Sage Weil [mailto:sw...@redhat.com] Sent: Tuesday, June 2, 2015 7:42 AM To: Wang, Zhiqiang Cc: ceph-devel@vger.kernel.org Subject: Re: A bug in rolling back to a degraded object Hi Zhiqiang, On Mon, 1 Jun 2015, Wang, Zhiqiang wrote: Hi Sage and all, Another bug discovered during the proxy write teuthology testing is when rolling back to a degraded object. This doesn't seem specific to proxy write. I see a scenario like below from the log file: - A rollback op comes in, and is enqueued. - Several other ops on the same object come in, and are enqueued. - The rollback op dispatches, and finds the object which it rollbacks to is degraded, then this op is pushbacked into a list to wait for the degraded object to recover. - The later ops are handled and responded back to client. - The degraded object recovers. The rollback op is enqueued again and finally responded to client. Yep! This breaks the op order. A fix for this is to maintain a map to track the source, destination pair. And when an op on the source dispatches, if such a pair exists, queue the op in the destination's degraded waiting list. A drawback of this approach is that some entries in the ' waiting_for_degraded_object' list of the destination object may not be actually accessing the destination, but the source. Does this make sense? Yeah, and I think it's fine for the op to appear in the other object's list. In fact there is already a mechanism in place that does something similar: obc-blocked_by. It was added for the clone operation, which nfortunately I don't think it's not exercised in any of our test.. but I think it does exactly what you need. If you set the head's blocked_by to the clone (and the clone's blocks set to include the head) then anybody trying to write to the head will queue up on the clone's degraded list (see the check for this in ReplicatedPG::do_op()). I think this mostl amounts to making the _rollback_to() method get the clone's obc, set up the blocked_by/blocks relationship, start recovery of that object immediately, and queue itself on the waiting list. Does that make sense? sage -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
osd crash with object store set to newstore
Hi Sage and all, I build ceph code from wip-newstore on RHEL7 and running performance tests to compare with filestore. After few hours of running the tests the osd daemons started to crash. Here is the stack trace, the osd crashes immediately after the restart. So I could not get the osd up and running. ceph version b8e22893f44979613738dfcdd40dada2b513118 (eb8e22893f44979613738dfcdd40dada2b513118) 1: /usr/bin/ceph-osd() [0xb84652] 2: (()+0xf130) [0x7f915f84f130] 3: (gsignal()+0x39) [0x7f915e2695c9] 4: (abort()+0x148) [0x7f915e26acd8] 5: (__gnu_cxx::__verbose_terminate_handler()+0x165) [0x7f915eb6d9d5] 6: (()+0x5e946) [0x7f915eb6b946] 7: (()+0x5e973) [0x7f915eb6b973] 8: (()+0x5eb9f) [0x7f915eb6bb9f] 9: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x27a) [0xc84c5a] 10: (NewStore::collection_list_partial(coll_t, ghobject_t, int, int, snapid_t, std::vectorghobject_t, std::allocatorghobject_t *, ghobject_t*)+0x13c9) [0xa08639] 11: (PGBackend::objects_list_partial(hobject_t const, int, int, snapid_t, std::vectorhobject_t, std::allocatorhobject_t *, hobject_t*)+0x352) [0x918a02] 12: (ReplicatedPG::do_pg_op(std::tr1::shared_ptrOpRequest)+0x1066) [0x8aa906] 13: (ReplicatedPG::do_op(std::tr1::shared_ptrOpRequest)+0x1eb) [0x8cd06b] 14: (ReplicatedPG::do_request(std::tr1::shared_ptrOpRequest, ThreadPool::TPHandle)+0x68a) [0x85dbea] 15: (OSD::dequeue_op(boost::intrusive_ptrPG, std::tr1::shared_ptrOpRequest, ThreadPool::TPHandle)+0x3ed) [0x6c3f5d] 16: (OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0x2e9) [0x6c4449] 17: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x86f) [0xc746bf] 18: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0xc767f0] 19: (()+0x7df3) [0x7f915f847df3] 20: (clone()+0x6d) [0x7f915e32a01d] NOTE: a copy of the executable, or `objdump -rdS executable` is needed to interpret this. Please let me know the cause of this crash, when this crash happens I noticed that two osds on separate machines are down. I can bring one osd up but restarting the other osd causes both OSDs to crash. My understanding is the crash seems to happen when two OSDs try to communicate and replicate a particular PG. Regards Srikanth -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: osd crash with object store set to newstore
Hi Sage, The assertion failed at line 1639, here is the log message 2015-05-30 23:17:55.141388 7f0891be0700 -1 os/newstore/NewStore.cc: In function 'virtual int NewStore::collection_list_partial(coll_t, ghobject_t, int, int, snapid_t, std::vectorghobject_t*, ghobject_t*)' thread 7f0891be0700 time 2015-05-30 23:17:55.137174 os/newstore/NewStore.cc: 1639: FAILED assert(k = start_key k end_key) Just before the crash the here are the debug statements printed by the method (collection_list_partial) 2015-05-30 22:49:23.607232 7f1681934700 15 newstore(/var/lib/ceph/osd/ceph-7) collection_list_partial 75.0_head start -1/0//0/0 min/max 1024/1024 snap head 2015-05-30 22:49:23.607251 7f1681934700 20 newstore(/var/lib/ceph/osd/ceph-7) collection_list_partial range --.7fb4.. to --.7fb4.0800. and --.804b.. to --.804b.0800. start -1/0//0/0 Regards Srikanth On Mon, Jun 1, 2015 at 8:54 PM, Sage Weil s...@newdream.net wrote: On Mon, 1 Jun 2015, Srikanth Madugundi wrote: Hi Sage and all, I build ceph code from wip-newstore on RHEL7 and running performance tests to compare with filestore. After few hours of running the tests the osd daemons started to crash. Here is the stack trace, the osd crashes immediately after the restart. So I could not get the osd up and running. ceph version b8e22893f44979613738dfcdd40dada2b513118 (eb8e22893f44979613738dfcdd40dada2b513118) 1: /usr/bin/ceph-osd() [0xb84652] 2: (()+0xf130) [0x7f915f84f130] 3: (gsignal()+0x39) [0x7f915e2695c9] 4: (abort()+0x148) [0x7f915e26acd8] 5: (__gnu_cxx::__verbose_terminate_handler()+0x165) [0x7f915eb6d9d5] 6: (()+0x5e946) [0x7f915eb6b946] 7: (()+0x5e973) [0x7f915eb6b973] 8: (()+0x5eb9f) [0x7f915eb6bb9f] 9: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x27a) [0xc84c5a] 10: (NewStore::collection_list_partial(coll_t, ghobject_t, int, int, snapid_t, std::vectorghobject_t, std::allocatorghobject_t *, ghobject_t*)+0x13c9) [0xa08639] 11: (PGBackend::objects_list_partial(hobject_t const, int, int, snapid_t, std::vectorhobject_t, std::allocatorhobject_t *, hobject_t*)+0x352) [0x918a02] 12: (ReplicatedPG::do_pg_op(std::tr1::shared_ptrOpRequest)+0x1066) [0x8aa906] 13: (ReplicatedPG::do_op(std::tr1::shared_ptrOpRequest)+0x1eb) [0x8cd06b] 14: (ReplicatedPG::do_request(std::tr1::shared_ptrOpRequest, ThreadPool::TPHandle)+0x68a) [0x85dbea] 15: (OSD::dequeue_op(boost::intrusive_ptrPG, std::tr1::shared_ptrOpRequest, ThreadPool::TPHandle)+0x3ed) [0x6c3f5d] 16: (OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0x2e9) [0x6c4449] 17: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x86f) [0xc746bf] 18: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0xc767f0] 19: (()+0x7df3) [0x7f915f847df3] 20: (clone()+0x6d) [0x7f915e32a01d] NOTE: a copy of the executable, or `objdump -rdS executable` is needed to interpret this. Please let me know the cause of this crash, when this crash happens I noticed that two osds on separate machines are down. I can bring one osd up but restarting the other osd causes both OSDs to crash. My understanding is the crash seems to happen when two OSDs try to communicate and replicate a particular PG. Can you include the log lines that preceed the dump above? In particular, there should be a line that tells you what assertion failed in what function and at what line number. I haven't seen this crash so I'm not sure offhand what it is. Thanks! sage -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: Discuss: New default recovery config settings
On Fri, May 29, 2015 at 4:18 PM, Gregory Farnum g...@gregs42.com wrote: On Fri, May 29, 2015 at 2:47 PM, Samuel Just sj...@redhat.com wrote: Many people have reported that they need to lower the osd recovery config options to minimize the impact of recovery on client io. We are talking about changing the defaults as follows: osd_max_backfills to 1 (from 10) osd_recovery_max_active to 3 (from 15) osd_recovery_op_priority to 1 (from 10) osd_recovery_max_single_start to 1 (from 5) I'm under the (possibly erroneous) impression that reducing the number of max backfills doesn't actually reduce recovery speed much (but will reduce memory use), but that dropping the op priority can. I'd rather we make users manually adjust values which can have a material impact on their data safety, even if most of them choose to do so. After all, even under our worst behavior we're still doing a lot better than a resilvering RAID array. ;) -Greg -- Greg, When we set... osd recovery max active = 1 osd max backfills = 1 We see rebalance times go down by more than half and client write performance increase significantly while rebalancing. We initially played with these settings to improve client IO expecting recovery time to get worse, but we got a 2-for-1. This was with firefly using replication, downing an entire node with lots of SAS drives. We left osd_recovery_threads, osd_recovery_op_priority, and osd_recovery_max_single_start default. We dropped osd_recovery_max_active and osd_max_backfills together. If you're right, do you think osd_recovery_max_active=1 is primary reason for the improvement? (higher osd_max_backfills helps recovery time with erasure coding.) -Paul
Re: Discuss: New default recovery config settings
On Mon, Jun 1, 2015 at 6:39 PM, Paul Von-Stamwitz pvonstamw...@us.fujitsu.com wrote: On Fri, May 29, 2015 at 4:18 PM, Gregory Farnum g...@gregs42.com wrote: On Fri, May 29, 2015 at 2:47 PM, Samuel Just sj...@redhat.com wrote: Many people have reported that they need to lower the osd recovery config options to minimize the impact of recovery on client io. We are talking about changing the defaults as follows: osd_max_backfills to 1 (from 10) osd_recovery_max_active to 3 (from 15) osd_recovery_op_priority to 1 (from 10) osd_recovery_max_single_start to 1 (from 5) I'm under the (possibly erroneous) impression that reducing the number of max backfills doesn't actually reduce recovery speed much (but will reduce memory use), but that dropping the op priority can. I'd rather we make users manually adjust values which can have a material impact on their data safety, even if most of them choose to do so. After all, even under our worst behavior we're still doing a lot better than a resilvering RAID array. ;) -Greg -- Greg, When we set... osd recovery max active = 1 osd max backfills = 1 We see rebalance times go down by more than half and client write performance increase significantly while rebalancing. We initially played with these settings to improve client IO expecting recovery time to get worse, but we got a 2-for-1. This was with firefly using replication, downing an entire node with lots of SAS drives. We left osd_recovery_threads, osd_recovery_op_priority, and osd_recovery_max_single_start default. We dropped osd_recovery_max_active and osd_max_backfills together. If you're right, do you think osd_recovery_max_active=1 is primary reason for the improvement? (higher osd_max_backfills helps recovery time with erasure coding.) Well, recovery max active and max backfills are similar in many ways. Both are about moving data into a new or outdated copy of the PG — the difference is that recovery refers to our log-based recovery (where we compare the PG logs and move over the objects which have changed) whereas backfill requires us to incrementally move through the entire PG's hash space and compare. I suspect dropping down max backfills is more important than reducing max recovery (gathering recovery metadata happens largely in memory) but I don't really know either way. My comment was meant to convey that I'd prefer we not reduce the recovery op priority levels. :) -Greg -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: osd crash with object store set to newstore
On Mon, 1 Jun 2015, Srikanth Madugundi wrote: Hi Sage and all, I build ceph code from wip-newstore on RHEL7 and running performance tests to compare with filestore. After few hours of running the tests the osd daemons started to crash. Here is the stack trace, the osd crashes immediately after the restart. So I could not get the osd up and running. ceph version b8e22893f44979613738dfcdd40dada2b513118 (eb8e22893f44979613738dfcdd40dada2b513118) 1: /usr/bin/ceph-osd() [0xb84652] 2: (()+0xf130) [0x7f915f84f130] 3: (gsignal()+0x39) [0x7f915e2695c9] 4: (abort()+0x148) [0x7f915e26acd8] 5: (__gnu_cxx::__verbose_terminate_handler()+0x165) [0x7f915eb6d9d5] 6: (()+0x5e946) [0x7f915eb6b946] 7: (()+0x5e973) [0x7f915eb6b973] 8: (()+0x5eb9f) [0x7f915eb6bb9f] 9: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x27a) [0xc84c5a] 10: (NewStore::collection_list_partial(coll_t, ghobject_t, int, int, snapid_t, std::vectorghobject_t, std::allocatorghobject_t *, ghobject_t*)+0x13c9) [0xa08639] 11: (PGBackend::objects_list_partial(hobject_t const, int, int, snapid_t, std::vectorhobject_t, std::allocatorhobject_t *, hobject_t*)+0x352) [0x918a02] 12: (ReplicatedPG::do_pg_op(std::tr1::shared_ptrOpRequest)+0x1066) [0x8aa906] 13: (ReplicatedPG::do_op(std::tr1::shared_ptrOpRequest)+0x1eb) [0x8cd06b] 14: (ReplicatedPG::do_request(std::tr1::shared_ptrOpRequest, ThreadPool::TPHandle)+0x68a) [0x85dbea] 15: (OSD::dequeue_op(boost::intrusive_ptrPG, std::tr1::shared_ptrOpRequest, ThreadPool::TPHandle)+0x3ed) [0x6c3f5d] 16: (OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0x2e9) [0x6c4449] 17: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x86f) [0xc746bf] 18: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0xc767f0] 19: (()+0x7df3) [0x7f915f847df3] 20: (clone()+0x6d) [0x7f915e32a01d] NOTE: a copy of the executable, or `objdump -rdS executable` is needed to interpret this. Please let me know the cause of this crash, when this crash happens I noticed that two osds on separate machines are down. I can bring one osd up but restarting the other osd causes both OSDs to crash. My understanding is the crash seems to happen when two OSDs try to communicate and replicate a particular PG. Can you include the log lines that preceed the dump above? In particular, there should be a line that tells you what assertion failed in what function and at what line number. I haven't seen this crash so I'm not sure offhand what it is. Thanks! sage -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: osd crash with object store set to newstore
Hi Sage, Unfortunately I purged the cluster yesterday and restarted the backfill tool. I did not see the osd crash yet on the cluster. I am monitoring the OSDs and will update you once I see the crash. With the new backfill run I have reduced the rps by half, not sure if this is the reason for not seeing the crash yet. Regards Srikanth On Mon, Jun 1, 2015 at 10:06 PM, Sage Weil s...@newdream.net wrote: I pushed a commit to wip-newstore-debuglist.. can you reproduce the crash with that branch with 'debug newstore = 20' and send us the log? (You can just do 'ceph-post-file filename'.) Thanks! sage On Mon, 1 Jun 2015, Srikanth Madugundi wrote: Hi Sage, The assertion failed at line 1639, here is the log message 2015-05-30 23:17:55.141388 7f0891be0700 -1 os/newstore/NewStore.cc: In function 'virtual int NewStore::collection_list_partial(coll_t, ghobject_t, int, int, snapid_t, std::vectorghobject_t*, ghobject_t*)' thread 7f0891be0700 time 2015-05-30 23:17:55.137174 os/newstore/NewStore.cc: 1639: FAILED assert(k = start_key k end_key) Just before the crash the here are the debug statements printed by the method (collection_list_partial) 2015-05-30 22:49:23.607232 7f1681934700 15 newstore(/var/lib/ceph/osd/ceph-7) collection_list_partial 75.0_head start -1/0//0/0 min/max 1024/1024 snap head 2015-05-30 22:49:23.607251 7f1681934700 20 newstore(/var/lib/ceph/osd/ceph-7) collection_list_partial range --.7fb4.. to --.7fb4.0800. and --.804b.. to --.804b.0800. start -1/0//0/0 Regards Srikanth On Mon, Jun 1, 2015 at 8:54 PM, Sage Weil s...@newdream.net wrote: On Mon, 1 Jun 2015, Srikanth Madugundi wrote: Hi Sage and all, I build ceph code from wip-newstore on RHEL7 and running performance tests to compare with filestore. After few hours of running the tests the osd daemons started to crash. Here is the stack trace, the osd crashes immediately after the restart. So I could not get the osd up and running. ceph version b8e22893f44979613738dfcdd40dada2b513118 (eb8e22893f44979613738dfcdd40dada2b513118) 1: /usr/bin/ceph-osd() [0xb84652] 2: (()+0xf130) [0x7f915f84f130] 3: (gsignal()+0x39) [0x7f915e2695c9] 4: (abort()+0x148) [0x7f915e26acd8] 5: (__gnu_cxx::__verbose_terminate_handler()+0x165) [0x7f915eb6d9d5] 6: (()+0x5e946) [0x7f915eb6b946] 7: (()+0x5e973) [0x7f915eb6b973] 8: (()+0x5eb9f) [0x7f915eb6bb9f] 9: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x27a) [0xc84c5a] 10: (NewStore::collection_list_partial(coll_t, ghobject_t, int, int, snapid_t, std::vectorghobject_t, std::allocatorghobject_t *, ghobject_t*)+0x13c9) [0xa08639] 11: (PGBackend::objects_list_partial(hobject_t const, int, int, snapid_t, std::vectorhobject_t, std::allocatorhobject_t *, hobject_t*)+0x352) [0x918a02] 12: (ReplicatedPG::do_pg_op(std::tr1::shared_ptrOpRequest)+0x1066) [0x8aa906] 13: (ReplicatedPG::do_op(std::tr1::shared_ptrOpRequest)+0x1eb) [0x8cd06b] 14: (ReplicatedPG::do_request(std::tr1::shared_ptrOpRequest, ThreadPool::TPHandle)+0x68a) [0x85dbea] 15: (OSD::dequeue_op(boost::intrusive_ptrPG, std::tr1::shared_ptrOpRequest, ThreadPool::TPHandle)+0x3ed) [0x6c3f5d] 16: (OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0x2e9) [0x6c4449] 17: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x86f) [0xc746bf] 18: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0xc767f0] 19: (()+0x7df3) [0x7f915f847df3] 20: (clone()+0x6d) [0x7f915e32a01d] NOTE: a copy of the executable, or `objdump -rdS executable` is needed to interpret this. Please let me know the cause of this crash, when this crash happens I noticed that two osds on separate machines are down. I can bring one osd up but restarting the other osd causes both OSDs to crash. My understanding is the crash seems to happen when two OSDs try to communicate and replicate a particular PG. Can you include the log lines that preceed the dump above? In particular, there should be a line that tells you what assertion failed in what function and at what line number. I haven't seen this crash so I'm not sure offhand what it is. Thanks! sage -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html