Re: init script bug with multiple clusters
Am 17.04.2015 um 03:01 schrieb Gregory Farnum: > This looks good to me, but we need an explicit sign-off from you for > it. If you can submit it as a PR on Github that's easiest for us, but > if not can you send it in git email patch form? :) Attached patch against next branch in git email form - hope this is as expected. Our devel system cannot send mail directly. Amon Ott -- Dr. Amon Ott m-privacy GmbH Tel: +49 30 24342334 Werner-Voß-Damm 62 Fax: +49 30 99296856 12101 Berlin http://www.m-privacy.de Amtsgericht Charlottenburg, HRB 84946 Geschäftsführer: Dipl.-Kfm. Holger Maczkowsky, Roman Maczkowsky GnuPG-Key-ID: 0x2DD3A649 From 1e4d9f4fcd688fcbe275f2cff55b272dfeec2e45 Mon Sep 17 00:00:00 2001 From: Amon Ott Date: Fri, 17 Apr 2015 08:42:58 +0200 Subject: [PATCH] init script bug with multiple clusters The Ceph init script (src/init-ceph.in) creates pid files without cluster names. This means that only one cluster can run at a time. The solution is simple and works fine here: add "$cluster-" as usual. Signed-off-by: Amon Ott --- src/init-ceph.in |2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/src/init-ceph.in b/src/init-ceph.in index 2ff98c7..d88ca58 100644 --- a/src/init-ceph.in +++ b/src/init-ceph.in @@ -227,7 +227,7 @@ for name in $what; do get_conf run_dir "/var/run/ceph" "run dir" -get_conf pid_file "$run_dir/$type.$id.pid" "pid file" +get_conf pid_file "$run_dir/$cluster-$type.$id.pid" "pid file" if [ "$command" = "start" ]; then if [ -n "$pid_file" ]; then -- 1.7.10.4 signature.asc Description: OpenPGP digital signature
Re: Regarding newstore performance
On Fri, Apr 17, 2015 at 8:38 AM, Sage Weil wrote: > On Thu, 16 Apr 2015, Mark Nelson wrote: >> On 04/16/2015 01:17 AM, Somnath Roy wrote: >> > Here is the data with omap separated to another SSD and after 1000GB of fio >> > writes (same profile).. >> > >> > omap writes: >> > - >> > >> > Total host writes in this period = 551020111 -- ~2101 GB >> > >> > Total flash writes in this period = 1150679336 >> > >> > data writes: >> > --- >> > >> > Total host writes in this period = 302550388 --- ~1154 GB >> > >> > Total flash writes in this period = 600238328 >> > >> > So, actual data write WA is ~1.1 but omap overhead is ~2.1 and adding those >> > getting ~3.2 WA overall. > > This all suggests that getting rocksdb to not rewrite the wal > entries at all will be the big win. I think Xiaoxi had tunable > suggestions for that? I didn't grok the rocksdb terms immediately so > they didn't make a lot of sense at the time.. this is probably a good > place to focus, though. The rocksdb compaction stats should help out > there. > > But... today I ignored this entirely and put rocksdb in tmpfs and focused > just on the actual wal IOs done to the fragments files after the fact. > For simplicity I focused just on 128k random writes into 4mb objects. > > fio can get ~18 mb/sec on my disk with iodepth=1. Interestingly, setting > iodepth=16 makes no different *until* I also set thinktime=10 (us, or > almost any value really) and thinktime_blocks=16, at which point it goes > up with the iodepth. I'm not quite sure what is going on there but it > seems to be preventing the elevator and/or disk from reordering writes and > make more efficient sweeps across the disk. In any case, though, with > that tweaked I can get up to ~30mb/sec with qd 16, ~40mb/sec with qd 64. > Similarly, with qa 1 and thinktime of 250us, it drops to like 15mb/sec, > which is basically what I was getting from newstore. Here's my fio > config: > > http://fpaste.org/212110/42923089/ > > Conclusion: we need multiple threads (or libaio) to get lots of IOs in > flight so that the block layer and/or disk can reorder and be efficient. > I added a threadpool for doing wal work (newstore wal threads = 8 by > default) and it makes a big difference. Now I am getting more like > 19mb/sec w/ 4 threads and client (smalliobench) qd 16. It's not going up > much from there as I scale threads or qd, strangely; not sure why yet. Do you mean this PR(https://github.com/ceph/ceph/pull/4318)? I have a simple benchmark at the comment of PR. > > But... that's a big improvement over a few days ago (~8mb/sec). And on > this drive filestore with journal on ssd gets ~8.5mb/sec. So we're > winning, yay! > > I tabled the libaio patch for now since it was getting spurious EINVAL and > would consistently SIGBUG from io_getevents() when ceph-osd did dlopen() > on the rados plugins (weird!). > > Mark, at this point it is probably worth checking that you can reproduce > these results? If so, we can redo the io size sweep. I picked 8 wal > threads since that was enough to help and going higher didn't seem to make > much difference, but at some point we'll want to be more careful about > picking that number. We could also use libaio here, but I'm not sure it's > worth it. And this approach is somewhat orthogonal to the idea of > efficiently passing the kernel things to fdatasync. Agreed, this time I think we need to focus data store only. Maybe I'm missing, what's your overlay config value in this test? > > Anyway, next up is probably wrangling rocksdb's log! > > sage -- Best Regards, Wheat -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: init script bug with multiple clusters
This looks good to me, but we need an explicit sign-off from you for it. If you can submit it as a PR on Github that's easiest for us, but if not can you send it in git email patch form? :) -Greg On Wed, Apr 8, 2015 at 2:58 AM, Amon Ott wrote: > Hello Ceph! > > The Ceph init script (src/init-ceph.in) creates pid files without > cluster names. This means that only one cluster can run at a time. The > solution is simple and works fine here, patch against 0.94 is attached. > > Amon Ott > -- > Dr. Amon Ott > m-privacy GmbH Tel: +49 30 24342334 > Werner-Voß-Damm 62 Fax: +49 30 99296856 > 12101 Berlin http://www.m-privacy.de > > Amtsgericht Charlottenburg, HRB 84946 > > Geschäftsführer: > Dipl.-Kfm. Holger Maczkowsky, > Roman Maczkowsky > > GnuPG-Key-ID: 0x2DD3A649 > -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: Regarding newstore performance
AgreeThreadpool/Queue/Locking is in generally bad for latency. Can we just make newstore backend as synchronize as possible and utilize the parallelism by more #OSD_OP_THREAD? Hopefully we could have better latency in low #QD case. -Original Message- From: Gregory Farnum [mailto:g...@gregs42.com] Sent: Friday, April 17, 2015 8:48 AM To: Sage Weil Cc: Mark Nelson; Somnath Roy; Chen, Xiaoxi; Haomai Wang; ceph-devel Subject: Re: Regarding newstore performance On Thu, Apr 16, 2015 at 5:38 PM, Sage Weil wrote: > On Thu, 16 Apr 2015, Mark Nelson wrote: >> On 04/16/2015 01:17 AM, Somnath Roy wrote: >> > Here is the data with omap separated to another SSD and after >> > 1000GB of fio writes (same profile).. >> > >> > omap writes: >> > - >> > >> > Total host writes in this period = 551020111 -- ~2101 GB >> > >> > Total flash writes in this period = 1150679336 >> > >> > data writes: >> > --- >> > >> > Total host writes in this period = 302550388 --- ~1154 GB >> > >> > Total flash writes in this period = 600238328 >> > >> > So, actual data write WA is ~1.1 but omap overhead is ~2.1 and >> > adding those getting ~3.2 WA overall. > > This all suggests that getting rocksdb to not rewrite the wal entries > at all will be the big win. I think Xiaoxi had tunable suggestions > for that? I didn't grok the rocksdb terms immediately so they didn't > make a lot of sense at the time.. this is probably a good place to > focus, though. The rocksdb compaction stats should help out there. > > But... today I ignored this entirely and put rocksdb in tmpfs and > focused just on the actual wal IOs done to the fragments files after the fact. > For simplicity I focused just on 128k random writes into 4mb objects. > > fio can get ~18 mb/sec on my disk with iodepth=1. Interestingly, > setting > iodepth=16 makes no different *until* I also set thinktime=10 (us, or > almost any value really) and thinktime_blocks=16, at which point it > goes up with the iodepth. I'm not quite sure what is going on there > but it seems to be preventing the elevator and/or disk from reordering > writes and make more efficient sweeps across the disk. In any case, > though, with that tweaked I can get up to ~30mb/sec with qd 16, ~40mb/sec > with qd 64. > Similarly, with qa 1 and thinktime of 250us, it drops to like > 15mb/sec, which is basically what I was getting from newstore. Here's > my fio > config: > > http://fpaste.org/212110/42923089/ > > Conclusion: we need multiple threads (or libaio) to get lots of IOs in > flight so that the block layer and/or disk can reorder and be efficient. > I added a threadpool for doing wal work (newstore wal threads = 8 by > default) and it makes a big difference. Now I am getting more like > 19mb/sec w/ 4 threads and client (smalliobench) qd 16. It's not going > up much from there as I scale threads or qd, strangely; not sure why yet. > > But... that's a big improvement over a few days ago (~8mb/sec). And > on this drive filestore with journal on ssd gets ~8.5mb/sec. So we're > winning, yay! > > I tabled the libaio patch for now since it was getting spurious EINVAL > and would consistently SIGBUG from io_getevents() when ceph-osd did > dlopen() on the rados plugins (weird!). > > Mark, at this point it is probably worth checking that you can > reproduce these results? If so, we can redo the io size sweep. I > picked 8 wal threads since that was enough to help and going higher > didn't seem to make much difference, but at some point we'll want to > be more careful about picking that number. We could also use libaio > here, but I'm not sure it's worth it. And this approach is somewhat > orthogonal to the idea of efficiently passing the kernel things to fdatasync. Adding another thread switch to the IO path is going to make us very sad in the future, so I think this'd be a bad prototype version to have escape into the wild. I keep hearing Sam's talk about needing to get down to 1 thread switch if we're ever to hope for 100usec writes. So consider this one vote for making libaio work, and sooner rather than later. :) -Greg N�r��yb�X��ǧv�^�){.n�+���z�]z���{ay�ʇڙ�,j��f���h���z��w��� ���j:+v���w�j�mzZ+�ݢj"��!�i
Re: Regarding newstore performance
On Thu, 16 Apr 2015, Gregory Farnum wrote: > On Thu, Apr 16, 2015 at 5:38 PM, Sage Weil wrote: > > On Thu, 16 Apr 2015, Mark Nelson wrote: > >> On 04/16/2015 01:17 AM, Somnath Roy wrote: > >> > Here is the data with omap separated to another SSD and after 1000GB of > >> > fio > >> > writes (same profile).. > >> > > >> > omap writes: > >> > - > >> > > >> > Total host writes in this period = 551020111 -- ~2101 GB > >> > > >> > Total flash writes in this period = 1150679336 > >> > > >> > data writes: > >> > --- > >> > > >> > Total host writes in this period = 302550388 --- ~1154 GB > >> > > >> > Total flash writes in this period = 600238328 > >> > > >> > So, actual data write WA is ~1.1 but omap overhead is ~2.1 and adding > >> > those > >> > getting ~3.2 WA overall. > > > > This all suggests that getting rocksdb to not rewrite the wal > > entries at all will be the big win. I think Xiaoxi had tunable > > suggestions for that? I didn't grok the rocksdb terms immediately so > > they didn't make a lot of sense at the time.. this is probably a good > > place to focus, though. The rocksdb compaction stats should help out > > there. > > > > But... today I ignored this entirely and put rocksdb in tmpfs and focused > > just on the actual wal IOs done to the fragments files after the fact. > > For simplicity I focused just on 128k random writes into 4mb objects. > > > > fio can get ~18 mb/sec on my disk with iodepth=1. Interestingly, setting > > iodepth=16 makes no different *until* I also set thinktime=10 (us, or > > almost any value really) and thinktime_blocks=16, at which point it goes > > up with the iodepth. I'm not quite sure what is going on there but it > > seems to be preventing the elevator and/or disk from reordering writes and > > make more efficient sweeps across the disk. In any case, though, with > > that tweaked I can get up to ~30mb/sec with qd 16, ~40mb/sec with qd 64. > > Similarly, with qa 1 and thinktime of 250us, it drops to like 15mb/sec, > > which is basically what I was getting from newstore. Here's my fio > > config: > > > > http://fpaste.org/212110/42923089/ > > > > Conclusion: we need multiple threads (or libaio) to get lots of IOs in > > flight so that the block layer and/or disk can reorder and be efficient. > > I added a threadpool for doing wal work (newstore wal threads = 8 by > > default) and it makes a big difference. Now I am getting more like > > 19mb/sec w/ 4 threads and client (smalliobench) qd 16. It's not going up > > much from there as I scale threads or qd, strangely; not sure why yet. > > > > But... that's a big improvement over a few days ago (~8mb/sec). And on > > this drive filestore with journal on ssd gets ~8.5mb/sec. So we're > > winning, yay! > > > > I tabled the libaio patch for now since it was getting spurious EINVAL and > > would consistently SIGBUG from io_getevents() when ceph-osd did dlopen() > > on the rados plugins (weird!). > > > > Mark, at this point it is probably worth checking that you can reproduce > > these results? If so, we can redo the io size sweep. I picked 8 wal > > threads since that was enough to help and going higher didn't seem to make > > much difference, but at some point we'll want to be more careful about > > picking that number. We could also use libaio here, but I'm not sure it's > > worth it. And this approach is somewhat orthogonal to the idea of > > efficiently passing the kernel things to fdatasync. > > Adding another thread switch to the IO path is going to make us very > sad in the future, so I think this'd be a bad prototype version to > have escape into the wild. I keep hearing Sam's talk about needing to > get down to 1 thread switch if we're ever to hope for 100usec writes. Yeah, for fast memory we'll want to take a totally different synchronous path through the code. Right now I'm targetting general purpose (spinning disk and current-generation SSDs) usage (and this is the async post-commit cleanup work). But yeah... I'll bite the bullet and do aio soon. I suspect I just screwed up the buffer alignment and that's where EINVAL was coming from before. sage -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Regarding newstore performance
On Thu, Apr 16, 2015 at 5:38 PM, Sage Weil wrote: > On Thu, 16 Apr 2015, Mark Nelson wrote: >> On 04/16/2015 01:17 AM, Somnath Roy wrote: >> > Here is the data with omap separated to another SSD and after 1000GB of fio >> > writes (same profile).. >> > >> > omap writes: >> > - >> > >> > Total host writes in this period = 551020111 -- ~2101 GB >> > >> > Total flash writes in this period = 1150679336 >> > >> > data writes: >> > --- >> > >> > Total host writes in this period = 302550388 --- ~1154 GB >> > >> > Total flash writes in this period = 600238328 >> > >> > So, actual data write WA is ~1.1 but omap overhead is ~2.1 and adding those >> > getting ~3.2 WA overall. > > This all suggests that getting rocksdb to not rewrite the wal > entries at all will be the big win. I think Xiaoxi had tunable > suggestions for that? I didn't grok the rocksdb terms immediately so > they didn't make a lot of sense at the time.. this is probably a good > place to focus, though. The rocksdb compaction stats should help out > there. > > But... today I ignored this entirely and put rocksdb in tmpfs and focused > just on the actual wal IOs done to the fragments files after the fact. > For simplicity I focused just on 128k random writes into 4mb objects. > > fio can get ~18 mb/sec on my disk with iodepth=1. Interestingly, setting > iodepth=16 makes no different *until* I also set thinktime=10 (us, or > almost any value really) and thinktime_blocks=16, at which point it goes > up with the iodepth. I'm not quite sure what is going on there but it > seems to be preventing the elevator and/or disk from reordering writes and > make more efficient sweeps across the disk. In any case, though, with > that tweaked I can get up to ~30mb/sec with qd 16, ~40mb/sec with qd 64. > Similarly, with qa 1 and thinktime of 250us, it drops to like 15mb/sec, > which is basically what I was getting from newstore. Here's my fio > config: > > http://fpaste.org/212110/42923089/ > > Conclusion: we need multiple threads (or libaio) to get lots of IOs in > flight so that the block layer and/or disk can reorder and be efficient. > I added a threadpool for doing wal work (newstore wal threads = 8 by > default) and it makes a big difference. Now I am getting more like > 19mb/sec w/ 4 threads and client (smalliobench) qd 16. It's not going up > much from there as I scale threads or qd, strangely; not sure why yet. > > But... that's a big improvement over a few days ago (~8mb/sec). And on > this drive filestore with journal on ssd gets ~8.5mb/sec. So we're > winning, yay! > > I tabled the libaio patch for now since it was getting spurious EINVAL and > would consistently SIGBUG from io_getevents() when ceph-osd did dlopen() > on the rados plugins (weird!). > > Mark, at this point it is probably worth checking that you can reproduce > these results? If so, we can redo the io size sweep. I picked 8 wal > threads since that was enough to help and going higher didn't seem to make > much difference, but at some point we'll want to be more careful about > picking that number. We could also use libaio here, but I'm not sure it's > worth it. And this approach is somewhat orthogonal to the idea of > efficiently passing the kernel things to fdatasync. Adding another thread switch to the IO path is going to make us very sad in the future, so I think this'd be a bad prototype version to have escape into the wild. I keep hearing Sam's talk about needing to get down to 1 thread switch if we're ever to hope for 100usec writes. So consider this one vote for making libaio work, and sooner rather than later. :) -Greg -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Regarding newstore performance
On Thu, 16 Apr 2015, Mark Nelson wrote: > On 04/16/2015 01:17 AM, Somnath Roy wrote: > > Here is the data with omap separated to another SSD and after 1000GB of fio > > writes (same profile).. > > > > omap writes: > > - > > > > Total host writes in this period = 551020111 -- ~2101 GB > > > > Total flash writes in this period = 1150679336 > > > > data writes: > > --- > > > > Total host writes in this period = 302550388 --- ~1154 GB > > > > Total flash writes in this period = 600238328 > > > > So, actual data write WA is ~1.1 but omap overhead is ~2.1 and adding those > > getting ~3.2 WA overall. This all suggests that getting rocksdb to not rewrite the wal entries at all will be the big win. I think Xiaoxi had tunable suggestions for that? I didn't grok the rocksdb terms immediately so they didn't make a lot of sense at the time.. this is probably a good place to focus, though. The rocksdb compaction stats should help out there. But... today I ignored this entirely and put rocksdb in tmpfs and focused just on the actual wal IOs done to the fragments files after the fact. For simplicity I focused just on 128k random writes into 4mb objects. fio can get ~18 mb/sec on my disk with iodepth=1. Interestingly, setting iodepth=16 makes no different *until* I also set thinktime=10 (us, or almost any value really) and thinktime_blocks=16, at which point it goes up with the iodepth. I'm not quite sure what is going on there but it seems to be preventing the elevator and/or disk from reordering writes and make more efficient sweeps across the disk. In any case, though, with that tweaked I can get up to ~30mb/sec with qd 16, ~40mb/sec with qd 64. Similarly, with qa 1 and thinktime of 250us, it drops to like 15mb/sec, which is basically what I was getting from newstore. Here's my fio config: http://fpaste.org/212110/42923089/ Conclusion: we need multiple threads (or libaio) to get lots of IOs in flight so that the block layer and/or disk can reorder and be efficient. I added a threadpool for doing wal work (newstore wal threads = 8 by default) and it makes a big difference. Now I am getting more like 19mb/sec w/ 4 threads and client (smalliobench) qd 16. It's not going up much from there as I scale threads or qd, strangely; not sure why yet. But... that's a big improvement over a few days ago (~8mb/sec). And on this drive filestore with journal on ssd gets ~8.5mb/sec. So we're winning, yay! I tabled the libaio patch for now since it was getting spurious EINVAL and would consistently SIGBUG from io_getevents() when ceph-osd did dlopen() on the rados plugins (weird!). Mark, at this point it is probably worth checking that you can reproduce these results? If so, we can redo the io size sweep. I picked 8 wal threads since that was enough to help and going higher didn't seem to make much difference, but at some point we'll want to be more careful about picking that number. We could also use libaio here, but I'm not sure it's worth it. And this approach is somewhat orthogonal to the idea of efficiently passing the kernel things to fdatasync. Anyway, next up is probably wrangling rocksdb's log! sage -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Ceph master - build broken unless --enable-debug specified
On 17/04/15 12:27, Gregory Farnum wrote: On Sat, Apr 11, 2015 at 8:42 PM, Mark Kirkwood wrote: Hi, Building without --enable-debug produces: ceph_fuse.cc: In member function ‘virtual void* main(int, const char**, const char**)::RemountTest::entry()’: ceph_fuse.cc:146:15: warning: ignoring return value of ‘int system(const char*)’, declared with attribute warn_unused_result [-Wunused-result] system(buf); ^ CXX ceph_osd.o CXX ceph_mds.o make[3]: *** No rule to make target '../src/gmock/lib/libgmock_main.la', needed by 'unittest_librbd'. Stop. make[3]: *** Waiting for unfinished jobs CXX test/erasure-code/ceph_erasure_code_non_regression.o make[3]: Leaving directory '/home/markir/develop/c/ceph/src' Makefile:20716: recipe for target 'all-recursive' failed make[2]: *** [all-recursive] Error 1 make[2]: Leaving directory '/home/markir/develop/c/ceph/src' Makefile:8977: recipe for target 'all' failed make[1]: *** [all] Error 2 make[1]: Leaving directory '/home/markir/develop/c/ceph/src' Makefile:467: recipe for target 'all-recursive' failed make: *** [all-recursive] Error 1 Adding in --enable-debug gives a successful build. This is on Ubuntu 14.10 64 bit, and the build procedure is: $ git pull $ git submodule update --init $ ./autogen.sh $ ./configure --prefix=/usr --sysconfdir=/etc --localstatedir=/var \ [--with-debug \ ] --with-nss \ --with-radosgw \ --with-librocksdb-static=check \ $ make [ -j4 ] Yep, looks like the unittest_librbd binary is in the noinst_PROGRAMS target (whatever that is) rather than the check_PROGRAMS target. Changing that seems to work — I pushed a branch wip-nodebug-build fixing it, but if you have your own fix a PR is welcome. If not I'll make a PR in the next couple days. I had not looked very closely at what the exact problem was - your analysis looks good to me, I'll leave you to file a PR :-) Cheers Mark -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Ceph master - build broken unless --enable-debug specified
On Sat, Apr 11, 2015 at 8:42 PM, Mark Kirkwood wrote: > Hi, > > Building without --enable-debug produces: > > ceph_fuse.cc: In member function ‘virtual void* main(int, const char**, > const char**)::RemountTest::entry()’: > ceph_fuse.cc:146:15: warning: ignoring return value of ‘int system(const > char*)’, declared with attribute warn_unused_result [-Wunused-result] > system(buf); >^ > CXX ceph_osd.o > CXX ceph_mds.o > make[3]: *** No rule to make target '../src/gmock/lib/libgmock_main.la', > needed by 'unittest_librbd'. Stop. > make[3]: *** Waiting for unfinished jobs > CXX test/erasure-code/ceph_erasure_code_non_regression.o > make[3]: Leaving directory '/home/markir/develop/c/ceph/src' > Makefile:20716: recipe for target 'all-recursive' failed > make[2]: *** [all-recursive] Error 1 > make[2]: Leaving directory '/home/markir/develop/c/ceph/src' > Makefile:8977: recipe for target 'all' failed > make[1]: *** [all] Error 2 > make[1]: Leaving directory '/home/markir/develop/c/ceph/src' > Makefile:467: recipe for target 'all-recursive' failed > make: *** [all-recursive] Error 1 > > > Adding in --enable-debug gives a successful build. > > This is on Ubuntu 14.10 64 bit, and the build procedure is: > > $ git pull > $ git submodule update --init > $ ./autogen.sh > $ ./configure --prefix=/usr --sysconfdir=/etc --localstatedir=/var \ > [--with-debug \ ] > --with-nss \ > --with-radosgw \ > --with-librocksdb-static=check \ > > $ make [ -j4 ] Yep, looks like the unittest_librbd binary is in the noinst_PROGRAMS target (whatever that is) rather than the check_PROGRAMS target. Changing that seems to work — I pushed a branch wip-nodebug-build fixing it, but if you have your own fix a PR is welcome. If not I'll make a PR in the next couple days. -Greg -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: CephFS and the next giant release v0.87.2
On Thu, Apr 16, 2015 at 4:16 PM, Loic Dachary wrote: > Hi Greg, > > On 17/04/2015 00:44, Gregory Farnum wrote: >> On Wed, Apr 15, 2015 at 2:37 AM, Loic Dachary wrote: >>> Hi Greg, >>> >>> The next giant release as found at https://github.com/ceph/ceph/tree/giant >>> passed the fs suite (http://tracker.ceph.com/issues/11153#fs). Do you think >>> it is ready for QE to start their own round of testing ? >>> >>> Note that it will be the last giant release. >>> >>> Cheers >>> >>> P.S. http://tracker.ceph.com/issues/11153#Release-information has direct >>> links to the pull requests merged into giant since v0.87.1 in case you need >>> more context about one of them. >> >> All those PRs look like fine ones to release. >> >> I remember you went through a big purge of giant-tagged backports at >> one point though (when we thought we weren't going to do any more >> releases at all). Did that get somehow undone and all of those dealt >> with? > > I inadvertendly closed a few issue that I should not have (less than 10 more > than 5 IIRC). I carefully reviewed all of them and they have been dealt with > indeed. There now is a HOWTO to prevent that kind of mistake, hopefully. > http://tracker.ceph.com/projects/ceph-releases/wiki/HOWTO_resolve_issues_that_are_Pending_Backport > >> Some of them were more important than others and it would be >> some work for me to reconstruct the list, but I'll need to do that if >> you haven't already. :/ > > There are a four giant issues that are candidate for backporting : > http://tracker.ceph.com/projects/ceph/issues?query_id=68, only one of which > is related to CephFS. Do you think it must be included in this last giant > release ? Nope, not that one! This all looks good to me then, thanks. -Greg -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: CephFS and the next giant release v0.87.2
Hi Greg, On 17/04/2015 00:44, Gregory Farnum wrote: > On Wed, Apr 15, 2015 at 2:37 AM, Loic Dachary wrote: >> Hi Greg, >> >> The next giant release as found at https://github.com/ceph/ceph/tree/giant >> passed the fs suite (http://tracker.ceph.com/issues/11153#fs). Do you think >> it is ready for QE to start their own round of testing ? >> >> Note that it will be the last giant release. >> >> Cheers >> >> P.S. http://tracker.ceph.com/issues/11153#Release-information has direct >> links to the pull requests merged into giant since v0.87.1 in case you need >> more context about one of them. > > All those PRs look like fine ones to release. > > I remember you went through a big purge of giant-tagged backports at > one point though (when we thought we weren't going to do any more > releases at all). Did that get somehow undone and all of those dealt > with? I inadvertendly closed a few issue that I should not have (less than 10 more than 5 IIRC). I carefully reviewed all of them and they have been dealt with indeed. There now is a HOWTO to prevent that kind of mistake, hopefully. http://tracker.ceph.com/projects/ceph-releases/wiki/HOWTO_resolve_issues_that_are_Pending_Backport > Some of them were more important than others and it would be > some work for me to reconstruct the list, but I'll need to do that if > you haven't already. :/ There are a four giant issues that are candidate for backporting : http://tracker.ceph.com/projects/ceph/issues?query_id=68, only one of which is related to CephFS. Do you think it must be included in this last giant release ? Cheers > -Greg > -- Loïc Dachary, Artisan Logiciel Libre signature.asc Description: OpenPGP digital signature
Re: partial acks when send reply to client to reduce write latency
On Thu, Apr 9, 2015 at 11:38 PM, 池信泽 wrote: > hi, all: > > Now, ceph should received all ack message from remote and then > reply ack to client, What > > about directly reply to client if primary has been received some of > them. Below is the request > > trace among osd. Primary wait for second sub_op_commit_rec msg for a long > time. > > Does it make sense? It makes sense on one level, but unfortunately it's just not feasible. It would change how peering needs to work — right now, we need to contact at least one OSD that is active in any interval. If we allowed commits to happen without having hit disk on every OSD, we need to talk to all the OSDs in every interval (or at least, {num_OSDs} - {number_we_require_ack} + 1 of them), which would be pretty bad for our failure resiliency. This comes up every so often as a suggestion and is a lot more feasible with erasure coding — Yahoo has already implemented the read-side version of this (http://yahooeng.tumblr.com/post/116391291701/yahoo-cloud-object-store-object-storage-at), but doing it on the write side would still take a lot of work. -Greg -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: CephFS and the next giant release v0.87.2
On Wed, Apr 15, 2015 at 2:37 AM, Loic Dachary wrote: > Hi Greg, > > The next giant release as found at https://github.com/ceph/ceph/tree/giant > passed the fs suite (http://tracker.ceph.com/issues/11153#fs). Do you think > it is ready for QE to start their own round of testing ? > > Note that it will be the last giant release. > > Cheers > > P.S. http://tracker.ceph.com/issues/11153#Release-information has direct > links to the pull requests merged into giant since v0.87.1 in case you need > more context about one of them. All those PRs look like fine ones to release. I remember you went through a big purge of giant-tagged backports at one point though (when we thought we weren't going to do any more releases at all). Did that get somehow undone and all of those dealt with? Some of them were more important than others and it would be some work for me to reconstruct the list, but I'll need to do that if you haven't already. :/ -Greg -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: client/cluster compatibility testing
Yea, Sage, that sounds reasonable. I added a ticket to capture this plan (http://tracker.ceph.com/issues/11413) and will add those tests soon. Please add your comments to the ticket above. I am assuming that it will look something like this for dumpling, firefly and hammer: dumpling(stable) -> client-x firefly(stable) -> client-x hammer(stable) -> client-x and reverse dumpling-client(stable) -> cluster-x firefly-cluster(stable) -> cluster-x hammer-cluster(stable) -> cluster-x Yes? Thx YuriW - Original Message - From: "Sage Weil" To: ceph-devel@vger.kernel.org Sent: Thursday, April 16, 2015 9:42:29 AM Subject: client/cluster compatibility testing Now that there are several different vendors shipping and supporting Ceph in their products, we'll invariably have people running different versions of Ceph that are interested in interoperability. If we focus just on client <-> cluster compatability, I think the issues are (1) compatibility between upstream ceph versions (firefly vs hammer) and (2) ensuring that any downstream changes the vendor makes don't break that compatibility. I think the simplest way to address this is to talk about compatibility in terms of the upstream stable releases (firefly, hammer, etc.), and test that compatibility with teuthology tests from ceph-qa-suite.git. We have some basic inter-version client/cluster tests already in suites/upgrade/client-upgrade. Currently these test new (version "x") clients against a given release (dumpling, firefly). I think we just need to add hammer to that mix, and then add a second set of tests that do the reverse: test clients from a given release (dumpling, firefly, hammer) against an arbitrary cluster version ("x"). We'll obviously run these tests on upstream releases to ensure that we are not breaking compatibility (or are doing so in known, explicit ways). Downstream folks can run the same test suites against any changes they make as well to ensure that their product is "compatible with firefly clients," or whatever. Does that sound reasonable? sage -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: client/cluster compatibility testing
On 04/16/2015 09:42 AM, Sage Weil wrote: I think the simplest way to address this is to talk about compatibility in terms of the upstream stable releases (firefly, hammer, etc.), and test that compatibility with teuthology tests from ceph-qa-suite.git. We have some basic inter-version client/cluster tests already in suites/upgrade/client-upgrade. Currently these test new (version "x") clients against a given release (dumpling, firefly). I think we just need to add hammer to that mix, and then add a second set of tests that do the reverse: test clients from a given release (dumpling, firefly, hammer) against an arbitrary cluster version ("x"). The suites in suites/upgrade/$version-x do this, and use a mixed version cluster rather than a purely version x cluster. It seems like people would want that intra-cluster version coverage for smooth upgrades. Just need to add hammer-x there too (Yuri's renaming the client ones to be $version-client-x for less confusion). Also I think we'll want to start doing mixed-client-version tests, particularly for things like rbd's exclusive locking: http://tracker.ceph.com/issues/11405 Josh -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Ceph.com
We've fixed it so that 404 handling isn't done by wordpress/php and things are muuuch happier. We've also moved all of the git stuff to git.ceph.com. There is a redirect from http://ceph.com/git to git.ceph.com (tho no https on the new site yet) and a proxy for git://ceph.com. Please let us know if anything still appears to be broken or slow! Thanks- sage -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Regarding newstore performance
On 04/16/2015 01:17 AM, Somnath Roy wrote: Here is the data with omap separated to another SSD and after 1000GB of fio writes (same profile).. omap writes: - Total host writes in this period = 551020111 -- ~2101 GB Total flash writes in this period = 1150679336 data writes: --- Total host writes in this period = 302550388 --- ~1154 GB Total flash writes in this period = 600238328 So, actual data write WA is ~1.1 but omap overhead is ~2.1 and adding those getting ~3.2 WA overall. Looks like we can get quite a bit of data out of the rocksdb log as well. Here's a stats dump after a full benchmark run from an SSD backed OSD with newstore, fdatasync, and xioxi's tuanbles to increase buffer sizes: http://www.fpaste.org/212007/raw/ It appears that in this test at least, a lot of data gets moved to L3 and L4 with associated WA. Notice the crazy amount of reads as well! Mark -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: make check bot paused
Hi, It is back :-) Cheers On 15/04/2015 13:55, Loic Dachary wrote: > Hi, > > The make check bot [1] that executes run-make-check.sh [2] on pull requests > and reports results as comments [3] is experiencing problems. It may be a > hardware issue and the bot is paused while the issue is investigated [4] to > avoid sending confusing false negatives. In the meantime the > run-make-check.sh [2] script can be run locally, before sending the pull > request, to confirm the commits to be sent do not break them. It is expected > to run in less than 15 minutes including compilation on a fast machine with a > SSD (or RAM disk) and 8 cores and 32GB of RAM and may take up to two hours on > a machine with a spinner and two cores. > > Thanks for your patience. > > Cheers > > [1] bot running on pull requests http://jenkins.ceph.dachary.org/job/ceph/ > [2] run-make-check.sh > http://workbench.dachary.org/ceph/ceph/blob/master/run-make-check.sh > [3] make check results example : > https://github.com/ceph/ceph/pull/3946#issuecomment-93286840 > [4] possible RAM failure http://tracker.ceph.com/issues/11399 > -- Loïc Dachary, Artisan Logiciel Libre signature.asc Description: OpenPGP digital signature
client/cluster compatibility testing
Now that there are several different vendors shipping and supporting Ceph in their products, we'll invariably have people running different versions of Ceph that are interested in interoperability. If we focus just on client <-> cluster compatability, I think the issues are (1) compatibility between upstream ceph versions (firefly vs hammer) and (2) ensuring that any downstream changes the vendor makes don't break that compatibility. I think the simplest way to address this is to talk about compatibility in terms of the upstream stable releases (firefly, hammer, etc.), and test that compatibility with teuthology tests from ceph-qa-suite.git. We have some basic inter-version client/cluster tests already in suites/upgrade/client-upgrade. Currently these test new (version "x") clients against a given release (dumpling, firefly). I think we just need to add hammer to that mix, and then add a second set of tests that do the reverse: test clients from a given release (dumpling, firefly, hammer) against an arbitrary cluster version ("x"). We'll obviously run these tests on upstream releases to ensure that we are not breaking compatibility (or are doing so in known, explicit ways). Downstream folks can run the same test suites against any changes they make as well to ensure that their product is "compatible with firefly clients," or whatever. Does that sound reasonable? sage -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: leaking mons on a latest dumpling
On Thu, 16 Apr 2015, Joao Eduardo Luis wrote: > On 04/15/2015 05:38 PM, Andrey Korolyov wrote: > > Hello, > > > > there is a slow leak which is presented in all ceph versions I assume > > but it is positively exposed only on large time spans and on large > > clusters. It looks like the lower is monitor placed in the quorum > > hierarchy, the higher the leak is: > > > > > > {"election_epoch":26,"quorum":[0,1,2,3,4],"quorum_names":["0","1","2","3","4"],"quorum_leader_name":"0","monmap":{"epoch":1,"fsid":"a2ec787e-3551-4a6f-aa24-deedbd8f8d01","modified":"2015-03-05 > > 13:48:54.696784","created":"2015-03-05 > > 13:48:54.696784","mons":[{"rank":0,"name":"0","addr":"10.0.1.91:6789\/0"},{"rank":1,"name":"1","addr":"10.0.1.92:6789\/0"},{"rank":2,"name":"2","addr":"10.0.1.93:6789\/0"},{"rank":3,"name":"3","addr":"10.0.1.94:6789\/0"},{"rank":4,"name":"4","addr":"10.0.1.95:6789\/0"}]}} > > > > ceph heap stats -m 10.0.1.95:6789 | grep Actual > > MALLOC: =427626648 ( 407.8 MiB) Actual memory used (physical + swap) > > ceph heap stats -m 10.0.1.94:6789 | grep Actual > > MALLOC: =289550488 ( 276.1 MiB) Actual memory used (physical + swap) > > ceph heap stats -m 10.0.1.93:6789 | grep Actual > > MALLOC: =230592664 ( 219.9 MiB) Actual memory used (physical + swap) > > ceph heap stats -m 10.0.1.92:6789 | grep Actual > > MALLOC: =253710488 ( 242.0 MiB) Actual memory used (physical + swap) > > ceph heap stats -m 10.0.1.91:6789 | grep Actual > > MALLOC: = 97112216 ( 92.6 MiB) Actual memory used (physical + swap) > > > > for almost same uptime, the data difference is: > > rd KB 55365750505 > > wr KB 82719722467 > > > > The leak itself is not very critical but of course requires some > > script work to restart monitors at least once per month on a 300Tb > > cluster to prevent >1G memory consumption by monitor processes. Given > > a current status for a dumpling, it would be probably possible to > > identify leak source and then forward-port fix to the newer releases, > > as the freshest version I am running on a large scale is a top of > > dumpling branch, otherwise it would require enormous amount of time to > > check fix proposals. > > There have been numerous reports of a slow leak in the monitors on > dumpling and firefly. I'm sure there's a ticket for that but I wasn't > able to find it. > > Many hours were spent chasing down this leak to no avail, despite of > plugging several leaks throughout the code (especially in firefly, that > should have been backported to dumpling at some point or the other). > > This was mostly hard to figure out because it tends to require a > long-term cluster to show up, and the biggest the cluster is the larger > the probability of triggering it. This behavior has me believing that > this should be somewhere in the message dispatching workflow and, given > it's the leader that suffers the most, should be somewhere in the > read-write message dispatching (PaxosService::prepare_update()). But > despite code inspections, I don't think we ever found the cause -- or > that any fixed leak was ever flagged as the root of the problem. > > Anyway, since Giant, most complaints (if not all!) went away. Maybe I > missed them, or maybe people suffering from this just stopped > complaining. I'm hoping it's the first rather than the latter and, as > luck has it, maybe the fix was a fortunate side-effect of some other change. Perhaps we should try to run one of the sepia lab cluster mons through valgrind massif. The slowdown shouldn't impact anything important and it's a real cluster with real load (running hammer). sage -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Ceph.com
Hey cephers, As most of you have no doubt noticed, ceph.com has been having some...er..."issues" lately. Unfortunately this is some of the holdover infrastructure stuff from being a startup without a big-boy ops plan. The current setup has ceph.com sharing a host with some of the nightly build stuff to make it easier for gitbuilder tasks (that also build the website doc) to coexist. Was this smart? No, probably not. Was is the quick-and-dirty way for us to get stuff rolling when we were tiny? Yep. So, now that things are continuing to grow (website traffic load, ceph-deploy key requests, number of simultaneous builds) we are hitting the end of what one hard-working box can handle. I am in the process of moving ceph.com to a new host so that build explosions wont slag things like Ceph Day pages and the blog, but the doc may lag behind a bit. Hopefully since I'm starting with the website it wont hose up too many of the other tasks, but bear with us while we split routing for a bit. If you have any questions please feel free to poke me. Thanks. -- Best Regards, Patrick McGarry Director Ceph Community || Red Hat http://ceph.com || http://community.redhat.com @scuttlemonkey || @ceph -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Issue]Ceph cluster hang due to network partition
Hi Sage, Thanks for you reply. Finally we fixed the network and ceph goes HEALTH_OK. We will improve our ops to get rid of network parition to fix this problem. Ketor On Tue, Apr 14, 2015 at 11:18 PM, Sage Weil wrote: > On Tue, 14 Apr 2015, Ketor D wrote: >> Hi Sage, >> We recently meet a network partition problem, cause our ceph >> cluster can not service rbd service. >>We are running 0.67.5 on our customer cluster. And the network >> partition is 3 osd can connect mon, but can not connect with all other >> osds. >>Then many PGs fall in peering status, and rbd I/O is hang. >> >> Before we operate the cluster , I set nout flag and stop the 3 >> OSDs. After operating the 3 OSDs memory and OS bootup, the network is >> partition. The 3 OSDs start, then many PGs went to peering. >> I stoped the 3 OSDs process, but the PGs fall in peering. > > One possibility is that those PGs were all on the partitioned side; in > that case you would have seen stale+peering+... states. > Another possibility is that there was not sufficient PG [meta]data on > the other side of the partition and the PGs got stuck in down+... or > incomplete+... states. > > Or, there was another partition somewhere or cofusion such that there were > OSDs that were unreachable but still in the 'up' state. > > sage > > >> After network partition is fixed, all PG get active+clean, all is OK. >> >> I can't explain this, because I think the OSD can judge if the >> other OSD is alive, and I can see 3OSD is marked down in 'ceph osd >> tree'. >> Why did these PGs fall in peering? >> >> Thanks! >> Ketor >> >> -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: tcmalloc issue
Thanks James ! We will try this out. Regards Somnath -Original Message- From: James Page [mailto:james.p...@ubuntu.com] Sent: Thursday, April 16, 2015 4:48 AM To: Chaitanya Huilgol; Somnath Roy; Sage Weil; ceph-maintain...@ceph.com Cc: ceph-devel@vger.kernel.org Subject: Re: tcmalloc issue -BEGIN PGP SIGNED MESSAGE- Hash: SHA256 Hi Folks The proposed fix is now in the 'proposed' pocket for 14.04 - see: https://bugs.launchpad.net/ubuntu/+source/google-perftools/+bug/1439277 on how to enable that pocket for testing if you'd like to confirm that you are able to use the required feature now. Cheers James - -- James Page Ubuntu and Debian Developer james.p...@ubuntu.com jamesp...@debian.org -BEGIN PGP SIGNATURE- Version: GnuPG v2 iQIcBAEBCAAGBQJVL6F1AAoJEL/srsug59jDRGoP/AuQ8jGGAKgsRGu/aSnAzG0N Aepa9AEnE2CH310HiYXqaKsmthKLwFy6vEkL0AkL4c4oDNVh9sdTaqBWV8qZXkxo iSzWMeoId54du5F/wgk5e+Itk78tdxmdBhiALlCyudm2erlIYzc1iMLwBY1AIO3u Ii7KngmPZj1d8DtUtfWAMoY8HbKSeZLpPpsu0mNOE6yzY/AycPlF/FPpWYoV2VBF fUEOWu3FCst8YwGnegMwi3PIrjdvvVOfU1QxRJgeP8/oy5QB+LqAIHQEwpLsiASv CJDZXj82SbFhMJxSj2vFF8WdfoNFz80DeMtdp268zjxaHVHVEjEszPAaZr862EHE 5qIXfAYgTYXtfI9p12OZm1PX7ogH45pfk5iMOZyZIWijkAnFHdMZ3ePHUMa+cBeR DBE3Zbd7BJ+jFQ53rtkTI1L9TmUikad5BRVJXB9bVIENDOxj69vS7dn960FsWme8 CulQ/Scil/Da8vauo/itKx2ey3tVuxX96hUAHubCjBgQyPPmW/KTqICYLgMMUy2n ZPMwVfKeM29KuLtyCu/VGdKtjaqjUl5QvTL0NJQaS18wFbsJOBGjwS2Mqfy2mVyU 3SDLAshD43m6Se2St8gEMIL3n5WYdGTksqb/iBVRKUfl3FhuBLTpb4ixtz3D2Tn1 HS2oqV/Bl/XFE0iyvPoM =WHzY -END PGP SIGNATURE- PLEASE NOTE: The information contained in this electronic mail message is intended only for the use of the designated recipient(s) named above. If the reader of this message is not the intended recipient, you are hereby notified that you have received this message in error and that any review, dissemination, distribution, or copying of this message is strictly prohibited. If you have received this communication in error, please notify the sender by telephone or e-mail (as shown above) immediately and destroy any and all copies of this message in your possession (whether hard copies or electronically stored copies). -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: tcmalloc issue
-BEGIN PGP SIGNED MESSAGE- Hash: SHA256 Hi Folks The proposed fix is now in the 'proposed' pocket for 14.04 - see: https://bugs.launchpad.net/ubuntu/+source/google-perftools/+bug/1439277 on how to enable that pocket for testing if you'd like to confirm that you are able to use the required feature now. Cheers James - -- James Page Ubuntu and Debian Developer james.p...@ubuntu.com jamesp...@debian.org -BEGIN PGP SIGNATURE- Version: GnuPG v2 iQIcBAEBCAAGBQJVL6F1AAoJEL/srsug59jDRGoP/AuQ8jGGAKgsRGu/aSnAzG0N Aepa9AEnE2CH310HiYXqaKsmthKLwFy6vEkL0AkL4c4oDNVh9sdTaqBWV8qZXkxo iSzWMeoId54du5F/wgk5e+Itk78tdxmdBhiALlCyudm2erlIYzc1iMLwBY1AIO3u Ii7KngmPZj1d8DtUtfWAMoY8HbKSeZLpPpsu0mNOE6yzY/AycPlF/FPpWYoV2VBF fUEOWu3FCst8YwGnegMwi3PIrjdvvVOfU1QxRJgeP8/oy5QB+LqAIHQEwpLsiASv CJDZXj82SbFhMJxSj2vFF8WdfoNFz80DeMtdp268zjxaHVHVEjEszPAaZr862EHE 5qIXfAYgTYXtfI9p12OZm1PX7ogH45pfk5iMOZyZIWijkAnFHdMZ3ePHUMa+cBeR DBE3Zbd7BJ+jFQ53rtkTI1L9TmUikad5BRVJXB9bVIENDOxj69vS7dn960FsWme8 CulQ/Scil/Da8vauo/itKx2ey3tVuxX96hUAHubCjBgQyPPmW/KTqICYLgMMUy2n ZPMwVfKeM29KuLtyCu/VGdKtjaqjUl5QvTL0NJQaS18wFbsJOBGjwS2Mqfy2mVyU 3SDLAshD43m6Se2St8gEMIL3n5WYdGTksqb/iBVRKUfl3FhuBLTpb4ixtz3D2Tn1 HS2oqV/Bl/XFE0iyvPoM =WHzY -END PGP SIGNATURE- -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: leaking mons on a latest dumpling
On Thu, Apr 16, 2015 at 11:30 AM, Joao Eduardo Luis wrote: > On 04/15/2015 05:38 PM, Andrey Korolyov wrote: >> Hello, >> >> there is a slow leak which is presented in all ceph versions I assume >> but it is positively exposed only on large time spans and on large >> clusters. It looks like the lower is monitor placed in the quorum >> hierarchy, the higher the leak is: >> >> >> {"election_epoch":26,"quorum":[0,1,2,3,4],"quorum_names":["0","1","2","3","4"],"quorum_leader_name":"0","monmap":{"epoch":1,"fsid":"a2ec787e-3551-4a6f-aa24-deedbd8f8d01","modified":"2015-03-05 >> 13:48:54.696784","created":"2015-03-05 >> 13:48:54.696784","mons":[{"rank":0,"name":"0","addr":"10.0.1.91:6789\/0"},{"rank":1,"name":"1","addr":"10.0.1.92:6789\/0"},{"rank":2,"name":"2","addr":"10.0.1.93:6789\/0"},{"rank":3,"name":"3","addr":"10.0.1.94:6789\/0"},{"rank":4,"name":"4","addr":"10.0.1.95:6789\/0"}]}} >> >> ceph heap stats -m 10.0.1.95:6789 | grep Actual >> MALLOC: =427626648 ( 407.8 MiB) Actual memory used (physical + swap) >> ceph heap stats -m 10.0.1.94:6789 | grep Actual >> MALLOC: =289550488 ( 276.1 MiB) Actual memory used (physical + swap) >> ceph heap stats -m 10.0.1.93:6789 | grep Actual >> MALLOC: =230592664 ( 219.9 MiB) Actual memory used (physical + swap) >> ceph heap stats -m 10.0.1.92:6789 | grep Actual >> MALLOC: =253710488 ( 242.0 MiB) Actual memory used (physical + swap) >> ceph heap stats -m 10.0.1.91:6789 | grep Actual >> MALLOC: = 97112216 ( 92.6 MiB) Actual memory used (physical + swap) >> >> for almost same uptime, the data difference is: >> rd KB 55365750505 >> wr KB 82719722467 >> >> The leak itself is not very critical but of course requires some >> script work to restart monitors at least once per month on a 300Tb >> cluster to prevent >1G memory consumption by monitor processes. Given >> a current status for a dumpling, it would be probably possible to >> identify leak source and then forward-port fix to the newer releases, >> as the freshest version I am running on a large scale is a top of >> dumpling branch, otherwise it would require enormous amount of time to >> check fix proposals. > > There have been numerous reports of a slow leak in the monitors on > dumpling and firefly. I'm sure there's a ticket for that but I wasn't > able to find it. > > Many hours were spent chasing down this leak to no avail, despite of > plugging several leaks throughout the code (especially in firefly, that > should have been backported to dumpling at some point or the other). > > This was mostly hard to figure out because it tends to require a > long-term cluster to show up, and the biggest the cluster is the larger > the probability of triggering it. This behavior has me believing that > this should be somewhere in the message dispatching workflow and, given > it's the leader that suffers the most, should be somewhere in the > read-write message dispatching (PaxosService::prepare_update()). But > despite code inspections, I don't think we ever found the cause -- or > that any fixed leak was ever flagged as the root of the problem. > > Anyway, since Giant, most complaints (if not all!) went away. Maybe I > missed them, or maybe people suffering from this just stopped > complaining. I'm hoping it's the first rather than the latter and, as > luck has it, maybe the fix was a fortunate side-effect of some other change. > > -Joao > Thanks for an explanation, I accidentally reversed the logical order describing leadership placement above. I`ll go through non-ported commits for ff and will port most promising ones on a spare time occasion, checking if the leak disappeared or not (it takes about a week to see the difference for mine workloads). Could dump structures be helpful for developers to ring a bell for deterministic suggestions? -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: leaking mons on a latest dumpling
On 04/15/2015 05:38 PM, Andrey Korolyov wrote: > Hello, > > there is a slow leak which is presented in all ceph versions I assume > but it is positively exposed only on large time spans and on large > clusters. It looks like the lower is monitor placed in the quorum > hierarchy, the higher the leak is: > > > {"election_epoch":26,"quorum":[0,1,2,3,4],"quorum_names":["0","1","2","3","4"],"quorum_leader_name":"0","monmap":{"epoch":1,"fsid":"a2ec787e-3551-4a6f-aa24-deedbd8f8d01","modified":"2015-03-05 > 13:48:54.696784","created":"2015-03-05 > 13:48:54.696784","mons":[{"rank":0,"name":"0","addr":"10.0.1.91:6789\/0"},{"rank":1,"name":"1","addr":"10.0.1.92:6789\/0"},{"rank":2,"name":"2","addr":"10.0.1.93:6789\/0"},{"rank":3,"name":"3","addr":"10.0.1.94:6789\/0"},{"rank":4,"name":"4","addr":"10.0.1.95:6789\/0"}]}} > > ceph heap stats -m 10.0.1.95:6789 | grep Actual > MALLOC: =427626648 ( 407.8 MiB) Actual memory used (physical + swap) > ceph heap stats -m 10.0.1.94:6789 | grep Actual > MALLOC: =289550488 ( 276.1 MiB) Actual memory used (physical + swap) > ceph heap stats -m 10.0.1.93:6789 | grep Actual > MALLOC: =230592664 ( 219.9 MiB) Actual memory used (physical + swap) > ceph heap stats -m 10.0.1.92:6789 | grep Actual > MALLOC: =253710488 ( 242.0 MiB) Actual memory used (physical + swap) > ceph heap stats -m 10.0.1.91:6789 | grep Actual > MALLOC: = 97112216 ( 92.6 MiB) Actual memory used (physical + swap) > > for almost same uptime, the data difference is: > rd KB 55365750505 > wr KB 82719722467 > > The leak itself is not very critical but of course requires some > script work to restart monitors at least once per month on a 300Tb > cluster to prevent >1G memory consumption by monitor processes. Given > a current status for a dumpling, it would be probably possible to > identify leak source and then forward-port fix to the newer releases, > as the freshest version I am running on a large scale is a top of > dumpling branch, otherwise it would require enormous amount of time to > check fix proposals. There have been numerous reports of a slow leak in the monitors on dumpling and firefly. I'm sure there's a ticket for that but I wasn't able to find it. Many hours were spent chasing down this leak to no avail, despite of plugging several leaks throughout the code (especially in firefly, that should have been backported to dumpling at some point or the other). This was mostly hard to figure out because it tends to require a long-term cluster to show up, and the biggest the cluster is the larger the probability of triggering it. This behavior has me believing that this should be somewhere in the message dispatching workflow and, given it's the leader that suffers the most, should be somewhere in the read-write message dispatching (PaxosService::prepare_update()). But despite code inspections, I don't think we ever found the cause -- or that any fixed leak was ever flagged as the root of the problem. Anyway, since Giant, most complaints (if not all!) went away. Maybe I missed them, or maybe people suffering from this just stopped complaining. I'm hoping it's the first rather than the latter and, as luck has it, maybe the fix was a fortunate side-effect of some other change. -Joao -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html