Re: init script bug with multiple clusters

2015-04-16 Thread Amon Ott
Am 17.04.2015 um 03:01 schrieb Gregory Farnum:
> This looks good to me, but we need an explicit sign-off from you for
> it. If you can submit it as a PR on Github that's easiest for us, but
> if not can you send it in git email patch form? :)

Attached patch against next branch in git email form - hope this is as
expected. Our devel system cannot send mail directly.

Amon Ott
-- 
Dr. Amon Ott
m-privacy GmbH   Tel: +49 30 24342334
Werner-Voß-Damm 62   Fax: +49 30 99296856
12101 Berlin http://www.m-privacy.de

Amtsgericht Charlottenburg, HRB 84946

Geschäftsführer:
 Dipl.-Kfm. Holger Maczkowsky,
 Roman Maczkowsky

GnuPG-Key-ID: 0x2DD3A649

From 1e4d9f4fcd688fcbe275f2cff55b272dfeec2e45 Mon Sep 17 00:00:00 2001
From: Amon Ott 
Date: Fri, 17 Apr 2015 08:42:58 +0200
Subject: [PATCH] init script bug with multiple clusters The Ceph init script
 (src/init-ceph.in) creates pid files without cluster names.
 This means that only one cluster can run at a time. The
 solution is simple and works fine here: add "$cluster-" as
 usual.

Signed-off-by: Amon Ott 
---
 src/init-ceph.in |2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/src/init-ceph.in b/src/init-ceph.in
index 2ff98c7..d88ca58 100644
--- a/src/init-ceph.in
+++ b/src/init-ceph.in
@@ -227,7 +227,7 @@ for name in $what; do
 
 get_conf run_dir "/var/run/ceph" "run dir"
 
-get_conf pid_file "$run_dir/$type.$id.pid" "pid file"
+get_conf pid_file "$run_dir/$cluster-$type.$id.pid" "pid file"
 
 if [ "$command" = "start" ]; then
 	if [ -n "$pid_file" ]; then
-- 
1.7.10.4



signature.asc
Description: OpenPGP digital signature


Re: Regarding newstore performance

2015-04-16 Thread Haomai Wang
On Fri, Apr 17, 2015 at 8:38 AM, Sage Weil  wrote:
> On Thu, 16 Apr 2015, Mark Nelson wrote:
>> On 04/16/2015 01:17 AM, Somnath Roy wrote:
>> > Here is the data with omap separated to another SSD and after 1000GB of fio
>> > writes (same profile)..
>> >
>> > omap writes:
>> > -
>> >
>> > Total host writes in this period = 551020111 -- ~2101 GB
>> >
>> > Total flash writes in this period = 1150679336
>> >
>> > data writes:
>> > ---
>> >
>> > Total host writes in this period = 302550388 --- ~1154 GB
>> >
>> > Total flash writes in this period = 600238328
>> >
>> > So, actual data write WA is ~1.1 but omap overhead is ~2.1 and adding those
>> > getting ~3.2 WA overall.
>
> This all suggests that getting rocksdb to not rewrite the wal
> entries at all will be the big win.  I think Xiaoxi had tunable
> suggestions for that?  I didn't grok the rocksdb terms immediately so
> they didn't make a lot of sense at the time.. this is probably a good
> place to focus, though.  The rocksdb compaction stats should help out
> there.
>
> But... today I ignored this entirely and put rocksdb in tmpfs and focused
> just on the actual wal IOs done to the fragments files after the fact.
> For simplicity I focused just on 128k random writes into 4mb objects.
>
> fio can get ~18 mb/sec on my disk with iodepth=1.  Interestingly, setting
> iodepth=16 makes no different *until* I also set thinktime=10 (us, or
> almost any value really) and thinktime_blocks=16, at which point it goes
> up with the iodepth.  I'm not quite sure what is going on there but it
> seems to be preventing the elevator and/or disk from reordering writes and
> make more efficient sweeps across the disk.  In any case, though, with
> that tweaked I can get up to ~30mb/sec with qd 16, ~40mb/sec with qd 64.
> Similarly, with qa 1 and thinktime of 250us, it drops to like 15mb/sec,
> which is basically what I was getting from newstore.  Here's my fio
> config:
>
> http://fpaste.org/212110/42923089/
>
> Conclusion: we need multiple threads (or libaio) to get lots of IOs in
> flight so that the block layer and/or disk can reorder and be efficient.
> I added a threadpool for doing wal work (newstore wal threads = 8 by
> default) and it makes a big difference.  Now I am getting more like
> 19mb/sec w/ 4 threads and client (smalliobench) qd 16.  It's not going up
> much from there as I scale threads or qd, strangely; not sure why yet.

Do you mean this PR(https://github.com/ceph/ceph/pull/4318)? I have a
simple benchmark at the comment of PR.

>
> But... that's a big improvement over a few days ago (~8mb/sec).  And on
> this drive filestore with journal on ssd gets ~8.5mb/sec.  So we're
> winning, yay!
>
> I tabled the libaio patch for now since it was getting spurious EINVAL and
> would consistently SIGBUG from io_getevents() when ceph-osd did dlopen()
> on the rados plugins (weird!).
>
> Mark, at this point it is probably worth checking that you can reproduce
> these results?  If so, we can redo the io size sweep.  I picked 8 wal
> threads since that was enough to help and going higher didn't seem to make
> much difference, but at some point we'll want to be more careful about
> picking that number.  We could also use libaio here, but I'm not sure it's
> worth it.  And this approach is somewhat orthogonal to the idea of
> efficiently passing the kernel things to fdatasync.

Agreed, this time I think we need to focus data store only. Maybe I'm
missing, what's your overlay config value in this test?

>
> Anyway, next up is probably wrangling rocksdb's log!
>
> sage



-- 
Best Regards,

Wheat
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: init script bug with multiple clusters

2015-04-16 Thread Gregory Farnum
This looks good to me, but we need an explicit sign-off from you for
it. If you can submit it as a PR on Github that's easiest for us, but
if not can you send it in git email patch form? :)
-Greg

On Wed, Apr 8, 2015 at 2:58 AM, Amon Ott  wrote:
> Hello Ceph!
>
> The Ceph init script (src/init-ceph.in) creates pid files without
> cluster names. This means that only one cluster can run at a time. The
> solution is simple and works fine here, patch against 0.94 is attached.
>
> Amon Ott
> --
> Dr. Amon Ott
> m-privacy GmbH   Tel: +49 30 24342334
> Werner-Voß-Damm 62   Fax: +49 30 99296856
> 12101 Berlin http://www.m-privacy.de
>
> Amtsgericht Charlottenburg, HRB 84946
>
> Geschäftsführer:
>  Dipl.-Kfm. Holger Maczkowsky,
>  Roman Maczkowsky
>
> GnuPG-Key-ID: 0x2DD3A649
>
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: Regarding newstore performance

2015-04-16 Thread Chen, Xiaoxi
AgreeThreadpool/Queue/Locking is in generally bad for latency. Can we just 
make newstore backend as synchronize as possible and utilize the parallelism by 
more #OSD_OP_THREAD? Hopefully we could have better latency in low #QD case.
 

-Original Message-
From: Gregory Farnum [mailto:g...@gregs42.com] 
Sent: Friday, April 17, 2015 8:48 AM
To: Sage Weil
Cc: Mark Nelson; Somnath Roy; Chen, Xiaoxi; Haomai Wang; ceph-devel
Subject: Re: Regarding newstore performance

On Thu, Apr 16, 2015 at 5:38 PM, Sage Weil  wrote:
> On Thu, 16 Apr 2015, Mark Nelson wrote:
>> On 04/16/2015 01:17 AM, Somnath Roy wrote:
>> > Here is the data with omap separated to another SSD and after 
>> > 1000GB of fio writes (same profile)..
>> >
>> > omap writes:
>> > -
>> >
>> > Total host writes in this period = 551020111 -- ~2101 GB
>> >
>> > Total flash writes in this period = 1150679336
>> >
>> > data writes:
>> > ---
>> >
>> > Total host writes in this period = 302550388 --- ~1154 GB
>> >
>> > Total flash writes in this period = 600238328
>> >
>> > So, actual data write WA is ~1.1 but omap overhead is ~2.1 and 
>> > adding those getting ~3.2 WA overall.
>
> This all suggests that getting rocksdb to not rewrite the wal entries 
> at all will be the big win.  I think Xiaoxi had tunable suggestions 
> for that?  I didn't grok the rocksdb terms immediately so they didn't 
> make a lot of sense at the time.. this is probably a good place to 
> focus, though.  The rocksdb compaction stats should help out there.
>
> But... today I ignored this entirely and put rocksdb in tmpfs and 
> focused just on the actual wal IOs done to the fragments files after the fact.
> For simplicity I focused just on 128k random writes into 4mb objects.
>
> fio can get ~18 mb/sec on my disk with iodepth=1.  Interestingly, 
> setting
> iodepth=16 makes no different *until* I also set thinktime=10 (us, or 
> almost any value really) and thinktime_blocks=16, at which point it 
> goes up with the iodepth.  I'm not quite sure what is going on there 
> but it seems to be preventing the elevator and/or disk from reordering 
> writes and make more efficient sweeps across the disk.  In any case, 
> though, with that tweaked I can get up to ~30mb/sec with qd 16, ~40mb/sec 
> with qd 64.
> Similarly, with qa 1 and thinktime of 250us, it drops to like 
> 15mb/sec, which is basically what I was getting from newstore.  Here's 
> my fio
> config:
>
> http://fpaste.org/212110/42923089/
>
> Conclusion: we need multiple threads (or libaio) to get lots of IOs in 
> flight so that the block layer and/or disk can reorder and be efficient.
> I added a threadpool for doing wal work (newstore wal threads = 8 by
> default) and it makes a big difference.  Now I am getting more like 
> 19mb/sec w/ 4 threads and client (smalliobench) qd 16.  It's not going 
> up much from there as I scale threads or qd, strangely; not sure why yet.
>
> But... that's a big improvement over a few days ago (~8mb/sec).  And 
> on this drive filestore with journal on ssd gets ~8.5mb/sec.  So we're 
> winning, yay!
>
> I tabled the libaio patch for now since it was getting spurious EINVAL 
> and would consistently SIGBUG from io_getevents() when ceph-osd did 
> dlopen() on the rados plugins (weird!).
>
> Mark, at this point it is probably worth checking that you can 
> reproduce these results?  If so, we can redo the io size sweep.  I 
> picked 8 wal threads since that was enough to help and going higher 
> didn't seem to make much difference, but at some point we'll want to 
> be more careful about picking that number.  We could also use libaio 
> here, but I'm not sure it's worth it.  And this approach is somewhat 
> orthogonal to the idea of efficiently passing the kernel things to fdatasync.

Adding another thread switch to the IO path is going to make us very sad in the 
future, so I think this'd be a bad prototype version to have escape into the 
wild. I keep hearing Sam's talk about needing to get down to 1 thread switch if 
we're ever to hope for 100usec writes.

So consider this one vote for making libaio work, and sooner rather than later. 
:) -Greg
N�r��yb�X��ǧv�^�)޺{.n�+���z�]z���{ay�ʇڙ�,j��f���h���z��w���
���j:+v���w�j�mzZ+�ݢj"��!�i

Re: Regarding newstore performance

2015-04-16 Thread Sage Weil
On Thu, 16 Apr 2015, Gregory Farnum wrote:
> On Thu, Apr 16, 2015 at 5:38 PM, Sage Weil  wrote:
> > On Thu, 16 Apr 2015, Mark Nelson wrote:
> >> On 04/16/2015 01:17 AM, Somnath Roy wrote:
> >> > Here is the data with omap separated to another SSD and after 1000GB of 
> >> > fio
> >> > writes (same profile)..
> >> >
> >> > omap writes:
> >> > -
> >> >
> >> > Total host writes in this period = 551020111 -- ~2101 GB
> >> >
> >> > Total flash writes in this period = 1150679336
> >> >
> >> > data writes:
> >> > ---
> >> >
> >> > Total host writes in this period = 302550388 --- ~1154 GB
> >> >
> >> > Total flash writes in this period = 600238328
> >> >
> >> > So, actual data write WA is ~1.1 but omap overhead is ~2.1 and adding 
> >> > those
> >> > getting ~3.2 WA overall.
> >
> > This all suggests that getting rocksdb to not rewrite the wal
> > entries at all will be the big win.  I think Xiaoxi had tunable
> > suggestions for that?  I didn't grok the rocksdb terms immediately so
> > they didn't make a lot of sense at the time.. this is probably a good
> > place to focus, though.  The rocksdb compaction stats should help out
> > there.
> >
> > But... today I ignored this entirely and put rocksdb in tmpfs and focused
> > just on the actual wal IOs done to the fragments files after the fact.
> > For simplicity I focused just on 128k random writes into 4mb objects.
> >
> > fio can get ~18 mb/sec on my disk with iodepth=1.  Interestingly, setting
> > iodepth=16 makes no different *until* I also set thinktime=10 (us, or
> > almost any value really) and thinktime_blocks=16, at which point it goes
> > up with the iodepth.  I'm not quite sure what is going on there but it
> > seems to be preventing the elevator and/or disk from reordering writes and
> > make more efficient sweeps across the disk.  In any case, though, with
> > that tweaked I can get up to ~30mb/sec with qd 16, ~40mb/sec with qd 64.
> > Similarly, with qa 1 and thinktime of 250us, it drops to like 15mb/sec,
> > which is basically what I was getting from newstore.  Here's my fio
> > config:
> >
> > http://fpaste.org/212110/42923089/
> >
> > Conclusion: we need multiple threads (or libaio) to get lots of IOs in
> > flight so that the block layer and/or disk can reorder and be efficient.
> > I added a threadpool for doing wal work (newstore wal threads = 8 by
> > default) and it makes a big difference.  Now I am getting more like
> > 19mb/sec w/ 4 threads and client (smalliobench) qd 16.  It's not going up
> > much from there as I scale threads or qd, strangely; not sure why yet.
> >
> > But... that's a big improvement over a few days ago (~8mb/sec).  And on
> > this drive filestore with journal on ssd gets ~8.5mb/sec.  So we're
> > winning, yay!
> >
> > I tabled the libaio patch for now since it was getting spurious EINVAL and
> > would consistently SIGBUG from io_getevents() when ceph-osd did dlopen()
> > on the rados plugins (weird!).
> >
> > Mark, at this point it is probably worth checking that you can reproduce
> > these results?  If so, we can redo the io size sweep.  I picked 8 wal
> > threads since that was enough to help and going higher didn't seem to make
> > much difference, but at some point we'll want to be more careful about
> > picking that number.  We could also use libaio here, but I'm not sure it's
> > worth it.  And this approach is somewhat orthogonal to the idea of
> > efficiently passing the kernel things to fdatasync.
> 
> Adding another thread switch to the IO path is going to make us very
> sad in the future, so I think this'd be a bad prototype version to
> have escape into the wild. I keep hearing Sam's talk about needing to
> get down to 1 thread switch if we're ever to hope for 100usec writes.

Yeah, for fast memory we'll want to take a totally different synchronous 
path through the code.  Right now I'm targetting general purpose (spinning 
disk and current-generation SSDs) usage (and this is the async post-commit 
cleanup work).

But yeah... I'll bite the bullet and do aio soon.  I suspect I just 
screwed up the buffer alignment and that's where EINVAL was coming from 
before.

sage
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Regarding newstore performance

2015-04-16 Thread Gregory Farnum
On Thu, Apr 16, 2015 at 5:38 PM, Sage Weil  wrote:
> On Thu, 16 Apr 2015, Mark Nelson wrote:
>> On 04/16/2015 01:17 AM, Somnath Roy wrote:
>> > Here is the data with omap separated to another SSD and after 1000GB of fio
>> > writes (same profile)..
>> >
>> > omap writes:
>> > -
>> >
>> > Total host writes in this period = 551020111 -- ~2101 GB
>> >
>> > Total flash writes in this period = 1150679336
>> >
>> > data writes:
>> > ---
>> >
>> > Total host writes in this period = 302550388 --- ~1154 GB
>> >
>> > Total flash writes in this period = 600238328
>> >
>> > So, actual data write WA is ~1.1 but omap overhead is ~2.1 and adding those
>> > getting ~3.2 WA overall.
>
> This all suggests that getting rocksdb to not rewrite the wal
> entries at all will be the big win.  I think Xiaoxi had tunable
> suggestions for that?  I didn't grok the rocksdb terms immediately so
> they didn't make a lot of sense at the time.. this is probably a good
> place to focus, though.  The rocksdb compaction stats should help out
> there.
>
> But... today I ignored this entirely and put rocksdb in tmpfs and focused
> just on the actual wal IOs done to the fragments files after the fact.
> For simplicity I focused just on 128k random writes into 4mb objects.
>
> fio can get ~18 mb/sec on my disk with iodepth=1.  Interestingly, setting
> iodepth=16 makes no different *until* I also set thinktime=10 (us, or
> almost any value really) and thinktime_blocks=16, at which point it goes
> up with the iodepth.  I'm not quite sure what is going on there but it
> seems to be preventing the elevator and/or disk from reordering writes and
> make more efficient sweeps across the disk.  In any case, though, with
> that tweaked I can get up to ~30mb/sec with qd 16, ~40mb/sec with qd 64.
> Similarly, with qa 1 and thinktime of 250us, it drops to like 15mb/sec,
> which is basically what I was getting from newstore.  Here's my fio
> config:
>
> http://fpaste.org/212110/42923089/
>
> Conclusion: we need multiple threads (or libaio) to get lots of IOs in
> flight so that the block layer and/or disk can reorder and be efficient.
> I added a threadpool for doing wal work (newstore wal threads = 8 by
> default) and it makes a big difference.  Now I am getting more like
> 19mb/sec w/ 4 threads and client (smalliobench) qd 16.  It's not going up
> much from there as I scale threads or qd, strangely; not sure why yet.
>
> But... that's a big improvement over a few days ago (~8mb/sec).  And on
> this drive filestore with journal on ssd gets ~8.5mb/sec.  So we're
> winning, yay!
>
> I tabled the libaio patch for now since it was getting spurious EINVAL and
> would consistently SIGBUG from io_getevents() when ceph-osd did dlopen()
> on the rados plugins (weird!).
>
> Mark, at this point it is probably worth checking that you can reproduce
> these results?  If so, we can redo the io size sweep.  I picked 8 wal
> threads since that was enough to help and going higher didn't seem to make
> much difference, but at some point we'll want to be more careful about
> picking that number.  We could also use libaio here, but I'm not sure it's
> worth it.  And this approach is somewhat orthogonal to the idea of
> efficiently passing the kernel things to fdatasync.

Adding another thread switch to the IO path is going to make us very
sad in the future, so I think this'd be a bad prototype version to
have escape into the wild. I keep hearing Sam's talk about needing to
get down to 1 thread switch if we're ever to hope for 100usec writes.

So consider this one vote for making libaio work, and sooner rather
than later. :)
-Greg
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Regarding newstore performance

2015-04-16 Thread Sage Weil
On Thu, 16 Apr 2015, Mark Nelson wrote:
> On 04/16/2015 01:17 AM, Somnath Roy wrote:
> > Here is the data with omap separated to another SSD and after 1000GB of fio
> > writes (same profile)..
> > 
> > omap writes:
> > -
> > 
> > Total host writes in this period = 551020111 -- ~2101 GB
> > 
> > Total flash writes in this period = 1150679336
> > 
> > data writes:
> > ---
> > 
> > Total host writes in this period = 302550388 --- ~1154 GB
> > 
> > Total flash writes in this period = 600238328
> > 
> > So, actual data write WA is ~1.1 but omap overhead is ~2.1 and adding those
> > getting ~3.2 WA overall.

This all suggests that getting rocksdb to not rewrite the wal 
entries at all will be the big win.  I think Xiaoxi had tunable 
suggestions for that?  I didn't grok the rocksdb terms immediately so 
they didn't make a lot of sense at the time.. this is probably a good 
place to focus, though.  The rocksdb compaction stats should help out 
there.

But... today I ignored this entirely and put rocksdb in tmpfs and focused 
just on the actual wal IOs done to the fragments files after the fact.  
For simplicity I focused just on 128k random writes into 4mb objects.

fio can get ~18 mb/sec on my disk with iodepth=1.  Interestingly, setting 
iodepth=16 makes no different *until* I also set thinktime=10 (us, or 
almost any value really) and thinktime_blocks=16, at which point it goes 
up with the iodepth.  I'm not quite sure what is going on there but it 
seems to be preventing the elevator and/or disk from reordering writes and 
make more efficient sweeps across the disk.  In any case, though, with 
that tweaked I can get up to ~30mb/sec with qd 16, ~40mb/sec with qd 64.  
Similarly, with qa 1 and thinktime of 250us, it drops to like 15mb/sec, 
which is basically what I was getting from newstore.  Here's my fio 
config:

http://fpaste.org/212110/42923089/

Conclusion: we need multiple threads (or libaio) to get lots of IOs in 
flight so that the block layer and/or disk can reorder and be efficient.  
I added a threadpool for doing wal work (newstore wal threads = 8 by 
default) and it makes a big difference.  Now I am getting more like 
19mb/sec w/ 4 threads and client (smalliobench) qd 16.  It's not going up 
much from there as I scale threads or qd, strangely; not sure why yet.

But... that's a big improvement over a few days ago (~8mb/sec).  And on 
this drive filestore with journal on ssd gets ~8.5mb/sec.  So we're 
winning, yay!

I tabled the libaio patch for now since it was getting spurious EINVAL and 
would consistently SIGBUG from io_getevents() when ceph-osd did dlopen() 
on the rados plugins (weird!).

Mark, at this point it is probably worth checking that you can reproduce 
these results?  If so, we can redo the io size sweep.  I picked 8 wal 
threads since that was enough to help and going higher didn't seem to make 
much difference, but at some point we'll want to be more careful about 
picking that number.  We could also use libaio here, but I'm not sure it's 
worth it.  And this approach is somewhat orthogonal to the idea of 
efficiently passing the kernel things to fdatasync.

Anyway, next up is probably wrangling rocksdb's log!

sage
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Ceph master - build broken unless --enable-debug specified

2015-04-16 Thread Mark Kirkwood

On 17/04/15 12:27, Gregory Farnum wrote:

On Sat, Apr 11, 2015 at 8:42 PM, Mark Kirkwood
 wrote:

Hi,

Building without --enable-debug produces:

ceph_fuse.cc: In member function ‘virtual void* main(int, const char**,
const char**)::RemountTest::entry()’:
ceph_fuse.cc:146:15: warning: ignoring return value of ‘int system(const
char*)’, declared with attribute warn_unused_result [-Wunused-result]
 system(buf);
^
   CXX  ceph_osd.o
   CXX  ceph_mds.o
make[3]: *** No rule to make target '../src/gmock/lib/libgmock_main.la',
needed by 'unittest_librbd'.  Stop.
make[3]: *** Waiting for unfinished jobs
   CXX  test/erasure-code/ceph_erasure_code_non_regression.o
make[3]: Leaving directory '/home/markir/develop/c/ceph/src'
Makefile:20716: recipe for target 'all-recursive' failed
make[2]: *** [all-recursive] Error 1
make[2]: Leaving directory '/home/markir/develop/c/ceph/src'
Makefile:8977: recipe for target 'all' failed
make[1]: *** [all] Error 2
make[1]: Leaving directory '/home/markir/develop/c/ceph/src'
Makefile:467: recipe for target 'all-recursive' failed
make: *** [all-recursive] Error 1


Adding in --enable-debug gives a successful build.

This is on Ubuntu 14.10 64 bit, and the build procedure is:

$ git pull
$ git submodule update --init
$ ./autogen.sh
$ ./configure --prefix=/usr --sysconfdir=/etc --localstatedir=/var \
   [--with-debug \ ]
   --with-nss \
   --with-radosgw \
   --with-librocksdb-static=check \

$ make [ -j4 ]


Yep, looks like the unittest_librbd binary is in the noinst_PROGRAMS
target (whatever that is) rather than the check_PROGRAMS target.
Changing that seems to work — I pushed a branch wip-nodebug-build
fixing it, but if you have your own fix a PR is welcome. If not I'll
make a PR in the next couple days.


I had not looked very closely at what the exact problem was - your 
analysis looks good to me, I'll leave you to file a PR :-)


Cheers

Mark

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Ceph master - build broken unless --enable-debug specified

2015-04-16 Thread Gregory Farnum
On Sat, Apr 11, 2015 at 8:42 PM, Mark Kirkwood
 wrote:
> Hi,
>
> Building without --enable-debug produces:
>
> ceph_fuse.cc: In member function ‘virtual void* main(int, const char**,
> const char**)::RemountTest::entry()’:
> ceph_fuse.cc:146:15: warning: ignoring return value of ‘int system(const
> char*)’, declared with attribute warn_unused_result [-Wunused-result]
> system(buf);
>^
>   CXX  ceph_osd.o
>   CXX  ceph_mds.o
> make[3]: *** No rule to make target '../src/gmock/lib/libgmock_main.la',
> needed by 'unittest_librbd'.  Stop.
> make[3]: *** Waiting for unfinished jobs
>   CXX  test/erasure-code/ceph_erasure_code_non_regression.o
> make[3]: Leaving directory '/home/markir/develop/c/ceph/src'
> Makefile:20716: recipe for target 'all-recursive' failed
> make[2]: *** [all-recursive] Error 1
> make[2]: Leaving directory '/home/markir/develop/c/ceph/src'
> Makefile:8977: recipe for target 'all' failed
> make[1]: *** [all] Error 2
> make[1]: Leaving directory '/home/markir/develop/c/ceph/src'
> Makefile:467: recipe for target 'all-recursive' failed
> make: *** [all-recursive] Error 1
>
>
> Adding in --enable-debug gives a successful build.
>
> This is on Ubuntu 14.10 64 bit, and the build procedure is:
>
> $ git pull
> $ git submodule update --init
> $ ./autogen.sh
> $ ./configure --prefix=/usr --sysconfdir=/etc --localstatedir=/var \
>   [--with-debug \ ]
>   --with-nss \
>   --with-radosgw \
>   --with-librocksdb-static=check \
>
> $ make [ -j4 ]

Yep, looks like the unittest_librbd binary is in the noinst_PROGRAMS
target (whatever that is) rather than the check_PROGRAMS target.
Changing that seems to work — I pushed a branch wip-nodebug-build
fixing it, but if you have your own fix a PR is welcome. If not I'll
make a PR in the next couple days.
-Greg
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: CephFS and the next giant release v0.87.2

2015-04-16 Thread Gregory Farnum
On Thu, Apr 16, 2015 at 4:16 PM, Loic Dachary  wrote:
> Hi Greg,
>
> On 17/04/2015 00:44, Gregory Farnum wrote:
>> On Wed, Apr 15, 2015 at 2:37 AM, Loic Dachary  wrote:
>>> Hi Greg,
>>>
>>> The next giant release as found at https://github.com/ceph/ceph/tree/giant 
>>> passed the fs suite (http://tracker.ceph.com/issues/11153#fs). Do you think 
>>> it is ready for QE to start their own round of testing ?
>>>
>>> Note that it will be the last giant release.
>>>
>>> Cheers
>>>
>>> P.S. http://tracker.ceph.com/issues/11153#Release-information has direct 
>>> links to the pull requests merged into giant since v0.87.1 in case you need 
>>> more context about one of them.
>>
>> All those PRs look like fine ones to release.
>>
>> I remember you went through a big purge of giant-tagged backports at
>> one point though (when we thought we weren't going to do any more
>> releases at all). Did that get somehow undone and all of those dealt
>> with?
>
> I inadvertendly closed a few issue that I should not have (less than 10 more 
> than 5 IIRC). I carefully reviewed all of them and they have been dealt with 
> indeed. There now is a HOWTO to prevent that kind of mistake, hopefully. 
> http://tracker.ceph.com/projects/ceph-releases/wiki/HOWTO_resolve_issues_that_are_Pending_Backport
>
>> Some of them were more important than others and it would be
>> some work for me to reconstruct the list, but I'll need to do that if
>> you haven't already. :/
>
> There are a four giant issues that are candidate for backporting : 
> http://tracker.ceph.com/projects/ceph/issues?query_id=68, only one of which 
> is related to CephFS. Do you think it must be included in this last giant 
> release ?

Nope, not that one!

This all looks good to me then, thanks.
-Greg
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: CephFS and the next giant release v0.87.2

2015-04-16 Thread Loic Dachary
Hi Greg,

On 17/04/2015 00:44, Gregory Farnum wrote:
> On Wed, Apr 15, 2015 at 2:37 AM, Loic Dachary  wrote:
>> Hi Greg,
>>
>> The next giant release as found at https://github.com/ceph/ceph/tree/giant 
>> passed the fs suite (http://tracker.ceph.com/issues/11153#fs). Do you think 
>> it is ready for QE to start their own round of testing ?
>>
>> Note that it will be the last giant release.
>>
>> Cheers
>>
>> P.S. http://tracker.ceph.com/issues/11153#Release-information has direct 
>> links to the pull requests merged into giant since v0.87.1 in case you need 
>> more context about one of them.
> 
> All those PRs look like fine ones to release.
> 
> I remember you went through a big purge of giant-tagged backports at
> one point though (when we thought we weren't going to do any more
> releases at all). Did that get somehow undone and all of those dealt
> with? 

I inadvertendly closed a few issue that I should not have (less than 10 more 
than 5 IIRC). I carefully reviewed all of them and they have been dealt with 
indeed. There now is a HOWTO to prevent that kind of mistake, hopefully. 
http://tracker.ceph.com/projects/ceph-releases/wiki/HOWTO_resolve_issues_that_are_Pending_Backport

> Some of them were more important than others and it would be
> some work for me to reconstruct the list, but I'll need to do that if
> you haven't already. :/

There are a four giant issues that are candidate for backporting : 
http://tracker.ceph.com/projects/ceph/issues?query_id=68, only one of which is 
related to CephFS. Do you think it must be included in this last giant release ?

Cheers

> -Greg
> 

-- 
Loïc Dachary, Artisan Logiciel Libre



signature.asc
Description: OpenPGP digital signature


Re: partial acks when send reply to client to reduce write latency

2015-04-16 Thread Gregory Farnum
On Thu, Apr 9, 2015 at 11:38 PM, 池信泽  wrote:
> hi, all:
>
> Now, ceph should received all ack message from remote and then
> reply ack to client, What
>
> about directly reply to client if primary has been received some of
> them. Below is the request
>
> trace among osd. Primary wait for second sub_op_commit_rec msg for a long 
> time.
>
> Does it make sense?

It makes sense on one level, but unfortunately it's just not feasible.
It would change how peering needs to work — right now, we need to
contact at least one OSD that is active in any interval. If we allowed
commits to happen without having hit disk on every OSD, we need to
talk to all the OSDs in every interval (or at least, {num_OSDs} -
{number_we_require_ack} + 1 of them), which would be pretty bad for
our failure resiliency.

This comes up every so often as a suggestion and is a lot more
feasible with erasure coding — Yahoo has already implemented the
read-side version of this
(http://yahooeng.tumblr.com/post/116391291701/yahoo-cloud-object-store-object-storage-at),
but doing it on the write side would still take a lot of work.
-Greg
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: CephFS and the next giant release v0.87.2

2015-04-16 Thread Gregory Farnum
On Wed, Apr 15, 2015 at 2:37 AM, Loic Dachary  wrote:
> Hi Greg,
>
> The next giant release as found at https://github.com/ceph/ceph/tree/giant 
> passed the fs suite (http://tracker.ceph.com/issues/11153#fs). Do you think 
> it is ready for QE to start their own round of testing ?
>
> Note that it will be the last giant release.
>
> Cheers
>
> P.S. http://tracker.ceph.com/issues/11153#Release-information has direct 
> links to the pull requests merged into giant since v0.87.1 in case you need 
> more context about one of them.

All those PRs look like fine ones to release.

I remember you went through a big purge of giant-tagged backports at
one point though (when we thought we weren't going to do any more
releases at all). Did that get somehow undone and all of those dealt
with? Some of them were more important than others and it would be
some work for me to reconstruct the list, but I'll need to do that if
you haven't already. :/
-Greg
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: client/cluster compatibility testing

2015-04-16 Thread Yuri Weinstein
Yea, Sage, that sounds reasonable.

I added a ticket to capture this plan (http://tracker.ceph.com/issues/11413) 
and will add those tests soon.

Please add your comments to the ticket above.

I am assuming that it will look something like this for dumpling, firefly and 
hammer:

dumpling(stable) -> client-x
firefly(stable) -> client-x
hammer(stable) -> client-x

and reverse

dumpling-client(stable) -> cluster-x
firefly-cluster(stable) -> cluster-x
hammer-cluster(stable) -> cluster-x

Yes?

Thx
YuriW

- Original Message -
From: "Sage Weil" 
To: ceph-devel@vger.kernel.org
Sent: Thursday, April 16, 2015 9:42:29 AM
Subject: client/cluster compatibility testing

Now that there are several different vendors shipping and supporting Ceph 
in their products, we'll invariably have people running different 
versions of Ceph that are interested in interoperability.  If we focus 
just on client <-> cluster compatability, I think the issues are (1) 
compatibility between upstream ceph versions (firefly vs hammer) and 
(2) ensuring that any downstream changes the vendor makes don't break that 
compatibility.

I think the simplest way to address this is to talk about compatibility in 
terms of the upstream stable releases (firefly, hammer, etc.), and test 
that compatibility with teuthology tests from ceph-qa-suite.git.  We have 
some basic inter-version client/cluster tests already in 
suites/upgrade/client-upgrade.  Currently these test new (version "x") 
clients against a given release (dumpling, firefly).  I think we just need 
to add hammer to that mix, and then add a second set of tests that do the 
reverse: test clients from a given release (dumpling, firefly, hammer) 
against an arbitrary cluster version ("x").

We'll obviously run these tests on upstream releases to ensure that we are 
not breaking compatibility (or are doing so in known, explicit ways).  
Downstream folks can run the same test suites against any changes they 
make as well to ensure that their product is "compatible with firefly 
clients," or whatever.

Does that sound reasonable?
sage

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: client/cluster compatibility testing

2015-04-16 Thread Josh Durgin

On 04/16/2015 09:42 AM, Sage Weil wrote:

I think the simplest way to address this is to talk about compatibility in
terms of the upstream stable releases (firefly, hammer, etc.), and test
that compatibility with teuthology tests from ceph-qa-suite.git.  We have
some basic inter-version client/cluster tests already in
suites/upgrade/client-upgrade.  Currently these test new (version "x")
clients against a given release (dumpling, firefly).  I think we just need
to add hammer to that mix, and then add a second set of tests that do the
reverse: test clients from a given release (dumpling, firefly, hammer)
against an arbitrary cluster version ("x").


The suites in suites/upgrade/$version-x do this, and use a mixed
version cluster rather than a purely version x cluster. It seems like
people would want that intra-cluster version coverage for smooth
upgrades.

Just need to add hammer-x there too (Yuri's renaming the client ones to
be $version-client-x for less confusion).

Also I think we'll want to start doing mixed-client-version tests,
particularly for things like rbd's exclusive locking:

http://tracker.ceph.com/issues/11405

Josh
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Ceph.com

2015-04-16 Thread Sage Weil
We've fixed it so that 404 handling isn't done by wordpress/php and 
things are muuuch happier.  We've also moved all of the git stuff to 
git.ceph.com.  There is a redirect from http://ceph.com/git to 
git.ceph.com (tho no https on the new site yet) and a proxy for 
git://ceph.com.

Please let us know if anything still appears to be broken or slow!

Thanks-
sage
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Regarding newstore performance

2015-04-16 Thread Mark Nelson

On 04/16/2015 01:17 AM, Somnath Roy wrote:

Here is the data with omap separated to another SSD and after 1000GB of fio 
writes (same profile)..

omap writes:
-

Total host writes in this period = 551020111 -- ~2101 GB

Total flash writes in this period = 1150679336

data writes:
---

Total host writes in this period = 302550388 --- ~1154 GB

Total flash writes in this period = 600238328

So, actual data write WA is ~1.1 but omap overhead is ~2.1 and adding those 
getting ~3.2 WA overall.


Looks like we can get quite a bit of data out of the rocksdb log as 
well.  Here's a stats dump after a full benchmark run from an SSD backed 
OSD with newstore, fdatasync, and xioxi's tuanbles to increase buffer sizes:


http://www.fpaste.org/212007/raw/

It appears that in this test at least, a lot of data gets moved to L3 
and L4 with associated WA.  Notice the crazy amount of reads as well!


Mark
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: make check bot paused

2015-04-16 Thread Loic Dachary
Hi,

It is back :-)

Cheers

On 15/04/2015 13:55, Loic Dachary wrote:
> Hi,
> 
> The make check bot [1] that executes run-make-check.sh [2] on pull requests 
> and reports results as comments [3] is experiencing problems. It may be a 
> hardware issue and the bot is paused while the issue is investigated [4] to 
> avoid sending confusing false negatives. In the meantime the 
> run-make-check.sh [2] script can be run locally, before sending the pull 
> request, to confirm the commits to be sent do not break them. It is expected 
> to run in less than 15 minutes including compilation on a fast machine with a 
> SSD (or RAM disk) and 8 cores and 32GB of RAM and may take up to two hours on 
> a machine with a spinner and two cores.
> 
> Thanks for your patience.
> 
> Cheers
> 
> [1] bot running on pull requests http://jenkins.ceph.dachary.org/job/ceph/
> [2] run-make-check.sh 
> http://workbench.dachary.org/ceph/ceph/blob/master/run-make-check.sh
> [3] make check results example : 
> https://github.com/ceph/ceph/pull/3946#issuecomment-93286840
> [4] possible RAM failure http://tracker.ceph.com/issues/11399
> 

-- 
Loïc Dachary, Artisan Logiciel Libre



signature.asc
Description: OpenPGP digital signature


client/cluster compatibility testing

2015-04-16 Thread Sage Weil
Now that there are several different vendors shipping and supporting Ceph 
in their products, we'll invariably have people running different 
versions of Ceph that are interested in interoperability.  If we focus 
just on client <-> cluster compatability, I think the issues are (1) 
compatibility between upstream ceph versions (firefly vs hammer) and 
(2) ensuring that any downstream changes the vendor makes don't break that 
compatibility.

I think the simplest way to address this is to talk about compatibility in 
terms of the upstream stable releases (firefly, hammer, etc.), and test 
that compatibility with teuthology tests from ceph-qa-suite.git.  We have 
some basic inter-version client/cluster tests already in 
suites/upgrade/client-upgrade.  Currently these test new (version "x") 
clients against a given release (dumpling, firefly).  I think we just need 
to add hammer to that mix, and then add a second set of tests that do the 
reverse: test clients from a given release (dumpling, firefly, hammer) 
against an arbitrary cluster version ("x").

We'll obviously run these tests on upstream releases to ensure that we are 
not breaking compatibility (or are doing so in known, explicit ways).  
Downstream folks can run the same test suites against any changes they 
make as well to ensure that their product is "compatible with firefly 
clients," or whatever.

Does that sound reasonable?
sage

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: leaking mons on a latest dumpling

2015-04-16 Thread Sage Weil
On Thu, 16 Apr 2015, Joao Eduardo Luis wrote:
> On 04/15/2015 05:38 PM, Andrey Korolyov wrote:
> > Hello,
> > 
> > there is a slow leak which is presented in all ceph versions I assume
> > but it is positively exposed only on large time spans and on large
> > clusters. It looks like the lower is monitor placed in the quorum
> > hierarchy, the higher the leak is:
> > 
> > 
> > {"election_epoch":26,"quorum":[0,1,2,3,4],"quorum_names":["0","1","2","3","4"],"quorum_leader_name":"0","monmap":{"epoch":1,"fsid":"a2ec787e-3551-4a6f-aa24-deedbd8f8d01","modified":"2015-03-05
> > 13:48:54.696784","created":"2015-03-05
> > 13:48:54.696784","mons":[{"rank":0,"name":"0","addr":"10.0.1.91:6789\/0"},{"rank":1,"name":"1","addr":"10.0.1.92:6789\/0"},{"rank":2,"name":"2","addr":"10.0.1.93:6789\/0"},{"rank":3,"name":"3","addr":"10.0.1.94:6789\/0"},{"rank":4,"name":"4","addr":"10.0.1.95:6789\/0"}]}}
> > 
> > ceph heap stats -m 10.0.1.95:6789 | grep Actual
> > MALLOC: =427626648 (  407.8 MiB) Actual memory used (physical + swap)
> > ceph heap stats -m 10.0.1.94:6789 | grep Actual
> > MALLOC: =289550488 (  276.1 MiB) Actual memory used (physical + swap)
> > ceph heap stats -m 10.0.1.93:6789 | grep Actual
> > MALLOC: =230592664 (  219.9 MiB) Actual memory used (physical + swap)
> > ceph heap stats -m 10.0.1.92:6789 | grep Actual
> > MALLOC: =253710488 (  242.0 MiB) Actual memory used (physical + swap)
> > ceph heap stats -m 10.0.1.91:6789 | grep Actual
> > MALLOC: = 97112216 (   92.6 MiB) Actual memory used (physical + swap)
> > 
> > for almost same uptime, the data difference is:
> > rd KB 55365750505
> > wr KB 82719722467
> > 
> > The leak itself is not very critical but of course requires some
> > script work to restart monitors at least once per month on a 300Tb
> > cluster to prevent >1G memory consumption by monitor processes. Given
> > a current status for a dumpling, it would be probably possible to
> > identify leak source and then forward-port fix to the newer releases,
> > as the freshest version I am running on a large scale is a top of
> > dumpling branch, otherwise it would require enormous amount of time to
> > check fix proposals.
> 
> There have been numerous reports of a slow leak in the monitors on
> dumpling and firefly.  I'm sure there's a ticket for that but I wasn't
> able to find it.
> 
> Many hours were spent chasing down this leak to no avail, despite of
> plugging several leaks throughout the code (especially in firefly, that
> should have been backported to dumpling at some point or the other).
> 
> This was mostly hard to figure out because it tends to require a
> long-term cluster to show up, and the biggest the cluster is the larger
> the probability of triggering it.  This behavior has me believing that
> this should be somewhere in the message dispatching workflow and, given
> it's the leader that suffers the most, should be somewhere in the
> read-write message dispatching (PaxosService::prepare_update()).  But
> despite code inspections, I don't think we ever found the cause -- or
> that any fixed leak was ever flagged as the root of the problem.
> 
> Anyway, since Giant, most complaints (if not all!) went away.  Maybe I
> missed them, or maybe people suffering from this just stopped
> complaining.  I'm hoping it's the first rather than the latter and, as
> luck has it, maybe the fix was a fortunate side-effect of some other change.

Perhaps we should try to run one of the sepia lab cluster mons through 
valgrind massif.  The slowdown shouldn't impact anything important and 
it's a real cluster with real load (running hammer).

sage
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Ceph.com

2015-04-16 Thread Patrick McGarry
Hey cephers,

As most of you have no doubt noticed, ceph.com has been having
some...er..."issues" lately. Unfortunately this is some of the
holdover infrastructure stuff from being a startup without a big-boy
ops plan.

The current setup has ceph.com sharing a host with some of the nightly
build stuff to make it easier for gitbuilder tasks (that also build
the website doc) to coexist. Was this smart? No, probably not. Was is
the quick-and-dirty way for us to get stuff rolling when we were tiny?
Yep.

So, now that things are continuing to grow (website traffic load,
ceph-deploy key requests, number of simultaneous builds) we are
hitting the end of what one hard-working box can handle. I am in the
process of moving ceph.com to a new host so that build explosions wont
slag things like Ceph Day pages and the blog, but the doc may lag
behind a bit.

Hopefully since I'm starting with the website it wont hose up too many
of the other tasks, but bear with us while we split routing for a bit.
If you have any questions please feel free to poke me. Thanks.

-- 

Best Regards,

Patrick McGarry
Director Ceph Community || Red Hat
http://ceph.com  ||  http://community.redhat.com
@scuttlemonkey || @ceph
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Issue]Ceph cluster hang due to network partition

2015-04-16 Thread Ketor D
Hi Sage,
  Thanks for you reply.
  Finally we fixed the network and ceph goes HEALTH_OK.
  We will improve our ops to get rid of network parition to fix
this problem.

Ketor


On Tue, Apr 14, 2015 at 11:18 PM, Sage Weil  wrote:
> On Tue, 14 Apr 2015, Ketor D wrote:
>> Hi Sage,
>>   We recently meet a network partition problem, cause our ceph
>> cluster can not service rbd service.
>>We are running 0.67.5 on our customer cluster. And the network
>> partition is 3 osd can connect mon, but can not connect with all other
>> osds.
>>Then many PGs fall in peering status, and rbd I/O is hang.
>>
>> Before we operate the cluster , I set nout flag and stop the 3
>> OSDs. After operating the 3 OSDs memory and OS bootup, the network is
>> partition. The 3 OSDs start, then many PGs went to peering.
>> I stoped the 3 OSDs process, but the PGs fall in peering.
>
> One possibility is that those PGs were all on the partitioned side; in
> that case you would have seen stale+peering+... states.
> Another possibility is that there was not sufficient PG [meta]data on
> the other side of the partition and the PGs got stuck in down+... or
> incomplete+... states.
>
> Or, there was another partition somewhere or cofusion such that there were
> OSDs that were unreachable but still in the 'up' state.
>
> sage
>
>
>> After network partition is fixed, all PG get active+clean, all is OK.
>>
>> I can't explain this, because I think the OSD can judge if the
>> other OSD is alive, and I can see 3OSD is marked down in 'ceph osd
>> tree'.
>> Why did these PGs fall in peering?
>>
>> Thanks!
>> Ketor
>>
>>
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: tcmalloc issue

2015-04-16 Thread Somnath Roy
Thanks James !
We will try this out.

Regards
Somnath

-Original Message-
From: James Page [mailto:james.p...@ubuntu.com]
Sent: Thursday, April 16, 2015 4:48 AM
To: Chaitanya Huilgol; Somnath Roy; Sage Weil; ceph-maintain...@ceph.com
Cc: ceph-devel@vger.kernel.org
Subject: Re: tcmalloc issue

-BEGIN PGP SIGNED MESSAGE-
Hash: SHA256

Hi Folks

The proposed fix is now in the 'proposed' pocket for 14.04 - see:

https://bugs.launchpad.net/ubuntu/+source/google-perftools/+bug/1439277

on how to enable that pocket for testing if you'd like to confirm that you are 
able to use the required feature now.

Cheers

James

- --
James Page
Ubuntu and Debian Developer
james.p...@ubuntu.com
jamesp...@debian.org
-BEGIN PGP SIGNATURE-
Version: GnuPG v2

iQIcBAEBCAAGBQJVL6F1AAoJEL/srsug59jDRGoP/AuQ8jGGAKgsRGu/aSnAzG0N
Aepa9AEnE2CH310HiYXqaKsmthKLwFy6vEkL0AkL4c4oDNVh9sdTaqBWV8qZXkxo
iSzWMeoId54du5F/wgk5e+Itk78tdxmdBhiALlCyudm2erlIYzc1iMLwBY1AIO3u
Ii7KngmPZj1d8DtUtfWAMoY8HbKSeZLpPpsu0mNOE6yzY/AycPlF/FPpWYoV2VBF
fUEOWu3FCst8YwGnegMwi3PIrjdvvVOfU1QxRJgeP8/oy5QB+LqAIHQEwpLsiASv
CJDZXj82SbFhMJxSj2vFF8WdfoNFz80DeMtdp268zjxaHVHVEjEszPAaZr862EHE
5qIXfAYgTYXtfI9p12OZm1PX7ogH45pfk5iMOZyZIWijkAnFHdMZ3ePHUMa+cBeR
DBE3Zbd7BJ+jFQ53rtkTI1L9TmUikad5BRVJXB9bVIENDOxj69vS7dn960FsWme8
CulQ/Scil/Da8vauo/itKx2ey3tVuxX96hUAHubCjBgQyPPmW/KTqICYLgMMUy2n
ZPMwVfKeM29KuLtyCu/VGdKtjaqjUl5QvTL0NJQaS18wFbsJOBGjwS2Mqfy2mVyU
3SDLAshD43m6Se2St8gEMIL3n5WYdGTksqb/iBVRKUfl3FhuBLTpb4ixtz3D2Tn1
HS2oqV/Bl/XFE0iyvPoM
=WHzY
-END PGP SIGNATURE-



PLEASE NOTE: The information contained in this electronic mail message is 
intended only for the use of the designated recipient(s) named above. If the 
reader of this message is not the intended recipient, you are hereby notified 
that you have received this message in error and that any review, 
dissemination, distribution, or copying of this message is strictly prohibited. 
If you have received this communication in error, please notify the sender by 
telephone or e-mail (as shown above) immediately and destroy any and all copies 
of this message in your possession (whether hard copies or electronically 
stored copies).

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: tcmalloc issue

2015-04-16 Thread James Page
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA256

Hi Folks

The proposed fix is now in the 'proposed' pocket for 14.04 - see:

https://bugs.launchpad.net/ubuntu/+source/google-perftools/+bug/1439277

on how to enable that pocket for testing if you'd like to confirm that
you are able to use the required feature now.

Cheers

James

- -- 
James Page
Ubuntu and Debian Developer
james.p...@ubuntu.com
jamesp...@debian.org
-BEGIN PGP SIGNATURE-
Version: GnuPG v2

iQIcBAEBCAAGBQJVL6F1AAoJEL/srsug59jDRGoP/AuQ8jGGAKgsRGu/aSnAzG0N
Aepa9AEnE2CH310HiYXqaKsmthKLwFy6vEkL0AkL4c4oDNVh9sdTaqBWV8qZXkxo
iSzWMeoId54du5F/wgk5e+Itk78tdxmdBhiALlCyudm2erlIYzc1iMLwBY1AIO3u
Ii7KngmPZj1d8DtUtfWAMoY8HbKSeZLpPpsu0mNOE6yzY/AycPlF/FPpWYoV2VBF
fUEOWu3FCst8YwGnegMwi3PIrjdvvVOfU1QxRJgeP8/oy5QB+LqAIHQEwpLsiASv
CJDZXj82SbFhMJxSj2vFF8WdfoNFz80DeMtdp268zjxaHVHVEjEszPAaZr862EHE
5qIXfAYgTYXtfI9p12OZm1PX7ogH45pfk5iMOZyZIWijkAnFHdMZ3ePHUMa+cBeR
DBE3Zbd7BJ+jFQ53rtkTI1L9TmUikad5BRVJXB9bVIENDOxj69vS7dn960FsWme8
CulQ/Scil/Da8vauo/itKx2ey3tVuxX96hUAHubCjBgQyPPmW/KTqICYLgMMUy2n
ZPMwVfKeM29KuLtyCu/VGdKtjaqjUl5QvTL0NJQaS18wFbsJOBGjwS2Mqfy2mVyU
3SDLAshD43m6Se2St8gEMIL3n5WYdGTksqb/iBVRKUfl3FhuBLTpb4ixtz3D2Tn1
HS2oqV/Bl/XFE0iyvPoM
=WHzY
-END PGP SIGNATURE-
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: leaking mons on a latest dumpling

2015-04-16 Thread Andrey Korolyov
On Thu, Apr 16, 2015 at 11:30 AM, Joao Eduardo Luis  wrote:
> On 04/15/2015 05:38 PM, Andrey Korolyov wrote:
>> Hello,
>>
>> there is a slow leak which is presented in all ceph versions I assume
>> but it is positively exposed only on large time spans and on large
>> clusters. It looks like the lower is monitor placed in the quorum
>> hierarchy, the higher the leak is:
>>
>>
>> {"election_epoch":26,"quorum":[0,1,2,3,4],"quorum_names":["0","1","2","3","4"],"quorum_leader_name":"0","monmap":{"epoch":1,"fsid":"a2ec787e-3551-4a6f-aa24-deedbd8f8d01","modified":"2015-03-05
>> 13:48:54.696784","created":"2015-03-05
>> 13:48:54.696784","mons":[{"rank":0,"name":"0","addr":"10.0.1.91:6789\/0"},{"rank":1,"name":"1","addr":"10.0.1.92:6789\/0"},{"rank":2,"name":"2","addr":"10.0.1.93:6789\/0"},{"rank":3,"name":"3","addr":"10.0.1.94:6789\/0"},{"rank":4,"name":"4","addr":"10.0.1.95:6789\/0"}]}}
>>
>> ceph heap stats -m 10.0.1.95:6789 | grep Actual
>> MALLOC: =427626648 (  407.8 MiB) Actual memory used (physical + swap)
>> ceph heap stats -m 10.0.1.94:6789 | grep Actual
>> MALLOC: =289550488 (  276.1 MiB) Actual memory used (physical + swap)
>> ceph heap stats -m 10.0.1.93:6789 | grep Actual
>> MALLOC: =230592664 (  219.9 MiB) Actual memory used (physical + swap)
>> ceph heap stats -m 10.0.1.92:6789 | grep Actual
>> MALLOC: =253710488 (  242.0 MiB) Actual memory used (physical + swap)
>> ceph heap stats -m 10.0.1.91:6789 | grep Actual
>> MALLOC: = 97112216 (   92.6 MiB) Actual memory used (physical + swap)
>>
>> for almost same uptime, the data difference is:
>> rd KB 55365750505
>> wr KB 82719722467
>>
>> The leak itself is not very critical but of course requires some
>> script work to restart monitors at least once per month on a 300Tb
>> cluster to prevent >1G memory consumption by monitor processes. Given
>> a current status for a dumpling, it would be probably possible to
>> identify leak source and then forward-port fix to the newer releases,
>> as the freshest version I am running on a large scale is a top of
>> dumpling branch, otherwise it would require enormous amount of time to
>> check fix proposals.
>
> There have been numerous reports of a slow leak in the monitors on
> dumpling and firefly.  I'm sure there's a ticket for that but I wasn't
> able to find it.
>
> Many hours were spent chasing down this leak to no avail, despite of
> plugging several leaks throughout the code (especially in firefly, that
> should have been backported to dumpling at some point or the other).
>
> This was mostly hard to figure out because it tends to require a
> long-term cluster to show up, and the biggest the cluster is the larger
> the probability of triggering it.  This behavior has me believing that
> this should be somewhere in the message dispatching workflow and, given
> it's the leader that suffers the most, should be somewhere in the
> read-write message dispatching (PaxosService::prepare_update()).  But
> despite code inspections, I don't think we ever found the cause -- or
> that any fixed leak was ever flagged as the root of the problem.
>
> Anyway, since Giant, most complaints (if not all!) went away.  Maybe I
> missed them, or maybe people suffering from this just stopped
> complaining.  I'm hoping it's the first rather than the latter and, as
> luck has it, maybe the fix was a fortunate side-effect of some other change.
>
>   -Joao
>

Thanks for an explanation, I accidentally reversed the logical order
describing leadership placement above. I`ll go through non-ported
commits for ff and will port most promising ones on a spare time
occasion, checking if the leak disappeared or not (it takes about a
week to see the difference for mine workloads). Could dump structures
be helpful for developers to ring a bell for deterministic
suggestions?
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: leaking mons on a latest dumpling

2015-04-16 Thread Joao Eduardo Luis
On 04/15/2015 05:38 PM, Andrey Korolyov wrote:
> Hello,
> 
> there is a slow leak which is presented in all ceph versions I assume
> but it is positively exposed only on large time spans and on large
> clusters. It looks like the lower is monitor placed in the quorum
> hierarchy, the higher the leak is:
> 
> 
> {"election_epoch":26,"quorum":[0,1,2,3,4],"quorum_names":["0","1","2","3","4"],"quorum_leader_name":"0","monmap":{"epoch":1,"fsid":"a2ec787e-3551-4a6f-aa24-deedbd8f8d01","modified":"2015-03-05
> 13:48:54.696784","created":"2015-03-05
> 13:48:54.696784","mons":[{"rank":0,"name":"0","addr":"10.0.1.91:6789\/0"},{"rank":1,"name":"1","addr":"10.0.1.92:6789\/0"},{"rank":2,"name":"2","addr":"10.0.1.93:6789\/0"},{"rank":3,"name":"3","addr":"10.0.1.94:6789\/0"},{"rank":4,"name":"4","addr":"10.0.1.95:6789\/0"}]}}
> 
> ceph heap stats -m 10.0.1.95:6789 | grep Actual
> MALLOC: =427626648 (  407.8 MiB) Actual memory used (physical + swap)
> ceph heap stats -m 10.0.1.94:6789 | grep Actual
> MALLOC: =289550488 (  276.1 MiB) Actual memory used (physical + swap)
> ceph heap stats -m 10.0.1.93:6789 | grep Actual
> MALLOC: =230592664 (  219.9 MiB) Actual memory used (physical + swap)
> ceph heap stats -m 10.0.1.92:6789 | grep Actual
> MALLOC: =253710488 (  242.0 MiB) Actual memory used (physical + swap)
> ceph heap stats -m 10.0.1.91:6789 | grep Actual
> MALLOC: = 97112216 (   92.6 MiB) Actual memory used (physical + swap)
> 
> for almost same uptime, the data difference is:
> rd KB 55365750505
> wr KB 82719722467
> 
> The leak itself is not very critical but of course requires some
> script work to restart monitors at least once per month on a 300Tb
> cluster to prevent >1G memory consumption by monitor processes. Given
> a current status for a dumpling, it would be probably possible to
> identify leak source and then forward-port fix to the newer releases,
> as the freshest version I am running on a large scale is a top of
> dumpling branch, otherwise it would require enormous amount of time to
> check fix proposals.

There have been numerous reports of a slow leak in the monitors on
dumpling and firefly.  I'm sure there's a ticket for that but I wasn't
able to find it.

Many hours were spent chasing down this leak to no avail, despite of
plugging several leaks throughout the code (especially in firefly, that
should have been backported to dumpling at some point or the other).

This was mostly hard to figure out because it tends to require a
long-term cluster to show up, and the biggest the cluster is the larger
the probability of triggering it.  This behavior has me believing that
this should be somewhere in the message dispatching workflow and, given
it's the leader that suffers the most, should be somewhere in the
read-write message dispatching (PaxosService::prepare_update()).  But
despite code inspections, I don't think we ever found the cause -- or
that any fixed leak was ever flagged as the root of the problem.

Anyway, since Giant, most complaints (if not all!) went away.  Maybe I
missed them, or maybe people suffering from this just stopped
complaining.  I'm hoping it's the first rather than the latter and, as
luck has it, maybe the fix was a fortunate side-effect of some other change.

  -Joao

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html