[ceph-users] Weird behaviour of mon_osd_down_out_subtree_limit=host

2015-07-24 Thread Jan Schermer
“Friday fun”… not!

We set mon_osd_down_out_subtree_limit=host some time ago. Now we needed to take 
down all OSDs on one host and as expected nothing happened (noout was _not_ 
set). All the PGs showed as stuck degraded.

Then we took 3 OSDs on the host up and then down again because of slow request 
madness.

Since then there’s some weirdness I don’t have an explanation for

1) there are 8 active+remapped PGs (hosted on completely different hosts from 
the one we were working on). Why?

2) How does mon_osd_down_out_subtree_limit even work? How does it tell the 
whole host is down? If I start just one OSD, is the host still down? Will it 
“out” all the other OSDs?
Doesn’t look like it, because I just started one OSD and it didn’t out all the 
others.

3) after starting the one OSD, there are some backfills occuring, even though I 
set “nobackfill”

4) the one OSD I started on this host now consumes 6.5GB memory (RSS). All 
other OSDs in the cluster consume ~1.2-1.5 GB. No idea why…
(and it’s the vanilla tcmalloc version)

Doh…

Any ideas welcome. I can’t even start all the OSDs if they start consuming this 
amount of memory.


Jan
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Weird behaviour of mon_osd_down_out_subtree_limit=host

2015-07-24 Thread Jan Schermer
Turns out that when we started the 3 OSDs it did “out” the rest on the same 
host, so their reweight was 0.
Thus when I started the singular OSD on that host, it tried to put all the PGs 
on the other OSDs onto this one (which failed for lack of disk space) and 
because of that it also consumed much more memory.
I had to reweight all the OSDs back (since we don’t usually run them with 1 
because of poor balancing) and I am starting them one by one…

I think mon_osd_down_out_subtree_limit shouyld be a bit smarter and only “out” 
the OSDs if they were started at least once - not when one other OSD on the 
host starts. I don’t think I want to start all the OSDs at once, ever, so that 
pretty much makes it unusable.

Jan

> On 24 Jul 2015, at 13:53, Jan Schermer  wrote:
> 
> “Friday fun”… not!
> 
> We set mon_osd_down_out_subtree_limit=host some time ago. Now we needed to 
> take down all OSDs on one host and as expected nothing happened (noout was 
> _not_ set). All the PGs showed as stuck degraded.
> 
> Then we took 3 OSDs on the host up and then down again because of slow 
> request madness.
> 
> Since then there’s some weirdness I don’t have an explanation for
> 
> 1) there are 8 active+remapped PGs (hosted on completely different hosts from 
> the one we were working on). Why?
> 
> 2) How does mon_osd_down_out_subtree_limit even work? How does it tell the 
> whole host is down? If I start just one OSD, is the host still down? Will it 
> “out” all the other OSDs?
> Doesn’t look like it, because I just started one OSD and it didn’t out all 
> the others.
> 
> 3) after starting the one OSD, there are some backfills occuring, even though 
> I set “nobackfill”
> 
> 4) the one OSD I started on this host now consumes 6.5GB memory (RSS). All 
> other OSDs in the cluster consume ~1.2-1.5 GB. No idea why…
> (and it’s the vanilla tcmalloc version)
> 
> Doh…
> 
> Any ideas welcome. I can’t even start all the OSDs if they start consuming 
> this amount of memory.
> 
> 
> Jan

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com