Re: [ceph-users] Unexplainable high memory usage OSD with BlueStore

2019-05-06 Thread Igor Fedotov

Hi Kenneth,

mimic 13.2.5 has previous version of bitmap allocator which isn't 
recommended to use. Please revert.



New bitmap allocator is will be available since 13.2.6.


Thanks,

Igor

On 5/6/2019 4:19 PM, Kenneth Waegeman wrote:

Hi all,

I am also switching osds to the new bitmap allocater on 13.2.5. That 
went quite fluently for now, except for one OSD that keeps segfaulting 
when I enable the bitmap allocator. Each time I disable bitmap 
allocater on that again, osd is ok again. Segfault error of the OSD:




--- begin dump of recent events ---
  -319> 2019-05-06 15:11:45.172 7f28f4321d80  5 asok(0x55b2155de5a0) 
register_command perfcounters_dump hook 0x55b2155b60d0
  -318> 2019-05-06 15:11:45.172 7f28f4321d80  5 asok(0x55b2155de5a0) 
register_command 1 hook 0x55b2155b60d0
  -317> 2019-05-06 15:11:45.172 7f28f4321d80  5 asok(0x55b2155de5a0) 
register_command perf dump hook 0x55b2155b60d0
  -316> 2019-05-06 15:11:45.172 7f28f4321d80  5 asok(0x55b2155de5a0) 
register_command perfcounters_schema hook 0x55b2155b60d0
  -315> 2019-05-06 15:11:45.172 7f28f4321d80  5 asok(0x55b2155de5a0) 
register_command perf histogram dump hook 0x55b2155b60d0
  -314> 2019-05-06 15:11:45.172 7f28f4321d80  5 asok(0x55b2155de5a0) 
register_command 2 hook 0x55b2155b60d0
  -313> 2019-05-06 15:11:45.172 7f28f4321d80  5 asok(0x55b2155de5a0) 
register_command perf schema hook 0x55b2155b60d0
  -312> 2019-05-06 15:11:45.172 7f28f4321d80  5 asok(0x55b2155de5a0) 
register_command perf histogram schema hook 0x55b2155b60d0
  -311> 2019-05-06 15:11:45.172 7f28f4321d80  5 asok(0x55b2155de5a0) 
register_command perf reset hook 0x55b2155b60d0
  -310> 2019-05-06 15:11:45.172 7f28f4321d80  5 asok(0x55b2155de5a0) 
register_command config show hook 0x55b2155b60d0
  -309> 2019-05-06 15:11:45.172 7f28f4321d80  5 asok(0x55b2155de5a0) 
register_command config help hook 0x55b2155b60d0
  -308> 2019-05-06 15:11:45.172 7f28f4321d80  5 asok(0x55b2155de5a0) 
register_command config set hook 0x55b2155b60d0
  -307> 2019-05-06 15:11:45.172 7f28f4321d80  5 asok(0x55b2155de5a0) 
register_command config unset hook 0x55b2155b60d0
  -306> 2019-05-06 15:11:45.172 7f28f4321d80  5 asok(0x55b2155de5a0) 
register_command config get hook 0x55b2155b60d0
  -305> 2019-05-06 15:11:45.172 7f28f4321d80  5 asok(0x55b2155de5a0) 
register_command config diff hook 0x55b2155b60d0
  -304> 2019-05-06 15:11:45.172 7f28f4321d80  5 asok(0x55b2155de5a0) 
register_command config diff get hook 0x55b2155b60d0
  -303> 2019-05-06 15:11:45.172 7f28f4321d80  5 asok(0x55b2155de5a0) 
register_command log flush hook 0x55b2155b60d0
  -302> 2019-05-06 15:11:45.172 7f28f4321d80  5 asok(0x55b2155de5a0) 
register_command log dump hook 0x55b2155b60d0
  -301> 2019-05-06 15:11:45.172 7f28f4321d80  5 asok(0x55b2155de5a0) 
register_command log reopen hook 0x55b2155b60d0
  -300> 2019-05-06 15:11:45.172 7f28f4321d80  5 asok(0x55b2155de5a0) 
register_command dump_mempools hook 0x55b2155ec2c8
  -299> 2019-05-06 15:11:45.182 7f28f4321d80 10 monclient: 
get_monmap_and_config
  -298> 2019-05-06 15:11:45.222 7f28f4321d80 10 monclient: 
build_initial_monmap
  -297> 2019-05-06 15:11:45.222 7f28e3a19700  2 Event(0x55b215911080 
nevent=5000 time_id=1).set_owner idx=1 owner=139813594437376
  -296> 2019-05-06 15:11:45.222 7f28e421a700  2 Event(0x55b215910c80 
nevent=5000 time_id=1).set_owner idx=0 owner=139813602830080
  -295> 2019-05-06 15:11:45.222 7f28e3218700  2 Event(0x55b215911880 
nevent=5000 time_id=1).set_owner idx=2 owner=139813586044672

  -294> 2019-05-06 15:11:45.222 7f28f4321d80  1  Processor -- start
  -293> 2019-05-06 15:11:45.222 7f28f4321d80  1 -- - start start
  -292> 2019-05-06 15:11:45.222 7f28f4321d80 10 monclient: init
  -291> 2019-05-06 15:11:45.222 7f28f4321d80  5 adding auth protocol: 
cephx
  -290> 2019-05-06 15:11:45.222 7f28f4321d80 10 monclient: 
auth_supported 2 method cephx
  -289> 2019-05-06 15:11:45.222 7f28f4321d80  2 auth: KeyRing::load: 
loaded key file /var/lib/ceph/osd/ceph-3/keyring
  -288> 2019-05-06 15:11:45.222 7f28f4321d80 10 monclient: 
_reopen_session rank -1
  -287> 2019-05-06 15:11:45.222 7f28f4321d80 10 monclient(hunting): 
picked mon.noname-c con 0x55b2159e2600 addr 10.141.16.3:6789/0
  -286> 2019-05-06 15:11:45.222 7f28f4321d80 10 monclient(hunting): 
picked mon.noname-b con 0x55b2159e2c00 addr 10.141.16.2:6789/0
  -285> 2019-05-06 15:11:45.222 7f28f4321d80  1 -- - --> 
10.141.16.2:6789/0 -- auth(proto 0 26 bytes epoch 0) v1 -- 
0x55b2155b1200 con 0
  -284> 2019-05-06 15:11:45.222 7f28f4321d80  1 -- - --> 
10.141.16.3:6789/0 -- auth(proto 0 26 bytes epoch 0) v1 -- 
0x55b2155b1440 con 0
  -283> 2019-05-06 15:11:45.222 7f28f4321d80 10 monclient(hunting): 
_renew_subs
  -282> 2019-05-06 15:11:45.222 7f28f4321d80 10 monclient(hunting): 
authenticate will time out at 2019-05-06 15:16:45.237660
  -281> 2019-05-06 15:11:45.222 7f28e3a19700  1 -- 
10.141.16.3:0/3652030958 learned_addr learned my addr 
10.141.16.3:0/3652030958
  -280> 2019-05-06 15:11:45.222 

Re: [ceph-users] Unexplainable high memory usage OSD with BlueStore

2019-05-06 Thread Kenneth Waegeman

Hi all,

I am also switching osds to the new bitmap allocater on 13.2.5. That 
went quite fluently for now, except for one OSD that keeps segfaulting 
when I enable the bitmap allocator. Each time I disable bitmap allocater 
on that again, osd is ok again. Segfault error of the OSD:




--- begin dump of recent events ---
  -319> 2019-05-06 15:11:45.172 7f28f4321d80  5 asok(0x55b2155de5a0) 
register_command perfcounters_dump hook 0x55b2155b60d0
  -318> 2019-05-06 15:11:45.172 7f28f4321d80  5 asok(0x55b2155de5a0) 
register_command 1 hook 0x55b2155b60d0
  -317> 2019-05-06 15:11:45.172 7f28f4321d80  5 asok(0x55b2155de5a0) 
register_command perf dump hook 0x55b2155b60d0
  -316> 2019-05-06 15:11:45.172 7f28f4321d80  5 asok(0x55b2155de5a0) 
register_command perfcounters_schema hook 0x55b2155b60d0
  -315> 2019-05-06 15:11:45.172 7f28f4321d80  5 asok(0x55b2155de5a0) 
register_command perf histogram dump hook 0x55b2155b60d0
  -314> 2019-05-06 15:11:45.172 7f28f4321d80  5 asok(0x55b2155de5a0) 
register_command 2 hook 0x55b2155b60d0
  -313> 2019-05-06 15:11:45.172 7f28f4321d80  5 asok(0x55b2155de5a0) 
register_command perf schema hook 0x55b2155b60d0
  -312> 2019-05-06 15:11:45.172 7f28f4321d80  5 asok(0x55b2155de5a0) 
register_command perf histogram schema hook 0x55b2155b60d0
  -311> 2019-05-06 15:11:45.172 7f28f4321d80  5 asok(0x55b2155de5a0) 
register_command perf reset hook 0x55b2155b60d0
  -310> 2019-05-06 15:11:45.172 7f28f4321d80  5 asok(0x55b2155de5a0) 
register_command config show hook 0x55b2155b60d0
  -309> 2019-05-06 15:11:45.172 7f28f4321d80  5 asok(0x55b2155de5a0) 
register_command config help hook 0x55b2155b60d0
  -308> 2019-05-06 15:11:45.172 7f28f4321d80  5 asok(0x55b2155de5a0) 
register_command config set hook 0x55b2155b60d0
  -307> 2019-05-06 15:11:45.172 7f28f4321d80  5 asok(0x55b2155de5a0) 
register_command config unset hook 0x55b2155b60d0
  -306> 2019-05-06 15:11:45.172 7f28f4321d80  5 asok(0x55b2155de5a0) 
register_command config get hook 0x55b2155b60d0
  -305> 2019-05-06 15:11:45.172 7f28f4321d80  5 asok(0x55b2155de5a0) 
register_command config diff hook 0x55b2155b60d0
  -304> 2019-05-06 15:11:45.172 7f28f4321d80  5 asok(0x55b2155de5a0) 
register_command config diff get hook 0x55b2155b60d0
  -303> 2019-05-06 15:11:45.172 7f28f4321d80  5 asok(0x55b2155de5a0) 
register_command log flush hook 0x55b2155b60d0
  -302> 2019-05-06 15:11:45.172 7f28f4321d80  5 asok(0x55b2155de5a0) 
register_command log dump hook 0x55b2155b60d0
  -301> 2019-05-06 15:11:45.172 7f28f4321d80  5 asok(0x55b2155de5a0) 
register_command log reopen hook 0x55b2155b60d0
  -300> 2019-05-06 15:11:45.172 7f28f4321d80  5 asok(0x55b2155de5a0) 
register_command dump_mempools hook 0x55b2155ec2c8
  -299> 2019-05-06 15:11:45.182 7f28f4321d80 10 monclient: 
get_monmap_and_config
  -298> 2019-05-06 15:11:45.222 7f28f4321d80 10 monclient: 
build_initial_monmap
  -297> 2019-05-06 15:11:45.222 7f28e3a19700  2 Event(0x55b215911080 
nevent=5000 time_id=1).set_owner idx=1 owner=139813594437376
  -296> 2019-05-06 15:11:45.222 7f28e421a700  2 Event(0x55b215910c80 
nevent=5000 time_id=1).set_owner idx=0 owner=139813602830080
  -295> 2019-05-06 15:11:45.222 7f28e3218700  2 Event(0x55b215911880 
nevent=5000 time_id=1).set_owner idx=2 owner=139813586044672

  -294> 2019-05-06 15:11:45.222 7f28f4321d80  1  Processor -- start
  -293> 2019-05-06 15:11:45.222 7f28f4321d80  1 -- - start start
  -292> 2019-05-06 15:11:45.222 7f28f4321d80 10 monclient: init
  -291> 2019-05-06 15:11:45.222 7f28f4321d80  5 adding auth protocol: 
cephx
  -290> 2019-05-06 15:11:45.222 7f28f4321d80 10 monclient: 
auth_supported 2 method cephx
  -289> 2019-05-06 15:11:45.222 7f28f4321d80  2 auth: KeyRing::load: 
loaded key file /var/lib/ceph/osd/ceph-3/keyring
  -288> 2019-05-06 15:11:45.222 7f28f4321d80 10 monclient: 
_reopen_session rank -1
  -287> 2019-05-06 15:11:45.222 7f28f4321d80 10 monclient(hunting): 
picked mon.noname-c con 0x55b2159e2600 addr 10.141.16.3:6789/0
  -286> 2019-05-06 15:11:45.222 7f28f4321d80 10 monclient(hunting): 
picked mon.noname-b con 0x55b2159e2c00 addr 10.141.16.2:6789/0
  -285> 2019-05-06 15:11:45.222 7f28f4321d80  1 -- - --> 
10.141.16.2:6789/0 -- auth(proto 0 26 bytes epoch 0) v1 -- 
0x55b2155b1200 con 0
  -284> 2019-05-06 15:11:45.222 7f28f4321d80  1 -- - --> 
10.141.16.3:6789/0 -- auth(proto 0 26 bytes epoch 0) v1 -- 
0x55b2155b1440 con 0
  -283> 2019-05-06 15:11:45.222 7f28f4321d80 10 monclient(hunting): 
_renew_subs
  -282> 2019-05-06 15:11:45.222 7f28f4321d80 10 monclient(hunting): 
authenticate will time out at 2019-05-06 15:16:45.237660
  -281> 2019-05-06 15:11:45.222 7f28e3a19700  1 -- 
10.141.16.3:0/3652030958 learned_addr learned my addr 
10.141.16.3:0/3652030958
  -280> 2019-05-06 15:11:45.222 7f28e3a19700  2 -- 
10.141.16.3:0/3652030958 >> 10.141.16.3:6789/0 conn(0x55b2159e2600 :-1 
s=STATE_CONNECTING_WAIT_ACK_SEQ pgs=0 cs=0 l=0)._process_connection 
got newly_acked_seq 0 vs out_seq 0
  -279> 2019-05-06 15:11:45.222 7f28e3218700  2 -- 

Re: [ceph-users] Unexplainable high memory usage OSD with BlueStore

2019-05-03 Thread Igor Podlesny
On Fri, 3 May 2019 at 21:39, Mark Nelson  wrote:
[...]
> > [osd]
> > ...
> > bluestore_allocator = bitmap
> > bluefs_allocator = bitmap
> >
> > I would restart the nodes one by one and see, what happens.
>
> If you are using 12.2.11 you likely still have the old bitmap allocator

Would those config changes be just ignored or OSD would fail to start instead?

-- 
End of message. Next message?
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Unexplainable high memory usage OSD with BlueStore

2019-05-03 Thread Mark Nelson


On 5/3/19 1:38 AM, Denny Fuchs wrote:

hi,

I never recognized the Debian /etc/default/ceph :-)

=
# Increase tcmalloc cache size
TCMALLOC_MAX_TOTAL_THREAD_CACHE_BYTES=134217728


that is, what is active now.



Yep, if you profile the OSD under a small write workload you can see how 
changing this affects tcmalloc's behavior.  This was especially 
important back before we had the async messenger, but I believe we've 
seen evidence that we're still getting some benefit with larger thread 
cache even now.






Huge pages:

# cat /sys/kernel/mm/transparent_hugepage/enabled
always [madvise] never


# dpkg -S  /usr/lib/x86_64-linux-gnu/libjemalloc.so.1
libjemalloc1: /usr/lib/x86_64-linux-gnu/libjemalloc.so.1


so file exists on PROXMOX 5.x (Ceph version 12.2.11-pve1)


If I understand correct: I should try to set bitmap allocator


[osd]
...
bluestore_allocator = bitmap
bluefs_allocator = bitmap

I would restart the nodes one by one and see, what happens.



If you are using 12.2.11 you likely still have the old bitmap allocator 
which we do not recommend using at all.  Igor Fedotov backported his 
excellent new bitmap allocator to 12.2.12 though:



https://github.com/ceph/ceph/tree/v12.2.12/src/os/bluestore


Mark




cu denny
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Unexplainable high memory usage OSD with BlueStore

2019-05-03 Thread Igor Podlesny
On Fri, 3 May 2019 at 13:38, Denny Fuchs  wrote:
[...]
> If I understand correct: I should try to set bitmap allocator

That's among one of the options I mentioned.

Another one was to try using jemalloc (re-read my emails).

> [osd]
> ...
> bluestore_allocator = bitmap
> bluefs_allocator = bitmap
>
> I would restart the nodes one by one and see, what happens.

Right.

-- 
End of message. Next message?
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Unexplainable high memory usage OSD with BlueStore

2019-05-03 Thread Denny Fuchs

hi,

I never recognized the Debian /etc/default/ceph :-)

=
# Increase tcmalloc cache size
TCMALLOC_MAX_TOTAL_THREAD_CACHE_BYTES=134217728


that is, what is active now.


Huge pages:

# cat /sys/kernel/mm/transparent_hugepage/enabled
always [madvise] never


# dpkg -S  /usr/lib/x86_64-linux-gnu/libjemalloc.so.1
libjemalloc1: /usr/lib/x86_64-linux-gnu/libjemalloc.so.1


so file exists on PROXMOX 5.x (Ceph version 12.2.11-pve1)


If I understand correct: I should try to set bitmap allocator


[osd]
...
bluestore_allocator = bitmap
bluefs_allocator = bitmap

I would restart the nodes one by one and see, what happens.

cu denny
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Unexplainable high memory usage OSD with BlueStore

2019-05-02 Thread Igor Podlesny
On Fri, 3 May 2019 at 05:12, Mark Nelson  wrote:
[...]
> > -- https://www.kernel.org/doc/Documentation/vm/transhuge.txt
>
> Why are you quoting the description for the madvise setting when that's
> clearly not what was set in the case I just showed you?

Similarly why(?) are you telling us it must be due to THPs if:

1) by default they're not used unless madvise()'ed,
2) none of jemalloc or tcmalloc would madvise by default too.

[...]
> previously|malloc|'ed. Because the machine used transparent huge pages,

Is it from DigitalOcean's blog? I read it pretty long ago. And it was
written long ago, referring to some
ancient release of jemalloc and what's more important -- to a system
that has THP activated.

-- But I've shown you that it's not default kernel's setting to use
THP -- unless madvise would tell kernel so.
Your example with CentOS isn't relevant due to person who started this
thread use Debian (Proxmox, to be more correct).
Moreover, something's telling me that even in default CentOS installs
THPs are also set to madvise()-only.

> I'm not going to argue with you about this.

I don't argue with you.
I'm merely showing you that instead of doing baseless claims (or wild
guess-working), it's worth checking facts first.
Checking if THP's are used at all (although it might be not due OSDs
but, say, KVM) is as simple as looking into /proc/meminfo.

> Test it if you want or don't.

I didn't start this thread. ;)
As to me -- I've played enough with all kind of allocators and THP settings. :)

-- 
End of message. Next message?
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Unexplainable high memory usage OSD with BlueStore

2019-05-02 Thread Mark Nelson


On 5/2/19 1:51 PM, Igor Podlesny wrote:

On Fri, 3 May 2019 at 01:29, Mark Nelson  wrote:

On 5/2/19 11:46 AM, Igor Podlesny wrote:

On Thu, 2 May 2019 at 05:02, Mark Nelson  wrote:
[...]

FWIW, if you still have an OSD up with tcmalloc, it's probably worth
looking at the heap stats to see how much memory tcmalloc thinks it's
allocated vs how much RSS memory is being used by the process.  It's
quite possible that there is memory that has been unmapped but that the
kernel can't (or has decided not yet to) reclaim.
Transparent huge pages can potentially have an effect here both with tcmalloc 
and with
jemalloc so it's not certain that switching the allocator will fix it entirely.

Most likely wrong. -- Default kernel's settings in regards of THP are "madvise".
None of tcmalloc or jemalloc would madvise() to make it happen.
With fresh enough jemalloc you could have it, but it needs special
malloc.conf'ing.


  From one of our centos nodes with no special actions taken to change
THP settings (though it's possible it was inherited from something else):


$ cat /etc/redhat-release
CentOS Linux release 7.5.1804 (Core)
$ cat /sys/kernel/mm/transparent_hugepage/enabled
[always] madvise never

"madvise" will enter direct reclaim like "always" but only for regions
that are have used madvise(MADV_HUGEPAGE). This is the default behaviour.

-- https://www.kernel.org/doc/Documentation/vm/transhuge.txt



Why are you quoting the description for the madvise setting when that's 
clearly not what was set in the case I just showed you?






And regarding madvise and alternate memory allocators:
https:

[...]

did you ever read any of it?

one link's info:

"By default jemalloc does not use huge pages for heap memory (there is
opt.metadata_thp which uses THP for internal metadata though)"



"It turns out that|jemalloc(3)|uses|madvise(2)|extensively to notify the 
operating system that it's done with a range of memory which it had 
previously|malloc|'ed. Because the machine used transparent huge pages, 
the page size was 2MB. As such, a lot of the memory which was being 
marked with|madvise(..., MADV_DONTNEED)|was within ranges substantially 
smaller than 2MB. This meant that the operating system never was able to 
evict pages which had ranges marked as|MADV_DONTNEED|because the entire 
page would have to be unneeded to allow it to be reused.


So despite initially looking like a leak, the operating system itself 
was unable to free memory because of|madvise(2)|and transparent huge 
pages.^4 
 
This led to sustained memory pressure on the machine 
and|redis-server|eventually getting OOM killed."



I'm not going to argue with you about this.  Test it if you want or don't.


Mark




(and I've said

None of tcmalloc or jemalloc would madvise() to make it happen.
With fresh enough jemalloc you could have it, but it needs special
malloc.conf'ing.

before)


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Unexplainable high memory usage OSD with BlueStore

2019-05-02 Thread Igor Podlesny
On Fri, 3 May 2019 at 01:29, Mark Nelson  wrote:
> On 5/2/19 11:46 AM, Igor Podlesny wrote:
> > On Thu, 2 May 2019 at 05:02, Mark Nelson  wrote:
> > [...]
> >> FWIW, if you still have an OSD up with tcmalloc, it's probably worth
> >> looking at the heap stats to see how much memory tcmalloc thinks it's
> >> allocated vs how much RSS memory is being used by the process.  It's
> >> quite possible that there is memory that has been unmapped but that the
> >> kernel can't (or has decided not yet to) reclaim.
> >> Transparent huge pages can potentially have an effect here both with 
> >> tcmalloc and with
> >> jemalloc so it's not certain that switching the allocator will fix it 
> >> entirely.
> > Most likely wrong. -- Default kernel's settings in regards of THP are 
> > "madvise".
> > None of tcmalloc or jemalloc would madvise() to make it happen.
> > With fresh enough jemalloc you could have it, but it needs special
> > malloc.conf'ing.
>
>
>  From one of our centos nodes with no special actions taken to change
> THP settings (though it's possible it was inherited from something else):
>
>
> $ cat /etc/redhat-release
> CentOS Linux release 7.5.1804 (Core)
> $ cat /sys/kernel/mm/transparent_hugepage/enabled
> [always] madvise never

"madvise" will enter direct reclaim like "always" but only for regions
that are have used madvise(MADV_HUGEPAGE). This is the default behaviour.

-- https://www.kernel.org/doc/Documentation/vm/transhuge.txt

> And regarding madvise and alternate memory allocators:
> https:
[...]

did you ever read any of it?

one link's info:

"By default jemalloc does not use huge pages for heap memory (there is
opt.metadata_thp which uses THP for internal metadata though)"

(and I've said
> > None of tcmalloc or jemalloc would madvise() to make it happen.
> > With fresh enough jemalloc you could have it, but it needs special
> > malloc.conf'ing.
before)

-- 
End of message. Next message?
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Unexplainable high memory usage OSD with BlueStore

2019-05-02 Thread Mark Nelson



On 5/2/19 11:46 AM, Igor Podlesny wrote:

On Thu, 2 May 2019 at 05:02, Mark Nelson  wrote:
[...]

FWIW, if you still have an OSD up with tcmalloc, it's probably worth
looking at the heap stats to see how much memory tcmalloc thinks it's
allocated vs how much RSS memory is being used by the process.  It's
quite possible that there is memory that has been unmapped but that the
kernel can't (or has decided not yet to) reclaim.
Transparent huge pages can potentially have an effect here both with tcmalloc 
and with
jemalloc so it's not certain that switching the allocator will fix it entirely.

Most likely wrong. -- Default kernel's settings in regards of THP are "madvise".
None of tcmalloc or jemalloc would madvise() to make it happen.
With fresh enough jemalloc you could have it, but it needs special
malloc.conf'ing.



From one of our centos nodes with no special actions taken to change 
THP settings (though it's possible it was inherited from something else):



$ cat /etc/redhat-release
CentOS Linux release 7.5.1804 (Core)
$ cat /sys/kernel/mm/transparent_hugepage/enabled
[always] madvise never


And regarding madvise and alternate memory allocators:


https://blog.digitalocean.com/transparent-huge-pages-and-alternative-memory-allocators/

https://www.nuodb.com/techblog/linux-transparent-huge-pages-jemalloc-and-nuodb

https://github.com/gperftools/gperftools/issues/1073

https://github.com/jemalloc/jemalloc/issues/1243

https://github.com/jemalloc/jemalloc/issues/1128





First I would just get the heap stats and then after that I would be
very curious if disabling transparent huge pages helps. Alternately,
it's always possible it's a memory leak. :D

RedHat can do better (hopefully). ;-P

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Unexplainable high memory usage OSD with BlueStore

2019-05-02 Thread Igor Podlesny
On Thu, 2 May 2019 at 05:02, Mark Nelson  wrote:
[...]
> FWIW, if you still have an OSD up with tcmalloc, it's probably worth
> looking at the heap stats to see how much memory tcmalloc thinks it's
> allocated vs how much RSS memory is being used by the process.  It's
> quite possible that there is memory that has been unmapped but that the
> kernel can't (or has decided not yet to) reclaim.

> Transparent huge pages can potentially have an effect here both with tcmalloc 
> and with
> jemalloc so it's not certain that switching the allocator will fix it 
> entirely.

Most likely wrong. -- Default kernel's settings in regards of THP are "madvise".
None of tcmalloc or jemalloc would madvise() to make it happen.
With fresh enough jemalloc you could have it, but it needs special
malloc.conf'ing.

> First I would just get the heap stats and then after that I would be
> very curious if disabling transparent huge pages helps. Alternately,
> it's always possible it's a memory leak. :D

RedHat can do better (hopefully). ;-P

-- 
End of message. Next message?
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Unexplainable high memory usage OSD with BlueStore

2019-05-01 Thread Mark Nelson

On 5/1/19 12:59 AM, Igor Podlesny wrote:

On Tue, 30 Apr 2019 at 20:56, Igor Podlesny  wrote:

On Tue, 30 Apr 2019 at 19:10, Denny Fuchs  wrote:
[..]

Any suggestions ?

-- Try different allocator.

Ah, BTW, except memory allocator there's another option: recently
backported bitmap allocator.
Igor Fedotov wrote about it's expected to have lesser memory footprint
with time:

 http://lists.ceph.com/pipermail/ceph-users-ceph.com/2019-April/034299.html

Also I'm not sure though if it's okay to switch existent OSDs "on-fly"
-- changing config and restarting OSDs.
Igor (Fedotov), can you please elaborate on this matter?



FWIW, if you still have an OSD up with tcmalloc, it's probably worth 
looking at the heap stats to see how much memory tcmalloc thinks it's 
allocated vs how much RSS memory is being used by the process.  It's 
quite possible that there is memory that has been unmapped but that the 
kernel can't (or has decided not yet to) reclaim.  Transparent huge 
pages can potentially have an effect here both with tcmalloc and with 
jemalloc so it's not certain that switching the allocator will fix it 
entirely.



First I would just get the heap stats and then after that I would be 
very curious if disabling transparent huge pages helps. Alternately, 
it's always possible it's a memory leak. :D



Mark

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Unexplainable high memory usage OSD with BlueStore

2019-05-01 Thread Igor Fedotov

Hi Igor,

yeah, BlueStore allocators are absolutely interchangeable. You can 
switch between them for free.



Thanks,

Igor


On 5/1/2019 8:59 AM, Igor Podlesny wrote:

On Tue, 30 Apr 2019 at 20:56, Igor Podlesny  wrote:

On Tue, 30 Apr 2019 at 19:10, Denny Fuchs  wrote:
[..]

Any suggestions ?

-- Try different allocator.

Ah, BTW, except memory allocator there's another option: recently
backported bitmap allocator.
Igor Fedotov wrote about it's expected to have lesser memory footprint
with time:

 http://lists.ceph.com/pipermail/ceph-users-ceph.com/2019-April/034299.html

Also I'm not sure though if it's okay to switch existent OSDs "on-fly"
-- changing config and restarting OSDs.
Igor (Fedotov), can you please elaborate on this matter?


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Unexplainable high memory usage OSD with BlueStore

2019-05-01 Thread Igor Podlesny
On Tue, 30 Apr 2019 at 20:56, Igor Podlesny  wrote:
> On Tue, 30 Apr 2019 at 19:10, Denny Fuchs  wrote:
> [..]
> > Any suggestions ?
>
> -- Try different allocator.

Ah, BTW, except memory allocator there's another option: recently
backported bitmap allocator.
Igor Fedotov wrote about it's expected to have lesser memory footprint
with time:

http://lists.ceph.com/pipermail/ceph-users-ceph.com/2019-April/034299.html

Also I'm not sure though if it's okay to switch existent OSDs "on-fly"
-- changing config and restarting OSDs.
Igor (Fedotov), can you please elaborate on this matter?

-- 
End of message. Next message?
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Unexplainable high memory usage OSD with BlueStore

2019-04-30 Thread Igor Podlesny
On Tue, 30 Apr 2019 at 19:10, Denny Fuchs  wrote:
[..]
> Any suggestions ?

-- Try different allocator.

In Proxmox 4 they by default had this in /etc/default/ceph {{

## use jemalloc instead of tcmalloc
#
# jemalloc is generally faster for small IO workloads and when
# ceph-osd is backed by SSDs.  However, memory usage is usually
# higher by 200-300mb.
#
#LD_PRELOAD=/usr/lib/x86_64-linux-gnu/libjemalloc.so.1

}},

so you may try using it in the same way, the package is still there in
Proxmox 5:

  libjemalloc1: /usr/lib/x86_64-linux-gnu/libjemalloc.so.1

No one can tell for sure if it would help, but jemalloc "...

is a general purpose malloc(3) implementation that emphasizes
fragmentation avoidance and scalable concurrency support.

..." -- http://jemalloc.net/

I noticed OSDs with jemalloc tend to have way bigger VSZ with time but
RSS should be fine.
Look forward hearing your experience with it.

--
End of message. Next message?
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Unexplainable high memory usage OSD with BlueStore

2019-04-30 Thread Denny Fuchs

hi,

I want to add also a memory problem.

What we have:

* Ceph version 12.2.11
* 5 x 512MB Samsung 850 Evo
* 5 x 1TB WD Red (5.4k)
* OS Debian Stretch ( Proxmox VE 5.x )
* 2 x CPU CPU E5-2620 v4
* Memory 64GB DDR4

I've added to ceph.conf

...

[osd]
  osd memory target = 3221225472
...

Which is active:


===
# ceph daemon osd.31 config show | grep memory_target
"osd_memory_target": "3221225472",
===

Problem is, that the OSD processes eating my memory:

==
# free -h
  totalusedfree  shared  buff/cache   
available
Mem:62G 52G7.8G693M2.2G  
   50G

Swap:  8.0G5.8M8.0G
==

As example osd.31, which is a HDD (WD Red)


==
# ceph daemon osd.31 dump_mempools

...

"bluestore_alloc": {
"items": 40379056,
"bytes": 40379056
},
"bluestore_cache_data": {
"items": 1613,
"bytes": 130048000
},
"bluestore_cache_onode": {
"items": 64888,
"bytes": 43604736
},
"bluestore_cache_other": {
"items": 7043426,
"bytes": 209450352
},
...
"total": {
"items": 48360478,
"bytes": 633918931
}
=


=
# ps -eo pmem,pcpu,vsize,pid,cmd | sort -k 1 -nr | head -30
 6.5  1.8 5040944 6594 /usr/bin/ceph-osd -f --cluster ceph --id 31 
--setuser ceph --setgroup ceph
 6.4  2.4 5053492 6819 /usr/bin/ceph-osd -f --cluster ceph --id 1 
--setuser ceph --setgroup ceph
 6.4  2.3 5044144 5454 /usr/bin/ceph-osd -f --cluster ceph --id 4 
--setuser ceph --setgroup ceph
 6.2  1.9 4927248 6082 /usr/bin/ceph-osd -f --cluster ceph --id 5 
--setuser ceph --setgroup ceph
 6.1  2.2 4839988 7684 /usr/bin/ceph-osd -f --cluster ceph --id 3 
--setuser ceph --setgroup ceph
 6.1  2.1 4876572 8155 /usr/bin/ceph-osd -f --cluster ceph --id 2 
--setuser ceph --setgroup ceph
 5.9  1.3 4652608 5760 /usr/bin/ceph-osd -f --cluster ceph --id 32 
--setuser ceph --setgroup ceph
 5.8  1.9 4699092 8374 /usr/bin/ceph-osd -f --cluster ceph --id 0 
--setuser ceph --setgroup ceph
 5.8  1.4 4562480 5623 /usr/bin/ceph-osd -f --cluster ceph --id 30 
--setuser ceph --setgroup ceph
 5.7  1.3 4491624 7268 /usr/bin/ceph-osd -f --cluster ceph --id 34 
--setuser ceph --setgroup ceph
 5.5  1.2 4430164 6201 /usr/bin/ceph-osd -f --cluster ceph --id 33 
--setuser ceph --setgroup ceph
 5.4  1.4 4319480 6405 /usr/bin/ceph-osd -f --cluster ceph --id 29 
--setuser ceph --setgroup ceph
 1.0  0.8 1094500 4749 /usr/bin/ceph-mon -f --cluster ceph --id 
fc-r02-ceph-osd-01 --setuser ceph --setgroup ceph
 0.2  4.8 948764  4803 /usr/bin/ceph-mgr -f --cluster ceph --id 
fc-r02-ceph-osd-01 --setuser ceph --setgroup ceph

=

After a reboot, the nodes uses round about 30GB, but over a month its 
again over 50GB and growing.


Any suggestions ?

cu denny
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Unexplainable high memory usage OSD with BlueStore

2018-11-08 Thread Wido den Hollander



On 11/8/18 12:28 PM, Hector Martin wrote:
> On 11/8/18 5:52 PM, Wido den Hollander wrote:
>> [osd]
>> bluestore_cache_size_ssd = 1G
>>
>> The BlueStore Cache size for SSD has been set to 1GB, so the OSDs
>> shouldn't use more then that.
>>
>> When dumping the mem pools each OSD claims to be using between 1.8GB and
>> 2.2GB of memory.
>>
>> $ ceph daemon osd.X dump_mempools|jq '.total.bytes'
>>
>> Summing up all the values I get to a total of 15.8GB and the system is
>> using 22GB.
>>
>> Looking at 'ps aux --sort rss' I see OSDs using almost 10% of the
>> memory, which would be ~3GB for a single daemon.
> 
> This is similar to what I see on a memory-starved host with the OSDs
> configured with very little cache:
> 
> [osd]
>   bluestore cache size = 18000
> 
> $ ceph daemon osd.13 dump_mempools|jq '.mempool.total.bytes'
> 163117861
> 

Interesting. Looking at my OSD in this case (cache = 1GB) I see
BlueStore reporting 1548288000 bytes at bluestore_cache_data.

That's 1.5GB while 1GB has been set.

This OSD claims to be using 2GB in total at mempool.total.bytes.

So that's 1.5GB for BlueStore's cache and then 512M for the rest?

PGLog and OSDMaps aren't using that much memory.

Wido

> That adds up, but ps says:
> 
> USERPID %CPU %MEMVSZ   RSS TTY  STAT START   TIME COMMAND
> ceph 234576  2.6  6.2 1236200 509620 ?  Ssl  20:10   0:16
> /usr/bin/ceph-osd -i 13 --pid-file /run/ceph/osd.13.pid -c
> /etc/ceph/ceph.conf --foreground
> 
> So ~500MB RSS for this one. Due to an emergency situation that made me
> lose half of the RAM on this host, I'm actually resorting to killing the
> oldest OSD every 5 minutes right now to keep the server from OOMing
> (this will be fixed soon).
> 
> I would very much like to know if this OSD memory usage outside of the
> bluestore cache size can be bounded or reduced somehow. I don't
> particularly care about performance, so it would be useful to be able to
> tune it lower. This would help single-host and smaller Ceph use cases; I
> think Ceph's properties make it a very interesting alternative to things
> like btrfs and zfs, but dedicating several GB of RAM per disk/OSD is not
> always viable. Right now it seems that besides the cache, OSDs will
> creep up in memory usage up to some threshold, and I'm not sure what
> determines what that baseline usage is or whether it can be controlled.
> 
> 
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Unexplainable high memory usage OSD with BlueStore

2018-11-08 Thread Hector Martin
On 11/8/18 5:52 PM, Wido den Hollander wrote:
> [osd]
> bluestore_cache_size_ssd = 1G
> 
> The BlueStore Cache size for SSD has been set to 1GB, so the OSDs
> shouldn't use more then that.
> 
> When dumping the mem pools each OSD claims to be using between 1.8GB and
> 2.2GB of memory.
> 
> $ ceph daemon osd.X dump_mempools|jq '.total.bytes'
> 
> Summing up all the values I get to a total of 15.8GB and the system is
> using 22GB.
> 
> Looking at 'ps aux --sort rss' I see OSDs using almost 10% of the
> memory, which would be ~3GB for a single daemon.

This is similar to what I see on a memory-starved host with the OSDs
configured with very little cache:

[osd]
  bluestore cache size = 18000

$ ceph daemon osd.13 dump_mempools|jq '.mempool.total.bytes'
163117861

That adds up, but ps says:

USERPID %CPU %MEMVSZ   RSS TTY  STAT START   TIME COMMAND
ceph 234576  2.6  6.2 1236200 509620 ?  Ssl  20:10   0:16
/usr/bin/ceph-osd -i 13 --pid-file /run/ceph/osd.13.pid -c
/etc/ceph/ceph.conf --foreground

So ~500MB RSS for this one. Due to an emergency situation that made me
lose half of the RAM on this host, I'm actually resorting to killing the
oldest OSD every 5 minutes right now to keep the server from OOMing
(this will be fixed soon).

I would very much like to know if this OSD memory usage outside of the
bluestore cache size can be bounded or reduced somehow. I don't
particularly care about performance, so it would be useful to be able to
tune it lower. This would help single-host and smaller Ceph use cases; I
think Ceph's properties make it a very interesting alternative to things
like btrfs and zfs, but dedicating several GB of RAM per disk/OSD is not
always viable. Right now it seems that besides the cache, OSDs will
creep up in memory usage up to some threshold, and I'm not sure what
determines what that baseline usage is or whether it can be controlled.


-- 
Hector Martin (hec...@marcansoft.com)
Public Key: https://mrcn.st/pub
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Unexplainable high memory usage OSD with BlueStore

2018-11-08 Thread Wido den Hollander



On 11/8/18 11:34 AM, Stefan Kooman wrote:
> Quoting Wido den Hollander (w...@42on.com):
>> Hi,
>>
>> Recently I've seen a Ceph cluster experience a few outages due to memory
>> issues.
>>
>> The machines:
>>
>> - Intel Xeon E3 CPU
>> - 32GB Memory
>> - 8x 1.92TB SSD
>> - Ubuntu 16.04
>> - Ceph 12.2.8
> 
> What kernel version is running? What network card is being used?
> 

4.15.0-38-generic

The NIC in this case is 1GbE, a Intel I210.

It's this SuperMicro mainboard:
https://www.supermicro.com/products/motherboard/xeon/c220/x10sl7-f.cfm

So the kernel is already at 4.15 and it's not using any of those NICs.

Thanks!

Wido

> We hit a mem leak bug in Intel driver (i40e) (Intel x710) which has been
> fixed (mostly) in 4.13 and up [1].
> 
> Gr. Stefan
> 
> [1]: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1748408
> 
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Unexplainable high memory usage OSD with BlueStore

2018-11-08 Thread Stefan Kooman
Quoting Wido den Hollander (w...@42on.com):
> Hi,
> 
> Recently I've seen a Ceph cluster experience a few outages due to memory
> issues.
> 
> The machines:
> 
> - Intel Xeon E3 CPU
> - 32GB Memory
> - 8x 1.92TB SSD
> - Ubuntu 16.04
> - Ceph 12.2.8

What kernel version is running? What network card is being used?

We hit a mem leak bug in Intel driver (i40e) (Intel x710) which has been
fixed (mostly) in 4.13 and up [1].

Gr. Stefan

[1]: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1748408

-- 
| BIT BV  http://www.bit.nl/Kamer van Koophandel 09090351
| GPG: 0xD14839C6   +31 318 648 688 / i...@bit.nl
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com