Re: [ceph-users] Ceph OSDs cause kernel unresponsive

Nick Fisk Fri, 25 Nov 2016 05:26:41 -0800

Hi,


I didn’t so the maths, so maybe 7GB isn’t worth tuning for, although every 
little helps ;-)

 

I don’t believe peering or recovery should effect this value, but other things 
will consume memory during recovery, but I’m not aware if this can be limited 
or tuned.

 

Yes, the write and read cache’s will consume memory and may limit Linux’s 
ability to react quickly enough in tight memory conditions. I believe you can 
be in a state where it looks like you have more memory potentially available 
than actually is usable at that point in time. The min_free_bytes can help 
here. 

 

From: Craig Chi [mailto:craig...@synology.com] 
Sent: 25 November 2016 01:46
To: Brad Hubbard <bhubb...@redhat.com>
Cc: Nick Fisk <n...@fisk.me.uk>; Ceph Users <ceph-users@lists.ceph.com>
Subject: Re: [ceph-users] Ceph OSDs cause kernel unresponsive

 

Hi Nick,

 

I have seen the report before, if I understand correctly, the 
osd_map_cache_size generally introduces a fixed amount of memory usage. We are 
using the default value of 200, and a single osd map I got from our cluster is 
404KB.

 

That is totally 404KB * 200 * 90 (osds) = about 7GB on each node.

 

Will the memory consumption generated by this factor become larger when 
unstably peering or recovering? If not, we still need to find the root cause of 
why free memory drops without control.

 

Does anyone know that what is the relation between filestore or journal 
configurations and the OSD's memory consumption? Is it possible that the 
filestore queue or journal queue occupy huge memory pages and cause filesystem 
cache hard to release (and result in oom)?

 

At last, about nobarrier, I fully knew the consequence and is seriously testing 
on this option. Sincerely appreciate your kindness and useful suggestions.

 

Sincerely,
Craig Chi

On 2016-11-25 07:23, Brad Hubbard <bhubb...@redhat.com> wrote:

Two of these appear to be hung task timeouts and the other is an invalid opcode.

There is no evidence here of memory exhaustion (although it remains to be seen 
whether this is a factor but I'd expect to see evidence of shrinker activity in 
the stacks) and I would speculate the increased memory utilisation is due to 
the issues with the OSD tasks.

I would suggest that the next step here is to work out specifically why the 
invalid opcode happened and/or why kernel tasks are hanging for > 120 seconds.

To do that you may need to capture a vmcore and analyse it and/or engage your 
kernel support team to investigate further.
 

 

On Fri, Nov 25, 2016 at 8:26 AM, Nick Fisk <n...@fisk.me.uk 
<mailto:n...@fisk.me.uk> > wrote:

There’s a couple of things you can do to reduce memory usage by limiting the 
number of OSD maps each OSD stores, but you will still be pushing up against 
the limits of the ram you have available. There is a Cern 30PB test (should be 
on google) which gives some details on some of the settings, but quite a few 
are no longer relevant in jewel.

 

Once other thing, I saw you have nobarrier set on mount options. Please please 
please understand the consequences of this option!!!!

 

From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com 
<mailto:ceph-users-boun...@lists.ceph.com> ] On Behalf Of Craig Chi
Sent: 24 November 2016 10:37
To: Nick Fisk <n...@fisk.me.uk <mailto:n...@fisk.me.uk> >
Cc: ceph-users@lists.ceph.com <mailto:ceph-users@lists.ceph.com> 
Subject: Re: [ceph-users] Ceph OSDs cause kernel unresponsive

 

Hi Nick,

 

Thank you for your helpful information.

 

I knew that Ceph recommends 1GB/1TB RAM, but we are not going to change the 
hardware architecture now.

Are there any methods to set the resource limit one OSD can consume?

 

And for your question, we currently set system configuration as:

 

vm.swappiness=10
kernel.pid_max=4194303
fs.file-max=26234859
vm.zone_reclaim_mode=0
vm.vfs_cache_pressure=50
vm.min_free_kbytes=4194303

 

I would try to configure vm.min_free_kbytes larger and test.

I will be grateful if anyone has the experience of how to tune these values for 
Ceph.

 

Sincerely,
Craig Chi

 

On 2016-11-24 17:48, Nick Fisk <n...@fisk.me.uk <mailto:n...@fisk.me.uk> > 
wrote:

Hi Craig,

 

From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com 
<mailto:ceph-users-boun...@lists.ceph.com> ] On Behalf Of Craig Chi
Sent: 24 November 2016 08:34
To: ceph-users@lists.ceph.com <mailto:ceph-users@lists.ceph.com> 
Subject: [ceph-users] Ceph OSDs cause kernel unresponsive

 

Hi Cephers,

We have encountered kernel hanging issue on our Ceph cluster. Just like 
http://imgur.com/a/U2Flz 
<http://xo4t.mj.am/lnk/AEQAGgKj_fMAAAAAAAAAAEtrDcsAADNJBWwAAAAAAACRXwBYN2kM0f7bmLRoRTm41j-83j08iAAAlBI/1/L-qvJ2I1vYlr-Jz4N7EQPA/aHR0cDovL3hvNHQubWouYW0vbG5rL0FFTUFHZ2w4ZDlrQUFBQUFBQUFBQUYzZ2R4SUFBRE5KQld3QUFBQUFBQUNSWHdCWU5yZG44NnpXbjhjNlRHMmVyV3V3SXBsQlJBQUFsQkkvMS80M3NyTjhBQWN5X0xxU2o5YVpHSGdRL2FIUjBjRG92TDJsdFozVnlMbU52YlM5aEwxVXlSbXg2>
  , http://imgur.com/a/lyEko 
<http://xo4t.mj.am/lnk/AEQAGgKj_fMAAAAAAAAAAEtrDcsAADNJBWwAAAAAAACRXwBYN2kM0f7bmLRoRTm41j-83j08iAAAlBI/2/A4Kxmy7OjlIgEzmZqnjneQ/aHR0cDovL3hvNHQubWouYW0vbG5rL0FFTUFHZ2w4ZDlrQUFBQUFBQUFBQUYzZ2R4SUFBRE5KQld3QUFBQUFBQUNSWHdCWU5yZG44NnpXbjhjNlRHMmVyV3V3SXBsQlJBQUFsQkkvMi9KakFFMndZMzhBYUhUUUpBSUFrUlBBL2FIUjBjRG92TDJsdFozVnlMbU52YlM5aEwyeDVSV3R2>
  or http://imgur.com/a/IGXdu 
<http://xo4t.mj.am/lnk/AEQAGgKj_fMAAAAAAAAAAEtrDcsAADNJBWwAAAAAAACRXwBYN2kM0f7bmLRoRTm41j-83j08iAAAlBI/3/mcbynpUZGvjh3ZzSvqpVrQ/aHR0cDovL3hvNHQubWouYW0vbG5rL0FFTUFHZ2w4ZDlrQUFBQUFBQUFBQUYzZ2R4SUFBRE5KQld3QUFBQUFBQUNSWHdCWU5yZG44NnpXbjhjNlRHMmVyV3V3SXBsQlJBQUFsQkkvMy9oZjhxTDZ5ZVVyektzU05ncmcwY0hRL2FIUjBjRG92TDJsdFozVnlMbU52YlM5aEwwbEhXR1Ix>
  .

We believed it is caused by out of memory, because we observed that when OSDs 
went crazy, the available memory of each node were decreasing rapidly (from 50% 
available to lower than 10%). Then the node running Ceph OSD became 
unresponsive with console showing hung_task_timout or slab_out_of_memory, etc. 
The only thing we can do then is hard reset the unit.

It is hard to predict when the kernel hanging issue will happen. In my past 
experiences, it usually happened after a long term benchmark procedure, and 
followed by a manual trigger like 1) reboot a node 2) restart all OSDs 3) 
modify CRUSH map.

Currently the cluster is back to normal, but we want to figure out the root 
cause to avoid happening again. We think the high values of ceph.conf are 
pretty suspicous, but without code tracing we are hard to realize the impact of 
the values and the memory consumption.

Many thanks if you have any suggestions.

 

I think you are probably running out of memory, 90x8TB disks is 720Tb of 
storage, that will need a lot of ram to run and also the fact that the problems 
occur when PG’s start moving around after a node failure also suggests this.

 

Have you adjusted your vm.vfs_cache_pressure?

 

You might also want to try setting vm.min_free_kbytes to 8-16GB to try and keep 
some memory free and avoid fragmentation.

 


=================================================================================


Following is our ceph cluster architecture:

OS: Ubuntu 16.04.1 LTS (4.4.0-31-generic #50-Ubuntu x86_64 GNU/Linux)
Ceph: Jewel 10.2.3

3 Ceph Monitors running on 3 dedicated machines
630 Ceph OSDs running on 7 storage machines (each machine has 256GB RAM and 90 
units of 8TB hard drives)

There are 4 pools with following settings:
vms     512  pg x 3 replica
images  512  pg x 3 replica
volumes 8192 pg x 3 replica
objects 4096 pg x (17,3) erasure code profile

==> average 173.92 pgs per OSD

We tuned our ceph.conf by referencing many performance tuning resources online 
( mainly from slide 38 of https://goo.gl/Idkh41 
<http://xo4t.mj.am/lnk/AEQAGgKj_fMAAAAAAAAAAEtrDcsAADNJBWwAAAAAAACRXwBYN2kM0f7bmLRoRTm41j-83j08iAAAlBI/4/p7EJ0AbR54--HaD5SwNzfg/aHR0cDovL3hvNHQubWouYW0vbG5rL0FFTUFHZ2w4ZDlrQUFBQUFBQUFBQUYzZ2R4SUFBRE5KQld3QUFBQUFBQUNSWHdCWU5yZG44NnpXbjhjNlRHMmVyV3V3SXBsQlJBQUFsQkkvNC9HcUtRVjNFQlRKTXVGTTZvbnQwakVBL2FIUjBjSE02THk5bmIyOHVaMnd2U1dScmFEUXg>
  )

[global]
osd pool default pg num = 4096
osd pool default pgp num = 4096
err to syslog = true
log to syslog = true
osd pool default size = 3
max open files = 131072
fsid = 1c33bf75-e080-4a70-9fd8-860ff216f595
osd crush chooseleaf type = 1

[mon.mon1]
host = mon1
mon addr = 172.20.1.2

[mon.mon2]
host = mon2
mon addr = 172.20.1.3

[mon.mon3]
host = mon3
mon addr = 172.20.1.4

[mon]
mon osd full ratio = 0.85
mon osd nearfull ratio = 0.7
mon osd down out interval = 600
mon osd down out subtree limit = host
mon allow pool delete = true
mon compact on start = true

[osd]
public_network = 172.20.3.1/21 <http://172.20.3.1/21> 
cluster_network = 172.24.0.1/24 <http://172.24.0.1/24> 
osd disk threads = 4
osd mount options xfs = 
rw,noexec,nodev,noatime,nodiratime,nobarrier,inode64,logbsize=256k
osd crush update on start = false
osd op threads = 20
osd mkfs options xfs = -f -i size=2048
osd max write size = 512
osd mkfs type = xfs
osd journal size = 5120
filestore max inline xattrs = 6
filestore queue committing max bytes = 1048576000
filestore queue committing max ops = 5000
filestore queue max bytes = 1048576000
filestore op threads = 32
filestore max inline xattr size = 254
filestore max sync interval = 15
filestore min sync interval = 10
journal max write bytes = 1048576000
journal max write entries = 1000
journal queue max ops = 3000
journal queue max bytes = 1048576000
ms dispatch throttle bytes = 1048576000

 

Sincerely,
Craig Chi

 

 

Sent from Synology MailPlus




 

 

Sent from Synology MailPlus





_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com <mailto:ceph-users@lists.ceph.com> 
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com




--

Cheers,
Brad

 

 

Sent from Synology MailPlus

_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Ceph OSDs cause kernel unresponsive

Reply via email to