Re: [ceph-users] OSDs busy reading from Bluestore partition while bringing up nodes.

Maged Mokhtar Sat, 12 Jan 2019 03:44:51 -0800

Hi,

Your settings for
osd_max_backfills = 10
osd_recovery_max_active = 10

Are above default and are too high. These limits are per OSD, so you mayhave a single disk doing 10 backfills ( actually 20 since in and out )at the same time.


Try dynamic limit it:

ceph tell osd.* injectargs '--osd_max_backfills 1'
ceph tell osd.* injectargs '--osd-recovery-max-active 1'
ceph tell osd.* injectargs '--osd_recovery_sleep 0.2'

Even if the commands return a required reboot message, they do takeeffect, you can double check by reading from a specific local OSD (example 1):

ceph daemon osd.1 config get osd_max_backfills

You can also assign less weights the newely added OSDs for example 0.5of their value to reduce data movement, then re-bump it when the clusterbalanced.


/Maged


On 12/01/2019 04:56, Subhachandra Chandra wrote:

Hi,
We have a cluster with 9 hosts and 540 HDDs using Bluestore andcontainerized OSDs running luminous 12.2.4. While trying to add newnodes, the cluster collapsed as it could not keep up with establishingenough tcp connections. We fixed sysctl to be able to handle moreconnections and also recycle tw sockets faster. Currently, as we aretrying to restart the cluster by bringing up a few OSDs at a time,some of the OSDs get very busy after around 360 of them come up.iostats show that the busy OSDs are constantly reading from theBluestore partition. The number of busy OSDs per node vary andnorecover is set with no active clients. OSD logs don't show anythingother than cephx: verify_authorizer errors which happen on both busyand idle OSDs and doesn't seem to be related to drive reads.
How can we figure out why the OSDs are busy reading from the drives?If it is some kind of recovery, is there a way to track progress?Output of ceph -s and logs from a busy and idle OSD are copied below.
Thanks
Chandra

Uptime stats with load averages show variance across the9 older nodes.

02:43:44 up 19:21,0 users,load average: 0.88, 1.03, 1.06

02:43:44 up7:58,0 users,load average: 16.91, 13.49, 12.43

02:43:44 up 1 day, 14 min,0 users,load average: 7.67, 6.70, 6.35

02:43:45 up7:01,0 users,load average: 84.40, 84.20, 83.73

02:43:45 up6:40,1 user,load average: 17.08, 17.40, 20.05

02:43:45 up 19:46,0 users,load average: 15.58, 11.93, 11.44

02:43:45 up 20:39,0 users,load average: 7.88, 6.50, 5.69

02:43:46 up 1 day,1:20,0 users,load average: 5.03, 3.81, 3.49

02:43:46 up 1 day, 58 min,0 users,load average: 0.62, 1.00, 1.38


Ceph Config

--------------

[global]

cluster network = 192.168.13.0/24 <http://192.168.13.0/24>

fsid = <>

mon host = 172.16.13.101,172.16.13.102,172.16.13.103

mon initial members = ctrl1,ctrl2,ctrl3

mon_max_pg_per_osd = 750

mon_osd_backfillfull_ratio = 0.92

mon_osd_down_out_interval = 900

mon_osd_full_ratio = 0.95

mon_osd_nearfull_ratio = 0.85

osd_crush_chooseleaf_type = 3

osd_heartbeat_grace = 900

mon_osd_laggy_max_interval = 900

osd_max_pg_per_osd_hard_ratio = 1.0

public network = 172.16.13.0/24 <http://172.16.13.0/24>


[mon]

mon_compact_on_start = true


[osd]

osd_deep_scrub_interval = 2419200

osd_deep_scrub_stride = 4194304

osd_max_backfills = 10

osd_max_object_size = 276824064

osd_max_scrubs = 1

osd_max_write_size = 264

osd_pool_erasure_code_stripe_unit = 2097152

osd_recovery_max_active = 10

osd_heartbeat_interval = 15


Data nodes Sysctl params

-----------------------------

fs.aio-max-nr=1048576

kernel.pid_max=4194303

kernel.threads-max=2097152

net.core.netdev_max_backlog=65536

net.core.optmem_max=1048576

net.core.rmem_max=8388608

net.core.rmem_default=8388608

net.core.somaxconn=2048

net.core.wmem_max=8388608

net.core.wmem_default=8388608

vm.max_map_count=524288

vm.min_free_kbytes=262144


net.ipv4.tcp_tw_reuse=1

net.ipv4.tcp_max_syn_backlog=16384

net.ipv4.tcp_fin_timeout=10

net.ipv4.tcp_slow_start_after_idle=0



Ceph -s output

---------------


root@ctrl1:/# ceph -s

cluster:

id: 06126476-6deb-4baa-b7ca-50f5ccfacb68

health: HEALTH_ERR

noout,nobackfill,norebalance,norecover,noscrub,nodeep-scrub flag(s) set

704 osds down

9 hosts (540 osds) down

71 nearfull osd(s)

2 pool(s) nearfull

780664/74163111 objects misplaced (1.053%)

7724/8242239 objects unfound (0.094%)

396 PGs pending on creation
Reduced data availability: 32597 pgs inactive, 29764 pgs down, 820 pgspeering, 74 pgs incomplete, 1 pg stale
Degraded data redundancy: 679158/74163111 objects degraded (0.916%),1250 pgs degraded, 1106 pgs undersized
33 slow requests are blocked > 32 sec

9 stuck requests are blocked > 4096 sec

mons ctrl1,ctrl2,ctrl3 are using a lot of disk space

services:

mon: 3 daemons, quorum ctrl1,ctrl2,ctrl3

mgr: ctrl1(active), standbys: ctrl2, ctrl3

osd: 1080 osds: 376 up, 1080 in; 1963 remapped pgs

flags noout,nobackfill,norebalance,norecover,noscrub,nodeep-scrub

data:

pools: 2 pools, 33280 pgs

objects: 8049k objects, 2073 TB

usage: 2277 TB used, 458 TB / 2736 TB avail

pgs: 3.585% pgs unknown

94.363% pgs not active

679158/74163111 objects degraded (0.916%)

780664/74163111 objects misplaced (1.053%)

7724/8242239 objects unfound (0.094%)

29754 down

1193unknown

535 peering

496 activating+undersized+degraded+remapped

284 remapped+peering

258 active+undersized+degraded+remapped

161 activating+degraded+remapped

143 active+recovering+undersized+degraded+remapped

89active+undersized+degraded

76active+clean+remapped

71incomplete

48active+undersized+remapped

46undersized+degraded+peered

34active+recovering+degraded+remapped

26active+clean

21activating+remapped

13activating+undersized+degraded

10down+remapped

9 active+recovery_wait+undersized+degraded+remapped

4 activating

3 remapped+incomplete

2 activating+undersized+remapped

1 stale+peering

1 undersized+peered

1 undersized+remapped+peered

1 activating+degraded

root@ctrl1:/# ceph version
ceph version 12.2.4 (52085d5249a80c5f5121a76d6288429f35e4e77b)luminous (stable)
osd.86 - busy
-------------------
2019-01-12 02:30:02.582363 7f00ea0f47000 osd.86 213711 NoAuthAuthorizeHandler found for protocol 0
2019-01-12 02:30:02.582383 7f00ea0f47000 -- 192.168.13.5:6806/351612<http://192.168.13.5:6806/351612> >> 192.168.13.8:6868/770039<http://192.168.13.8:6868/770039> conn(0x55c4a8d89000 :6806s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0l=0).handle_connect_msg: got bad authorizer
2019-01-12 02:30:04.005541 7f00ea8f57000 auth: could not findsecret_id=7544
2019-01-12 02:30:04.005554 7f00ea8f57000 cephx: verify_authorizercould not get service secret for service osd secret_id=7544
2019-01-12 02:30:04.005557 7f00ea8f57000 -- 192.168.13.5:6806/351612<http://192.168.13.5:6806/351612> >> 192.168.13.3:6836/405613<http://192.168.13.3:6836/405613> conn(0x55c4f3617800 :6806s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0l=0).handle_connect_msg: got bad authorizer
2019-01-12 02:30:07.910816 7f00e98f37000 auth: could not findsecret_id=7550
2019-01-12 02:30:07.910864 7f00e98f37000 cephx: verify_authorizercould not get service secret for service osd secret_id=7550
2019-01-12 02:30:07.910884 7f00e98f37000 -- 192.168.13.5:6806/351612<http://192.168.13.5:6806/351612> >> 192.168.13.8:6824/767660<http://192.168.13.8:6824/767660> conn(0x55c4ec462800 :6806s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0l=0).handle_connect_msg: got bad authorizer
2019-01-12 02:30:16.982636 7f00ea8f57000 osd.86 213711 NoAuthAuthorizeHandler found for protocol 0
2019-01-12 02:30:16.982640 7f00ea8f57000 -- 192.168.13.5:6806/351612<http://192.168.13.5:6806/351612> >> 192.168.13.6:6816/349322<http://192.168.13.6:6816/349322> conn(0x55c4aee6f000 :6806s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0l=0).handle_connect_msg: got bad authorizer
OSD.132 - idle

---------------
2019-01-12 02:31:38.370478 7fa5084507000 auth: could not findsecret_id=7551
2019-01-12 02:31:38.370487 7fa5084507000 cephx: verify_authorizercould not get service secret for service osd secret_id=7551
2019-01-12 02:31:38.370489 7fa5084507000 -- 192.168.13.5:6854/356774<http://192.168.13.5:6854/356774> >> 192.168.13.8:6872/1201589<http://192.168.13.8:6872/1201589> conn(0x563b3e46f000 :6854s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0l=0).handle_connect_msg: got bad authorizer
2019-01-12 02:31:41.121603 7fa5094527000 auth: could not findsecret_id=7544
2019-01-12 02:31:41.121672 7fa5094527000 cephx: verify_authorizercould not get service secret for service osd secret_id=7544
2019-01-12 02:31:41.121707 7fa5094527000 -- 192.168.13.5:6854/356774<http://192.168.13.5:6854/356774> >> 192.168.13.9:6808/515991<http://192.168.13.9:6808/515991> conn(0x563b53997800 :6854s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0l=0).handle_connect_msg: got bad authorizer
This email message, including attachments, may contain private,proprietary, or privileged information and is the confidentialinformation and/or property of GRAIL, Inc., and is for the sole use ofthe intended recipient(s). Any unauthorized review, use, disclosure ordistribution is strictly prohibited. If you are not the intendedrecipient, please contact the sender by reply email and destroy allcopies of the original message.
_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


--
Maged Mokhtar
CEO PetaSAN
4 Emad El Deen Kamel
Cairo 11371, Egypt
www.petasan.org
+201006979931
skype: maged.mokhtar

_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] OSDs busy reading from Bluestore partition while bringing up nodes.

Reply via email to