Hi, We have a cluster with 9 hosts and 540 HDDs using Bluestore and containerized OSDs running luminous 12.2.4. While trying to add new nodes, the cluster collapsed as it could not keep up with establishing enough tcp connections. We fixed sysctl to be able to handle more connections and also recycle tw sockets faster. Currently, as we are trying to restart the cluster by bringing up a few OSDs at a time, some of the OSDs get very busy after around 360 of them come up. iostats show that the busy OSDs are constantly reading from the Bluestore partition. The number of busy OSDs per node vary and norecover is set with no active clients. OSD logs don't show anything other than cephx: verify_authorizer errors which happen on both busy and idle OSDs and doesn't seem to be related to drive reads.
How can we figure out why the OSDs are busy reading from the drives? If it is some kind of recovery, is there a way to track progress? Output of ceph -s and logs from a busy and idle OSD are copied below. Thanks Chandra Uptime stats with load averages show variance across the9 older nodes. 02:43:44 up 19:21, 0 users, load average: 0.88, 1.03, 1.06 02:43:44 up 7:58, 0 users, load average: 16.91, 13.49, 12.43 02:43:44 up 1 day, 14 min, 0 users, load average: 7.67, 6.70, 6.35 02:43:45 up 7:01, 0 users, load average: 84.40, 84.20, 83.73 02:43:45 up 6:40, 1 user, load average: 17.08, 17.40, 20.05 02:43:45 up 19:46, 0 users, load average: 15.58, 11.93, 11.44 02:43:45 up 20:39, 0 users, load average: 7.88, 6.50, 5.69 02:43:46 up 1 day, 1:20, 0 users, load average: 5.03, 3.81, 3.49 02:43:46 up 1 day, 58 min, 0 users, load average: 0.62, 1.00, 1.38 Ceph Config -------------- [global] cluster network = 192.168.13.0/24 fsid = <> mon host = 172.16.13.101,172.16.13.102,172.16.13.103 mon initial members = ctrl1,ctrl2,ctrl3 mon_max_pg_per_osd = 750 mon_osd_backfillfull_ratio = 0.92 mon_osd_down_out_interval = 900 mon_osd_full_ratio = 0.95 mon_osd_nearfull_ratio = 0.85 osd_crush_chooseleaf_type = 3 osd_heartbeat_grace = 900 mon_osd_laggy_max_interval = 900 osd_max_pg_per_osd_hard_ratio = 1.0 public network = 172.16.13.0/24 [mon] mon_compact_on_start = true [osd] osd_deep_scrub_interval = 2419200 osd_deep_scrub_stride = 4194304 osd_max_backfills = 10 osd_max_object_size = 276824064 osd_max_scrubs = 1 osd_max_write_size = 264 osd_pool_erasure_code_stripe_unit = 2097152 osd_recovery_max_active = 10 osd_heartbeat_interval = 15 Data nodes Sysctl params ----------------------------- fs.aio-max-nr=1048576 kernel.pid_max=4194303 kernel.threads-max=2097152 net.core.netdev_max_backlog=65536 net.core.optmem_max=1048576 net.core.rmem_max=8388608 net.core.rmem_default=8388608 net.core.somaxconn=2048 net.core.wmem_max=8388608 net.core.wmem_default=8388608 vm.max_map_count=524288 vm.min_free_kbytes=262144 net.ipv4.tcp_tw_reuse=1 net.ipv4.tcp_max_syn_backlog=16384 net.ipv4.tcp_fin_timeout=10 net.ipv4.tcp_slow_start_after_idle=0 Ceph -s output --------------- root@ctrl1:/# ceph -s cluster: id: 06126476-6deb-4baa-b7ca-50f5ccfacb68 health: HEALTH_ERR noout,nobackfill,norebalance,norecover,noscrub,nodeep-scrub flag(s) set 704 osds down 9 hosts (540 osds) down 71 nearfull osd(s) 2 pool(s) nearfull 780664/74163111 objects misplaced (1.053%) 7724/8242239 objects unfound (0.094%) 396 PGs pending on creation Reduced data availability: 32597 pgs inactive, 29764 pgs down, 820 pgs peering, 74 pgs incomplete, 1 pg stale Degraded data redundancy: 679158/74163111 objects degraded (0.916%), 1250 pgs degraded, 1106 pgs undersized 33 slow requests are blocked > 32 sec 9 stuck requests are blocked > 4096 sec mons ctrl1,ctrl2,ctrl3 are using a lot of disk space services: mon: 3 daemons, quorum ctrl1,ctrl2,ctrl3 mgr: ctrl1(active), standbys: ctrl2, ctrl3 osd: 1080 osds: 376 up, 1080 in; 1963 remapped pgs flags noout,nobackfill,norebalance,norecover,noscrub,nodeep-scrub data: pools: 2 pools, 33280 pgs objects: 8049k objects, 2073 TB usage: 2277 TB used, 458 TB / 2736 TB avail pgs: 3.585% pgs unknown 94.363% pgs not active 679158/74163111 objects degraded (0.916%) 780664/74163111 objects misplaced (1.053%) 7724/8242239 objects unfound (0.094%) 29754 down 1193 unknown 535 peering 496 activating+undersized+degraded+remapped 284 remapped+peering 258 active+undersized+degraded+remapped 161 activating+degraded+remapped 143 active+recovering+undersized+degraded+remapped 89 active+undersized+degraded 76 active+clean+remapped 71 incomplete 48 active+undersized+remapped 46 undersized+degraded+peered 34 active+recovering+degraded+remapped 26 active+clean 21 activating+remapped 13 activating+undersized+degraded 10 down+remapped 9 active+recovery_wait+undersized+degraded+remapped 4 activating 3 remapped+incomplete 2 activating+undersized+remapped 1 stale+peering 1 undersized+peered 1 undersized+remapped+peered 1 activating+degraded root@ctrl1:/# ceph version ceph version 12.2.4 (52085d5249a80c5f5121a76d6288429f35e4e77b) luminous (stable) osd.86 - busy ------------------- 2019-01-12 02:30:02.582363 7f00ea0f4700 0 osd.86 213711 No AuthAuthorizeHandler found for protocol 0 2019-01-12 02:30:02.582383 7f00ea0f4700 0 -- 192.168.13.5:6806/351612 >> 192.168.13.8:6868/770039 conn(0x55c4a8d89000 :6806 s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=0).handle_connect_msg: got bad authorizer 2019-01-12 02:30:04.005541 7f00ea8f5700 0 auth: could not find secret_id=7544 2019-01-12 02:30:04.005554 7f00ea8f5700 0 cephx: verify_authorizer could not get service secret for service osd secret_id=7544 2019-01-12 02:30:04.005557 7f00ea8f5700 0 -- 192.168.13.5:6806/351612 >> 192.168.13.3:6836/405613 conn(0x55c4f3617800 :6806 s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=0).handle_connect_msg: got bad authorizer 2019-01-12 02:30:07.910816 7f00e98f3700 0 auth: could not find secret_id=7550 2019-01-12 02:30:07.910864 7f00e98f3700 0 cephx: verify_authorizer could not get service secret for service osd secret_id=7550 2019-01-12 02:30:07.910884 7f00e98f3700 0 -- 192.168.13.5:6806/351612 >> 192.168.13.8:6824/767660 conn(0x55c4ec462800 :6806 s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=0).handle_connect_msg: got bad authorizer 2019-01-12 02:30:16.982636 7f00ea8f5700 0 osd.86 213711 No AuthAuthorizeHandler found for protocol 0 2019-01-12 02:30:16.982640 7f00ea8f5700 0 -- 192.168.13.5:6806/351612 >> 192.168.13.6:6816/349322 conn(0x55c4aee6f000 :6806 s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=0).handle_connect_msg: got bad authorizer OSD.132 - idle --------------- 2019-01-12 02:31:38.370478 7fa508450700 0 auth: could not find secret_id=7551 2019-01-12 02:31:38.370487 7fa508450700 0 cephx: verify_authorizer could not get service secret for service osd secret_id=7551 2019-01-12 02:31:38.370489 7fa508450700 0 -- 192.168.13.5:6854/356774 >> 192.168.13.8:6872/1201589 conn(0x563b3e46f000 :6854 s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=0).handle_connect_msg: got bad authorizer 2019-01-12 02:31:41.121603 7fa509452700 0 auth: could not find secret_id=7544 2019-01-12 02:31:41.121672 7fa509452700 0 cephx: verify_authorizer could not get service secret for service osd secret_id=7544 2019-01-12 02:31:41.121707 7fa509452700 0 -- 192.168.13.5:6854/356774 >> 192.168.13.9:6808/515991 conn(0x563b53997800 :6854 s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=0).handle_connect_msg: got bad authorizer -- This email message, including attachments, may contain private, proprietary, or privileged information and is the confidential information and/or property of GRAIL, Inc., and is for the sole use of the intended recipient(s). Any unauthorized review, use, disclosure or distribution is strictly prohibited. If you are not the intended recipient, please contact the sender by reply email and destroy all copies of the original message.
_______________________________________________ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com