[ceph-users] OSD containers lose connectivity after change from Rocky 8.7->9.2

Dan O'Brien Tue, 15 Aug 2023 05:38:49 -0700

I recently updated one of the hosts (an older Dell PowerEdge R515) in my Ceph 
Quincy (17.2.6) cluster. I needed to change the IP address, so I removed the 
host from the cluster (gracefully removed OSDs and daemons, then removed the 
host). I also took the opportunity to upgrade the host from Rocky 8.7 to 9.2 
before re-joining it to the cluster with cephadm. I zapped the storage, so for 
all intents and purposes, it should have been a completely clean instance and 
went smoothly. I have 2 other hosts (new Dell PowerEdge R450s) using Rocky 9.2 
with no problems. Before the upgrade, the R515 host was well-behaved and 
unremarkable.


Our cluster is connected to our internal network, and has a 10G private network 
used for interconnect between the nodes.

Since the upgrade, the OSDs on the R515 host regularly, after a period of 
minutes to hours (usually a few hours), drop out of the cluster. I can restart 
the OSDs and they immediately reconnect and join the cluster, which returns to 
a HEALTHY state after a short period. The OSD logs show

Aug 15 06:53:57 ceph99.cecnet.gmu.edu ceph-osd[193725]: log_channel(cluster) 
log [WRN] : Monitor daemon marked osd.9 down, but it is still running
Aug 15 06:53:57 ceph99.cecnet.gmu.edu ceph-osd[193725]: log_channel(cluster) 
log [DBG] : map e17993 wrongly marked me down at e17988
Aug 15 06:53:57 ceph99.cecnet.gmu.edu ceph-osd[193725]: osd.9 17993 
start_waiting_for_healthy
Aug 15 06:53:57 ceph99.cecnet.gmu.edu ceph-osd[193725]: osd.9 pg_epoch: 17988 
pg[16.f( v 16315'2765702 (15197'2757704,16315'2765702] 
local-lis/les=17902/17903 n=188 ec=211/211 lis/c=17902/17902 
les/c/f=17903/17903/0 sis=17988 pruub=8.000660896s) [23,18] r=-1 lpr=1798>
Aug 15 06:53:57 ceph99.cecnet.gmu.edu ceph-osd[193725]: osd.9 17993 is_healthy 
false -- only 0/10 up peers (less than 33%)
Aug 15 06:53:57 ceph99.cecnet.gmu.edu ceph-osd[193725]: osd.9 17993 not 
healthy; waiting to boot

The MON logs show

Aug 15 06:53:53 os-storage-1.cecnet.gmu.edu ceph-mon[4684]: osd.9 reported 
immediately failed by osd.3
Aug 15 06:53:53 os-storage-1.cecnet.gmu.edu ceph-mon[4684]: osd.9 failed 
(root=default,pod=openstack,host=ceph99) (connection refused reported by osd.3)
Aug 15 06:53:53 os-storage-1.cecnet.gmu.edu ceph-mon[4684]: osd.11 reported 
immediately failed by osd.3
Aug 15 06:53:53 os-storage-1.cecnet.gmu.edu ceph-mon[4684]: osd.11 failed 
(root=default,pod=openstack,host=ceph99) (connection refused reported by osd.3)
Aug 15 06:53:53 os-storage-1.cecnet.gmu.edu ceph-mon[4684]: osd.12 reported 
immediately failed by osd.3
Aug 15 06:53:53 os-storage-1.cecnet.gmu.edu ceph-mon[4684]: osd.12 failed 
(root=default,pod=openstack,host=ceph99) (connection refused reported by osd.3)
Aug 15 06:53:53 os-storage-1.cecnet.gmu.edu ceph-mon[4684]: osd.10 reported 
immediately failed by osd.3
Aug 15 06:53:53 os-storage-1.cecnet.gmu.edu ceph-mon[4684]: osd.10 failed 
(root=default,pod=openstack,host=ceph99) (connection refused reported by osd.3)
Aug 15 06:53:53 os-storage-1.cecnet.gmu.edu ceph-mon[4684]: osd.9 reported 
immediately failed by osd.3
Aug 15 06:53:53 os-storage-1.cecnet.gmu.edu ceph-mon[4684]: osd.10 reported 
immediately failed by osd.3
Aug 15 06:53:53 os-storage-1.cecnet.gmu.edu ceph-mon[4684]: osd.11 reported 
immediately failed by osd.3
Aug 15 06:53:53 os-storage-1.cecnet.gmu.edu ceph-mon[4684]: osd.12 reported 
immediately failed by osd.3
Aug 15 06:53:54 os-storage-1.cecnet.gmu.edu ceph-mon[4684]: 
mon.os-storage-1@1(peon).osd e17989 e17989: 26 total, 23 up, 26 in
Aug 15 06:53:54 os-storage-1.cecnet.gmu.edu ceph-mon[4684]: 
mon.os-storage-1@1(peon).osd e17989 _set_new_cache_sizes cache_size:1020054731 
inc_alloc: 339738624 full_alloc: 356515840 kv_alloc: 318767104
Aug 15 06:53:54 os-storage-1.cecnet.gmu.edu ceph-mon[4684]: 15.13 scrub starts
Aug 15 06:53:54 os-storage-1.cecnet.gmu.edu ceph-mon[4684]: Health check 
failed: 4 osds down (OSD_DOWN)
Aug 15 06:53:54 os-storage-1.cecnet.gmu.edu ceph-mon[4684]: Health check 
failed: 2 hosts (4 osds) down (OSD_HOST_DOWN)
Aug 15 06:53:54 os-storage-1.cecnet.gmu.edu ceph-mon[4684]: osdmap e17988: 26 
total, 22 up, 26 in
Aug 15 06:53:54 os-storage-1.cecnet.gmu.edu ceph-mon[4684]: osd.11 marked 
itself dead as of e17988
Aug 15 06:53:54 os-storage-1.cecnet.gmu.edu ceph-mon[4684]: from='mgr.16700599 
10.192.126.85:0/2567473893' entity='mgr.os-storage.cecnet.gmu.edu.mouglb' 
cmd=[{"prefix": "config dump", "format": "json"}]: dispatch
Aug 15 06:53:54 os-storage-1.cecnet.gmu.edu ceph-mon[4684]: pgmap v121406: 374 
pgs: 3 peering, 20 stale+active+clean, 3 active+remapped+backfilling, 348 
active+clean; 2.1 TiB data, 4.1 TiB used, 76 TiB / 80 TiB avail; 102 B/s rd, 
338 KiB/s wr, 6 op/s; 36715/1268774 objects >
Aug 15 06:53:54 os-storage-1.cecnet.gmu.edu ceph-mon[4684]: Health check 
cleared: OSD_HOST_DOWN (was: 2 hosts (4 osds) down)
Aug 15 06:53:54 os-storage-1.cecnet.gmu.edu ceph-mon[4684]: from='mgr.16700599 
10.192.126.85:0/2567473893' entity='mgr.os-storage.cecnet.gmu.edu.mouglb' 
cmd=[{"prefix": "osd metadata", "id": 11}]: dispatch
Aug 15 06:53:54 os-storage-1.cecnet.gmu.edu ceph-mon[4684]: osd.11 
[v2:10.192.126.76:6808/3768579449,v1:10.192.126.76:6809/3768579449] boot
Aug 15 06:53:54 os-storage-1.cecnet.gmu.edu ceph-mon[4684]: osdmap e17989: 26 
total, 23 up, 26 in
Aug 15 06:53:54 os-storage-1.cecnet.gmu.edu ceph-mon[4684]: osd.12 marked 
itself dead as of e17989
Aug 15 06:53:54 os-storage-1.cecnet.gmu.edu ceph-mon[4684]: osd.11 reported 
immediately failed by osd.24
Aug 15 06:53:54 os-storage-1.cecnet.gmu.edu ceph-mon[4684]: osd.11 failed 
(root=default,pod=openstack,host=ceph99) (connection refused reported by osd.24)
Aug 15 06:53:54 os-storage-1.cecnet.gmu.edu ceph-mon[4684]: osd.11 reported 
immediately failed by osd.5
Aug 15 06:53:54 os-storage-1.cecnet.gmu.edu ceph-mon[4684]: osd.11 reported 
immediately failed by osd.3
Aug 15 06:53:55 os-storage-1.cecnet.gmu.edu ceph-mon[4684]: 
mon.os-storage-1@1(peon).osd e17990 e17990: 26 total, 23 up, 26 in
Aug 15 06:53:55 os-storage-1.cecnet.gmu.edu ceph-mon[4684]: Monitor daemon 
marked osd.11 down, but it is still running
Aug 15 06:53:55 os-storage-1.cecnet.gmu.edu ceph-mon[4684]: map e17988 wrongly 
marked me down at e17988
Aug 15 06:53:55 os-storage-1.cecnet.gmu.edu ceph-mon[4684]: 25.16 continuing 
backfill to osd.22 from (17451'16681348,17987'16685608] 
25:6977376b:::rbd_data.4040a64151e028.000000000000a3c3:head to 17987'16685608
Aug 15 06:53:55 os-storage-1.cecnet.gmu.edu ceph-mon[4684]: Monitor daemon 
marked osd.12 down, but it is still running

The system logs show no problems around the time of the drop.

My best guess at the moment is that it's a networking issue with podman, but 
I've found no evidence of a problem. Output from ethtool doesn't show any 
problems

[root@ceph99 ~]# ethtool -S enp3s0f0 | egrep 'error|drop|timeout'
     rx_errors: 0
     tx_errors: 0
     rx_dropped: 0
     tx_dropped: 0
     rx_over_errors: 0
     rx_crc_errors: 0
     rx_frame_errors: 0
     rx_fifo_errors: 0
     rx_missed_errors: 0
     tx_aborted_errors: 0
     tx_carrier_errors: 0
     tx_fifo_errors: 0
     tx_heartbeat_errors: 0
     tx_timeout_count: 0
     rx_length_errors: 0
     rx_long_length_errors: 0
     rx_short_length_errors: 0
     rx_csum_offload_errors: 0
     tx_hwtstamp_timeouts: 0

So far I have:
- Updated the Intel X550-T NIC's firmware and driver (ixgbe) to the latest from 
Intel
- Reverted the kernel, podman and NetworkManager packages to match the other 
Rocky 9.2 hosts that are working
- Reverted the Intel driver to the ixgbe included in the kernel
- sworn, cried and pleaded with the gods to spare me further anguish, to no 
avail.

Other, possibly relevant, information:
- There's no MON or MGR daemons on the host; just node exporter, crash, alerter 
and promtail.

Is there anything else I should be looking at before I remove the host from the 
cluster and re-install Rocky 8.7 (and hope it works again)? This host was used 
while we were standing up the cluster and is due to be retired as we repurpose 
some of our other storage (standalone NFS servers with RAID6) and move them 
into the cluster.
_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] OSD containers lose connectivity after change from Rocky 8.7->9.2

Reply via email to