[ceph-users] Re: Issues after a shutdown
I use Ubiquiti equipment, mainly because I'm not a network admin... I rebooted the 10G switches and now everything is working and recovering. I hate when there's not a definitive answer but that's kind of the deal when you use Ubiquiti stuff. Thank you Sean and Frank. Frank, you were right. It made no sense because from a very basic point of view the network seemed fine, but Sean's ping revealed that it clearly wasn't. Thank you! -jeremy On Mon, Jul 25, 2022 at 3:08 PM Sean Redmond wrote: > Yea, assuming you can ping with a lower MTU, check the MTU on your > switching. > > On Mon, 25 Jul 2022, 23:05 Jeremy Hansen, > wrote: > >> That results in packet loss: >> >> [root@cn01 ~]# ping -M do -s 8972 192.168.30.14 >> PING 192.168.30.14 (192.168.30.14) 8972(9000) bytes of data. >> ^C >> --- 192.168.30.14 ping statistics --- >> 3 packets transmitted, 0 received, 100% packet loss, time 2062ms >> >> That's very weird... but this gives me something to figure out. Hmmm. >> Thank you. >> >> On Mon, Jul 25, 2022 at 3:01 PM Sean Redmond >> wrote: >> >>> Looks good, just confirm it with a large ping with don't fragment flag >>> set between each host. >>> >>> ping -M do -s 8972 [destination IP] >>> >>> >>> On Mon, 25 Jul 2022, 22:56 Jeremy Hansen, >>> wrote: >>> MTU is the same across all hosts: - cn01.ceph.la1.clx.corp- enp2s0: flags=4163 mtu 9000 inet 192.168.30.11 netmask 255.255.255.0 broadcast 192.168.30.255 inet6 fe80::3e8c:f8ff:feed:728d prefixlen 64 scopeid 0x20 ether 3c:8c:f8:ed:72:8d txqueuelen 1000 (Ethernet) RX packets 3163785 bytes 213625 (1.9 GiB) RX errors 0 dropped 0 overruns 0 frame 0 TX packets 6890933 bytes 40233267272 (37.4 GiB) TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0 - cn02.ceph.la1.clx.corp- enp2s0: flags=4163 mtu 9000 inet 192.168.30.12 netmask 255.255.255.0 broadcast 192.168.30.255 inet6 fe80::3e8c:f8ff:feed:ff0c prefixlen 64 scopeid 0x20 ether 3c:8c:f8:ed:ff:0c txqueuelen 1000 (Ethernet) RX packets 3976256 bytes 2761764486 (2.5 GiB) RX errors 0 dropped 0 overruns 0 frame 0 TX packets 9270324 bytes 56984933585 (53.0 GiB) TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0 - cn03.ceph.la1.clx.corp- enp2s0: flags=4163 mtu 9000 inet 192.168.30.13 netmask 255.255.255.0 broadcast 192.168.30.255 inet6 fe80::3e8c:f8ff:feed:feba prefixlen 64 scopeid 0x20 ether 3c:8c:f8:ed:fe:ba txqueuelen 1000 (Ethernet) RX packets 13081847 bytes 93614795356 (87.1 GiB) RX errors 0 dropped 0 overruns 0 frame 0 TX packets 4001854 bytes 2536322435 (2.3 GiB) TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0 - cn04.ceph.la1.clx.corp- enp2s0: flags=4163 mtu 9000 inet 192.168.30.14 netmask 255.255.255.0 broadcast 192.168.30.255 inet6 fe80::3e8c:f8ff:feed:6f89 prefixlen 64 scopeid 0x20 ether 3c:8c:f8:ed:6f:89 txqueuelen 1000 (Ethernet) RX packets 60018 bytes 5622542 (5.3 MiB) RX errors 0 dropped 0 overruns 0 frame 0 TX packets 59889 bytes 17463794 (16.6 MiB) TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0 - cn05.ceph.la1.clx.corp- enp2s0: flags=4163 mtu 9000 inet 192.168.30.15 netmask 255.255.255.0 broadcast 192.168.30.255 inet6 fe80::3e8c:f8ff:feed:7245 prefixlen 64 scopeid 0x20 ether 3c:8c:f8:ed:72:45 txqueuelen 1000 (Ethernet) RX packets 69163 bytes 8085511 (7.7 MiB) RX errors 0 dropped 0 overruns 0 frame 0 TX packets 73539 bytes 17069869 (16.2 MiB) TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0 - cn06.ceph.la1.clx.corp- enp2s0: flags=4163 mtu 9000 inet 192.168.30.16 netmask 255.255.255.0 broadcast 192.168.30.255 inet6 fe80::3e8c:f8ff:feed:feab prefixlen 64 scopeid 0x20 ether 3c:8c:f8:ed:fe:ab txqueuelen 1000 (Ethernet) RX packets 23570 bytes 2251531 (2.1 MiB) RX errors 0 dropped 0 overruns 0 frame 0 TX packets 22268 bytes 16186794 (15.4 MiB) TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0 10G. On Mon, Jul 25, 2022 at 2:51 PM Sean Redmond wrote: > Is the MTU in n the new rack set correctly? > > On Mon, 25 Jul 2022, 11:30 Jeremy Hansen, < > farnsworth.mcfad...@gmail.com> wrote: > >> I transitioned some servers to a new rack and now I'm h
[ceph-users] Re: Issues after a shutdown
Yea, assuming you can ping with a lower MTU, check the MTU on your switching. On Mon, 25 Jul 2022, 23:05 Jeremy Hansen, wrote: > That results in packet loss: > > [root@cn01 ~]# ping -M do -s 8972 192.168.30.14 > PING 192.168.30.14 (192.168.30.14) 8972(9000) bytes of data. > ^C > --- 192.168.30.14 ping statistics --- > 3 packets transmitted, 0 received, 100% packet loss, time 2062ms > > That's very weird... but this gives me something to figure out. Hmmm. > Thank you. > > On Mon, Jul 25, 2022 at 3:01 PM Sean Redmond > wrote: > >> Looks good, just confirm it with a large ping with don't fragment flag >> set between each host. >> >> ping -M do -s 8972 [destination IP] >> >> >> On Mon, 25 Jul 2022, 22:56 Jeremy Hansen, >> wrote: >> >>> MTU is the same across all hosts: >>> >>> - cn01.ceph.la1.clx.corp- >>> enp2s0: flags=4163 mtu 9000 >>> inet 192.168.30.11 netmask 255.255.255.0 broadcast >>> 192.168.30.255 >>> inet6 fe80::3e8c:f8ff:feed:728d prefixlen 64 scopeid 0x20 >>> ether 3c:8c:f8:ed:72:8d txqueuelen 1000 (Ethernet) >>> RX packets 3163785 bytes 213625 (1.9 GiB) >>> RX errors 0 dropped 0 overruns 0 frame 0 >>> TX packets 6890933 bytes 40233267272 (37.4 GiB) >>> TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0 >>> >>> - cn02.ceph.la1.clx.corp- >>> enp2s0: flags=4163 mtu 9000 >>> inet 192.168.30.12 netmask 255.255.255.0 broadcast >>> 192.168.30.255 >>> inet6 fe80::3e8c:f8ff:feed:ff0c prefixlen 64 scopeid 0x20 >>> ether 3c:8c:f8:ed:ff:0c txqueuelen 1000 (Ethernet) >>> RX packets 3976256 bytes 2761764486 (2.5 GiB) >>> RX errors 0 dropped 0 overruns 0 frame 0 >>> TX packets 9270324 bytes 56984933585 (53.0 GiB) >>> TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0 >>> >>> - cn03.ceph.la1.clx.corp- >>> enp2s0: flags=4163 mtu 9000 >>> inet 192.168.30.13 netmask 255.255.255.0 broadcast >>> 192.168.30.255 >>> inet6 fe80::3e8c:f8ff:feed:feba prefixlen 64 scopeid 0x20 >>> ether 3c:8c:f8:ed:fe:ba txqueuelen 1000 (Ethernet) >>> RX packets 13081847 bytes 93614795356 (87.1 GiB) >>> RX errors 0 dropped 0 overruns 0 frame 0 >>> TX packets 4001854 bytes 2536322435 (2.3 GiB) >>> TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0 >>> >>> - cn04.ceph.la1.clx.corp- >>> enp2s0: flags=4163 mtu 9000 >>> inet 192.168.30.14 netmask 255.255.255.0 broadcast >>> 192.168.30.255 >>> inet6 fe80::3e8c:f8ff:feed:6f89 prefixlen 64 scopeid 0x20 >>> ether 3c:8c:f8:ed:6f:89 txqueuelen 1000 (Ethernet) >>> RX packets 60018 bytes 5622542 (5.3 MiB) >>> RX errors 0 dropped 0 overruns 0 frame 0 >>> TX packets 59889 bytes 17463794 (16.6 MiB) >>> TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0 >>> >>> - cn05.ceph.la1.clx.corp- >>> enp2s0: flags=4163 mtu 9000 >>> inet 192.168.30.15 netmask 255.255.255.0 broadcast >>> 192.168.30.255 >>> inet6 fe80::3e8c:f8ff:feed:7245 prefixlen 64 scopeid 0x20 >>> ether 3c:8c:f8:ed:72:45 txqueuelen 1000 (Ethernet) >>> RX packets 69163 bytes 8085511 (7.7 MiB) >>> RX errors 0 dropped 0 overruns 0 frame 0 >>> TX packets 73539 bytes 17069869 (16.2 MiB) >>> TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0 >>> >>> - cn06.ceph.la1.clx.corp- >>> enp2s0: flags=4163 mtu 9000 >>> inet 192.168.30.16 netmask 255.255.255.0 broadcast >>> 192.168.30.255 >>> inet6 fe80::3e8c:f8ff:feed:feab prefixlen 64 scopeid 0x20 >>> ether 3c:8c:f8:ed:fe:ab txqueuelen 1000 (Ethernet) >>> RX packets 23570 bytes 2251531 (2.1 MiB) >>> RX errors 0 dropped 0 overruns 0 frame 0 >>> TX packets 22268 bytes 16186794 (15.4 MiB) >>> TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0 >>> >>> 10G. >>> >>> On Mon, Jul 25, 2022 at 2:51 PM Sean Redmond >>> wrote: >>> Is the MTU in n the new rack set correctly? On Mon, 25 Jul 2022, 11:30 Jeremy Hansen, < farnsworth.mcfad...@gmail.com> wrote: > I transitioned some servers to a new rack and now I'm having major > issues > with Ceph upon bringing things back up. > > I believe the issue may be related to the ceph nodes coming back up > with > different IPs before VLANs were set. That's just a guess because I > can't > think of any other reason this would happen. > > Current state: > > Every 2.0s: ceph -s >cn01.ceph.la1.clx.corp: Mon Jul 25 10:13:05 2022 > > cluster: > id: bfa2ad58-c049-11eb-9098-3c8cf8ed728d > health: HEALTH_WARN > 1 filesystem is degraded > 2 MDSs report slow metadata IOs > 2/5 mons d
[ceph-users] Re: Issues after a shutdown
That results in packet loss: [root@cn01 ~]# ping -M do -s 8972 192.168.30.14 PING 192.168.30.14 (192.168.30.14) 8972(9000) bytes of data. ^C --- 192.168.30.14 ping statistics --- 3 packets transmitted, 0 received, 100% packet loss, time 2062ms That's very weird... but this gives me something to figure out. Hmmm. Thank you. On Mon, Jul 25, 2022 at 3:01 PM Sean Redmond wrote: > Looks good, just confirm it with a large ping with don't fragment flag set > between each host. > > ping -M do -s 8972 [destination IP] > > > On Mon, 25 Jul 2022, 22:56 Jeremy Hansen, > wrote: > >> MTU is the same across all hosts: >> >> - cn01.ceph.la1.clx.corp- >> enp2s0: flags=4163 mtu 9000 >> inet 192.168.30.11 netmask 255.255.255.0 broadcast >> 192.168.30.255 >> inet6 fe80::3e8c:f8ff:feed:728d prefixlen 64 scopeid 0x20 >> ether 3c:8c:f8:ed:72:8d txqueuelen 1000 (Ethernet) >> RX packets 3163785 bytes 213625 (1.9 GiB) >> RX errors 0 dropped 0 overruns 0 frame 0 >> TX packets 6890933 bytes 40233267272 (37.4 GiB) >> TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0 >> >> - cn02.ceph.la1.clx.corp- >> enp2s0: flags=4163 mtu 9000 >> inet 192.168.30.12 netmask 255.255.255.0 broadcast >> 192.168.30.255 >> inet6 fe80::3e8c:f8ff:feed:ff0c prefixlen 64 scopeid 0x20 >> ether 3c:8c:f8:ed:ff:0c txqueuelen 1000 (Ethernet) >> RX packets 3976256 bytes 2761764486 (2.5 GiB) >> RX errors 0 dropped 0 overruns 0 frame 0 >> TX packets 9270324 bytes 56984933585 (53.0 GiB) >> TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0 >> >> - cn03.ceph.la1.clx.corp- >> enp2s0: flags=4163 mtu 9000 >> inet 192.168.30.13 netmask 255.255.255.0 broadcast >> 192.168.30.255 >> inet6 fe80::3e8c:f8ff:feed:feba prefixlen 64 scopeid 0x20 >> ether 3c:8c:f8:ed:fe:ba txqueuelen 1000 (Ethernet) >> RX packets 13081847 bytes 93614795356 (87.1 GiB) >> RX errors 0 dropped 0 overruns 0 frame 0 >> TX packets 4001854 bytes 2536322435 (2.3 GiB) >> TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0 >> >> - cn04.ceph.la1.clx.corp- >> enp2s0: flags=4163 mtu 9000 >> inet 192.168.30.14 netmask 255.255.255.0 broadcast >> 192.168.30.255 >> inet6 fe80::3e8c:f8ff:feed:6f89 prefixlen 64 scopeid 0x20 >> ether 3c:8c:f8:ed:6f:89 txqueuelen 1000 (Ethernet) >> RX packets 60018 bytes 5622542 (5.3 MiB) >> RX errors 0 dropped 0 overruns 0 frame 0 >> TX packets 59889 bytes 17463794 (16.6 MiB) >> TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0 >> >> - cn05.ceph.la1.clx.corp- >> enp2s0: flags=4163 mtu 9000 >> inet 192.168.30.15 netmask 255.255.255.0 broadcast >> 192.168.30.255 >> inet6 fe80::3e8c:f8ff:feed:7245 prefixlen 64 scopeid 0x20 >> ether 3c:8c:f8:ed:72:45 txqueuelen 1000 (Ethernet) >> RX packets 69163 bytes 8085511 (7.7 MiB) >> RX errors 0 dropped 0 overruns 0 frame 0 >> TX packets 73539 bytes 17069869 (16.2 MiB) >> TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0 >> >> - cn06.ceph.la1.clx.corp- >> enp2s0: flags=4163 mtu 9000 >> inet 192.168.30.16 netmask 255.255.255.0 broadcast >> 192.168.30.255 >> inet6 fe80::3e8c:f8ff:feed:feab prefixlen 64 scopeid 0x20 >> ether 3c:8c:f8:ed:fe:ab txqueuelen 1000 (Ethernet) >> RX packets 23570 bytes 2251531 (2.1 MiB) >> RX errors 0 dropped 0 overruns 0 frame 0 >> TX packets 22268 bytes 16186794 (15.4 MiB) >> TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0 >> >> 10G. >> >> On Mon, Jul 25, 2022 at 2:51 PM Sean Redmond >> wrote: >> >>> Is the MTU in n the new rack set correctly? >>> >>> On Mon, 25 Jul 2022, 11:30 Jeremy Hansen, >>> wrote: >>> I transitioned some servers to a new rack and now I'm having major issues with Ceph upon bringing things back up. I believe the issue may be related to the ceph nodes coming back up with different IPs before VLANs were set. That's just a guess because I can't think of any other reason this would happen. Current state: Every 2.0s: ceph -s cn01.ceph.la1.clx.corp: Mon Jul 25 10:13:05 2022 cluster: id: bfa2ad58-c049-11eb-9098-3c8cf8ed728d health: HEALTH_WARN 1 filesystem is degraded 2 MDSs report slow metadata IOs 2/5 mons down, quorum cn02,cn03,cn01 9 osds down 3 hosts (17 osds) down Reduced data availability: 97 pgs inactive, 9 pgs down Degraded data redundancy: 13860144/30824413 objects degraded (44.965%), 411 pgs degraded, 482 pgs undersized
[ceph-users] Re: Issues after a shutdown
Does ceph do any kind of io fencing if it notices an anomaly? Do I need to do something to re-enable these hosts if they get marked as bad? On Mon, Jul 25, 2022 at 2:56 PM Jeremy Hansen wrote: > MTU is the same across all hosts: > > - cn01.ceph.la1.clx.corp- > enp2s0: flags=4163 mtu 9000 > inet 192.168.30.11 netmask 255.255.255.0 broadcast 192.168.30.255 > inet6 fe80::3e8c:f8ff:feed:728d prefixlen 64 scopeid 0x20 > ether 3c:8c:f8:ed:72:8d txqueuelen 1000 (Ethernet) > RX packets 3163785 bytes 213625 (1.9 GiB) > RX errors 0 dropped 0 overruns 0 frame 0 > TX packets 6890933 bytes 40233267272 (37.4 GiB) > TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0 > > - cn02.ceph.la1.clx.corp- > enp2s0: flags=4163 mtu 9000 > inet 192.168.30.12 netmask 255.255.255.0 broadcast 192.168.30.255 > inet6 fe80::3e8c:f8ff:feed:ff0c prefixlen 64 scopeid 0x20 > ether 3c:8c:f8:ed:ff:0c txqueuelen 1000 (Ethernet) > RX packets 3976256 bytes 2761764486 (2.5 GiB) > RX errors 0 dropped 0 overruns 0 frame 0 > TX packets 9270324 bytes 56984933585 (53.0 GiB) > TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0 > > - cn03.ceph.la1.clx.corp- > enp2s0: flags=4163 mtu 9000 > inet 192.168.30.13 netmask 255.255.255.0 broadcast 192.168.30.255 > inet6 fe80::3e8c:f8ff:feed:feba prefixlen 64 scopeid 0x20 > ether 3c:8c:f8:ed:fe:ba txqueuelen 1000 (Ethernet) > RX packets 13081847 bytes 93614795356 (87.1 GiB) > RX errors 0 dropped 0 overruns 0 frame 0 > TX packets 4001854 bytes 2536322435 (2.3 GiB) > TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0 > > - cn04.ceph.la1.clx.corp- > enp2s0: flags=4163 mtu 9000 > inet 192.168.30.14 netmask 255.255.255.0 broadcast 192.168.30.255 > inet6 fe80::3e8c:f8ff:feed:6f89 prefixlen 64 scopeid 0x20 > ether 3c:8c:f8:ed:6f:89 txqueuelen 1000 (Ethernet) > RX packets 60018 bytes 5622542 (5.3 MiB) > RX errors 0 dropped 0 overruns 0 frame 0 > TX packets 59889 bytes 17463794 (16.6 MiB) > TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0 > > - cn05.ceph.la1.clx.corp- > enp2s0: flags=4163 mtu 9000 > inet 192.168.30.15 netmask 255.255.255.0 broadcast 192.168.30.255 > inet6 fe80::3e8c:f8ff:feed:7245 prefixlen 64 scopeid 0x20 > ether 3c:8c:f8:ed:72:45 txqueuelen 1000 (Ethernet) > RX packets 69163 bytes 8085511 (7.7 MiB) > RX errors 0 dropped 0 overruns 0 frame 0 > TX packets 73539 bytes 17069869 (16.2 MiB) > TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0 > > - cn06.ceph.la1.clx.corp- > enp2s0: flags=4163 mtu 9000 > inet 192.168.30.16 netmask 255.255.255.0 broadcast 192.168.30.255 > inet6 fe80::3e8c:f8ff:feed:feab prefixlen 64 scopeid 0x20 > ether 3c:8c:f8:ed:fe:ab txqueuelen 1000 (Ethernet) > RX packets 23570 bytes 2251531 (2.1 MiB) > RX errors 0 dropped 0 overruns 0 frame 0 > TX packets 22268 bytes 16186794 (15.4 MiB) > TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0 > > 10G. > > On Mon, Jul 25, 2022 at 2:51 PM Sean Redmond > wrote: > >> Is the MTU in n the new rack set correctly? >> >> On Mon, 25 Jul 2022, 11:30 Jeremy Hansen, >> wrote: >> >>> I transitioned some servers to a new rack and now I'm having major issues >>> with Ceph upon bringing things back up. >>> >>> I believe the issue may be related to the ceph nodes coming back up with >>> different IPs before VLANs were set. That's just a guess because I can't >>> think of any other reason this would happen. >>> >>> Current state: >>> >>> Every 2.0s: ceph -s >>>cn01.ceph.la1.clx.corp: Mon Jul 25 10:13:05 2022 >>> >>> cluster: >>> id: bfa2ad58-c049-11eb-9098-3c8cf8ed728d >>> health: HEALTH_WARN >>> 1 filesystem is degraded >>> 2 MDSs report slow metadata IOs >>> 2/5 mons down, quorum cn02,cn03,cn01 >>> 9 osds down >>> 3 hosts (17 osds) down >>> Reduced data availability: 97 pgs inactive, 9 pgs down >>> Degraded data redundancy: 13860144/30824413 objects degraded >>> (44.965%), 411 pgs degraded, 482 pgs undersized >>> >>> services: >>> mon: 5 daemons, quorum cn02,cn03,cn01 (age 62m), out of quorum: cn05, >>> cn04 >>> mgr: cn02.arszct(active, since 5m) >>> mds: 2/2 daemons up, 2 standby >>> osd: 35 osds: 15 up (since 62m), 24 in (since 58m); 222 remapped pgs >>> >>> data: >>> volumes: 1/2 healthy, 1 recovering >>> pools: 8 pools, 545 pgs >>> objects: 7.71M objects, 6.7 TiB >>> usage: 15 TiB used, 39 TiB / 54 TiB avail >>> pgs: 0.367% pgs unknown >>> 17.4
[ceph-users] Re: Issues after a shutdown
MTU is the same across all hosts: - cn01.ceph.la1.clx.corp- enp2s0: flags=4163 mtu 9000 inet 192.168.30.11 netmask 255.255.255.0 broadcast 192.168.30.255 inet6 fe80::3e8c:f8ff:feed:728d prefixlen 64 scopeid 0x20 ether 3c:8c:f8:ed:72:8d txqueuelen 1000 (Ethernet) RX packets 3163785 bytes 213625 (1.9 GiB) RX errors 0 dropped 0 overruns 0 frame 0 TX packets 6890933 bytes 40233267272 (37.4 GiB) TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0 - cn02.ceph.la1.clx.corp- enp2s0: flags=4163 mtu 9000 inet 192.168.30.12 netmask 255.255.255.0 broadcast 192.168.30.255 inet6 fe80::3e8c:f8ff:feed:ff0c prefixlen 64 scopeid 0x20 ether 3c:8c:f8:ed:ff:0c txqueuelen 1000 (Ethernet) RX packets 3976256 bytes 2761764486 (2.5 GiB) RX errors 0 dropped 0 overruns 0 frame 0 TX packets 9270324 bytes 56984933585 (53.0 GiB) TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0 - cn03.ceph.la1.clx.corp- enp2s0: flags=4163 mtu 9000 inet 192.168.30.13 netmask 255.255.255.0 broadcast 192.168.30.255 inet6 fe80::3e8c:f8ff:feed:feba prefixlen 64 scopeid 0x20 ether 3c:8c:f8:ed:fe:ba txqueuelen 1000 (Ethernet) RX packets 13081847 bytes 93614795356 (87.1 GiB) RX errors 0 dropped 0 overruns 0 frame 0 TX packets 4001854 bytes 2536322435 (2.3 GiB) TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0 - cn04.ceph.la1.clx.corp- enp2s0: flags=4163 mtu 9000 inet 192.168.30.14 netmask 255.255.255.0 broadcast 192.168.30.255 inet6 fe80::3e8c:f8ff:feed:6f89 prefixlen 64 scopeid 0x20 ether 3c:8c:f8:ed:6f:89 txqueuelen 1000 (Ethernet) RX packets 60018 bytes 5622542 (5.3 MiB) RX errors 0 dropped 0 overruns 0 frame 0 TX packets 59889 bytes 17463794 (16.6 MiB) TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0 - cn05.ceph.la1.clx.corp- enp2s0: flags=4163 mtu 9000 inet 192.168.30.15 netmask 255.255.255.0 broadcast 192.168.30.255 inet6 fe80::3e8c:f8ff:feed:7245 prefixlen 64 scopeid 0x20 ether 3c:8c:f8:ed:72:45 txqueuelen 1000 (Ethernet) RX packets 69163 bytes 8085511 (7.7 MiB) RX errors 0 dropped 0 overruns 0 frame 0 TX packets 73539 bytes 17069869 (16.2 MiB) TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0 - cn06.ceph.la1.clx.corp- enp2s0: flags=4163 mtu 9000 inet 192.168.30.16 netmask 255.255.255.0 broadcast 192.168.30.255 inet6 fe80::3e8c:f8ff:feed:feab prefixlen 64 scopeid 0x20 ether 3c:8c:f8:ed:fe:ab txqueuelen 1000 (Ethernet) RX packets 23570 bytes 2251531 (2.1 MiB) RX errors 0 dropped 0 overruns 0 frame 0 TX packets 22268 bytes 16186794 (15.4 MiB) TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0 10G. On Mon, Jul 25, 2022 at 2:51 PM Sean Redmond wrote: > Is the MTU in n the new rack set correctly? > > On Mon, 25 Jul 2022, 11:30 Jeremy Hansen, > wrote: > >> I transitioned some servers to a new rack and now I'm having major issues >> with Ceph upon bringing things back up. >> >> I believe the issue may be related to the ceph nodes coming back up with >> different IPs before VLANs were set. That's just a guess because I can't >> think of any other reason this would happen. >> >> Current state: >> >> Every 2.0s: ceph -s >>cn01.ceph.la1.clx.corp: Mon Jul 25 10:13:05 2022 >> >> cluster: >> id: bfa2ad58-c049-11eb-9098-3c8cf8ed728d >> health: HEALTH_WARN >> 1 filesystem is degraded >> 2 MDSs report slow metadata IOs >> 2/5 mons down, quorum cn02,cn03,cn01 >> 9 osds down >> 3 hosts (17 osds) down >> Reduced data availability: 97 pgs inactive, 9 pgs down >> Degraded data redundancy: 13860144/30824413 objects degraded >> (44.965%), 411 pgs degraded, 482 pgs undersized >> >> services: >> mon: 5 daemons, quorum cn02,cn03,cn01 (age 62m), out of quorum: cn05, >> cn04 >> mgr: cn02.arszct(active, since 5m) >> mds: 2/2 daemons up, 2 standby >> osd: 35 osds: 15 up (since 62m), 24 in (since 58m); 222 remapped pgs >> >> data: >> volumes: 1/2 healthy, 1 recovering >> pools: 8 pools, 545 pgs >> objects: 7.71M objects, 6.7 TiB >> usage: 15 TiB used, 39 TiB / 54 TiB avail >> pgs: 0.367% pgs unknown >> 17.431% pgs not active >> 13860144/30824413 objects degraded (44.965%) >> 1137693/30824413 objects misplaced (3.691%) >> 280 active+undersized+degraded >> 67 undersized+degraded+remapped+backfilling+peered >> 57 active+undersized+remapped >> 45 active+clean+remapped >> 44 a
[ceph-users] Re: Issues after a shutdown
Here's some more info: HEALTH_WARN 2 failed cephadm daemon(s); 3 hosts fail cephadm check; 2 filesystems are degraded; 1 MDSs report slow metadata IOs; 2/5 mons down, quorum cn02,cn03,cn01; 10 osds down; 3 hosts (17 osds) down; Reduced data availability: 13 pgs inactive, 9 pgs down; Degraded data redundancy: 8515690/30862245 objects degraded (27.593%), 326 pgs degraded, 447 pgs undersized [WRN] CEPHADM_FAILED_DAEMON: 2 failed cephadm daemon(s) daemon osd.3 on cn01.ceph is in error state daemon osd.2 on cn01.ceph is in error state [WRN] CEPHADM_HOST_CHECK_FAILED: 3 hosts fail cephadm check host cn04.ceph (192.168.30.14) failed check: Failed to connect to cn04.ceph (192.168.30.14). Please make sure that the host is reachable and accepts connections using the cephadm SSH key To add the cephadm SSH key to the host: > ceph cephadm get-pub-key > ~/ceph.pub > ssh-copy-id -f -i ~/ceph.pub root@192.168.30.14 To check that the host is reachable open a new shell with the --no-hosts flag: > cephadm shell --no-hosts Then run the following: > ceph cephadm get-ssh-config > ssh_config > ceph config-key get mgr/cephadm/ssh_identity_key > ~/cephadm_private_key > chmod 0600 ~/cephadm_private_key > ssh -F ssh_config -i ~/cephadm_private_key root@192.168.30.14 host cn06.ceph (192.168.30.16) failed check: Failed to connect to cn06.ceph (192.168.30.16). Please make sure that the host is reachable and accepts connections using the cephadm SSH key To add the cephadm SSH key to the host: > ceph cephadm get-pub-key > ~/ceph.pub > ssh-copy-id -f -i ~/ceph.pub root@192.168.30.16 To check that the host is reachable open a new shell with the --no-hosts flag: > cephadm shell --no-hosts Then run the following: > ceph cephadm get-ssh-config > ssh_config > ceph config-key get mgr/cephadm/ssh_identity_key > ~/cephadm_private_key > chmod 0600 ~/cephadm_private_key > ssh -F ssh_config -i ~/cephadm_private_key root@192.168.30.16 host cn05.ceph (192.168.30.15) failed check: Failed to connect to cn05.ceph (192.168.30.15). Please make sure that the host is reachable and accepts connections using the cephadm SSH key To add the cephadm SSH key to the host: > ceph cephadm get-pub-key > ~/ceph.pub > ssh-copy-id -f -i ~/ceph.pub root@192.168.30.15 To check that the host is reachable open a new shell with the --no-hosts flag: > cephadm shell --no-hosts Then run the following: > ceph cephadm get-ssh-config > ssh_config > ceph config-key get mgr/cephadm/ssh_identity_key > ~/cephadm_private_key > chmod 0600 ~/cephadm_private_key > ssh -F ssh_config -i ~/cephadm_private_key root@192.168.30.15 [WRN] FS_DEGRADED: 2 filesystems are degraded fs coldlogix is degraded fs btc is degraded [WRN] MDS_SLOW_METADATA_IO: 1 MDSs report slow metadata IOs mds.coldlogix.cn01.uriofo(mds.0): 2 slow metadata IOs are blocked > 30 secs, oldest blocked for 2096 secs [WRN] MON_DOWN: 2/5 mons down, quorum cn02,cn03,cn01 mon.cn05 (rank 0) addr [v2:192.168.30.15:3300/0,v1:192.168.30.15:6789/0] is down (out of quorum) mon.cn04 (rank 3) addr [v2:192.168.30.14:3300/0,v1:192.168.30.14:6789/0] is down (out of quorum) [WRN] OSD_DOWN: 10 osds down osd.0 (root=default,host=cn05) is down osd.1 (root=default,host=cn06) is down osd.7 (root=default,host=cn04) is down osd.13 (root=default,host=cn06) is down osd.15 (root=default,host=cn05) is down osd.18 (root=default,host=cn04) is down osd.20 (root=default,host=cn04) is down osd.33 (root=default,host=cn06) is down osd.34 (root=default,host=cn06) is down osd.36 (root=default,host=cn05) is down [WRN] OSD_HOST_DOWN: 3 hosts (17 osds) down host cn04 (root=default) (6 osds) is down host cn05 (root=default) (5 osds) is down host cn06 (root=default) (6 osds) is down [WRN] PG_AVAILABILITY: Reduced data availability: 13 pgs inactive, 9 pgs down pg 9.3a is down, acting [8] pg 9.7a is down, acting [8] pg 9.ba is down, acting [8] pg 9.fa is down, acting [8] pg 11.3 is stuck inactive for 39h, current state undersized+degraded+peered, last acting [11] pg 11.11 is down, acting [19,9] pg 11.1f is stuck inactive for 13h, current state undersized+degraded+peered, last acting [10] pg 12.36 is down, acting [21,16] pg 12.59 is down, acting [26,5] pg 12.66 is down, acting [5] pg 19.4 is stuck inactive for 39h, current state undersized+degraded+peered, last acting [6] pg 19.1c is down, acting [21,16,11] pg 21.1 is stuck inactive for 36m, current state unknown, last acting [] [WRN] PG_DEGRADED: Degraded data redundancy: 8515690/30862245 objects degraded (27.593%), 326 pgs degraded, 447 pgs undersized pg 9.75 is stuck undersized for 34m, current state active+undersized+remapped, last acting [4,8,35] pg 9.76 is stuck undersized for 35m, current state active+undersized+degraded, last acting [35,10,21] pg 9.77 is stuck undersized for 34m, current state active+undersized+remapped, last acting [32,35,4] pg 9.
[ceph-users] Re: Issues after a shutdown
Pretty desperate here. Can someone suggest what I might be able to do to get these OSDs back up. It looks like my recovery had stalled. On Mon, Jul 25, 2022 at 7:26 AM Anthony D'Atri wrote: > Do your values for public and cluster network include the new addresses on > all nodes? > This cluster only has one network. There is no separation between public and cluster. Three of the nodes momentarily came up using a different IP address. I've also noticed on one of the nodes that did not move or have any IP issue, on a single node, from the dashboard, it names the same device for two different osd's: 2 cn01 out destroyed hdd TOSHIBA_MG04SCA40EE_21M0A0CKFWZB Unknown sdb osd.2 3 cn01 out destroyed ssd SAMSUNG_MZILT3T8HBLS/007_S5G0NE0R200159 Unknown sdb osd.3 [ceph: root@cn01 /]# ceph-volume inventory Device Path Size rotates available Model name /dev/sda 3.64 TB TrueFalse MG04SCA40EE /dev/sdb 3.49 TB False False MZILT3T8HBLS/007 /dev/sdc 3.64 TB TrueFalse MG04SCA40EE /dev/sdd 3.64 TB TrueFalse MG04SCA40EE /dev/sde 3.49 TB False False MZILT3T8HBLS/007 /dev/sdf 3.64 TB TrueFalse MG04SCA40EE /dev/sdg 698.64 GBTrueFalse SEAGATE ST375064 [ceph: root@cn01 /]# ceph osd info osd.0 down out weight 0 up_from 30231 up_thru 30564 down_at 30688 last_clean_interval [25500,30228) [v2: 192.168.30.15:6818/2512683421,v1:192.168.30.15:6819/2512683421] [v2: 192.168.30.15:6824/2512683421,v1:192.168.30.15:6826/2512683421] autoout,exists d14cf503-a303-4fa4-a713-9530b67d613a osd.1 down out weight 0 up_from 30393 up_thru 30688 down_at 30697 last_clean_interval [25518,30321) [v2: 192.168.30.16:6834/1781855831,v1:192.168.30.16:6835/1781855831] [v2: 192.168.30.16:6836/1781855831,v1:192.168.30.16:6837/1781855831] autoout,exists 0d521411-c835-4fa3-beca-3631b4ff6bf7 osd.2 down out weight 0 up_from 31316 up_thru 31293 down_at 31317 last_clean_interval [31218,31296) [v2: 192.168.30.11:6810/894589880,v1:192.168.30.11:6811/894589880] [v2: 192.168.30.11:6812/894589880,v1:192.168.30.11:6813/894589880] destroyed,exists osd.3 down out weight 0 up_from 31265 up_thru 31266 down_at 31268 last_clean_interval [31254,31256) [v2: 192.168.30.11:6818/1641948535,v1:192.168.30.11:6819/1641948535] [v2: 192.168.30.11:6820/1641948535,v1:192.168.30.11:6821/1641948535] destroyed,exists osd.4 up in weight 1 up_from 31356 up_thru 31581 down_at 31339 last_clean_interval [31320,31338) [v2: 192.168.30.11:6802/2785067179,v1:192.168.30.11:6803/2785067179] [v2: 192.168.30.11:6804/2785067179,v1:192.168.30.11:6805/2785067179] exists,up 3afd06db-b91d-44fe-9305-5eb95f7a59b9 osd.5 up in weight 1 up_from 31347 up_thru 31699 down_at 31339 last_clean_interval [31311,31338) [v2: 192.168.30.11:6818/1936771540,v1:192.168.30.11:6819/1936771540] [v2: 192.168.30.11:6820/1936771540,v1:192.168.30.11:6821/1936771540] exists,up 063c2ccf-02ce-4f5e-8252-dddfbb258a95 osd.6 up in weight 1 up_from 31218 up_thru 31711 down_at 31217 last_clean_interval [30978,31195) [v2: 192.168.30.12:6816/1585973160,v1:192.168.30.12:6817/1585973160] [v2: 192.168.30.12:6818/1585973160,v1:192.168.30.12:6819/1585973160] exists,up 94250ea2-f12e-4dc6-9135-b626086ccffd osd.7 down out weight 0 up_from 30353 up_thru 30558 down_at 30688 last_clean_interval [25533,30349) [v2: 192.168.30.14:6816/4083104061,v1:192.168.30.14:6817/4083104061] [v2: 192.168.30.14:6840/4094104061,v1:192.168.30.14:6841/4094104061] autoout,exists de351aec-b91e-4c22-a0bf-85369bc14579 osd.8 up in weight 1 up_from 31226 up_thru 31668 down_at 31225 last_clean_interval [30983,31195) [v2: 192.168.30.12:6824/1312484329,v1:192.168.30.12:6825/1312484329] [v2: 192.168.30.12:6826/1312484329,v1:192.168.30.12:6827/1312484329] exists,up 51f665b4-fa5b-4b17-8390-ed130145ef04 osd.9 up in weight 1 up_from 31351 up_thru 31673 down_at 31340 last_clean_interval [31315,31338) [v2: 192.168.30.11:6810/1446838877,v1:192.168.30.11:6811/1446838877] [v2: 192.168.30.11:6812/1446838877,v1:192.168.30.11:6813/1446838877] exists,up 985f1127-d126-4629-b8cd-03cf2d914d99 osd.10 up in weight 1 up_from 31219 up_thru 31639 down_at 31218 last_clean_interval [30980,31195) [v2: 192.168.30.12:6808/1587842953,v1:192.168.30.12:6809/1587842953] [v2: 192.168.30.12:6810/1587842953,v1:192.168.30.12:6811/1587842953] exists,up c7fca03e-4bd5-4485-a090-658ca967d5f6 osd.11 up in weight 1 up_from 31234 up_thru 31659 down_at 31223 last_clean_interval [30978,31195) [v2: 192.168.30.12:6840/3403200742,v1:192.168.30.12:6841/3403200742] [v2: 192.168.30.12:6842/3403200742,v1:192.168.30.12:6843/3403200742] exists,up 81074bd7-ad9f-4e56-8885-cca4745f6c95 osd.12 up in weight 1 up_from 31230 up_thru 31717 down_at 31223 last_clean_interval [30975,31195) [v2: 192.168.30.13:6816/4268732910,v1:192.168.30.13:6817/4268732910] [v2: 192.168.30.13:6818/4268732910