[ceph-users] Re: Issues after a shutdown

2022-07-25 Thread Jeremy Hansen
I use Ubiquiti equipment, mainly because I'm not a network admin...  I
rebooted the 10G switches and now everything is working and recovering.  I
hate when there's not a definitive answer but that's kind of the deal when
you use Ubiquiti stuff.  Thank you Sean and Frank.  Frank, you were right.
It made no sense because from a very basic point of view the network seemed
fine, but Sean's ping revealed that it clearly wasn't.

Thank you!
-jeremy


On Mon, Jul 25, 2022 at 3:08 PM Sean Redmond 
wrote:

> Yea, assuming you can ping with a lower MTU, check the MTU on your
> switching.
>
> On Mon, 25 Jul 2022, 23:05 Jeremy Hansen, 
> wrote:
>
>> That results in packet loss:
>>
>> [root@cn01 ~]# ping -M do -s 8972 192.168.30.14
>> PING 192.168.30.14 (192.168.30.14) 8972(9000) bytes of data.
>> ^C
>> --- 192.168.30.14 ping statistics ---
>> 3 packets transmitted, 0 received, 100% packet loss, time 2062ms
>>
>> That's very weird...  but this gives me something to figure out.  Hmmm.
>> Thank you.
>>
>> On Mon, Jul 25, 2022 at 3:01 PM Sean Redmond 
>> wrote:
>>
>>> Looks good, just confirm it with a large ping with don't fragment flag
>>> set between each host.
>>>
>>> ping -M do -s 8972 [destination IP]
>>>
>>>
>>> On Mon, 25 Jul 2022, 22:56 Jeremy Hansen, 
>>> wrote:
>>>
 MTU is the same across all hosts:

 - cn01.ceph.la1.clx.corp-
 enp2s0: flags=4163  mtu 9000
 inet 192.168.30.11  netmask 255.255.255.0  broadcast
 192.168.30.255
 inet6 fe80::3e8c:f8ff:feed:728d  prefixlen 64  scopeid
 0x20
 ether 3c:8c:f8:ed:72:8d  txqueuelen 1000  (Ethernet)
 RX packets 3163785  bytes 213625 (1.9 GiB)
 RX errors 0  dropped 0  overruns 0  frame 0
 TX packets 6890933  bytes 40233267272 (37.4 GiB)
 TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

 - cn02.ceph.la1.clx.corp-
 enp2s0: flags=4163  mtu 9000
 inet 192.168.30.12  netmask 255.255.255.0  broadcast
 192.168.30.255
 inet6 fe80::3e8c:f8ff:feed:ff0c  prefixlen 64  scopeid
 0x20
 ether 3c:8c:f8:ed:ff:0c  txqueuelen 1000  (Ethernet)
 RX packets 3976256  bytes 2761764486 (2.5 GiB)
 RX errors 0  dropped 0  overruns 0  frame 0
 TX packets 9270324  bytes 56984933585 (53.0 GiB)
 TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

 - cn03.ceph.la1.clx.corp-
 enp2s0: flags=4163  mtu 9000
 inet 192.168.30.13  netmask 255.255.255.0  broadcast
 192.168.30.255
 inet6 fe80::3e8c:f8ff:feed:feba  prefixlen 64  scopeid
 0x20
 ether 3c:8c:f8:ed:fe:ba  txqueuelen 1000  (Ethernet)
 RX packets 13081847  bytes 93614795356 (87.1 GiB)
 RX errors 0  dropped 0  overruns 0  frame 0
 TX packets 4001854  bytes 2536322435 (2.3 GiB)
 TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

 - cn04.ceph.la1.clx.corp-
 enp2s0: flags=4163  mtu 9000
 inet 192.168.30.14  netmask 255.255.255.0  broadcast
 192.168.30.255
 inet6 fe80::3e8c:f8ff:feed:6f89  prefixlen 64  scopeid
 0x20
 ether 3c:8c:f8:ed:6f:89  txqueuelen 1000  (Ethernet)
 RX packets 60018  bytes 5622542 (5.3 MiB)
 RX errors 0  dropped 0  overruns 0  frame 0
 TX packets 59889  bytes 17463794 (16.6 MiB)
 TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

 - cn05.ceph.la1.clx.corp-
 enp2s0: flags=4163  mtu 9000
 inet 192.168.30.15  netmask 255.255.255.0  broadcast
 192.168.30.255
 inet6 fe80::3e8c:f8ff:feed:7245  prefixlen 64  scopeid
 0x20
 ether 3c:8c:f8:ed:72:45  txqueuelen 1000  (Ethernet)
 RX packets 69163  bytes 8085511 (7.7 MiB)
 RX errors 0  dropped 0  overruns 0  frame 0
 TX packets 73539  bytes 17069869 (16.2 MiB)
 TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

 - cn06.ceph.la1.clx.corp-
 enp2s0: flags=4163  mtu 9000
 inet 192.168.30.16  netmask 255.255.255.0  broadcast
 192.168.30.255
 inet6 fe80::3e8c:f8ff:feed:feab  prefixlen 64  scopeid
 0x20
 ether 3c:8c:f8:ed:fe:ab  txqueuelen 1000  (Ethernet)
 RX packets 23570  bytes 2251531 (2.1 MiB)
 RX errors 0  dropped 0  overruns 0  frame 0
 TX packets 22268  bytes 16186794 (15.4 MiB)
 TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

 10G.

 On Mon, Jul 25, 2022 at 2:51 PM Sean Redmond 
 wrote:

> Is the MTU in n the new rack set correctly?
>
> On Mon, 25 Jul 2022, 11:30 Jeremy Hansen, <
> farnsworth.mcfad...@gmail.com> wrote:
>
>> I transitioned some servers to a new rack and now I'm 

[ceph-users] Re: Issues after a shutdown

2022-07-25 Thread Sean Redmond
Yea, assuming you can ping with a lower MTU, check the MTU on your
switching.

On Mon, 25 Jul 2022, 23:05 Jeremy Hansen, 
wrote:

> That results in packet loss:
>
> [root@cn01 ~]# ping -M do -s 8972 192.168.30.14
> PING 192.168.30.14 (192.168.30.14) 8972(9000) bytes of data.
> ^C
> --- 192.168.30.14 ping statistics ---
> 3 packets transmitted, 0 received, 100% packet loss, time 2062ms
>
> That's very weird...  but this gives me something to figure out.  Hmmm.
> Thank you.
>
> On Mon, Jul 25, 2022 at 3:01 PM Sean Redmond 
> wrote:
>
>> Looks good, just confirm it with a large ping with don't fragment flag
>> set between each host.
>>
>> ping -M do -s 8972 [destination IP]
>>
>>
>> On Mon, 25 Jul 2022, 22:56 Jeremy Hansen, 
>> wrote:
>>
>>> MTU is the same across all hosts:
>>>
>>> - cn01.ceph.la1.clx.corp-
>>> enp2s0: flags=4163  mtu 9000
>>> inet 192.168.30.11  netmask 255.255.255.0  broadcast
>>> 192.168.30.255
>>> inet6 fe80::3e8c:f8ff:feed:728d  prefixlen 64  scopeid 0x20
>>> ether 3c:8c:f8:ed:72:8d  txqueuelen 1000  (Ethernet)
>>> RX packets 3163785  bytes 213625 (1.9 GiB)
>>> RX errors 0  dropped 0  overruns 0  frame 0
>>> TX packets 6890933  bytes 40233267272 (37.4 GiB)
>>> TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0
>>>
>>> - cn02.ceph.la1.clx.corp-
>>> enp2s0: flags=4163  mtu 9000
>>> inet 192.168.30.12  netmask 255.255.255.0  broadcast
>>> 192.168.30.255
>>> inet6 fe80::3e8c:f8ff:feed:ff0c  prefixlen 64  scopeid 0x20
>>> ether 3c:8c:f8:ed:ff:0c  txqueuelen 1000  (Ethernet)
>>> RX packets 3976256  bytes 2761764486 (2.5 GiB)
>>> RX errors 0  dropped 0  overruns 0  frame 0
>>> TX packets 9270324  bytes 56984933585 (53.0 GiB)
>>> TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0
>>>
>>> - cn03.ceph.la1.clx.corp-
>>> enp2s0: flags=4163  mtu 9000
>>> inet 192.168.30.13  netmask 255.255.255.0  broadcast
>>> 192.168.30.255
>>> inet6 fe80::3e8c:f8ff:feed:feba  prefixlen 64  scopeid 0x20
>>> ether 3c:8c:f8:ed:fe:ba  txqueuelen 1000  (Ethernet)
>>> RX packets 13081847  bytes 93614795356 (87.1 GiB)
>>> RX errors 0  dropped 0  overruns 0  frame 0
>>> TX packets 4001854  bytes 2536322435 (2.3 GiB)
>>> TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0
>>>
>>> - cn04.ceph.la1.clx.corp-
>>> enp2s0: flags=4163  mtu 9000
>>> inet 192.168.30.14  netmask 255.255.255.0  broadcast
>>> 192.168.30.255
>>> inet6 fe80::3e8c:f8ff:feed:6f89  prefixlen 64  scopeid 0x20
>>> ether 3c:8c:f8:ed:6f:89  txqueuelen 1000  (Ethernet)
>>> RX packets 60018  bytes 5622542 (5.3 MiB)
>>> RX errors 0  dropped 0  overruns 0  frame 0
>>> TX packets 59889  bytes 17463794 (16.6 MiB)
>>> TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0
>>>
>>> - cn05.ceph.la1.clx.corp-
>>> enp2s0: flags=4163  mtu 9000
>>> inet 192.168.30.15  netmask 255.255.255.0  broadcast
>>> 192.168.30.255
>>> inet6 fe80::3e8c:f8ff:feed:7245  prefixlen 64  scopeid 0x20
>>> ether 3c:8c:f8:ed:72:45  txqueuelen 1000  (Ethernet)
>>> RX packets 69163  bytes 8085511 (7.7 MiB)
>>> RX errors 0  dropped 0  overruns 0  frame 0
>>> TX packets 73539  bytes 17069869 (16.2 MiB)
>>> TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0
>>>
>>> - cn06.ceph.la1.clx.corp-
>>> enp2s0: flags=4163  mtu 9000
>>> inet 192.168.30.16  netmask 255.255.255.0  broadcast
>>> 192.168.30.255
>>> inet6 fe80::3e8c:f8ff:feed:feab  prefixlen 64  scopeid 0x20
>>> ether 3c:8c:f8:ed:fe:ab  txqueuelen 1000  (Ethernet)
>>> RX packets 23570  bytes 2251531 (2.1 MiB)
>>> RX errors 0  dropped 0  overruns 0  frame 0
>>> TX packets 22268  bytes 16186794 (15.4 MiB)
>>> TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0
>>>
>>> 10G.
>>>
>>> On Mon, Jul 25, 2022 at 2:51 PM Sean Redmond 
>>> wrote:
>>>
 Is the MTU in n the new rack set correctly?

 On Mon, 25 Jul 2022, 11:30 Jeremy Hansen, <
 farnsworth.mcfad...@gmail.com> wrote:

> I transitioned some servers to a new rack and now I'm having major
> issues
> with Ceph upon bringing things back up.
>
> I believe the issue may be related to the ceph nodes coming back up
> with
> different IPs before VLANs were set.  That's just a guess because I
> can't
> think of any other reason this would happen.
>
> Current state:
>
> Every 2.0s: ceph -s
>cn01.ceph.la1.clx.corp: Mon Jul 25 10:13:05 2022
>
>   cluster:
> id: bfa2ad58-c049-11eb-9098-3c8cf8ed728d
> health: HEALTH_WARN
> 1 filesystem is degraded
> 2 MDSs report slow metadata IOs
> 2/5 mons 

[ceph-users] Re: Issues after a shutdown

2022-07-25 Thread Jeremy Hansen
That results in packet loss:

[root@cn01 ~]# ping -M do -s 8972 192.168.30.14
PING 192.168.30.14 (192.168.30.14) 8972(9000) bytes of data.
^C
--- 192.168.30.14 ping statistics ---
3 packets transmitted, 0 received, 100% packet loss, time 2062ms

That's very weird...  but this gives me something to figure out.  Hmmm.
Thank you.

On Mon, Jul 25, 2022 at 3:01 PM Sean Redmond 
wrote:

> Looks good, just confirm it with a large ping with don't fragment flag set
> between each host.
>
> ping -M do -s 8972 [destination IP]
>
>
> On Mon, 25 Jul 2022, 22:56 Jeremy Hansen, 
> wrote:
>
>> MTU is the same across all hosts:
>>
>> - cn01.ceph.la1.clx.corp-
>> enp2s0: flags=4163  mtu 9000
>> inet 192.168.30.11  netmask 255.255.255.0  broadcast
>> 192.168.30.255
>> inet6 fe80::3e8c:f8ff:feed:728d  prefixlen 64  scopeid 0x20
>> ether 3c:8c:f8:ed:72:8d  txqueuelen 1000  (Ethernet)
>> RX packets 3163785  bytes 213625 (1.9 GiB)
>> RX errors 0  dropped 0  overruns 0  frame 0
>> TX packets 6890933  bytes 40233267272 (37.4 GiB)
>> TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0
>>
>> - cn02.ceph.la1.clx.corp-
>> enp2s0: flags=4163  mtu 9000
>> inet 192.168.30.12  netmask 255.255.255.0  broadcast
>> 192.168.30.255
>> inet6 fe80::3e8c:f8ff:feed:ff0c  prefixlen 64  scopeid 0x20
>> ether 3c:8c:f8:ed:ff:0c  txqueuelen 1000  (Ethernet)
>> RX packets 3976256  bytes 2761764486 (2.5 GiB)
>> RX errors 0  dropped 0  overruns 0  frame 0
>> TX packets 9270324  bytes 56984933585 (53.0 GiB)
>> TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0
>>
>> - cn03.ceph.la1.clx.corp-
>> enp2s0: flags=4163  mtu 9000
>> inet 192.168.30.13  netmask 255.255.255.0  broadcast
>> 192.168.30.255
>> inet6 fe80::3e8c:f8ff:feed:feba  prefixlen 64  scopeid 0x20
>> ether 3c:8c:f8:ed:fe:ba  txqueuelen 1000  (Ethernet)
>> RX packets 13081847  bytes 93614795356 (87.1 GiB)
>> RX errors 0  dropped 0  overruns 0  frame 0
>> TX packets 4001854  bytes 2536322435 (2.3 GiB)
>> TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0
>>
>> - cn04.ceph.la1.clx.corp-
>> enp2s0: flags=4163  mtu 9000
>> inet 192.168.30.14  netmask 255.255.255.0  broadcast
>> 192.168.30.255
>> inet6 fe80::3e8c:f8ff:feed:6f89  prefixlen 64  scopeid 0x20
>> ether 3c:8c:f8:ed:6f:89  txqueuelen 1000  (Ethernet)
>> RX packets 60018  bytes 5622542 (5.3 MiB)
>> RX errors 0  dropped 0  overruns 0  frame 0
>> TX packets 59889  bytes 17463794 (16.6 MiB)
>> TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0
>>
>> - cn05.ceph.la1.clx.corp-
>> enp2s0: flags=4163  mtu 9000
>> inet 192.168.30.15  netmask 255.255.255.0  broadcast
>> 192.168.30.255
>> inet6 fe80::3e8c:f8ff:feed:7245  prefixlen 64  scopeid 0x20
>> ether 3c:8c:f8:ed:72:45  txqueuelen 1000  (Ethernet)
>> RX packets 69163  bytes 8085511 (7.7 MiB)
>> RX errors 0  dropped 0  overruns 0  frame 0
>> TX packets 73539  bytes 17069869 (16.2 MiB)
>> TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0
>>
>> - cn06.ceph.la1.clx.corp-
>> enp2s0: flags=4163  mtu 9000
>> inet 192.168.30.16  netmask 255.255.255.0  broadcast
>> 192.168.30.255
>> inet6 fe80::3e8c:f8ff:feed:feab  prefixlen 64  scopeid 0x20
>> ether 3c:8c:f8:ed:fe:ab  txqueuelen 1000  (Ethernet)
>> RX packets 23570  bytes 2251531 (2.1 MiB)
>> RX errors 0  dropped 0  overruns 0  frame 0
>> TX packets 22268  bytes 16186794 (15.4 MiB)
>> TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0
>>
>> 10G.
>>
>> On Mon, Jul 25, 2022 at 2:51 PM Sean Redmond 
>> wrote:
>>
>>> Is the MTU in n the new rack set correctly?
>>>
>>> On Mon, 25 Jul 2022, 11:30 Jeremy Hansen, 
>>> wrote:
>>>
 I transitioned some servers to a new rack and now I'm having major
 issues
 with Ceph upon bringing things back up.

 I believe the issue may be related to the ceph nodes coming back up with
 different IPs before VLANs were set.  That's just a guess because I
 can't
 think of any other reason this would happen.

 Current state:

 Every 2.0s: ceph -s
cn01.ceph.la1.clx.corp: Mon Jul 25 10:13:05 2022

   cluster:
 id: bfa2ad58-c049-11eb-9098-3c8cf8ed728d
 health: HEALTH_WARN
 1 filesystem is degraded
 2 MDSs report slow metadata IOs
 2/5 mons down, quorum cn02,cn03,cn01
 9 osds down
 3 hosts (17 osds) down
 Reduced data availability: 97 pgs inactive, 9 pgs down
 Degraded data redundancy: 13860144/30824413 objects degraded
 (44.965%), 411 pgs degraded, 482 pgs undersized

[ceph-users] Re: Issues after a shutdown

2022-07-25 Thread Jeremy Hansen
Does ceph do any kind of io fencing if it notices an anomaly?  Do I need to
do something to re-enable these hosts if they get marked as bad?

On Mon, Jul 25, 2022 at 2:56 PM Jeremy Hansen 
wrote:

> MTU is the same across all hosts:
>
> - cn01.ceph.la1.clx.corp-
> enp2s0: flags=4163  mtu 9000
> inet 192.168.30.11  netmask 255.255.255.0  broadcast 192.168.30.255
> inet6 fe80::3e8c:f8ff:feed:728d  prefixlen 64  scopeid 0x20
> ether 3c:8c:f8:ed:72:8d  txqueuelen 1000  (Ethernet)
> RX packets 3163785  bytes 213625 (1.9 GiB)
> RX errors 0  dropped 0  overruns 0  frame 0
> TX packets 6890933  bytes 40233267272 (37.4 GiB)
> TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0
>
> - cn02.ceph.la1.clx.corp-
> enp2s0: flags=4163  mtu 9000
> inet 192.168.30.12  netmask 255.255.255.0  broadcast 192.168.30.255
> inet6 fe80::3e8c:f8ff:feed:ff0c  prefixlen 64  scopeid 0x20
> ether 3c:8c:f8:ed:ff:0c  txqueuelen 1000  (Ethernet)
> RX packets 3976256  bytes 2761764486 (2.5 GiB)
> RX errors 0  dropped 0  overruns 0  frame 0
> TX packets 9270324  bytes 56984933585 (53.0 GiB)
> TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0
>
> - cn03.ceph.la1.clx.corp-
> enp2s0: flags=4163  mtu 9000
> inet 192.168.30.13  netmask 255.255.255.0  broadcast 192.168.30.255
> inet6 fe80::3e8c:f8ff:feed:feba  prefixlen 64  scopeid 0x20
> ether 3c:8c:f8:ed:fe:ba  txqueuelen 1000  (Ethernet)
> RX packets 13081847  bytes 93614795356 (87.1 GiB)
> RX errors 0  dropped 0  overruns 0  frame 0
> TX packets 4001854  bytes 2536322435 (2.3 GiB)
> TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0
>
> - cn04.ceph.la1.clx.corp-
> enp2s0: flags=4163  mtu 9000
> inet 192.168.30.14  netmask 255.255.255.0  broadcast 192.168.30.255
> inet6 fe80::3e8c:f8ff:feed:6f89  prefixlen 64  scopeid 0x20
> ether 3c:8c:f8:ed:6f:89  txqueuelen 1000  (Ethernet)
> RX packets 60018  bytes 5622542 (5.3 MiB)
> RX errors 0  dropped 0  overruns 0  frame 0
> TX packets 59889  bytes 17463794 (16.6 MiB)
> TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0
>
> - cn05.ceph.la1.clx.corp-
> enp2s0: flags=4163  mtu 9000
> inet 192.168.30.15  netmask 255.255.255.0  broadcast 192.168.30.255
> inet6 fe80::3e8c:f8ff:feed:7245  prefixlen 64  scopeid 0x20
> ether 3c:8c:f8:ed:72:45  txqueuelen 1000  (Ethernet)
> RX packets 69163  bytes 8085511 (7.7 MiB)
> RX errors 0  dropped 0  overruns 0  frame 0
> TX packets 73539  bytes 17069869 (16.2 MiB)
> TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0
>
> - cn06.ceph.la1.clx.corp-
> enp2s0: flags=4163  mtu 9000
> inet 192.168.30.16  netmask 255.255.255.0  broadcast 192.168.30.255
> inet6 fe80::3e8c:f8ff:feed:feab  prefixlen 64  scopeid 0x20
> ether 3c:8c:f8:ed:fe:ab  txqueuelen 1000  (Ethernet)
> RX packets 23570  bytes 2251531 (2.1 MiB)
> RX errors 0  dropped 0  overruns 0  frame 0
> TX packets 22268  bytes 16186794 (15.4 MiB)
> TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0
>
> 10G.
>
> On Mon, Jul 25, 2022 at 2:51 PM Sean Redmond 
> wrote:
>
>> Is the MTU in n the new rack set correctly?
>>
>> On Mon, 25 Jul 2022, 11:30 Jeremy Hansen, 
>> wrote:
>>
>>> I transitioned some servers to a new rack and now I'm having major issues
>>> with Ceph upon bringing things back up.
>>>
>>> I believe the issue may be related to the ceph nodes coming back up with
>>> different IPs before VLANs were set.  That's just a guess because I can't
>>> think of any other reason this would happen.
>>>
>>> Current state:
>>>
>>> Every 2.0s: ceph -s
>>>cn01.ceph.la1.clx.corp: Mon Jul 25 10:13:05 2022
>>>
>>>   cluster:
>>> id: bfa2ad58-c049-11eb-9098-3c8cf8ed728d
>>> health: HEALTH_WARN
>>> 1 filesystem is degraded
>>> 2 MDSs report slow metadata IOs
>>> 2/5 mons down, quorum cn02,cn03,cn01
>>> 9 osds down
>>> 3 hosts (17 osds) down
>>> Reduced data availability: 97 pgs inactive, 9 pgs down
>>> Degraded data redundancy: 13860144/30824413 objects degraded
>>> (44.965%), 411 pgs degraded, 482 pgs undersized
>>>
>>>   services:
>>> mon: 5 daemons, quorum cn02,cn03,cn01 (age 62m), out of quorum: cn05,
>>> cn04
>>> mgr: cn02.arszct(active, since 5m)
>>> mds: 2/2 daemons up, 2 standby
>>> osd: 35 osds: 15 up (since 62m), 24 in (since 58m); 222 remapped pgs
>>>
>>>   data:
>>> volumes: 1/2 healthy, 1 recovering
>>> pools:   8 pools, 545 pgs
>>> objects: 7.71M objects, 6.7 TiB
>>> usage:   15 TiB used, 39 TiB / 54 TiB avail
>>> pgs: 0.367% pgs unknown
>>>  

[ceph-users] Re: Issues after a shutdown

2022-07-25 Thread Jeremy Hansen
MTU is the same across all hosts:

- cn01.ceph.la1.clx.corp-
enp2s0: flags=4163  mtu 9000
inet 192.168.30.11  netmask 255.255.255.0  broadcast 192.168.30.255
inet6 fe80::3e8c:f8ff:feed:728d  prefixlen 64  scopeid 0x20
ether 3c:8c:f8:ed:72:8d  txqueuelen 1000  (Ethernet)
RX packets 3163785  bytes 213625 (1.9 GiB)
RX errors 0  dropped 0  overruns 0  frame 0
TX packets 6890933  bytes 40233267272 (37.4 GiB)
TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

- cn02.ceph.la1.clx.corp-
enp2s0: flags=4163  mtu 9000
inet 192.168.30.12  netmask 255.255.255.0  broadcast 192.168.30.255
inet6 fe80::3e8c:f8ff:feed:ff0c  prefixlen 64  scopeid 0x20
ether 3c:8c:f8:ed:ff:0c  txqueuelen 1000  (Ethernet)
RX packets 3976256  bytes 2761764486 (2.5 GiB)
RX errors 0  dropped 0  overruns 0  frame 0
TX packets 9270324  bytes 56984933585 (53.0 GiB)
TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

- cn03.ceph.la1.clx.corp-
enp2s0: flags=4163  mtu 9000
inet 192.168.30.13  netmask 255.255.255.0  broadcast 192.168.30.255
inet6 fe80::3e8c:f8ff:feed:feba  prefixlen 64  scopeid 0x20
ether 3c:8c:f8:ed:fe:ba  txqueuelen 1000  (Ethernet)
RX packets 13081847  bytes 93614795356 (87.1 GiB)
RX errors 0  dropped 0  overruns 0  frame 0
TX packets 4001854  bytes 2536322435 (2.3 GiB)
TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

- cn04.ceph.la1.clx.corp-
enp2s0: flags=4163  mtu 9000
inet 192.168.30.14  netmask 255.255.255.0  broadcast 192.168.30.255
inet6 fe80::3e8c:f8ff:feed:6f89  prefixlen 64  scopeid 0x20
ether 3c:8c:f8:ed:6f:89  txqueuelen 1000  (Ethernet)
RX packets 60018  bytes 5622542 (5.3 MiB)
RX errors 0  dropped 0  overruns 0  frame 0
TX packets 59889  bytes 17463794 (16.6 MiB)
TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

- cn05.ceph.la1.clx.corp-
enp2s0: flags=4163  mtu 9000
inet 192.168.30.15  netmask 255.255.255.0  broadcast 192.168.30.255
inet6 fe80::3e8c:f8ff:feed:7245  prefixlen 64  scopeid 0x20
ether 3c:8c:f8:ed:72:45  txqueuelen 1000  (Ethernet)
RX packets 69163  bytes 8085511 (7.7 MiB)
RX errors 0  dropped 0  overruns 0  frame 0
TX packets 73539  bytes 17069869 (16.2 MiB)
TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

- cn06.ceph.la1.clx.corp-
enp2s0: flags=4163  mtu 9000
inet 192.168.30.16  netmask 255.255.255.0  broadcast 192.168.30.255
inet6 fe80::3e8c:f8ff:feed:feab  prefixlen 64  scopeid 0x20
ether 3c:8c:f8:ed:fe:ab  txqueuelen 1000  (Ethernet)
RX packets 23570  bytes 2251531 (2.1 MiB)
RX errors 0  dropped 0  overruns 0  frame 0
TX packets 22268  bytes 16186794 (15.4 MiB)
TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

10G.

On Mon, Jul 25, 2022 at 2:51 PM Sean Redmond 
wrote:

> Is the MTU in n the new rack set correctly?
>
> On Mon, 25 Jul 2022, 11:30 Jeremy Hansen, 
> wrote:
>
>> I transitioned some servers to a new rack and now I'm having major issues
>> with Ceph upon bringing things back up.
>>
>> I believe the issue may be related to the ceph nodes coming back up with
>> different IPs before VLANs were set.  That's just a guess because I can't
>> think of any other reason this would happen.
>>
>> Current state:
>>
>> Every 2.0s: ceph -s
>>cn01.ceph.la1.clx.corp: Mon Jul 25 10:13:05 2022
>>
>>   cluster:
>> id: bfa2ad58-c049-11eb-9098-3c8cf8ed728d
>> health: HEALTH_WARN
>> 1 filesystem is degraded
>> 2 MDSs report slow metadata IOs
>> 2/5 mons down, quorum cn02,cn03,cn01
>> 9 osds down
>> 3 hosts (17 osds) down
>> Reduced data availability: 97 pgs inactive, 9 pgs down
>> Degraded data redundancy: 13860144/30824413 objects degraded
>> (44.965%), 411 pgs degraded, 482 pgs undersized
>>
>>   services:
>> mon: 5 daemons, quorum cn02,cn03,cn01 (age 62m), out of quorum: cn05,
>> cn04
>> mgr: cn02.arszct(active, since 5m)
>> mds: 2/2 daemons up, 2 standby
>> osd: 35 osds: 15 up (since 62m), 24 in (since 58m); 222 remapped pgs
>>
>>   data:
>> volumes: 1/2 healthy, 1 recovering
>> pools:   8 pools, 545 pgs
>> objects: 7.71M objects, 6.7 TiB
>> usage:   15 TiB used, 39 TiB / 54 TiB avail
>> pgs: 0.367% pgs unknown
>>  17.431% pgs not active
>>  13860144/30824413 objects degraded (44.965%)
>>  1137693/30824413 objects misplaced (3.691%)
>>  280 active+undersized+degraded
>>  67  undersized+degraded+remapped+backfilling+peered
>>  57  active+undersized+remapped
>>  45  active+clean+remapped
>>  44  

[ceph-users] Re: Issues after a shutdown

2022-07-25 Thread Jeremy Hansen
Here's some more info:

HEALTH_WARN 2 failed cephadm daemon(s); 3 hosts fail cephadm check; 2
filesystems are degraded; 1 MDSs report slow metadata IOs; 2/5 mons down,
quorum cn02,cn03,cn01; 10 osds down; 3 hosts (17 osds) down; Reduced data
availability: 13 pgs inactive, 9 pgs down; Degraded data redundancy:
8515690/30862245 objects degraded (27.593%), 326 pgs degraded, 447 pgs
undersized
[WRN] CEPHADM_FAILED_DAEMON: 2 failed cephadm daemon(s)
daemon osd.3 on cn01.ceph is in error state
daemon osd.2 on cn01.ceph is in error state
[WRN] CEPHADM_HOST_CHECK_FAILED: 3 hosts fail cephadm check
host cn04.ceph (192.168.30.14) failed check: Failed to connect to
cn04.ceph (192.168.30.14).
Please make sure that the host is reachable and accepts connections using
the cephadm SSH key

To add the cephadm SSH key to the host:
> ceph cephadm get-pub-key > ~/ceph.pub
> ssh-copy-id -f -i ~/ceph.pub root@192.168.30.14

To check that the host is reachable open a new shell with the --no-hosts
flag:
> cephadm shell --no-hosts

Then run the following:
> ceph cephadm get-ssh-config > ssh_config
> ceph config-key get mgr/cephadm/ssh_identity_key > ~/cephadm_private_key
> chmod 0600 ~/cephadm_private_key
> ssh -F ssh_config -i ~/cephadm_private_key root@192.168.30.14
host cn06.ceph (192.168.30.16) failed check: Failed to connect to
cn06.ceph (192.168.30.16).
Please make sure that the host is reachable and accepts connections using
the cephadm SSH key

To add the cephadm SSH key to the host:
> ceph cephadm get-pub-key > ~/ceph.pub
> ssh-copy-id -f -i ~/ceph.pub root@192.168.30.16

To check that the host is reachable open a new shell with the --no-hosts
flag:
> cephadm shell --no-hosts

Then run the following:
> ceph cephadm get-ssh-config > ssh_config
> ceph config-key get mgr/cephadm/ssh_identity_key > ~/cephadm_private_key
> chmod 0600 ~/cephadm_private_key
> ssh -F ssh_config -i ~/cephadm_private_key root@192.168.30.16
host cn05.ceph (192.168.30.15) failed check: Failed to connect to
cn05.ceph (192.168.30.15).
Please make sure that the host is reachable and accepts connections using
the cephadm SSH key

To add the cephadm SSH key to the host:
> ceph cephadm get-pub-key > ~/ceph.pub
> ssh-copy-id -f -i ~/ceph.pub root@192.168.30.15

To check that the host is reachable open a new shell with the --no-hosts
flag:
> cephadm shell --no-hosts

Then run the following:
> ceph cephadm get-ssh-config > ssh_config
> ceph config-key get mgr/cephadm/ssh_identity_key > ~/cephadm_private_key
> chmod 0600 ~/cephadm_private_key
> ssh -F ssh_config -i ~/cephadm_private_key root@192.168.30.15
[WRN] FS_DEGRADED: 2 filesystems are degraded
fs coldlogix is degraded
fs btc is degraded
[WRN] MDS_SLOW_METADATA_IO: 1 MDSs report slow metadata IOs
mds.coldlogix.cn01.uriofo(mds.0): 2 slow metadata IOs are blocked > 30
secs, oldest blocked for 2096 secs
[WRN] MON_DOWN: 2/5 mons down, quorum cn02,cn03,cn01
mon.cn05 (rank 0) addr [v2:192.168.30.15:3300/0,v1:192.168.30.15:6789/0]
is down (out of quorum)
mon.cn04 (rank 3) addr [v2:192.168.30.14:3300/0,v1:192.168.30.14:6789/0]
is down (out of quorum)
[WRN] OSD_DOWN: 10 osds down
osd.0 (root=default,host=cn05) is down
osd.1 (root=default,host=cn06) is down
osd.7 (root=default,host=cn04) is down
osd.13 (root=default,host=cn06) is down
osd.15 (root=default,host=cn05) is down
osd.18 (root=default,host=cn04) is down
osd.20 (root=default,host=cn04) is down
osd.33 (root=default,host=cn06) is down
osd.34 (root=default,host=cn06) is down
osd.36 (root=default,host=cn05) is down
[WRN] OSD_HOST_DOWN: 3 hosts (17 osds) down
host cn04 (root=default) (6 osds) is down
host cn05 (root=default) (5 osds) is down
host cn06 (root=default) (6 osds) is down
[WRN] PG_AVAILABILITY: Reduced data availability: 13 pgs inactive, 9 pgs
down
pg 9.3a is down, acting [8]
pg 9.7a is down, acting [8]
pg 9.ba is down, acting [8]
pg 9.fa is down, acting [8]
pg 11.3 is stuck inactive for 39h, current state
undersized+degraded+peered, last acting [11]
pg 11.11 is down, acting [19,9]
pg 11.1f is stuck inactive for 13h, current state
undersized+degraded+peered, last acting [10]
pg 12.36 is down, acting [21,16]
pg 12.59 is down, acting [26,5]
pg 12.66 is down, acting [5]
pg 19.4 is stuck inactive for 39h, current state
undersized+degraded+peered, last acting [6]
pg 19.1c is down, acting [21,16,11]
pg 21.1 is stuck inactive for 36m, current state unknown, last acting []
[WRN] PG_DEGRADED: Degraded data redundancy: 8515690/30862245 objects
degraded (27.593%), 326 pgs degraded, 447 pgs undersized
pg 9.75 is stuck undersized for 34m, current state
active+undersized+remapped, last acting [4,8,35]
pg 9.76 is stuck undersized for 35m, current state
active+undersized+degraded, last acting [35,10,21]
pg 9.77 is stuck undersized for 34m, current state
active+undersized+remapped, last acting [32,35,4]
pg 

[ceph-users] Re: Issues after a shutdown

2022-07-25 Thread Jeremy Hansen
Pretty desperate here.  Can someone suggest what I might be able to do to
get these OSDs back up.  It looks like my recovery had stalled.


On Mon, Jul 25, 2022 at 7:26 AM Anthony D'Atri 
wrote:

> Do your values for public and cluster network include the new addresses on
> all nodes?
>

This cluster only has one network.  There is no separation between
public and cluster.  Three of the nodes momentarily came up using a
different IP address.

I've also noticed on one of the nodes that did not move or have any IP
issue, on a single node, from the dashboard, it names the same device for
two different osd's:

2 cn01 out destroyed hdd TOSHIBA_MG04SCA40EE_21M0A0CKFWZB Unknown sdb osd.2

3 cn01 out destroyed ssd SAMSUNG_MZILT3T8HBLS/007_S5G0NE0R200159 Unknown
sdb osd.3


[ceph: root@cn01 /]# ceph-volume inventory

Device Path   Size rotates available Model name
/dev/sda  3.64 TB  TrueFalse MG04SCA40EE
/dev/sdb  3.49 TB  False   False MZILT3T8HBLS/007
/dev/sdc  3.64 TB  TrueFalse MG04SCA40EE
/dev/sdd  3.64 TB  TrueFalse MG04SCA40EE
/dev/sde  3.49 TB  False   False MZILT3T8HBLS/007
/dev/sdf  3.64 TB  TrueFalse MG04SCA40EE
/dev/sdg  698.64 GBTrueFalse SEAGATE ST375064

[ceph: root@cn01 /]# ceph osd info
osd.0 down out weight 0 up_from 30231 up_thru 30564 down_at 30688
last_clean_interval [25500,30228) [v2:
192.168.30.15:6818/2512683421,v1:192.168.30.15:6819/2512683421] [v2:
192.168.30.15:6824/2512683421,v1:192.168.30.15:6826/2512683421]
autoout,exists d14cf503-a303-4fa4-a713-9530b67d613a
osd.1 down out weight 0 up_from 30393 up_thru 30688 down_at 30697
last_clean_interval [25518,30321) [v2:
192.168.30.16:6834/1781855831,v1:192.168.30.16:6835/1781855831] [v2:
192.168.30.16:6836/1781855831,v1:192.168.30.16:6837/1781855831]
autoout,exists 0d521411-c835-4fa3-beca-3631b4ff6bf7
osd.2 down out weight 0 up_from 31316 up_thru 31293 down_at 31317
last_clean_interval [31218,31296) [v2:
192.168.30.11:6810/894589880,v1:192.168.30.11:6811/894589880] [v2:
192.168.30.11:6812/894589880,v1:192.168.30.11:6813/894589880]
destroyed,exists
osd.3 down out weight 0 up_from 31265 up_thru 31266 down_at 31268
last_clean_interval [31254,31256) [v2:
192.168.30.11:6818/1641948535,v1:192.168.30.11:6819/1641948535] [v2:
192.168.30.11:6820/1641948535,v1:192.168.30.11:6821/1641948535]
destroyed,exists
osd.4 up   in  weight 1 up_from 31356 up_thru 31581 down_at 31339
last_clean_interval [31320,31338) [v2:
192.168.30.11:6802/2785067179,v1:192.168.30.11:6803/2785067179] [v2:
192.168.30.11:6804/2785067179,v1:192.168.30.11:6805/2785067179] exists,up
3afd06db-b91d-44fe-9305-5eb95f7a59b9
osd.5 up   in  weight 1 up_from 31347 up_thru 31699 down_at 31339
last_clean_interval [31311,31338) [v2:
192.168.30.11:6818/1936771540,v1:192.168.30.11:6819/1936771540] [v2:
192.168.30.11:6820/1936771540,v1:192.168.30.11:6821/1936771540] exists,up
063c2ccf-02ce-4f5e-8252-dddfbb258a95
osd.6 up   in  weight 1 up_from 31218 up_thru 31711 down_at 31217
last_clean_interval [30978,31195) [v2:
192.168.30.12:6816/1585973160,v1:192.168.30.12:6817/1585973160] [v2:
192.168.30.12:6818/1585973160,v1:192.168.30.12:6819/1585973160] exists,up
94250ea2-f12e-4dc6-9135-b626086ccffd
osd.7 down out weight 0 up_from 30353 up_thru 30558 down_at 30688
last_clean_interval [25533,30349) [v2:
192.168.30.14:6816/4083104061,v1:192.168.30.14:6817/4083104061] [v2:
192.168.30.14:6840/4094104061,v1:192.168.30.14:6841/4094104061]
autoout,exists de351aec-b91e-4c22-a0bf-85369bc14579
osd.8 up   in  weight 1 up_from 31226 up_thru 31668 down_at 31225
last_clean_interval [30983,31195) [v2:
192.168.30.12:6824/1312484329,v1:192.168.30.12:6825/1312484329] [v2:
192.168.30.12:6826/1312484329,v1:192.168.30.12:6827/1312484329] exists,up
51f665b4-fa5b-4b17-8390-ed130145ef04
osd.9 up   in  weight 1 up_from 31351 up_thru 31673 down_at 31340
last_clean_interval [31315,31338) [v2:
192.168.30.11:6810/1446838877,v1:192.168.30.11:6811/1446838877] [v2:
192.168.30.11:6812/1446838877,v1:192.168.30.11:6813/1446838877] exists,up
985f1127-d126-4629-b8cd-03cf2d914d99
osd.10 up   in  weight 1 up_from 31219 up_thru 31639 down_at 31218
last_clean_interval [30980,31195) [v2:
192.168.30.12:6808/1587842953,v1:192.168.30.12:6809/1587842953] [v2:
192.168.30.12:6810/1587842953,v1:192.168.30.12:6811/1587842953] exists,up
c7fca03e-4bd5-4485-a090-658ca967d5f6
osd.11 up   in  weight 1 up_from 31234 up_thru 31659 down_at 31223
last_clean_interval [30978,31195) [v2:
192.168.30.12:6840/3403200742,v1:192.168.30.12:6841/3403200742] [v2:
192.168.30.12:6842/3403200742,v1:192.168.30.12:6843/3403200742] exists,up
81074bd7-ad9f-4e56-8885-cca4745f6c95
osd.12 up   in  weight 1 up_from 31230 up_thru 31717 down_at 31223
last_clean_interval [30975,31195) [v2:
192.168.30.13:6816/4268732910,v1:192.168.30.13:6817/4268732910] [v2: