[ceph-users] Module 'telemetry' has experienced an error

2020-02-20 Thread alexander . v . litvak
This evening I was awakened by an error message cluster: id: 9b4468b7-5bf2-4964-8aec-4b2f4bee87ad health: HEALTH_ERR Module 'telemetry' has failed: ('Connection aborted.', error(101, 'Network is unreachable')) services: I have not seen any other problems with anythi

[ceph-users] Re: RESOLVED: Sudden loss of all SSD OSDs in a cluster, immedaite abort on restart [Mimic 13.2.6]

2020-02-20 Thread Troy Ablan
Dan, This has happened to HDDs also, and nvme most recently. CentOS 7.7, usually the kernel is within 6 months of current updates. We try to stay relatively up to date. -Troy On 2/20/20 5:28 PM, Dan van der Ster wrote: Another thing... in your thread that you said that only the *SSDs* in

[ceph-users] Re: RESOLVED: Sudden loss of all SSD OSDs in a cluster, immedaite abort on restart [Mimic 13.2.6]

2020-02-20 Thread Dan van der Ster
Another thing... in your thread that you said that only the *SSDs* in your cluster had crashed, but not the HDDs. Both SSDs and HDDs were bluestore? Did the hdds ever crash subsequently? Which OS/kernel do you run? We're CentOS 7 with quite some uptime. On Thu, Feb 20, 2020 at 10:29 PM Troy Ablan

[ceph-users] Re: RESOLVED: Sudden loss of all SSD OSDs in a cluster, immedaite abort on restart [Mimic 13.2.6]

2020-02-20 Thread Troy Ablan
I hope I don't sound too happy to hear that you've run into this same problem, but still I'm glad to see that it's not just a one-off problem with us. :) We're still running Mimic. I haven't yet deployed Nautilus anywhere. Thanks -Troy On 2/20/20 2:14 PM, Dan van der Ster wrote: Thanks Troy

[ceph-users] Re: RESOLVED: Sudden loss of all SSD OSDs in a cluster, immedaite abort on restart [Mimic 13.2.6]

2020-02-20 Thread Dan van der Ster
Thanks Troy for the quick response. Are you still running mimic on that cluster? Seeing the crashes in nautilus too? Our cluster is also quite old -- so it could very well be memory or network gremlins. Cheers, Dan On Thu, Feb 20, 2020 at 10:11 PM Troy Ablan wrote: > > Dan, > > Yes, I have had

[ceph-users] Re: RESOLVED: Sudden loss of all SSD OSDs in a cluster, immedaite abort on restart [Mimic 13.2.6]

2020-02-20 Thread Troy Ablan
Dan, Yes, I have had this happen several times since, but fortunately the last couple of times has only happened to one or two OSDs at a time so it didn't take down entire pools. Remedy has been the same. I had been holding off on too much further investigation because I thought the source

[ceph-users] Re: RESOLVED: Sudden loss of all SSD OSDs in a cluster, immedaite abort on restart [Mimic 13.2.6]

2020-02-20 Thread Dan van der Ster
Hi Troy, Looks like we hit the same today -- Sage posted some observations here: https://tracker.ceph.com/issues/39525#note-6 Did it happen again in your cluster? Cheers, Dan On Tue, Aug 20, 2019 at 2:18 AM Troy Ablan wrote: > > While I'm still unsure how this happened, this is what was done

[ceph-users] Re: osdmap::decode crc error -- 13.2.7 -- most osds down

2020-02-20 Thread Dan van der Ster
On Thu, Feb 20, 2020 at 9:20 PM Wido den Hollander wrote: > > > Op 20 feb. 2020 om 19:54 heeft Dan van der Ster het > > volgende geschreven: > > > > For those following along, the issue is here: > > https://tracker.ceph.com/issues/39525#note-6 > > > > Somehow single bits are getting flipped in

[ceph-users] Re: osdmap::decode crc error -- 13.2.7 -- most osds down

2020-02-20 Thread Wido den Hollander
> Op 20 feb. 2020 om 19:54 heeft Dan van der Ster het > volgende geschreven: > > For those following along, the issue is here: > https://tracker.ceph.com/issues/39525#note-6 > > Somehow single bits are getting flipped in the osdmaps -- maybe > network, maybe memory, maybe a bug. > Weird!

[ceph-users] Re: osdmap::decode crc error -- 13.2.7 -- most osds down

2020-02-20 Thread Dan van der Ster
For those following along, the issue is here: https://tracker.ceph.com/issues/39525#note-6 Somehow single bits are getting flipped in the osdmaps -- maybe network, maybe memory, maybe a bug. To get an osd starting, we have to extract the full osdmap from the mon, and set it into the crashing osd.

[ceph-users] Re: osdmap::decode crc error -- 13.2.7 -- most osds down

2020-02-20 Thread Dan van der Ster
680 is epoch 2983572 666 crashes at 2982809 or 2982808 -407> 2020-02-20 11:20:24.960 7f4d931b5b80 10 osd.666 0 add_map_bl 2982809 612069 bytes -407> 2020-02-20 11:20:24.966 7f4d931b5b80 -1 *** Caught signal (Aborted) ** in thread 7f4d931b5b80 thread_name:ceph-osd So I grabbed 2982809 and 298

[ceph-users] Re: osdmap::decode crc error -- 13.2.7 -- most osds down

2020-02-20 Thread Wido den Hollander
On 2/20/20 12:40 PM, Dan van der Ster wrote: > Hi, > > My turn. > We suddenly have a big outage which is similar/identical to > http://lists.ceph.com/pipermail/ceph-users-ceph.com/2019-August/036519.html > > Some of the osds are runnable, but most crash when they start -- crc > error in osdmap

[ceph-users] osdmap::decode crc error -- 13.2.7 -- most osds down

2020-02-20 Thread Dan van der Ster
Hi, My turn. We suddenly have a big outage which is similar/identical to http://lists.ceph.com/pipermail/ceph-users-ceph.com/2019-August/036519.html Some of the osds are runnable, but most crash when they start -- crc error in osdmap::decode. I'm able to extract an osd map from a good osd and it