This evening I was awakened by an error message
cluster:
id: 9b4468b7-5bf2-4964-8aec-4b2f4bee87ad
health: HEALTH_ERR
Module 'telemetry' has failed: ('Connection aborted.', error(101,
'Network is unreachable'))
services:
I have not seen any other problems with anythi
Dan,
This has happened to HDDs also, and nvme most recently. CentOS 7.7,
usually the kernel is within 6 months of current updates. We try to
stay relatively up to date.
-Troy
On 2/20/20 5:28 PM, Dan van der Ster wrote:
Another thing... in your thread that you said that only the *SSDs* in
Another thing... in your thread that you said that only the *SSDs* in
your cluster had crashed, but not the HDDs.
Both SSDs and HDDs were bluestore? Did the hdds ever crash subsequently?
Which OS/kernel do you run? We're CentOS 7 with quite some uptime.
On Thu, Feb 20, 2020 at 10:29 PM Troy Ablan
I hope I don't sound too happy to hear that you've run into this same
problem, but still I'm glad to see that it's not just a one-off problem
with us. :)
We're still running Mimic. I haven't yet deployed Nautilus anywhere.
Thanks
-Troy
On 2/20/20 2:14 PM, Dan van der Ster wrote:
Thanks Troy
Thanks Troy for the quick response.
Are you still running mimic on that cluster? Seeing the crashes in nautilus too?
Our cluster is also quite old -- so it could very well be memory or
network gremlins.
Cheers, Dan
On Thu, Feb 20, 2020 at 10:11 PM Troy Ablan wrote:
>
> Dan,
>
> Yes, I have had
Dan,
Yes, I have had this happen several times since, but fortunately the
last couple of times has only happened to one or two OSDs at a time so
it didn't take down entire pools. Remedy has been the same.
I had been holding off on too much further investigation because I
thought the source
Hi Troy,
Looks like we hit the same today -- Sage posted some observations
here: https://tracker.ceph.com/issues/39525#note-6
Did it happen again in your cluster?
Cheers, Dan
On Tue, Aug 20, 2019 at 2:18 AM Troy Ablan wrote:
>
> While I'm still unsure how this happened, this is what was done
On Thu, Feb 20, 2020 at 9:20 PM Wido den Hollander wrote:
>
> > Op 20 feb. 2020 om 19:54 heeft Dan van der Ster het
> > volgende geschreven:
> >
> > For those following along, the issue is here:
> > https://tracker.ceph.com/issues/39525#note-6
> >
> > Somehow single bits are getting flipped in
> Op 20 feb. 2020 om 19:54 heeft Dan van der Ster het
> volgende geschreven:
>
> For those following along, the issue is here:
> https://tracker.ceph.com/issues/39525#note-6
>
> Somehow single bits are getting flipped in the osdmaps -- maybe
> network, maybe memory, maybe a bug.
>
Weird!
For those following along, the issue is here:
https://tracker.ceph.com/issues/39525#note-6
Somehow single bits are getting flipped in the osdmaps -- maybe
network, maybe memory, maybe a bug.
To get an osd starting, we have to extract the full osdmap from the
mon, and set it into the crashing osd.
680 is epoch 2983572
666 crashes at 2982809 or 2982808
-407> 2020-02-20 11:20:24.960 7f4d931b5b80 10 osd.666 0 add_map_bl
2982809 612069 bytes
-407> 2020-02-20 11:20:24.966 7f4d931b5b80 -1 *** Caught signal (Aborted) **
in thread 7f4d931b5b80 thread_name:ceph-osd
So I grabbed 2982809 and 298
On 2/20/20 12:40 PM, Dan van der Ster wrote:
> Hi,
>
> My turn.
> We suddenly have a big outage which is similar/identical to
> http://lists.ceph.com/pipermail/ceph-users-ceph.com/2019-August/036519.html
>
> Some of the osds are runnable, but most crash when they start -- crc
> error in osdmap
Hi,
My turn.
We suddenly have a big outage which is similar/identical to
http://lists.ceph.com/pipermail/ceph-users-ceph.com/2019-August/036519.html
Some of the osds are runnable, but most crash when they start -- crc
error in osdmap::decode.
I'm able to extract an osd map from a good osd and it
13 matches
Mail list logo