[ceph-users] Re: osdmap::decode crc error -- 13.2.7 -- most osds down

2020-02-20 Thread Wido den Hollander
On 2/20/20 12:40 PM, Dan van der Ster wrote: > Hi, > > My turn. > We suddenly have a big outage which is similar/identical to > http://lists.ceph.com/pipermail/ceph-users-ceph.com/2019-August/036519.html > > Some of the osds are runnable, but most crash when they start -- crc > error in osdmap

[ceph-users] Re: osdmap::decode crc error -- 13.2.7 -- most osds down

2020-02-20 Thread Dan van der Ster
680 is epoch 2983572 666 crashes at 2982809 or 2982808 -407> 2020-02-20 11:20:24.960 7f4d931b5b80 10 osd.666 0 add_map_bl 2982809 612069 bytes -407> 2020-02-20 11:20:24.966 7f4d931b5b80 -1 *** Caught signal (Aborted) ** in thread 7f4d931b5b80 thread_name:ceph-osd So I grabbed 2982809 and 298

[ceph-users] Re: osdmap::decode crc error -- 13.2.7 -- most osds down

2020-02-20 Thread Dan van der Ster
For those following along, the issue is here: https://tracker.ceph.com/issues/39525#note-6 Somehow single bits are getting flipped in the osdmaps -- maybe network, maybe memory, maybe a bug. To get an osd starting, we have to extract the full osdmap from the mon, and set it into the crashing osd.

[ceph-users] Re: osdmap::decode crc error -- 13.2.7 -- most osds down

2020-02-20 Thread Wido den Hollander
> Op 20 feb. 2020 om 19:54 heeft Dan van der Ster het > volgende geschreven: > > For those following along, the issue is here: > https://tracker.ceph.com/issues/39525#note-6 > > Somehow single bits are getting flipped in the osdmaps -- maybe > network, maybe memory, maybe a bug. > Weird!

[ceph-users] Re: osdmap::decode crc error -- 13.2.7 -- most osds down

2020-02-20 Thread Dan van der Ster
On Thu, Feb 20, 2020 at 9:20 PM Wido den Hollander wrote: > > > Op 20 feb. 2020 om 19:54 heeft Dan van der Ster het > > volgende geschreven: > > > > For those following along, the issue is here: > > https://tracker.ceph.com/issues/39525#note-6 > > > > Somehow single bits are getting flipped in

[ceph-users] Re: osdmap::decode crc error -- 13.2.7 -- most osds down

2020-02-27 Thread Dan van der Ster
FTR, the root cause is now understood: https://tracker.ceph.com/issues/39525#note-21 -- dan On Thu, Feb 20, 2020 at 9:24 PM Dan van der Ster wrote: > > On Thu, Feb 20, 2020 at 9:20 PM Wido den Hollander wrote: > > > > > Op 20 feb. 2020 om 19:54 heeft Dan van der Ster het > > > volgende geschr