[ceph-users] Re: 14.2.9 MDS Failing

2020-05-01 Thread Ashley Merrick
Quickly checking the code that calls that assert if (version > omap_version) { omap_version = version; omap_num_objs = num_objs; omap_num_items.resize(omap_num_objs); journal_state = jstate; } else if (version == omap_version) { ceph_assert(omap_num_objs == num_objs); if (jstate > journa

[ceph-users] Re: 14.2.9 MDS Failing

2020-05-01 Thread Marco Pizzolo
Hi Ashley, Thanks for your response. Nothing that I can think of would have happened. We are using max_mds =1. We do have 4 so used to have 3 standby. Within minutes they all crash. On Fri, May 1, 2020 at 2:21 PM Ashley Merrick wrote: > Quickly checking the code that calls that assert > > >

[ceph-users] Re: 14.2.9 MDS Failing

2020-05-01 Thread Marco Pizzolo
Also seeing errors such as this: [2020-05-01 13:15:20,970][systemd][WARNING] command returned non-zero exit status: 1 [2020-05-01 13:15:20,970][systemd][WARNING] failed activating OSD, retries left: 11 [2020-05-01 13:15:20,974][ceph_volume.process][INFO ] stderr --> RuntimeError: could not find

[ceph-users] Re: 14.2.9 MDS Failing

2020-05-01 Thread Paul Emmerich
The OpenFileTable objects are safe to delete while the MDS is offline anyways, the RADOS object names are mds*_openfiles* Paul -- Paul Emmerich Looking for help with your Ceph cluster? Contact us at https://croit.io croit GmbH Freseniusstr. 31h 81247 München www.croit.io Tel: +49 89 1896585

[ceph-users] Re: 14.2.9 MDS Failing

2020-05-01 Thread Marco Pizzolo
Hi Paul, I appreciate the response but as I'm fairly new to Ceph, I am not sure that I'm understanding. Are you saying that you believe the issue to be due to the number of open files? If so, what are you suggesting as the solution? Thanks. On Fri, May 1, 2020 at 3:27 PM Paul Emmerich wrote

[ceph-users] Re: 14.2.9 MDS Failing

2020-05-01 Thread Paul Emmerich
On Fri, May 1, 2020 at 9:27 PM Paul Emmerich wrote: > The OpenFileTable objects are safe to delete while the MDS is offline > anyways, the RADOS object names are mds*_openfiles* > I should clarify this a little bit: you shouldn't touch the CephFS internal state or data structures unless you know

[ceph-users] Re: 14.2.9 MDS Failing

2020-05-01 Thread Marco Pizzolo
Understood Paul, thanks. In case this helps to shed any further light...Digging through logs I'm also seeing this: 2020-05-01 10:06:55.984 7eff10cc3700 1 mds.prdceph01 Updating MDS map to version 1487236 from mon.2 2020-05-01 10:06:56.398 7eff0e4be700 0 log_channel(cluster) log [WRN] : 17 slow

[ceph-users] Re: 14.2.9 MDS Failing

2020-05-01 Thread Marco Pizzolo
Thanks Everyone, I was able to address the issue at least temporarily. The filesystem and MDSes are for the time staying online and the pgs are being remapped. What i'm not sure about is the best tuning for MDS given our use case, nor am i sure of exactly what caused the OSD to flap as they did,

[ceph-users] Re: 14.2.9 MDS Failing

2020-05-03 Thread Sasha Litvak
Marco, Could you please share what was done to make your cluster stable again? On Fri, May 1, 2020 at 4:47 PM Marco Pizzolo wrote: > > Thanks Everyone, > > I was able to address the issue at least temporarily. The filesystem and > MDSes are for the time staying online and the pgs are being rema