A reboot of the host has fixed the problem but I still want to find the root cause.
Looking at the logs I can see the original mon went down because the docker engine shutdown in response to a network event. That network event seems to appears to be systemd wait-on-network timeout related and an daily apt updates check happening at the same time. When the mon was rebooted, and came back up, this seemed to trigger the OSDs on a separate server to shutdown. Only OSD containers shutdown, not mon/mgr or mds containers. Again, this seems to be a requested shutdown rather than a crash…. Need to some more digging….. Any thoughts would be appreciated. A Sent from my iPhone On 6 Aug 2021, at 09:20, David Caro <dc...@wikimedia.org> wrote: On 08/06 07:59, Andrew Walker-Brown wrote: > Hi Marc, > > Yes i’m probably doing just that. > > The ceph admin guides aren’t exactly helpful on this. The cluster was > deployed using cephadm and it’s been running perfectly until now. > > Wouldn’t running “journalctl -u ceph-osd@5” on host ceph-004 show me the logs > for osd.5 on that host? On my containerized setup, the services that cephadm created are: dcaro@node1:~ $ sudo systemctl list-units | grep ceph ceph-d49b287a-b680-11eb-95d4-e45f010c03a8@crash.node1.service loaded active running Ceph crash.node1 for d49b287a-b680-11eb-95d4-e45f010c03a8 ceph-d49b287a-b680-11eb-95d4-e45f010c03a8@mgr.node1.mhqltg.service loaded active running Ceph mgr.node1.mhqltg for d49b287a-b680-11eb-95d4-e45f010c03a8 ceph-d49b287a-b680-11eb-95d4-e45f010c03a8@mon.node1.service loaded active running Ceph mon.node1 for d49b287a-b680-11eb-95d4-e45f010c03a8 ceph-d49b287a-b680-11eb-95d4-e45f010c03a8@osd.3.service loaded active running Ceph osd.3 for d49b287a-b680-11eb-95d4-e45f010c03a8 ceph-d49b287a-b680-11eb-95d4-e45f010c03a8@osd.7.service loaded active running Ceph osd.7 for d49b287a-b680-11eb-95d4-e45f010c03a8 system-ceph\x2dd49b287a\x2db680\x2d11eb\x2d95d4\x2de45f010c03a8.slice loaded active active system-ceph\x2dd49b287a\x2db680\x2d11eb\x2d95d4\x2de45f010c03a8.slice ceph-d49b287a-b680-11eb-95d4-e45f010c03a8.target loaded active active Ceph cluster d49b287a-b680-11eb-95d4-e45f010c03a8 ceph.target loaded active active All Ceph clusters and services where the string after 'ceph-' is the fsid of the cluster. Hope that helps (you can use the systemctl list-units also to search the specific ones on yours). > > Cheers, > A > > > > > > Sent from Mail<https://go.microsoft.com/fwlink/?LinkId=550986> for Windows 10 > > From: Marc<mailto:m...@f1-outsourcing.eu> > Sent: 06 August 2021 08:54 > To: Andrew Walker-Brown<mailto:andrew_jbr...@hotmail.com>; > ceph-users@ceph.io<mailto:ceph-users@ceph.io> > Subject: RE: All OSDs on one host down > >> >> I’ve tried restarting on of the osds but that fails, journalctl shows >> osd not found.....not convinced I’ve got the systemctl command right. >> > > You are not mixing 'not container commands' with 'container commands'. As in, > if you execute this journalctl outside of the container it will not find > anything of course. > > > _______________________________________________ > ceph-users mailing list -- ceph-users@ceph.io > To unsubscribe send an email to ceph-users-le...@ceph.io -- David Caro SRE - Cloud Services Wikimedia Foundation <https://wikimediafoundation.org/> PGP Signature: 7180 83A2 AC8B 314F B4CE 1171 4071 C7E1 D262 69C3 "Imagine a world in which every single human being can freely share in the sum of all knowledge. That's our commitment." _______________________________________________ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io