[ceph-users] Re: All OSDs on one host down

Andrew Walker-Brown Fri, 06 Aug 2021 05:51:57 -0700

A reboot of the host has fixed the problem but I still want to find the root 
cause.


Looking at the logs I can see the original mon went down because the docker 
engine shutdown in response to a network event. That network event seems to 
appears to be systemd wait-on-network timeout related and an daily apt updates 
check happening at the same time. 

When the mon was rebooted, and came back up, this seemed to trigger the OSDs on 
a separate server to shutdown.  Only OSD containers shutdown, not mon/mgr or 
mds containers. Again, this seems to be a requested shutdown rather than a 
crash….

Need to some more digging…..

Any thoughts would be appreciated. 

A

Sent from my iPhone

On 6 Aug 2021, at 09:20, David Caro <dc...@wikimedia.org> wrote:

On 08/06 07:59, Andrew Walker-Brown wrote:
> Hi Marc,
> 
> Yes i’m probably doing just that.
> 
> The ceph admin guides aren’t exactly helpful on this.  The cluster was 
> deployed using cephadm and it’s been running perfectly until now.
> 
> Wouldn’t running “journalctl -u ceph-osd@5” on host ceph-004 show me the logs 
> for osd.5 on that host?

On my containerized setup, the services that cephadm created are:

dcaro@node1:~ $ sudo systemctl list-units | grep ceph
 ceph-d49b287a-b680-11eb-95d4-e45f010c03a8@crash.node1.service                  
                                                               loaded active 
running   Ceph crash.node1 for d49b287a-b680-11eb-95d4-e45f010c03a8
 ceph-d49b287a-b680-11eb-95d4-e45f010c03a8@mgr.node1.mhqltg.service             
                                                               loaded active 
running   Ceph mgr.node1.mhqltg for d49b287a-b680-11eb-95d4-e45f010c03a8
 ceph-d49b287a-b680-11eb-95d4-e45f010c03a8@mon.node1.service                    
                                                               loaded active 
running   Ceph mon.node1 for d49b287a-b680-11eb-95d4-e45f010c03a8
 ceph-d49b287a-b680-11eb-95d4-e45f010c03a8@osd.3.service                        
                                                               loaded active 
running   Ceph osd.3 for d49b287a-b680-11eb-95d4-e45f010c03a8
 ceph-d49b287a-b680-11eb-95d4-e45f010c03a8@osd.7.service                        
                                                               loaded active 
running   Ceph osd.7 for d49b287a-b680-11eb-95d4-e45f010c03a8
 system-ceph\x2dd49b287a\x2db680\x2d11eb\x2d95d4\x2de45f010c03a8.slice          
                                                               loaded active 
active    system-ceph\x2dd49b287a\x2db680\x2d11eb\x2d95d4\x2de45f010c03a8.slice
 ceph-d49b287a-b680-11eb-95d4-e45f010c03a8.target                               
                                                               loaded active 
active    Ceph cluster d49b287a-b680-11eb-95d4-e45f010c03a8
 ceph.target                                                                    
                                                               loaded active 
active    All Ceph clusters and services

where the string after 'ceph-' is the fsid of the cluster.
Hope that helps (you can use the systemctl list-units also to search the 
specific ones on yours).


> 
> Cheers,
> A
> 
> 
> 
> 
> 
> Sent from Mail<https://go.microsoft.com/fwlink/?LinkId=550986> for Windows 10
> 
> From: Marc<mailto:m...@f1-outsourcing.eu>
> Sent: 06 August 2021 08:54
> To: Andrew Walker-Brown<mailto:andrew_jbr...@hotmail.com>; 
> ceph-users@ceph.io<mailto:ceph-users@ceph.io>
> Subject: RE: All OSDs on one host down
> 
>> 
>> I’ve tried restarting on of the osds but that fails, journalctl shows
>> osd not found.....not convinced I’ve got the systemctl command right.
>> 
> 
> You are not mixing 'not container commands' with 'container commands'. As in, 
> if you execute this journalctl outside of the container it will not find 
> anything of course.
> 
> 
> _______________________________________________
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io

-- 
David Caro
SRE - Cloud Services
Wikimedia Foundation <https://wikimediafoundation.org/>
PGP Signature: 7180 83A2 AC8B 314F B4CE  1171 4071 C7E1 D262 69C3

"Imagine a world in which every single human being can freely share in the
sum of all knowledge. That's our commitment."
_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: All OSDs on one host down

Reply via email to