Really not sure where to go with this one.  Firstly, a description of my 
cluster.  Yes, I know there are a lot of "not ideals" here but this is what I 
inherited.

The cluster is running Jewel and has two storage/mon nodes and an additional 
mon only node, with a pool size of 2.  Today, we had a some power issues in the 
data centre and we very ungracefully lost both storage servers at the same 
time.  Node 1 came back online before node 2 but I could see there were a few 
OSDs that were down.  When node 2 came back, I started trying to get OSDs up.  
Each node has 14 OSDs and I managed to get all OSDs up and in on node 2, but 
one of the OSDs on node 1 keeps starting and crashing and just won't stay up.  
I'm not finding the OSD log output to be much use.  Current health status looks 
like this:

# ceph health
HEALTH_ERR 26 pgs are stuck inactive for more than 300 seconds; 26 pgs down; 26 
pgs peering; 26 pgs stuck inactive; 26 pgs stuck unclean; 5 requests are 
blocked > 32 sec
# ceph status
    cluster e2391bbf-15e0-405f-af12-943610cb4909
     health HEALTH_ERR
            26 pgs are stuck inactive for more than 300 seconds
            26 pgs down
            26 pgs peering
            26 pgs stuck inactive
            26 pgs stuck unclean
            5 requests are blocked > 32 sec

Any clues as to what I should be looking for or what sort of action I should be 
taking to troubleshoot this?  Unfortunately, I'm a complete novice with Ceph.

Here's a snippet from the OSD log that means little to me...

--- begin dump of recent events ---
     0> 2021-04-16 12:25:10.169340 7f2e23921ac0 -1 *** Caught signal (Aborted) 
**
 in thread 7f2e23921ac0 thread_name:ceph-osd

 ceph version 10.2.11 (e4b061b47f07f583c92a050d9e84b1813a35671e)
 1: (()+0x9f1c2a) [0x7f2e24330c2a]
 2: (()+0xf5d0) [0x7f2e21ee95d0]
 3: (gsignal()+0x37) [0x7f2e2049f207]
 4: (abort()+0x148) [0x7f2e204a08f8]
 5: (ceph::__ceph_assert_fail(char const*, char const*, int, char 
const*)+0x267) [0x7f2e2442fd47]
 6: (FileJournal::read_entry(ceph::buffer::list&, unsigned long&, bool*)+0x90c) 
[0x7f2e2417bc7c]
 7: (JournalingObjectStore::journal_replay(unsigned long)+0x1ee) 
[0x7f2e240c8dce]
 8: (FileStore::mount()+0x3cd6) [0x7f2e240a0546]
 9: (OSD::init()+0x27d) [0x7f2e23d5828d]
 10: (main()+0x2c18) [0x7f2e23c71088]
 11: (__libc_start_main()+0xf5) [0x7f2e2048b3d5]
 12: (()+0x3c8847) [0x7f2e23d07847]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to 
interpret this.

Thanks in advance,
Mark

_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

Reply via email to