[ceph-users] Reduced data availability: 3 pgs inactive, 3 pgs down

Shain Miley Sun, 13 Oct 2024 06:58:52 -0700

Hello,

I am seeing the following information after reviewing ‘ceph health detail’:


[WRN] PG_AVAILABILITY: Reduced data availability: 3 pgs inactive, 3 pgs down

    pg 0.1a is down, acting [234,35]

    pg 0.20 is down, acting [226,267]

    pg 0.2f is down, acting [227,161]


When I query each of those pgs I see the following message on each of them:

  "peering_blocked_by": [

                {

                    "osd": 233,

                    "current_lost_at": 0,

                    "comment": "starting or marking this osd lost may let us 
proceed"

                }


Osd.233 crashed a while ago and when I try to start it the log shows some sort 
of issue with the filesystem:


ceph version 15.2.13 (c44bc49e7a57a87d84dfff2a077a2058aa2172e2) octopus (stable)

 1: (()+0x12980) [0x7f2779617980]

 2: (gsignal()+0xc7) [0x7f27782c9fb7]

 3: (abort()+0x141) [0x7f27782cb921]

 4: (ceph::__ceph_abort(char const*, int, char const*, 
std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > 
const&)+0x1b2) [0x556ebe773ddf]

 5: (FileStore::_do_transaction(ceph::os::Transaction&, unsigned long, int, 
ThreadPool::TPHandle*, char const*)+0x62b3) [0x556ebebe2753]

 6: (FileStore::_do_transactions(std::vector<ceph::os::Transaction, 
std::allocator<ceph::os::Transaction> >&, unsigned long, ThreadPool::TPHandle*, 
char const*)+0x48) [0x556ebebe3f38]

 7: (JournalingObjectStore::journal_replay(unsigned long)+0x105a) 
[0x556ebebfc56a]

 8: (FileStore::mount()+0x438a) [0x556ebebda82a]

 9: (OSD::init()+0x4d1) [0x556ebe80fdc1]

 10: (main()+0x3f8c) [0x556ebe77ad2c]

 11: (__libc_start_main()+0xe7) [0x7f27782acbf7]

 12: (_start()+0x2a) [0x556ebe78fc4a]

 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to 
interpret this.





At this point I am thinking about either running an xfs repair on osd.233 and 
trying to see if I can get it back up (once the pgs are healthy again I would 
likey zap/readd or replace the drive).



Another option it sounds like is to mark the osd as lost.



I am just looking for advice on what exactly I should do next to try to 
minimize the chances of any data loss.

Here is the query output for each of those pgs:
https://pastebin.com/YbfnpZGC



Thank you,

Shain




_______________________________________________
ceph-users mailing list -- [email protected]
To unsubscribe send an email to [email protected]

[ceph-users] Reduced data availability: 3 pgs inactive, 3 pgs down

Reply via email to