Hi

We have a lot of OSDs flapping during recovery and eventually they don't come up again until kicked with "ceph orch daemon restart osd.x".

ceph version 16.2.7 (dd0603118f56ab514f133c8d2e3adfc983942503) pacific (stable)

6 hosts connected by 2 x 10 GB. Most data in EC 2+2 rbd pool.

"
# ceph -s
  cluster:
    id:     3b7736c6-00e4-11ec-a3c5-3cecef467984
    health: HEALTH_WARN
            2 host(s) running different kernel versions
            noscrub,nodeep-scrub,nosnaptrim flag(s) set
Degraded data redundancy: 95909/1023542135 objects degraded (0.009%), 10 pgs degraded, 6 pgs undersized
            484 pgs not deep-scrubbed in time
            725 pgs not scrubbed in time
            11 daemons have recently crashed
4 slow ops, oldest one blocked for 178 sec, daemons [osd.13,osd.15,osd.19,osd.46,osd.50] have slow ops.

  services:
mon: 5 daemons, quorum test-ceph-03,test-ceph-04,dcn-ceph-03,dcn-ceph-02,dcn-ceph-01 (age 51s) mgr: dcn-ceph-01.dzercj(active, since 20h), standbys: dcn-ceph-03.lrhaxo
    mds:        1/1 daemons up, 2 standby
osd: 118 osds: 118 up (since 3m), 118 in (since 86m); 137 remapped pgs
                flags noscrub,nodeep-scrub,nosnaptrim
    rbd-mirror: 2 daemons active (2 hosts)

  data:
    volumes: 1/1 healthy
    pools:   9 pools, 2737 pgs
    objects: 257.30M objects, 334 TiB
    usage:   673 TiB used, 680 TiB / 1.3 PiB avail
    pgs:     95909/1023542135 objects degraded (0.009%)
             7498287/1023542135 objects misplaced (0.733%)
             2505 active+clean
             98   active+remapped+backfilling
             85   active+clean+snaptrim_wait
             32   active+remapped+backfill_wait
             6    active+clean+laggy
             5    active+undersized+degraded+remapped+backfilling
             4    active+recovering+degraded
             1    active+recovering+degraded+remapped
             1    active+undersized+remapped+backfilling

  io:
    client:   45 KiB/s rd, 1.3 MiB/s wr, 52 op/s rd, 91 op/s wr
    recovery: 1.2 GiB/s, 467 objects/s

  progress:
    Global Recovery Event (20h)
      [==========================..] (remaining: 65m)
"

Crash info for one OSD:

"
022-08-03T10:34:50.179+0000 7fdedd02f700 -1 *** Caught signal (Aborted) **
 in thread 7fdedd02f700 thread_name:tp_osd_tp

ceph version 16.2.7 (dd0603118f56ab514f133c8d2e3adfc983942503) pacific (stable)
 1: /lib64/libpthread.so.0(+0x12c20) [0x7fdf00a78c20]
 2: pread64()
3: (KernelDevice::read_random(unsigned long, unsigned long, char*, bool)+0x40d) [0x55701b4c0f0d] 4: (BlueFS::_read_random(BlueFS::FileReader*, unsigned long, unsigned long, char*)+0x60d) [0x55701b05ee7d] 5: (BlueRocksRandomAccessFile::Read(unsigned long, unsigned long, rocksdb::Slice*, char*) const+0x24) [0x55701b08e6d4] 6: (rocksdb::LegacyRandomAccessFileWrapper::Read(unsigned long, unsigned long, rocksdb::IOOptions const&, rocksdb::Slice*, char*, rocksdb::IODebugContext*) const+0x26) [0x55701b529396] 7: (rocksdb::RandomAccessFileReader::Read(unsigned long, unsigned long, rocksdb::Slice*, char*, bool) const+0xdc7) [0x55701b745267]
 8: (rocksdb::BlockFetcher::ReadBlockContents()+0x4b5) [0x55701b69fa45]
9: (rocksdb::Status rocksdb::BlockBasedTable::MaybeReadBlockAndLoadToCache<rocksdb::ParsedFullFilterBlock>(rocksdb::FilePrefetchBuffer*, rocksdb::ReadOptions const&, rocksdb::BlockHandle const&, rocksdb::UncompressionDict const&, rocksdb::CachableEntry<rocksdb::ParsedFullFilterBlock>*, rocksdb::BlockType, rocksdb::GetContext*, rocksdb::BlockCacheLookupContext*, rocksdb::BlockContents*) const+0x919) [0x55701b691ba9] 10: (rocksdb::Status rocksdb::BlockBasedTable::RetrieveBlock<rocksdb::ParsedFullFilterBlock>(rocksdb::FilePrefetchBuffer*, rocksdb::ReadOptions const&, rocksdb::BlockHandle const&, rocksdb::UncompressionDict const&, rocksdb::CachableEntry<rocksdb::ParsedFullFilterBlock>*, rocksdb::BlockType, rocksdb::GetContext*, rocksdb::BlockCacheLookupContext*, bool, bool) const+0x286) [0x55701b691f86] 11: (rocksdb::FilterBlockReaderCommon<rocksdb::ParsedFullFilterBlock>::ReadFilterBlock(rocksdb::BlockBasedTable const*, rocksdb::FilePrefetchBuffer*, rocksdb::ReadOptions const&, bool, rocksdb::GetContext*, rocksdb::BlockCacheLookupContext*, rocksdb::CachableEntry<rocksdb::ParsedFullFilterBlock>*)+0xf1) [0x55701b760891] 12: (rocksdb::FilterBlockReaderCommon<rocksdb::ParsedFullFilterBlock>::GetOrReadFilterBlock(bool, rocksdb::GetContext*, rocksdb::BlockCacheLookupContext*, rocksdb::CachableEntry<rocksdb::ParsedFullFilterBlock>*) const+0xfe) [0x55701b760b5e] 13: (rocksdb::FullFilterBlockReader::MayMatch(rocksdb::Slice const&, bool, rocksdb::GetContext*, rocksdb::BlockCacheLookupContext*) const+0x43) [0x55701b69aae3] 14: (rocksdb::BlockBasedTable::FullFilterKeyMayMatch(rocksdb::ReadOptions const&, rocksdb::FilterBlockReader*, rocksdb::Slice const&, bool, rocksdb::SliceTransform const*, rocksdb::GetContext*, rocksdb::BlockCacheLookupContext*) const+0x9e) [0x55701b68118e] 15: (rocksdb::BlockBasedTable::Get(rocksdb::ReadOptions const&, rocksdb::Slice const&, rocksdb::GetContext*, rocksdb::SliceTransform const*, bool)+0x180) [0x55701b6829a0] 16: (rocksdb::TableCache::Get(rocksdb::ReadOptions const&, rocksdb::InternalKeyComparator const&, rocksdb::FileMetaData const&, rocksdb::Slice const&, rocksdb::GetContext*, rocksdb::SliceTransform const*, rocksdb::HistogramImpl*, bool, int)+0x170) [0x55701b5e2370] 17: (rocksdb::Version::Get(rocksdb::ReadOptions const&, rocksdb::LookupKey const&, rocksdb::PinnableSlice*, rocksdb::Status*, rocksdb::MergeContext*, unsigned long*, bool*, bool*, unsigned long*, rocksdb::ReadCallback*, bool*, bool)+0x3ad) [0x55701b5ffbbd] 18: (rocksdb::DBImpl::GetImpl(rocksdb::ReadOptions const&, rocksdb::Slice const&, rocksdb::DBImpl::GetImplOptions)+0x537) [0x55701b50fbe7] 19: (rocksdb::DBImpl::Get(rocksdb::ReadOptions const&, rocksdb::ColumnFamilyHandle*, rocksdb::Slice const&, rocksdb::PinnableSlice*)+0x4d) [0x55701b5104fd] 20: (RocksDBStore::get(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, ceph::buffer::v15_2_0::list*)+0x35b) [0x55701b4d617b] 21: (BlueStore::Collection::load_shared_blob(boost::intrusive_ptr<BlueStore::SharedBlob>)+0xf7) [0x55701af50b57] 22: (BlueStore::_wctx_finish(BlueStore::TransContext*, boost::intrusive_ptr<BlueStore::Collection>&, boost::intrusive_ptr<BlueStore::Onode>, BlueStore::WriteContext*, std::set<BlueStore::SharedBlob*, std::less<BlueStore::SharedBlob*>, std::allocator<BlueStore::SharedBlob*> >*)+0xeaf) [0x55701af908cf] 23: (BlueStore::_do_truncate(BlueStore::TransContext*, boost::intrusive_ptr<BlueStore::Collection>&, boost::intrusive_ptr<BlueStore::Onode>, unsigned long, std::set<BlueStore::SharedBlob*, std::less<BlueStore::SharedBlob*>, std::allocator<BlueStore::SharedBlob*> >*)+0x3a8) [0x55701af91f68] 24: (BlueStore::_do_remove(BlueStore::TransContext*, boost::intrusive_ptr<BlueStore::Collection>&, boost::intrusive_ptr<BlueStore::Onode>)+0xce) [0x55701af9289e] 25: (BlueStore::_remove(BlueStore::TransContext*, boost::intrusive_ptr<BlueStore::Collection>&, boost::intrusive_ptr<BlueStore::Onode>&)+0x22c) [0x55701af9441c] 26: (BlueStore::_txc_add_transaction(BlueStore::TransContext*, ceph::os::Transaction*)+0x1fcf) [0x55701afb8dbf] 27: (BlueStore::queue_transactions(boost::intrusive_ptr<ObjectStore::CollectionImpl>&, std::vector<ceph::os::Transaction, std::allocator<ceph::os::Transaction> >&, boost::intrusive_ptr<TrackedOp>, ThreadPool::TPHandle*)+0x316) [0x55701afba1a6] 28: (ObjectStore::queue_transaction(boost::intrusive_ptr<ObjectStore::CollectionImpl>&, ceph::os::Transaction&&, boost::intrusive_ptr<TrackedOp>, ThreadPool::TPHandle*)+0x85) [0x55701aad7765] 29: (OSD::dispatch_context(PeeringCtx&, PG*, std::shared_ptr<OSDMap const>, ThreadPool::TPHandle*)+0xf3) [0x55701aa6c6d3] 30: (OSD::dequeue_peering_evt(OSDShard*, PG*, std::shared_ptr<PGPeeringEvent>, ThreadPool::TPHandle&)+0x2d8) [0x55701aa9ee98] 31: (OSD::dequeue_delete(OSDShard*, PG*, unsigned int, ThreadPool::TPHandle&)+0xc8) [0x55701aa9f038] 32: (OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0xc28) [0x55701aa90d48] 33: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x5c4) [0x55701b1025b4]
 34: (ShardedThreadPool::WorkThreadSharded::entry()+0x14) [0x55701b105254]
 35: /lib64/libpthread.so.0(+0x817a) [0x7fdf00a6e17a]
 36: clone()
NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

"

Some errors seen in the journal one the host:

"
Aug 04 06:57:59 dcn-ceph-01 bash[5683]: debug 2022-08-04T06:57:59.735+0000 7fa687612700 -1 librbd::SnapshotRemoveRequest: 0x559541832780 handle_trash_snap: failed to move snapshot to trash: (16) Device or resource busy

Aug 04 06:57:59 dcn-ceph-01 bash[7700]: debug 2022-08-04T06:57:59.827+0000 7f6f7695e700 -1 osd.25 138094 heartbeat_check: no reply from 172.21.15.55:6844 osd.100 since back 2022-08-04T06:57:22.475054+0000 front 2022-08-04T06:57:22.475002+0000 (oldest deadline 2022-08-04T06:57:46.573685+0000)

Aug 04 06:59:34 dcn-ceph-01 bash[8385]: debug 2022-08-04T06:59:34.398+0000 7f4b440c4700 1 heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7f4b25676700' had timed out after 15.000000954s

Aug 04 06:59:31 dcn-ceph-01 bash[5230]: debug 2022-08-04T06:59:31.619+0000 7fab9ede5700 1 heartbeat_map reset_timeout 'Monitor::cpu_tp thread 0x7fab9ede5700' had timed out after 0.000000000s

Aug 04 06:59:29 dcn-ceph-01 bash[5230]: debug 2022-08-04T06:59:29.808+0000 7fab9cde1700 1 mon.dcn-ceph-01@4(electing) e21 collect_metadata md0: no unique device id for md0: fallback method has no model nor serial'
"

There are a lot of those 15s heartbeat timeouts.

I'm not sure what to look for. The cluster is quite busy but OSDs shouldn't flap or crash like that? Perhaps some heartbeat/timeout values could be tweaked?

Mvh.

Torkil

--
Torkil Svensgaard
Sysadmin
MR-Forskningssektionen, afs. 714
DRCMR, Danish Research Centre for Magnetic Resonance
Hvidovre Hospital
KettegÄrd Allé 30
DK-2650 Hvidovre
Denmark
Tel: +45 386 22828
E-mail: tor...@drcmr.dk
_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

Reply via email to