Hi!

We've recently upgraded all our clusters from Mimic to Octopus (15.2.4). Since
then, our largest cluster is experiencing random crashes on OSDs attached to the
mon hosts.

This is the crash we are seeing (cut for brevity, see links in post scriptum):

   {
       "ceph_version": "15.2.4",
       "utsname_release": "4.15.0-72-generic",
       "assert_condition": "r == 0",
       "assert_func": "void BlueStore::_txc_apply_kv(BlueStore::TransContext*, 
bool)",
       "assert_file": "/build/ceph-15.2.4/src/os/bluestore/BlueStore.cc 
<http://bluestore.cc/>",
       "assert_line": 11430,
       "assert_thread_name": "bstore_kv_sync",
       "assert_msg": "/build/ceph-15.2.4/src/os/bluestore/BlueStore.cc 
<http://bluestore.cc/>: In function 'void 
BlueStore::_txc_apply_kv(BlueStore::TransContext*, bool)' thread 7fc56311a700 
time 
2020-08-26T08:52:24.917083+0200\n/build/ceph-15.2.4/src/os/bluestore/BlueStore.cc
 <http://bluestore.cc/>: 11430: FAILED ceph_assert(r == 0)\n",
       "backtrace": [
           "(()+0x12890) [0x7fc576875890]",
           "(gsignal()+0xc7) [0x7fc575527e97]",
           "(abort()+0x141) [0x7fc575529801]",
           "(ceph::__ceph_assert_fail(char const*, char const*, int, char 
const*)+0x1a5) [0x559ef9ae97b5]",
           "(ceph::__ceph_assertf_fail(char const*, char const*, int, char 
const*, char const*, ...)+0) [0x559ef9ae993f]",
           "(BlueStore::_txc_apply_kv(BlueStore::TransContext*, bool)+0x3a0) 
[0x559efa0245b0]",
           "(BlueStore::_kv_sync_thread()+0xbdd) [0x559efa07745d]",
           "(BlueStore::KVSyncThread::entry()+0xd) [0x559efa09cd3d]",
           "(()+0x76db) [0x7fc57686a6db]",
           "(clone()+0x3f) [0x7fc57560a88f]"
       ]
   }

Right before the crash occurs, we see the following message in the crash log:

       -3> 2020-08-26T08:52:24.787+0200 7fc569b2d700  2 rocksdb: 
[db/db_impl_compaction_flush.cc:2212 
<http://db_impl_compaction_flush.cc:2212/>] Waiting after background compaction 
error: Corruption: block checksum mismatch: expected 2548200440, got 2324967102 
 in db/815839.sst offset 67107066 size 3808, Accumulated background error 
counts: 1
       -2> 2020-08-26T08:52:24.852+0200 7fc56311a700 -1 rocksdb: submit_common 
error: Corruption: block checksum mismatch: expected 2548200440, got 2324967102 
 in db/815839.sst offset 67107066 size 3808 code = 2 Rocksdb transaction:

In short, we see a Rocksdb corruption error after background compaction, when 
this happens.

When an OSD crashes, which happens about 10-15 times a day, it restarts and
resumes work without any further problems.

We are pretty confident that this is not a hardware issue, due to the following 
facts:

* The crashes occur on 5 different hosts over 3 different racks.
* There is no smartctl/dmesg output that could explain it.
* It usually happens to a different OSD that did not crash before.

Still we checked the following on a few OSDs/hosts:

* We can do a manual compaction, both offline and online.
* We successfully ran "ceph-bluestore-tool fsck --deep yes" on one of the OSDs.
* We manually compacted a number of OSDs, one of which crashed hours later.

The only thing we have noticed so far: It only happens to OSDs that are attached
to a mon host. *None* of the non-mon host OSDs have had a crash!

Does anyone have a hint what could be causing this? We currently have no good
theory that could explain this, much less have a fix or workaround.

Any help would be greatly appreciated.

Denis

Crash: https://public-resources.objects.lpg.cloudscale.ch/osd-crash/meta.txt 
<https://public-resources.objects.lpg.cloudscale.ch/osd-crash/meta.txt>
Log: https://public-resources.objects.lpg.cloudscale.ch/osd-crash/log.txt 
<https://public-resources.objects.lpg.cloudscale.ch/osd-crash/log.txt>

_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

Reply via email to