[ceph-users] Re: RandomCrashes on OSDs Attached to Mon Hosts with Octopus

Denis Krienbühl Thu, 27 Aug 2020 04:48:30 -0700

Hi Igor

Just to clarify:


>> I grepped the logs for "checksum mismatch" and "_verify_csum". The only
>> occurrences I could find where the ones that preceed the crashes.
> 
> Are you able to find multiple _verify_csum precisely?

There are no “_verify_csum” entries whatsoever. I wrote that wrongly.
I could only find “checksum mismatch” right when the crash happens.

Sorry for the confusion.

I will keep tracking those counters and have a look at monitor/osd memory 
tracking.

Cheers,

Denis

> On 27 Aug 2020, at 13:39, Igor Fedotov <ifedo...@suse.de> wrote:
> 
> Hi Denis
> 
> please see my comments inline.
> 
> 
> Thanks,
> 
> Igor
> 
> On 8/27/2020 10:06 AM, Denis Krienbühl wrote:
>> Hi Igor,
>> 
>> Thanks for your input. I tried to gather as much information as I could to
>> answer your questions. Hopefully we can get to the bottom of this.
>> 
>>> 0) What is backing disks layout for OSDs in question (main device type?, 
>>> additional DB/WAL devices?).
>> Everything is on a single Intel NVMe P4510 using dmcrypt with 2 OSDs per NVMe
>> device. There is no additional DB/WAL device and there are no HDDs involved.
>> 
>> Also note that we use 40 OSDs per host with a memory target of 6'174'015'488.
>> 
>>> 1) Please check all the existing logs for OSDs at "failing" nodes for other 
>>> checksum errors (as per my comment #38)
>> I grepped the logs for "checksum mismatch" and "_verify_csum". The only
>> occurrences I could find where the ones that preceed the crashes.
> 
> Are you able to find multiple _verify_csum precisely?
> 
> If so this means data read failures were observed at user data not RocksDB 
> one. Which backs the hypothesis about interim  disk read
> 
> errors as a root cause. User data reading has quite a different access stack 
> and is able to retry after such errors hence they aren't that visible.
> 
> But having checksum failures for both DB and user data points to the same 
> root cause at lower layers (kernel, I/O stack etc).
> 
> It might be interesting whether _verify_csum and RocksDB csum were happening 
> nearly at the same period of time. Not even for a single OSD but for 
> different OSDs of the same node.
> 
> This might indicate that node was suffering from some decease at that time. 
> Anything suspicious from system-wide logs for this time period?
> 
>> 
>>> 2) Check if BlueFS spillover is observed for any failing OSDs.
>> As everything is on the same device, there can be no spillover, right?
> Right
>> 
>>> 3) Check "bluestore_reads_with_retries" performance counters for all OSDs 
>>> at nodes in question. See comments 38-42 on the details. Any non-zero 
>>> values?
>> I monitored this over night by repeatedly polling this performance counter 
>> over
>> all OSDs on the mons. Only one OSD, which has crashed in the past, has had a
>> value of 1 since I started measuring. All the other OSDs, including the ones
>> that crashed over night, have a value of 0. Before and after the crash.
> 
> Even a single occurrence isn't expected - this counter should always be equal 
> to 0. And presumably these are peak hours when the cluster is exposed to the 
> issue at most. Night is likely to be not the the peak period though. So 
> please keep tracking...
> 
> 
>> 
>>> 4) Start monitoring RAM usage and swapping for these nodes. Comment 39.
>> The memory use of those nodes is pretty constant with ~6GB free, ~25GB 
>> availble of 256GB.
>> There are also only a handful of pages being swapped, if at all.
>> 
>>> a hypothesis why mon hosts are affected only  - higher memory utilization 
>>> at these nodes is what causes disk reading failures to appear. RAM leakage 
>>> (or excessive utilization) in MON processes or something?
>> Since the memory usage is rather constant I'm not sure this is the case, I 
>> think
>> we would see more of an up/down pattern. However we are not yet monitoring 
>> all
>> processes, and that would be somthing I'd like to get some data on, but I'm 
>> not
>> sure this is the right course of action at the moment.
> 
> Given the fact that colocation with monitors is probably the clue - suggest 
> to track  MON and OSD process at least.
> 
> And high memory pressure is just a working hypothesis for these disk failures 
> root cause. Something else (e.g. high disk utilization) might be another 
> trigger or it might just be wrong...
> 
> So please just pay some attention to this.
> 
>> 
>> What do you think, is it still plausible that we see a memory utilization
>> problem, even though there's little variance in the memory usage patterns?
>> 
>> The approaches we currently consider is to upgrade our kernel and to lower 
>> the memory
>> target somewhat.
>> 
>> Cheers,
>> 
>> Denis
>> 
>> 
>>> On 26 Aug 2020, at 15:29, Igor Fedotov <ifedo...@suse.de> wrote:
>>> 
>>> Hi Denis,
>>> 
>>> this reminds me the following ticket: https://tracker.ceph.com/issues/37282
>>> 
>>> Please note they mentioned co-location with mon in comment #29.
>>> 
>>> 
>>> Working hypothesis for this ticket is the interim disk read failures which 
>>> cause RocksDB checksum failures. Earlier we observed such a problem for 
>>> main device. Presumably it's heavy memory pressure which causes kernel to 
>>> be failing this way.  See my comment #38 there.
>>> 
>>> So I'd like to see answers/comments for the following questions:
>>> 
>>> 0) What is backing disks layout for OSDs in question (main device type?, 
>>> additional DB/WAL devices?).
>>> 
>>> 1) Please check all the existing logs for OSDs at "failing" nodes for other 
>>> checksum errors (as per my comment #38)
>>> 
>>> 2) Check if BlueFS spillover is observed for any failing OSDs.
>>> 
>>> 3) Check "bluestore_reads_with_retries" performance counters for all OSDs 
>>> at nodes in question. See comments 38-42 on the details. Any non-zero 
>>> values?
>>> 
>>> 4) Start monitoring RAM usage and swapping for these nodes. Comment 39.
>>> 
>>> 
>>> Thanks,
>>> 
>>> Igor
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> On 8/26/2020 3:47 PM, Denis Krienbühl wrote:
>>>> Hi!
>>>> 
>>>> We've recently upgraded all our clusters from Mimic to Octopus (15.2.4). 
>>>> Since
>>>> then, our largest cluster is experiencing random crashes on OSDs attached 
>>>> to the
>>>> mon hosts.
>>>> 
>>>> This is the crash we are seeing (cut for brevity, see links in post 
>>>> scriptum):
>>>> 
>>>>    {
>>>>        "ceph_version": "15.2.4",
>>>>        "utsname_release": "4.15.0-72-generic",
>>>>        "assert_condition": "r == 0",
>>>>        "assert_func": "void 
>>>> BlueStore::_txc_apply_kv(BlueStore::TransContext*, bool)",
>>>>        "assert_file": "/build/ceph-15.2.4/src/os/bluestore/BlueStore.cc 
>>>> <http://bluestore.cc/>",
>>>>        "assert_line": 11430,
>>>>        "assert_thread_name": "bstore_kv_sync",
>>>>        "assert_msg": "/build/ceph-15.2.4/src/os/bluestore/BlueStore.cc 
>>>> <http://bluestore.cc/>: In function 'void 
>>>> BlueStore::_txc_apply_kv(BlueStore::TransContext*, bool)' thread 
>>>> 7fc56311a700 time 
>>>> 2020-08-26T08:52:24.917083+0200\n/build/ceph-15.2.4/src/os/bluestore/BlueStore.cc
>>>>  <http://bluestore.cc/>: 11430: FAILED ceph_assert(r == 0)\n",
>>>>        "backtrace": [
>>>>            "(()+0x12890) [0x7fc576875890]",
>>>>            "(gsignal()+0xc7) [0x7fc575527e97]",
>>>>            "(abort()+0x141) [0x7fc575529801]",
>>>>            "(ceph::__ceph_assert_fail(char const*, char const*, int, char 
>>>> const*)+0x1a5) [0x559ef9ae97b5]",
>>>>            "(ceph::__ceph_assertf_fail(char const*, char const*, int, char 
>>>> const*, char const*, ...)+0) [0x559ef9ae993f]",
>>>>            "(BlueStore::_txc_apply_kv(BlueStore::TransContext*, 
>>>> bool)+0x3a0) [0x559efa0245b0]",
>>>>            "(BlueStore::_kv_sync_thread()+0xbdd) [0x559efa07745d]",
>>>>            "(BlueStore::KVSyncThread::entry()+0xd) [0x559efa09cd3d]",
>>>>            "(()+0x76db) [0x7fc57686a6db]",
>>>>            "(clone()+0x3f) [0x7fc57560a88f]"
>>>>        ]
>>>>    }
>>>> 
>>>> Right before the crash occurs, we see the following message in the crash 
>>>> log:
>>>> 
>>>>        -3> 2020-08-26T08:52:24.787+0200 7fc569b2d700  2 rocksdb: 
>>>> [db/db_impl_compaction_flush.cc:2212 
>>>> <http://db_impl_compaction_flush.cc:2212/>] Waiting after background 
>>>> compaction error: Corruption: block checksum mismatch: expected 
>>>> 2548200440, got 2324967102  in db/815839.sst offset 67107066 size 3808, 
>>>> Accumulated background error counts: 1
>>>>        -2> 2020-08-26T08:52:24.852+0200 7fc56311a700 -1 rocksdb: 
>>>> submit_common error: Corruption: block checksum mismatch: expected 
>>>> 2548200440, got 2324967102  in db/815839.sst offset 67107066 size 3808 
>>>> code = 2 Rocksdb transaction:
>>>> 
>>>> In short, we see a Rocksdb corruption error after background compaction, 
>>>> when this happens.
>>>> 
>>>> When an OSD crashes, which happens about 10-15 times a day, it restarts and
>>>> resumes work without any further problems.
>>>> 
>>>> We are pretty confident that this is not a hardware issue, due to the 
>>>> following facts:
>>>> 
>>>> * The crashes occur on 5 different hosts over 3 different racks.
>>>> * There is no smartctl/dmesg output that could explain it.
>>>> * It usually happens to a different OSD that did not crash before.
>>>> 
>>>> Still we checked the following on a few OSDs/hosts:
>>>> 
>>>> * We can do a manual compaction, both offline and online.
>>>> * We successfully ran "ceph-bluestore-tool fsck --deep yes" on one of the 
>>>> OSDs.
>>>> * We manually compacted a number of OSDs, one of which crashed hours later.
>>>> 
>>>> The only thing we have noticed so far: It only happens to OSDs that are 
>>>> attached
>>>> to a mon host. *None* of the non-mon host OSDs have had a crash!
>>>> 
>>>> Does anyone have a hint what could be causing this? We currently have no 
>>>> good
>>>> theory that could explain this, much less have a fix or workaround.
>>>> 
>>>> Any help would be greatly appreciated.
>>>> 
>>>> Denis
>>>> 
>>>> Crash: 
>>>> https://public-resources.objects.lpg.cloudscale.ch/osd-crash/meta.txt 
>>>> <https://public-resources.objects.lpg.cloudscale.ch/osd-crash/meta.txt>
>>>> Log: https://public-resources.objects.lpg.cloudscale.ch/osd-crash/log.txt 
>>>> <https://public-resources.objects.lpg.cloudscale.ch/osd-crash/log.txt>
>>>> 
>>>> _______________________________________________
>>>> ceph-users mailing list -- ceph-users@ceph.io
>>>> To unsubscribe send an email to ceph-users-le...@ceph.io
>>> _______________________________________________
>>> ceph-users mailing list -- ceph-users@ceph.io
>>> To unsubscribe send an email to ceph-users-le...@ceph.io
> _______________________________________________
> ceph-users mailing list -- ceph-users@ceph.io <mailto:ceph-users@ceph.io>
> To unsubscribe send an email to ceph-users-le...@ceph.io 
> <mailto:ceph-users-le...@ceph.io>
_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: RandomCrashes on OSDs Attached to Mon Hosts with Octopus

Reply via email to