[ceph-users] Re: After upgrading from 17.2.6 to 18.2.0, OSDs are very frequently restarting due to livenessprobe failures

2023-09-21 Thread Igor Fedotov
Hi! Can you share OSD logs demostrating such a restart? Thanks, Igor On 20/09/2023 20:16, sbeng...@gmail.com wrote: Since upgrading to 18.2.0 , OSDs are very frequently restarting due to livenessprobe failures making the cluster unusable. Has anyone else seen this behavior? Upgrade path:

[ceph-users] Re: After upgrading from 17.2.6 to 18.2.0, OSDs are very frequently restarting due to livenessprobe failures

2023-09-21 Thread Travis Nielsen
If there is nothing obvious in the OSD logs such as failing to start, and if the OSDs appear to be running until the liveness probe restarts them, you could disable or change the timeouts on the liveness probe. See https://rook.io/docs/rook/latest/CRDs/Cluster/ceph-cluster-crd/#health-settings . B

[ceph-users] Re: After upgrading from 17.2.6 to 18.2.0, OSDs are very frequently restarting due to livenessprobe failures

2023-09-21 Thread Sudhin Bengeri
Igor, Travis, Thanks for your attention to this issue. We extended the timeout for the liveness probe yesterday, and also extended the time after which a down OSD deployment is deleted by the operator. Once all the OSD deployments were recreated by the operator, we observed two OSD restarts - whi

[ceph-users] Re: After upgrading from 17.2.6 to 18.2.0, OSDs are very frequently restarting due to livenessprobe failures

2023-09-22 Thread Peter Goron
Hi, For the record, in the past we faced a similar issue with OSDs being killed one after each other every day starting from midnight. The root cause was linked to device_health_check launched by mgr on each OSD. While OSD is doing device_health_check, OSD admin socket is busy and can't answer to

[ceph-users] Re: After upgrading from 17.2.6 to 18.2.0, OSDs are very frequently restarting due to livenessprobe failures

2023-09-26 Thread sbengeri
Hi Igor, Please let where can I upload the OSD logs. Thanks. Sudhin ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: After upgrading from 17.2.6 to 18.2.0, OSDs are very frequently restarting due to livenessprobe failures

2023-09-26 Thread Igor Fedotov
Hi Sudhin, any publicly available cloud storage, e.g. Google drive should work. Thanks, Igor On 26/09/2023 22:52, sbeng...@gmail.com wrote: Hi Igor, Please let where can I upload the OSD logs. Thanks. Sudhin ___ ceph-users mailing list -- ceph-user

[ceph-users] Re: After upgrading from 17.2.6 to 18.2.0, OSDs are very frequently restarting due to livenessprobe failures

2023-09-27 Thread sbengeri
Hi Igor, I have copied three OSD logs to https://drive.google.com/file/d/1aQxibFJR6Dzvr3RbuqnpPhaSMhPSL--F/view?usp=sharing Hopefully they include some meaningful information. Thank you. Sudhin ___ ceph-users mailing list -- ceph-users@ceph.io To uns

[ceph-users] Re: After upgrading from 17.2.6 to 18.2.0, OSDs are very frequently restarting due to livenessprobe failures

2023-09-28 Thread Igor Fedotov
Hi Sudhin, It looks like manual DB compactions are (periodically?) issued via admin socket for your OSDs, which (my working hypothesis) triggers DB access stalls. Here are the log lines indicating such calls debug 2023-09-22T11:24:55.234+ 7fc4efa20700  1 osd.1 1192508 triggering manual

[ceph-users] Re: After upgrading from 17.2.6 to 18.2.0, OSDs are very frequently restarting due to livenessprobe failures

2023-09-28 Thread Mark Nelson
There are some pretty strange compaction behavior happening in these logs.  For instance, in osd0, we see a O-1 CF L1 compaction that's taking ~204 seconds: 2023-09-21T20:03:59.378+ 7f16a286c700  4 rocksdb: (Original Log Time 2023/09/21-20:03:59.381808) EVENT_LOG_v1 {"time_micros": 169532