Hi,
Mgr of my cluster logs this every few seconds:
[progress WARNING root] complete: ev 7de5bb74-790b-4fda-8838-e4af4af18c62
does not exist
[progress WARNING root] complete: ev fff93fce-b630-4141-81ee-19e7a3e61483
does not exist
[progress WARNING root] complete: ev a02f6966-5b9f-49e8-89c4-b4fb8e6
I have an 8-node cluster with old hardware. a week ago 4 nodes went down and
the CEPH cluster went nuts.
All pgs became unknown and montors took too long to be in sync.
So i reduced the number of mons to one and mgrs to one as well
Now the recovery starts with 100% unknown pgs and then pgs start
There are some pretty strange compaction behavior happening in these
logs. For instance, in osd0, we see a O-1 CF L1 compaction that's
taking ~204 seconds:
2023-09-21T20:03:59.378+ 7f16a286c700 4 rocksdb: (Original Log Time
2023/09/21-20:03:59.381808) EVENT_LOG_v1 {"time_micros":
169532
Hi Milind,Team
Thank you for the response @Milind.
>>Snap-schedule no longer accepts a --subvol argument,
Thank you for the information.
Currently, we are using the following commands to create the snap-schedules:
Syntax:
*"ceph fs snap-schedule add ///
"*
*"ceph fs snap-schedule retention add
Hi Venky, and cephers
Thanks for reply.
no config changes had been made before the issues occurred. It suspects to
be client bug. Please see following message about the log segment
accumulation to be trimmed.for the moment problematic client nodes can not
be rebooted.evicting client will definite
Hi Sudhin,
It looks like manual DB compactions are (periodically?) issued via admin
socket for your OSDs, which (my working hypothesis) triggers DB access
stalls.
Here are the log lines indicating such calls
debug 2023-09-22T11:24:55.234+ 7fc4efa20700 1 osd.1 1192508
triggering manual