Nothing special is going on that OSD as far as I can tell and the OSD number of each op is different. The config isn’t entirely default, but we have been using it successfully for quite a bit. It basically just redirects everything to journald so that we don’t have log creep. I reverted it nonetheless.
The MONs have a stable quorum, but the store size is so large now (35GB by this time), that I am seeing first daemon restarts. > On 25. Feb 2021, at 21:10, Dan van der Ster <d...@vanderster.com> wrote: > > > "source": "osd.104... > > What's happening on that osd? Is it something new which corresponds to when > your mon started growing? Are other OSDs also flooding the mons with logs? > > I'm mobile so can't check... Are those logging configs the defaults? If not > .... revert to default... > > BTW do your mons have stable quorum or are they flapping with this load? > > .. dan > > > > On Thu, Feb 25, 2021, 8:58 PM Janek Bevendorff > <janek.bevendo...@uni-weimar.de <mailto:janek.bevendo...@uni-weimar.de>> > wrote: > Thanks, Dan. > > On the first MON, the command doesn’t even return, but I was able to get a > dump from the one I restarted most recently. The oldest ops look like this: > > { > "description": "log(1000 entries from seq 17876238 at > 2021-02-25T15:13:20.306487+0100)", > "initiated_at": "2021-02-25T20:40:34.698932+0100", > "age": 183.762551121, > "duration": 183.762599201, > "type_data": { > "events": [ > { > "time": "2021-02-25T20:40:34.698932+0100", > "event": "initiated" > }, > { > "time": "2021-02-25T20:40:34.698636+0100", > "event": "throttled" > }, > { > "time": "2021-02-25T20:40:34.698932+0100", > "event": "header_read" > }, > { > "time": "2021-02-25T20:40:34.701407+0100", > "event": "all_read" > }, > { > "time": "2021-02-25T20:40:34.701455+0100", > "event": "dispatched" > }, > { > "time": "2021-02-25T20:40:34.701458+0100", > "event": "mon:_ms_dispatch" > }, > { > "time": "2021-02-25T20:40:34.701459+0100", > "event": "mon:dispatch_op" > }, > { > "time": "2021-02-25T20:40:34.701459+0100", > "event": "psvc:dispatch" > }, > { > "time": "2021-02-25T20:40:34.701490+0100", > "event": "logm:wait_for_readable" > }, > { > "time": "2021-02-25T20:40:34.701491+0100", > "event": "logm:wait_for_readable/paxos" > }, > { > "time": "2021-02-25T20:40:34.701496+0100", > "event": "paxos:wait_for_readable" > }, > { > "time": "2021-02-25T20:40:34.989198+0100", > "event": "callback finished" > }, > { > "time": "2021-02-25T20:40:34.989199+0100", > "event": "psvc:dispatch" > }, > { > "time": "2021-02-25T20:40:34.989208+0100", > "event": "logm:preprocess_query" > }, > { > "time": "2021-02-25T20:40:34.989208+0100", > "event": "logm:preprocess_log" > }, > { > "time": "2021-02-25T20:40:34.989278+0100", > "event": "forward_request_leader" > }, > { > "time": "2021-02-25T20:40:34.989344+0100", > "event": "forwarded" > }, > { > "time": "2021-02-25T20:41:58.658022+0100", > "event": "resend forwarded message to leader" > }, > { > "time": "2021-02-25T20:42:27.735449+0100", > "event": "resend forwarded message to leader" > } > ], > "info": { > "seq": 41550, > "src_is_mon": false, > "source": "osd.104 v2:XXX:6864/16579", > "forwarded_to_leader": true > } > > > Any idea what that might be about? Almost looks like this: > https://tracker.ceph.com/issues/24180 <https://tracker.ceph.com/issues/24180> > I set debug_mon to 0, but I keep getting a lot of log spill in journals. It’s > about 1-2 messages per second, mostly RocksDB stuff, but nothing that > actually looks serious or even log-worthy. I noticed that before that despite > logging being set to warning level, the cluster log keeps being written to > the MON log. But it shouldn’t cause such massive stability issues, should it? > The date on the log op is also weird. 15:13+0100 was hours ago. > > Here’s my log config: > > global advanced clog_to_syslog_level > warning > global basic err_to_syslog > true > global basic log_to_file > false > global basic log_to_stderr > false > global basic log_to_syslog > true > global advanced mon_cluster_log_file_level > error > global advanced mon_cluster_log_to_file > false > global advanced mon_cluster_log_to_stderr > false > global advanced mon_cluster_log_to_syslog > false > global advanced mon_cluster_log_to_syslog_level > warning > > > > Ceph version is 15.2.8. > > Janek > > >> On 25. Feb 2021, at 20:33, Dan van der Ster <d...@vanderster.com >> <mailto:d...@vanderster.com>> wrote: >> >> ceph daemon mon.`hostname -s` ops >> >> That should show you the accumulating ops. >> >> .. dan >> >> >> On Thu, Feb 25, 2021, 8:23 PM Janek Bevendorff >> <janek.bevendo...@uni-weimar.de <mailto:janek.bevendo...@uni-weimar.de>> >> wrote: >> Hi, >> >> All of a sudden, we are experiencing very concerning MON behaviour. We have >> five MONs and all of them have thousands up to tens of thousands of slow >> ops, the oldest one blocking basically indefinitely (at least the timer >> keeps creeping up). Additionally, the MON stores keep inflating heavily. >> Under normal circumstances we have about 450-550MB there. Right now its 27GB >> and growing (rapidly). >> >> I tried restarting all MONs, I disabled auto-scaling (just in case) and >> checked the system load and hardware. I also restarted the MGR and MDS >> daemons, but to no avail. >> >> Is there any way I can debug this properly? I can’t seem to find how I can >> actually view what ops are causing this and what client (if any) may be >> responsible for it. >> >> Thanks >> Janek >> _______________________________________________ >> ceph-users mailing list -- ceph-users@ceph.io <mailto:ceph-users@ceph.io> >> To unsubscribe send an email to ceph-users-le...@ceph.io >> <mailto:ceph-users-le...@ceph.io> _______________________________________________ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io