On Sat, Apr 10, 2021 at 2:10 AM Robert LeBlanc <rob...@leblancnet.us> wrote: > > On Fri, Apr 9, 2021 at 4:04 PM Dan van der Ster <d...@vanderster.com> wrote: > > > > Here's what you should look for, with debug_mon=10. It shows clearly > > that it takes the mon 23 seconds to run through > > get_removed_snaps_range. > > So if this is happening every 30s, it explains at least part of why > > this mon is busy. > > > > 2021-04-09 17:07:27.238 7f9fc83e4700 10 mon.sun-storemon01@0(leader) > > e45 handle_subscribe > > mon_subscribe({mdsmap=3914079+,monmap=0+,osdmap=1170448}) > > 2021-04-09 17:07:27.238 7f9fc83e4700 10 > > mon.sun-storemon01@0(leader).osd e1987355 check_osdmap_sub > > 0x55e2e2133de0 next 1170448 (onetime) > > 2021-04-09 17:07:27.238 7f9fc83e4700 5 > > mon.sun-storemon01@0(leader).osd e1987355 send_incremental > > [1170448..1987355] to client.131831153 > > 2021-04-09 17:07:28.590 7f9fc83e4700 10 > > mon.sun-storemon01@0(leader).osd e1987355 get_removed_snaps_range 0 > > [1~3] > > 2021-04-09 17:07:29.898 7f9fc83e4700 10 > > mon.sun-storemon01@0(leader).osd e1987355 get_removed_snaps_range 5 [] > > 2021-04-09 17:07:31.258 7f9fc83e4700 10 > > mon.sun-storemon01@0(leader).osd e1987355 get_removed_snaps_range 6 [] > > 2021-04-09 17:07:32.562 7f9fc83e4700 10 > > mon.sun-storemon01@0(leader).osd e1987355 get_removed_snaps_range 20 > > [] > > 2021-04-09 17:07:33.866 7f9fc83e4700 10 > > mon.sun-storemon01@0(leader).osd e1987355 get_removed_snaps_range 21 > > [] > > 2021-04-09 17:07:35.162 7f9fc83e4700 10 > > mon.sun-storemon01@0(leader).osd e1987355 get_removed_snaps_range 22 > > [] > > 2021-04-09 17:07:36.470 7f9fc83e4700 10 > > mon.sun-storemon01@0(leader).osd e1987355 get_removed_snaps_range 23 > > [] > > 2021-04-09 17:07:37.778 7f9fc83e4700 10 > > mon.sun-storemon01@0(leader).osd e1987355 get_removed_snaps_range 24 > > [] > > 2021-04-09 17:07:39.090 7f9fc83e4700 10 > > mon.sun-storemon01@0(leader).osd e1987355 get_removed_snaps_range 25 > > [] > > 2021-04-09 17:07:40.398 7f9fc83e4700 10 > > mon.sun-storemon01@0(leader).osd e1987355 get_removed_snaps_range 26 > > [] > > 2021-04-09 17:07:41.706 7f9fc83e4700 10 > > mon.sun-storemon01@0(leader).osd e1987355 get_removed_snaps_range 27 > > [] > > 2021-04-09 17:07:43.006 7f9fc83e4700 10 > > mon.sun-storemon01@0(leader).osd e1987355 get_removed_snaps_range 28 > > [] > > 2021-04-09 17:07:44.322 7f9fc83e4700 10 > > mon.sun-storemon01@0(leader).osd e1987355 get_removed_snaps_range 29 > > [] > > 2021-04-09 17:07:45.630 7f9fc83e4700 10 > > mon.sun-storemon01@0(leader).osd e1987355 get_removed_snaps_range 30 > > [] > > 2021-04-09 17:07:46.938 7f9fc83e4700 10 > > mon.sun-storemon01@0(leader).osd e1987355 get_removed_snaps_range 31 > > [] > > 2021-04-09 17:07:48.246 7f9fc83e4700 10 > > mon.sun-storemon01@0(leader).osd e1987355 get_removed_snaps_range 32 > > [] > > 2021-04-09 17:07:49.562 7f9fc83e4700 10 > > mon.sun-storemon01@0(leader).osd e1987355 get_removed_snaps_range 34 > > [] > > 2021-04-09 17:07:50.862 7f9fc83e4700 10 > > mon.sun-storemon01@0(leader).osd e1987355 get_removed_snaps_range 35 > > [] > > 2021-04-09 17:07:50.862 7f9fc83e4700 20 > > mon.sun-storemon01@0(leader).osd e1987355 send_incremental starting > > with base full 1986745 664086 bytes > > 2021-04-09 17:07:50.862 7f9fc83e4700 10 > > mon.sun-storemon01@0(leader).osd e1987355 build_incremental > > [1986746..1986785] with features 107b84a842aca > > > > So have a look for that client again or other similar traces. > > So, even though I blacklisted the client and we remounted the file > system on it, it wasn't enough for it to keep performing the same bad > requests. We found another node that had two sessions to the same > mount point. We rebooted both nodes and the CPU is now back at a > reasonable 4-6% and the cluster is running at full performance again. > I've added in back both MONs to have all 3 mons in the system and > there are no more elections. Thank you for helping us track down the > bad clients out of over 2,000 clients. > > > > Maybe if that code path isn't needed in Nautilus it can be removed in > > > the next point release? > > > > I think there were other major changes in this area that might make > > such a backport difficult. And we should expect nautilus to be nearing > > its end... > > But ... we just got to Nautilus... :)
Ouch, we just suffered this or a similar issue on our big prod block storage cluster running 14.2.19. But in our case it wasn't related to an old client -- rather we had 100% mon cpu and election storms but also huge tcmallocs all following the recreation of a couple OSDs. We wrote the details here: https://tracker.ceph.com/issues/50587 -- Dan _______________________________________________ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io