Re: [ceph-users] dashboard hangs
Hi, On 2019-11-20 15:55, thoralf schulze wrote: hi, we were able to track this down to the auto balancer: disabling the auto balancer and cleaning out old (and probably not very meaningful) upmap-entries via ceph osd rm-pg-upmap-items brought back stable mgr daemons and an usable dashboard. I can confirm that, in our case I see this on a 14.2.4 cluster (which has started its life with an earlier Nautilus version, and developed this issue over the past weeks) and doing: ceph balancer off has been sufficient to make the mgrs stable again (i.e. I left the upmap-items in place). Interestingly, we did not see this with Luminous or Mimic on different clusters (which however have a more stable number of OSDs). @devs: If there's any more info needed to track this down, please let us know. Cheers, Oliver the not-so-sensible upmap-entries might or might not have been caused by us updating from mimic to nautilus - it's too late to debug this now. this seems to be consistent with bryan stillwell's findings ("mgr hangs with upmap balancer"). thank you very much & with kind regards, thoralf. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com smime.p7s Description: S/MIME Cryptographic Signature ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] dashboard hangs
hi, we were able to track this down to the auto balancer: disabling the auto balancer and cleaning out old (and probably not very meaningful) upmap-entries via ceph osd rm-pg-upmap-items brought back stable mgr daemons and an usable dashboard. the not-so-sensible upmap-entries might or might not have been caused by us updating from mimic to nautilus - it's too late to debug this now. this seems to be consistent with bryan stillwell's findings ("mgr hangs with upmap balancer"). thank you very much & with kind regards, thoralf. signature.asc Description: OpenPGP digital signature ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] dashboard hangs
hi Lenz, On 11/13/19 6:38 PM, Lenz Grimmer wrote: > there have been several reports about Ceph mgr modules (not just the > dashboard) experiencing hangs and freezes recently. The thread "mgr > daemons becoming unresponsive" might give you some additional insight. > > Is the "device health metrics" module enabled on your cluster? Could you > try disabling it to see if that fixes the issue? thank you for your answer … i should have mentioned that we tried with nautilus 14.2.2 and 14.2.4, with and without the patch to src/pybind/mgr/devicehealth/module.py provided by Sage in the thread mentioned above. while the patch apparently fixed the issue for other people, it didn't help in our case. regarding the modules: currently, we have dashboard, iostat, pg_autoscaler, prometheus and restful enabled. disabling them one by one until only dashboard is left helps, albeit for a short while only - i guess this is due to the mgr respawning itself. with kind regards, t. signature.asc Description: OpenPGP digital signature ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] dashboard hangs
Hi Thoralf, there have been several reports about Ceph mgr modules (not just the dashboard) experiencing hangs and freezes recently. The thread "mgr daemons becoming unresponsive" might give you some additional insight. Is the "device health metrics" module enabled on your cluster? Could you try disabling it to see if that fixes the issue? Lenz On 11/13/19 4:01 PM, thoralf schulze wrote: > the dashboard of our moderatly used cluster with 3 mon/mgr-nodes gets > stuck about 30 seconds after a mgr becomes active. the dashboard is not > usable anymore (ie: the mgr damon does not respond to http requests > anymore), although it comes back from the dead occasionally for a few > seconds. the same happens to the prometheus module: grafana only shows a > few data points here and there. > > other mgr-related stuff (eg., ceph pg dump) continues to work just fine. > forcing a switchover to another mgr or enabling / disabling mgr modules > helps for a short while, until the whole gets stuck again. > > a mgr log with debugging enabled for both mgr and mgrc at level 20 can > be found at > http://www.user.tu-berlin.de/thoralf.schulze/ceph-mgr-2019113.log.xz - > in this case, the hang occurred shortly before 14:55. > > any hints would be greatly appreciated … > > thank you very much & with kind regards, -- SUSE Software Solutions Germany GmbH - Maxfeldstr. 5 - 90409 Nuernberg GF: Felix Imendörffer, HRB 36809 (AG Nürnberg) signature.asc Description: OpenPGP digital signature ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com