Re: [ceph-users] dashboard hangs

2019-11-22 Thread Oliver Freyermuth

Hi,

On 2019-11-20 15:55, thoralf schulze wrote:

hi,

we were able to track this down to the auto balancer: disabling the auto
balancer and cleaning out old (and probably not very meaningful)
upmap-entries via ceph osd rm-pg-upmap-items brought back stable mgr
daemons and an usable dashboard.


I can confirm that, in our case I see this on a 14.2.4 cluster (which has 
started its life with an earlier Nautilus version,
and developed this issue over the past weeks) and doing:
 ceph balancer off
has been sufficient to make the mgrs stable again (i.e. I left the upmap-items 
in place).

Interestingly, we did not see this with Luminous or Mimic on different clusters 
(which however have a more stable number of OSDs).

@devs: If there's any more info needed to track this down, please let us know.

Cheers,
Oliver



the not-so-sensible upmap-entries might or might not have been caused by
us updating from mimic to nautilus - it's too late to debug this now.
this seems to be consistent with bryan stillwell's findings ("mgr hangs
with upmap balancer").

thank you very much & with kind regards,
thoralf.


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com





smime.p7s
Description: S/MIME Cryptographic Signature
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] dashboard hangs

2019-11-20 Thread thoralf schulze
hi,

we were able to track this down to the auto balancer: disabling the auto
balancer and cleaning out old (and probably not very meaningful)
upmap-entries via ceph osd rm-pg-upmap-items brought back stable mgr
daemons and an usable dashboard.

the not-so-sensible upmap-entries might or might not have been caused by
us updating from mimic to nautilus - it's too late to debug this now.
this seems to be consistent with bryan stillwell's findings ("mgr hangs
with upmap balancer").

thank you very much & with kind regards,
thoralf.



signature.asc
Description: OpenPGP digital signature
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] dashboard hangs

2019-11-14 Thread thoralf schulze
hi Lenz,

On 11/13/19 6:38 PM, Lenz Grimmer wrote:
> there have been several reports about Ceph mgr modules (not just the
> dashboard) experiencing hangs and freezes recently. The thread "mgr
> daemons becoming unresponsive" might give you some additional insight.
> 
> Is the "device health metrics" module enabled on your cluster? Could you
> try disabling it to see if that fixes the issue?

thank you for your answer … i should have mentioned that we tried with
nautilus 14.2.2 and 14.2.4, with and without the patch to
src/pybind/mgr/devicehealth/module.py provided by Sage in the thread
mentioned above. while the patch apparently fixed the issue for other
people, it didn't help in our case.

regarding the modules: currently, we have dashboard, iostat,
pg_autoscaler, prometheus and restful enabled. disabling them one by one
until only dashboard is left helps, albeit for a short while only - i
guess this is due to the mgr respawning itself.

with kind regards,
t.



signature.asc
Description: OpenPGP digital signature
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] dashboard hangs

2019-11-13 Thread Lenz Grimmer
Hi Thoralf,

there have been several reports about Ceph mgr modules (not just the
dashboard) experiencing hangs and freezes recently. The thread "mgr
daemons becoming unresponsive" might give you some additional insight.

Is the "device health metrics" module enabled on your cluster? Could you
try disabling it to see if that fixes the issue?

Lenz

On 11/13/19 4:01 PM, thoralf schulze wrote:

> the dashboard of our moderatly used cluster with 3 mon/mgr-nodes gets
> stuck about 30 seconds after a mgr becomes active. the dashboard is not
> usable anymore (ie: the mgr damon does not respond to http requests
> anymore), although it comes back from the dead occasionally for a few
> seconds. the same happens to the prometheus module: grafana only shows a
> few data points here and there.
> 
> other mgr-related stuff (eg., ceph pg dump) continues to work just fine.
> forcing a switchover to another mgr or enabling / disabling mgr modules
> helps for a short while, until the whole gets stuck again.
> 
> a mgr log with debugging enabled for both mgr and mgrc at level 20 can
> be found at
> http://www.user.tu-berlin.de/thoralf.schulze/ceph-mgr-2019113.log.xz -
> in this case, the hang occurred shortly before 14:55.
> 
> any hints would be greatly appreciated …
> 
> thank you very much & with kind regards,

-- 
SUSE Software Solutions Germany GmbH - Maxfeldstr. 5 - 90409 Nuernberg
GF: Felix Imendörffer, HRB 36809 (AG Nürnberg)



signature.asc
Description: OpenPGP digital signature
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com