[ceph-users] Re: No reply or very slow reply from Prometheus plugin - ceph-mgr 13.2.8 mimic

2020-04-01 Thread Janek Bevendorff
> I’m actually very curious how well this is performing for you as I’ve > definitely not seen a deployment this large. How do you use it? What exactly do you mean? Our cluster has 11PiB capacity of which about 15% are used at the moment (web-scale corpora and such). We have deployed 5 MONs and

[ceph-users] Re: No reply or very slow reply from Prometheus plugin - ceph-mgr 13.2.8 mimic

2020-03-27 Thread Jarett DeAngelis
I’m actually very curious how well this is performing for you as I’ve definitely not seen a deployment this large. How do you use it? > On Mar 27, 2020, at 11:47 AM, shubjero wrote: > > I've reported stability problems with ceph-mgr w/ prometheus plugin > enabled on all versions we ran in

[ceph-users] Re: No reply or very slow reply from Prometheus plugin - ceph-mgr 13.2.8 mimic

2020-03-27 Thread shubjero
I've reported stability problems with ceph-mgr w/ prometheus plugin enabled on all versions we ran in production which were several versions of Luminous and Mimic. Our solution was to disable the prometheus exporter. I am using Zabbix instead. Our cluster is 1404 OSD's in size with about 9PB raw

[ceph-users] Re: No reply or very slow reply from Prometheus plugin - ceph-mgr 13.2.8 mimic

2020-03-27 Thread Janek Bevendorff
Sorry, I meant MGR of course. MDS are fine for me. But the MGRs were failing constantly due to the prometheus module doing something funny. On 26/03/2020 18:10, Paul Choi wrote: > I won't speculate more into the MDS's stability, but I do wonder about > the same thing. > There is one file served

[ceph-users] Re: No reply or very slow reply from Prometheus plugin - ceph-mgr 13.2.8 mimic

2020-03-26 Thread Paul Choi
I won't speculate more into the MDS's stability, but I do wonder about the same thing. There is one file served by the MDS that would cause the ceph-fuse client to hang. It was a file that many people in the company relied on for data updates, so very noticeable. The only fix was to fail over the

[ceph-users] Re: No reply or very slow reply from Prometheus plugin - ceph-mgr 13.2.8 mimic

2020-03-26 Thread Janek Bevendorff
If there is actually a connection, then it's no wonder our MDS kept crashing. Our Ceph has 9.2PiB of available space at the moment. On 26/03/2020 17:32, Paul Choi wrote: > I can't quite explain what happened, but the Prometheus endpoint > became stable after the free disk space for the largest

[ceph-users] Re: No reply or very slow reply from Prometheus plugin - ceph-mgr 13.2.8 mimic

2020-03-26 Thread Paul Choi
I can't quite explain what happened, but the Prometheus endpoint became stable after the free disk space for the largest pool went substantially lower than 1PB. I wonder if there's some metric that exceeds the maximum size for some int, double, etc? -Paul On Mon, Mar 23, 2020 at 9:50 AM Janek

[ceph-users] Re: No reply or very slow reply from Prometheus plugin - ceph-mgr 13.2.8 mimic

2020-03-23 Thread Janek Bevendorff
I dug up this issue report, where the problem has been reported before: https://tracker.ceph.com/issues/39264 Unfortuantely, the issue hasn't got much (or any) attention yet. So let's get this fixed, the prometheus module is unusable in its current state. On 23/03/2020 17:50, Janek Bevendorff

[ceph-users] Re: No reply or very slow reply from Prometheus plugin - ceph-mgr 13.2.8 mimic

2020-03-23 Thread Janek Bevendorff
I haven't seen any MGR hangs so far since I disabled the prometheus module. It seems like the module is not only slow, but kills the whole MGR when the cluster is sufficiently large, so these two issues are most likely connected. The issue has become much, much worse with 14.2.8. On 23/03/2020

[ceph-users] Re: No reply or very slow reply from Prometheus plugin - ceph-mgr 13.2.8 mimic

2020-03-23 Thread Janek Bevendorff
I am running the very latest version of Nautilus. I will try setting up an external exporter today and see if that fixes anything. Our cluster is somewhat large-ish with 1248 OSDs, so I expect stat collection to take "some" time, but it definitely shouldn't crush the MGRs all the time. On

[ceph-users] Re: No reply or very slow reply from Prometheus plugin - ceph-mgr 13.2.8 mimic

2020-03-20 Thread Paul Choi
Hi Janek, What version of Ceph are you using? We also have a much smaller cluster running Nautilus, with no MDS. No Prometheus issues there. I won't speculate further than this but perhaps Nautilus doesn't have the same issue as Mimic? On Fri, Mar 20, 2020 at 12:23 PM Janek Bevendorff <

[ceph-users] Re: No reply or very slow reply from Prometheus plugin - ceph-mgr 13.2.8 mimic

2020-03-20 Thread Janek Bevendorff
I think this is related to my previous post to this list about MGRs failing regularly and being overall quite slow to respond. The problem has existed before, but the new version has made it way worse. My MGRs keep dyring every few hours and need to be restarted. the Promtheus plugin works, but

[ceph-users] Re: No reply or very slow reply from Prometheus plugin - ceph-mgr 13.2.8 mimic

2020-03-20 Thread Paul Choi
If I "curl http://localhost:9283/metrics; and wait sufficiently long enough, I get this - says "No MON connection". But the mons are health and the cluster is functioning fine. That said, the mons' rocksdb sizes are fairly big because there's lots of rebalancing going on. The Prometheus endpoint