[ceph-users] Re: No reply or very slow reply from Prometheus plugin - ceph-mgr 13.2.8 mimic

2020-03-20 Thread Paul Choi
If I "curl http://localhost:9283/metrics"; and wait sufficiently long
enough, I get this - says "No MON connection". But the mons are health and
the cluster is functioning fine.
That said, the mons' rocksdb sizes are fairly big because there's lots of
rebalancing going on. The Prometheus endpoint hanging seems to happen
regardless of the mon size anyhow.

mon.woodenbox0 is 41 GiB >= mon_data_size_warn (15 GiB)
mon.woodenbox2 is 26 GiB >= mon_data_size_warn (15 GiB)
mon.woodenbox4 is 42 GiB >= mon_data_size_warn (15 GiB)
mon.woodenbox3 is 43 GiB >= mon_data_size_warn (15 GiB)
mon.woodenbox1 is 38 GiB >= mon_data_size_warn (15 GiB)

# fg
curl -H "Connection: close" http://localhost:9283/metrics
http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd";>



503 Service Unavailable

#powered_by {
margin-top: 20px;
border-top: 2px solid black;
font-style: italic;
}

#traceback {
color: red;
}



503 Service Unavailable
No MON connection
Traceback (most recent call last):
  File "/usr/lib/python2.7/dist-packages/cherrypy/_cprequest.py", line 670,
in respond
response.body = self.handler()
  File "/usr/lib/python2.7/dist-packages/cherrypy/lib/encoding.py", line
217, in __call__
self.body = self.oldhandler(*args, **kwargs)
  File "/usr/lib/python2.7/dist-packages/cherrypy/_cpdispatch.py", line 61,
in __call__
return self.callable(*self.args, **self.kwargs)
  File "/usr/lib/ceph/mgr/prometheus/module.py", line 704, in metrics
return self._metrics(instance)
  File "/usr/lib/ceph/mgr/prometheus/module.py", line 721, in _metrics
raise cherrypy.HTTPError(503, 'No MON connection')
HTTPError: (503, 'No MON connection')


  
Powered by http://www.cherrypy.org";>CherryPy 3.5.0
  




On Fri, Mar 20, 2020 at 6:33 AM Paul Choi  wrote:

> Hello,
>
> We are running Mimic 13.2.8 with our cluster, and since upgrading to
> 13.2.8 the Prometheus plugin seems to hang a lot. It used to respond under
> 10s but now it often hangs. Restarting the mgr processes helps temporarily
> but within minutes it gets stuck again.
>
> The active mgr doesn't exit when doing `systemctl stop ceph-mgr.target"
> and needs to
>  be kill -9'ed.
>
> Is there anything I can do to address this issue, or at least get better
> visibility into the issue?
>
> We only have a few plugins enabled:
> $ ceph mgr module ls
> {
> "enabled_modules": [
> "balancer",
> "prometheus",
> "zabbix"
> ],
>
> 3 mgr processes, but it's a pretty large cluster (near 4000 OSDs) and it's
> a busy one with lots of rebalancing. (I don't know if a busy cluster would
> seriously affect the mgr's performance, but just throwing it out there)
>
>   services:
> mon: 5 daemons, quorum
> woodenbox0,woodenbox2,woodenbox4,woodenbox3,woodenbox1
> mgr: woodenbox2(active), standbys: woodenbox0, woodenbox1
> mds: cephfs-1/1/1 up  {0=woodenbox6=up:active}, 1 up:standby-replay
> osd: 3964 osds: 3928 up, 3928 in; 831 remapped pgs
> rgw: 4 daemons active
>
> Thanks in advance for your help,
>
> -Paul Choi
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: No reply or very slow reply from Prometheus plugin - ceph-mgr 13.2.8 mimic

2020-03-20 Thread Janek Bevendorff
I think this is related to my previous post to this list about MGRs
failing regularly and being overall quite slow to respond. The problem
has existed before, but the new version has made it way worse. My MGRs
keep dyring every few hours and need to be restarted. the Promtheus
plugin works, but it's pretty slow and so is the dashboard.
Unfortunately, nobody seems to have a solution for this and I wonder why
not more people are complaining about this problem.


On 20/03/2020 19:30, Paul Choi wrote:
> If I "curl http://localhost:9283/metrics"; and wait sufficiently long
> enough, I get this - says "No MON connection". But the mons are health and
> the cluster is functioning fine.
> That said, the mons' rocksdb sizes are fairly big because there's lots of
> rebalancing going on. The Prometheus endpoint hanging seems to happen
> regardless of the mon size anyhow.
>
> mon.woodenbox0 is 41 GiB >= mon_data_size_warn (15 GiB)
> mon.woodenbox2 is 26 GiB >= mon_data_size_warn (15 GiB)
> mon.woodenbox4 is 42 GiB >= mon_data_size_warn (15 GiB)
> mon.woodenbox3 is 43 GiB >= mon_data_size_warn (15 GiB)
> mon.woodenbox1 is 38 GiB >= mon_data_size_warn (15 GiB)
>
> # fg
> curl -H "Connection: close" http://localhost:9283/metrics
>  "-//W3C//DTD XHTML 1.0 Transitional//EN"
> "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd";>
> 
> 
> 
> 503 Service Unavailable
> 
> #powered_by {
> margin-top: 20px;
> border-top: 2px solid black;
> font-style: italic;
> }
>
> #traceback {
> color: red;
> }
> 
> 
> 
> 503 Service Unavailable
> No MON connection
> Traceback (most recent call last):
>   File "/usr/lib/python2.7/dist-packages/cherrypy/_cprequest.py", line 670,
> in respond
> response.body = self.handler()
>   File "/usr/lib/python2.7/dist-packages/cherrypy/lib/encoding.py", line
> 217, in __call__
> self.body = self.oldhandler(*args, **kwargs)
>   File "/usr/lib/python2.7/dist-packages/cherrypy/_cpdispatch.py", line 61,
> in __call__
> return self.callable(*self.args, **self.kwargs)
>   File "/usr/lib/ceph/mgr/prometheus/module.py", line 704, in metrics
> return self._metrics(instance)
>   File "/usr/lib/ceph/mgr/prometheus/module.py", line 721, in _metrics
> raise cherrypy.HTTPError(503, 'No MON connection')
> HTTPError: (503, 'No MON connection')
> 
> 
>   
> Powered by http://www.cherrypy.org";>CherryPy 3.5.0
>   
> 
> 
> 
>
> On Fri, Mar 20, 2020 at 6:33 AM Paul Choi  wrote:
>
>> Hello,
>>
>> We are running Mimic 13.2.8 with our cluster, and since upgrading to
>> 13.2.8 the Prometheus plugin seems to hang a lot. It used to respond under
>> 10s but now it often hangs. Restarting the mgr processes helps temporarily
>> but within minutes it gets stuck again.
>>
>> The active mgr doesn't exit when doing `systemctl stop ceph-mgr.target"
>> and needs to
>>  be kill -9'ed.
>>
>> Is there anything I can do to address this issue, or at least get better
>> visibility into the issue?
>>
>> We only have a few plugins enabled:
>> $ ceph mgr module ls
>> {
>> "enabled_modules": [
>> "balancer",
>> "prometheus",
>> "zabbix"
>> ],
>>
>> 3 mgr processes, but it's a pretty large cluster (near 4000 OSDs) and it's
>> a busy one with lots of rebalancing. (I don't know if a busy cluster would
>> seriously affect the mgr's performance, but just throwing it out there)
>>
>>   services:
>> mon: 5 daemons, quorum
>> woodenbox0,woodenbox2,woodenbox4,woodenbox3,woodenbox1
>> mgr: woodenbox2(active), standbys: woodenbox0, woodenbox1
>> mds: cephfs-1/1/1 up  {0=woodenbox6=up:active}, 1 up:standby-replay
>> osd: 3964 osds: 3928 up, 3928 in; 831 remapped pgs
>> rgw: 4 daemons active
>>
>> Thanks in advance for your help,
>>
>> -Paul Choi
>>
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: No reply or very slow reply from Prometheus plugin - ceph-mgr 13.2.8 mimic

2020-03-20 Thread Paul Choi
Hi Janek,

What version of Ceph are you using?
We also have a much smaller cluster running Nautilus, with no MDS. No
Prometheus issues there.
I won't speculate further than this but perhaps Nautilus doesn't have the
same issue as Mimic?

On Fri, Mar 20, 2020 at 12:23 PM Janek Bevendorff <
janek.bevendo...@uni-weimar.de> wrote:

> I think this is related to my previous post to this list about MGRs
> failing regularly and being overall quite slow to respond. The problem
> has existed before, but the new version has made it way worse. My MGRs
> keep dyring every few hours and need to be restarted. the Promtheus
> plugin works, but it's pretty slow and so is the dashboard.
> Unfortunately, nobody seems to have a solution for this and I wonder why
> not more people are complaining about this problem.
>
>
> On 20/03/2020 19:30, Paul Choi wrote:
> > If I "curl http://localhost:9283/metrics"; and wait sufficiently long
> > enough, I get this - says "No MON connection". But the mons are health
> and
> > the cluster is functioning fine.
> > That said, the mons' rocksdb sizes are fairly big because there's lots of
> > rebalancing going on. The Prometheus endpoint hanging seems to happen
> > regardless of the mon size anyhow.
> >
> > mon.woodenbox0 is 41 GiB >= mon_data_size_warn (15 GiB)
> > mon.woodenbox2 is 26 GiB >= mon_data_size_warn (15 GiB)
> > mon.woodenbox4 is 42 GiB >= mon_data_size_warn (15 GiB)
> > mon.woodenbox3 is 43 GiB >= mon_data_size_warn (15 GiB)
> > mon.woodenbox1 is 38 GiB >= mon_data_size_warn (15 GiB)
> >
> > # fg
> > curl -H "Connection: close" http://localhost:9283/metrics
> >  > "-//W3C//DTD XHTML 1.0 Transitional//EN"
> > "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd";>
> > 
> > 
> > 
> > 503 Service Unavailable
> > 
> > #powered_by {
> > margin-top: 20px;
> > border-top: 2px solid black;
> > font-style: italic;
> > }
> >
> > #traceback {
> > color: red;
> > }
> > 
> > 
> > 
> > 503 Service Unavailable
> > No MON connection
> > Traceback (most recent call last):
> >   File "/usr/lib/python2.7/dist-packages/cherrypy/_cprequest.py", line
> 670,
> > in respond
> > response.body = self.handler()
> >   File "/usr/lib/python2.7/dist-packages/cherrypy/lib/encoding.py", line
> > 217, in __call__
> > self.body = self.oldhandler(*args, **kwargs)
> >   File "/usr/lib/python2.7/dist-packages/cherrypy/_cpdispatch.py", line
> 61,
> > in __call__
> > return self.callable(*self.args, **self.kwargs)
> >   File "/usr/lib/ceph/mgr/prometheus/module.py", line 704, in metrics
> > return self._metrics(instance)
> >   File "/usr/lib/ceph/mgr/prometheus/module.py", line 721, in _metrics
> > raise cherrypy.HTTPError(503, 'No MON connection')
> > HTTPError: (503, 'No MON connection')
> > 
> > 
> >   
> > Powered by http://www.cherrypy.org";>CherryPy 3.5.0
> >   
> > 
> > 
> > 
> >
> > On Fri, Mar 20, 2020 at 6:33 AM Paul Choi  wrote:
> >
> >> Hello,
> >>
> >> We are running Mimic 13.2.8 with our cluster, and since upgrading to
> >> 13.2.8 the Prometheus plugin seems to hang a lot. It used to respond
> under
> >> 10s but now it often hangs. Restarting the mgr processes helps
> temporarily
> >> but within minutes it gets stuck again.
> >>
> >> The active mgr doesn't exit when doing `systemctl stop ceph-mgr.target"
> >> and needs to
> >>  be kill -9'ed.
> >>
> >> Is there anything I can do to address this issue, or at least get better
> >> visibility into the issue?
> >>
> >> We only have a few plugins enabled:
> >> $ ceph mgr module ls
> >> {
> >> "enabled_modules": [
> >> "balancer",
> >> "prometheus",
> >> "zabbix"
> >> ],
> >>
> >> 3 mgr processes, but it's a pretty large cluster (near 4000 OSDs) and
> it's
> >> a busy one with lots of rebalancing. (I don't know if a busy cluster
> would
> >> seriously affect the mgr's performance, but just throwing it out there)
> >>
> >>   services:
> >> mon: 5 daemons, quorum
> >> woodenbox0,woodenbox2,woodenbox4,woodenbox3,woodenbox1
> >> mgr: woodenbox2(active), standbys: woodenbox0, woodenbox1
> >> mds: cephfs-1/1/1 up  {0=woodenbox6=up:active}, 1 up:standby-replay
> >> osd: 3964 osds: 3928 up, 3928 in; 831 remapped pgs
> >> rgw: 4 daemons active
> >>
> >> Thanks in advance for your help,
> >>
> >> -Paul Choi
> >>
> > ___
> > ceph-users mailing list -- ceph-users@ceph.io
> > To unsubscribe send an email to ceph-users-le...@ceph.io
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: No reply or very slow reply from Prometheus plugin - ceph-mgr 13.2.8 mimic

2020-03-23 Thread Janek Bevendorff
I am running the very latest version of Nautilus. I will try setting up
an external exporter today and see if that fixes anything. Our cluster
is somewhat large-ish with 1248 OSDs, so I expect stat collection to
take "some" time, but it definitely shouldn't crush the MGRs all the time.

On 21/03/2020 02:33, Paul Choi wrote:
> Hi Janek,
>
> What version of Ceph are you using?
> We also have a much smaller cluster running Nautilus, with no MDS. No
> Prometheus issues there.
> I won't speculate further than this but perhaps Nautilus doesn't have
> the same issue as Mimic?
>
> On Fri, Mar 20, 2020 at 12:23 PM Janek Bevendorff
>  > wrote:
>
> I think this is related to my previous post to this list about MGRs
> failing regularly and being overall quite slow to respond. The problem
> has existed before, but the new version has made it way worse. My MGRs
> keep dyring every few hours and need to be restarted. the Promtheus
> plugin works, but it's pretty slow and so is the dashboard.
> Unfortunately, nobody seems to have a solution for this and I
> wonder why
> not more people are complaining about this problem.
>
>
> On 20/03/2020 19:30, Paul Choi wrote:
> > If I "curl http://localhost:9283/metrics"; and wait sufficiently long
> > enough, I get this - says "No MON connection". But the mons are
> health and
> > the cluster is functioning fine.
> > That said, the mons' rocksdb sizes are fairly big because
> there's lots of
> > rebalancing going on. The Prometheus endpoint hanging seems to
> happen
> > regardless of the mon size anyhow.
> >
> >     mon.woodenbox0 is 41 GiB >= mon_data_size_warn (15 GiB)
> >     mon.woodenbox2 is 26 GiB >= mon_data_size_warn (15 GiB)
> >     mon.woodenbox4 is 42 GiB >= mon_data_size_warn (15 GiB)
> >     mon.woodenbox3 is 43 GiB >= mon_data_size_warn (15 GiB)
> >     mon.woodenbox1 is 38 GiB >= mon_data_size_warn (15 GiB)
> >
> > # fg
> > curl -H "Connection: close" http://localhost:9283/metrics
> >  > "-//W3C//DTD XHTML 1.0 Transitional//EN"
> > "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd";>
> > 
> > 
> >     
> >     503 Service Unavailable
> >     
> >     #powered_by {
> >         margin-top: 20px;
> >         border-top: 2px solid black;
> >         font-style: italic;
> >     }
> >
> >     #traceback {
> >         color: red;
> >     }
> >     
> > 
> >     
> >         503 Service Unavailable
> >         No MON connection
> >         Traceback (most recent call last):
> >   File
> "/usr/lib/python2.7/dist-packages/cherrypy/_cprequest.py", line 670,
> > in respond
> >     response.body = self.handler()
> >   File
> "/usr/lib/python2.7/dist-packages/cherrypy/lib/encoding.py", line
> > 217, in __call__
> >     self.body = self.oldhandler(*args, **kwargs)
> >   File
> "/usr/lib/python2.7/dist-packages/cherrypy/_cpdispatch.py", line 61,
> > in __call__
> >     return self.callable(*self.args, **self.kwargs)
> >   File "/usr/lib/ceph/mgr/prometheus/module.py", line 704, in
> metrics
> >     return self._metrics(instance)
> >   File "/usr/lib/ceph/mgr/prometheus/module.py", line 721, in
> _metrics
> >     raise cherrypy.HTTPError(503, 'No MON connection')
> > HTTPError: (503, 'No MON connection')
> > 
> >     
> >       
> >         Powered by http://www.cherrypy.org";>CherryPy
> 3.5.0
> >       
> >     
> >     
> > 
> >
> > On Fri, Mar 20, 2020 at 6:33 AM Paul Choi  > wrote:
> >
> >> Hello,
> >>
> >> We are running Mimic 13.2.8 with our cluster, and since
> upgrading to
> >> 13.2.8 the Prometheus plugin seems to hang a lot. It used to
> respond under
> >> 10s but now it often hangs. Restarting the mgr processes helps
> temporarily
> >> but within minutes it gets stuck again.
> >>
> >> The active mgr doesn't exit when doing `systemctl stop
> ceph-mgr.target"
> >> and needs to
> >>  be kill -9'ed.
> >>
> >> Is there anything I can do to address this issue, or at least
> get better
> >> visibility into the issue?
> >>
> >> We only have a few plugins enabled:
> >> $ ceph mgr module ls
> >> {
> >>     "enabled_modules": [
> >>         "balancer",
> >>         "prometheus",
> >>         "zabbix"
> >>     ],
> >>
> >> 3 mgr processes, but it's a pretty large cluster (near 4000
> OSDs) and it's
> >> a busy one with lots of rebalancing. (I don't know if a busy
> cluster would
> >> seriously affect the mgr's performance, but just throwing it
> out there)
> >>
> >>   services:
> >>     mon: 5 daemons, quorum
> >> woodenbox0,woodenbox2,woodenbox4,w

[ceph-users] Re: No reply or very slow reply from Prometheus plugin - ceph-mgr 13.2.8 mimic

2020-03-23 Thread Janek Bevendorff
I haven't seen any MGR hangs so far since I disabled the prometheus
module. It seems like the module is not only slow, but kills the whole
MGR when the cluster is sufficiently large, so these two issues are most
likely connected. The issue has become much, much worse with 14.2.8.


On 23/03/2020 09:00, Janek Bevendorff wrote:
> I am running the very latest version of Nautilus. I will try setting up
> an external exporter today and see if that fixes anything. Our cluster
> is somewhat large-ish with 1248 OSDs, so I expect stat collection to
> take "some" time, but it definitely shouldn't crush the MGRs all the time.
>
> On 21/03/2020 02:33, Paul Choi wrote:
>> Hi Janek,
>>
>> What version of Ceph are you using?
>> We also have a much smaller cluster running Nautilus, with no MDS. No
>> Prometheus issues there.
>> I won't speculate further than this but perhaps Nautilus doesn't have
>> the same issue as Mimic?
>>
>> On Fri, Mar 20, 2020 at 12:23 PM Janek Bevendorff
>> > > wrote:
>>
>> I think this is related to my previous post to this list about MGRs
>> failing regularly and being overall quite slow to respond. The problem
>> has existed before, but the new version has made it way worse. My MGRs
>> keep dyring every few hours and need to be restarted. the Promtheus
>> plugin works, but it's pretty slow and so is the dashboard.
>> Unfortunately, nobody seems to have a solution for this and I
>> wonder why
>> not more people are complaining about this problem.
>>
>>
>> On 20/03/2020 19:30, Paul Choi wrote:
>> > If I "curl http://localhost:9283/metrics"; and wait sufficiently long
>> > enough, I get this - says "No MON connection". But the mons are
>> health and
>> > the cluster is functioning fine.
>> > That said, the mons' rocksdb sizes are fairly big because
>> there's lots of
>> > rebalancing going on. The Prometheus endpoint hanging seems to
>> happen
>> > regardless of the mon size anyhow.
>> >
>> >     mon.woodenbox0 is 41 GiB >= mon_data_size_warn (15 GiB)
>> >     mon.woodenbox2 is 26 GiB >= mon_data_size_warn (15 GiB)
>> >     mon.woodenbox4 is 42 GiB >= mon_data_size_warn (15 GiB)
>> >     mon.woodenbox3 is 43 GiB >= mon_data_size_warn (15 GiB)
>> >     mon.woodenbox1 is 38 GiB >= mon_data_size_warn (15 GiB)
>> >
>> > # fg
>> > curl -H "Connection: close" http://localhost:9283/metrics
>> > > > "-//W3C//DTD XHTML 1.0 Transitional//EN"
>> > "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd";>
>> > 
>> > 
>> >     
>> >     503 Service Unavailable
>> >     
>> >     #powered_by {
>> >         margin-top: 20px;
>> >         border-top: 2px solid black;
>> >         font-style: italic;
>> >     }
>> >
>> >     #traceback {
>> >         color: red;
>> >     }
>> >     
>> > 
>> >     
>> >         503 Service Unavailable
>> >         No MON connection
>> >         Traceback (most recent call last):
>> >   File
>> "/usr/lib/python2.7/dist-packages/cherrypy/_cprequest.py", line 670,
>> > in respond
>> >     response.body = self.handler()
>> >   File
>> "/usr/lib/python2.7/dist-packages/cherrypy/lib/encoding.py", line
>> > 217, in __call__
>> >     self.body = self.oldhandler(*args, **kwargs)
>> >   File
>> "/usr/lib/python2.7/dist-packages/cherrypy/_cpdispatch.py", line 61,
>> > in __call__
>> >     return self.callable(*self.args, **self.kwargs)
>> >   File "/usr/lib/ceph/mgr/prometheus/module.py", line 704, in
>> metrics
>> >     return self._metrics(instance)
>> >   File "/usr/lib/ceph/mgr/prometheus/module.py", line 721, in
>> _metrics
>> >     raise cherrypy.HTTPError(503, 'No MON connection')
>> > HTTPError: (503, 'No MON connection')
>> > 
>> >     
>> >       
>> >         Powered by http://www.cherrypy.org";>CherryPy
>> 3.5.0
>> >       
>> >     
>> >     
>> > 
>> >
>> > On Fri, Mar 20, 2020 at 6:33 AM Paul Choi > > wrote:
>> >
>> >> Hello,
>> >>
>> >> We are running Mimic 13.2.8 with our cluster, and since
>> upgrading to
>> >> 13.2.8 the Prometheus plugin seems to hang a lot. It used to
>> respond under
>> >> 10s but now it often hangs. Restarting the mgr processes helps
>> temporarily
>> >> but within minutes it gets stuck again.
>> >>
>> >> The active mgr doesn't exit when doing `systemctl stop
>> ceph-mgr.target"
>> >> and needs to
>> >>  be kill -9'ed.
>> >>
>> >> Is there anything I can do to address this issue, or at least
>> get better
>> >> visibility into the issue?
>> >>
>> >> We only have a few plugins enabled:
>> >> $ ceph mgr module ls
>> >> {
>> >>     "enabled_modules": [
>> >>         "balancer",

[ceph-users] Re: No reply or very slow reply from Prometheus plugin - ceph-mgr 13.2.8 mimic

2020-03-23 Thread Janek Bevendorff
I dug up this issue report, where the problem has been reported before:
https://tracker.ceph.com/issues/39264

Unfortuantely, the issue hasn't got much (or any) attention yet. So
let's get this fixed, the prometheus module is unusable in its current
state.


On 23/03/2020 17:50, Janek Bevendorff wrote:
> I haven't seen any MGR hangs so far since I disabled the prometheus
> module. It seems like the module is not only slow, but kills the whole
> MGR when the cluster is sufficiently large, so these two issues are most
> likely connected. The issue has become much, much worse with 14.2.8.
>
>
> On 23/03/2020 09:00, Janek Bevendorff wrote:
>> I am running the very latest version of Nautilus. I will try setting up
>> an external exporter today and see if that fixes anything. Our cluster
>> is somewhat large-ish with 1248 OSDs, so I expect stat collection to
>> take "some" time, but it definitely shouldn't crush the MGRs all the time.
>>
>> On 21/03/2020 02:33, Paul Choi wrote:
>>> Hi Janek,
>>>
>>> What version of Ceph are you using?
>>> We also have a much smaller cluster running Nautilus, with no MDS. No
>>> Prometheus issues there.
>>> I won't speculate further than this but perhaps Nautilus doesn't have
>>> the same issue as Mimic?
>>>
>>> On Fri, Mar 20, 2020 at 12:23 PM Janek Bevendorff
>>> >> > wrote:
>>>
>>> I think this is related to my previous post to this list about MGRs
>>> failing regularly and being overall quite slow to respond. The problem
>>> has existed before, but the new version has made it way worse. My MGRs
>>> keep dyring every few hours and need to be restarted. the Promtheus
>>> plugin works, but it's pretty slow and so is the dashboard.
>>> Unfortunately, nobody seems to have a solution for this and I
>>> wonder why
>>> not more people are complaining about this problem.
>>>
>>>
>>> On 20/03/2020 19:30, Paul Choi wrote:
>>> > If I "curl http://localhost:9283/metrics"; and wait sufficiently long
>>> > enough, I get this - says "No MON connection". But the mons are
>>> health and
>>> > the cluster is functioning fine.
>>> > That said, the mons' rocksdb sizes are fairly big because
>>> there's lots of
>>> > rebalancing going on. The Prometheus endpoint hanging seems to
>>> happen
>>> > regardless of the mon size anyhow.
>>> >
>>> >     mon.woodenbox0 is 41 GiB >= mon_data_size_warn (15 GiB)
>>> >     mon.woodenbox2 is 26 GiB >= mon_data_size_warn (15 GiB)
>>> >     mon.woodenbox4 is 42 GiB >= mon_data_size_warn (15 GiB)
>>> >     mon.woodenbox3 is 43 GiB >= mon_data_size_warn (15 GiB)
>>> >     mon.woodenbox1 is 38 GiB >= mon_data_size_warn (15 GiB)
>>> >
>>> > # fg
>>> > curl -H "Connection: close" http://localhost:9283/metrics
>>> > >> > "-//W3C//DTD XHTML 1.0 Transitional//EN"
>>> > "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd";>
>>> > 
>>> > 
>>> >     
>>> >     503 Service Unavailable
>>> >     
>>> >     #powered_by {
>>> >         margin-top: 20px;
>>> >         border-top: 2px solid black;
>>> >         font-style: italic;
>>> >     }
>>> >
>>> >     #traceback {
>>> >         color: red;
>>> >     }
>>> >     
>>> > 
>>> >     
>>> >         503 Service Unavailable
>>> >         No MON connection
>>> >         Traceback (most recent call last):
>>> >   File
>>> "/usr/lib/python2.7/dist-packages/cherrypy/_cprequest.py", line 670,
>>> > in respond
>>> >     response.body = self.handler()
>>> >   File
>>> "/usr/lib/python2.7/dist-packages/cherrypy/lib/encoding.py", line
>>> > 217, in __call__
>>> >     self.body = self.oldhandler(*args, **kwargs)
>>> >   File
>>> "/usr/lib/python2.7/dist-packages/cherrypy/_cpdispatch.py", line 61,
>>> > in __call__
>>> >     return self.callable(*self.args, **self.kwargs)
>>> >   File "/usr/lib/ceph/mgr/prometheus/module.py", line 704, in
>>> metrics
>>> >     return self._metrics(instance)
>>> >   File "/usr/lib/ceph/mgr/prometheus/module.py", line 721, in
>>> _metrics
>>> >     raise cherrypy.HTTPError(503, 'No MON connection')
>>> > HTTPError: (503, 'No MON connection')
>>> > 
>>> >     
>>> >       
>>> >         Powered by http://www.cherrypy.org";>CherryPy
>>> 3.5.0
>>> >       
>>> >     
>>> >     
>>> > 
>>> >
>>> > On Fri, Mar 20, 2020 at 6:33 AM Paul Choi >> > wrote:
>>> >
>>> >> Hello,
>>> >>
>>> >> We are running Mimic 13.2.8 with our cluster, and since
>>> upgrading to
>>> >> 13.2.8 the Prometheus plugin seems to hang a lot. It used to
>>> respond under
>>> >> 10s but now it often hangs. Restarting the mgr processes helps
>>> temporarily
>>> >> but within minutes it gets stuck again.
>>> >>
>>> >>

[ceph-users] Re: No reply or very slow reply from Prometheus plugin - ceph-mgr 13.2.8 mimic

2020-03-26 Thread Paul Choi
I can't quite explain what happened, but the Prometheus endpoint became
stable after the free disk space for the largest pool went substantially
lower than 1PB.
I wonder if there's some metric that exceeds the maximum size for some int,
double, etc?

-Paul

On Mon, Mar 23, 2020 at 9:50 AM Janek Bevendorff <
janek.bevendo...@uni-weimar.de> wrote:

> I haven't seen any MGR hangs so far since I disabled the prometheus
> module. It seems like the module is not only slow, but kills the whole
> MGR when the cluster is sufficiently large, so these two issues are most
> likely connected. The issue has become much, much worse with 14.2.8.
>
>
> On 23/03/2020 09:00, Janek Bevendorff wrote:
> > I am running the very latest version of Nautilus. I will try setting up
> > an external exporter today and see if that fixes anything. Our cluster
> > is somewhat large-ish with 1248 OSDs, so I expect stat collection to
> > take "some" time, but it definitely shouldn't crush the MGRs all the
> time.
> >
> > On 21/03/2020 02:33, Paul Choi wrote:
> >> Hi Janek,
> >>
> >> What version of Ceph are you using?
> >> We also have a much smaller cluster running Nautilus, with no MDS. No
> >> Prometheus issues there.
> >> I won't speculate further than this but perhaps Nautilus doesn't have
> >> the same issue as Mimic?
> >>
> >> On Fri, Mar 20, 2020 at 12:23 PM Janek Bevendorff
> >>  >> > wrote:
> >>
> >> I think this is related to my previous post to this list about MGRs
> >> failing regularly and being overall quite slow to respond. The
> problem
> >> has existed before, but the new version has made it way worse. My
> MGRs
> >> keep dyring every few hours and need to be restarted. the Promtheus
> >> plugin works, but it's pretty slow and so is the dashboard.
> >> Unfortunately, nobody seems to have a solution for this and I
> >> wonder why
> >> not more people are complaining about this problem.
> >>
> >>
> >> On 20/03/2020 19:30, Paul Choi wrote:
> >> > If I "curl http://localhost:9283/metrics"; and wait sufficiently
> long
> >> > enough, I get this - says "No MON connection". But the mons are
> >> health and
> >> > the cluster is functioning fine.
> >> > That said, the mons' rocksdb sizes are fairly big because
> >> there's lots of
> >> > rebalancing going on. The Prometheus endpoint hanging seems to
> >> happen
> >> > regardless of the mon size anyhow.
> >> >
> >> > mon.woodenbox0 is 41 GiB >= mon_data_size_warn (15 GiB)
> >> > mon.woodenbox2 is 26 GiB >= mon_data_size_warn (15 GiB)
> >> > mon.woodenbox4 is 42 GiB >= mon_data_size_warn (15 GiB)
> >> > mon.woodenbox3 is 43 GiB >= mon_data_size_warn (15 GiB)
> >> > mon.woodenbox1 is 38 GiB >= mon_data_size_warn (15 GiB)
> >> >
> >> > # fg
> >> > curl -H "Connection: close" http://localhost:9283/metrics
> >> >  >> > "-//W3C//DTD XHTML 1.0 Transitional//EN"
> >> > "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd";>
> >> > 
> >> > 
> >> > 
> >> > 503 Service Unavailable
> >> > 
> >> > #powered_by {
> >> > margin-top: 20px;
> >> > border-top: 2px solid black;
> >> > font-style: italic;
> >> > }
> >> >
> >> > #traceback {
> >> > color: red;
> >> > }
> >> > 
> >> > 
> >> > 
> >> > 503 Service Unavailable
> >> > No MON connection
> >> > Traceback (most recent call last):
> >> >   File
> >> "/usr/lib/python2.7/dist-packages/cherrypy/_cprequest.py", line 670,
> >> > in respond
> >> > response.body = self.handler()
> >> >   File
> >> "/usr/lib/python2.7/dist-packages/cherrypy/lib/encoding.py", line
> >> > 217, in __call__
> >> > self.body = self.oldhandler(*args, **kwargs)
> >> >   File
> >> "/usr/lib/python2.7/dist-packages/cherrypy/_cpdispatch.py", line 61,
> >> > in __call__
> >> > return self.callable(*self.args, **self.kwargs)
> >> >   File "/usr/lib/ceph/mgr/prometheus/module.py", line 704, in
> >> metrics
> >> > return self._metrics(instance)
> >> >   File "/usr/lib/ceph/mgr/prometheus/module.py", line 721, in
> >> _metrics
> >> > raise cherrypy.HTTPError(503, 'No MON connection')
> >> > HTTPError: (503, 'No MON connection')
> >> > 
> >> > 
> >> >   
> >> > Powered by http://www.cherrypy.org";>CherryPy
> >> 3.5.0
> >> >   
> >> > 
> >> > 
> >> > 
> >> >
> >> > On Fri, Mar 20, 2020 at 6:33 AM Paul Choi  >> > wrote:
> >> >
> >> >> Hello,
> >> >>
> >> >> We are running Mimic 13.2.8 with our cluster, and since
> >> upgrading to
> >> >> 13.2.8 the Prometheus plugin seems to hang a lot. It used to
> >> respond under
> >> 

[ceph-users] Re: No reply or very slow reply from Prometheus plugin - ceph-mgr 13.2.8 mimic

2020-03-26 Thread Janek Bevendorff
If there is actually a connection, then it's no wonder our MDS kept
crashing. Our Ceph has 9.2PiB of available space at the moment.


On 26/03/2020 17:32, Paul Choi wrote:
> I can't quite explain what happened, but the Prometheus endpoint
> became stable after the free disk space for the largest pool went
> substantially lower than 1PB.
> I wonder if there's some metric that exceeds the maximum size for some
> int, double, etc?
>
> -Paul
>
> On Mon, Mar 23, 2020 at 9:50 AM Janek Bevendorff
>  > wrote:
>
> I haven't seen any MGR hangs so far since I disabled the prometheus
> module. It seems like the module is not only slow, but kills the whole
> MGR when the cluster is sufficiently large, so these two issues
> are most
> likely connected. The issue has become much, much worse with 14.2.8.
>
>
> On 23/03/2020 09:00, Janek Bevendorff wrote:
> > I am running the very latest version of Nautilus. I will try
> setting up
> > an external exporter today and see if that fixes anything. Our
> cluster
> > is somewhat large-ish with 1248 OSDs, so I expect stat collection to
> > take "some" time, but it definitely shouldn't crush the MGRs all
> the time.
> >
> > On 21/03/2020 02:33, Paul Choi wrote:
> >> Hi Janek,
> >>
> >> What version of Ceph are you using?
> >> We also have a much smaller cluster running Nautilus, with no
> MDS. No
> >> Prometheus issues there.
> >> I won't speculate further than this but perhaps Nautilus
> doesn't have
> >> the same issue as Mimic?
> >>
> >> On Fri, Mar 20, 2020 at 12:23 PM Janek Bevendorff
> >>  
> >>  >> wrote:
> >>
> >>     I think this is related to my previous post to this list
> about MGRs
> >>     failing regularly and being overall quite slow to respond.
> The problem
> >>     has existed before, but the new version has made it way
> worse. My MGRs
> >>     keep dyring every few hours and need to be restarted. the
> Promtheus
> >>     plugin works, but it's pretty slow and so is the dashboard.
> >>     Unfortunately, nobody seems to have a solution for this and I
> >>     wonder why
> >>     not more people are complaining about this problem.
> >>
> >>
> >>     On 20/03/2020 19:30, Paul Choi wrote:
> >>     > If I "curl http://localhost:9283/metrics"; and wait
> sufficiently long
> >>     > enough, I get this - says "No MON connection". But the
> mons are
> >>     health and
> >>     > the cluster is functioning fine.
> >>     > That said, the mons' rocksdb sizes are fairly big because
> >>     there's lots of
> >>     > rebalancing going on. The Prometheus endpoint hanging
> seems to
> >>     happen
> >>     > regardless of the mon size anyhow.
> >>     >
> >>     >     mon.woodenbox0 is 41 GiB >= mon_data_size_warn (15 GiB)
> >>     >     mon.woodenbox2 is 26 GiB >= mon_data_size_warn (15 GiB)
> >>     >     mon.woodenbox4 is 42 GiB >= mon_data_size_warn (15 GiB)
> >>     >     mon.woodenbox3 is 43 GiB >= mon_data_size_warn (15 GiB)
> >>     >     mon.woodenbox1 is 38 GiB >= mon_data_size_warn (15 GiB)
> >>     >
> >>     > # fg
> >>     > curl -H "Connection: close" http://localhost:9283/metrics
> >>     >  >>     > "-//W3C//DTD XHTML 1.0 Transitional//EN"
> >>     > "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd";>
> >>     > 
> >>     > 
> >>     >     
> >>     >     503 Service Unavailable
> >>     >     
> >>     >     #powered_by {
> >>     >         margin-top: 20px;
> >>     >         border-top: 2px solid black;
> >>     >         font-style: italic;
> >>     >     }
> >>     >
> >>     >     #traceback {
> >>     >         color: red;
> >>     >     }
> >>     >     
> >>     > 
> >>     >     
> >>     >         503 Service Unavailable
> >>     >         No MON connection
> >>     >         Traceback (most recent call
> last):
> >>     >   File
> >>     "/usr/lib/python2.7/dist-packages/cherrypy/_cprequest.py",
> line 670,
> >>     > in respond
> >>     >     response.body = self.handler()
> >>     >   File
> >>   
>  "/usr/lib/python2.7/dist-packages/cherrypy/lib/encoding.py", line
> >>     > 217, in __call__
> >>     >     self.body = self.oldhandler(*args, **kwargs)
> >>     >   File
> >>     "/usr/lib/python2.7/dist-packages/cherrypy/_cpdispatch.py",
> line 61,
> >>     > in __call__
> >>     >     return self.callable(*self.args, **self.kwargs)
> >>     >   File "/usr/lib/ceph/mgr/prometheus/module.py", line 704, in
> >>     metrics
> >>     >     return self._met

[ceph-users] Re: No reply or very slow reply from Prometheus plugin - ceph-mgr 13.2.8 mimic

2020-03-26 Thread Paul Choi
I won't speculate more into the MDS's stability, but I do wonder about the
same thing.
There is one file served by the MDS that would cause the ceph-fuse client
to hang. It was a file that many people in the company relied on for data
updates, so very noticeable. The only fix was to fail over the MDS.

Since the free disk space dropped, I haven't heard anyone complain...


On Thu, Mar 26, 2020 at 9:43 AM Janek Bevendorff <
janek.bevendo...@uni-weimar.de> wrote:

> If there is actually a connection, then it's no wonder our MDS kept
> crashing. Our Ceph has 9.2PiB of available space at the moment.
>
>
> On 26/03/2020 17:32, Paul Choi wrote:
>
> I can't quite explain what happened, but the Prometheus endpoint became
> stable after the free disk space for the largest pool went substantially
> lower than 1PB.
> I wonder if there's some metric that exceeds the maximum size for some
> int, double, etc?
>
> -Paul
>
> On Mon, Mar 23, 2020 at 9:50 AM Janek Bevendorff <
> janek.bevendo...@uni-weimar.de> wrote:
>
>> I haven't seen any MGR hangs so far since I disabled the prometheus
>> module. It seems like the module is not only slow, but kills the whole
>> MGR when the cluster is sufficiently large, so these two issues are most
>> likely connected. The issue has become much, much worse with 14.2.8.
>>
>>
>> On 23/03/2020 09:00, Janek Bevendorff wrote:
>> > I am running the very latest version of Nautilus. I will try setting up
>> > an external exporter today and see if that fixes anything. Our cluster
>> > is somewhat large-ish with 1248 OSDs, so I expect stat collection to
>> > take "some" time, but it definitely shouldn't crush the MGRs all the
>> time.
>> >
>> > On 21/03/2020 02:33, Paul Choi wrote:
>> >> Hi Janek,
>> >>
>> >> What version of Ceph are you using?
>> >> We also have a much smaller cluster running Nautilus, with no MDS. No
>> >> Prometheus issues there.
>> >> I won't speculate further than this but perhaps Nautilus doesn't have
>> >> the same issue as Mimic?
>> >>
>> >> On Fri, Mar 20, 2020 at 12:23 PM Janek Bevendorff
>> >> > >> > wrote:
>> >>
>> >> I think this is related to my previous post to this list about MGRs
>> >> failing regularly and being overall quite slow to respond. The
>> problem
>> >> has existed before, but the new version has made it way worse. My
>> MGRs
>> >> keep dyring every few hours and need to be restarted. the Promtheus
>> >> plugin works, but it's pretty slow and so is the dashboard.
>> >> Unfortunately, nobody seems to have a solution for this and I
>> >> wonder why
>> >> not more people are complaining about this problem.
>> >>
>> >>
>> >> On 20/03/2020 19:30, Paul Choi wrote:
>> >> > If I "curl http://localhost:9283/metrics"; and wait sufficiently
>> long
>> >> > enough, I get this - says "No MON connection". But the mons are
>> >> health and
>> >> > the cluster is functioning fine.
>> >> > That said, the mons' rocksdb sizes are fairly big because
>> >> there's lots of
>> >> > rebalancing going on. The Prometheus endpoint hanging seems to
>> >> happen
>> >> > regardless of the mon size anyhow.
>> >> >
>> >> > mon.woodenbox0 is 41 GiB >= mon_data_size_warn (15 GiB)
>> >> > mon.woodenbox2 is 26 GiB >= mon_data_size_warn (15 GiB)
>> >> > mon.woodenbox4 is 42 GiB >= mon_data_size_warn (15 GiB)
>> >> > mon.woodenbox3 is 43 GiB >= mon_data_size_warn (15 GiB)
>> >> > mon.woodenbox1 is 38 GiB >= mon_data_size_warn (15 GiB)
>> >> >
>> >> > # fg
>> >> > curl -H "Connection: close" http://localhost:9283/metrics
>> >> > > >> > "-//W3C//DTD XHTML 1.0 Transitional//EN"
>> >> > "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd";>
>> >> > 
>> >> > 
>> >> > 
>> >> > 503 Service Unavailable
>> >> > 
>> >> > #powered_by {
>> >> > margin-top: 20px;
>> >> > border-top: 2px solid black;
>> >> > font-style: italic;
>> >> > }
>> >> >
>> >> > #traceback {
>> >> > color: red;
>> >> > }
>> >> > 
>> >> > 
>> >> > 
>> >> > 503 Service Unavailable
>> >> > No MON connection
>> >> > Traceback (most recent call last):
>> >> >   File
>> >> "/usr/lib/python2.7/dist-packages/cherrypy/_cprequest.py", line
>> 670,
>> >> > in respond
>> >> > response.body = self.handler()
>> >> >   File
>> >> "/usr/lib/python2.7/dist-packages/cherrypy/lib/encoding.py", line
>> >> > 217, in __call__
>> >> > self.body = self.oldhandler(*args, **kwargs)
>> >> >   File
>> >> "/usr/lib/python2.7/dist-packages/cherrypy/_cpdispatch.py", line
>> 61,
>> >> > in __call__
>> >> > return self.callable(*self.args, **self.kwargs)
>> >> >   File "/usr/lib/ceph/mgr/prometheus/module.py", line 704, in
>> >> metrics
>> >>  

[ceph-users] Re: No reply or very slow reply from Prometheus plugin - ceph-mgr 13.2.8 mimic

2020-03-27 Thread Janek Bevendorff
Sorry, I meant MGR of course. MDS are fine for me. But the MGRs were
failing constantly due to the prometheus module doing something funny.


On 26/03/2020 18:10, Paul Choi wrote:
> I won't speculate more into the MDS's stability, but I do wonder about
> the same thing.
> There is one file served by the MDS that would cause the ceph-fuse
> client to hang. It was a file that many people in the company relied
> on for data updates, so very noticeable. The only fix was to fail over
> the MDS.
>
> Since the free disk space dropped, I haven't heard anyone complain...
> 
>
> On Thu, Mar 26, 2020 at 9:43 AM Janek Bevendorff
>  > wrote:
>
> If there is actually a connection, then it's no wonder our MDS
> kept crashing. Our Ceph has 9.2PiB of available space at the moment.
>
>
> On 26/03/2020 17:32, Paul Choi wrote:
>> I can't quite explain what happened, but the Prometheus endpoint
>> became stable after the free disk space for the largest pool went
>> substantially lower than 1PB.
>> I wonder if there's some metric that exceeds the maximum size for
>> some int, double, etc?
>>
>> -Paul
>>
>> On Mon, Mar 23, 2020 at 9:50 AM Janek Bevendorff
>> > > wrote:
>>
>> I haven't seen any MGR hangs so far since I disabled the
>> prometheus
>> module. It seems like the module is not only slow, but kills
>> the whole
>> MGR when the cluster is sufficiently large, so these two
>> issues are most
>> likely connected. The issue has become much, much worse with
>> 14.2.8.
>>
>>
>> On 23/03/2020 09:00, Janek Bevendorff wrote:
>> > I am running the very latest version of Nautilus. I will
>> try setting up
>> > an external exporter today and see if that fixes anything.
>> Our cluster
>> > is somewhat large-ish with 1248 OSDs, so I expect stat
>> collection to
>> > take "some" time, but it definitely shouldn't crush the
>> MGRs all the time.
>> >
>> > On 21/03/2020 02:33, Paul Choi wrote:
>> >> Hi Janek,
>> >>
>> >> What version of Ceph are you using?
>> >> We also have a much smaller cluster running Nautilus, with
>> no MDS. No
>> >> Prometheus issues there.
>> >> I won't speculate further than this but perhaps Nautilus
>> doesn't have
>> >> the same issue as Mimic?
>> >>
>> >> On Fri, Mar 20, 2020 at 12:23 PM Janek Bevendorff
>> >> > 
>> >> > >> wrote:
>> >>
>> >>     I think this is related to my previous post to this
>> list about MGRs
>> >>     failing regularly and being overall quite slow to
>> respond. The problem
>> >>     has existed before, but the new version has made it
>> way worse. My MGRs
>> >>     keep dyring every few hours and need to be restarted.
>> the Promtheus
>> >>     plugin works, but it's pretty slow and so is the
>> dashboard.
>> >>     Unfortunately, nobody seems to have a solution for
>> this and I
>> >>     wonder why
>> >>     not more people are complaining about this problem.
>> >>
>> >>
>> >>     On 20/03/2020 19:30, Paul Choi wrote:
>> >>     > If I "curl http://localhost:9283/metrics"; and wait
>> sufficiently long
>> >>     > enough, I get this - says "No MON connection". But
>> the mons are
>> >>     health and
>> >>     > the cluster is functioning fine.
>> >>     > That said, the mons' rocksdb sizes are fairly big
>> because
>> >>     there's lots of
>> >>     > rebalancing going on. The Prometheus endpoint
>> hanging seems to
>> >>     happen
>> >>     > regardless of the mon size anyhow.
>> >>     >
>> >>     >     mon.woodenbox0 is 41 GiB >= mon_data_size_warn
>> (15 GiB)
>> >>     >     mon.woodenbox2 is 26 GiB >= mon_data_size_warn
>> (15 GiB)
>> >>     >     mon.woodenbox4 is 42 GiB >= mon_data_size_warn
>> (15 GiB)
>> >>     >     mon.woodenbox3 is 43 GiB >= mon_data_size_warn
>> (15 GiB)
>> >>     >     mon.woodenbox1 is 38 GiB >= mon_data_size_warn
>> (15 GiB)
>> >>     >
>> >>     > # fg
>> >>     > curl -H "Connection: close"
>> http://localhost:9283/metrics
>> >>     > > >>     > "-//W3C//DTD XHTML 1.0 Transitional//EN"
>> >>     >
>> "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd";>
>> >>     > 
>> >>     > 
>> >>     >     
>> >>     >    

[ceph-users] Re: No reply or very slow reply from Prometheus plugin - ceph-mgr 13.2.8 mimic

2020-03-27 Thread shubjero
I've reported stability problems with ceph-mgr w/ prometheus plugin
enabled on all versions we ran in production which were several
versions of Luminous and Mimic. Our solution was to disable the
prometheus exporter. I am using Zabbix instead. Our cluster is 1404
OSD's in size with about 9PB raw with around 35% utilization.

On Fri, Mar 27, 2020 at 4:26 AM Janek Bevendorff
 wrote:
>
> Sorry, I meant MGR of course. MDS are fine for me. But the MGRs were
> failing constantly due to the prometheus module doing something funny.
>
>
> On 26/03/2020 18:10, Paul Choi wrote:
> > I won't speculate more into the MDS's stability, but I do wonder about
> > the same thing.
> > There is one file served by the MDS that would cause the ceph-fuse
> > client to hang. It was a file that many people in the company relied
> > on for data updates, so very noticeable. The only fix was to fail over
> > the MDS.
> >
> > Since the free disk space dropped, I haven't heard anyone complain...
> > 
> >
> > On Thu, Mar 26, 2020 at 9:43 AM Janek Bevendorff
> >  > > wrote:
> >
> > If there is actually a connection, then it's no wonder our MDS
> > kept crashing. Our Ceph has 9.2PiB of available space at the moment.
> >
> >
> > On 26/03/2020 17:32, Paul Choi wrote:
> >> I can't quite explain what happened, but the Prometheus endpoint
> >> became stable after the free disk space for the largest pool went
> >> substantially lower than 1PB.
> >> I wonder if there's some metric that exceeds the maximum size for
> >> some int, double, etc?
> >>
> >> -Paul
> >>
> >> On Mon, Mar 23, 2020 at 9:50 AM Janek Bevendorff
> >>  >> > wrote:
> >>
> >> I haven't seen any MGR hangs so far since I disabled the
> >> prometheus
> >> module. It seems like the module is not only slow, but kills
> >> the whole
> >> MGR when the cluster is sufficiently large, so these two
> >> issues are most
> >> likely connected. The issue has become much, much worse with
> >> 14.2.8.
> >>
> >>
> >> On 23/03/2020 09:00, Janek Bevendorff wrote:
> >> > I am running the very latest version of Nautilus. I will
> >> try setting up
> >> > an external exporter today and see if that fixes anything.
> >> Our cluster
> >> > is somewhat large-ish with 1248 OSDs, so I expect stat
> >> collection to
> >> > take "some" time, but it definitely shouldn't crush the
> >> MGRs all the time.
> >> >
> >> > On 21/03/2020 02:33, Paul Choi wrote:
> >> >> Hi Janek,
> >> >>
> >> >> What version of Ceph are you using?
> >> >> We also have a much smaller cluster running Nautilus, with
> >> no MDS. No
> >> >> Prometheus issues there.
> >> >> I won't speculate further than this but perhaps Nautilus
> >> doesn't have
> >> >> the same issue as Mimic?
> >> >>
> >> >> On Fri, Mar 20, 2020 at 12:23 PM Janek Bevendorff
> >> >>  >> 
> >> >>  >> >> wrote:
> >> >>
> >> >> I think this is related to my previous post to this
> >> list about MGRs
> >> >> failing regularly and being overall quite slow to
> >> respond. The problem
> >> >> has existed before, but the new version has made it
> >> way worse. My MGRs
> >> >> keep dyring every few hours and need to be restarted.
> >> the Promtheus
> >> >> plugin works, but it's pretty slow and so is the
> >> dashboard.
> >> >> Unfortunately, nobody seems to have a solution for
> >> this and I
> >> >> wonder why
> >> >> not more people are complaining about this problem.
> >> >>
> >> >>
> >> >> On 20/03/2020 19:30, Paul Choi wrote:
> >> >> > If I "curl http://localhost:9283/metrics"; and wait
> >> sufficiently long
> >> >> > enough, I get this - says "No MON connection". But
> >> the mons are
> >> >> health and
> >> >> > the cluster is functioning fine.
> >> >> > That said, the mons' rocksdb sizes are fairly big
> >> because
> >> >> there's lots of
> >> >> > rebalancing going on. The Prometheus endpoint
> >> hanging seems to
> >> >> happen
> >> >> > regardless of the mon size anyhow.
> >> >> >
> >> >> > mon.woodenbox0 is 41 GiB >= mon_data_size_warn
> >> (15 GiB)
> >> >> > mon.woodenbox2 is 26 GiB >= mon_data_size_warn
> >> (15 GiB)
> >> >> > mon.woodenbox4 is 42 GiB >= mon_data_size_

[ceph-users] Re: No reply or very slow reply from Prometheus plugin - ceph-mgr 13.2.8 mimic

2020-03-27 Thread Jarett DeAngelis
I’m actually very curious how well this is performing for you as I’ve 
definitely not seen a deployment this large. How do you use it?

> On Mar 27, 2020, at 11:47 AM, shubjero  wrote:
> 
> I've reported stability problems with ceph-mgr w/ prometheus plugin
> enabled on all versions we ran in production which were several
> versions of Luminous and Mimic. Our solution was to disable the
> prometheus exporter. I am using Zabbix instead. Our cluster is 1404
> OSD's in size with about 9PB raw with around 35% utilization.
> 
> On Fri, Mar 27, 2020 at 4:26 AM Janek Bevendorff
>  wrote:
>> 
>> Sorry, I meant MGR of course. MDS are fine for me. But the MGRs were
>> failing constantly due to the prometheus module doing something funny.
>> 
>> 
>> On 26/03/2020 18:10, Paul Choi wrote:
>>> I won't speculate more into the MDS's stability, but I do wonder about
>>> the same thing.
>>> There is one file served by the MDS that would cause the ceph-fuse
>>> client to hang. It was a file that many people in the company relied
>>> on for data updates, so very noticeable. The only fix was to fail over
>>> the MDS.
>>> 
>>> Since the free disk space dropped, I haven't heard anyone complain...
>>> 
>>> 
>>> On Thu, Mar 26, 2020 at 9:43 AM Janek Bevendorff
>>> >> > wrote:
>>> 
>>>If there is actually a connection, then it's no wonder our MDS
>>>kept crashing. Our Ceph has 9.2PiB of available space at the moment.
>>> 
>>> 
>>>On 26/03/2020 17:32, Paul Choi wrote:
I can't quite explain what happened, but the Prometheus endpoint
became stable after the free disk space for the largest pool went
substantially lower than 1PB.
I wonder if there's some metric that exceeds the maximum size for
some int, double, etc?
 
-Paul
 
On Mon, Mar 23, 2020 at 9:50 AM Janek Bevendorff
>>>> wrote:
 
I haven't seen any MGR hangs so far since I disabled the
prometheus
module. It seems like the module is not only slow, but kills
the whole
MGR when the cluster is sufficiently large, so these two
issues are most
likely connected. The issue has become much, much worse with
14.2.8.
 
 
On 23/03/2020 09:00, Janek Bevendorff wrote:
> I am running the very latest version of Nautilus. I will
try setting up
> an external exporter today and see if that fixes anything.
Our cluster
> is somewhat large-ish with 1248 OSDs, so I expect stat
collection to
> take "some" time, but it definitely shouldn't crush the
MGRs all the time.
> 
> On 21/03/2020 02:33, Paul Choi wrote:
>> Hi Janek,
>> 
>> What version of Ceph are you using?
>> We also have a much smaller cluster running Nautilus, with
no MDS. No
>> Prometheus issues there.
>> I won't speculate further than this but perhaps Nautilus
doesn't have
>> the same issue as Mimic?
>> 
>> On Fri, Mar 20, 2020 at 12:23 PM Janek Bevendorff
>> >>>
>> >> wrote:
>> 
>>I think this is related to my previous post to this
list about MGRs
>>failing regularly and being overall quite slow to
respond. The problem
>>has existed before, but the new version has made it
way worse. My MGRs
>>keep dyring every few hours and need to be restarted.
the Promtheus
>>plugin works, but it's pretty slow and so is the
dashboard.
>>Unfortunately, nobody seems to have a solution for
this and I
>>wonder why
>>not more people are complaining about this problem.
>> 
>> 
>>On 20/03/2020 19:30, Paul Choi wrote:
>>> If I "curl http://localhost:9283/metrics"; and wait
sufficiently long
>>> enough, I get this - says "No MON connection". But
the mons are
>>health and
>>> the cluster is functioning fine.
>>> That said, the mons' rocksdb sizes are fairly big
because
>>there's lots of
>>> rebalancing going on. The Prometheus endpoint
hanging seems to
>>happen
>>> regardless of the mon size anyhow.
>>> 
>>>mon.woodenbox0 is 41 GiB >= mon_data_size_warn
(15 GiB)
>>>mon.woodenbox2 is 26 GiB >= mon_data_size_warn
(15 GiB)
>>>mon.woodenbox4 is 42 GiB >= mon_data_size_warn
(15 GiB)
>>>mon.woodenbox3 is 43 GiB >= mon_data_size_warn
(15 GiB)
>>>mon.woodenbox1 is 38 GiB >= mon_data_size_warn
(15 GiB)
>>> 
>>> # fg
>>> curl -H "Connection: close"

[ceph-users] Re: No reply or very slow reply from Prometheus plugin - ceph-mgr 13.2.8 mimic

2020-04-01 Thread Janek Bevendorff

> I’m actually very curious how well this is performing for you as I’ve 
> definitely not seen a deployment this large. How do you use it?

What exactly do you mean? Our cluster has 11PiB capacity of which about
15% are used at the moment (web-scale corpora and such). We have
deployed 5 MONs and 5MGRs (both on the same hosts) and it works totally
fine overall. We have some MDS performance issues here and there, but
that's not too bad anymore after a few upstream patches and then we have
this annoying Prometheus MGR problem, which kills our MGRs reliably
after a few hours.

>
>> On Mar 27, 2020, at 11:47 AM, shubjero  wrote:
>>
>> I've reported stability problems with ceph-mgr w/ prometheus plugin
>> enabled on all versions we ran in production which were several
>> versions of Luminous and Mimic. Our solution was to disable the
>> prometheus exporter. I am using Zabbix instead. Our cluster is 1404
>> OSD's in size with about 9PB raw with around 35% utilization.
>>
>> On Fri, Mar 27, 2020 at 4:26 AM Janek Bevendorff
>>  wrote:
>>> Sorry, I meant MGR of course. MDS are fine for me. But the MGRs were
>>> failing constantly due to the prometheus module doing something funny.
>>>
>>>
>>> On 26/03/2020 18:10, Paul Choi wrote:
 I won't speculate more into the MDS's stability, but I do wonder about
 the same thing.
 There is one file served by the MDS that would cause the ceph-fuse
 client to hang. It was a file that many people in the company relied
 on for data updates, so very noticeable. The only fix was to fail over
 the MDS.

 Since the free disk space dropped, I haven't heard anyone complain...
 

 On Thu, Mar 26, 2020 at 9:43 AM Janek Bevendorff
 >>> > wrote:

If there is actually a connection, then it's no wonder our MDS
kept crashing. Our Ceph has 9.2PiB of available space at the moment.


On 26/03/2020 17:32, Paul Choi wrote:
>I can't quite explain what happened, but the Prometheus endpoint
>became stable after the free disk space for the largest pool went
>substantially lower than 1PB.
>I wonder if there's some metric that exceeds the maximum size for
>some int, double, etc?
>
>-Paul
>
>On Mon, Mar 23, 2020 at 9:50 AM Janek Bevendorff
>> wrote:
>
>I haven't seen any MGR hangs so far since I disabled the
>prometheus
>module. It seems like the module is not only slow, but kills
>the whole
>MGR when the cluster is sufficiently large, so these two
>issues are most
>likely connected. The issue has become much, much worse with
>14.2.8.
>
>
>On 23/03/2020 09:00, Janek Bevendorff wrote:
>> I am running the very latest version of Nautilus. I will
>try setting up
>> an external exporter today and see if that fixes anything.
>Our cluster
>> is somewhat large-ish with 1248 OSDs, so I expect stat
>collection to
>> take "some" time, but it definitely shouldn't crush the
>MGRs all the time.
>> On 21/03/2020 02:33, Paul Choi wrote:
>>> Hi Janek,
>>>
>>> What version of Ceph are you using?
>>> We also have a much smaller cluster running Nautilus, with
>no MDS. No
>>> Prometheus issues there.
>>> I won't speculate further than this but perhaps Nautilus
>doesn't have
>>> the same issue as Mimic?
>>>
>>> On Fri, Mar 20, 2020 at 12:23 PM Janek Bevendorff
>>> 
>>> >> wrote:
>>>I think this is related to my previous post to this
>list about MGRs
>>>failing regularly and being overall quite slow to
>respond. The problem
>>>has existed before, but the new version has made it
>way worse. My MGRs
>>>keep dyring every few hours and need to be restarted.
>the Promtheus
>>>plugin works, but it's pretty slow and so is the
>dashboard.
>>>Unfortunately, nobody seems to have a solution for
>this and I
>>>wonder why
>>>not more people are complaining about this problem.
>>>
>>>
>>>On 20/03/2020 19:30, Paul Choi wrote:
 If I "curl http://localhost:9283/metrics"; and wait
>sufficiently long
 enough, I get this - says "No MON connection". But
>the mons are
>>>health and
 the cluster is functioning fine.
 That said, the mons' rocksdb sizes are fairly big
>because
>>>there's lots of
 rebalancing going on. The Prometheus endpoint
>hanging seems to
>>>happ