> I’m actually very curious how well this is performing for you as I’ve
> definitely not seen a deployment this large. How do you use it?
What exactly do you mean? Our cluster has 11PiB capacity of which about
15% are used at the moment (web-scale corpora and such). We have
deployed 5 MONs and
I’m actually very curious how well this is performing for you as I’ve
definitely not seen a deployment this large. How do you use it?
> On Mar 27, 2020, at 11:47 AM, shubjero wrote:
>
> I've reported stability problems with ceph-mgr w/ prometheus plugin
> enabled on all versions we ran in
I've reported stability problems with ceph-mgr w/ prometheus plugin
enabled on all versions we ran in production which were several
versions of Luminous and Mimic. Our solution was to disable the
prometheus exporter. I am using Zabbix instead. Our cluster is 1404
OSD's in size with about 9PB raw
Sorry, I meant MGR of course. MDS are fine for me. But the MGRs were
failing constantly due to the prometheus module doing something funny.
On 26/03/2020 18:10, Paul Choi wrote:
> I won't speculate more into the MDS's stability, but I do wonder about
> the same thing.
> There is one file served
I won't speculate more into the MDS's stability, but I do wonder about the
same thing.
There is one file served by the MDS that would cause the ceph-fuse client
to hang. It was a file that many people in the company relied on for data
updates, so very noticeable. The only fix was to fail over the
If there is actually a connection, then it's no wonder our MDS kept
crashing. Our Ceph has 9.2PiB of available space at the moment.
On 26/03/2020 17:32, Paul Choi wrote:
> I can't quite explain what happened, but the Prometheus endpoint
> became stable after the free disk space for the largest
I can't quite explain what happened, but the Prometheus endpoint became
stable after the free disk space for the largest pool went substantially
lower than 1PB.
I wonder if there's some metric that exceeds the maximum size for some int,
double, etc?
-Paul
On Mon, Mar 23, 2020 at 9:50 AM Janek
I dug up this issue report, where the problem has been reported before:
https://tracker.ceph.com/issues/39264
Unfortuantely, the issue hasn't got much (or any) attention yet. So
let's get this fixed, the prometheus module is unusable in its current
state.
On 23/03/2020 17:50, Janek Bevendorff
I haven't seen any MGR hangs so far since I disabled the prometheus
module. It seems like the module is not only slow, but kills the whole
MGR when the cluster is sufficiently large, so these two issues are most
likely connected. The issue has become much, much worse with 14.2.8.
On 23/03/2020
I am running the very latest version of Nautilus. I will try setting up
an external exporter today and see if that fixes anything. Our cluster
is somewhat large-ish with 1248 OSDs, so I expect stat collection to
take "some" time, but it definitely shouldn't crush the MGRs all the time.
On
Hi Janek,
What version of Ceph are you using?
We also have a much smaller cluster running Nautilus, with no MDS. No
Prometheus issues there.
I won't speculate further than this but perhaps Nautilus doesn't have the
same issue as Mimic?
On Fri, Mar 20, 2020 at 12:23 PM Janek Bevendorff <
I think this is related to my previous post to this list about MGRs
failing regularly and being overall quite slow to respond. The problem
has existed before, but the new version has made it way worse. My MGRs
keep dyring every few hours and need to be restarted. the Promtheus
plugin works, but
If I "curl http://localhost:9283/metrics; and wait sufficiently long
enough, I get this - says "No MON connection". But the mons are health and
the cluster is functioning fine.
That said, the mons' rocksdb sizes are fairly big because there's lots of
rebalancing going on. The Prometheus endpoint
13 matches
Mail list logo