We are a lot impacted by this issue with MGR in Pacific.
This has to be fixed.
As someone suggested in the issue tracker, we limited the memory usage
of the MGR in the systemd unit (MemoryLimit=16G) in order to kill the
MGR before it consumes all the memory of the server and impacts other
serv
I have to say that not including a fix for a serious issue into the last
minor release of Pacific is a rather odd decision.
/Z
On Thu, 25 Jan 2024 at 09:00, Konstantin Shalygin wrote:
> Hi,
>
> The backport to pacific was rejected [1], you may switch to reef, when [2]
> merged and released
>
>
Hi,
The backport to pacific was rejected [1], you may switch to reef, when [2]
merged and released
[1] https://github.com/ceph/ceph/pull/55109
[2] https://github.com/ceph/ceph/pull/55110
k
Sent from my iPhone
> On Jan 25, 2024, at 04:12, changzhi tan <544463...@qq.com> wrote:
>
> Is there an
I found that quickly restarting the affected mgr every 2 days is an okay
kludge. It takes less than a second to restart, and never grows to
dangerous sizes which is when it randomly starts ballooning.
/Z
On Thu, 25 Jan 2024, 03:12 changzhi tan, <544463...@qq.com> wrote:
> Is there any way to sol
Is there any way to solve this problem?thanks
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io
Hi,
Today after 3 weeks of normal operation the mgr reached memory usage of
1600 MB, quickly ballooned to over 100 GB for no apparent reason and got
oom-killed again. There were no suspicious messages in the logs until the
message indicating that the mgr failed to allocate more memory. Any
thought
Hi,
Another update: after 2 more weeks the mgr process grew to ~1.5 GB, which
again was expected:
mgr.ceph01.vankui ceph01 *:8443,9283 running (2w)102s ago 2y
1519M- 16.2.14 fc0182d6cda5 3451f8c6c07e
mgr.ceph02.shsinf ceph02 *:8443,9283 running (2w)102s ago 7M
Hi,
A small update: after disabling 'progress' module the active mgr (on
ceph01) used up ~1.3 GB of memory in 3 days, which was expected:
mgr.ceph01.vankui ceph01 *:8443,9283 running (3d) 9m ago 2y
1284M- 16.2.14 fc0182d6cda5 3451f8c6c07e
mgr.ceph02.shsinf ceph02 *:8
Thanks for this. This looks similar to what we're observing. Although we
don't use the API apart from the usage by Ceph deployment itself - which I
guess still counts.
/Z
On Wed, 22 Nov 2023, 15:22 Adrien Georget,
wrote:
> Hi,
>
> This memory leak with ceph-mgr seems to be due to a change in Ce
Yes, we use docker, though we haven't had any issues because of it. I don't
think that docker itself can cause mgr memory leaks.
/Z
On Wed, 22 Nov 2023, 15:14 Eugen Block, wrote:
> One other difference is you use docker, right? We use podman, could it
> be some docker restriction?
>
> Zitat von
Hi,
This memory leak with ceph-mgr seems to be due to a change in Ceph 16.2.12.
Check this issue : https://tracker.ceph.com/issues/59580
We are also affected by this, with or without containerized services.
Cheers,
Adrien
Le 22/11/2023 à 14:14, Eugen Block a écrit :
One other difference is you
One other difference is you use docker, right? We use podman, could it
be some docker restriction?
Zitat von Zakhar Kirpichenko :
It's a 6-node cluster with 96 OSDs, not much I/O, mgr . Each node has 384
GB of RAM, each OSD has a memory target of 16 GB, about 100 GB of memory,
give or take, i
It's a 6-node cluster with 96 OSDs, not much I/O, mgr . Each node has 384
GB of RAM, each OSD has a memory target of 16 GB, about 100 GB of memory,
give or take, is available (mostly used by page cache) on each node during
normal operation. Nothing unusual there, tbh.
No unusual mgr modules or set
What does your hardware look like memory-wise? Just for comparison,
one customer cluster has 4,5 GB in use (middle-sized cluster for
openstack, 280 OSDs):
PID USER PR NIVIRTRESSHR S %CPU %MEM TIME+ COMMAND
6077 ceph 20 0 6357560 4,522g 22316 S 12,00 1,79
I've disabled the progress module entirely and will see how it goes.
Otherwise, mgr memory usage keeps increasing slowly, from past experience
it will stabilize at around 1.5-1.6 GB. Other than this event warning, it's
unclear what could have caused random memory ballooning.
/Z
On Wed, 22 Nov 202
I see these progress messages all the time, I don't think they cause
it, but I might be wrong. You can disable it just to rule that out.
Zitat von Zakhar Kirpichenko :
Unfortunately, I don't have a full stack trace because there's no crash
when the mgr gets oom-killed. There's just the mgr lo
Unfortunately, I don't have a full stack trace because there's no crash
when the mgr gets oom-killed. There's just the mgr log, which looks
completely normal until about 2-3 minutes before the oom-kill, when
tmalloc warnings show up.
I'm not sure that it's the same issue that is described in the t
Do you have the full stack trace? The pastebin only contains the
"tcmalloc: large alloc" messages (same as in the tracker issue). Maybe
comment in the tracker issue directly since Radek asked for someone
with a similar problem in a newer release.
Zitat von Zakhar Kirpichenko :
Thanks, Eug
Thanks, Eugen. It is similar in the sense that the mgr is getting
OOM-killed.
It started happening in our cluster after the upgrade to 16.2.14. We
haven't had this issue with earlier Pacific releases.
/Z
On Tue, 21 Nov 2023, 21:53 Eugen Block, wrote:
> Just checking it on the phone, but isn’t
I encountered mgr ballooning multiple times with Luminous, but have not since.
At the time, I could often achieve relief by sending the admin socket a heap
release - it would show large amounts of memory unused but not yet released.
That experience is one reason I got Rook recently to allow pro
Just checking it on the phone, but isn’t this quite similar?
https://tracker.ceph.com/issues/45136
Zitat von Zakhar Kirpichenko :
Hi,
I'm facing a rather new issue with our Ceph cluster: from time to time
ceph-mgr on one of the two mgr nodes gets oom-killed after consuming over
100 GB RAM:
[
21 matches
Mail list logo