[ceph-users] Re: Ceph 16.2.14: ceph-mgr getting oom-killed

2024-01-25 Thread Adrien Georget
We are a lot impacted by this issue with MGR in Pacific. This has to be fixed. As someone suggested in the issue tracker, we limited the memory usage of the MGR in the systemd unit (MemoryLimit=16G) in order to kill the MGR before it consumes all the memory of the server and impacts other serv

[ceph-users] Re: Ceph 16.2.14: ceph-mgr getting oom-killed

2024-01-24 Thread Zakhar Kirpichenko
I have to say that not including a fix for a serious issue into the last minor release of Pacific is a rather odd decision. /Z On Thu, 25 Jan 2024 at 09:00, Konstantin Shalygin wrote: > Hi, > > The backport to pacific was rejected [1], you may switch to reef, when [2] > merged and released > >

[ceph-users] Re: Ceph 16.2.14: ceph-mgr getting oom-killed

2024-01-24 Thread Konstantin Shalygin
Hi, The backport to pacific was rejected [1], you may switch to reef, when [2] merged and released [1] https://github.com/ceph/ceph/pull/55109 [2] https://github.com/ceph/ceph/pull/55110 k Sent from my iPhone > On Jan 25, 2024, at 04:12, changzhi tan <544463...@qq.com> wrote: > > Is there an

[ceph-users] Re: Ceph 16.2.14: ceph-mgr getting oom-killed

2024-01-24 Thread Zakhar Kirpichenko
I found that quickly restarting the affected mgr every 2 days is an okay kludge. It takes less than a second to restart, and never grows to dangerous sizes which is when it randomly starts ballooning. /Z On Thu, 25 Jan 2024, 03:12 changzhi tan, <544463...@qq.com> wrote: > Is there any way to sol

[ceph-users] Re: Ceph 16.2.14: ceph-mgr getting oom-killed

2024-01-24 Thread changzhi tan
Is there any way to solve this problem?thanks ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: Ceph 16.2.14: ceph-mgr getting oom-killed

2023-12-18 Thread Zakhar Kirpichenko
Hi, Today after 3 weeks of normal operation the mgr reached memory usage of 1600 MB, quickly ballooned to over 100 GB for no apparent reason and got oom-killed again. There were no suspicious messages in the logs until the message indicating that the mgr failed to allocate more memory. Any thought

[ceph-users] Re: Ceph 16.2.14: ceph-mgr getting oom-killed

2023-12-11 Thread Zakhar Kirpichenko
Hi, Another update: after 2 more weeks the mgr process grew to ~1.5 GB, which again was expected: mgr.ceph01.vankui ceph01 *:8443,9283 running (2w)102s ago 2y 1519M- 16.2.14 fc0182d6cda5 3451f8c6c07e mgr.ceph02.shsinf ceph02 *:8443,9283 running (2w)102s ago 7M

[ceph-users] Re: Ceph 16.2.14: ceph-mgr getting oom-killed

2023-11-24 Thread Zakhar Kirpichenko
Hi, A small update: after disabling 'progress' module the active mgr (on ceph01) used up ~1.3 GB of memory in 3 days, which was expected: mgr.ceph01.vankui ceph01 *:8443,9283 running (3d) 9m ago 2y 1284M- 16.2.14 fc0182d6cda5 3451f8c6c07e mgr.ceph02.shsinf ceph02 *:8

[ceph-users] Re: Ceph 16.2.14: ceph-mgr getting oom-killed

2023-11-22 Thread Zakhar Kirpichenko
Thanks for this. This looks similar to what we're observing. Although we don't use the API apart from the usage by Ceph deployment itself - which I guess still counts. /Z On Wed, 22 Nov 2023, 15:22 Adrien Georget, wrote: > Hi, > > This memory leak with ceph-mgr seems to be due to a change in Ce

[ceph-users] Re: Ceph 16.2.14: ceph-mgr getting oom-killed

2023-11-22 Thread Zakhar Kirpichenko
Yes, we use docker, though we haven't had any issues because of it. I don't think that docker itself can cause mgr memory leaks. /Z On Wed, 22 Nov 2023, 15:14 Eugen Block, wrote: > One other difference is you use docker, right? We use podman, could it > be some docker restriction? > > Zitat von

[ceph-users] Re: Ceph 16.2.14: ceph-mgr getting oom-killed

2023-11-22 Thread Adrien Georget
Hi, This memory leak with ceph-mgr seems to be due to a change in Ceph 16.2.12. Check this issue : https://tracker.ceph.com/issues/59580 We are also affected by this, with or without containerized services. Cheers, Adrien Le 22/11/2023 à 14:14, Eugen Block a écrit : One other difference is you

[ceph-users] Re: Ceph 16.2.14: ceph-mgr getting oom-killed

2023-11-22 Thread Eugen Block
One other difference is you use docker, right? We use podman, could it be some docker restriction? Zitat von Zakhar Kirpichenko : It's a 6-node cluster with 96 OSDs, not much I/O, mgr . Each node has 384 GB of RAM, each OSD has a memory target of 16 GB, about 100 GB of memory, give or take, i

[ceph-users] Re: Ceph 16.2.14: ceph-mgr getting oom-killed

2023-11-22 Thread Zakhar Kirpichenko
It's a 6-node cluster with 96 OSDs, not much I/O, mgr . Each node has 384 GB of RAM, each OSD has a memory target of 16 GB, about 100 GB of memory, give or take, is available (mostly used by page cache) on each node during normal operation. Nothing unusual there, tbh. No unusual mgr modules or set

[ceph-users] Re: Ceph 16.2.14: ceph-mgr getting oom-killed

2023-11-22 Thread Eugen Block
What does your hardware look like memory-wise? Just for comparison, one customer cluster has 4,5 GB in use (middle-sized cluster for openstack, 280 OSDs): PID USER PR NIVIRTRESSHR S %CPU %MEM TIME+ COMMAND 6077 ceph 20 0 6357560 4,522g 22316 S 12,00 1,79

[ceph-users] Re: Ceph 16.2.14: ceph-mgr getting oom-killed

2023-11-22 Thread Zakhar Kirpichenko
I've disabled the progress module entirely and will see how it goes. Otherwise, mgr memory usage keeps increasing slowly, from past experience it will stabilize at around 1.5-1.6 GB. Other than this event warning, it's unclear what could have caused random memory ballooning. /Z On Wed, 22 Nov 202

[ceph-users] Re: Ceph 16.2.14: ceph-mgr getting oom-killed

2023-11-22 Thread Eugen Block
I see these progress messages all the time, I don't think they cause it, but I might be wrong. You can disable it just to rule that out. Zitat von Zakhar Kirpichenko : Unfortunately, I don't have a full stack trace because there's no crash when the mgr gets oom-killed. There's just the mgr lo

[ceph-users] Re: Ceph 16.2.14: ceph-mgr getting oom-killed

2023-11-22 Thread Zakhar Kirpichenko
Unfortunately, I don't have a full stack trace because there's no crash when the mgr gets oom-killed. There's just the mgr log, which looks completely normal until about 2-3 minutes before the oom-kill, when tmalloc warnings show up. I'm not sure that it's the same issue that is described in the t

[ceph-users] Re: Ceph 16.2.14: ceph-mgr getting oom-killed

2023-11-22 Thread Eugen Block
Do you have the full stack trace? The pastebin only contains the "tcmalloc: large alloc" messages (same as in the tracker issue). Maybe comment in the tracker issue directly since Radek asked for someone with a similar problem in a newer release. Zitat von Zakhar Kirpichenko : Thanks, Eug

[ceph-users] Re: Ceph 16.2.14: ceph-mgr getting oom-killed

2023-11-21 Thread Zakhar Kirpichenko
Thanks, Eugen. It is similar in the sense that the mgr is getting OOM-killed. It started happening in our cluster after the upgrade to 16.2.14. We haven't had this issue with earlier Pacific releases. /Z On Tue, 21 Nov 2023, 21:53 Eugen Block, wrote: > Just checking it on the phone, but isn’t

[ceph-users] Re: Ceph 16.2.14: ceph-mgr getting oom-killed

2023-11-21 Thread Anthony D'Atri
I encountered mgr ballooning multiple times with Luminous, but have not since. At the time, I could often achieve relief by sending the admin socket a heap release - it would show large amounts of memory unused but not yet released. That experience is one reason I got Rook recently to allow pro

[ceph-users] Re: Ceph 16.2.14: ceph-mgr getting oom-killed

2023-11-21 Thread Eugen Block
Just checking it on the phone, but isn’t this quite similar? https://tracker.ceph.com/issues/45136 Zitat von Zakhar Kirpichenko : Hi, I'm facing a rather new issue with our Ceph cluster: from time to time ceph-mgr on one of the two mgr nodes gets oom-killed after consuming over 100 GB RAM: [