[ceph-users] Re: Ceph Nautilous 14.2.22 slow OSD memory leak?

2024-01-13 Thread Konstantin Shalygin
Hi,

> On Jan 12, 2024, at 12:01, Frédéric Nass  
> wrote:
> 
> Hard to tell for sure since this bug hit different major versions of the 
> kernel, at least RHEL's from what I know. 

In what RH kernel release this issue was fixed?


Thanks,
k
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Ceph Nautilous 14.2.22 slow OSD memory leak?

2024-01-12 Thread Frédéric Nass

Samuel, 
  
Hard to tell for sure since this bug hit different major versions of the 
kernel, at least RHEL's from what I know. The only way to tell is to check for 
num_cgroups in /proc/cgroups:

 
 
$ cat /proc/cgroups | grep -e subsys -e blkio | column -t 
   #subsys_name  hierarchy  num_cgroups  enabled 
   blkio         4          1099         1  
Otherwise, you'd have to check the sources of the kernel you're using against 
the patch that fixed this bug. Unfortunately, I can't spot the upstream patch 
that fixed this issue since RH BZs related to this bug are private. Maybe 
someone here can spot it. 
   
 
Regards, 
Frédéric.  

  

-Message original-

De: huxiaoyu 
à: Frédéric 
Cc: ceph-users 
Envoyé: vendredi 12 janvier 2024 09:25 CET
Sujet : Re: Re: [ceph-users] Ceph Nautilous 14.2.22 slow OSD memory leak?

 
Dear Frederic, 
  
Thanks a lot for the suggestions. We are using the valilla Linux 4.19 LTS 
version. Do you think we may be suffering from the same bug? 
  
best regards, 
  
Samuel 
  
   huxia...@horebdata.cn    From: Frédéric Nass Date: 2024-01-12 09:19 To: 
huxiaoyu CC: ceph-users Subject: Re: [ceph-users] Ceph Nautilous 14.2.22 slow 
OSD memory leak?  Hello,   We've had a similar situation recently where 
OSDs would use way more memory than osd_memory_target and get OOM killed by the 
kernel. It was due to a kernel bug related to cgroups [1].   If num_cgroups 
below keeps increasing then you may hit this bug.
 
  
$ cat /proc/cgroups | grep -e subsys -e blkio | column -t 
   #subsys_name  hierarchy  num_cgroups  enabled 
   blkio         4          1099         1 
  
If you hit this bug, upgrading OSDs nodes kernels should get you through. If 
you can't access the Red Hat KB [1], let me know your current nodes kernel 
version and I'll check for you. 
  Regards,
Frédéric. 
 
 
  
[1] https://access.redhat.com/solutions/7014337 
De: huxiaoyu 
à: ceph-users 
Envoyé: mercredi 10 janvier 2024 19:21 CET
Sujet : [ceph-users] Ceph Nautilous 14.2.22 slow OSD memory leak?

Dear Ceph folks, 

I am responsible for two Ceph clusters, running Nautilius 14.2.22 version, one 
with replication 3, and the other with EC 4+2. After around 400 days runing 
quietly and smoothly, recently the two clusters occured with similar problems: 
some of OSDs consume ca 18 GB while the memory target is setting at 2GB. 

What could wrong in the background? Does it mean any slow OSD memory leak 
issues with 14.2.22 which i do not know yet? 

I would be highly appreciated if some some provides any clues, ideas, comments 
.. 

best regards, 

Samuel 



huxia...@horebdata.cn 
___ 
ceph-users mailing list -- ceph-users@ceph.io 
To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Ceph Nautilous 14.2.22 slow OSD memory leak?

2024-01-12 Thread huxia...@horebdata.cn
Dear Frederic,

Thanks a lot for the suggestions. We are using the valilla Linux 4.19 LTS 
version. Do you think we may be suffering from the same bug?

best regards,

Samuel



huxia...@horebdata.cn
 
From: Frédéric Nass
Date: 2024-01-12 09:19
To: huxiaoyu
CC: ceph-users
Subject: Re: [ceph-users] Ceph Nautilous 14.2.22 slow OSD memory leak?
Hello,
 
We've had a similar situation recently where OSDs would use way more memory 
than osd_memory_target and get OOM killed by the kernel.
It was due to a kernel bug related to cgroups [1].
 
If num_cgroups below keeps increasing then you may hit this bug.
 
$ cat /proc/cgroups | grep -e subsys -e blkio | column -t
   #subsys_name  hierarchy  num_cgroups  enabled
   blkio 4  1099 1
 
If you hit this bug, upgrading OSDs nodes kernels should get you through. If 
you can't access the Red Hat KB [1], let me know your current nodes kernel 
version and I'll check for you.
 
Regards,
Frédéric.
 
[1] https://access.redhat.com/solutions/7014337


De: huxiaoyu 
à: ceph-users 
Envoyé: mercredi 10 janvier 2024 19:21 CET
Sujet : [ceph-users] Ceph Nautilous 14.2.22 slow OSD memory leak?

Dear Ceph folks, 

I am responsible for two Ceph clusters, running Nautilius 14.2.22 version, one 
with replication 3, and the other with EC 4+2. After around 400 days runing 
quietly and smoothly, recently the two clusters occured with similar problems: 
some of OSDs consume ca 18 GB while the memory target is setting at 2GB. 

What could wrong in the background? Does it mean any slow OSD memory leak 
issues with 14.2.22 which i do not know yet? 

I would be highly appreciated if some some provides any clues, ideas, comments 
.. 

best regards, 

Samuel 



huxia...@horebdata.cn 
___ 
ceph-users mailing list -- ceph-users@ceph.io 
To unsubscribe send an email to ceph-users-le...@ceph.io 
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Ceph Nautilous 14.2.22 slow OSD memory leak?

2024-01-12 Thread Frédéric Nass

Hello, 
  
We've had a similar situation recently where OSDs would use way more memory 
than osd_memory_target and get OOM killed by the kernel. 
It was due to a kernel bug related to cgroups [1]. 
  
If num_cgroups below keeps increasing then you may hit this bug.
 
  
$ cat /proc/cgroups | grep -e subsys -e blkio | column -t 
   #subsys_name  hierarchy  num_cgroups  enabled 
   blkio         4          1099         1 
  
If you hit this bug, upgrading OSDs nodes kernels should get you through. If 
you can't access the Red Hat KB [1], let me know your current nodes kernel 
version and I'll check for you. 
  Regards,
Frédéric.  
 
  
[1] https://access.redhat.com/solutions/7014337 

-Message original-

De: huxiaoyu 
à: ceph-users 
Envoyé: mercredi 10 janvier 2024 19:21 CET
Sujet : [ceph-users] Ceph Nautilous 14.2.22 slow OSD memory leak?

Dear Ceph folks, 

I am responsible for two Ceph clusters, running Nautilius 14.2.22 version, one 
with replication 3, and the other with EC 4+2. After around 400 days runing 
quietly and smoothly, recently the two clusters occured with similar problems: 
some of OSDs consume ca 18 GB while the memory target is setting at 2GB. 

What could wrong in the background? Does it mean any slow OSD memory leak 
issues with 14.2.22 which i do not know yet? 

I would be highly appreciated if some some provides any clues, ideas, comments 
.. 

best regards, 

Samuel 



huxia...@horebdata.cn 
___ 
ceph-users mailing list -- ceph-users@ceph.io 
To unsubscribe send an email to ceph-users-le...@ceph.io   
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Ceph Nautilous 14.2.22 slow OSD memory leak?

2024-01-10 Thread Janne Johansson
Den ons 10 jan. 2024 kl 19:20 skrev huxia...@horebdata.cn
:
> Dear Ceph folks,
>
> I am responsible for two Ceph clusters, running Nautilius 14.2.22 version, 
> one with replication 3, and the other with EC 4+2. After around 400 days 
> runing quietly and smoothly, recently the two clusters occured with similar 
> problems: some of OSDs consume ca 18 GB while the memory target is setting at 
> 2GB.
>
> What could wrong in the background?  Does it mean any slow OSD memory leak 
> issues with 14.2.22 which i do not know yet?

While I am sorry to not be able to help you with the actual problem, I
just wanted to comment that the memory targets and user-selectable ram
sizes are only parts of what the OSD will use (even if you have no
bugs or memory leaks), so you can tell the OSD to aim for 2 or 6 or 12
G and it will work towards this goal for the resizeable buffers and
caches and all that, but there are some parts you don't get to control
and which (at least during recovery and so on) eat tons of ram
regardless of your preferences. The OSD will try to allocate as much
as it feels needed in order to fix itself without considering the
targets you may have set for it.

It seems very strange that your OSDs would all at the same time jump
up to use 9x as much ram as expected and I really hope you can figure
out why and how to get it back to normal operations again, I just
wanted to add my comment so that others who are sizing their OSD hosts
don't think that "if I set a value to 2G, then I can run 8 OSDs on
this 16G ram box and all will be fine since no OSD will eat more than
2". It will work like this when all is fine and healthy and when the
cluster is new and almost empty but not in all situations.
When/if you do get memory leaks, having tons of ram will just delay
the inevitable of course, so it's not a solution to just buy too much
ram either.

I would at least try setting noout+norebalance and restart one of
those problematic OSDs and see if they quickly return to this huge
overdraw of memory or not.

-- 
May the most significant bit of your life be positive.
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Ceph Nautilous 14.2.22 slow OSD memory leak?

2024-01-10 Thread Dan van der Ster
Hi Samuel,

It can be a few things. A good place to start is to dump_mempools of one of
those bloated OSDs:

`ceph daemon osd.123 dump_mempools`

Cheers, Dan


--
Dan van der Ster
CTO

Clyso GmbH
p: +49 89 215252722 | a: Vancouver, Canada
w: https://clyso.com | e: dan.vanders...@clyso.com

We are hiring: https://www.clyso.com/jobs/



On Wed, Jan 10, 2024 at 10:20 AM huxia...@horebdata.cn <
huxia...@horebdata.cn> wrote:

> Dear Ceph folks,
>
> I am responsible for two Ceph clusters, running Nautilius 14.2.22 version,
> one with replication 3, and the other with EC 4+2. After around 400 days
> runing quietly and smoothly, recently the two clusters occured with similar
> problems: some of OSDs consume ca 18 GB while the memory target is setting
> at 2GB.
>
> What could wrong in the background?  Does it mean any slow OSD memory leak
> issues with 14.2.22 which i do not know yet?
>
> I would be highly appreciated if some some provides any clues, ideas,
> comments ..
>
> best regards,
>
> Samuel
>
>
>
> huxia...@horebdata.cn
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io