Re: [ceph-users] Unexpected increase in the memory usage of OSDs

2019-10-09 Thread Gregory Farnum
On Wed, Oct 9, 2019 at 10:58 AM Vladimir Brik <
vladimir.b...@icecube.wisc.edu> wrote:

> Best I can tell, automatic cache sizing is enabled and all related
> settings are at their default values.
>
> Looking through cache tunables, I came across
> osd_memory_expected_fragmentation, which the docs define as "estimate
> the percent of memory fragmentation". What's the formula to compute
> actual percentage of memory fragmentation?
>
> Based on /proc/buddyinfo, I suspect that our memory fragmentation is a
> lot worse than osd_memory_expected_fragmentation default of 0.15. Could
> this be related to many OSDs' RSSes far exceeding osd_memory_target?
>
> So far high memory consumption hasn't been a problem for us. (I guess
> it's possible that the kernel simply sees no need to reclaim unmapped
> memory until there is actually real memory pressure?)


Oh well that you can check on the admin socket using the “heap” family of
commands. It’ll tell you how much the daemon is actually using out of
what’s allocated, and IIRC what it’s given back to the OS but maybe hasn’t
actually been reclaimed.

It's just a little
> scary not understanding why this started happening when memory usage had
> been so stable before.


>
> Thanks,
>
> Vlad
>
>
>
> On 10/9/19 11:51 AM, Gregory Farnum wrote:
> > On Mon, Oct 7, 2019 at 7:20 AM Vladimir Brik
> >  wrote:
> >>
> >>   > Do you have statistics on the size of the OSDMaps or count of them
> >>   > which were being maintained by the OSDs?
> >> No, I don't think so. How can I find this information?
> >
> > Hmm I don't know if we directly expose the size of maps. There are
> > perfcounters which expose the range of maps being kept around but I
> > don't know their names off-hand.
> >
> > Maybe it's something else involving the bluestore cache or whatever;
> > if you're not using the newer memory limits I'd switch to those but
> > otherwise I dunno.
> > -Greg
> >
> >>
> >> Memory consumption started to climb again:
> >> https://icecube.wisc.edu/~vbrik/graph-3.png
> >>
> >> Some more info (not sure if relevant or not):
> >>
> >> I increased size of the swap on the servers to 10GB and it's being
> >> completely utilized, even though there is still quite a bit of free
> memory.
> >>
> >> It appears that memory is highly fragmented on the NUMA node 0 of all
> >> the servers. Some of the servers have no free pages higher than order 0.
> >> (Memory on NUMA node 1 of the servers appears much less fragmented.)
> >>
> >> The servers have 192GB of RAM, 2 NUMA nodes.
> >>
> >>
> >> Vlad
> >>
> >>
> >>
> >> On 10/4/19 6:09 PM, Gregory Farnum wrote:
> >>> Do you have statistics on the size of the OSDMaps or count of them
> >>> which were being maintained by the OSDs? I'm not sure why having noout
> >>> set would change that if all the nodes were alive, but that's my bet.
> >>> -Greg
> >>>
> >>> On Thu, Oct 3, 2019 at 7:04 AM Vladimir Brik
> >>>  wrote:
> 
>  And, just as unexpectedly, things have returned to normal overnight
>  https://icecube.wisc.edu/~vbrik/graph-1.png
> 
>  The change seems to have coincided with the beginning of Rados Gateway
>  activity (before, it was essentially zero). I can see nothing in the
>  logs that would explain what happened though.
> 
>  Vlad
> 
> 
> 
>  On 10/2/19 3:43 PM, Vladimir Brik wrote:
> > Hello
> >
> > I am running a Ceph 14.2.2 cluster and a few days ago, memory
> > consumption of our OSDs started to unexpectedly grow on all 5 nodes,
> > after being stable for about 6 months.
> >
> > Node memory consumption: https://icecube.wisc.edu/~vbrik/graph.png
> > Average OSD resident size: https://icecube.wisc.edu/~vbrik/image.png
> >
> > I am not sure what changed to cause this. Cluster usage has been very
> > light (typically <10 iops) during this period, and the number of
> objects
> > stayed about the same.
> >
> > The only unusual occurrence was the reboot of one of the nodes the
> day
> > before (a firmware update). For the reboot, I ran "ceph osd set
> noout",
> > but forgot to unset it until several days later. Unsetting noout did
> not
> > stop the increase in memory consumption.
> >
> > I don't see anything unusual in the logs.
> >
> > Our nodes have SSDs and HDDs. Resident set size of SSD ODSs is about
> > 3.7GB. Resident set size of HDD OSDs varies from about 5GB to 12GB. I
> > don't know why there is such a big spread. All HDDs are 10TB, 72-76%
> > utilized, with 101-104 PGs.
> >
> > Does anybody know what might be the problem here and how to address
> or
> > debug it?
> >
> >
> > Thanks very much,
> >
> > Vlad
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>  ___
>  ceph-users mailing 

Re: [ceph-users] Unexpected increase in the memory usage of OSDs

2019-10-09 Thread Vladimir Brik
Best I can tell, automatic cache sizing is enabled and all related 
settings are at their default values.


Looking through cache tunables, I came across 
osd_memory_expected_fragmentation, which the docs define as "estimate 
the percent of memory fragmentation". What's the formula to compute 
actual percentage of memory fragmentation?


Based on /proc/buddyinfo, I suspect that our memory fragmentation is a 
lot worse than osd_memory_expected_fragmentation default of 0.15. Could 
this be related to many OSDs' RSSes far exceeding osd_memory_target?


So far high memory consumption hasn't been a problem for us. (I guess 
it's possible that the kernel simply sees no need to reclaim unmapped 
memory until there is actually real memory pressure?) It's just a little 
scary not understanding why this started happening when memory usage had 
been so stable before.


Thanks,

Vlad



On 10/9/19 11:51 AM, Gregory Farnum wrote:

On Mon, Oct 7, 2019 at 7:20 AM Vladimir Brik
 wrote:


  > Do you have statistics on the size of the OSDMaps or count of them
  > which were being maintained by the OSDs?
No, I don't think so. How can I find this information?


Hmm I don't know if we directly expose the size of maps. There are
perfcounters which expose the range of maps being kept around but I
don't know their names off-hand.

Maybe it's something else involving the bluestore cache or whatever;
if you're not using the newer memory limits I'd switch to those but
otherwise I dunno.
-Greg



Memory consumption started to climb again:
https://icecube.wisc.edu/~vbrik/graph-3.png

Some more info (not sure if relevant or not):

I increased size of the swap on the servers to 10GB and it's being
completely utilized, even though there is still quite a bit of free memory.

It appears that memory is highly fragmented on the NUMA node 0 of all
the servers. Some of the servers have no free pages higher than order 0.
(Memory on NUMA node 1 of the servers appears much less fragmented.)

The servers have 192GB of RAM, 2 NUMA nodes.


Vlad



On 10/4/19 6:09 PM, Gregory Farnum wrote:

Do you have statistics on the size of the OSDMaps or count of them
which were being maintained by the OSDs? I'm not sure why having noout
set would change that if all the nodes were alive, but that's my bet.
-Greg

On Thu, Oct 3, 2019 at 7:04 AM Vladimir Brik
 wrote:


And, just as unexpectedly, things have returned to normal overnight
https://icecube.wisc.edu/~vbrik/graph-1.png

The change seems to have coincided with the beginning of Rados Gateway
activity (before, it was essentially zero). I can see nothing in the
logs that would explain what happened though.

Vlad



On 10/2/19 3:43 PM, Vladimir Brik wrote:

Hello

I am running a Ceph 14.2.2 cluster and a few days ago, memory
consumption of our OSDs started to unexpectedly grow on all 5 nodes,
after being stable for about 6 months.

Node memory consumption: https://icecube.wisc.edu/~vbrik/graph.png
Average OSD resident size: https://icecube.wisc.edu/~vbrik/image.png

I am not sure what changed to cause this. Cluster usage has been very
light (typically <10 iops) during this period, and the number of objects
stayed about the same.

The only unusual occurrence was the reboot of one of the nodes the day
before (a firmware update). For the reboot, I ran "ceph osd set noout",
but forgot to unset it until several days later. Unsetting noout did not
stop the increase in memory consumption.

I don't see anything unusual in the logs.

Our nodes have SSDs and HDDs. Resident set size of SSD ODSs is about
3.7GB. Resident set size of HDD OSDs varies from about 5GB to 12GB. I
don't know why there is such a big spread. All HDDs are 10TB, 72-76%
utilized, with 101-104 PGs.

Does anybody know what might be the problem here and how to address or
debug it?


Thanks very much,

Vlad
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com





___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Unexpected increase in the memory usage of OSDs

2019-10-09 Thread Gregory Farnum
On Mon, Oct 7, 2019 at 7:20 AM Vladimir Brik
 wrote:
>
>  > Do you have statistics on the size of the OSDMaps or count of them
>  > which were being maintained by the OSDs?
> No, I don't think so. How can I find this information?

Hmm I don't know if we directly expose the size of maps. There are
perfcounters which expose the range of maps being kept around but I
don't know their names off-hand.

Maybe it's something else involving the bluestore cache or whatever;
if you're not using the newer memory limits I'd switch to those but
otherwise I dunno.
-Greg

>
> Memory consumption started to climb again:
> https://icecube.wisc.edu/~vbrik/graph-3.png
>
> Some more info (not sure if relevant or not):
>
> I increased size of the swap on the servers to 10GB and it's being
> completely utilized, even though there is still quite a bit of free memory.
>
> It appears that memory is highly fragmented on the NUMA node 0 of all
> the servers. Some of the servers have no free pages higher than order 0.
> (Memory on NUMA node 1 of the servers appears much less fragmented.)
>
> The servers have 192GB of RAM, 2 NUMA nodes.
>
>
> Vlad
>
>
>
> On 10/4/19 6:09 PM, Gregory Farnum wrote:
> > Do you have statistics on the size of the OSDMaps or count of them
> > which were being maintained by the OSDs? I'm not sure why having noout
> > set would change that if all the nodes were alive, but that's my bet.
> > -Greg
> >
> > On Thu, Oct 3, 2019 at 7:04 AM Vladimir Brik
> >  wrote:
> >>
> >> And, just as unexpectedly, things have returned to normal overnight
> >> https://icecube.wisc.edu/~vbrik/graph-1.png
> >>
> >> The change seems to have coincided with the beginning of Rados Gateway
> >> activity (before, it was essentially zero). I can see nothing in the
> >> logs that would explain what happened though.
> >>
> >> Vlad
> >>
> >>
> >>
> >> On 10/2/19 3:43 PM, Vladimir Brik wrote:
> >>> Hello
> >>>
> >>> I am running a Ceph 14.2.2 cluster and a few days ago, memory
> >>> consumption of our OSDs started to unexpectedly grow on all 5 nodes,
> >>> after being stable for about 6 months.
> >>>
> >>> Node memory consumption: https://icecube.wisc.edu/~vbrik/graph.png
> >>> Average OSD resident size: https://icecube.wisc.edu/~vbrik/image.png
> >>>
> >>> I am not sure what changed to cause this. Cluster usage has been very
> >>> light (typically <10 iops) during this period, and the number of objects
> >>> stayed about the same.
> >>>
> >>> The only unusual occurrence was the reboot of one of the nodes the day
> >>> before (a firmware update). For the reboot, I ran "ceph osd set noout",
> >>> but forgot to unset it until several days later. Unsetting noout did not
> >>> stop the increase in memory consumption.
> >>>
> >>> I don't see anything unusual in the logs.
> >>>
> >>> Our nodes have SSDs and HDDs. Resident set size of SSD ODSs is about
> >>> 3.7GB. Resident set size of HDD OSDs varies from about 5GB to 12GB. I
> >>> don't know why there is such a big spread. All HDDs are 10TB, 72-76%
> >>> utilized, with 101-104 PGs.
> >>>
> >>> Does anybody know what might be the problem here and how to address or
> >>> debug it?
> >>>
> >>>
> >>> Thanks very much,
> >>>
> >>> Vlad
> >>> ___
> >>> ceph-users mailing list
> >>> ceph-users@lists.ceph.com
> >>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >> ___
> >> ceph-users mailing list
> >> ceph-users@lists.ceph.com
> >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Unexpected increase in the memory usage of OSDs

2019-10-07 Thread Vladimir Brik

> Do you have statistics on the size of the OSDMaps or count of them
> which were being maintained by the OSDs?
No, I don't think so. How can I find this information?

Memory consumption started to climb again:
https://icecube.wisc.edu/~vbrik/graph-3.png

Some more info (not sure if relevant or not):

I increased size of the swap on the servers to 10GB and it's being 
completely utilized, even though there is still quite a bit of free memory.


It appears that memory is highly fragmented on the NUMA node 0 of all 
the servers. Some of the servers have no free pages higher than order 0. 
(Memory on NUMA node 1 of the servers appears much less fragmented.)


The servers have 192GB of RAM, 2 NUMA nodes.


Vlad



On 10/4/19 6:09 PM, Gregory Farnum wrote:

Do you have statistics on the size of the OSDMaps or count of them
which were being maintained by the OSDs? I'm not sure why having noout
set would change that if all the nodes were alive, but that's my bet.
-Greg

On Thu, Oct 3, 2019 at 7:04 AM Vladimir Brik
 wrote:


And, just as unexpectedly, things have returned to normal overnight
https://icecube.wisc.edu/~vbrik/graph-1.png

The change seems to have coincided with the beginning of Rados Gateway
activity (before, it was essentially zero). I can see nothing in the
logs that would explain what happened though.

Vlad



On 10/2/19 3:43 PM, Vladimir Brik wrote:

Hello

I am running a Ceph 14.2.2 cluster and a few days ago, memory
consumption of our OSDs started to unexpectedly grow on all 5 nodes,
after being stable for about 6 months.

Node memory consumption: https://icecube.wisc.edu/~vbrik/graph.png
Average OSD resident size: https://icecube.wisc.edu/~vbrik/image.png

I am not sure what changed to cause this. Cluster usage has been very
light (typically <10 iops) during this period, and the number of objects
stayed about the same.

The only unusual occurrence was the reboot of one of the nodes the day
before (a firmware update). For the reboot, I ran "ceph osd set noout",
but forgot to unset it until several days later. Unsetting noout did not
stop the increase in memory consumption.

I don't see anything unusual in the logs.

Our nodes have SSDs and HDDs. Resident set size of SSD ODSs is about
3.7GB. Resident set size of HDD OSDs varies from about 5GB to 12GB. I
don't know why there is such a big spread. All HDDs are 10TB, 72-76%
utilized, with 101-104 PGs.

Does anybody know what might be the problem here and how to address or
debug it?


Thanks very much,

Vlad
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Unexpected increase in the memory usage of OSDs

2019-10-04 Thread Gregory Farnum
Do you have statistics on the size of the OSDMaps or count of them
which were being maintained by the OSDs? I'm not sure why having noout
set would change that if all the nodes were alive, but that's my bet.
-Greg

On Thu, Oct 3, 2019 at 7:04 AM Vladimir Brik
 wrote:
>
> And, just as unexpectedly, things have returned to normal overnight
> https://icecube.wisc.edu/~vbrik/graph-1.png
>
> The change seems to have coincided with the beginning of Rados Gateway
> activity (before, it was essentially zero). I can see nothing in the
> logs that would explain what happened though.
>
> Vlad
>
>
>
> On 10/2/19 3:43 PM, Vladimir Brik wrote:
> > Hello
> >
> > I am running a Ceph 14.2.2 cluster and a few days ago, memory
> > consumption of our OSDs started to unexpectedly grow on all 5 nodes,
> > after being stable for about 6 months.
> >
> > Node memory consumption: https://icecube.wisc.edu/~vbrik/graph.png
> > Average OSD resident size: https://icecube.wisc.edu/~vbrik/image.png
> >
> > I am not sure what changed to cause this. Cluster usage has been very
> > light (typically <10 iops) during this period, and the number of objects
> > stayed about the same.
> >
> > The only unusual occurrence was the reboot of one of the nodes the day
> > before (a firmware update). For the reboot, I ran "ceph osd set noout",
> > but forgot to unset it until several days later. Unsetting noout did not
> > stop the increase in memory consumption.
> >
> > I don't see anything unusual in the logs.
> >
> > Our nodes have SSDs and HDDs. Resident set size of SSD ODSs is about
> > 3.7GB. Resident set size of HDD OSDs varies from about 5GB to 12GB. I
> > don't know why there is such a big spread. All HDDs are 10TB, 72-76%
> > utilized, with 101-104 PGs.
> >
> > Does anybody know what might be the problem here and how to address or
> > debug it?
> >
> >
> > Thanks very much,
> >
> > Vlad
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Unexpected increase in the memory usage of OSDs

2019-10-03 Thread Vladimir Brik

And, just as unexpectedly, things have returned to normal overnight
https://icecube.wisc.edu/~vbrik/graph-1.png

The change seems to have coincided with the beginning of Rados Gateway 
activity (before, it was essentially zero). I can see nothing in the 
logs that would explain what happened though.


Vlad



On 10/2/19 3:43 PM, Vladimir Brik wrote:

Hello

I am running a Ceph 14.2.2 cluster and a few days ago, memory 
consumption of our OSDs started to unexpectedly grow on all 5 nodes, 
after being stable for about 6 months.


Node memory consumption: https://icecube.wisc.edu/~vbrik/graph.png
Average OSD resident size: https://icecube.wisc.edu/~vbrik/image.png

I am not sure what changed to cause this. Cluster usage has been very 
light (typically <10 iops) during this period, and the number of objects 
stayed about the same.


The only unusual occurrence was the reboot of one of the nodes the day 
before (a firmware update). For the reboot, I ran "ceph osd set noout", 
but forgot to unset it until several days later. Unsetting noout did not 
stop the increase in memory consumption.


I don't see anything unusual in the logs.

Our nodes have SSDs and HDDs. Resident set size of SSD ODSs is about 
3.7GB. Resident set size of HDD OSDs varies from about 5GB to 12GB. I 
don't know why there is such a big spread. All HDDs are 10TB, 72-76% 
utilized, with 101-104 PGs.


Does anybody know what might be the problem here and how to address or 
debug it?



Thanks very much,

Vlad
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Unexpected increase in the memory usage of OSDs

2019-10-02 Thread Vladimir Brik

Hello

I am running a Ceph 14.2.2 cluster and a few days ago, memory 
consumption of our OSDs started to unexpectedly grow on all 5 nodes, 
after being stable for about 6 months.


Node memory consumption: https://icecube.wisc.edu/~vbrik/graph.png
Average OSD resident size: https://icecube.wisc.edu/~vbrik/image.png

I am not sure what changed to cause this. Cluster usage has been very 
light (typically <10 iops) during this period, and the number of objects 
stayed about the same.


The only unusual occurrence was the reboot of one of the nodes the day 
before (a firmware update). For the reboot, I ran "ceph osd set noout", 
but forgot to unset it until several days later. Unsetting noout did not 
stop the increase in memory consumption.


I don't see anything unusual in the logs.

Our nodes have SSDs and HDDs. Resident set size of SSD ODSs is about 
3.7GB. Resident set size of HDD OSDs varies from about 5GB to 12GB. I 
don't know why there is such a big spread. All HDDs are 10TB, 72-76% 
utilized, with 101-104 PGs.


Does anybody know what might be the problem here and how to address or 
debug it?



Thanks very much,

Vlad
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com