[ceph-users] Re: ceph_leadership_team_meeting_s18e06.mkv

2023-09-07 Thread Rok Jaklič
Hi,

we have also experienced several ceph-mgr oom kills on ceph v16.2.13 on
120T/200T data.

Is there any tracker about the problem?

Does upgrade to 17.x "solves" the problem?

Kind regards,
Rok



On Wed, Sep 6, 2023 at 9:36 PM Ernesto Puerta  wrote:

> Dear Cephers,
>
> Today brought us an eventful CTL meeting: it looks like Jitsi recently
> started
> requiring user authentication
>  (anonymous users
> will get a "Waiting for a moderator" modal), but authentication didn't work
> against Google or GitHub accounts, so we had to move to the good old Google
> Meet.
>
> As a result of this, Neha has kindly set up a new private Slack channel
> (#clt) to allow for quicker communication among CLT members (if you usually
> attend the CLT meeting and have not been added, please ping any CLT member
> to request that).
>
> Now, let's move on the important stuff:
>
> *The latest Pacific Release (v16.2.14)*
>
> *The Bad*
> The 14th drop of the Pacific release has landed with a few hiccups:
>
>- Some .deb packages were made available to downloads.ceph.com before
>the release process completion. Although this is not the first time it
>happens, we want to ensure this is the last one, so we'd like to gather
>ideas to improve the release publishing process. Neha encouraged
> everyone
>to share ideas here:
>   - https://tracker.ceph.com/issues/62671
>   - https://tracker.ceph.com/issues/62672
>   - v16.2.14 also hit issues during the ceph-container stage. Laura
>wanted to raise awareness of its current setbacks
> and collect ideas to tackle
>them:
>   - Enforce reviews and mandatory CI checks
>   - Rework the current approach to use simple Dockerfiles
>   
>   - Call the Ceph community for help: ceph-container is currently
>   maintained part-time by a single contributor (Guillaume Abrioux).
> This
>   sub-project would benefit from the sound expertise on containers
> among Ceph
>   users. If you have ever considered contributing to Ceph, but felt a
> bit
>   intimidated by C++, Paxos and race conditions, ceph-container is a
> good
>   place to shed your fear.
>
>
> *The Good*
> Not everything about v16.2.14 was going to be bleak: David Orman brought us
> really good news. They tested v16.2.14 on a large production cluster
> (10gbit/s+ RGW and ~13PiB raw) and found that it solved a major issue
> affecting RGW in Pacific .
>
> *The Ugly*
> During that testing, they noticed that ceph-mgr was occasionally OOM killed
> (nothing new to 16.2.14, as it was previously reported). They already
> tried:
>
>- Disabling modules (like the restful one, which was a suspect)
>- Enabling debug 20
>- Turning the pg autoscaler off
>
> Debugging will continue to characterize this issue:
>
>- Enable profiling (Mark Nelson)
>- Try Bloomberg's Python mem profiler
> (Matthew Leonard)
>
>
> *Infrastructure*
>
> *Reminder: Infrastructure Meeting Tomorrow. **11:30-12:30 Central Time*
>
> Patrick brought up the following topics:
>
>- Need to reduce the OVH spending ($72k/year, which is a good cut in the
>Ceph Foundation budget, that's a lot less avocado sandwiches for the
> next
>Cephalocon):
>   - Move services (e.g.: Chacra) to the Sepia lab
>   - Re-use CentOS (and any spared/unused) machines for devel purposes
>- Current Ceph sys admins are overloaded, so devel/community involvement
>would be much appreciated.
>- More to be discussed in tomorrow's meeting. Please join if you
>think you can help solve/improve the Ceph infrastrucru!
>
>
> *BTW*: today's CDM will be canceled, since no topics were proposed.
>
> Kind Regards,
>
> Ernesto
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: ceph_leadership_team_meeting_s18e06.mkv

2023-09-07 Thread Mark Nelson

Hi Rok,

We're still try to catch what's causing the memory growth, so it's hard 
to guess at which releases are affected.  We know it's happening 
intermittently on a live Pacific cluster at least.  If you have the 
ability to catch it while it's happening, there are several 
approaches/tools that might aid in diagnosing it. Container deployments 
are a bit tougher to get debugging tools working in though which afaik 
has slowed down existing attempts at diagnosing the issue.


Mark

On 9/7/23 05:55, Rok Jaklič wrote:

Hi,

we have also experienced several ceph-mgr oom kills on ceph v16.2.13 on
120T/200T data.

Is there any tracker about the problem?

Does upgrade to 17.x "solves" the problem?

Kind regards,
Rok



On Wed, Sep 6, 2023 at 9:36 PM Ernesto Puerta  wrote:


Dear Cephers,

Today brought us an eventful CTL meeting: it looks like Jitsi recently
started
requiring user authentication
 (anonymous users
will get a "Waiting for a moderator" modal), but authentication didn't work
against Google or GitHub accounts, so we had to move to the good old Google
Meet.

As a result of this, Neha has kindly set up a new private Slack channel
(#clt) to allow for quicker communication among CLT members (if you usually
attend the CLT meeting and have not been added, please ping any CLT member
to request that).

Now, let's move on the important stuff:

*The latest Pacific Release (v16.2.14)*

*The Bad*
The 14th drop of the Pacific release has landed with a few hiccups:

- Some .deb packages were made available to downloads.ceph.com before
the release process completion. Although this is not the first time it
happens, we want to ensure this is the last one, so we'd like to gather
ideas to improve the release publishing process. Neha encouraged
everyone
to share ideas here:
   - https://tracker.ceph.com/issues/62671
   - https://tracker.ceph.com/issues/62672
   - v16.2.14 also hit issues during the ceph-container stage. Laura
wanted to raise awareness of its current setbacks
 and collect ideas to tackle
them:
   - Enforce reviews and mandatory CI checks
   - Rework the current approach to use simple Dockerfiles
   
   - Call the Ceph community for help: ceph-container is currently
   maintained part-time by a single contributor (Guillaume Abrioux).
This
   sub-project would benefit from the sound expertise on containers
among Ceph
   users. If you have ever considered contributing to Ceph, but felt a
bit
   intimidated by C++, Paxos and race conditions, ceph-container is a
good
   place to shed your fear.


*The Good*
Not everything about v16.2.14 was going to be bleak: David Orman brought us
really good news. They tested v16.2.14 on a large production cluster
(10gbit/s+ RGW and ~13PiB raw) and found that it solved a major issue
affecting RGW in Pacific .

*The Ugly*
During that testing, they noticed that ceph-mgr was occasionally OOM killed
(nothing new to 16.2.14, as it was previously reported). They already
tried:

- Disabling modules (like the restful one, which was a suspect)
- Enabling debug 20
- Turning the pg autoscaler off

Debugging will continue to characterize this issue:

- Enable profiling (Mark Nelson)
- Try Bloomberg's Python mem profiler
 (Matthew Leonard)


*Infrastructure*

*Reminder: Infrastructure Meeting Tomorrow. **11:30-12:30 Central Time*

Patrick brought up the following topics:

- Need to reduce the OVH spending ($72k/year, which is a good cut in the
Ceph Foundation budget, that's a lot less avocado sandwiches for the
next
Cephalocon):
   - Move services (e.g.: Chacra) to the Sepia lab
   - Re-use CentOS (and any spared/unused) machines for devel purposes
- Current Ceph sys admins are overloaded, so devel/community involvement
would be much appreciated.
- More to be discussed in tomorrow's meeting. Please join if you
think you can help solve/improve the Ceph infrastrucru!


*BTW*: today's CDM will be canceled, since no topics were proposed.

Kind Regards,

Ernesto
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: ceph_leadership_team_meeting_s18e06.mkv

2023-09-08 Thread Rok Jaklič
We do not use containers.

Anything special for debugging or should we try something from previous
email?
   - Enable profiling (Mark Nelson)
   - Try Bloomberg's Python mem profiler
    (Matthew Leonard)

Profiling means instructions from
https://docs.ceph.com/en/pacific/rados/troubleshooting/memory-profiling/ ?

Rok

On Thu, Sep 7, 2023 at 9:34 PM Mark Nelson  wrote:

> Hi Rok,
>
> We're still try to catch what's causing the memory growth, so it's hard
> to guess at which releases are affected.  We know it's happening
> intermittently on a live Pacific cluster at least.  If you have the
> ability to catch it while it's happening, there are several
> approaches/tools that might aid in diagnosing it. Container deployments
> are a bit tougher to get debugging tools working in though which afaik
> has slowed down existing attempts at diagnosing the issue.
>
> Mark
>
> On 9/7/23 05:55, Rok Jaklič wrote:
> > Hi,
> >
> > we have also experienced several ceph-mgr oom kills on ceph v16.2.13 on
> > 120T/200T data.
> >
> > Is there any tracker about the problem?
> >
> > Does upgrade to 17.x "solves" the problem?
> >
> > Kind regards,
> > Rok
> >
> >
> >
> > On Wed, Sep 6, 2023 at 9:36 PM Ernesto Puerta 
> wrote:
> >
> >> Dear Cephers,
> >>
> >> Today brought us an eventful CTL meeting: it looks like Jitsi recently
> >> started
> >> requiring user authentication
> >>  (anonymous
> users
> >> will get a "Waiting for a moderator" modal), but authentication didn't
> work
> >> against Google or GitHub accounts, so we had to move to the good old
> Google
> >> Meet.
> >>
> >> As a result of this, Neha has kindly set up a new private Slack channel
> >> (#clt) to allow for quicker communication among CLT members (if you
> usually
> >> attend the CLT meeting and have not been added, please ping any CLT
> member
> >> to request that).
> >>
> >> Now, let's move on the important stuff:
> >>
> >> *The latest Pacific Release (v16.2.14)*
> >>
> >> *The Bad*
> >> The 14th drop of the Pacific release has landed with a few hiccups:
> >>
> >> - Some .deb packages were made available to downloads.ceph.com
> before
> >> the release process completion. Although this is not the first time
> it
> >> happens, we want to ensure this is the last one, so we'd like to
> gather
> >> ideas to improve the release publishing process. Neha encouraged
> >> everyone
> >> to share ideas here:
> >>- https://tracker.ceph.com/issues/62671
> >>- https://tracker.ceph.com/issues/62672
> >>- v16.2.14 also hit issues during the ceph-container stage. Laura
> >> wanted to raise awareness of its current setbacks
> >>  and collect ideas to
> tackle
> >> them:
> >>- Enforce reviews and mandatory CI checks
> >>- Rework the current approach to use simple Dockerfiles
> >>
> >>- Call the Ceph community for help: ceph-container is currently
> >>maintained part-time by a single contributor (Guillaume Abrioux).
> >> This
> >>sub-project would benefit from the sound expertise on containers
> >> among Ceph
> >>users. If you have ever considered contributing to Ceph, but
> felt a
> >> bit
> >>intimidated by C++, Paxos and race conditions, ceph-container is
> a
> >> good
> >>place to shed your fear.
> >>
> >>
> >> *The Good*
> >> Not everything about v16.2.14 was going to be bleak: David Orman
> brought us
> >> really good news. They tested v16.2.14 on a large production cluster
> >> (10gbit/s+ RGW and ~13PiB raw) and found that it solved a major issue
> >> affecting RGW in Pacific .
> >>
> >> *The Ugly*
> >> During that testing, they noticed that ceph-mgr was occasionally OOM
> killed
> >> (nothing new to 16.2.14, as it was previously reported). They already
> >> tried:
> >>
> >> - Disabling modules (like the restful one, which was a suspect)
> >> - Enabling debug 20
> >> - Turning the pg autoscaler off
> >>
> >> Debugging will continue to characterize this issue:
> >>
> >> - Enable profiling (Mark Nelson)
> >> - Try Bloomberg's Python mem profiler
> >>  (Matthew Leonard)
> >>
> >>
> >> *Infrastructure*
> >>
> >> *Reminder: Infrastructure Meeting Tomorrow. **11:30-12:30 Central Time*
> >>
> >> Patrick brought up the following topics:
> >>
> >> - Need to reduce the OVH spending ($72k/year, which is a good cut
> in the
> >> Ceph Foundation budget, that's a lot less avocado sandwiches for the
> >> next
> >> Cephalocon):
> >>- Move services (e.g.: Chacra) to the Sepia lab
> >>- Re-use CentOS (and any spared/unused) machines for devel
> purposes
> >> - Current Ceph sys admins are overloaded, so devel/community
> involvement
> >> would be much appr

[ceph-users] Re: ceph_leadership_team_meeting_s18e06.mkv

2023-09-08 Thread Loïc Tortay

On 07/09/2023 21:33, Mark Nelson wrote:

Hi Rok,

We're still try to catch what's causing the memory growth, so it's hard 
to guess at which releases are affected.  We know it's happening 
intermittently on a live Pacific cluster at least.  If you have the 
ability to catch it while it's happening, there are several 
approaches/tools that might aid in diagnosing it. Container deployments 
are a bit tougher to get debugging tools working in though which afaik 
has slowed down existing attempts at diagnosing the issue.



Hello,
We have a cluster recently upgraded from Octopus to Pacific 16.2.13 
where the active MGR was OOM-killed a few times.


We have another cluster that was recently upgraded from 16.2.11 to 
16.2.14 and the issue also started to appear (very soon) on that cluster.

We didn't have the issue before, during the months running 16.2.11.

In short: the issue seems to be due to a change in 16.2.12 or 16.2.13.


Loïc.
--
|   Loīc Tortay  - IN2P3 Computing Centre  |
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: ceph_leadership_team_meeting_s18e06.mkv

2023-09-08 Thread David Orman
I would suggest updating: https://tracker.ceph.com/issues/59580

We did notice it with 16.2.13, as well, after upgrading from .10, so likely 
in-between those two releases.

David

On Fri, Sep 8, 2023, at 04:00, Loïc Tortay wrote:
> On 07/09/2023 21:33, Mark Nelson wrote:
>> Hi Rok,
>> 
>> We're still try to catch what's causing the memory growth, so it's hard 
>> to guess at which releases are affected.  We know it's happening 
>> intermittently on a live Pacific cluster at least.  If you have the 
>> ability to catch it while it's happening, there are several 
>> approaches/tools that might aid in diagnosing it. Container deployments 
>> are a bit tougher to get debugging tools working in though which afaik 
>> has slowed down existing attempts at diagnosing the issue.
>> 
> Hello,
> We have a cluster recently upgraded from Octopus to Pacific 16.2.13 
> where the active MGR was OOM-killed a few times.
>
> We have another cluster that was recently upgraded from 16.2.11 to 
> 16.2.14 and the issue also started to appear (very soon) on that cluster.
> We didn't have the issue before, during the months running 16.2.11.
>
> In short: the issue seems to be due to a change in 16.2.12 or 16.2.13.
>
>
> Loïc.
> -- 
> |   Loīc Tortay  - IN2P3 Computing Centre  |
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: ceph_leadership_team_meeting_s18e06.mkv

2023-09-11 Thread Rok Jaklič
I can confirm this. ... as we did the upgrade from .10 also.

Rok


On Fri, Sep 8, 2023 at 5:26 PM David Orman  wrote:

> I would suggest updating: https://tracker.ceph.com/issues/59580
>
> We did notice it with 16.2.13, as well, after upgrading from .10, so
> likely in-between those two releases.
>
> David
>
> On Fri, Sep 8, 2023, at 04:00, Loïc Tortay wrote:
> > On 07/09/2023 21:33, Mark Nelson wrote:
> >> Hi Rok,
> >>
> >> We're still try to catch what's causing the memory growth, so it's hard
> >> to guess at which releases are affected.  We know it's happening
> >> intermittently on a live Pacific cluster at least.  If you have the
> >> ability to catch it while it's happening, there are several
> >> approaches/tools that might aid in diagnosing it. Container deployments
> >> are a bit tougher to get debugging tools working in though which afaik
> >> has slowed down existing attempts at diagnosing the issue.
> >>
> > Hello,
> > We have a cluster recently upgraded from Octopus to Pacific 16.2.13
> > where the active MGR was OOM-killed a few times.
> >
> > We have another cluster that was recently upgraded from 16.2.11 to
> > 16.2.14 and the issue also started to appear (very soon) on that cluster.
> > We didn't have the issue before, during the months running 16.2.11.
> >
> > In short: the issue seems to be due to a change in 16.2.12 or 16.2.13.
> >
> >
> > Loïc.
> > --
> > |   Loīc Tortay  - IN2P3 Computing Centre  |
> > ___
> > ceph-users mailing list -- ceph-users@ceph.io
> > To unsubscribe send an email to ceph-users-le...@ceph.io
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io