Hi Rok,

We're still try to catch what's causing the memory growth, so it's hard to guess at which releases are affected. We know it's happening intermittently on a live Pacific cluster at least. If you have the ability to catch it while it's happening, there are several approaches/tools that might aid in diagnosing it. Container deployments are a bit tougher to get debugging tools working in though which afaik has slowed down existing attempts at diagnosing the issue.

Mark

On 9/7/23 05:55, Rok Jaklič wrote:
Hi,

we have also experienced several ceph-mgr oom kills on ceph v16.2.13 on
120T/200T data.

Is there any tracker about the problem?

Does upgrade to 17.x "solves" the problem?

Kind regards,
Rok



On Wed, Sep 6, 2023 at 9:36 PM Ernesto Puerta <epuer...@redhat.com> wrote:

Dear Cephers,

Today brought us an eventful CTL meeting: it looks like Jitsi recently
started
requiring user authentication
<https://jitsi.org/blog/authentication-on-meet-jit-si/> (anonymous users
will get a "Waiting for a moderator" modal), but authentication didn't work
against Google or GitHub accounts, so we had to move to the good old Google
Meet.

As a result of this, Neha has kindly set up a new private Slack channel
(#clt) to allow for quicker communication among CLT members (if you usually
attend the CLT meeting and have not been added, please ping any CLT member
to request that).

Now, let's move on the important stuff:

*The latest Pacific Release (v16.2.14)*

*The Bad*
The 14th drop of the Pacific release has landed with a few hiccups:

    - Some .deb packages were made available to downloads.ceph.com before
    the release process completion. Although this is not the first time it
    happens, we want to ensure this is the last one, so we'd like to gather
    ideas to improve the release publishing process. Neha encouraged
everyone
    to share ideas here:
       - https://tracker.ceph.com/issues/62671
       - https://tracker.ceph.com/issues/62672
       - v16.2.14 also hit issues during the ceph-container stage. Laura
    wanted to raise awareness of its current setbacks
    <https://pad.ceph.com/p/16.2.14-struggles> and collect ideas to tackle
    them:
       - Enforce reviews and mandatory CI checks
       - Rework the current approach to use simple Dockerfiles
       <https://github.com/ceph/ceph/pull/43292>
       - Call the Ceph community for help: ceph-container is currently
       maintained part-time by a single contributor (Guillaume Abrioux).
This
       sub-project would benefit from the sound expertise on containers
among Ceph
       users. If you have ever considered contributing to Ceph, but felt a
bit
       intimidated by C++, Paxos and race conditions, ceph-container is a
good
       place to shed your fear.


*The Good*
Not everything about v16.2.14 was going to be bleak: David Orman brought us
really good news. They tested v16.2.14 on a large production cluster
(10gbit/s+ RGW and ~13PiB raw) and found that it solved a major issue
affecting RGW in Pacific <https://github.com/ceph/ceph/pull/52552>.

*The Ugly*
During that testing, they noticed that ceph-mgr was occasionally OOM killed
(nothing new to 16.2.14, as it was previously reported). They already
tried:

    - Disabling modules (like the restful one, which was a suspect)
    - Enabling debug 20
    - Turning the pg autoscaler off

Debugging will continue to characterize this issue:

    - Enable profiling (Mark Nelson)
    - Try Bloomberg's Python mem profiler
    <https://github.com/bloomberg/memray> (Matthew Leonard)


*Infrastructure*

*Reminder: Infrastructure Meeting Tomorrow. **11:30-12:30 Central Time*

Patrick brought up the following topics:

    - Need to reduce the OVH spending ($72k/year, which is a good cut in the
    Ceph Foundation budget, that's a lot less avocado sandwiches for the
next
    Cephalocon):
       - Move services (e.g.: Chacra) to the Sepia lab
       - Re-use CentOS (and any spared/unused) machines for devel purposes
    - Current Ceph sys admins are overloaded, so devel/community involvement
    would be much appreciated.
    - More to be discussed in tomorrow's meeting. Please join if you
    think you can help solve/improve the Ceph infrastrucru!


*BTW*: today's CDM will be canceled, since no topics were proposed.

Kind Regards,

Ernesto
_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io
_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

Reply via email to