Yeah, it's recommended to upgrade when all PGs are active+clean. Of course, this isn't always possible but one should at least not add more workload to the upgrade process if not absolutely necessary.
I concur with your plan to wait until everything has settled.

Zitat von Daniel Williams <[email protected]>:

There were no degraded pg's but there were misplaced objects / rebalancing
due to additional drives being added in terms of upgrade state. Before the
new drive, everything was deep scrub on version 18.2.7. The new pool was
added half way through the upgrade process. I guess I'm doing too many
things at the same time.

During the upgrade I did add these config settings since it seemed stuck
and going badly, bunch of "experiencing slow operations in BlueStore"
global                              advanced  bdev_async_discard_threads
                 1
global                              advanced  bdev_enable_discard
                true

I've attached all the things you asked for, unfortunately it's post me
rolling my own ceph-osd without the check (OSD.cc_patch.txt). Tried it on
one OSD first and it seemed happy and functional without crashing, so
brought the others up using the custom binary.

I'm too scared to do anything to pool 57 yet... I'll wait, let a full deep
scrub happen and then delete it I guess, then revert back to the unpatched
version on one of the OSDs and see how it goes.

On Fri, Sep 26, 2025 at 4:37 PM Eugen Block <[email protected]> wrote:

Hi,

I haven't seen this error yet, did you upgrade while the cluster was
not healthy? The more history you can provide, the better.
Can you add the output of these CLI commands?

ceph -s
ceph health detail
ceph pg ls-by-pool <pool_with_id_57>  (not the entire output, just to
see if they are listed)

Before deleting a PG, I'd export it with ceph-objectstore-tool, just
in case. Then you could try to remove it from one OSD (also with
ceph-objectstore-tool) and see if that single OSD starts again. If it
works, you could do the same for the remaining PG chunks.

Downgrading is generally not supported, so you might break even more.

Regards,
Eugen


Zitat von Daniel Williams <[email protected]>:

> Some background pool 57 is a new rbd pool (12MiB used) that I was just
> experimenting with (performance of striped hdd rbd devices), I don't
think
> I deleted it but can't say for sure (it appears in ceph df) since it
> doesn't matter.
> This pool was created on reef, a full deep scrub has been done several
> times over since moving to reef (March 2024), likely no deep scrub has
been
> done since moving to squid since I've had lots of troubles...
>
> This error however has broken a 150TiB machine and worse I don't know
that
> a restart won't break others..
>
> After a host reboot I've lost half the OSDs on that host, they all refuse
> to start with:
>
>  -725> 2025-09-25T18:02:37.157+0000 7f93d0aab8c0 -1 Falling back to
public
> interface
>     -2> 2025-09-25T18:02:40.033+0000 7f93d0aab8c0 -1 osd.21 2098994 init
> missing pg_pool_t for deleted pool 57 for pg 57.3s7; please downgrade to
> luminous and allow pg deletion to complete before upgrading
>     -1> 2025-09-25T18:02:40.037+0000 7f93d0aab8c0 -1 ./src/osd/OSD.cc: In
> function 'int OSD::init()' thread 7f93d0aab8c0 time
> 2025-09-25T18:02:40.040491+0000
> ./src/osd/OSD.cc: 3867: ceph_abort_msg("abort() called")
>
>  ceph version 19.2.3 (c92aebb279828e9c3c1f5d24613efca272649e62) squid
> (stable)
>  1: (ceph::__ceph_abort(char const*, int, char const*,
> std::__cxx11::basic_string<char, std::char_traits<char>,
> std::allocator<char> > const&)+0xb7) [0x560b78a7056a]
>  2: /usr/bin/ceph-osd(+0x385bcb) [0x560b789f0bcb]
>  3: main()
>  4: /lib/x86_64-linux-gnu/libc.so.6(+0x29d90) [0x7f93d165dd90]
>  5: __libc_start_main()
>  6: _start()
>
>      0> 2025-09-25T18:02:40.037+0000 7f93d0aab8c0 -1 *** Caught signal
> (Aborted) **
>  in thread 7f93d0aab8c0 thread_name:ceph-osd
>
>  ceph version 19.2.3 (c92aebb279828e9c3c1f5d24613efca272649e62) squid
> (stable)
>  1: /lib/x86_64-linux-gnu/libc.so.6(+0x42520) [0x7f93d1676520]
>  2: pthread_kill()
>  3: raise()
>  4: abort()
>  5: (ceph::__ceph_abort(char const*, int, char const*,
> std::__cxx11::basic_string<char, std::char_traits<char>,
> std::allocator<char> > const&)+0x16a) [0x560b78a7061d]
>  6: /usr/bin/ceph-osd(+0x385bcb) [0x560b789f0bcb]
>  7: main()
>  8: /lib/x86_64-linux-gnu/libc.so.6(+0x29d90) [0x7f93d165dd90]
>  9: __libc_start_main()
>  10: _start()
>  NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed
> to interpret this.
>
>   -725> 2025-09-25T18:02:37.157+0000 7f93d0aab8c0 -1 Falling back to
public
> interface
>     -2> 2025-09-25T18:02:40.033+0000 7f93d0aab8c0 -1 osd.21 2098994 init
> missing pg_pool_t for deleted pool 57 for pg 57.3s7; please downgrade to
> luminous and allow pg deletion to complete before upgrading
>     -1> 2025-09-25T18:02:40.037+0000 7f93d0aab8c0 -1 ./src/osd/OSD.cc: In
> function 'int OSD::init()' thread 7f93d0aab8c0 time
> 2025-09-25T18:02:40.040491+0000
> ./src/osd/OSD.cc: 3867: ceph_abort_msg("abort() called")
>
>  ceph version 19.2.3 (c92aebb279828e9c3c1f5d24613efca272649e62) squid
> (stable)
>  1: (ceph::__ceph_abort(char const*, int, char const*,
> std::__cxx11::basic_string<char, std::char_traits<char>,
> std::allocator<char> > const&)+0xb7) [0x560b78a7056a]
>  2: /usr/bin/ceph-osd(+0x385bcb) [0x560b789f0bcb]
>  3: main()
>  4: /lib/x86_64-linux-gnu/libc.so.6(+0x29d90) [0x7f93d165dd90]
>  5: __libc_start_main()
>  6: _start()
>
>      0> 2025-09-25T18:02:40.037+0000 7f93d0aab8c0 -1 *** Caught signal
> (Aborted) **
>  in thread 7f93d0aab8c0 thread_name:ceph-osd
>
>  ceph version 19.2.3 (c92aebb279828e9c3c1f5d24613efca272649e62) squid
> (stable)
>  1: /lib/x86_64-linux-gnu/libc.so.6(+0x42520) [0x7f93d1676520]
>  2: pthread_kill()
>  3: raise()
>  4: abort()
>  5: (ceph::__ceph_abort(char const*, int, char const*,
> std::__cxx11::basic_string<char, std::char_traits<char>,
> std::allocator<char> > const&)+0x16a) [0x560b78a7061d]
>  6: /usr/bin/ceph-osd(+0x385bcb) [0x560b789f0bcb]
>  7: main()
>  8: /lib/x86_64-linux-gnu/libc.so.6(+0x29d90) [0x7f93d165dd90]
>  9: __libc_start_main()
>  10: _start()
>  NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed
> to interpret this.
>
> Aborted
>
>
>
> Will deleting the PG help? Is there any way I can recover these OSDs?
Will
> moving back to reef help?
>
> Daniel
> _______________________________________________
> ceph-users mailing list -- [email protected]
> To unsubscribe send an email to [email protected]


_______________________________________________
ceph-users mailing list -- [email protected]
To unsubscribe send an email to [email protected]



_______________________________________________
ceph-users mailing list -- [email protected]
To unsubscribe send an email to [email protected]

Reply via email to