[ceph-users] tuning for backup target cluster

2024-05-24 Thread Lukasz Borek
Hi Everyone,

I'm putting together a HDD cluster with an ECC pool dedicated to the backup
environment. Traffic via s3. Version 18.2,  7 OSD nodes, 12 * 12TB HDD +
1NVME each, 4+2 ECC pool.

Wondering if there is some general guidance for startup setup/tuning in
regards to s3 object size. Files are read from fast storage (SSD/NVME) and
written to s3. Files sizes are 10MB-1TB, so it's not standard s3. traffic.
Backup for big files took hours to complete.

My first shot would be to increase default bluestore_min_alloc_size_hdd, to
reduce the number of stored objects, but I'm not sure if it's a
good direccion?  Any other parameters worth checking to support such a
traffic pattern?

Thanks!

-- 
Łukasz
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] MDS Abort druing FS scrub

2024-05-24 Thread Malcolm Haak
When running a cephfs scrub the MDS will crash with the following backtrace

-1> 2024-05-25T09:00:23.028+1000 7ef2958006c0 -1
/usr/src/debug/ceph/ceph-18.2.2/src/mds/MDSRank.cc: In function 'void
MDSRank::abort(std::string_view)' thread 7ef2958006c0 time
2024-05-25T09:00:23.031373+1000
/usr/src/debug/ceph/ceph-18.2.2/src/mds/MDSRank.cc: 938:
ceph_abort_msg("abort() called")

 ceph version 18.2.2 (531c0d11a1c5d39fbfe6aa8a521f023abf3bf3e2) reef (stable)
 1: (ceph::__ceph_abort(char const*, int, char const*,
std::__cxx11::basic_string,
std::allocator > const&)+0xd8) [0x7ef2a3a1ea65]
 2: (MDSRank::abort(std::basic_string_view >)+0x89) [0x5b97f9c811b9]
 3: (CDentry::check_corruption(bool)+0x4e3) [0x5b97f9ef2583]
 4: (EMetaBlob::add_primary_dentry(EMetaBlob::dirlump&, CDentry*,
CInode*, unsigned char)+0x59) [0x5b97f9d554a9]
 5: (Locker::scatter_writebehind(ScatterLock*)+0x2e9) [0x5b97f9e775c9]
 6: (Locker::simple_sync(SimpleLock*, bool*)+0x16a) [0x5b97f9e7db4a]
 7: (Locker::file_eval(ScatterLock*, bool*)+0x139) [0x5b97f9e84299]
 8: (Locker::_drop_locks(MutationImpl*, std::set, std::allocator >*, bool)+0x1b1)
[0x5b97f9e96221]
 9: (Locker::drop_locks(MutationImpl*, std::set, std::allocator >*)+0x7f) [0x5b97f9e9726f]
 10: 
(MDCache::repair_inode_stats_work(boost::intrusive_ptr&)+0x8f7)
[0x5b97f9dd2dd7]
 11: (MDCache::repair_inode_stats(CInode*)+0x90) [0x5b97f9df61e0]
 12: /usr/bin/ceph-mds(+0x460cb8) [0x5b97f9f58cb8]
 13: (Continuation::_continue_function(int, int)+0x1c3) [0x5b97f9f6a903]
 14: /usr/bin/ceph-mds(+0x46efa1) [0x5b97f9f66fa1]
 15: (Continuation::_continue_function(int, int)+0x1c3)
[0x5b97f9f6a903]
 16: (Continuation::Callback::finish(int)+0x14) [0x5b97f9f6aa14]
 17: (Context::complete(int)+0xd) [0x5b97f9c75bfd]
 18: (MDSContext::complete(int)+0x5f) [0x5b97f9fd2b0f]
 19: (MDSIOContextBase::complete(int)+0x38c) [0x5b97f9fd2ffc]
 20: (Finisher::finisher_thread_entry()+0x175) [0x7ef2a3adac95]
 21: /usr/lib/libc.so.6(+0x8b55a) [0x7ef2a32a955a]
 22: /usr/lib/libc.so.6(+0x108a3c) [0x7ef2a3326a3c]

This is after running through the "Advanced recovery procedure"

I assume I need a flag or something on the MDS to allow it to keep
running during said scan?
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Lousy recovery for mclock and reef

2024-05-24 Thread Kai Stian Olstad

On 24.05.2024 21:07, Mazzystr wrote:
I did the obnoxious task of updating ceph.conf and restarting all my 
osds.


ceph --admin-daemon /var/run/ceph/ceph-osd.*.asok config get 
osd_op_queue

{
"osd_op_queue": "wpq"
}

I have some spare memory on my target host/osd and increased the target
memory of that OSD to 10 Gb and restarted.  No effect observed.  In 
fact

mem usage on the host is stable so I don't think the change took effect
even with updating ceph.conf, restart and a direct asok config set.  
target

memory value is confirmed to be set via asok config get

Nothing has helped.  I still cannot break the 21 MiB/s barrier.

Does anyone have any more ideas?


For recovery you can adjust the following.

osd_max_backfills default is 1, in my system I get the best performance 
with 3 and wpq.


The following I have not adjusted myself, but you can try.
osd_recovery_max_active is default to 3.
osd_recovery_op_priority is default to 3, a lower number increases the 
priority for recovery.


All of them can be runtime adjusted.


--
Kai Stian Olstad
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Lousy recovery for mclock and reef

2024-05-24 Thread Joshua Baergen
Now that you're on wpq, you can try tweaking osd_max_backfills (up)
and osd_recovery_sleep (down).

Josh

On Fri, May 24, 2024 at 1:07 PM Mazzystr  wrote:
>
> I did the obnoxious task of updating ceph.conf and restarting all my osds.
>
> ceph --admin-daemon /var/run/ceph/ceph-osd.*.asok config get osd_op_queue
> {
> "osd_op_queue": "wpq"
> }
>
> I have some spare memory on my target host/osd and increased the target 
> memory of that OSD to 10 Gb and restarted.  No effect observed.  In fact mem 
> usage on the host is stable so I don't think the change took effect even with 
> updating ceph.conf, restart and a direct asok config set.  target memory 
> value is confirmed to be set via asok config get
>
> Nothing has helped.  I still cannot break the 21 MiB/s barrier.
>
> Does anyone have any more ideas?
>
> /C
>
> On Fri, May 24, 2024 at 10:20 AM Joshua Baergen  
> wrote:
>>
>> It requires an OSD restart, unfortunately.
>>
>> Josh
>>
>> On Fri, May 24, 2024 at 11:03 AM Mazzystr  wrote:
>> >
>> > Is that a setting that can be applied runtime or does it req osd restart?
>> >
>> > On Fri, May 24, 2024 at 9:59 AM Joshua Baergen 
>> > wrote:
>> >
>> > > Hey Chris,
>> > >
>> > > A number of users have been reporting issues with recovery on Reef
>> > > with mClock. Most folks have had success reverting to
>> > > osd_op_queue=wpq. AIUI 18.2.3 should have some mClock improvements but
>> > > I haven't looked at the list myself yet.
>> > >
>> > > Josh
>> > >
>> > > On Fri, May 24, 2024 at 10:55 AM Mazzystr  wrote:
>> > > >
>> > > > Hi all,
>> > > > Goodness I'd say it's been at least 3 major releases since I had to do 
>> > > > a
>> > > > recovery.  I have disks with 60-75,000 power_on_hours.  I just updated
>> > > from
>> > > > Octopus to Reef last month and I'm hit with 3 disk failures and the
>> > > mclock
>> > > > ugliness.  My recovery is moving at a wondrous 21 mb/sec after some
>> > > serious
>> > > > hacking.  It started out at 9 mb/sec.
>> > > >
>> > > > My hosts are showing minimal cpu use.  normal mem use.  0-6% disk
>> > > > business.  Load is minimal so processes aren't blocked by disk io.
>> > > >
>> > > > I tried the changing all the sleeps and recovery_max and
>> > > > setting osd_mclock_profile high_recovery_ops to no change in 
>> > > > performance.
>> > > >
>> > > > Does anyone have any suggestions to improve performance?
>> > > >
>> > > > Thanks,
>> > > > /Chris C
>> > > > ___
>> > > > ceph-users mailing list -- ceph-users@ceph.io
>> > > > To unsubscribe send an email to ceph-users-le...@ceph.io
>> > >
>> > ___
>> > ceph-users mailing list -- ceph-users@ceph.io
>> > To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Lousy recovery for mclock and reef

2024-05-24 Thread Mazzystr
I did the obnoxious task of updating ceph.conf and restarting all my osds.

ceph --admin-daemon /var/run/ceph/ceph-osd.*.asok config get osd_op_queue
{
"osd_op_queue": "wpq"
}

I have some spare memory on my target host/osd and increased the target
memory of that OSD to 10 Gb and restarted.  No effect observed.  In fact
mem usage on the host is stable so I don't think the change took effect
even with updating ceph.conf, restart and a direct asok config set.  target
memory value is confirmed to be set via asok config get

Nothing has helped.  I still cannot break the 21 MiB/s barrier.

Does anyone have any more ideas?

/C

On Fri, May 24, 2024 at 10:20 AM Joshua Baergen 
wrote:

> It requires an OSD restart, unfortunately.
>
> Josh
>
> On Fri, May 24, 2024 at 11:03 AM Mazzystr  wrote:
> >
> > Is that a setting that can be applied runtime or does it req osd restart?
> >
> > On Fri, May 24, 2024 at 9:59 AM Joshua Baergen <
> jbaer...@digitalocean.com>
> > wrote:
> >
> > > Hey Chris,
> > >
> > > A number of users have been reporting issues with recovery on Reef
> > > with mClock. Most folks have had success reverting to
> > > osd_op_queue=wpq. AIUI 18.2.3 should have some mClock improvements but
> > > I haven't looked at the list myself yet.
> > >
> > > Josh
> > >
> > > On Fri, May 24, 2024 at 10:55 AM Mazzystr  wrote:
> > > >
> > > > Hi all,
> > > > Goodness I'd say it's been at least 3 major releases since I had to
> do a
> > > > recovery.  I have disks with 60-75,000 power_on_hours.  I just
> updated
> > > from
> > > > Octopus to Reef last month and I'm hit with 3 disk failures and the
> > > mclock
> > > > ugliness.  My recovery is moving at a wondrous 21 mb/sec after some
> > > serious
> > > > hacking.  It started out at 9 mb/sec.
> > > >
> > > > My hosts are showing minimal cpu use.  normal mem use.  0-6% disk
> > > > business.  Load is minimal so processes aren't blocked by disk io.
> > > >
> > > > I tried the changing all the sleeps and recovery_max and
> > > > setting osd_mclock_profile high_recovery_ops to no change in
> performance.
> > > >
> > > > Does anyone have any suggestions to improve performance?
> > > >
> > > > Thanks,
> > > > /Chris C
> > > > ___
> > > > ceph-users mailing list -- ceph-users@ceph.io
> > > > To unsubscribe send an email to ceph-users-le...@ceph.io
> > >
> > ___
> > ceph-users mailing list -- ceph-users@ceph.io
> > To unsubscribe send an email to ceph-users-le...@ceph.io
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: unknown PGs after adding hosts in different subtree

2024-05-24 Thread Eugen Block

Hi,

I guess you mean use something like "step take DCA class hdd"  
instead of "step take default class hdd" as in:


rule rule-ec-k7m11 {
id 1
type erasure
min_size 3
max_size 18
step set_chooseleaf_tries 5
step set_choose_tries 100
step take DCA class hdd
step chooseleaf indep 9 type host
step take DCB class hdd
step chooseleaf indep 9 type host
step emit
}


Almost, yes. There needs to be an "emit" step after the first  
chooseleaf, so something like this:



step take DCA class hdd
step chooseleaf indep 9 type host
step emit
step take DCB class hdd
step chooseleaf indep 9 type host
step emit


Otherwise the placement according to crushtool would be incomplete and  
only 9 chunks get a mapping. With this rule (omit "default") there are  
not bad mappings reported, so that would most likely work as well. But  
having all primaries in one DC is not optimal, although for this  
specific customer it probably wouldn't make a difference. But in  
general I agree, not ideal.


In case you have time, it would be great if you could collect  
information on (reproducing) the fatal peering problem. While  
remappings might be "unexpectedly expected" it is clearly a serious  
bug that incomplete and unknown PGs show up in the process of adding  
hosts at the root.


Time wouldn't be an issue, but there's no way for me to do that on the  
customer's cluster. In my lab it doesn't behave as observed which  
isn't surprising without much data and no client load. I'm not sure  
yet how to achieve that.


Thanks,
Eugen

Zitat von Frank Schilder :


Hi Eugen,

so it is partly "unexpectedly expected" and partly buggy. I really  
wish the crush implementation was honouring a few obvious  
invariants. It is extremely counter-intuitive that mappings taken  
from a sub-set change even if both, the sub-set and the mapping  
instructions themselves don't.



- Use different root names


That's what we are doing and it works like a charm, also for draining OSDs.


more specific crush rules.


I guess you mean use something like "step take DCA class hdd"  
instead of "step take default class hdd" as in:


rule rule-ec-k7m11 {
id 1
type erasure
min_size 3
max_size 18
step set_chooseleaf_tries 5
step set_choose_tries 100
step take DCA class hdd
step chooseleaf indep 9 type host
step take DCB class hdd
step chooseleaf indep 9 type host
step emit
}

According to the documentation, this should actually work and be  
almost equivalent to your crush rule. The difference here is that it  
will make sure that the first 9 shards are from DCA and the second 9  
shards from DCB (its an ordering). Side effect is that all primary  
OSDs will be in DCA if both DCs are up. I remember people asking for  
that as a feature in multi-DC set-ups to pick the one with lowest  
latency to have the primary OSDs by default.


Can you give this crush rule a try and report back whether or not  
the behaviour when adding hosts changes?


In case you have time, it would be great if you could collect  
information on (reproducing) the fatal peering problem. While  
remappings might be "unexpectedly expected" it is clearly a serious  
bug that incomplete and unknown PGs show up in the process of adding  
hosts at the root.


Best regards,
=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14


From: Eugen Block 
Sent: Friday, May 24, 2024 2:51 PM
To: ceph-users@ceph.io
Subject: [ceph-users] Re: unknown PGs after adding hosts in different subtree

I start to think that the root cause of the remapping is just the fact
that the crush rule(s) contain(s) the "step take default" line:

  step take default class hdd

My interpretation is that crush simply tries to honor the rule:
consider everything underneath the "default" root, so PGs get remapped
if new hosts are added there (but not in their designated subtree
buckets). The effect (unknown PGs) is bad, but there are a couple of
options to avoid that:

- Use different root names and/or more specific crush rules.
- Use host spec file(s) to place new hosts directly where they belong.
- Set osd_crush_initial_weight = 0 to avoid remapping until everything
is where it's supposed to be, then reweight the OSDs.


Zitat von Eugen Block :


Hi Frank,

thanks for looking up those trackers. I haven't looked into them
yet, I'll read your response in detail later, but I wanted to add
some new observation:

I added another root bucket (custom) to the osd tree:

# ceph osd tree
ID   CLASS  WEIGHT   TYPE NAME   STATUS  REWEIGHT  PRI-AFF
-12   0  root custom
 -1 0.27698  root default
 -8 0.09399  room room1
 -3 0.04700  host host1
  7hdd  0.02299  osd.7   up   1.0  1.00

[ceph-users] Re: Lousy recovery for mclock and reef

2024-05-24 Thread Joshua Baergen
It requires an OSD restart, unfortunately.

Josh

On Fri, May 24, 2024 at 11:03 AM Mazzystr  wrote:
>
> Is that a setting that can be applied runtime or does it req osd restart?
>
> On Fri, May 24, 2024 at 9:59 AM Joshua Baergen 
> wrote:
>
> > Hey Chris,
> >
> > A number of users have been reporting issues with recovery on Reef
> > with mClock. Most folks have had success reverting to
> > osd_op_queue=wpq. AIUI 18.2.3 should have some mClock improvements but
> > I haven't looked at the list myself yet.
> >
> > Josh
> >
> > On Fri, May 24, 2024 at 10:55 AM Mazzystr  wrote:
> > >
> > > Hi all,
> > > Goodness I'd say it's been at least 3 major releases since I had to do a
> > > recovery.  I have disks with 60-75,000 power_on_hours.  I just updated
> > from
> > > Octopus to Reef last month and I'm hit with 3 disk failures and the
> > mclock
> > > ugliness.  My recovery is moving at a wondrous 21 mb/sec after some
> > serious
> > > hacking.  It started out at 9 mb/sec.
> > >
> > > My hosts are showing minimal cpu use.  normal mem use.  0-6% disk
> > > business.  Load is minimal so processes aren't blocked by disk io.
> > >
> > > I tried the changing all the sleeps and recovery_max and
> > > setting osd_mclock_profile high_recovery_ops to no change in performance.
> > >
> > > Does anyone have any suggestions to improve performance?
> > >
> > > Thanks,
> > > /Chris C
> > > ___
> > > ceph-users mailing list -- ceph-users@ceph.io
> > > To unsubscribe send an email to ceph-users-le...@ceph.io
> >
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Lousy recovery for mclock and reef

2024-05-24 Thread Mazzystr
Is that a setting that can be applied runtime or does it req osd restart?

On Fri, May 24, 2024 at 9:59 AM Joshua Baergen 
wrote:

> Hey Chris,
>
> A number of users have been reporting issues with recovery on Reef
> with mClock. Most folks have had success reverting to
> osd_op_queue=wpq. AIUI 18.2.3 should have some mClock improvements but
> I haven't looked at the list myself yet.
>
> Josh
>
> On Fri, May 24, 2024 at 10:55 AM Mazzystr  wrote:
> >
> > Hi all,
> > Goodness I'd say it's been at least 3 major releases since I had to do a
> > recovery.  I have disks with 60-75,000 power_on_hours.  I just updated
> from
> > Octopus to Reef last month and I'm hit with 3 disk failures and the
> mclock
> > ugliness.  My recovery is moving at a wondrous 21 mb/sec after some
> serious
> > hacking.  It started out at 9 mb/sec.
> >
> > My hosts are showing minimal cpu use.  normal mem use.  0-6% disk
> > business.  Load is minimal so processes aren't blocked by disk io.
> >
> > I tried the changing all the sleeps and recovery_max and
> > setting osd_mclock_profile high_recovery_ops to no change in performance.
> >
> > Does anyone have any suggestions to improve performance?
> >
> > Thanks,
> > /Chris C
> > ___
> > ceph-users mailing list -- ceph-users@ceph.io
> > To unsubscribe send an email to ceph-users-le...@ceph.io
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Lousy recovery for mclock and reef

2024-05-24 Thread Joshua Baergen
Hey Chris,

A number of users have been reporting issues with recovery on Reef
with mClock. Most folks have had success reverting to
osd_op_queue=wpq. AIUI 18.2.3 should have some mClock improvements but
I haven't looked at the list myself yet.

Josh

On Fri, May 24, 2024 at 10:55 AM Mazzystr  wrote:
>
> Hi all,
> Goodness I'd say it's been at least 3 major releases since I had to do a
> recovery.  I have disks with 60-75,000 power_on_hours.  I just updated from
> Octopus to Reef last month and I'm hit with 3 disk failures and the mclock
> ugliness.  My recovery is moving at a wondrous 21 mb/sec after some serious
> hacking.  It started out at 9 mb/sec.
>
> My hosts are showing minimal cpu use.  normal mem use.  0-6% disk
> business.  Load is minimal so processes aren't blocked by disk io.
>
> I tried the changing all the sleeps and recovery_max and
> setting osd_mclock_profile high_recovery_ops to no change in performance.
>
> Does anyone have any suggestions to improve performance?
>
> Thanks,
> /Chris C
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Lousy recovery for mclock and reef

2024-05-24 Thread Mazzystr
Hi all,
Goodness I'd say it's been at least 3 major releases since I had to do a
recovery.  I have disks with 60-75,000 power_on_hours.  I just updated from
Octopus to Reef last month and I'm hit with 3 disk failures and the mclock
ugliness.  My recovery is moving at a wondrous 21 mb/sec after some serious
hacking.  It started out at 9 mb/sec.

My hosts are showing minimal cpu use.  normal mem use.  0-6% disk
business.  Load is minimal so processes aren't blocked by disk io.

I tried the changing all the sleeps and recovery_max and
setting osd_mclock_profile high_recovery_ops to no change in performance.

Does anyone have any suggestions to improve performance?

Thanks,
/Chris C
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Reef RGWs stop processing requests

2024-05-24 Thread Iain Stott
Thanks Enrico,

We are only syncing metadata between sites, so I don't think that bug will be 
the cause of our issues.

I have been able to delete ~30k objects without causing the RGW to stop 
processing.


Thanks
Iain

From: Enrico Bocchi 
Sent: 22 May 2024 13:48
To: Iain Stott ; ceph-users@ceph.io 
Subject: Re: [ceph-users] Reef RGWs stop processing requests

CAUTION: This email originates from outside THG

Hi Iain,

Can you check if it relates to this? --
https://tracker.ceph.com/issues/63373
There is a bug when bulk deleting objects, causing the RGWs to deadlock.

Cheers,
Enrico


On 5/17/24 11:24, Iain Stott wrote:
> Hi,
>
> We are running 3 clusters in multisite. All 3 were running Quincy 17.2.6 and 
> using cephadm. We upgraded one of the secondary sites to Reef 18.2.1 a couple 
> of weeks ago and were planning on doing the rest shortly afterwards.
>
> We run 3 RGW daemons on separate physical hosts behind an external HAProxy HA 
> pair for each cluster.
>
> Since we upgrade to Reef we have had issues with the RGWs stopping processing 
> requests. We can see that they don't crash as they still have entries in the 
> logs about syncing, but as far as request processing goes, they just stop. 
> While debugging this we have 1 of the 3 RGWs running a Quincy image, and this 
> has never had an issue where it stops processing requests. Any Reef 
> containers we deploy have always stopped within 48Hrs of being deployed. We 
> have tried Reef versions 18.2.1, 18.2.2 and 18.1.3 and all exhibit the same 
> issue. We are running podman 4.6.1 on Centos 8 with kernel 
> 4.18.0-513.24.1.el8_9.x86_64.
>
> We have enabled debug logs for the RGWs but we have been unable to find 
> anything in them that would shed light on the cause.
>
> We are just wondering if anyone had any ideas on what could be causing this 
> or how to debug it further?
>
> Thanks
> Iain
>
> Iain Stott
> OpenStack Engineer
> iain.st...@thg.com
> [THG Ingenuity Logo]>
> www.thg.com>
> [LinkedIn]>
>  [Instagram] > 
> [X] >
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io

--
Enrico Bocchi
CERN European Laboratory for Particle Physics
IT - Storage & Data Management - General Storage Services
Mailbox: G20500 - Office: 31-2-010
1211 Genève 23
Switzerland
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: unknown PGs after adding hosts in different subtree

2024-05-24 Thread Frank Schilder
Hi Eugen,

so it is partly "unexpectedly expected" and partly buggy. I really wish the 
crush implementation was honouring a few obvious invariants. It is extremely 
counter-intuitive that mappings taken from a sub-set change even if both, the 
sub-set and the mapping instructions themselves don't.

> - Use different root names

That's what we are doing and it works like a charm, also for draining OSDs.

> more specific crush rules.

I guess you mean use something like "step take DCA class hdd" instead of "step 
take default class hdd" as in:

rule rule-ec-k7m11 {
id 1
type erasure
min_size 3
max_size 18
step set_chooseleaf_tries 5
step set_choose_tries 100
step take DCA class hdd
step chooseleaf indep 9 type host
step take DCB class hdd
step chooseleaf indep 9 type host
step emit
}

According to the documentation, this should actually work and be almost 
equivalent to your crush rule. The difference here is that it will make sure 
that the first 9 shards are from DCA and the second 9 shards from DCB (its an 
ordering). Side effect is that all primary OSDs will be in DCA if both DCs are 
up. I remember people asking for that as a feature in multi-DC set-ups to pick 
the one with lowest latency to have the primary OSDs by default.

Can you give this crush rule a try and report back whether or not the behaviour 
when adding hosts changes?

In case you have time, it would be great if you could collect information on 
(reproducing) the fatal peering problem. While remappings might be 
"unexpectedly expected" it is clearly a serious bug that incomplete and unknown 
PGs show up in the process of adding hosts at the root.

Best regards,
=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14


From: Eugen Block 
Sent: Friday, May 24, 2024 2:51 PM
To: ceph-users@ceph.io
Subject: [ceph-users] Re: unknown PGs after adding hosts in different subtree

I start to think that the root cause of the remapping is just the fact
that the crush rule(s) contain(s) the "step take default" line:

  step take default class hdd

My interpretation is that crush simply tries to honor the rule:
consider everything underneath the "default" root, so PGs get remapped
if new hosts are added there (but not in their designated subtree
buckets). The effect (unknown PGs) is bad, but there are a couple of
options to avoid that:

- Use different root names and/or more specific crush rules.
- Use host spec file(s) to place new hosts directly where they belong.
- Set osd_crush_initial_weight = 0 to avoid remapping until everything
is where it's supposed to be, then reweight the OSDs.


Zitat von Eugen Block :

> Hi Frank,
>
> thanks for looking up those trackers. I haven't looked into them
> yet, I'll read your response in detail later, but I wanted to add
> some new observation:
>
> I added another root bucket (custom) to the osd tree:
>
> # ceph osd tree
> ID   CLASS  WEIGHT   TYPE NAME   STATUS  REWEIGHT  PRI-AFF
> -12   0  root custom
>  -1 0.27698  root default
>  -8 0.09399  room room1
>  -3 0.04700  host host1
>   7hdd  0.02299  osd.7   up   1.0  1.0
>  10hdd  0.02299  osd.10  up   1.0  1.0
> ...
>
> Then I tried this approach to add a new host directly to the
> non-default root:
>
> # cat host5.yaml
> service_type: host
> hostname: host5
> addr: 192.168.168.54
> location:
>   root: custom
> labels:
>- osd
>
> # ceph orch apply -i host5.yaml
>
> # ceph osd tree
> ID   CLASS  WEIGHT   TYPE NAME   STATUS  REWEIGHT  PRI-AFF
> -12 0.04678  root custom
> -23 0.04678  host host5
>   1hdd  0.02339  osd.1   up   1.0  1.0
>  13hdd  0.02339  osd.13  up   1.0  1.0
>  -1 0.27698  root default
>  -8 0.09399  room room1
>  -3 0.04700  host host1
>   7hdd  0.02299  osd.7   up   1.0  1.0
>  10hdd  0.02299  osd.10  up   1.0  1.0
> ...
>
> host5 is placed directly underneath the new custom root correctly,
> but not a single PG is marked "remapped"! So this is actually what I
> (or we) expected. I'm not sure yet what to make of it, but I'm
> leaning towards using this approach in the future and add hosts
> underneath a different root first, and then move it to its
> designated location.
>
> Just to validate again, I added host6 without a location spec, so
> it's placed underneath the default root again:
>
> # ceph osd tree
> ID   CLASS  WEIGHT   TYPE NAME   STATUS  REWEIGHT  PRI-AFF
> -12 0.04678  root custom
> -23 0.04678  host host5
>   1hdd  0.02339  osd.1   up   1.0  1.0
>  13hdd  0.02339  osd.13  up   1.0  1.0
>  -1 0.32376  root

[ceph-users] Re: User + Dev Meetup Tomorrow!

2024-05-24 Thread Frédéric Nass
Hello Sebastian,

I just checked the survey and you're right, the issue was within the question. 
Got me a bit confused when I read it but I clicked anyway. Who doesn't like 
clicking? :-D

What best describes your deployment target? *
1/ Bare metal (RPMs/Binary)
2/ Containers (cephadm/Rook)
3/ Both

How funny is that.

Apart from that, I was thinking of some users having reported that they found 
the orchestrator a little obscure in its operation/decisions, particularly with 
regard to the creation of OSDs.

A nice feature would be to have a history of what the orchestrator did with the 
result of its action and the reason (in case of failure).
A 'ceph orch history' for example (or ceph orch status --details or --history 
or whatever). It would be much easier to read than the MGR's very verbose 
ceph.cephadm.log.

Like for example:

$ ceph orch history
DATE/TIME TASK  
  HOSTS RESULT
2024-05-24T10:40:44.866148Z   Applying tuned-profile latency-performance
  voltaire,lafontaine,rimbaud   SUCCESS
2024-05-24T10:39:44.866148Z   Applying mds.cephfs spec  
  verlaine,hugo SUCCESS
2024-05-24T10:33:44.866148Z   Applying service osd.osd_nodes_fifteen on host 
lamartine... lamartine FAILED (host has _no_schedule 
label)
2024-05-24T10:28:44.866148Z   Applying service rgw.s31 spec 
  eluard,baudelaire SUCCESS

We'd just have to "watch ceph orch history" and see what the orchestrator does 
in real time.

Cheers,
Frédéric.

- Le 24 Mai 24, à 15:07, Sebastian Wagner sebastian.wag...@croit.io a écrit 
:

> Hi Frédéric,
> 
> I agree. Maybe we should re-frame things? Containers can run on
> bare-metal and containers can run virtualized. And distribution packages
> can run bare-metal and virtualized as well.
> 
> What about asking independently about:
> 
>  * Do you run containers or distribution packages?
>  * Do you run bare-metal or virtualized?
> 
> Best,
> Sebastian
> 
> Am 24.05.24 um 12:28 schrieb Frédéric Nass:
>> Hello everyone,
>>
>> Nice talk yesterday. :-)
>>
>> Regarding containers vs RPMs and orchestration, and the related discussion 
>> from
>> yesterday, I wanted to share a few things (which I wasn't able to share
>> yesterday on the call due to a headset/bluetooth stack issue) to explain why 
>> we
>> use cephadm and ceph orch these days with bare-metal clusters even though, as
>> someone said, cephadm was not supposed to work with (nor support) bare-metal
>> clusters (which actually surprised me since cephadm is all about managing
>> containers on a host, regardless of its type). I also think this explains the
>> observation that was made that half of the reports (iirc) are supposedly 
>> using
>> cephadm with bare-metal clusters.
>>
>> Over the years, we've deployed and managed bare-metal clusters with 
>> ceph-deploy
>> in Hammer, then switched to ceph-ansible (take-over-existing-cluster.yml) 
>> with
>> Jewel (or was it Luminous?), and then moved to cephadm, cephadm-ansible and
>> ceph-orch with Pacific, to manage the exact same bare-metal cluster. I guess
>> this explains why some bare-metal cluster today are managed using cephadm.
>> These are not new clusters deployed with Rook in K8s environments, but 
>> existing
>> bare-metal clusters that continue to servce brilliantly 10 years after
>> installation.
>>
>> Regarding rpms vs containers, as mentioned during the call, not sure why one
>> would still want to use rpms vs containers considering the simplicity and
>> velocity that containers offer regarding upgrades with ceph orch clever
>> automation. Some reported performance reasons between rpms vs containers,
>> meaning rpms binaries would perform better than containers. Is there any
>> evidence of that?
>>
>> Perhaps the reason why people still use RPMs is instead that they have 
>> invested
>> a lot of time and effort into developing automation tools/scripts/playbooks 
>> for
>> RPMs installations and they consider the transition to ceph orch and
>> containerized environments as a significant challenge.
>>
>> Regarding containerized Ceph, I remember asking Sage for a minimalist CephOS
>> back in 2018 (there was no containers by that time). IIRC, he said 
>> maintaining
>> a ceph-specific Linux distro would take too much time and resources, so it 
>> was
>> not something considered at that time. Now that Ceph is all containers, I
>> really hope that a minimalist rolling Ceph distro comes out one day. ceph 
>> orch
>> could even handle rare distro upgrades such as kernel upgrades as well as
>> ordered reboots. This would make ceph clusters really easier to maintain over
>> time (compared to the last complicated upgrade path from non-containerized
>> RHEL7+RHCS4.3 to containerized RHEL9+RHCS5.2 that we had to follow a year 
>> ago).
>>
>> Bests,
>> Frédéric.
>>

[ceph-users] Re: User + Dev Meetup Tomorrow!

2024-05-24 Thread Sebastian Wagner

Hi Frédéric,

I agree. Maybe we should re-frame things? Containers can run on 
bare-metal and containers can run virtualized. And distribution packages 
can run bare-metal and virtualized as well.


What about asking independently about:

 * Do you run containers or distribution packages?
 * Do you run bare-metal or virtualized?

Best,
Sebastian

Am 24.05.24 um 12:28 schrieb Frédéric Nass:

Hello everyone,

Nice talk yesterday. :-)

Regarding containers vs RPMs and orchestration, and the related discussion from 
yesterday, I wanted to share a few things (which I wasn't able to share 
yesterday on the call due to a headset/bluetooth stack issue) to explain why we 
use cephadm and ceph orch these days with bare-metal clusters even though, as 
someone said, cephadm was not supposed to work with (nor support) bare-metal 
clusters (which actually surprised me since cephadm is all about managing 
containers on a host, regardless of its type). I also think this explains the 
observation that was made that half of the reports (iirc) are supposedly using 
cephadm with bare-metal clusters.

Over the years, we've deployed and managed bare-metal clusters with ceph-deploy 
in Hammer, then switched to ceph-ansible (take-over-existing-cluster.yml) with 
Jewel (or was it Luminous?), and then moved to cephadm, cephadm-ansible and 
ceph-orch with Pacific, to manage the exact same bare-metal cluster. I guess 
this explains why some bare-metal cluster today are managed using cephadm. 
These are not new clusters deployed with Rook in K8s environments, but existing 
bare-metal clusters that continue to servce brilliantly 10 years after 
installation.

Regarding rpms vs containers, as mentioned during the call, not sure why one 
would still want to use rpms vs containers considering the simplicity and 
velocity that containers offer regarding upgrades with ceph orch clever 
automation. Some reported performance reasons between rpms vs containers, 
meaning rpms binaries would perform better than containers. Is there any 
evidence of that?

Perhaps the reason why people still use RPMs is instead that they have invested 
a lot of time and effort into developing automation tools/scripts/playbooks for 
RPMs installations and they consider the transition to ceph orch and 
containerized environments as a significant challenge.

Regarding containerized Ceph, I remember asking Sage for a minimalist CephOS 
back in 2018 (there was no containers by that time). IIRC, he said maintaining 
a ceph-specific Linux distro would take too much time and resources, so it was 
not something considered at that time. Now that Ceph is all containers, I 
really hope that a minimalist rolling Ceph distro comes out one day. ceph orch 
could even handle rare distro upgrades such as kernel upgrades as well as 
ordered reboots. This would make ceph clusters really easier to maintain over 
time (compared to the last complicated upgrade path from non-containerized 
RHEL7+RHCS4.3 to containerized RHEL9+RHCS5.2 that we had to follow a year ago).

Bests,
Frédéric.

- Le 23 Mai 24, à 15:58, Laura floreslflo...@redhat.com  a écrit :


Hi all,

The meeting will be starting shortly! Join us at this link:
https://meet.jit.si/ceph-user-dev-monthly

- Laura

On Wed, May 22, 2024 at 2:55 PM Laura Flores  wrote:


Hi all,

The User + Dev Meetup will be held tomorrow at 10:00 AM EDT. We will be
discussing the results of the latest survey, and users who attend will have
the opportunity to provide additional feedback in real time.

See you there!
Laura Flores

Meeting Details:
https://www.meetup.com/ceph-user-group/events/300883526/

--

Laura Flores

She/Her/Hers

Software Engineer, Ceph Storage

Chicago, IL

lflo...@ibm.com  |lflo...@redhat.com  
M: +17087388804




--

Laura Flores

She/Her/Hers

Software Engineer, Ceph Storage

Chicago, IL

lflo...@ibm.com  |lflo...@redhat.com  
M: +17087388804
___
ceph-users mailing list --ceph-users@ceph.io
To unsubscribe send an email toceph-users-le...@ceph.io

___
ceph-users mailing list --ceph-users@ceph.io
To unsubscribe send an email toceph-users-le...@ceph.io

--
Head of Software Development
E-Mail: sebastian.wag...@croit.io

croit GmbH, Freseniusstr. 31h, 81247 Munich
CEO: Martin Verges, Andy Muthmann - VAT-ID: DE310638492
Com. register: Amtsgericht Munich HRB 231263

Web  | LinkedIn  | 
Youtube  | 
Twitter 



TOP 100 Innovator Award Winner 
 by compamedia
Technology Fast50 Award 
 Winner by Deloitte

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: unknown PGs after adding hosts in different subtree

2024-05-24 Thread Eugen Block
I start to think that the root cause of the remapping is just the fact  
that the crush rule(s) contain(s) the "step take default" line:


 step take default class hdd

My interpretation is that crush simply tries to honor the rule:  
consider everything underneath the "default" root, so PGs get remapped  
if new hosts are added there (but not in their designated subtree  
buckets). The effect (unknown PGs) is bad, but there are a couple of  
options to avoid that:


- Use different root names and/or more specific crush rules.
- Use host spec file(s) to place new hosts directly where they belong.
- Set osd_crush_initial_weight = 0 to avoid remapping until everything  
is where it's supposed to be, then reweight the OSDs.



Zitat von Eugen Block :


Hi Frank,

thanks for looking up those trackers. I haven't looked into them  
yet, I'll read your response in detail later, but I wanted to add  
some new observation:


I added another root bucket (custom) to the osd tree:

# ceph osd tree
ID   CLASS  WEIGHT   TYPE NAME   STATUS  REWEIGHT  PRI-AFF
-12   0  root custom
 -1 0.27698  root default
 -8 0.09399  room room1
 -3 0.04700  host host1
  7hdd  0.02299  osd.7   up   1.0  1.0
 10hdd  0.02299  osd.10  up   1.0  1.0
...

Then I tried this approach to add a new host directly to the  
non-default root:


# cat host5.yaml
service_type: host
hostname: host5
addr: 192.168.168.54
location:
  root: custom
labels:
   - osd

# ceph orch apply -i host5.yaml

# ceph osd tree
ID   CLASS  WEIGHT   TYPE NAME   STATUS  REWEIGHT  PRI-AFF
-12 0.04678  root custom
-23 0.04678  host host5
  1hdd  0.02339  osd.1   up   1.0  1.0
 13hdd  0.02339  osd.13  up   1.0  1.0
 -1 0.27698  root default
 -8 0.09399  room room1
 -3 0.04700  host host1
  7hdd  0.02299  osd.7   up   1.0  1.0
 10hdd  0.02299  osd.10  up   1.0  1.0
...

host5 is placed directly underneath the new custom root correctly,  
but not a single PG is marked "remapped"! So this is actually what I  
(or we) expected. I'm not sure yet what to make of it, but I'm  
leaning towards using this approach in the future and add hosts  
underneath a different root first, and then move it to its  
designated location.


Just to validate again, I added host6 without a location spec, so  
it's placed underneath the default root again:


# ceph osd tree
ID   CLASS  WEIGHT   TYPE NAME   STATUS  REWEIGHT  PRI-AFF
-12 0.04678  root custom
-23 0.04678  host host5
  1hdd  0.02339  osd.1   up   1.0  1.0
 13hdd  0.02339  osd.13  up   1.0  1.0
 -1 0.32376  root default
-25 0.04678  host host6
 14hdd  0.02339  osd.14  up   1.0  1.0
 15hdd  0.02339  osd.15  up   1.0  1.0
 -8 0.09399  room room1
 -3 0.04700  host host1
...

And this leads to remapped PGs again. I assume this must be related  
to the default root. I'm gonna investigate further.


Thanks!
Eugen


Zitat von Frank Schilder :


Hi Eugen,

just to add another strangeness observation from long ago:  
https://www.spinics.net/lists/ceph-users/msg74655.html. I didn't  
see any reweights in your trees, so its something else. However,  
there seem to be multiple issues with EC pools and peering.


I also want to clarify:

If this is the case, it is possible that this is partly  
intentional and partly buggy.


"Partly intentional" here means the code behaviour changes when you  
add OSDs to the root outside the rooms and this change is not  
considered a bug. It is clearly *not* expected as it means you  
cannot do maintenance on a pool living on a tree A without  
affecting pools on the same device class living on an unmodified  
subtree of A.


From a ceph user's point of view everything you observe looks  
buggy. I would really like to see a good explanation why the  
mappings in the subtree *should* change when adding OSDs above that  
subtree as in your case when the expectation for good reasons is  
that they don't. This would help devising clean procedures for  
adding hosts when you (and I) want to add OSDs first without any  
peering and then move OSDs into place to have it happen separate  
from adding and not a total mess with everything in parallel.


Best regards,
=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14


From: Frank Schilder 
Sent: Thursday, May 23, 2024 6:32 PM
To: Eugen Block
Cc: ceph-users@ceph.io
Subject: [ceph-users] Re: unknown PGs after adding hosts in  
different subtree


Hi Eugen,

I'm at home now. Could you please check all the remapped PGs that  
they have no shards on t

[ceph-users] Re: quincy rgw with ceph orch and two realms only get answers from first realm

2024-05-24 Thread Boris
Thanks for being my rubber ducky.
Turns out I didn't had the rgw_zonegroup configured in the first apply.
Then adding it to the config and applying it, does not restart or
reconfigure the containers.
After doing a ceph orch restart rgw.customer it seems to work now.

Happy weekend everybody.

Am Fr., 24. Mai 2024 um 14:19 Uhr schrieb Boris :

> Hi,
>
> we are currently in the process of adopting the main s3 cluster to
> orchestrator.
> We have two realms (one for us and one for the customer).
>
> The old config worked fine and depending on the port I requested, I got
> different x-amz-request-id header back:
> x-amz-request-id: tx0307170ac0d734ab4-0066508120-94aa0f66-eu-central-1
> x-amz-request-id: tx0a9d8fc1821bbe258-00665081d1-949b2447-eu-customer-1
>
> I then deployed the orchestrator config to other systems and checked, but
> now I get always an answer from the eu-central-1 zone and never from the
> customer zone.
>
> The test is very simple and works fine for the old rgw config, but fails
> for the new orchestrator config:
> curl [IPv6]:7481/somebucket -v
>
> All rgw instances are running on ceph version 17.2.7.
> Configs attached below.
>
> The old ceph.conf looked like this:
> [client.eu-central-1-s3db1-old]
> rgw_frontends = beast endpoint=[::]:7480
> rgw_region = eu
> rgw_zone = eu-central-1
> rgw_dns_name = example.com
> rgw_dns_s3website_name = eu-central-1.example.com
> [client.eu-customer-1-s3db1]
> rgw_frontends = beast endpoint=[::]:7481
> rgw_region = eu-customer
> rgw_zone = eu-customer-1
> rgw_dns_name = s3.customer.domain
> rgw_dns_s3website_name = eu-central-1.s3.customer.domain
>
> And this is the new service.yaml
> service_type: rgw
> service_id: ab12
> placement:
> label: rgw
> config:
> debug_rgw: 0
> rgw_thread_pool_size: 2048
> rgw_dns_name: example.com
> rgw_dns_s3website_name: eu-central-1.example.com
> rgw_enable_gc_threads: false
> spec:
> rgw_frontend_port: 7480
> rgw_frontend_type: beast
> rgw_realm: company
> rgw_zone: eu-central-1
> rgw_zonegroup: eu
> ---
> service_type: rgw
> service_id: customer
> placement:
> label: rgw
> config:
> debug_rgw: 0
> rgw_dns_name: s3.customer.domain
> rgw_dns_s3website_name: eu-central-1.s3.customer.domain
> rgw_enable_gc_threads: false
> spec:
> rgw_frontend_port: 7481
> rgw_frontend_type: beast
> rgw_realm: customer
> rgw_zone: eu-customer-1
> rgw_zonegroup: eu-customer
>
> --
> Die Selbsthilfegruppe "UTF-8-Probleme" trifft sich diesmal abweichend im
> groüen Saal.
>


-- 
Die Selbsthilfegruppe "UTF-8-Probleme" trifft sich diesmal abweichend im
groüen Saal.
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] quincy rgw with ceph orch and two realms only get answers from first realm

2024-05-24 Thread Boris
Hi,

we are currently in the process of adopting the main s3 cluster to
orchestrator.
We have two realms (one for us and one for the customer).

The old config worked fine and depending on the port I requested, I got
different x-amz-request-id header back:
x-amz-request-id: tx0307170ac0d734ab4-0066508120-94aa0f66-eu-central-1
x-amz-request-id: tx0a9d8fc1821bbe258-00665081d1-949b2447-eu-customer-1

I then deployed the orchestrator config to other systems and checked, but
now I get always an answer from the eu-central-1 zone and never from the
customer zone.

The test is very simple and works fine for the old rgw config, but fails
for the new orchestrator config:
curl [IPv6]:7481/somebucket -v

All rgw instances are running on ceph version 17.2.7.
Configs attached below.

The old ceph.conf looked like this:
[client.eu-central-1-s3db1-old]
rgw_frontends = beast endpoint=[::]:7480
rgw_region = eu
rgw_zone = eu-central-1
rgw_dns_name = example.com
rgw_dns_s3website_name = eu-central-1.example.com
[client.eu-customer-1-s3db1]
rgw_frontends = beast endpoint=[::]:7481
rgw_region = eu-customer
rgw_zone = eu-customer-1
rgw_dns_name = s3.customer.domain
rgw_dns_s3website_name = eu-central-1.s3.customer.domain

And this is the new service.yaml
service_type: rgw
service_id: ab12
placement:
label: rgw
config:
debug_rgw: 0
rgw_thread_pool_size: 2048
rgw_dns_name: example.com
rgw_dns_s3website_name: eu-central-1.example.com
rgw_enable_gc_threads: false
spec:
rgw_frontend_port: 7480
rgw_frontend_type: beast
rgw_realm: company
rgw_zone: eu-central-1
rgw_zonegroup: eu
---
service_type: rgw
service_id: customer
placement:
label: rgw
config:
debug_rgw: 0
rgw_dns_name: s3.customer.domain
rgw_dns_s3website_name: eu-central-1.s3.customer.domain
rgw_enable_gc_threads: false
spec:
rgw_frontend_port: 7481
rgw_frontend_type: beast
rgw_realm: customer
rgw_zone: eu-customer-1
rgw_zonegroup: eu-customer

-- 
Die Selbsthilfegruppe "UTF-8-Probleme" trifft sich diesmal abweichend im
groüen Saal.
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: cephadm bootstraps cluster with bad CRUSH map(?)

2024-05-24 Thread Eugen Block

Hi,

thanks for picking that up so quickly!

I haven't used a host spec file yet to add new hosts, but if you read  
my thread about the unknown PGs, this might be my first choice to do  
that in the future. So thanks again for bringig it to my attention. ;-)


Regards,
Eugen

Zitat von Matthew Vernon :


Hi,

On 22/05/2024 12:44, Eugen Block wrote:


you can specify the entire tree in the location statement, if you need to:


[snip]

Brilliant, that's just the ticket, thank you :)


This should be made a bit clearer in the docs [0], I added Zac.


I've opened a MR to update the docs, I hope it's at least useful as  
a starter-for-ten:

https://github.com/ceph/ceph/pull/57633

Thanks,

Matthew



___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: User + Dev Meetup Tomorrow!

2024-05-24 Thread Frédéric Nass
Hello everyone,

Nice talk yesterday. :-)

Regarding containers vs RPMs and orchestration, and the related discussion from 
yesterday, I wanted to share a few things (which I wasn't able to share 
yesterday on the call due to a headset/bluetooth stack issue) to explain why we 
use cephadm and ceph orch these days with bare-metal clusters even though, as 
someone said, cephadm was not supposed to work with (nor support) bare-metal 
clusters (which actually surprised me since cephadm is all about managing 
containers on a host, regardless of its type). I also think this explains the 
observation that was made that half of the reports (iirc) are supposedly using 
cephadm with bare-metal clusters.

Over the years, we've deployed and managed bare-metal clusters with ceph-deploy 
in Hammer, then switched to ceph-ansible (take-over-existing-cluster.yml) with 
Jewel (or was it Luminous?), and then moved to cephadm, cephadm-ansible and 
ceph-orch with Pacific, to manage the exact same bare-metal cluster. I guess 
this explains why some bare-metal cluster today are managed using cephadm. 
These are not new clusters deployed with Rook in K8s environments, but existing 
bare-metal clusters that continue to servce brilliantly 10 years after 
installation.

Regarding rpms vs containers, as mentioned during the call, not sure why one 
would still want to use rpms vs containers considering the simplicity and 
velocity that containers offer regarding upgrades with ceph orch clever 
automation. Some reported performance reasons between rpms vs containers, 
meaning rpms binaries would perform better than containers. Is there any 
evidence of that?

Perhaps the reason why people still use RPMs is instead that they have invested 
a lot of time and effort into developing automation tools/scripts/playbooks for 
RPMs installations and they consider the transition to ceph orch and 
containerized environments as a significant challenge.

Regarding containerized Ceph, I remember asking Sage for a minimalist CephOS 
back in 2018 (there was no containers by that time). IIRC, he said maintaining 
a ceph-specific Linux distro would take too much time and resources, so it was 
not something considered at that time. Now that Ceph is all containers, I 
really hope that a minimalist rolling Ceph distro comes out one day. ceph orch 
could even handle rare distro upgrades such as kernel upgrades as well as 
ordered reboots. This would make ceph clusters really easier to maintain over 
time (compared to the last complicated upgrade path from non-containerized 
RHEL7+RHCS4.3 to containerized RHEL9+RHCS5.2 that we had to follow a year ago).

Bests,
Frédéric.

- Le 23 Mai 24, à 15:58, Laura Flores lflo...@redhat.com a écrit :

> Hi all,
> 
> The meeting will be starting shortly! Join us at this link:
> https://meet.jit.si/ceph-user-dev-monthly
> 
> - Laura
> 
> On Wed, May 22, 2024 at 2:55 PM Laura Flores  wrote:
> 
>> Hi all,
>>
>> The User + Dev Meetup will be held tomorrow at 10:00 AM EDT. We will be
>> discussing the results of the latest survey, and users who attend will have
>> the opportunity to provide additional feedback in real time.
>>
>> See you there!
>> Laura Flores
>>
>> Meeting Details:
>> https://www.meetup.com/ceph-user-group/events/300883526/
>>
>> --
>>
>> Laura Flores
>>
>> She/Her/Hers
>>
>> Software Engineer, Ceph Storage 
>>
>> Chicago, IL
>>
>> lflo...@ibm.com | lflo...@redhat.com 
>> M: +17087388804
>>
>>
>>
> 
> --
> 
> Laura Flores
> 
> She/Her/Hers
> 
> Software Engineer, Ceph Storage 
> 
> Chicago, IL
> 
> lflo...@ibm.com | lflo...@redhat.com 
> M: +17087388804
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: unknown PGs after adding hosts in different subtree

2024-05-24 Thread Eugen Block

Hi Frank,

thanks for looking up those trackers. I haven't looked into them yet,  
I'll read your response in detail later, but I wanted to add some new  
observation:


I added another root bucket (custom) to the osd tree:

# ceph osd tree
ID   CLASS  WEIGHT   TYPE NAME   STATUS  REWEIGHT  PRI-AFF
-12   0  root custom
 -1 0.27698  root default
 -8 0.09399  room room1
 -3 0.04700  host host1
  7hdd  0.02299  osd.7   up   1.0  1.0
 10hdd  0.02299  osd.10  up   1.0  1.0
...

Then I tried this approach to add a new host directly to the non-default root:

# cat host5.yaml
service_type: host
hostname: host5
addr: 192.168.168.54
location:
  root: custom
labels:
   - osd

# ceph orch apply -i host5.yaml

# ceph osd tree
ID   CLASS  WEIGHT   TYPE NAME   STATUS  REWEIGHT  PRI-AFF
-12 0.04678  root custom
-23 0.04678  host host5
  1hdd  0.02339  osd.1   up   1.0  1.0
 13hdd  0.02339  osd.13  up   1.0  1.0
 -1 0.27698  root default
 -8 0.09399  room room1
 -3 0.04700  host host1
  7hdd  0.02299  osd.7   up   1.0  1.0
 10hdd  0.02299  osd.10  up   1.0  1.0
...

host5 is placed directly underneath the new custom root correctly, but  
not a single PG is marked "remapped"! So this is actually what I (or  
we) expected. I'm not sure yet what to make of it, but I'm leaning  
towards using this approach in the future and add hosts underneath a  
different root first, and then move it to its designated location.


Just to validate again, I added host6 without a location spec, so it's  
placed underneath the default root again:


# ceph osd tree
ID   CLASS  WEIGHT   TYPE NAME   STATUS  REWEIGHT  PRI-AFF
-12 0.04678  root custom
-23 0.04678  host host5
  1hdd  0.02339  osd.1   up   1.0  1.0
 13hdd  0.02339  osd.13  up   1.0  1.0
 -1 0.32376  root default
-25 0.04678  host host6
 14hdd  0.02339  osd.14  up   1.0  1.0
 15hdd  0.02339  osd.15  up   1.0  1.0
 -8 0.09399  room room1
 -3 0.04700  host host1
...

And this leads to remapped PGs again. I assume this must be related to  
the default root. I'm gonna investigate further.


Thanks!
Eugen


Zitat von Frank Schilder :


Hi Eugen,

just to add another strangeness observation from long ago:  
https://www.spinics.net/lists/ceph-users/msg74655.html. I didn't see  
any reweights in your trees, so its something else. However, there  
seem to be multiple issues with EC pools and peering.


I also want to clarify:

If this is the case, it is possible that this is partly intentional  
and partly buggy.


"Partly intentional" here means the code behaviour changes when you  
add OSDs to the root outside the rooms and this change is not  
considered a bug. It is clearly *not* expected as it means you  
cannot do maintenance on a pool living on a tree A without affecting  
pools on the same device class living on an unmodified subtree of A.


From a ceph user's point of view everything you observe looks buggy.  
I would really like to see a good explanation why the mappings in  
the subtree *should* change when adding OSDs above that subtree as  
in your case when the expectation for good reasons is that they  
don't. This would help devising clean procedures for adding hosts  
when you (and I) want to add OSDs first without any peering and then  
move OSDs into place to have it happen separate from adding and not  
a total mess with everything in parallel.


Best regards,
=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14


From: Frank Schilder 
Sent: Thursday, May 23, 2024 6:32 PM
To: Eugen Block
Cc: ceph-users@ceph.io
Subject: [ceph-users] Re: unknown PGs after adding hosts in different subtree

Hi Eugen,

I'm at home now. Could you please check all the remapped PGs that  
they have no shards on the new OSDs, i.e. its just shuffling around  
mappings within the same set of OSDs under rooms?


If this is the case, it is possible that this is partly intentional  
and partly buggy. The remapping is then probably intentional and the  
method I use with a disjoint tree for new hosts prevents such  
remappings initially (the crush code sees the new OSDs in the root,  
doesn't use them but their presence does change choice orders  
resulting in remapped PGs). However, the unknown PGs should clearly  
not occur.


I'm afraid that the peering code has quite a few bugs, I reported  
something at least similarly weird a long time ago:  
https://tracker.ceph.com/issues/56995 and  
https://tracker.ceph.com/issues/46847. Might even be related. It  
looks like peering can l