[ceph-users] Cannot get backfill speed up

2023-07-05 Thread Jesper Krogh



Hi.

Fresh cluster - but despite setting:
jskr@dkcphhpcmgt028:/$ sudo ceph config show osd.0 |  grep 
recovery_max_active_ssd
osd_recovery_max_active_ssd  50  
 
  mon   
default[20]
jskr@dkcphhpcmgt028:/$ sudo ceph config show osd.0 |  grep 
osd_max_backfills
osd_max_backfills100 
 
  mon   
default[10]


I still get
jskr@dkcphhpcmgt028:/$ sudo ceph status
  cluster:
id: 5c384430-da91-11ed-af9c-c780a5227aff
health: HEALTH_OK

  services:
mon: 3 daemons, quorum dkcphhpcmgt031,dkcphhpcmgt029,dkcphhpcmgt028 
(age 16h)
mgr: dkcphhpcmgt031.afbgjx(active, since 33h), standbys: 
dkcphhpcmgt029.bnsegi, dkcphhpcmgt028.bxxkqd

mds: 2/2 daemons up, 1 standby
osd: 40 osds: 40 up (since 45h), 40 in (since 39h); 21 remapped pgs

  data:
volumes: 2/2 healthy
pools:   9 pools, 495 pgs
objects: 24.85M objects, 60 TiB
usage:   117 TiB used, 159 TiB / 276 TiB avail
pgs: 10655690/145764002 objects misplaced (7.310%)
 474 active+clean
 15  active+remapped+backfilling
 6   active+remapped+backfill_wait

  io:
client:   0 B/s rd, 1.4 MiB/s wr, 0 op/s rd, 116 op/s wr
recovery: 328 MiB/s, 108 objects/s

  progress:
Global Recovery Event (9h)
  [==..] (remaining: 25m)

With these numbers for the setting - I would expect to get more than 15 
active backfilling... (and based on SSD's and 2x25gbit network, I can 
also spend more resources on recovery than 328 MiB/s


Thanks, .

--
Jesper Krogh
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Rook on bare-metal?

2023-07-05 Thread Nico Schottelius


Morning,

we are running some ceph clusters with rook on bare metal and can very
much recomend it. You should have proper k8s knowledge, knowing how to
change objects such as configmaps or deployments, in case things go
wrong.

In regards to stability, the rook operator is written rather defensive,
not changing monitors or the cluster if the quorom is not met and
checking how the osd status is on removal/adding of osds.

So TL;DR: very much usable and rather k8s native.

BR,

Nico

zs...@tuta.io writes:

> Hello!
>
> I am looking to simplify ceph management on bare-metal by deploying
> Rook onto kubernetes that has been deployed on bare metal (rke). I
> have used rook in a cloud environment but I have not used it on
> bare-metal. I am wondering if anyone here runs rook in bare-metal?
> Would you recommend it to cephadm or would you steer clear of it?
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io


--
Sustainable and modern Infrastructures by ungleich.ch
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] pg_num != pgp_num - and unable to change.

2023-07-05 Thread Jesper Krogh

Hi.

Fresh cluster - after a dance where the autoscaler did not work 
(returned black) as described in the doc - I now seemingly have it 
working. It has bumpted target to something reasonable -- and is slowly 
incrementing pg_num and pgp_num by 2 over time (hope this is correct?)


But .
jskr@dkcphhpcmgt028:/$ sudo ceph osd pool ls detail | grep 62
pool 22 'cephfs.archive.ec62data' erasure profile ecprof62 size 8 
min_size 7 crush_rule 3 object_hash rjenkins pg_num 150 pgp_num 22 
pg_num_target 512 pgp_num_target 512 autoscale_mode on last_change 9159 
lfor 0/0/9147 flags hashpspool,ec_overwrites,selfmanaged_snaps,bulk 
stripe_width 24576 pg_num_min 128 target_size_ratio 0.4 application 
cephfs


pg_num = 150
pgp_num = 22

and setting pgp_num seemingly have zero effect on the system .. not even 
with autoscaling set to off.


jskr@dkcphhpcmgt028:/$ sudo ceph osd pool set cephfs.archive.ec62data 
pg_autoscale_mode off

set pool 22 pg_autoscale_mode to off
jskr@dkcphhpcmgt028:/$ sudo ceph osd pool set cephfs.archive.ec62data 
pgp_num 150

set pool 22 pgp_num to 150
jskr@dkcphhpcmgt028:/$ sudo ceph osd pool set cephfs.archive.ec62data 
pg_num_min 128

set pool 22 pg_num_min to 128
jskr@dkcphhpcmgt028:/$ sudo ceph osd pool set cephfs.archive.ec62data 
pg_num 150

set pool 22 pg_num to 150
jskr@dkcphhpcmgt028:/$ sudo ceph osd pool set cephfs.archive.ec62data 
pg_autoscale_mode on

set pool 22 pg_autoscale_mode to on
jskr@dkcphhpcmgt028:/$ sudo ceph progress
PG autoscaler increasing pool 22 PGs from 150 to 512 (14s)
[]
jskr@dkcphhpcmgt028:/$ sudo ceph osd pool ls detail | grep 62
pool 22 'cephfs.archive.ec62data' erasure profile ecprof62 size 8 
min_size 7 crush_rule 3 object_hash rjenkins pg_num 150 pgp_num 22 
pg_num_target 512 pgp_num_target 512 autoscale_mode on last_change 9159 
lfor 0/0/9147 flags hashpspool,ec_overwrites,selfmanaged_snaps,bulk 
stripe_width 24576 pg_num_min 128 target_size_ratio 0.4 application 
cephfs


pgp_num != pg_num ?

In earlier versions of ceph (without autoscaler) I have only experienced 
that setting pg_num and pgp_num took immidiate effect?


Jesper

jskr@dkcphhpcmgt028:/$ sudo ceph version
ceph version 17.2.6 (d7ff0d10654d2280e08f1ab989c7cdf3064446a5) quincy 
(stable)

jskr@dkcphhpcmgt028:/$ sudo ceph health
HEALTH_OK
jskr@dkcphhpcmgt028:/$ sudo ceph status
  cluster:
id: 5c384430-da91-11ed-af9c-c780a5227aff
health: HEALTH_OK

  services:
mon: 3 daemons, quorum dkcphhpcmgt031,dkcphhpcmgt029,dkcphhpcmgt028 
(age 15h)
mgr: dkcphhpcmgt031.afbgjx(active, since 32h), standbys: 
dkcphhpcmgt029.bnsegi, dkcphhpcmgt028.bxxkqd

mds: 2/2 daemons up, 1 standby
osd: 40 osds: 40 up (since 44h), 40 in (since 39h); 33 remapped pgs

  data:
volumes: 2/2 healthy
pools:   9 pools, 495 pgs
objects: 24.85M objects, 60 TiB
usage:   117 TiB used, 158 TiB / 276 TiB avail
pgs: 13494029/145763897 objects misplaced (9.257%)
 462 active+clean
 23  active+remapped+backfilling
 10  active+remapped+backfill_wait

  io:
client:   0 B/s rd, 1.1 MiB/s wr, 0 op/s rd, 94 op/s wr
recovery: 705 MiB/s, 208 objects/s

  progress:


--
Jesper Krogh
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] CLT Meeting minutes 2023-07-05

2023-07-05 Thread Radoslaw Zarzynski
Hello!

Releasing Reef
-
* RC2 is out but we still have several PRs to go, including blockers.
* RC3 might be worth doing but we Reef shall go before end of the month.

Misc
---
* For the sake of unittesting of dencoders interoperatbility we're going
  to impose some extra work (like registering types within ceph-dencoder)
  on developers writing encodable structs. This will be discussed further
  in a CDM.
* A lab issue got fixed.

Regards
Radek
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Rook on bare-metal?

2023-07-05 Thread zssas
Hello!

I am looking to simplify ceph management on bare-metal by deploying Rook onto 
kubernetes that has been deployed on bare metal (rke). I have used rook in a 
cloud environment but I have not used it on bare-metal. I am wondering if 
anyone here runs rook in bare-metal? Would you recommend it to cephadm or would 
you steer clear of it?
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] ceph quota qustion

2023-07-05 Thread sejun21 . kim
Hi, I contact you for some question about quota.

Situation is following below.

1. I set the user quota 10M
2. Using s3 browser, upload one 12M file
3. The upload failed as i wish, but some object remains in the pool(almost 10M) 
and s3brower doesn't show failed file.

I expected nothing to be left in Ceph. 

My question is "can user or admin remove the remaining objects?"
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Erasure coding and backfilling speed

2023-07-05 Thread jesper
Hi. 

I have a Ceph (NVME) based cluster with 12 hosts and 40 OSD's .. currently it 
is backfilling pg's but I cannot get it to run more than 20 backfilling (pgs) 
at the same time (6+2 profile)
osd_max_backfills = 100 and osd_recovery_max_active_ssd = 50 (non-sane) but it 
still stops at 20 with 40+ in backfill_wait

Any idea about how to speed it up? 

Thanks.
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: RBD with PWL cache shows poor performance compared to cache device

2023-07-05 Thread Yin, Congmin
Hi ,  Matthew

I see "rbd with pwl cache: 5210112 ns",  This latency is beyond my expectations 
and I believe it is unlikely to occur. In theory, this value should be around a 
few hundred microseconds. But I'm not sure what went wrong in your steps. Can 
you use perf for latency analysis. Hi  @Ilya Dryomov , do you have any 
suggestions?

Perf, some command:
admin_socket = /mnt/pmem/cache.asok
ceph --admin-daemon /mnt/pmem/cache.asok perf reset all
ceph --admin-daemon /mnt/pmem/cache.asok perf dump

-Original Message-
From: Matthew Booth  
Sent: Monday, July 3, 2023 6:09 PM
To: Yin, Congmin 
Cc: Ilya Dryomov ; Giulio Fidente ; 
Tang, Guifeng ; Vikhyat Umrao ; 
Jdurgin ; John Fulton ; Francesco 
Pantano ; ceph-users@ceph.io
Subject: Re: [ceph-users] RBD with PWL cache shows poor performance compared to 
cache device

On Fri, 30 Jun 2023 at 08:50, Yin, Congmin  wrote:
>
> Hi Matthew,
>
> Due to the latency of rbd layers, the write latency of the pwl cache is more 
> than ten times that of the Raw device.
> I replied directly below the 2 questions.
>
> Best regards.
> Congmin Yin
>
>
> -Original Message-
> From: Matthew Booth 
> Sent: Thursday, June 29, 2023 7:23 PM
> To: Ilya Dryomov 
> Cc: Giulio Fidente ; Yin, Congmin 
> ; Tang, Guifeng ; 
> Vikhyat Umrao ; Jdurgin ; John 
> Fulton ; Francesco Pantano ; 
> ceph-users@ceph.io
> Subject: Re: [ceph-users] RBD with PWL cache shows poor performance 
> compared to cache device
>
> On Wed, 28 Jun 2023 at 22:44, Ilya Dryomov  wrote:
> >> ** TL;DR
> >>
> >> In testing, the write latency performance of a PWL-cache backed RBD 
> >> disk was 2 orders of magnitude worse than the disk holding the PWL 
> >> cache.
>
>
>
> PWL cache can use pmem or SSD as cache devices. Using PMEM, based on 
> my test environment at that time, I can give specific data as follows: 
> the write latency of the pmem Raw device is about 10+us, the write 
> latency of the pwl cache is about 100us+(from the latency of the rbd 
> layers), and the write latency of the ceph cluster is about 
> 1000+us(from messengers and network). But for SSDs, there are many 
> types, and I cannot provide a specific value, but it will definitely 
> be worse than pmem. So, for a phenomenon that is 2 orders of magnitude 
> lower, it is worse than expected. Can you provide detailed values of 
> the three for analysis. (SSD, pwl cache, ceph cluster)

I'm not entirely sure what you're asking for. Which values are you looking for?

I did provide 3 sets of test results below, is that what you mean?
* rbd no cache: 1417216 ns
* pwl cache device: 44288 ns
* rbd with pwl cache: 5210112 ns

These are all outputs from the benchmarking test. The first is executing in the 
VM writing to a ceph RBD disk *without* PWL. The second is executing on the 
host writing directly to the SSD which is being used for the PWL cache. The 
third is execuing in the VM writing to the same ceph RBD disk, but this time 
*with* PWL.

Incidentally, the client and server machines are identical, and the SSD used by 
the client for PWL is the same model used on the server as the OSDs. The SSDs 
are SAMSUNG MZ7KH480HAHQ0D3 SSDs attached to PERC H730P Mini (Embedded).

> ==
>
> >>
> >> ** Summary
> >>
> >> I was hoping that PWL cache might be a good solution to the problem 
> >> of write latency requirements of etcd when running a kubernetes 
> >> control plane on ceph. Etcd is extremely write latency sensitive 
> >> and becomes unstable if write latency is too high. The etcd 
> >> workload can be characterised by very small (~4k) writes with a queue 
> >> depth of 1.
> >> Throughput, even on a busy system, is normally very low. As etcd is 
> >> distributed and can safely handle the loss of un-flushed data from 
> >> a single node, a local ssd PWL cache for etcd looked like an ideal 
> >> solution.
> >
> >
> > Right, this is exactly the use case that the PWL cache is supposed to 
> > address.
>
> Good to know!
>
> >> My expectation was that adding a PWL cache on a local SSD to an 
> >> RBD-backed would improve write latency to something approaching the 
> >> write latency performance of the local SSD. However, in my testing 
> >> adding a PWL cache to an rbd-backed VM increased write latency by 
> >> approximately 4x over not using a PWL cache. This was over 100x 
> >> more than the write latency performance of the underlying SSD.
>
>
>
>
> When using image as the VM's disk, you may have used commands like the 
> following. In many cases, using parameters such as writeback will force the 
> start of rbd cache, which is a memory cache. It is normal for pwl cache to be 
> several times slower than it. Please confirm.
> There is currently no parameter support for using only pwl cache instead of 
> rbd cache. I have tested the latency of using pwl cache (pmem) by modifying 
> the code myself, which is about twice as high as using rbd cache.
>
> qemu -m 1024 -drive 
> 

[ceph-users] Re: upgrading from 15.2.17 to 16.2.11 - Health ERROR

2023-07-05 Thread letonphat1988
I've met this issue when try to upgrade octopus 15.2.17 to 16.2.13 last night. 
Upgrade process failed at mgr module phase after the new MGR version become to 
active state. I tried to enable debug `ceph config set mgr 
mgr/cephadm/log_to_cluster_level debug
` and I saw the message like @xadhoom76 about config_key "registry_credentials"

I guessed the root cause because of this line 
`json.loads(str(self.mgr.get_store('registry_credentials'` and the 
key_store had a wrong value. Then I got the empty value when run this command 
"ceph config-key dump | grep 'registry_credentials'" and the same for "ceph 
config-key get mgr/cephadm/registry_credentials"
By check `cephadm` source i see the value should be a json format like that 
ceph config-key set mgr/cephadm/registry_credentials '{"url": 
"registry.local:5000", "username": "user-deployer", "password": 
"xxxzz"}'

After set this key and `ceph mgr fail` to reload , my cluster issue was gone .
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: [multisite] The purpose of zonegroup

2023-07-05 Thread Casey Bodley
thanks Yixin,

On Tue, Jul 4, 2023 at 1:20 PM Yixin Jin  wrote:
>
>  Hi Casey,
> Thanks a lot for the clarification. I feel that zonegroup made a great sense 
> at the beginning when multisite feature was conceived and (I suspect) zones 
> were always syncing from all other zones within a zonegroup. However, once 
> the "sync_from" was introduced and later the sync policy further enhanced the 
> granularity of the control over data sync, it seems not much advantage is 
> left with zonegroup.

both sync_from and sync policy do offer finer-grained control over
which zones sync from which, but they can't represent a bucket's
'residency' the way that the zonegroup-based LocationConstraint does.
by redirecting requests to the bucket's resident zonegroup, the goal
is to present a single eventually-consistent set of objects per
bucket. while features like sync_from and bucket replication policy do
complicate this picture, i think this concept of residency and
redirects are important to make sense of s3's LocationConstraint. but
perhaps the sync policy model could be extended to take over the
zonegroup's role here?

> Both "sync_from" and sync policy could be moved up to realm level while the 
> isolation of datasets can still be maintained. On the other hand, if some new 
> features are introduced to enable some isolation of metadata within the same 
> realm, probably at zonegroup level, its usefulness may be more justified.> 
> Regards,Yixin
>
> On Friday, June 30, 2023 at 11:29:16 a.m. EDT, Casey Bodley 
>  wrote:
>
>  you're correct that the distinction is between metadata and data;
> metadata like users and buckets will replicate to all zonegroups,
> while object data only replicates within a single zonegroup. any given
> bucket is 'owned' by the zonegroup that creates it (or overridden by
> the LocationConstraint on creation). requests for data in that bucket
> sent to other zonegroups should redirect to the zonegroup where it
> resides
>
> the ability to create multiple zonegroups can be useful in cases where
> you want some isolation for the datasets, but a shared namespace of
> users and buckets. you may have several connected sites sharing
> storage, but only require a single backup for purposes of disaster
> recovery. there it could make sense to create several zonegroups with
> only two zones each to avoid replicating all objects to all zones
>
> in other cases, it could make more sense to isolate things in separate
> realms with a single zonegroup each. zonegroups just provide some
> flexibility to control the isolation of data and metadata separately
>
> On Thu, Jun 29, 2023 at 5:48 PM Yixin Jin  wrote:
> >
> > Hi folks,
> > In the multisite environment, we can get one realm that contains multiple 
> > zonegroups, each in turn can have multiple zones. However, the purpose of 
> > zonegroup isn't clear to me. It seems that when a user is created, its 
> > metadata is synced to all zones within the same realm, regardless whether 
> > they are in different zonegroups or not. The same happens to buckets. 
> > Therefore, what is the purpose of having zonegroups? Wouldn't it be easier 
> > to just have realm and zones?
> > Thanks,Yixin
> > ___
> > ceph-users mailing list -- ceph-users@ceph.io
> > To unsubscribe send an email to ceph-users-le...@ceph.io
> >
>
>
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: [multisite] The purpose of zonegroup

2023-07-05 Thread Zac Dover
The information provided by Casey has been added to doc/radosgw/multisite.rst 
in this PR: https://github.com/ceph/ceph/pull/52324

Zac Dover
Upstream Docs
Ceph Foundation




--- Original Message ---
On Saturday, July 1st, 2023 at 1:45 AM, Casey Bodley  wrote:


> 
> 
> cc Zac, who has been working on multisite docs in
> https://tracker.ceph.com/issues/58632
> 
> On Fri, Jun 30, 2023 at 11:37 AM Alexander E. Patrakov
> patra...@gmail.com wrote:
> 
> > Thanks! This is something that should be copy-pasted at the top of
> > https://docs.ceph.com/en/latest/radosgw/multisite/
> > 
> > Actually, I reported a documentation bug for something very similar.
> > 
> > On Fri, Jun 30, 2023 at 11:30 PM Casey Bodley cbod...@redhat.com wrote:
> > 
> > > you're correct that the distinction is between metadata and data;
> > > metadata like users and buckets will replicate to all zonegroups,
> > > while object data only replicates within a single zonegroup. any given
> > > bucket is 'owned' by the zonegroup that creates it (or overridden by
> > > the LocationConstraint on creation). requests for data in that bucket
> > > sent to other zonegroups should redirect to the zonegroup where it
> > > resides
> > > 
> > > the ability to create multiple zonegroups can be useful in cases where
> > > you want some isolation for the datasets, but a shared namespace of
> > > users and buckets. you may have several connected sites sharing
> > > storage, but only require a single backup for purposes of disaster
> > > recovery. there it could make sense to create several zonegroups with
> > > only two zones each to avoid replicating all objects to all zones
> > > 
> > > in other cases, it could make more sense to isolate things in separate
> > > realms with a single zonegroup each. zonegroups just provide some
> > > flexibility to control the isolation of data and metadata separately
> > > 
> > > On Thu, Jun 29, 2023 at 5:48 PM Yixin Jin yji...@yahoo.ca wrote:
> > > 
> > > > Hi folks,
> > > > In the multisite environment, we can get one realm that contains 
> > > > multiple zonegroups, each in turn can have multiple zones. However, the 
> > > > purpose of zonegroup isn't clear to me. It seems that when a user is 
> > > > created, its metadata is synced to all zones within the same realm, 
> > > > regardless whether they are in different zonegroups or not. The same 
> > > > happens to buckets. Therefore, what is the purpose of having 
> > > > zonegroups? Wouldn't it be easier to just have realm and zones?
> > > > Thanks,Yixin
> > > > ___
> > > > ceph-users mailing list -- ceph-users@ceph.io
> > > > To unsubscribe send an email to ceph-users-le...@ceph.io
> > > 
> > > ___
> > > ceph-users mailing list -- ceph-users@ceph.io
> > > To unsubscribe send an email to ceph-users-le...@ceph.io
> > 
> > --
> > Alexander E. Patrakov
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: RBD with PWL cache shows poor performance compared to cache device

2023-07-05 Thread Mark Nelson


On 7/4/23 10:39, Matthew Booth wrote:

On Tue, 4 Jul 2023 at 10:00, Matthew Booth  wrote:

On Mon, 3 Jul 2023 at 18:33, Ilya Dryomov  wrote:

On Mon, Jul 3, 2023 at 6:58 PM Mark Nelson  wrote:


On 7/3/23 04:53, Matthew Booth wrote:

On Thu, 29 Jun 2023 at 14:11, Mark Nelson  wrote:

This container runs:
  fio --rw=write --ioengine=sync --fdatasync=1
--directory=/var/lib/etcd --size=100m --bs=8000 --name=etcd_perf
--output-format=json --runtime=60 --time_based=1

And extracts sync.lat_ns.percentile["99.00"]

Matthew, do you have the rest of the fio output captured?  It would be 
interesting to see if it's just the 99th percentile that is bad or the PWL 
cache is worse in general.

Sure.

With PWL cache: https://paste.openstack.org/show/820504/
Without PWL cache: https://paste.openstack.org/show/b35e71zAwtYR2hjmSRtR/
With PWL cache, 'rbd_cache'=false:
https://paste.openstack.org/show/byp8ZITPzb3r9bb06cPf/

Also, how's the CPU usage client side?  I would be very curious to see
if unwindpmp shows anything useful (especially lock contention):


https://github.com/markhpc/uwpmp


Just attach it to the client-side process and start out with something
like 100 samples (more are better but take longer).  You can run it like:


./unwindpmp -n 100 -p 

I've included the output in this gist:
https://gist.github.com/mdbooth/2d68b7e081a37e27b78fe396d771427d

That gist contains 4 runs: 2 with PWL enabled and 2 without, and also
a markdown file explaining the collection method.

Matt


Thanks Matt!  I looked through the output.  Looks like the symbols might
have gotten mangled.  I'm not an expert on the RBD client, but I don't
think we would really be calling into
rbd_group_snap_rollback_with_progress from
librbd::cache::pwl::ssd::WriteLogEntry::writeback_bl.  Was it possible
you used the libdw backend for unwindpmp?  libdw sometimes gives
strange/mangled callgraphs, but I haven't seen it before with
libunwind.  Hopefully Congmin Yin or Ilya can confirm if it's garbage.
So with that said, assuming we can trust these callgraphs at all, it
looks like it might be worth looking at the latency of the
AbstractWriteLog, librbd::cache::pwl::ssd::WriteLogEntry::writeback_bl,
and possibly usage of librados::v14_2_0::IoCtx::object_list.  On the

Hi Mark,

Both rbd_group_snap_rollback_with_progress and
librados::v14_2_0::IoCtx::object_list entries don't make sense to me,
so I'd say it's garbage.

Unfortunately I'm not at all familiar with this tool. Do you know how
it obtains its symbols? I didn't install any debuginfo packages, so I
was a bit surprised to see any symbols at all.

I installed the following debuginfo packages and re-ran the tests:
elfutils-debuginfod-client-0.189-2.fc38.x86_64
elfutils-debuginfod-client-devel-0.189-2.fc38.x86_64
ceph-debuginfo-17.2.6-3.fc38.x86_64
librbd1-debuginfo-17.2.6-3.fc38.x86_64
librados2-debuginfo-17.2.6-3.fc38.x86_64
qemu-debuginfo-7.2.1-2.fc38.x86_64
qemu-system-x86-core-debuginfo-7.2.1-2.fc38.x86_64
boost-debuginfo-1.78.0-11.fc38.x86_64

Note that unwindpmp now runs considerably slower (because it re-reads
debug symbols for each sample?), so I had to reduce the number of
samples to 500.



It basically just uses libunwind or libdw to unwind the stack over and 
over and then unwindpmp turns the resulting samples into a forward or 
reverse call graph.  The libunwind backend code is here:


https://github.com/markhpc/uwpmp/blob/master/src/tracer/unwind_tracer.cc

I'm sort of amazed that it gave you symbols without the debuginfo 
packages installed.  I'll need to figure out a way to prevent that.  
Having said that, your new traces look more accurate to me.  The thing 
that sticks out to me is the (slight?) amount of contention on the PWL 
m_lock in dispatch_deferred_writes, update_root_scheduled_ops, 
append_ops, append_sync_point(), etc.


I don't know if the contention around the m_lock is enough to cause an 
increase in 99% tail latency from 1.4ms to 5.2ms, but it's the first 
thing that jumps out at me.  There appears to be a large number of 
threads (each tp_pwl thread, the io_context_pool threads, the qemu 
thread, and the bstore_aio thread) that all appear to have potential to 
contend on that lock.  You could try dropping the number of tp_pwl 
threads from 4 to 1 and see if that changes anything.



Mark




I have updated the gist with the new results:
https://gist.github.com/mdbooth/2d68b7e081a37e27b78fe396d771427d

Thanks,
Matt


--
Best Regards,
Mark Nelson
Head of R (USA)

Clyso GmbH
p: +49 89 21552391 12
a: Loristraße 8 | 80335 München | Germany
w: https://clyso.com | e: mark.nel...@clyso.com

We are hiring: https://www.clyso.com/jobs/
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Slow ACL Changes in Secondary Zone

2023-07-05 Thread Ramin Najjarbashi
Dear Ceph community,

I'm facing an issue with ACL changes in the secondary zone of my Ceph
cluster after making modifications to the API name in the master zone of my
master zonegroup. I would appreciate any insights or suggestions on how to
resolve this problem.

Here's the background information on my setup:


   - I have two clusters that are part of a single realm.
   - Each cluster has a zone within a single zonegroup.
   - Initially, all functionality was working perfectly fine.

However, the problem arose when I changed the API name in the master zone
of my master zonegroup. Since then, all functionalities appear to be
functioning as expected, except for ACL changes, which have become
extremely slow specifically in the secondary zone. Whenever I attempt to
change the ACL in the secondary zone, it takes approximately 10 seconds or
more for a response to be received.

I would like to understand why this delay is occurring and find a solution
to improve the performance of ACL changes in the secondary zone. Any
suggestions, explanations, or guidance would be greatly appreciated.

Thank you in advance for your help.
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Mishap after disk replacement, db and block split into separate OSD's in ceph-volume

2023-07-05 Thread Mikael Öhman
We had a fauly disk which was causing many errors, and replacement took a
while so we had to try to stop ceph from using the OSD in during this time.
However I think we must have done that wrong and after the disk replacement
our ceph orch seems to have picked up /dev/sdp and added the a new osd and
automatically (588), without a separate DB device (since that was still
taken by the old OSD 31 maybe? I'm not sure how to ).
This led to issues where osd31 of course wouldn't start, and some actions
were attempted to clear this out, which might have just caused more harm.

Long story short, we are currently in a odd position where we have still
have ceph-volume lvm list osd.31 with only a [db] section:
== osd.31 ==

 [db]
 
/dev/ceph-1b309b1e-a4a6-4861-b16c-7c06ecde1a3d/osd-db-fb09a714-f955-4418-99f2-6bccd8c6220e


 block device
 
/dev/ceph-48f7dbd8-4a7c-4f7e-8962-104e756ae864/osd-block-33538b36-52b3-421d-bf66-6c729a057707

 block uuidbykFYi-z8T6-OWXp-i1OB-H7CE-uLDm-Td6QTI
 cephx lockbox secret
 cluster fsid  5406fed0-d52b-11ec-beff-7ed30a54847b
 cluster name  ceph
 crush device classNone
 db device

/dev/ceph-1b309b1e-a4a6-4861-b16c-7c06ecde1a3d/osd-db-fb09a714-f955-4418-99f2-6bccd8c6220e

 db uuid   Vy3aOA-qseQ-RIDT-741e-z7o0-y376-kKTXRE
 encrypted 0
 osd fsid  33538b36-52b3-421d-bf66-6c729a057707
 osd id31
 osdspec affinity  osd_spec
 type  db
 vdo   0
 devices   /dev/nvme0n1

and a seperate extra osd.588 (which is running) which has taken only the
[block] device

= osd.588 ==

 [block]
  
/dev/ceph-f63ef837-3b18-47a4-be55-d5c2c0db8927/osd-block-58b33b8f-9623-46b3-a86a-3061602a76b5


 block device
 
/dev/ceph-f63ef837-3b18-47a4-be55-d5c2c0db8927/osd-block-58b33b8f-9623-46b3-a86a-3061602a76b5

 block uuidKYHzBq-zgJJ-Nw93-j7Jx-Oz5i-BMuU-ndtTCH
 cephx lockbox secret
 cluster fsid  5406fed0-d52b-11ec-beff-7ed30a54847b
 cluster name  ceph
 crush device class
 encrypted 0
 osd fsid  58b33b8f-9623-46b3-a86a-3061602a76b5
 osd id588
 osdspec affinity  all-available-devices
 type  block
 vdo   0
 devices   /dev/sdp

I figured the best action was to clear out both of these faulty OSDs via
orch "ceph orch osd rm XX" but osd 31 isn't recognized

[ceph: root@mimer-osd01 /]# ceph orch osd rm 31
Unable to find OSDs: ['31']

Deleting 588 is recognized. Should I attempt to clear out the osd.31 from
ceph-volume manually?
I'd really like to get back to a situation where I have osd.31 with the osd
fsid that matches the device names, with /dev/sdp and /dev/nmve0n1 but I'm
really afraid of just breaking things even more.

>From what i can see from files laying around, the OSD spec we have is
simply:
placement:
 host_pattern: "mimer-osd01"
service_id: osd_spec
service_type: osd
spec:
 data_devices:
   rotational: 1
 db_devices:
   rotational: 0
in case this matters. I appreciate any help or guidance.

Best regards, Mikael
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io