[ceph-users] Re: avg apply latency went up after update from octopus to pacific

2023-04-12 Thread Reed Dier
Hi Jan, As someone who has been watching this thread in anticipation of planning an Octopus to Pacific upgrade, and also someone not all that interested in repaving all OSDs, which release(s) were the OSDs originally deployed with? Just trying to get a basic estimate on how recent or not these

[ceph-users] Re: avg apply latency went up after update from octopus to pacific

2023-04-12 Thread Jan-Tristan Kruse
Hi, we solved our latency-problems in two clusters by redeploying all OSDs. Since then, we did not reencounter the problems described. Greetings, Jan ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to

[ceph-users] Re: avg apply latency went up after update from octopus to pacific

2023-04-02 Thread Jan-Tristan Kruse
ptimal (op_w_latency at around 30-40ms, peaks at 1-2s sometimes), but somewhat stable. Greetings, Jan Von: c...@elchaka.de Datum: Samstag, 1. April 2023 um 01:12 An: ceph-users@ceph.io , Jan-Tristan Kruse Betreff: Re: [ceph-users] Re: avg apply latency went up after update from octopus to p

[ceph-users] Re: avg apply latency went up after update from octopus to pacific

2023-03-31 Thread ceph
Hello Jan, I had the same on two cluster from nautlus to pacific. On both it did help to fire Ceph tell osd.* compact If this had not help, i would go for a recreate of the osds... Hth Mehmet Am 31. März 2023 10:56:42 MESZ schrieb j.kr...@profihost.ag: >Hi, > >we have a very similar

[ceph-users] Re: avg apply latency went up after update from octopus to pacific

2023-03-31 Thread j . kruse
Hi, we have a very similar situation. We updated from nautilus -> pacific (16.2.11) and saw a rapid increase in the commit_latency and op_w_latency (>10s on some OSDs) after a few hours. We also have nearly exclusive rbd workload. After deleting old snapshots we saw an improvenent, and after

[ceph-users] Re: avg apply latency went up after update from octopus to pacific

2023-03-30 Thread Boris Behrens
A short correction: The IOPS from the bench in out pacific cluster are also down to 40 again for the 4/8TB disks , but the apply latency seems to stay in the same place. But I still don't understand why it is down again. Even when I synced out the OSD so it receives 0 traffic it is still slow.

[ceph-users] Re: avg apply latency went up after update from octopus to pacific

2023-03-30 Thread Boris Behrens
After some digging in the nautilus cluster I see that the disks with the exceptional high IOPS performance are actually SAS attached NVME disks (these: https://semiconductor.samsung.com/ssd/enterprise-ssd/pm1643-pm1643a/mzilt7t6hala-7/ ) and these disk make around 45% of cluster capacity.

[ceph-users] Re: avg apply latency went up after update from octopus to pacific

2023-03-27 Thread Igor Fedotov
On 3/27/2023 12:19 PM, Boris Behrens wrote: Nonetheless the IOPS the bench command generates are still VERY low compared to the nautilus cluster (~150 vs ~250). But this is something I would pin to this bug:https://tracker.ceph.com/issues/58530 I've just run "ceph tell bench" against main,

[ceph-users] Re: avg apply latency went up after update from octopus to pacific

2023-03-27 Thread Marc
> > > > And for your reference - IOPS numbers I'm getting in my lab with > data/DB > > colocated: > > > > 1) OSD on top of Intel S4600 (SATA SSD) - ~110 IOPS > > sata ssd's on Nautilus: Micron 5100 117 MZ7KM1T9HMJP-5 122 ___ ceph-users mailing

[ceph-users] Re: avg apply latency went up after update from octopus to pacific

2023-03-27 Thread Boris Behrens
Hey Igor, we are currently using these disks - all SATA attached (is it normal to have some OSDs without waer counter?): # ceph device ls | awk '{print $1}' | cut -f 1,2 -d _ | sort | uniq -c 18 SAMSUNG_MZ7KH3T8 (4TB) 126 SAMSUNG_MZ7KM1T9 (2TB) 24 SAMSUNG_MZ7L37T6 (8TB) 1

[ceph-users] Re: avg apply latency went up after update from octopus to pacific

2023-03-27 Thread Marc
> > > > >> > >> What I also see is that I have three OSDs that have quite a lot of > OMAP > >> data, in compare to other OSDs (~20 time higher). I don't know if > this > >> is an issue: > > > > I have on 2TB ssd's with 2GB - 4GB omap data, while on 8TB hdd's the > omap data is only 53MB - 100MB.

[ceph-users] Re: avg apply latency went up after update from octopus to pacific

2023-03-27 Thread Anthony D'Atri
> >> >> What I also see is that I have three OSDs that have quite a lot of OMAP >> data, in compare to other OSDs (~20 time higher). I don't know if this >> is an issue: > > I have on 2TB ssd's with 2GB - 4GB omap data, while on 8TB hdd's the omap > data is only 53MB - 100MB. > Should I

[ceph-users] Re: avg apply latency went up after update from octopus to pacific

2023-03-27 Thread Marc
> > What I also see is that I have three OSDs that have quite a lot of OMAP > data, in compare to other OSDs (~20 time higher). I don't know if this > is an issue: I have on 2TB ssd's with 2GB - 4GB omap data, while on 8TB hdd's the omap data is only 53MB - 100MB. Should I manually clean this?

[ceph-users] Re: avg apply latency went up after update from octopus to pacific

2023-03-27 Thread Igor Fedotov
Hi Boris, I wouldn't recommend to take absolute "osd bench" numbers too seriously. It's definitely not a full-scale quality benchmark tool. The idea was just to make brief OSDs comparison from c1 and c2. And for your reference -  IOPS numbers I'm getting in my lab with data/DB colocated:

[ceph-users] Re: avg apply latency went up after update from octopus to pacific

2023-03-27 Thread Boris Behrens
Hello together, I've redeployed all OSDs in the cluster and did a blkdiscard before deploying them again. It looks now a lot better, even better before the octopus. I am waiting for confirmation from the dev and customer teams as the value over all OSDs can be misleading, and we still have some

[ceph-users] Re: avg apply latency went up after update from octopus to pacific

2023-03-26 Thread Marc
> > sadly we do not have the data from the time where c1 was on nautilus. > The RocksDB warning persisted the recreation. > Hi Boris, I was monitoring this thread a bit because I also still need to update from Nautilus, and am interested in this performance degradation. I am happy to provide

[ceph-users] Re: avg apply latency went up after update from octopus to pacific

2023-03-22 Thread Boris Behrens
Hey Igor, sadly we do not have the data from the time where c1 was on nautilus. The RocksDB warning persisted the recreation. Here are the measurements. I've picked the same SSD models from the clusters to have some comparablity. For the 8TB disks it's even the same chassis configuration

[ceph-users] Re: avg apply latency went up after update from octopus to pacific

2023-03-22 Thread Igor Fedotov
Hi Boris, first of all I'm not sure if it's valid to compare two different clusters (pacific vs . nautilus, C1 vs. C2 respectively). The perf numbers difference might be caused by a bunch of other factors: different H/W, user load, network etc... I can see that you got ~2x latency increase

[ceph-users] Re: avg apply latency went up after update from octopus to pacific

2023-03-22 Thread Boris Behrens
Might be. Josh also pointed in that direction. I currently search for ways to mitigate it. Am Mi., 22. März 2023 um 10:30 Uhr schrieb Konstantin Shalygin < k0...@k0ste.ru>: > Hi, > > > Maybe [1] ? > > > [1] https://tracker.ceph.com/issues/58530 > k > > On 22 Mar 2023, at 16:20, Boris Behrens

[ceph-users] Re: avg apply latency went up after update from octopus to pacific

2023-03-22 Thread Konstantin Shalygin
Hi, Maybe [1] ? [1] https://tracker.ceph.com/issues/58530 k > On 22 Mar 2023, at 16:20, Boris Behrens wrote: > > Are there any other ides? > ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to

[ceph-users] Re: avg apply latency went up after update from octopus to pacific

2023-03-21 Thread Boris Behrens
Hi Igor, i've offline compacted all the OSDs and reenabled the bluefs_buffered_io It didn't change anything and the commit and apply latencies are around 5-10 times higher than on our nautlus cluster. The pacific cluster got a 5 minute mean over all OSDs 2.2ms, while the nautilus cluster is

[ceph-users] Re: avg apply latency went up after update from octopus to pacific

2023-03-21 Thread Igor Fedotov
Hi Boris, additionally you might want to manually compact RocksDB for every OSD. Thanks, Igor On 3/21/2023 12:22 PM, Boris Behrens wrote: Disabling the write cache and the bluefs_buffered_io did not change anything. What we see is that larger disks seem to be the leader in therms of

[ceph-users] Re: avg apply latency went up after update from octopus to pacific

2023-03-21 Thread Boris Behrens
Disabling the write cache and the bluefs_buffered_io did not change anything. What we see is that larger disks seem to be the leader in therms of slowness (we have 70% 2TB, 20% 4TB and 10% 8TB SSDs in the cluster), but removing some of the 8TB disks and replace them with 2TB (because it's by far

[ceph-users] Re: avg apply latency went up after update from octopus to pacific

2023-02-28 Thread Mark Nelson
One thing to watch out for with bluefs_buffered_io is that disabling it can greatly impact certain rocksdb workloads. From what I remember it was a huge problem during certain iteration workloads for things like collection listing. I think the block cache was being invalidated or simply

[ceph-users] Re: avg apply latency went up after update from octopus to pacific

2023-02-28 Thread Boris Behrens
Hi Josh, thanks a lot for the breakdown and the links. I disabled the write cache but it didn't change anything. Tomorrow I will try to disable bluefs_buffered_io. It doesn't sound that I can mitigate the problem with more SSDs. Am Di., 28. Feb. 2023 um 15:42 Uhr schrieb Josh Baergen <

[ceph-users] Re: avg apply latency went up after update from octopus to pacific

2023-02-28 Thread Josh Baergen
Hi Boris, OK, what I'm wondering is whether https://tracker.ceph.com/issues/58530 is involved. There are two aspects to that ticket: * A measurable increase in the number of bytes written to disk in Pacific as compared to Nautilus * The same, but for IOPS Per the current theory, both are due to

[ceph-users] Re: avg apply latency went up after update from octopus to pacific

2023-02-28 Thread Boris Behrens
Hi Josh, we upgraded 15.2.17 -> 16.2.11 and we only use rbd workload. Am Di., 28. Feb. 2023 um 15:00 Uhr schrieb Josh Baergen < jbaer...@digitalocean.com>: > Hi Boris, > > Which version did you upgrade from and to, specifically? And what > workload are you running (RBD, etc.)? > > Josh > > On

[ceph-users] Re: avg apply latency went up after update from octopus to pacific

2023-02-28 Thread Josh Baergen
Hi Boris, Which version did you upgrade from and to, specifically? And what workload are you running (RBD, etc.)? Josh On Tue, Feb 28, 2023 at 6:51 AM Boris Behrens wrote: > > Hi, > today I did the first update from octopus to pacific, and it looks like the > avg apply latency went up from 1ms