[ceph-users] Re: snaptrim blocks io on ceph pacific even on fast NVMEs

2021-11-10 Thread Anthony D'Atri
> How many osd you have on 1 nvme drives? > We increased 2/nvme to 4/nvme and it improved the snap-trimming quite a lot. Interesting. Most analyses I’ve seen report diminishing returns with more than two OSDs per. There are definitely serialization bottlenecks in the PG and OSD code, so I’m

[ceph-users] Re: slow operation observed for _collection_list

2021-11-10 Thread Сергей Процун
No, you can not do online compaction. пт, 5 лист. 2021, 17:22 користувач Szabo, Istvan (Agoda) < istvan.sz...@agoda.com> пише: > Seems like it can help, but after 1-2 days it comes back on different and > in some cases on the same osd as well. > Is there any other way to compact online as it

[ceph-users] Re: 2 zones for a single RGW cluster

2021-11-10 Thread prosergey07
Yes. You just need to create a separate zone with radosgw-admin and the corresponding pool names for that rgw zone. Then on the radosgw host you need to put rgw zone for which it would operate in ceph.confНадіслано з пристрою Galaxy Оригінальне повідомлення Від: J-P Methot

[ceph-users] Re: Question if WAL/block.db partition will benefit us

2021-11-10 Thread Сергей Процун
In rgw.meta contains user, bucket, bucket instance metadata. rgw.bucket.index contains bucket indexes aka shards. Like if you have 32 shards you will have 32 objects in that pool: .dir.BUCKET_ID.0-31. Each would have part of your objects listed.They should be using some sort of hash table

[ceph-users] Re: Question if WAL/block.db partition will benefit us

2021-11-10 Thread Anthony D'Atri
> > Oh. > How would one recover from that? Sounds like it basically makes no difference > if 2, 5 oder 10 OSD are in the blast radius. Veilicht. Aber a larger blast radius means that you lose a larger percentage of your cluster, assuming that you have a CRUSH failure domain of no smaller

[ceph-users] Re: Question if WAL/block.db partition will benefit us

2021-11-10 Thread Boris
Oh. How would one recover from that? Sounds like it basically makes no difference if 2, 5 oder 10 OSD are in the blast radius. Can the omap key/values be regenerated? I always thought these data would be stored in the rgw pools. Or am I mixing things up and the bluestore metadata got omap

[ceph-users] Re: Question if WAL/block.db partition will benefit us

2021-11-10 Thread Сергей Процун
No, you can not do that. Because RocksDB for omap key/values and WAL would be gone meaning all xattr and omap will be gone too. Hence osd will become non operational. But if you notice that ssd starts throwing errors, you can start migrating bluefs device to a new partition: ceph-bluestore-tool

[ceph-users] [Pacific] OSD Spec problem?

2021-11-10 Thread [AR] Guillaume CephML
Hello, I got something strange on a Pacific (16.2.6) cluster. I have added 8 new empty spinning disk on this running cluster that is configured with: # ceph orch ls osd --export service_type: osd service_id: ar_osd_hdd_spec service_name: osd.ar_osd_hdd_spec placement: host_pattern: '*' spec:

[ceph-users] Re: slow operation observed for _collection_list

2021-11-10 Thread Boris Behrens
Did someone figure this out? We are currently facing the same issue but the OSDs more often kill themself and need to be restarted by us. This happens to OSDs that have a SSD backed block.db and OSDs that got the block.db on the bluestore device. All OSDs are rotating disk of various sizes. We've

[ceph-users] Re: snaptrim blocks io on ceph pacific even on fast NVMEs

2021-11-10 Thread Arthur Outhenin-Chalandre
Hi, On 11/10/21 16:14, Christoph Adomeit wrote: But the cluster seemed to slowly "eat" storage space. So yesterday I decided to add 3 more NVMEs, 1 for each node. In the second i added the first nvme as ceph osd the cluster was crashing. I had high loads on all osds and all the osds where

[ceph-users] Re: allocate_bluefs_freespace failed to allocate

2021-11-10 Thread mhnx
Hello Igor. Thanks for the answer. There are so many changes to read and test for me but I will plan an upgrade to Octopus when I'm available. Is there any problem upgrading from 14.2.16 ---> 15.2.15 ? Igor Fedotov , 10 Kas 2021 Çar, 17:50 tarihinde şunu yazdı: > I would encourage you to

[ceph-users] Re: snaptrim blocks io on ceph pacific even on fast NVMEs

2021-11-10 Thread Christoph Adomeit
Thanks Stefan, i played with bluefs_buffered_io but i think the impact is not great since the nvmes are so fast. I think buffered IO on increased cpu load while buffered io off increase nvme load. Problem was with both settings. I am not sure if require-osd-release was run. What do you think

[ceph-users] snaptrim blocks io on ceph pacific even on fast NVMEs

2021-11-10 Thread Christoph Adomeit
I have upgraded my ceph cluster to pacific in August and updated to pacific 16.2.6 in September without problems. I had no performance issues at all, the cluster has 3 nodes 64 core each, 15 blazing fast Samsung PM1733 NVME osds, 25 GBit/s Network and around 100 vms. The cluster was really

[ceph-users] Re: allocate_bluefs_freespace failed to allocate

2021-11-10 Thread Igor Fedotov
I would encourage you to upgrade to at least the latest Nautilus (and preferably to Octopus). There were a bunch of allocator's bugs fixed since 14.2.16. Not even sure all of them landed into N since it's EOL. A couple examples are (both are present in the latest Nautilus):

[ceph-users] Re: LVM support in Ceph Pacific

2021-11-10 Thread Janne Johansson
Den ons 10 nov. 2021 kl 11:27 skrev MERZOUKI, HAMID : > > Hello everybody, > > I have a misunderstanding about LVM configuration for OSD devices: > > In the pacific documentation cephadm/install (and it was already written in > octopus documentation in section DEPLOY OSDS), it is written : > “The

[ceph-users] ceph-data-scan: Watching progress and choosing the number of threads

2021-11-10 Thread Anderson, Erik
I deleted a filesystem that should not have been deleted on a seven node 1.2P cluster running Octopus. After looking through various docs and threads I am running ‘ceph-data-scan’ to try and rebuild the metadata from the data pool. The example for ceph-data-scan in the documentation uses four

[ceph-users] Re: How to enable RDMA

2021-11-10 Thread David Majchrzak, Oderland Webbhotell AB
I think the latest docs on ceph RDMA "support" is based on Luminous. I'd be careful using RDMA on later versions of ceph if you're running a production cluster. ⁣Kind Regards, David Majchrzak CTO Oderland Webbhotell AB​ Den 10 nov. 2021 11:47, kI 11:47, "Mason-Williams, Gabryel (RFI,RAL,-)"

[ceph-users] Re: How to enable RDMA

2021-11-10 Thread Mason-Williams, Gabryel (RFI,RAL,-)
Hi GHui, You might find this document useful: https://support.mellanox.com/s/article/bring-up-ceph-rdma---developer-s-guide Also, I previously asked this question and there was some useful information in the thread:

[ceph-users] Re: OSD spend too much time on "waiting for readable" -> slow ops -> laggy pg -> rgw stop -> worst case osd restart

2021-11-10 Thread Manuel Lausch
This is the patch I made. I think this is the wrong place to do this. but in the first place in worked. diff --git a/src/osd/PrimaryLogPG.cc b/src/osd/PrimaryLogPG.cc index 9fb22e0f9ee..69341840153 100644 --- a/src/osd/PrimaryLogPG.cc +++ b/src/osd/PrimaryLogPG.cc @@ -798,6 +798,10 @@ void

[ceph-users] Re: OSD spend too much time on "waiting for readable" -> slow ops -> laggy pg -> rgw stop -> worst case osd restart

2021-11-10 Thread Peter Lieven
Am 10.11.21 um 11:35 schrieb Manuel Lausch: > oh shit, > > I patched in a switch to deactivate the read_lease feature. This is only a > hack to test a bit around. But accidentally I had this switch enabled for my > last tests done here in this mail-thread. > > The bad news. The

[ceph-users] Re: OSD spend too much time on "waiting for readable" -> slow ops -> laggy pg -> rgw stop -> worst case osd restart

2021-11-10 Thread Manuel Lausch
oh shit, I patched in a switch to deactivate the read_lease feature. This is only a hack to test a bit around. But accidentally I had this switch enabled for my last tests done here in this mail-thread. The bad news. The require_osd_release doesn't fix the slow op problematic, only the

[ceph-users] Re: large bucket index in multisite environement (how to deal with large omap objects warning)?

2021-11-10 Thread Boris Behrens
I am just creating a bucket with a lot of files to test it. Who would have thought that uploading a million 1k files would take days? Am Di., 9. Nov. 2021 um 00:50 Uhr schrieb prosergey07 : > When resharding is performed I believe its considered as bucket operation > and undergoes through

[ceph-users] Re: Question if WAL/block.db partition will benefit us

2021-11-10 Thread Boris Behrens
Hi, we use enterprise SSDs like SAMSUNG MZ7KM1T9. The work very well for our block storage. Some NVMe would be a lot nicer but we have some good experience with them. One SSD fail takes down 10 OSDs might sound hard, but this would be an okayish risk. Most of the tunables are defaul in our setup

[ceph-users] Re: OSD spend too much time on "waiting for readable" -> slow ops -> laggy pg -> rgw stop -> worst case osd restart

2021-11-10 Thread Peter Lieven
Am 10.11.21 um 09:57 schrieb Manuel Lausch: > Hi Sage, > > > thank you for your help. > > > My origin issue with slow ops on osd restarts are gone too. Even with default > values for paxos_proposal_interval. > > > Its a bit annoying, that I spent many hours to debug this and finally I > missed

[ceph-users] Re: steady increasing of osd map epoch since octopus

2021-11-10 Thread Manuel Lausch
We found the reason, after the upgrade from nautilus we forgot the set ceph osd require-osd-release pacific Now all is fine. Thanks Manuel Von: Manuel Lausch Gesendet: Montag, 8. November 2021 14:37 An: Dan van der Ster Cc: Ceph Users Betreff:

[ceph-users] Re: OSD spend too much time on "waiting for readable" -> slow ops -> laggy pg -> rgw stop -> worst case osd restart

2021-11-10 Thread Manuel Lausch
Hi Sage, thank you for your help. My origin issue with slow ops on osd restarts are gone too. Even with default values for paxos_proposal_interval. Its a bit annoying, that I spent many hours to debug this and finally I missed only one step in the upgrade. Only during the update itself,

[ceph-users] Re: allocate_bluefs_freespace failed to allocate

2021-11-10 Thread mhnx
Yes. I don't have separate DB/WAL. These SSD's are only using by rgw index. The command "--command bluefs-bdev-sizes" is not working if the osd up and working. I need a new OSD failure to get useful output. I will check when I get one. I picked an OSD from my test environment to check the command