[ceph-users] RadosGW public HA traffic - best practices?

2023-11-17 Thread Boris Behrens
Hi, I am looking for some experience on how people make their RGW public. Currently we use the follow: 3 IP addresses that get distributed via keepalived between three HAproxy instances, which then balance to three RGWs. The caveat is, that keepalived is PITA to get working in distributing a set

[ceph-users] Re: Emergency, I lost 4 monitors but all osd disk are safe

2023-11-02 Thread Boris Behrens
anks for your help, I'm very stuck because the data is present but I > don't know how to add the old osd in the cluster to recover the data. > > > > Le jeu. 2 nov. 2023 à 11:55, Boris Behrens a écrit : > >> Hi Mohamed, >> are all mons down, or do you still have at leas

[ceph-users] Re: Emergency, I lost 4 monitors but all osd disk are safe

2023-11-02 Thread Boris Behrens
Hi Mohamed, are all mons down, or do you still have at least one that is running? AFAIK: the mons save their DB on the normal OS disks, and not within the ceph cluster. So if all mons are dead, which mean the disks which contained the mon data are unrecoverable dead, you might need to bootstrap a

[ceph-users] Re: RGW access logs with bucket name

2023-10-30 Thread Boris Behrens
og bucket name at level 1. > > Cheers, Dan > > -- > Dan van der Ster > CTO > > Clyso GmbH > p: +49 89 215252722 | a: Vancouver, Canada > w: https://clyso.com | e: dan.vanders...@clyso.com > > Try our Ceph Analyzer: https://analyzer.clyso.com > > On Thu, Mar 30,

[ceph-users] traffic by IP address / bucket / user

2023-10-18 Thread Boris Behrens
Hi, did someone have a solution ready to monitor traffic by IP address? Cheers Boris ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: Autoscaler problems in pacific

2023-10-05 Thread Boris Behrens
ug > reports to improve it. > > Zitat von Boris Behrens : > > > Hi, > > I've just upgraded to our object storages to the latest pacific version > > (16.2.14) and the autscaler is acting weird. > > On one cluster it just shows nothing: > > ~# ceph osd pool autoscal

[ceph-users] Re: Autoscaler problems in pacific

2023-10-04 Thread Boris Behrens
Also found what the 2nd problem was: When there are pools using the default replicated_ruleset while there are multiple rulesets with differenct device classes, the autoscaler does not produce any output. Should I open a bug for that? Am Mi., 4. Okt. 2023 um 14:36 Uhr schrieb Boris Behrens

[ceph-users] Re: Autoscaler problems in pacific

2023-10-04 Thread Boris Behrens
Found the bug for the TOO_MANY_PGS: https://tracker.ceph.com/issues/62986 But I am still not sure, why I don't have any output on that one cluster. Am Mi., 4. Okt. 2023 um 14:08 Uhr schrieb Boris Behrens : > Hi, > I've just upgraded to our object storages to the latest pacific version >

[ceph-users] Autoscaler problems in pacific

2023-10-04 Thread Boris Behrens
Hi, I've just upgraded to our object storages to the latest pacific version (16.2.14) and the autscaler is acting weird. On one cluster it just shows nothing: ~# ceph osd pool autoscale-status ~# On the other clusters it shows this when it is set to warn: ~# ceph health detail ... [WRN]

[ceph-users] multiple rgw instances with same cephx key

2023-09-22 Thread Boris Behrens
Hi, is it possible to use one cephx key for multiple parallel running RGW? Maybe I could just use the same 'name' and the same key for all of the RGW instances? I plan to start RGWs all over the place in container and let BGP handle the traffic. But I don't know how to create on demand keys, that

[ceph-users] ceph orch osd data_allocate_fraction does not work

2023-09-21 Thread Boris Behrens
I have a use case where I want to only use a small portion of the disk for the OSD and the documentation states that I can use data_allocation_fraction [1] But cephadm can not use this and throws this error: /usr/bin/podman: stderr ceph-volume lvm batch: error: unrecognized arguments:

[ceph-users] Re: Make ceph orch daemons reboot safe

2023-09-18 Thread Boris Behrens
> > > > I don’t have time to look into all the details, but I’m wondering how > you seem to be able to start mgr services with the orchestrator if all mgr > daemons are down. The orchestrator is a mgr module, so that’s a bit weird, > isn’t it? > > > > Zitat von Boris B

[ceph-users] Re: Make ceph orch daemons reboot safe

2023-09-16 Thread Boris Behrens
a node where I had to "play around" a bit with removed and > redeployed osd containers. At some point they didn't react to > systemctl commands anymore, but a reboot fixed that. But I haven't > seen that in a production cluster yet, so some more details would be > useful.

[ceph-users] Make ceph orch daemons reboot safe

2023-09-15 Thread Boris Behrens
Hi, is there a way to have the pods start again after reboot? Currently I need to start them by hand via ceph orch start mon/mgr/osd/... I imagine this will lead to a lot of headache when the ceph cluster gets a powercycle and the mon pods will not start automatically. I've spun up a test

[ceph-users] Re: ceph orchestator managed daemons do not use authentication (was: ceph orchestrator pulls strange images from docker.io)

2023-09-15 Thread Boris Behrens
none * global advanced auth_service_required none Am Fr., 15. Sept. 2023 um 13:01 Uhr schrieb Boris Behrens : > Oh, we found the issue. A very old update was stuck in the pipeline. We > canceled it and then the correct images got

[ceph-users] Re: ceph orchestator managed daemons do not use authentication (was: ceph orchestrator pulls strange images from docker.io)

2023-09-15 Thread Boris Behrens
.0cc47a6df330@-1(probing) e0 handle_auth_bad_method hmm, they didn't like 2 result (95) Operation not supported I added the mon via: ceph orch daemon add mon FQDN:[IPv6_address] Am Fr., 15. Sept. 2023 um 09:21 Uhr schrieb Boris Behrens : > Hi Stefan, > > the cluster is running 17.6.

[ceph-users] Re: ceph orchestator pulls strange images from docker.io

2023-09-15 Thread Boris Behrens
alling the hosts, but as I have to adopt 17 clusters to the orchestrator, I rather get some learnings from the not working thing :) Am Fr., 15. Sept. 2023 um 08:26 Uhr schrieb Stefan Kooman : > On 14-09-2023 17:49, Boris Behrens wrote: > > Hi, > > I currently try to adopt our

[ceph-users] ceph orchestator pulls strange images from docker.io

2023-09-14 Thread Boris Behrens
Hi, I currently try to adopt our stage cluster, some hosts just pull strange images. root@0cc47a6df330:/var/lib/containers/storage/overlay-images# podman ps CONTAINER ID IMAGE COMMAND CREATEDSTATUSPORTS NAMES

[ceph-users] Re: [quincy] Migrating ceph cluster to new network, bind OSDs to multple public_nework

2023-08-23 Thread Boris Behrens
ility of both: old and new network, until end of migration > > k > Sent from my iPhone > > > On 22 Aug 2023, at 10:43, Boris Behrens wrote: > > > > The OSDs are still only bound to one IP address. > > -- Die Selbsthilfegruppe "UTF-

[ceph-users] Re: [quincy] Migrating ceph cluster to new network, bind OSDs to multple public_nework

2023-08-22 Thread Boris Behrens
IP, > I'm not aware of a way to have them bind to multiple public IPs like > the MONs can. You'll probably need to route the compute node traffic > towards the new network. Please correct me if I misunderstood your > response. > > Zitat von Boris Behrens : > > > The OSDs ar

[ceph-users] Re: [quincy] Migrating ceph cluster to new network, bind OSDs to multple public_nework

2023-08-22 Thread Boris Behrens
o have both old and new network in there, but I'd try on one > host first and see if it works. > > Zitat von Boris Behrens : > > > We're working on the migration to cephadm, but it requires some > > prerequisites that still needs planing. > > > > root@host:~#

[ceph-users] Re: [quincy] Migrating ceph cluster to new network, bind OSDs to multple public_nework

2023-08-21 Thread Boris Behrens
via cephadm / > > orchestrator. > > I just assumed that with Quincy it already would be managed by > cephadm. So what does the ceph.conf currently look like on an OSD host > (mask sensitive data)? > > Zitat von Boris Behrens : > > > Hey Eugen, > > I don't ha

[ceph-users] Re: [quincy] Migrating ceph cluster to new network, bind OSDs to multple public_nework

2023-08-21 Thread Boris Behrens
tps://www.spinics.net/lists/ceph-users/msg75162.html > [2] > > https://docs.ceph.com/en/quincy/cephadm/services/mon/#moving-monitors-to-a-different-network > > Zitat von Boris Behrens : > > > Hi, > > I need to migrate a storage cluster to a new network. > > > > I adde

[ceph-users] [quincy] Migrating ceph cluster to new network, bind OSDs to multple public_nework

2023-08-21 Thread Boris Behrens
Hi, I need to migrate a storage cluster to a new network. I added the new network to the ceph config via: ceph config set global public_network "old_network/64, new_network/64" I've added a set of new mon daemons with IP addresses in the new network and they are added to the quorum and seem to

[ceph-users] Re: Upgrading nautilus / centos7 to octopus / ubuntu 20.04. - Suggestions and hints?

2023-08-01 Thread Boris Behrens
Hi Goetz, I've done the same, and went to Octopus and to Ubuntu. It worked like a charm and with pip, you can get the pecan library working. I think I did it with this: yum -y install python36-six.noarch python36-PyYAML.x86_64 pip3 install pecan werkzeug cherrypy Worked very well, until we got

[ceph-users] Re: radosgw new zonegroup hammers master with metadata sync

2023-07-04 Thread Boris Behrens
Are there any ideas how to work with this? We disabled the logging so we do not run our of diskspace, but the rgw daemon still requires A LOT of cpu because of this. Am Mi., 21. Juni 2023 um 10:45 Uhr schrieb Boris Behrens : > I've update the dc3 site from octopus to pacific and the prob

[ceph-users] Re: list of rgw instances in ceph status

2023-07-03 Thread Boris Behrens
gt; > The following command extract all their ids > > ceph service dump -f json-pretty | jq '.services.rgw.daemons' | egrep -e >> 'gid' -e '\"id\"' >> > > Best Regards, > Mahnoosh > > On Mon, Jul 3, 2023 at 3:00 PM Boris Behrens wrote: > >

[ceph-users] list of rgw instances in ceph status

2023-07-03 Thread Boris Behrens
Hi, might be a dump question, but is there a way to list the rgw instances that are running in a ceph cluster? Before pacific it showed up in `ceph status` but now it only tells me how many daemons are active, now which daemons are active. ceph orch ls tells me that I need to configure a backend

[ceph-users] Re: device class for nvme disk is ssd

2023-06-29 Thread Boris Behrens
So basically it does not matter unless I want to have that split up. Thanks for all the answers. I am still lobbying to phase out SATA SSDs and replace them with NVME disks. :) Am Mi., 28. Juni 2023 um 18:14 Uhr schrieb Anthony D'Atri < a...@dreamsnake.net>: > Even when you factor in density,

[ceph-users] device class for nvme disk is ssd

2023-06-28 Thread Boris Behrens
Hi, is it a problem that the device class for all my disks is SSD even all of these disks are NVME disks? If it is just a classification for ceph, so I can have pools on SSDs and NVMEs separated I don't care. But maybe ceph handles NVME disks differently internally? I've added them via

[ceph-users] Re: radosgw new zonegroup hammers master with metadata sync

2023-06-21 Thread Boris Behrens
I've update the dc3 site from octopus to pacific and the problem is still there. I find it very weird that in only happens from one single zonegroup to the master and not from the other two. Am Mi., 21. Juni 2023 um 01:59 Uhr schrieb Boris Behrens : > I recreated the site and the problem st

[ceph-users] Re: radosgw new zonegroup hammers master with metadata sync

2023-06-20 Thread Boris Behrens
I currently think I made a > mistake in the process. > > Mit freundlichen Grüßen > - Boris Behrens > > > Am 20.06.2023 um 18:30 schrieb Casey Bodley : > > > > hi Boris, > > > > we've been investigating reports of excessive polling from metadata &

[ceph-users] radosgw new zonegroup hammers master with metadata sync

2023-06-20 Thread Boris Behrens
Hi, yesterday I added a new zonegroup and it looks like it seems to cycle over the same requests over and over again. In the log of the main zone I see these requests: 2023-06-20T09:48:37.979+ 7f8941fb3700 1 beast: 0x7f8a602f3700: fd00:2380:0:24::136 - - [2023-06-20T09:48:37.979941+]

[ceph-users] Re: Bucket empty after resharding on multisite environment

2023-04-27 Thread Boris Behrens
E:OLD_BUCKET_ID < bucket.instance:BUCKET_NAME:NEW_BUCKET_ID.json Am Do., 27. Apr. 2023 um 13:32 Uhr schrieb Boris Behrens : > To clarify a bit: > The bucket data is not in the main zonegroup. > I wanted to start the reshard in the zonegroup where the bucket and the > data is located, but rgw told me to

[ceph-users] Re: Bucket empty after resharding on multisite environment

2023-04-27 Thread Boris Behrens
les Am Do., 27. Apr. 2023 um 13:08 Uhr schrieb Boris Behrens : > Hi, > I just resharded a bucket on an octopus multisite environment from 11 to > 101. > > I did it on the master zone and it went through very fast. > But now the index is empty. > > The files are still there

[ceph-users] Bucket empty after resharding on multisite environment

2023-04-27 Thread Boris Behrens
Hi, I just resharded a bucket on an octopus multisite environment from 11 to 101. I did it on the master zone and it went through very fast. But now the index is empty. The files are still there when doing a radosgw-admin bucket radoslist --bucket-id Do I just need to wait or do I need to

[ceph-users] Re: How to find the bucket name from Radosgw log?

2023-04-27 Thread Boris Behrens
Cheers Dan, would it be an option to enable the ops log? I still didn't figure out how it is actually working. But I am also thinking to move to the logparsing in HAproxy and disable the access log on the RGW instances. Am Mi., 26. Apr. 2023 um 18:21 Uhr schrieb Dan van der Ster <

[ceph-users] Re: Veeam backups to radosgw seem to be very slow

2023-04-27 Thread Boris Behrens
Thanks Janne, I will hand that to the customer. > Look at https://community.veeam.com/blogs-and-podcasts-57/sobr-veeam > -capacity-tier-calculations-and-considerations-in-v11-2548 > for "extra large blocks" to make them 8M at least. > We had one Veeam installation vomit millions of files onto our

[ceph-users] Veeam backups to radosgw seem to be very slow

2023-04-25 Thread Boris Behrens
We have a customer that tries to use veeam with our rgw objectstorage and it seems to be blazingly slow. What also seems to be strange, that veeam sometimes show "bucket does not exist" or "permission denied". I've tested parallel and everything seems to work fine from the s3cmd/aws cli

[ceph-users] Re: radosgw-admin bucket stats doesn't show real num_objects and size

2023-04-11 Thread Boris Behrens
I don't think you can exclude that. We've build a notification in the customer panel that there are incomplete multipart uploads which will be added as space to the bill. We also added a button to create a LC policy for these objects. Am Di., 11. Apr. 2023 um 19:07 Uhr schrieb : > The

[ceph-users] Re: RGW can't create bucket

2023-03-31 Thread Boris Behrens
x_buckets": 1000, and those users have the same access_denied issue > when creating a bucket. > > We also tried other bucket names and it is the same issue. > > On Thu, Mar 30, 2023 at 6:28 PM Boris Behrens wrote: > >> Hi Kamil, >> is this with all new buckets o

[ceph-users] Re: OSD down cause all OSD slow ops

2023-03-30 Thread Boris Behrens
Hi, you might suffer from the same bug we suffered: https://tracker.ceph.com/issues/53729 https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/thread/KG35GRTN4ZIDWPLJZ5OQOKERUIQT5WQ6/#K45MJ63J37IN2HNAQXVOOT3J6NTXIHCA Basically there is a bug that prevents the removal of PGlog items. You need

[ceph-users] Re: Eccessive occupation of small OSDs

2023-03-30 Thread Boris Behrens
Hi Nicola, can you send the output of ceph osd df tree ceph df ? Cheers Boris Am Do., 30. März 2023 um 16:36 Uhr schrieb Nicola Mori : > Dear Ceph users, > > my cluster is made up of 10 old machines, with uneven number of disks and > disk size. Essentially I have just one big data pool (6+2

[ceph-users] Re: RGW can't create bucket

2023-03-30 Thread Boris Behrens
Hi Kamil, is this with all new buckets or only the 'test' bucket? Maybe the name is already taken? Can you check s3cmd --debug if you are connecting to the correct endpoint? Also I see that the user seems to not be allowed to create bukets ... "max_buckets": 0, ... Cheers Boris Am Do., 30.

[ceph-users] Re: RGW access logs with bucket name

2023-03-30 Thread Boris Behrens
frastructure Engineer > --- > Agoda Services Co., Ltd. > e: istvan.sz...@agoda.com > --- > > On 2023. Mar 30., at 17:44, Boris Behrens wrote: > > Email received from the internet. If in doubt, d

[ceph-users] Re: RGW access logs with bucket name

2023-03-30 Thread Boris Behrens
Bringing up that topic again: is it possible to log the bucket name in the rgw client logs? currently I am only to know the bucket name when someone access the bucket via https://TLD/bucket/object instead of https://bucket.TLD/object. Am Di., 3. Jan. 2023 um 10:25 Uhr schrieb Boris Behrens

[ceph-users] Re: avg apply latency went up after update from octopus to pacific

2023-03-30 Thread Boris Behrens
. After idling over night it is back up to 120 IOPS Am Do., 30. März 2023 um 09:45 Uhr schrieb Boris Behrens : > After some digging in the nautilus cluster I see that the disks with the > exceptional high IOPS performance are actually SAS attached NVME disks > (these

[ceph-users] Re: avg apply latency went up after update from octopus to pacific

2023-03-30 Thread Boris Behrens
(4h resolution) goes up again (2023-03-01 upgrade to pacific, the dip around 25th was the redeploy and now it seems to go up again) [image: image.png] Am Mo., 27. März 2023 um 17:24 Uhr schrieb Igor Fedotov < igor.fedo...@croit.io>: > > On 3/27/2023 12:19 PM, Boris Behrens wrote:

[ceph-users] Re: avg apply latency went up after update from octopus to pacific

2023-03-27 Thread Boris Behrens
Hey Igor, we are currently using these disks - all SATA attached (is it normal to have some OSDs without waer counter?): # ceph device ls | awk '{print $1}' | cut -f 1,2 -d _ | sort | uniq -c 18 SAMSUNG_MZ7KH3T8 (4TB) 126 SAMSUNG_MZ7KM1T9 (2TB) 24 SAMSUNG_MZ7L37T6 (8TB) 1

[ceph-users] Re: avg apply latency went up after update from octopus to pacific

2023-03-27 Thread Boris Behrens
? @marc If I interpret the linked bug correctly, you might want to have the metadata on an SSD, because the write aplification might hit very hard on HDDs. But maybe someone else from the mailing list can say more about it. Cheers Boris Am Mi., 22. März 2023 um 22:45 Uhr schrieb Boris Behrens : >

[ceph-users] Re: avg apply latency went up after update from octopus to pacific

2023-03-22 Thread Boris Behrens
warning - I presume this might be caused by newer RocksDB > version running on top of DB with a legacy format.. Perhaps redeployment > would fix that... > > > Thanks, > > Igor > On 3/21/2023 5:31 PM, Boris Behrens wrote: > > Hi Igor, > i've offline compacted all t

[ceph-users] Re: avg apply latency went up after update from octopus to pacific

2023-03-22 Thread Boris Behrens
Might be. Josh also pointed in that direction. I currently search for ways to mitigate it. Am Mi., 22. März 2023 um 10:30 Uhr schrieb Konstantin Shalygin < k0...@k0ste.ru>: > Hi, > > > Maybe [1] ? > > > [1] https://tracker.ceph.com/issues/58530 > k > > On

[ceph-users] Re: avg apply latency went up after update from octopus to pacific

2023-03-21 Thread Boris Behrens
>=5. Am Di., 21. März 2023 um 10:46 Uhr schrieb Igor Fedotov < igor.fedo...@croit.io>: > Hi Boris, > > additionally you might want to manually compact RocksDB for every OSD. > > > Thanks, > > Igor > On 3/21/2023 12:22 PM, Boris Behrens

[ceph-users] Re: Changing os to ubuntu from centos 8

2023-03-21 Thread Boris Behrens
Hi Istvan, I currently make the move from centos7 to ubuntu18.04 (we want to jump directly from nautilus to pacific), When everything in the cluster got the same version, and the version is available on the new OS you can just reinstall the hosts with the new OS. With the mons, I remove the

[ceph-users] Re: avg apply latency went up after update from octopus to pacific

2023-03-21 Thread Boris Behrens
? Cheers Boris Am Di., 28. Feb. 2023 um 22:46 Uhr schrieb Boris Behrens : > Hi Josh, > thanks a lot for the breakdown and the links. > I disabled the write cache but it didn't change anything. Tomorrow I will > try to disable bluefs_buffered_io. > > It doesn't sound that I can mi

[ceph-users] Re: radosgw SSE-C is not working (InvalidRequest)

2023-03-17 Thread Boris Behrens
Ha, found the error and now I feel just a tiny bit stupid: haproxy did not add the X-Forwarded-Proto header. Am Fr., 17. März 2023 um 12:03 Uhr schrieb Boris Behrens : > Hi, > I try to evaluate SSE-C (so customer provides keys) for our object > storages. > We do not provide

[ceph-users] radosgw SSE-C is not working (InvalidRequest)

2023-03-17 Thread Boris Behrens
Hi, I try to evaluate SSE-C (so customer provides keys) for our object storages. We do not provide a KMS server. I've added "Access-Control-Allow-Headers" to the haproxy frontend. rspadd Access-Control-Allow-Headers... x-amz-server-side-encryption-customer-algorithm,\

[ceph-users] Re: Concerns about swap in ceph nodes

2023-03-16 Thread Boris Behrens
Maybe worth to mention, because it caught me by surprise: Ubuntu creates a swap file (/swap.img) if you do not specify a swap partition (check /etc/fstab). Cheers Boris Am Mi., 15. März 2023 um 22:11 Uhr schrieb Anthony D'Atri < a...@dreamsnake.net>: > > With CentOS/Rocky 7-8 I’ve observed

[ceph-users] radosgw - octopus - 500 Bad file descriptor on upload

2023-03-09 Thread Boris Behrens
Hi, we've observed 500er errors on uploading files to a single bucket, but the problem went away after around 2 hours. We've checked and saw the following error message: 2023-03-08T17:55:58.778+ 7f8062f15700 0 WARNING: set_req_state_err err_no=125 resorting to 500 2023-03-08T17:55:58.778+

[ceph-users] Re: avg apply latency went up after update from octopus to pacific

2023-02-28 Thread Boris Behrens
mething else to consider is > > https://docs.ceph.com/en/quincy/start/hardware-recommendations/#write-caches > , > as sometimes disabling these write caches can improve the IOPS > performance of SSDs. > > Josh > > On Tue, Feb 28, 2023 at 7:19 AM Boris Behrens wrote: >

[ceph-users] Re: avg apply latency went up after update from octopus to pacific

2023-02-28 Thread Boris Behrens
(RBD, etc.)? > > Josh > > On Tue, Feb 28, 2023 at 6:51 AM Boris Behrens wrote: > > > > Hi, > > today I did the first update from octopus to pacific, and it looks like > the > > avg apply latency went up from 1ms to 2ms. > > > > All 36 OSDs are 4TB SSDs

[ceph-users] avg apply latency went up after update from octopus to pacific

2023-02-28 Thread Boris Behrens
Hi, today I did the first update from octopus to pacific, and it looks like the avg apply latency went up from 1ms to 2ms. All 36 OSDs are 4TB SSDs and nothing else changed. Someone knows if this is an issue, or am I just missing a config value? Cheers Boris

[ceph-users] Re: growing osd_pglog_items (was: increasing PGs OOM kill SSD OSDs (octopus) - unstable OSD behavior)

2023-02-23 Thread Boris Behrens
. Is there anything I can do with an octopus cluster, or is the only way to upgrade? And why does it happen? Am Di., 21. Feb. 2023 um 18:31 Uhr schrieb Boris Behrens : > Thanks a lot Josh. That really seems like my problem. > That does not look healthy in the cluster. oof. > ~# ceph tell osd.* perf d

[ceph-users] Re: increasing PGs OOM kill SSD OSDs (octopus) - unstable OSD behavior

2023-02-21 Thread Boris Behrens
"osd_pglog_bytes": 541849048, "osd_pglog_items": 3880437, ... Am Di., 21. Feb. 2023 um 18:21 Uhr schrieb Josh Baergen < jbaer...@digitalocean.com>: > Hi Boris, > > This sounds a bit like https://tracker.ceph.com/issues/53729. > https://tracker.c

[ceph-users] increasing PGs OOM kill SSD OSDs (octopus) - unstable OSD behavior

2023-02-21 Thread Boris Behrens
Hi, today I wanted to increase the PGs from 2k -> 4k and random OSDs went offline in the cluster. After some investigation we saw, that the OSDs got OOM killed (I've seen a host that went from 90GB used memory to 190GB before OOM kills happen). We have around 24 SSD OSDs per host and

[ceph-users] Re: Very slow snaptrim operations blocking client I/O

2023-02-20 Thread Boris Behrens
Hi, we've encountered the same issue after upgrading to octopus on on of our rbd cluster, and now it reappears after the autoscaler lowered the PGs form 8k to 2k for the RBD pool. What we've done in the past: - recreate all OSD after our 2nd incident with slow OPS in a single week after the ceph

[ceph-users] Re: [RGW - octopus] too many omapkeys on versioned bucket

2023-02-13 Thread Boris Behrens
I've tried it the other way around and let cat give out all escaped chars and the did the grep: # cat -A omapkeys_list | grep -aFn '/' 9844:/$ 9845:/^@v913^@$ 88010:M-^@1000_/^@$ 128981:M-^@1001_/$ Did anyone ever saw something like this? Am Mo., 13. Feb. 2023 um 14:31 Uhr schrieb Boris Behrens

[ceph-users] Re: [RGW - octopus] too many omapkeys on versioned bucket

2023-02-13 Thread Boris Behrens
rminal) <80>1000_//^@ Any idea what this is? Am Mo., 13. Feb. 2023 um 13:57 Uhr schrieb Boris Behrens : > Hi, > I have one bucket that showed up with a large omap warning, but the amount > of objects in the bucket, does not align with the amount of omap keys. The > buck

[ceph-users] [RGW - octopus] too many omapkeys on versioned bucket

2023-02-13 Thread Boris Behrens
Hi, I have one bucket that showed up with a large omap warning, but the amount of objects in the bucket, does not align with the amount of omap keys. The bucket is sharded to get rid of the "large omapkeys" warning. I've counted all the omapkeys of one bucket and it came up with 33.383.622 (rados

[ceph-users] Re: Migrate a bucket from replicated pool to ec pool

2023-02-13 Thread Boris Behrens
Hi Casey, changes to the user's default placement target/storage class don't > apply to existing buckets, only newly-created ones. a bucket's default > placement target/storage class can't be changed after creation > so I can easily update the placement rules for this user and can migrate

[ceph-users] Migrate a bucket from replicated pool to ec pool

2023-02-11 Thread Boris Behrens
Hi, we use rgw as our backup storage, and it basically holds only compressed rbd snapshots. I would love to move these out of the replicated into a ec pool. I've read that I can set a default placement target for a user ( https://docs.ceph.com/en/octopus/radosgw/placement/). What does happen to

[ceph-users] Re: PG_BACKFILL_FULL

2023-01-16 Thread Boris Behrens
Hmm.. I ran into some similar issue. IMHO there are two ways to work around the problem until the new disk in place: 1. change the backfill full threshold (I use these commands: https://www.suse.com/support/kb/doc/?id=19724) 2. reweight the backfill full OSDs just a little bit, so they move

[ceph-users] RGW - large omaps even when buckets are sharded

2023-01-16 Thread Boris Behrens
Hi, since last week the scrubbing results in large omap warning. After some digging I've got these results: # searching for indexes with large omaps: $ for i in `rados -p eu-central-1.rgw.buckets.index ls`; do rados -p eu-central-1.rgw.buckets.index listomapkeys $i | wc -l | tr -d '\n' >>

[ceph-users] radosgw ceph.conf question

2023-01-13 Thread Boris Behrens
Hi, I am just reading through this document ( https://docs.ceph.com/en/octopus/radosgw/config-ref/) and on the top is states: The following settings may added to the Ceph configuration file (i.e., > usually ceph.conf) under the [client.radosgw.{instance-name}] section. > And my ceph.conf looks

[ceph-users] Octopus RGW large omaps in usage

2023-01-10 Thread Boris Behrens
Hi, I am currently trying to figure out how to resolve the "large objects found in pool 'rgw.usage'" error. In the past I trimmed the usage log, but now I am at the point that I need to trim it down to two weeks. I checked and amount of omapkeys and the distribution is quite off: # for OBJECT

[ceph-users] Re: docs.ceph.com -- Do you use the header navigation bar? (RESPONSES REQUESTED)

2023-01-09 Thread Boris Behrens
I actually do not mind if i need to scroll up a line, but I also think it is a good idea to remove it. Am Mo., 9. Jan. 2023 um 11:06 Uhr schrieb Frank Schilder : > > Hi John, > > firstly, image attachments are filtered out by the list. How about you upload > the image somewhere like

[ceph-users] Re: rgw - unable to remove some orphans

2023-01-03 Thread Boris Behrens
Hi Andrei, happy new year to you too. The file might be already removed. You can check if the radosobject is there with `rados -p ls ...` You can also check if the file is is still in the bucket with `radosgw-admin bucket radoslist --bucket BUCKET` Cheers Boris Am Di., 3. Jan. 2023 um 13:47

[ceph-users] RGW access logs with bucket name

2023-01-03 Thread Boris Behrens
Hi, I am looking forward to move our logs from /var/log/ceph/ceph-client...log to our logaggregator. Is there a way to have the bucket name in the log file? Or can I write the rgw_enable_ops_log into a file? Maybe I could work with this. Cheers and happy new year Boris

[ceph-users] Re: octopus (15.2.16) OSDs crash or don't answer heathbeats (and get marked as down)

2022-12-13 Thread Boris Behrens
gt; On Wed, Dec 7, 2022 at 6:10 PM Boris wrote: >> >> Hi Jakub, >> >> the problem is in our case that we hit this bug >> (https://tracker.ceph.com/issues/53585) and the GC leads to this problem. >> >> We worked around this, by moving the GC to separate d

[ceph-users] nautilus mgr die when the balancer runs

2022-12-13 Thread Boris Behrens
Hi, we had an issue with an old cluster, where we put disks from one host to another. We destroyed the disks and added them as new OSDs, but since then the mgr daemon were restarting in 120s intervals. I tried to debug it a bit, and it looks like the balancer is the problem. I tried to disable it

[ceph-users] Re: radosgw - limit maximum file size

2022-12-09 Thread Boris Behrens
g help *rgw_multipart_part_upload_limit* > rgw_multipart_part_upload_limit - Max number of parts in multipart upload > (int, advanced) > Default: 1 > Can update at runtime: true > Services: [rgw] > > *rgw_max_put_size* is set in bytes. > > Regards, > Eric. > > On Fr

[ceph-users] radosgw - limit maximum file size

2022-12-09 Thread Boris Behrens
Hi, is it possible to somehow limit the maximum file/object size? I've read that I can limit the size of multipart objects and the amount of multipart objects, but I would like to limit the size of each object in the index to 100GB. I haven't found a config or quota value, that would fit.

[ceph-users] Re: octopus rbd cluster just stopped out of nowhere (>20k slow ops)

2022-12-09 Thread Boris Behrens
Hello together, @Alex: I am not sure for what to look in /sys/block//device There are a lot of files.Is there anything I should check in particular? You have sysfs access in /sys/block//device - this will show a lot > of settings. You can go to this directory on CentOS vs. Ubuntu, and see if >

[ceph-users] Re: octopus rbd cluster just stopped out of nowhere (>20k slow ops)

2022-12-06 Thread Boris Behrens
< icepic...@gmail.com>: > Perhaps run "iostat -xtcy 5" on the OSD hosts to > see if any of the drives have weirdly high utilization despite low > iops/requests? > > > Den tis 6 dec. 2022 kl 10:02 skrev Boris Behrens : > > > > Hi Sven, > > I am

[ceph-users] Re: octopus rbd cluster just stopped out of nowhere (>20k slow ops)

2022-12-06 Thread Boris Behrens
schrieb Sven Kieske : > On Sa, 2022-12-03 at 01:54 +0100, Boris Behrens wrote: > > hi, > > maybe someone here can help me to debug an issue we faced today. > > > > Today one of our clusters came to a grinding halt with 2/3 of our OSDs > > reporting slow ops. > &

[ceph-users] Re: octopus rbd cluster just stopped out of nowhere (>20k slow ops)

2022-12-04 Thread Boris Behrens
Something has got to be there, >> which makes the problem go away. >> -- >> Alex Gorbachev >> https://alextelescope.blogspot.com >> >> >> >> On Sun, Dec 4, 2022 at 6:08 AM Boris Behrens wrote: >> >> > Hi Alex, >> > I am searching for a log line tha

[ceph-users] Re: octopus rbd cluster just stopped out of nowhere (>20k slow ops)

2022-12-04 Thread Boris Behrens
@Alex: the issue is done for now, but I fear it might come back sometime. The cluster was running fine for months. I check if we can restart the switches easily. Host reboots should also be no problem. There is no "implicated OSD" message in the logs. All OSDs were recreated 3 months ago. (sync

[ceph-users] Dilemma with PG distribution

2022-12-04 Thread Boris Behrens
Hi, I am just evaluating out cluster configuration again, because we had an very bad incident with laggy OSDs that shut down the entire cluster. We use datacenter SSDs in different sizes (2, 4, 8TB) and someone said, that I should not go beyond a specific amount of PGs on certain device classes.

[ceph-users] Re: octopus rbd cluster just stopped out of nowhere (>20k slow ops)

2022-12-04 Thread Boris Behrens
Hi Alex, I am searching for a log line that points me in the right direction. From what I've seen, I could find a specific Host, OSD, PG that was leading to this problem. But maybe I am looking at the wrong logs. I have around 150k lines that look like this:

[ceph-users] octopus rbd cluster just stopped out of nowhere (>20k slow ops)

2022-12-02 Thread Boris Behrens
hi, maybe someone here can help me to debug an issue we faced today. Today one of our clusters came to a grinding halt with 2/3 of our OSDs reporting slow ops. Only option to get it back to work fast, was to restart all OSDs daemons. The cluster is an octopus cluster with 150 enterprise SSD

[ceph-users] Re: radosgw octopus - how to cleanup orphan multipart uploads

2022-12-02 Thread Boris Behrens
ozuCYXDKYvhkW5RiZUxuaNfu48C.365_1-- --ff7a8b0c-07e6-463a-861b-78f0adeba8ad.2339856956.63__multipart_8cfd0bdb-05f9-40cd-a50d-83295b416ea9.lz4.CwlAWozuCYXDKYvhkW5RiZUxuaNfu48C.365-- Am Fr., 2. Dez. 2022 um 12:17 Uhr schrieb Boris Behrens : > Hi, > we are currently encountering a lot of broken

[ceph-users] Re: radosgw-octopus latest - NoSuchKey Error - some buckets lose their rados objects, but not the bucket index

2022-12-02 Thread Boris Behrens
like > "c44a7aab-e086-43df-befe-ed8151b3a209.4147.1_obj1”. > > 3. grep through the logs for the head object and see if you find anything. > > Eric > (he/him) > > On Nov 22, 2022, at 10:36 AM, Boris Behrens wrote: > > Does someone have an idea what I can ch

[ceph-users] radosgw octopus - how to cleanup orphan multipart uploads

2022-12-02 Thread Boris Behrens
Hi, we are currently encountering a lot of broken / orphan multipart uploads. When I try to fetch the multipart uploads via s3cmd, it just never finishes. Debug output looks like this and it basically never changes. DEBUG: signature-v4 headers: {'x-amz-date': '20221202T105838Z', 'Authorization':

[ceph-users] Re: radosgw-admin bucket check --fix returns a lot of errors (unable to find head object data)

2022-11-23 Thread Boris Behrens
ere, but now I don't care. (I also have this for a healthy bucket, where I test stuff like this prior, which gets recreated periodically) Am Mi., 23. Nov. 2022 um 12:22 Uhr schrieb Boris Behrens : > Hi, > we have a customer that got some _multipart_ files in his bucket, but the > bucket g

[ceph-users] radosgw-admin bucket check --fix returns a lot of errors (unable to find head object data)

2022-11-23 Thread Boris Behrens
Hi, we have a customer that got some _multipart_ files in his bucket, but the bucket got no unfinished multipart objects. So I tried to remove them via $ radosgw-admin object rm --bucket BUCKET --object=_multipart_OBJECT.qjqyT8bXiWW5jdbxpVqHxXnLWOG3koUi.1 ERROR: object remove returned: (2) No

[ceph-users] radosgw-octopus latest - NoSuchKey Error - some buckets lose their rados objects, but not the bucket index

2022-11-21 Thread Boris Behrens
Good day people, we have a very strange problem with some bucket. Customer informed us, that they had issues with objects. They are listed, but on a GET they receive "NoSuchKey" error. They did not delete anything from the bucket. We checked and `radosgw-admin bucket radoslist --bucket $BUCKET`

[ceph-users] Re: rgw multisite octopus - bucket can not be resharded after cancelling prior reshard process

2022-10-25 Thread Boris Behrens
Opened a bug on the tracker for it: https://tracker.ceph.com/issues/57919 Am Fr., 7. Okt. 2022 um 11:30 Uhr schrieb Boris Behrens : > Hi, > I just wanted to reshard a bucket but mistyped the amount of shards. In a > reflex I hit ctrl-c and waited. It looked like the resharding did not

[ceph-users] Re: rgw multisite octopus - bucket can not be resharded after cancelling prior reshard process

2022-10-24 Thread Boris Behrens
Cheers again. I am still stuck at this. Someone got an idea how to fix it? Am Fr., 7. Okt. 2022 um 11:30 Uhr schrieb Boris Behrens : > Hi, > I just wanted to reshard a bucket but mistyped the amount of shards. In a > reflex I hit ctrl-c and waited. It looked like the resharding did not

[ceph-users] Re: octopus 15.2.17 RGW daemons begin to crash regularly

2022-10-07 Thread Boris Behrens
d a socket's remote_endpoint(). > i didn't think that local_endpoint() could fail the same way, but i've > opened https://tracker.ceph.com/issues/57784 to track this and the fix > should look the same > > On Thu, Oct 6, 2022 at 12:12 PM Boris Behrens wrote: > > > > Any ideas o

[ceph-users] rgw multisite octopus - bucket can not be resharded after cancelling prior reshard process

2022-10-07 Thread Boris Behrens
Hi, I just wanted to reshard a bucket but mistyped the amount of shards. In a reflex I hit ctrl-c and waited. It looked like the resharding did not finish so I canceled it, and now the bucket is in this state. How can I fix it. It does not show up in the stale-instace list. It's also a multisite

  1   2   3   >