[ceph-users] Re: Deep-scrub much slower than HDD speed

2023-04-26 Thread Janne Johansson
Den ons 26 apr. 2023 kl 21:20 skrev Niklas Hambüchen : > > 100MB/s is sequential, your scrubbing is random. afaik everything is random. > > Is there any docs that explain this, any code, or other definitive answer? > Also wouldn't it make sense that for scrubbing to be able to read the disk > line

[ceph-users] Re: Ceph 16.2.12, particular OSD shows higher latency than others

2023-04-26 Thread Zakhar Kirpichenko
As suggested by someone, I tried `dump_historic_slow_ops`. There aren't many, and they're somewhat difficult to interpret: "description": "osd_op(client.250533532.0:56821 13.16f 13:f6c9079e:::rbd_data.eed629ecc1f946.001c:head [stat,write 3518464~8192] snapc 0=[] ondisk+writ

[ceph-users] Re: Deep-scrub much slower than HDD speed

2023-04-26 Thread Niklas Hambüchen
The question you should ask yourself, why you want to change/investigate this? Because if scrubbing takes 10x longer thrashing seeks, my scrubs never finish in time (the default is 1 week). I end with e.g. 267 pgs not deep-scrubbed in time On a 38 TB cluster, if you scrub 8 MB/s on 10 disks

[ceph-users] Re: Rados gateway data-pool replacement.

2023-04-26 Thread Gaël THEROND
Hi Richard! Thanks a lot for your answer! Indeed I’m s dumb… I just performed a K/M migration on another service that involved RBD last week and so just completely eluded the fact that you, of course, can change the crush_rule of a pool without having to copy anything!! OMFG… I can be so sil

[ceph-users] Re: Deep-scrub much slower than HDD speed

2023-04-26 Thread Marc
Hi Niklas, > > > 100MB/s is sequential, your scrubbing is random. afaik everything is > random. > > Is there any docs that explain this, any code, or other definitive > answer? do a fio[1] test on a disk to see how it performs under certain conditions. Or look at atop during scrubbing, it will

[ceph-users] Re: Deep-scrub much slower than HDD speed

2023-04-26 Thread Niklas Hambüchen
Hi Marc, thanks for your reply. 100MB/s is sequential, your scrubbing is random. afaik everything is random. Is there any docs that explain this, any code, or other definitive answer? Also wouldn't it make sense that for scrubbing to be able to read the disk linearly, at least to some signi

[ceph-users] Ceph 16.2.12, bluestore cache doesn't seem to be used much

2023-04-26 Thread Zakhar Kirpichenko
Hi, I have a Ceph 16.2.12 cluster with hybrid OSDs (HDD block storage, DB/WAL on NVME). All OSD settings are default except, cache-related settings are as follows: osd.14dev bluestore_cache_autotune true osd.14dev bluestore_cache_size_hdd 4294967

[ceph-users] Ceph 16.2.12, particular OSD shows higher latency than others

2023-04-26 Thread Zakhar Kirpichenko
Hi, I have a Ceph 16.2.12 cluster with uniform hardware, same drive make/model, etc. A particular OSD is showing higher latency than usual in `ceph osd perf`, usually mid to high tens of milliseconds while other OSDs show low single digits, although its drive's I/O stats don't look different from

[ceph-users] Re: How to find the bucket name from Radosgw log?

2023-04-26 Thread Dan van der Ster
Hi, Your cluster probably has dns-style buckets enabled. .. In that case the path does not include the bucket name, and neither does the rgw log. Do you have a frontend lb like haproxy? You'll find the bucket names there. -- Dan __ Clyso GmbH | https://www.clyso.com

[ceph-users] Re: How to control omap capacity?

2023-04-26 Thread Dan van der Ster
Hi, Simplest solution would be to add a few OSDs. -- dan __ Clyso GmbH | https://www.clyso.com On Tue, Apr 25, 2023 at 2:58 PM WeiGuo Ren wrote: > > I have two osds. these osd are used to rgw index pool. After a lot of > stress tests, these two osds were written t

[ceph-users] Re: For suggestions and best practices on expanding Ceph cluster and removing old nodes

2023-04-26 Thread Dan van der Ster
Thanks Tom, this is a very useful post! I've added our docs guy Zac in cc: IMHO this would be useful in a "Tips & Tricks" section of the docs. -- dan __ Clyso GmbH | https://www.clyso.com On Wed, Apr 26, 2023 at 7:46 AM Thomas Bennett wrote: > > I would second Joa

[ceph-users] Re: Massive OMAP remediation

2023-04-26 Thread Dan van der Ster
Hi Ben, Are you compacting the relevant osds periodically? ceph tell osd.x compact (for the three osds holding the bilog) would help reshape the rocksdb levels to least perform better for a little while until the next round of bilog trims. Otherwise, I have experience deleting ~50M object indices

[ceph-users] Re: Ceph Mgr/Dashboard Python depedencies: a new approach

2023-04-26 Thread Casey Bodley
are there any volunteers willing to help make these python packages available upstream? On Tue, Mar 28, 2023 at 5:34 AM Ernesto Puerta wrote: > > Hey Ken, > > This change doesn't not involve any further internet access other than the > already required for the "make dist" stage (e.g.: npm packag

[ceph-users] Re: Bug, pg_upmap_primaries.empty()

2023-04-26 Thread Gregory Farnum
Looks like you've somehow managed to enable the upmap balancer while allowing a client that's too told to understand it to mount. Radek, this is a commit from yesterday; is it a known issue? On Wed, Apr 26, 2023 at 7:49 AM Nguetchouang Ngongang Kevin wrote: > > Good morning, i found a bug on cep

[ceph-users] Ceph Leadership Team meeting minutes - 2023 April 26

2023-04-26 Thread Casey Bodley
# ceph windows tests PR check will be made required once regressions are fixed windows build currently depends on gcc11 which limits use of c++20 features. investigating newer gcc or clang toolchain # 16.2.13 release final testing in progress # prometheus metric regressions https://tracker.ceph.c

[ceph-users] Bug, pg_upmap_primaries.empty()

2023-04-26 Thread Nguetchouang Ngongang Kevin
Good morning, i found a bug on ceph reef After installing ceph and deploying 9 osds with a cephfs layer. I got this error after many writing and reading operations on the ceph fs i deployed. ```{ "assert_condition": "pg_upmap_primaries.empty()", "assert_file": "/home/jenkins-build/build

[ceph-users] Re: For suggestions and best practices on expanding Ceph cluster and removing old nodes

2023-04-26 Thread Thomas Bennett
I would second Joachim's suggestion - this is exactly what we're in the process of doing for a client, i.e migrating from Luminous to Quincy. However below would also work if you're moving to Nautilus. The only catch with this plan would be if you plan to reuse any hardware - i.e the hosts running

[ceph-users] Re: Move ceph to new addresses and hostnames

2023-04-26 Thread Eugen Block
Hi, can you paste the following output: ceph orch ls osd --export Maybe you have the "all-available-devices" service set to managed? You can disable that with [1]: ceph orch apply osd --all-available-devices --unmanaged=true Please also add your osd yaml configuration, you can test that wi

[ceph-users] Re: Move ceph to new addresses and hostnames

2023-04-26 Thread Jan Marek
Hello all, today I moved ceph to HEALTH_OK state :-) 1) I had to restart MGR node, then my old c-osdx hostnames goes definitely away and all of OSDs from old machines are now orchestrated by 'ceph orch' command. 2) I've updated ceph* packages on the osd2 node to version 17.2.6, then I tried 'cep

[ceph-users] Re: Veeam backups to radosgw seem to be very slow

2023-04-26 Thread Joachim Kraftmayer - ceph ambassador
"bucket does not exist" or "permission denied". Had received similar error messages with another client program. The default region did not match the region of the cluster. ___ ceph ambassador DACH ceph consultant since 2012 Clyso GmbH - Premier Ceph Foundation M

[ceph-users] Re: OSD_TOO_MANY_REPAIRS on random OSDs causing clients to hang

2023-04-26 Thread Thomas Hukkelberg
Hi! There are no kernel log messages that indicate read errors on the disk, and the error is not tied to one specific OSD. The errors so far have been on 7 different OSDs and when we restart the OSD with errors, the errors appears on one of the other OSDs in the same PG; as you can see when res

[ceph-users] Re: OSD_TOO_MANY_REPAIRS on random OSDs causing clients to hang

2023-04-26 Thread Joachim Kraftmayer - ceph ambassador
Hello Thomas, I would strongly recommend you to read the messages on the mailing list regarding ceph version 16.2.11,16.2.12 and 16.2.13. Joachim ___ ceph ambassador DACH ceph consultant since 2012 Clyso GmbH - Premier Ceph Foundation Member https://www.clyso.

[ceph-users] Re: OSD_TOO_MANY_REPAIRS on random OSDs causing clients to hang

2023-04-26 Thread Robert Sander
On 26.04.23 13:24, Thomas Hukkelberg wrote: [WRN] OSD_TOO_MANY_REPAIRS: Too many repaired reads on 1 OSDs osd.34 had 9936 reads repaired Are there any messages in the kernel log that indicate this device has read errors? Have you considered replacing the disk? Regards -- Robert Sander

[ceph-users] OSD_TOO_MANY_REPAIRS on random OSDs causing clients to hang

2023-04-26 Thread Thomas Hukkelberg
Hi all, Over the last 2 weeks we have experienced several OSD_TOO_MANY_REPAIRS errors that we struggle to handle in a non-intrusive manner. Restarting MDS + hypervisor that accessed the object in question seems to be the only way we can clear the error so we can repair the PG and recover access

[ceph-users] Re: Increase timeout for marking osd down

2023-04-26 Thread Eugen Block
Hi, I don't think increasing the mon_osd_down_out_interval timeout alone will really help you in this situation, I remember an older thread about that but couldn't find it. What you could test is setting the nodown flag (ceph osd set nodown) to prevent flapping OSDs, but that's not a real

[ceph-users] Re: ceph pg stuck - missing on 1 osd how to proceed

2023-04-26 Thread Eugen Block
We know very little about the whole cluster, can you add the usual information like 'ceph -s' and 'ceph osd df tree'? Scrubbing has nothing to do with the undersized PGs. Is the balancer and/or autoscaler on? Please also add 'ceph balancer status' and 'ceph osd pool autoscale-status'. Than

[ceph-users] Re: PVE CEPH OSD heartbeat show

2023-04-26 Thread Frank Schilder
Hi Peter, 2% packet loss is a lot, specifically on such expensive hardware. We observed the problems you describe with defective networking hardware with NIC/switch ports in active-active LACP bonding mode. We had periodically failing transceivers and these fails are not immediately detected by

[ceph-users] Re: Deep-scrub much slower than HDD speed

2023-04-26 Thread Marc
> > I observed that on an otherwise idle cluster, scrubbing cannot fully > utilise the speed of my HDDs. Maybe the configured limit is set like this, because of that once (a part of) the scrubbing process is started it is not possible/easy to automatically scale down the performance to benefit

[ceph-users] Re: Dead node (watcher) won't timeout on RBD

2023-04-26 Thread Eugen Block
Hi, can you share the exact command how you blocked the watcher? To get the lock list run: rbd lock list / There is 1 exclusive lock on this image. Locker IDAddress client.1211875 auto 139643345791728 192.168.3.12:0/2259335316 To blacklist the client run: ceph o

[ceph-users] Re: PVE CEPH OSD heartbeat show

2023-04-26 Thread Fabian Grünbichler
On April 25, 2023 9:03 pm, Peter wrote: > Dear all, > > We are experiencing with Ceph after deploying it by PVE with the network > backed by a 10G Cisco switch with VPC feature on. We are encountering a slow > OSD heartbeat and have not been able to identify any network traffic issues. > > Upon

[ceph-users] ERROR: Distro uos version 20 not supported

2023-04-26 Thread Ben
Hi, This seems not very relevant since all ceph components are running in containers. Any ideas to get over this issue? Any other ideas or suggestions on this kind of deployment? sudo ./cephadm --image 10.21.22.1:5000/ceph:v17.2.5-20230316 --docker bootstrap --mon-ip 10.21.22.1 --skip-monitoring-s