[ceph-users] Re: OSD crashes during upgrade mimic->octopus

2022-10-07 Thread Frank Schilder
Hi Igor and Stefan, thanks a lot for your help! Our cluster is almost finished with recovery and I would like to switch to off-line conversion of the SSD OSDs. In one of Stefan's I coud find the command for manual compaction: ceph-kvstore-tool bluestore-kv "/var/lib/ceph/osd/ceph-${OSD_ID}" com

[ceph-users] Iinfinite backfill loop + number of pgp groups stuck at wrong value

2022-10-07 Thread Nicola Mori
Dear Ceph users, my cluster is stuck since several days with some PG backfilling. The number of misplaced objects slowly decreases down to 5%, and at that point jumps up again to about 7%, and so on. I found several possible reasons for this behavior. One is related to the balancer, which anyw

[ceph-users] Re: OSD crashes during upgrade mimic->octopus

2022-10-07 Thread Frank Schilder
Hi Stefan, super thanks! I found a quick-fix command in the help output: # ceph-bluestore-tool -h [...] Positional options: --command arg fsck, repair, quick-fix, bluefs-export, bluefs-bdev-sizes, bluefs-bdev-expand, bluefs-bdev-new-db

[ceph-users] Re: 16.2.10: ceph osd perf always shows high latency for a specific OSD

2022-10-07 Thread Zakhar Kirpichenko
Unfortunately, that isn't the case: the drive is perfectly healthy and, according to all measurements I did on the host itself, it isn't any different from any other drive on that host size-, health- or performance-wise. The only difference I noticed is that this drive sporadically does more I/O t

[ceph-users] Re: Stuck in upgrade

2022-10-07 Thread Jan Marek
Hello, I've now cluster healthy. I've studied OSDMonitor.cc file and I've found, that there is some problematic logic. Assumptions: 1) require_osd_release can be only raise. 2) ceph-mon in version 17.2.3 can set require_osd_release to minimal value 'octopus'. I have two variants: 1) If I can

[ceph-users] Re: Stuck in upgrade

2022-10-07 Thread Dan van der Ster
Hi Jan, It looks like you got into this situation by not setting require-osd-release to pacific while you were running 16.2.7. The code has that expectation, and unluckily for you if you had upgraded to 16.2.8 you would have had a HEALTH_WARN that pointed out the mismatch between require_osd_relea

[ceph-users] Re: 16.2.10: ceph osd perf always shows high latency for a specific OSD

2022-10-07 Thread Dan van der Ster
Hi Zakhar, I can back up what Konstantin has reported -- we occasionally have HDDs performing very slowly even though all smart tests come back clean. Besides ceph osd perf showing a high latency, you could see high ioutil% with iostat. We normally replace those HDDs -- usually by draining and ze

[ceph-users] rgw multisite octopus - bucket can not be resharded after cancelling prior reshard process

2022-10-07 Thread Boris Behrens
Hi, I just wanted to reshard a bucket but mistyped the amount of shards. In a reflex I hit ctrl-c and waited. It looked like the resharding did not finish so I canceled it, and now the bucket is in this state. How can I fix it. It does not show up in the stale-instace list. It's also a multisite en

[ceph-users] Re: octopus 15.2.17 RGW daemons begin to crash regularly

2022-10-07 Thread Boris Behrens
Hi Casey, thanks a lot. I added the full stack trace from our ceph-client log. Cheers Boris Am Do., 6. Okt. 2022 um 19:21 Uhr schrieb Casey Bodley : > hey Boris, > > that looks a lot like https://tracker.ceph.com/issues/40018 where an > exception was thrown when trying to read a socket's remote

[ceph-users] Re: OSD crashes during upgrade mimic->octopus

2022-10-07 Thread Igor Fedotov
Hi Frank, one more thing I realized during the night :) Whe performing conversion DB gets a significant bunch of new data (approx. on par with the original OMAP volume) without old one being immediately removed. Hence one should expect DB size grows dramatically at this point. Which should go

[ceph-users] Re: OSD crashes during upgrade mimic->octopus

2022-10-07 Thread Igor Fedotov
For format updates one can use quick-fix command instead of repair, it might work a bit faster.. On 10/7/2022 10:07 AM, Stefan Kooman wrote: On 10/7/22 09:03, Frank Schilder wrote: Hi Igor and Stefan, thanks a lot for your help! Our cluster is almost finished with recovery and I would like t

[ceph-users] Re: OSD crashes during upgrade mimic->octopus

2022-10-07 Thread Igor Fedotov
Just FYI: standalone ceph-bluestore-tool's quick-fix behaves pretty similar to the action performed on start-up with bluestore_fsck_quick_fix_on_mount = true On 10/7/2022 10:18 AM, Frank Schilder wrote: Hi Stefan, super thanks! I found a quick-fix command in the help output: # ceph-blues

[ceph-users] Re: Iinfinite backfill loop + number of pgp groups stuck at wrong value

2022-10-07 Thread Nicola Mori
The situation got solved by itself, since probably there was no error. I manually increased the number of PGs and PGPs to 128 some days ago, and the PGP count was being updated step by step. Actually after a bump from 5% to 7% in the count of misplaced object I noticed that the number of PGPs w

[ceph-users] Re: Stuck in upgrade

2022-10-07 Thread Jan Marek
Hi Dan, thanks for this point, it's at least minimum, which can be done. But can you imagine, what I have to do, when I have not an ability to change OSDMonitor.cc, recompile and raise require-osd-release? Or have require-osd-release lower than nautilus? Parameter min_mon_release raised automati

[ceph-users] Re: OSD crashes during upgrade mimic->octopus

2022-10-07 Thread Igor Fedotov
Hi Frank, there no tools to defragment OSD atm. The only way to defragment OSD is to redeploy it... Thanks, Igor On 10/7/2022 3:04 AM, Frank Schilder wrote: Hi Igor, sorry for the extra e-mail. I forgot to ask: I'm interested in a tool to de-fragment the OSD. It doesn't look like the fs

[ceph-users] Re: 16.2.10: ceph osd perf always shows high latency for a specific OSD

2022-10-07 Thread Zakhar Kirpichenko
Thanks for this! The drive doesn't show increased utilization on average, but it does sporadically get more I/O than other drives, usually in short bursts. I am now trying to find a way to trace this to a specific PG, pool and object (s) – not sure if that is possible. /Z On Fri, 7 Oct 2022, 12:

[ceph-users] Re: OSD crashes during upgrade mimic->octopus

2022-10-07 Thread Frank Schilder
Hi all, trying to respond to 4 past emails :) We started using manual conversion and, if the conversion fails, it fails in the last step. So far, we have a fail on 1 out of 8 OSDs. The OSD can be repaired with running a compaction + another repair, which will complete the last step. Looks lik

[ceph-users] Re: OSD crashes during upgrade mimic->octopus

2022-10-07 Thread Szabo, Istvan (Agoda)
Finally how is your pg distribution? How many pg/disk? Istvan Szabo Senior Infrastructure Engineer --- Agoda Services Co., Ltd. e: istvan.sz...@agoda.com --- -Original Message- From: Frank Schi

[ceph-users] Re: 16.2.10: ceph osd perf always shows high latency for a specific OSD

2022-10-07 Thread Eugen Block
Hi, I’d look for deep-scrubs on that OSD, those are logged, maybe those timestamps match your observations. Zitat von Zakhar Kirpichenko : Thanks for this! The drive doesn't show increased utilization on average, but it does sporadically get more I/O than other drives, usually in short bur

[ceph-users] Re: iscsi deprecation

2022-10-07 Thread Maged Mokhtar
You can try PetaSAN www.petasan.org We are open source solution on top of Ceph. we provide scalable active/active iSCSI which supports VMWare VAAI and Microsoft clustered shared volumes for hyper-v clustering. Cheers /maged On 30/09/2022 19:36, Filipe Mendes wrote: Hello! I'm consideri

[ceph-users] Slow monitor responses for rbd ls etc.

2022-10-07 Thread Sven Barczyk
Hello, we are encountering a strange behavior on our Ceph. (All Ubuntu 20 / All mons Quincy 17.2.4 / Oldest OSD Quincy 17.2.0 ) Administrative commands like rbd ls or create are so slow, that libvirtd is running into timeouts and creating new VMs on our Cloudstack, on behalf of creating new vol

[ceph-users] Inherited CEPH nightmare

2022-10-07 Thread Tino Todino
Hi folks, The company I recently joined has a Proxmox cluster of 4 hosts with a CEPH implementation that was set-up using the Proxmox GUI. It is running terribly, and as a CEPH newbie I'm trying to figure out if the configuration is at fault. I'd really appreciate some help and guidance on th

[ceph-users] Re: Inherited CEPH nightmare

2022-10-07 Thread Stefan Kooman
On 10/7/22 16:56, Tino Todino wrote: Hi folks, The company I recently joined has a Proxmox cluster of 4 hosts with a CEPH implementation that was set-up using the Proxmox GUI. It is running terribly, and as a CEPH newbie I'm trying to figure out if the configuration is at fault. I'd really

[ceph-users] Re: Inherited CEPH nightmare

2022-10-07 Thread Robert Sander
Hi Tino, Am 07.10.22 um 16:56 schrieb Tino Todino: I know some of these are consumer class, but I'm working on replacing these. This would be your biggest issue. SSD performance can vary drastically. Ceph needs "multi-use" enterprise SSDs, not read-optimized consumer ones. All 4 hosts are se

[ceph-users] Re: Inherited CEPH nightmare

2022-10-07 Thread Josef Johansson
Hi, You want to also check disk_io_weighted via some kind of metric system. That will detect which SSDs that are hogging the systems, if there are any specific ones. Also check their error levels and endurance. On Fri, 7 Oct 2022 at 17:05, Stefan Kooman wrote: > On 10/7/22 16:56, Tino Todino wr

[ceph-users] Re: 16.2.10: ceph osd perf always shows high latency for a specific OSD

2022-10-07 Thread Konstantin Shalygin
Zakhar, try to look to top of slow ops in daemon socket for this osd, you may find 'snapc' operations, for example. By rbd head you can find rbd image, and then try to look how much snapshots in chain for this image. More than 10 snaps for one image can increase client ops latency to tens millis

[ceph-users] Re: Iinfinite backfill loop + number of pgp groups stuck at wrong value

2022-10-07 Thread Josh Baergen
As of Nautilus+, when you set pg_num, it actually internally sets pg(p)_num_target, and then slowly increases (or decreases, if you're merging) pg_num and then pgp_num until it reaches the target. The amount of backfill scheduled into the system is controlled by target_max_misplaced_ratio. Josh O

[ceph-users] every rgw stuck on "RGWReshardLock::lock found lock"

2022-10-07 Thread Haas, Josh
I've observed this occur on v14.2.22 and v15.2.12. Wasn't able to find anything obviously relevant in changelogs, bug tickets, or existing mailing list threads. In both cases, every RGW in the cluster starts spamming logs with lines that look like the following: 2022-09-04 14:20:45.231 7fc7b2

[ceph-users] Re: 16.2.10: ceph osd perf always shows high latency for a specific OSD

2022-10-07 Thread Zakhar Kirpichenko
Thanks for the suggestions, I will try this. /Z On Fri, 7 Oct 2022 at 18:13, Konstantin Shalygin wrote: > Zakhar, try to look to top of slow ops in daemon socket for this osd, you > may find 'snapc' operations, for example. By rbd head you can find rbd > image, and then try to look how much sna