[ceph-users] Re: OSD crashes during upgrade mimic->octopus

2022-10-06 Thread Frank Schilder
Hi Igor, sorry for the extra e-mail. I forgot to ask: I'm interested in a tool to de-fragment the OSD. It doesn't look like the fsck command does that. Is there any such tool? Best regards, = Frank Schilder AIT Risø Campus Bygning 109, rum S14

[ceph-users] Re: OSD crashes during upgrade mimic->octopus

2022-10-06 Thread Igor Fedotov
well, I've just realized that you're apparently unable to collect these high-level stats for broken OSDs, aren't you? But if that's the case you shouldn't make any assumption about faulty OSDs utilization from healthy ones - it's definitely a very doubtful approach ;) On 10/7/2022 2:19

[ceph-users] Re: OSD crashes during upgrade mimic->octopus

2022-10-06 Thread Igor Fedotov
The log I inspected was for osd.16  so please share that OSD utilization... And honestly I trust allocator's stats more so it's rather CLI stats are incorrect if any. Anyway free dump should provide additional proofs.. And once again - do other non-starting OSDs show the same ENOSPC error? 

[ceph-users] Re: OSD crashes during upgrade mimic->octopus

2022-10-06 Thread Frank Schilder
Hi Igor, I suspect there is something wrong with the data reported. These OSDs are only 50-60% used. For example: IDCLASS WEIGHT REWEIGHT SIZE RAW USE DATA OMAP META AVAIL%USE VAR PGS STATUS TYPE NAME 29 ssd 0.09099 1.0 93

[ceph-users] Re: Can't delete or unprotect snapshot with rbd

2022-10-06 Thread Wesley Dillingham
Anything in the trash? "rbd trash ls images" Respectfully, *Wes Dillingham* w...@wesdillingham.com LinkedIn On Thu, Oct 6, 2022 at 3:29 PM Niklas Jakobsson < niklas.jakobs...@kindredgroup.com> wrote: > A yes, sorry about that. I actually have the

[ceph-users] Re: OSD crashes during upgrade mimic->octopus

2022-10-06 Thread Igor Fedotov
Hi Frank, the abort message "bluefs enospc" indicates lack of free space for additional bluefs space allocations which prevents osd from startup. From the following log line one can see that bluefs needs ~1M more space while the total available one is approx 622M. the problem is that bluefs

[ceph-users] Re: OSD crashes during upgrade mimic->octopus

2022-10-06 Thread Frank Schilder
Hi Igor, the problematic disk holds OSDs 16,17,18 and 19. OSD 16 is the one crashing the show. I collected its startup log here: https://pastebin.com/25D3piS6 . The line sticking out is line 603:

[ceph-users] Re: Can't delete or unprotect snapshot with rbd

2022-10-06 Thread Niklas Jakobsson
A yes, sorry about that. I actually have the issue on two images and I seem to have mixed them up when I was putting together the example, here is a correct one: # rbd info images/f3f4c73f-2eec-4af1-9bdf-4974a747607b rbd image 'f3f4c73f-2eec-4af1-9bdf-4974a747607b': size 8 GiB in 1024

[ceph-users] Re: Can't delete or unprotect snapshot with rbd

2022-10-06 Thread Wesley Dillingham
You are demo'ing two RBDs here: images/f3f4c73f-2eec-4af1-9bdf-4974a747607b seems to have 1 snapshot yet later when you try to interact with the snapshot you are doing so with a different rbd/image altogether: images/1fcfaa6b-eba0-4c75-b77d-d5b3ab4538a9 Respectfully, *Wes Dillingham*

[ceph-users] Re: octopus 15.2.17 RGW daemons begin to crash regularly

2022-10-06 Thread Casey Bodley
hey Boris, that looks a lot like https://tracker.ceph.com/issues/40018 where an exception was thrown when trying to read a socket's remote_endpoint(). i didn't think that local_endpoint() could fail the same way, but i've opened https://tracker.ceph.com/issues/57784 to track this and the fix

[ceph-users] Re: OSD crashes during upgrade mimic->octopus

2022-10-06 Thread Frank Schilder
Hi Stefan and anyone else reading this, we are probably misunderstanding each other here: > There is a strict MDS maintenance dance you have to perform [1]. > ... > [1]: https://docs.ceph.com/en/octopus/cephfs/upgrading/ Our ceph fs shut-down was *after* completing the upgrade to octopus, *not

[ceph-users] Re: octopus 15.2.17 RGW daemons begin to crash regularly

2022-10-06 Thread Boris Behrens
Any ideas on this? Am So., 2. Okt. 2022 um 00:44 Uhr schrieb Boris Behrens : > Hi, > we are experiencing that the rgw daemons crash and I don't understand why, > Maybe someone here can lead me to a point where I can dig further. > > { > "backtrace": [ > "(()+0x43090)

[ceph-users] Re: OSD crashes during upgrade mimic->octopus

2022-10-06 Thread Frank Schilder
Hi Stefan, to answer your question as well: > ... conversion from octopus to > pacific, and the resharding as well). We would save half the time by > compacting them before hand. It would take, in our case, many hours to > do a conversion, so it would pay off immensely. ... With experiments on

[ceph-users] Re: OSD crashes during upgrade mimic->octopus

2022-10-06 Thread Frank Schilder
Hi Igor. > But could you please share full OSD startup log for any one which is > unable to restart after host reboot? Will do. I also would like to know what happened here and if it is possible to recover these OSDs. The rebuild takes ages with the current throttled recovery settings. >

[ceph-users] Re: OSD crashes during upgrade mimic->octopus

2022-10-06 Thread Igor Fedotov
Sorry - no clue about CephFS related questions... But could you please share full OSD startup log for any one which is unable to restart after host reboot? On 10/6/2022 5:12 PM, Frank Schilder wrote: Hi Igor and Stefan. Not sure why you're talking about replicated(!) 4(2) pool. Its

[ceph-users] Re: OSD crashes during upgrade mimic->octopus

2022-10-06 Thread Stefan Kooman
On 10/6/22 16:12, Frank Schilder wrote: Hi Igor and Stefan. Not sure why you're talking about replicated(!) 4(2) pool. Its because in the production cluster its the 4(2) pool that has that problem. On the test cluster it was an > > EC pool. Seems to affect all sorts of pools. I have to

[ceph-users] Re: OSD crashes during upgrade mimic->octopus

2022-10-06 Thread Frank Schilder
Hi Igor and Stefan. > > Not sure why you're talking about replicated(!) 4(2) pool. > > Its because in the production cluster its the 4(2) pool that has that > problem. On the test cluster it was an > > EC pool. Seems to affect all sorts > of pools. I have to take this one back. It is indeed

[ceph-users] Can't delete or unprotect snapshot with rbd

2022-10-06 Thread Niklas Jakobsson
Hi, I have an issue with a rbd image that I can't delete. I have tried this: # rbd info images/f3f4c73f-2eec-4af1-9bdf-4974a747607b@snap rbd image 'f3f4c73f-2eec-4af1-9bdf-4974a747607b': size 8 GiB in 1024 objects order 23 (8 MiB objects) snapshot_count: 1 id:

[ceph-users] Re: How does client get the new active ceph-mgr endpoint when failover happens?

2022-10-06 Thread Burkhard Linke
Hi, do clients (as in ceph clients like rbd/cephfs/rgw) connected to the mgr at all? IMHO the clients only need to be able to connect to the mon (host list in ceph.conf / DNS SRV entries) and osd (osd map in mons). If certain clients need to retrieve data from the mgr (e.g. cephfs-top,

[ceph-users] Re: OSD crashes during upgrade mimic->octopus

2022-10-06 Thread Igor Fedotov
On 10/6/2022 3:16 PM, Stefan Kooman wrote: On 10/6/22 13:41, Frank Schilder wrote: Hi Stefan, thanks for looking at this. The conversion has happened on 1 host only. Status is: - all daemons on all hosts upgraded - all OSDs on 1 OSD-host were restarted with

[ceph-users] Re: OSD crashes during upgrade mimic->octopus

2022-10-06 Thread Igor Fedotov
Are crashing OSDs still bound to two hosts? If not - does any died OSD unconditionally mean its underlying disk is unavailable any more? On 10/6/2022 3:35 PM, Frank Schilder wrote: Hi Igor. Not sure why you're talking about replicated(!) 4(2) pool. Its because in the production cluster

[ceph-users] Re: OSD crashes during upgrade mimic->octopus

2022-10-06 Thread Frank Schilder
Hi Igor. > Not sure why you're talking about replicated(!) 4(2) pool. Its because in the production cluster its the 4(2) pool that has that problem. On the test cluster it was an EC pool. Seems to affect all sorts of pools. I just lost another disk, we have PGs down now. I really hope the

[ceph-users] Re: OSD crashes during upgrade mimic->octopus

2022-10-06 Thread Igor Fedotov
On 10/6/2022 2:55 PM, Frank Schilder wrote: Hi Igor, it has the SSD OSDs down, the HDD OSDs are running just fine. I don't want to make a bad situation worse for now and wait for recovery to finish. The inactive PGs are activating very slowly. Got it. By the way, there are 2 out of 4 OSDs

[ceph-users] Re: OSD crashes during upgrade mimic->octopus

2022-10-06 Thread Stefan Kooman
On 10/6/22 13:41, Frank Schilder wrote: Hi Stefan, thanks for looking at this. The conversion has happened on 1 host only. Status is: - all daemons on all hosts upgraded - all OSDs on 1 OSD-host were restarted with bluestore_fsck_quick_fix_on_mount = true in its local ceph.conf, these OSDs

[ceph-users] 16.2.10: ceph osd perf always shows high latency for a specific OSD

2022-10-06 Thread Zakhar Kirpichenko
Hi, I'm having a peculiar "issue" in my cluster, which I'm not sure whether it's real: a particular OSD always shows significant latency in `ceph osd perf` report, an order of magnitude higher than any other OSD. I traced this OSD to a particular drive in a particular host. OSD logs don't look

[ceph-users] Re: OSD crashes during upgrade mimic->octopus

2022-10-06 Thread Frank Schilder
Hi Igor, it has the SSD OSDs down, the HDD OSDs are running just fine. I don't want to make a bad situation worse for now and wait for recovery to finish. The inactive PGs are activating very slowly. By the way, there are 2 out of 4 OSDs up in the replicated 4(2) pool. Why are PGs even

[ceph-users] Re: OSD crashes during upgrade mimic->octopus

2022-10-06 Thread Igor Fedotov
From your response to Stefan I'm getting that one of two damaged hosts has all OSDs down and unable to start. I that correct? If so you can reboot it with no problem and proceed with manual compaction [and other experiments] quite "safely" for the rest of the cluster. On 10/6/2022 2:35 PM,

[ceph-users] Re: OSD crashes during upgrade mimic->octopus

2022-10-06 Thread Frank Schilder
Hi Stefan, thanks for looking at this. The conversion has happened on 1 host only. Status is: - all daemons on all hosts upgraded - all OSDs on 1 OSD-host were restarted with bluestore_fsck_quick_fix_on_mount = true in its local ceph.conf, these OSDs completed conversion and rebooted, I would

[ceph-users] Re: OSD crashes during upgrade mimic->octopus

2022-10-06 Thread Frank Schilder
Hi Igor, I can't access these drives. They have an OSD- or LVM process hanging in D-state. Any attempt to do something with these gets stuck as well. I somehow need to wait for recovery to finish and protect the still running OSDs from crashing similarly badly. After we have full redundancy

[ceph-users] Re: OSD crashes during upgrade mimic->octopus

2022-10-06 Thread Igor Fedotov
IIUC the OSDs that expose "had timed out after 15" are failing to start up. Is that correct or I missed something?  I meant trying compaction for them... On 10/6/2022 2:27 PM, Frank Schilder wrote: Hi Igor, thanks for your response. And what's the target Octopus release? ceph version

[ceph-users] Re: OSD crashes during upgrade mimic->octopus

2022-10-06 Thread Stefan Kooman
On 10/6/22 13:06, Frank Schilder wrote: Hi all, we are stuck with a really unpleasant situation and we would appreciate help. Yesterday we completed the ceph deamon upgrade from mimic to octopus all he way through with bluestore_fsck_quick_fix_on_mount = false and started the OSD OMAP

[ceph-users] Re: OSD crashes during upgrade mimic->octopus

2022-10-06 Thread Frank Schilder
Hi Igor, thanks for your response. > And what's the target Octopus release? ceph version 15.2.17 (8a82819d84cf884bd39c17e3236e0632ac146dc4) octopus (stable) I'm afraid I don't have the luxury right now to take OSDs down or add extra load with an on-line compaction. I would really appreciate a

[ceph-users] Re: How does client get the new active ceph-mgr endpoint when failover happens?

2022-10-06 Thread Janne Johansson
> Thanks for the quick response! > > What if the node is down? The client cannot even connect to the mgr. Then this mgr would not be in the list of possible mgrs to connect to at all. -- May the most significant bit of your life be positive. ___

[ceph-users] Re: OSD crashes during upgrade mimic->octopus

2022-10-06 Thread Igor Fedotov
And what's the target Octopus release? On 10/6/2022 2:06 PM, Frank Schilder wrote: Hi all, we are stuck with a really unpleasant situation and we would appreciate help. Yesterday we completed the ceph deamon upgrade from mimic to octopus all he way through with

[ceph-users] OSD crashes during upgrade mimic->octopus

2022-10-06 Thread Frank Schilder
Hi all, we are stuck with a really unpleasant situation and we would appreciate help. Yesterday we completed the ceph deamon upgrade from mimic to octopus all he way through with bluestore_fsck_quick_fix_on_mount = false and started the OSD OMAP conversion today in the morning. Everything went

[ceph-users] Re: ceph on kubernetes

2022-10-06 Thread Clyso GmbH - Ceph Foundation Member
Hello Oğuz, we have been supporting several rook/ceph clusters in the hyperscalers for years, including Azure. A few quick notes: * you can be prepared to run into some issues with the default config of the osds. * in Azure, there is the issue with the quality of the network in some

[ceph-users] Re: How does client get the new active ceph-mgr endpoint when failover happens?

2022-10-06 Thread Janne Johansson
Den tors 6 okt. 2022 kl 10:40 skrev Zhongzhou Cai : > Hi folks, > I have ceph-mgr bootstrapped on three nodes, and they are running in HA. > When the active mgr node goes down, it will failover to one of the > standbys. I'm wondering if there is a way for the client to be aware of the > leadership

[ceph-users] How does client get the new active ceph-mgr endpoint when failover happens?

2022-10-06 Thread Zhongzhou Cai
Hi folks, I have ceph-mgr bootstrapped on three nodes, and they are running in HA. When the active mgr node goes down, it will failover to one of the standbys. I'm wondering if there is a way for the client to be aware of the leadership change and connect to the new active mgr? Do I need to set

[ceph-users] Re: MDS Performance and PG/PGP value

2022-10-06 Thread Janne Johansson
> Hello > > As previously describe here, we have a full-flash NVME ceph cluster (16.2.6) > with currently only cephfs service configured. [...] > We noticed that cephfs_metadata pool had only 16 PG, we have set > autoscale_mode to off and increase the number of PG to 256 and with this > change,

[ceph-users] MDS Performance and PG/PGP value

2022-10-06 Thread Yoann Moulin
Hello As previously describe here, we have a full-flash NVME ceph cluster (16.2.6) with currently only cephfs service configured. The current setup is 54 nodes with 1 NVME each, 2 partitions for each NVME. 8 MDSs (7 actives, 1 sandby) MDS cache memory limit to 128GB. It's an hyperconverged