[ceph-users] Re: pg deep-scrub issue

2023-05-04 Thread Janne Johansson
>undergo deepscrub and regular scrub cannot be completed in a timely manner. I >have noticed that these PGs appear to be concentrated on a single OSD. I am >seeking your guidance on how to address this issue and would appreciate any >insights or suggestions you may have. > The usual "see if

[ceph-users] Re: Unable to restart mds - mds crashes almost immediately after finishing recovery

2023-05-04 Thread Xiubo Li
Hi Emmanuel, This should be one known issue as https://tracker.ceph.com/issues/58392 and there is one fix in https://github.com/ceph/ceph/pull/49652. Could you just stop all the clients first and then set the 'max_mds' to 1 and then restart the MDS daemons ? Thanks On 5/3/23 16:01,

[ceph-users] Re: client isn't responding to mclientcaps(revoke), pending pAsLsXsFsc issued pAsLsXsFsc

2023-05-04 Thread Xiubo Li
On 5/1/23 17:35, Frank Schilder wrote: Hi all, I think we might be hitting a known problem (https://tracker.ceph.com/issues/57244). I don't want to fail the mds yet, because we have troubles with older kclients that miss the mds restart and hold on to cache entries referring to the killed

[ceph-users] Change in DMARC handling for the list

2023-05-04 Thread Dan Mick
Several users have complained for some time that our DMARC/DKIM handling is not correct. I've recently had time to go study DMARC, DKIM, SPF, SRS, and other tasty morsels of initialisms, and have thus made a change to how Mailman handles DKIM signatures for the list: If a domain advertises

[ceph-users] pg deep-scrub issue

2023-05-04 Thread Peter
Dear all, I am writing to seek your assistance in resolving an issue with my Ceph cluster. Currently, the cluster is experiencing a problem where the number of Placement Groups (PGs) that need to undergo deepscrub and regular scrub cannot be completed in a timely manner. I have noticed that

[ceph-users] Re: CephFS Scrub Questions

2023-05-04 Thread Patrick Donnelly
On Thu, May 4, 2023 at 11:35 AM Chris Palmer wrote: > > Hi > > Grateful if someone could clarify some things about CephFS Scrubs: > > 1) Am I right that a command such as "ceph tell mds.cephfs:0 scrub start > / recursive" only triggers a forward scrub (not a backward scrub)? The naming here that

[ceph-users] Re: 16.2.13 pacific QE validation status

2023-05-04 Thread Radoslaw Zarzynski
If we get some time, I would like to include: https://github.com/ceph/ceph/pull/50894. Regards, Radek On Thu, May 4, 2023 at 5:56 PM Venky Shankar wrote: > > Hi Yuri, > > On Wed, May 3, 2023 at 7:10 PM Venky Shankar wrote: > > > > On Tue, May 2, 2023 at 8:25 PM Yuri Weinstein wrote: > > >

[ceph-users] Re: 16.2.13 pacific QE validation status

2023-05-04 Thread Yuri Weinstein
In summary: Release Notes: https://github.com/ceph/ceph/pull/51301 We plan to finish this release next week and we have the following PRs planned to be added: https://github.com/ceph/ceph/pull/51232 -- Venky approved https://github.com/ceph/ceph/pull/51344 -- Venky in progress

[ceph-users] Re: Radosgw: ssl_private_key could not find the file even if it existed

2023-05-04 Thread Janne Johansson
Den tors 4 maj 2023 kl 17:07 skrev : > > The radosgw has been configured like this: > > [client.rgw.ceph1] > host = ceph1 > rgw_frontends = beast port=8080 ssl_port=443 ssl_certificate=/root/ssl/ca.crt > ssl_private_key=/root/ssl/ca.key > #rgw_frontends = beast port=8080 ssl_port=443 >

[ceph-users] Re: 16.2.13 pacific QE validation status

2023-05-04 Thread Venky Shankar
Hi Yuri, On Wed, May 3, 2023 at 7:10 PM Venky Shankar wrote: > > On Tue, May 2, 2023 at 8:25 PM Yuri Weinstein wrote: > > > > Venky, I did plan to cherry-pick this PR if you approve this (this PR > > was used for a rerun) > > OK. The fs suite failure is being looked into >

[ceph-users] Re: Upgrading from Pacific to Quincy fails with "Unexpected error"

2023-05-04 Thread Adam King
for setting the user, `ceph cephadm set-user` command should do it. Bit surprised by the second part of that though. With passwordless sudo access I would have expected that to start working. On Thu, May 4, 2023 at 11:27 AM Reza Bakhshayeshi wrote: > Thank you. > I don't see any more errors

[ceph-users] CephFS Scrub Questions

2023-05-04 Thread Chris Palmer
Hi Grateful if someone could clarify some things about CephFS Scrubs: 1) Am I right that a command such as "ceph tell mds.cephfs:0 scrub start / recursive" only triggers a forward scrub (not a backward scrub)? 2) I couldn't find any reference to forward scrubs being done automatically and

[ceph-users] Re: Upgrading from Pacific to Quincy fails with "Unexpected error"

2023-05-04 Thread Reza Bakhshayeshi
Thank you. I don't see any more errors rather than: 2023-05-04T15:07:38.003+ 7ff96cbe0700 0 log_channel(cephadm) log [DBG] : Running command: sudo which python3 2023-05-04T15:07:38.025+ 7ff96cbe0700 0 log_channel(cephadm) log [DBG] : Connection to host1 failed. Process exited with

[ceph-users] Radosgw: ssl_private_key could not find the file even if it existed

2023-05-04 Thread viplanghe6
The radosgw has been configured like this: [client.rgw.ceph1] host = ceph1 rgw_frontends = beast port=8080 ssl_port=443 ssl_certificate=/root/ssl/ca.crt ssl_private_key=/root/ssl/ca.key #rgw_frontends = beast port=8080 ssl_port=443 ssl_certificate=/root/ssl/ca.crt

[ceph-users] Re: Orchestration seems not to work

2023-05-04 Thread Thomas Widhalm
I uploaded the output there: https://nextcloud.widhalm.or.at/nextcloud/s/FCqPM8zRsix3gss IP 192.168.23.62 is one of my OSDs that were still booting when the reconnect tries happened. What makes me wonder is that it's the only one listed when there are a few similar ones in the cluster. On

[ceph-users] Re: Orchestration seems not to work

2023-05-04 Thread Adam King
what does specifically `ceph log last 200 debug cephadm` spit out? The log lines you've posted so far I don't think are generated by the orchestrator so curious what the last actions it took was (and how long ago). On Thu, May 4, 2023 at 10:35 AM Thomas Widhalm wrote: > To completely rule out

[ceph-users] Re: Frequent calling monitor election

2023-05-04 Thread Frank Schilder
Hi all, there was another election after about 2 hours. trying the stop+reboot procedure on another mon now. Just for the record, I observe that when I stop one mon another goes down as a consequence: [root@ceph-02 ~]# docker stop ceph-mon ceph-mon [root@ceph-02 ~]# ceph status cluster:

[ceph-users] Re: Help needed to configure erasure coding LRC plugin

2023-05-04 Thread Frank Schilder
Yep, reading but not using LRC. Please keep it on the ceph user list for future reference -- thanks! = Frank Schilder AIT Risø Campus Bygning 109, rum S14 From: Eugen Block Sent: Thursday, May 4, 2023 3:07 PM To: ceph-users@ceph.io

[ceph-users] Re: Orchestration seems not to work

2023-05-04 Thread Thomas Widhalm
To completely rule out hung processes, I managed to get another short shutdown. Now I'm seeing lots of: mgr.server handle_open ignoring open from mds.mds01.ceph01.usujbi v2:192.168.23.61:6800/2922006253; not ready for session (expect reconnect) mgr finish mon failed to return metadata for

[ceph-users] Re: pg upmap primary

2023-05-04 Thread Dan van der Ster
Hello, After you delete the OSD, the now "invalid" upmap rule will be automatically removed. Cheers, Dan __ Clyso GmbH | https://www.clyso.com On Wed, May 3, 2023 at 10:13 PM Nguetchouang Ngongang Kevin wrote: > > Hello, I have a question, when happened when i

[ceph-users] Re: Orchestration seems not to work

2023-05-04 Thread Thomas Widhalm
Hi, What I'm seeing a lot is this: "[stats WARNING root] cmdtag not found in client metadata" Can't make anything of it but I guess it's not showing the initial issue. Now that I think of it - I started the cluster with 3 nodes which are now only used as OSD. Could it be there's something

[ceph-users] Re: Best practice for expanding Ceph cluster

2023-05-04 Thread huxia...@horebdata.cn
Dear Josh, Thanks a lot. Your clarification really gives me much courage on using pgmap tool set for re-balancing. best regards, Samuel huxia...@horebdata.cn From: Josh Baergen Date: 2023-05-04 15:46 To: huxia...@horebdata.cn CC: Janne Johansson; ceph-users Subject: Re: [ceph-users] Re:

[ceph-users] Re: Best practice for expanding Ceph cluster

2023-05-04 Thread Josh Baergen
Hi Samuel, Both pgremapper and the CERN scripts were developed against Luminous, and in my experience 12.2.13 has all of the upmap patches needed for the scheme that Janne outlined to work. However, if you have a complex CRUSH map sometimes the upmap balancer can struggle, and I think that's true

[ceph-users] Re: Orchestration seems not to work

2023-05-04 Thread Thomas Widhalm
Thanks. I set the log level to debug, try a few steps and then come back. On 04.05.23 14:48, Eugen Block wrote: Hi, try setting debug logs for the mgr: ceph config set mgr mgr/cephadm/log_level debug This should provide more details what the mgr is trying and where it's failing, hopefully.

[ceph-users] Re: Orchestration seems not to work

2023-05-04 Thread Thomas Widhalm
Thanks for the reply. "Refreshed" is "3 weeks ago" on most lines. The running mds and osd.cost_capacity are both "-" in this column. I'm already done with "mgr fail", that didn't do anything. And I even tried a complete shut down during a maintenance windows that was not 3 weeks ago but

[ceph-users] Re: Help needed to configure erasure coding LRC plugin

2023-05-04 Thread Eugen Block
Hi, I don't think you've shared your osd tree yet, could you do that? Apparently nobody else but us reads this thread or nobody reading this uses the LRC plugin. ;-) Thanks, Eugen Zitat von Michel Jouvin : Hi, I had to restart one of my OSD server today and the problem showed up

[ceph-users] Re: Orchestration seems not to work

2023-05-04 Thread Adam King
First thing I always check when it seems like orchestrator commands aren't doing anything is "ceph orch ps" and "ceph orch device ls" and check the REFRESHED column. If it's well above 10 minutes for orch ps or 30 minutes for orch device ls, then it means the orchestrator is most likely hanging on

[ceph-users] Re: Orchestration seems not to work

2023-05-04 Thread Eugen Block
Hi, try setting debug logs for the mgr: ceph config set mgr mgr/cephadm/log_level debug This should provide more details what the mgr is trying and where it's failing, hopefully. Last week this helped to identify an issue between a lower pacific issue for me. Do you see anything in the

[ceph-users] Orchestration seems not to work

2023-05-04 Thread Thomas Widhalm
Hi, I'm in the process of upgrading my cluster from 17.2.5 to 17.2.6 but the following problem existed when I was still everywhere on 17.2.5 . I had a major issue in my cluster which could be solved with a lot of your help and even more trial and error. Right now it seems that most is

[ceph-users] Re: Frequent calling monitor election

2023-05-04 Thread Frank Schilder
Hi all, I think I can reduce the defcon level a bit. Since I couldn't see something in the mon log, I started to try if its a specific mon that causes trouble by shutting one by one down for a while. I got lucky at the first try. Shutting down the leader stopped the voting from happening. I

[ceph-users] Re: Best practice for expanding Ceph cluster

2023-05-04 Thread huxia...@horebdata.cn
Janne, thanks a lot for the detailed scheme. I totally agree that the upmap approach would be one of best methods, however, my current cluster is working on Luminious 12.2.13 version and upmap seems not work reliably on Lumnious. samuel huxia...@horebdata.cn From: Janne Johansson Date:

[ceph-users] Re: rbd map: corrupt full osdmap (-22) when

2023-05-04 Thread Ilya Dryomov
On Thu, May 4, 2023 at 11:27 AM Kamil Madac wrote: > > Thanks for the info. > > As a solution we used rbd-nbd which works fine without any issues. If we will > have time we will also try to disable ipv4 on the cluster and will try kernel > rbd mapping again. Are there any disadvantages when

[ceph-users] Re: Frequent calling monitor election

2023-05-04 Thread Frank Schilder
Hi all, I have to get back to this case. On Monday I had to restart an MDS to get rid of a stuck client caps recall. Right after that fail-over, the MONs went into a voting frenzy again. I already restarted all of them like last time, but this time this doesn't help. I might be in a different

[ceph-users] Re: Help needed to configure erasure coding LRC plugin

2023-05-04 Thread Michel Jouvin
Hi, I had to restart one of my OSD server today and the problem showed up again. This time I managed to capture "ceph health detail" output showing the problem with the 2 PGs: [WRN] PG_AVAILABILITY: Reduced data availability: 2 pgs inactive, 2 pgs down     pg 56.1 is down, acting

[ceph-users] Re: Best practice for expanding Ceph cluster

2023-05-04 Thread Janne Johansson
Den tors 4 maj 2023 kl 10:39 skrev huxia...@horebdata.cn : > Dear Ceph folks, > > I am writing to ask for advice on best practice of expanding ceph cluster. We > are running an 8-node Ceph cluster and RGW, and would like to add another 10 > node, each of which have 10x 12TB HDD. The current

[ceph-users] Re: rbd map: corrupt full osdmap (-22) when

2023-05-04 Thread Kamil Madac
Thanks for the info. As a solution we used rbd-nbd which works fine without any issues. If we will have time we will also try to disable ipv4 on the cluster and will try kernel rbd mapping again. Are there any disadvantages when using NBD instead of kernel driver? Thanks On Wed, May 3, 2023 at

[ceph-users] ??????ceph-users Digest, Vol 107, Issue 20

2023-05-04 Thread ??????
help ---- ??: "ceph-users"

[ceph-users] Re: 16.2.13 pacific QE validation status

2023-05-04 Thread Guillaume Abrioux
ceph-volume approved https://jenkins.ceph.com/job/ceph-volume-test/553/ On Wed, 3 May 2023 at 22:43, Guillaume Abrioux wrote: > The failure seen in ceph-volume tests isn't related. > That being said, it needs to be fixed to have a better view of the current > status. > > On Wed, 3 May 2023 at

[ceph-users] Best practice for expanding Ceph cluster

2023-05-04 Thread huxia...@horebdata.cn
Dear Ceph folks, I am writing to ask for advice on best practice of expanding ceph cluster. We are running an 8-node Ceph cluster and RGW, and would like to add another 10 node, each of which have 10x 12TB HDD. The current 8-node has ca. 400TB user data. I am wondering whether to add 10 nodes

[ceph-users] Re: MDS "newly corrupt dentry" after patch version upgrade

2023-05-04 Thread Janek Bevendorff
After running the tool for 11 hours straight, it exited with the following exception: Traceback (most recent call last):   File "/home/webis/first-damage.py", line 156, in     traverse(f, ioctx)   File "/home/webis/first-damage.py", line 84, in traverse     for (dnk, val) in it:   File

[ceph-users] Re: MDS crash on FAILED ceph_assert(cur->is_auth())

2023-05-04 Thread Peter van Heusden
Hi Emmaneul It was a while ago, but as I recall I evicted all clients and that allowed me to restart the MDS servers. There was something clearly "broken" in how at least one of the clients was interacting with the system. Peter On Thu, 4 May 2023 at 07:18, Emmanuel Jaep wrote: > Hi, > > did