[ceph-users] Re: ceph fs service outage: currently failed to authpin, subtree is being exported

2021-09-19 Thread Dan van der Ster
Hi Frank, The only time I've seen something like this was when we accidently changed a subtree pin to a different rank, causing a huge amount of mds export work to be queued up between MDSs. In that incident, we just waited until it completed... It took around 30 minutes, after which all the log s

[ceph-users] ceph fs service outage: currently failed to authpin, subtree is being exported

2021-09-19 Thread Frank Schilder
Guten Tag. Our file system is out of operation (mimic 13.2.10). Our MDSes are choking on an operation: 2021-09-19 02:23:36.432664 mon.ceph-01 mon.0 192.168.32.65:6789/0 185676 : cluster [WRN] Health check failed: 1 MDSs repor t slow requests (MDS_SLOW_REQUEST) [...] 2021-09-19 02:23:34.909269 m

[ceph-users] Re: ceph fs service outage: currently failed to authpin, subtree is being exported

2021-09-19 Thread Frank Schilder
It looks similar to these: https://tracker.ceph.com/issues/39987 [The user reporting this was me as well] https://tracker.ceph.com/issues/42338 Issue 39987 was fixed a long time ago by Zheng Yan. A search for "currently failed to authpin, subtree is being exported" only returns hits regarding th

[ceph-users] rocksdb corruption with 16.2.6

2021-09-19 Thread Andrej Filipcic
Hi, after upgrading the cluster from 16.2.5 to 16.2.6, several OSDs crashed and refuse to start due to rocksdb corruption, eg:  2021-09-19T15:47:10.611+0200 7f8bc1f0e700  4 rocksdb: [compaction/compaction_job.cc:1680] [default] Compaction start summary: Base version 6 Base level 0,

[ceph-users] Re: ceph fs service outage: currently failed to authpin, subtree is being exported

2021-09-19 Thread Frank Schilder
Hi Dan, thanks for looking at this. I had to take action and restarted 2 of our 4 active MDSes, which flushed out the stuck operations. I'm pretty sure it was a real deadlock, most clients were blocked already. There was almost no fs activity and MDS CPU usage was below 2%. We are running a ve

[ceph-users] Re: debugging radosgw sync errors

2021-09-19 Thread Boris Behrens
I just deleted the rados object from .rgw.data.root and this removed the bucket.instance, but this did not solve the problem. It looks like there is some access error when I try to radosgw-admin metadata sync init. The 403 http response code on the post to the /admin/realm/period endpoint. I chec

[ceph-users] Re: Adding cache tier to an existing objectstore cluster possible?

2021-09-19 Thread Zakhar Kirpichenko
Hi, You can arbitrarily add or remove the cache tier, there's no problem with that. The problem is that cache tier doesn't work well, I tried it in front of replicated and EC-pools with very mixed results: when it worked there wasn't as much of a speed/latency benefit as one would expect from NVME

[ceph-users] PGs stuck in unkown state

2021-09-19 Thread Mr. Gecko
Hello, I'll start by explaining what I have done. I was adding some new storage in attempt to setup a cache pool according to https://docs.ceph.com/en/latest/dev/cache-pool/ by doing the following. 1. I upgraded all servers in cluster to ceph 15.2.14 which put the system into recovery for ou

[ceph-users] Re: Adding cache tier to an existing objectstore cluster possible?

2021-09-19 Thread Eugen Block
And we are quite happy with our cache tier. When we got new HDD OSDs we tested if things would improve without the tier but we had to stick to it, otherwise working with our VMs was almost impossible. But this is an RBD cache so I can't tell how the other protocols perform with a cache tier

[ceph-users] Re: HEALTH_WARN: failed to probe daemons or devices after upgrade to 16.2.6

2021-09-19 Thread Eugen Block
Hi, Yes! I did play with another cluster before and forgot to completely clear that node! And the fsid "46e2b13c-dab7-11eb-810b-a5ea707f1ea1" from that cluster. But then there is an error in CEPH. Because the mon the existing cluster complained about (with fsid "1ef45b26-dbac-11eb-a357-61