[ceph-users] It takes long time for a newly added osd booting to up state due to heavy rocksdb activity

2020-08-11 Thread Jerry Pu
Hi All We had a cluster (v13.2.4) with 32 osds in total. At first, an osd (osd.18) in cluster was down. So, we tried to remove the osd and added a new one (osd.32) with new ID. We unplugged the disk (osd.18) and plugged in a new disk in the same slot and add osd.32 into cluster. Then, osd.32 was

[ceph-users] Re: Remapped PGs

2020-08-11 Thread ceph
Hi, I am not sure but perhaps this could be an Effekt of "balancer" module - if you use it!? Hth Mehmet Am 10. August 2020 17:28:27 MESZ schrieb David Orman : >We've gotten a bit further, after evaluating how this remapped count >was >determine (pg_temp), we've found the PGs counted as being

[ceph-users] v14.2.11 Nautilus released

2020-08-11 Thread Abhishek Lekshmanan
We're happy to announce the availability of the eleventh release in the Nautilus series. This release brings a number of bugfixes across all major components of Ceph. We recommend that all Nautilus users upgrade to this release. Notable Changes --- * RGW: The `radosgw-admin`

[ceph-users] Re: 5 pgs inactive, 5 pgs incomplete

2020-08-11 Thread Kevin Myers
Replica count of 2 is a sure fire way to a crisis ! Sent from my iPad > On 11 Aug 2020, at 18:45, Martin Palma wrote: > > Hello, > after an unexpected power outage our production cluster has 5 PGs > inactive and incomplete. The OSDs on which these 5 PGs are located all > show "stuck requests

[ceph-users] 5 pgs inactive, 5 pgs incomplete

2020-08-11 Thread Martin Palma
Hello, after an unexpected power outage our production cluster has 5 PGs inactive and incomplete. The OSDs on which these 5 PGs are located all show "stuck requests are blocked": Reduced data availability: 5 pgs inactive, 5 pgs incomplete 98 stuck requests are blocked > 4096 sec. Implicated

[ceph-users] Re: OSD memory leak?

2020-08-11 Thread Frank Schilder
Hi Mark, here is a first collection of heap profiling data (valid 30 days): https://files.dtu.dk/u/53HHic_xx5P1cceJ/heap_profiling-2020-08-03.tgz?l This was collected with the following config settings: osd dev osd_memory_cache_min 805306368 osd

[ceph-users] Re: ceph orch host rm seems to just move daemons out of cephadm, not remove them

2020-08-11 Thread pixel fairy
tried removing the daemon first, and that kinda blew up. ceph orch daemon rm --force mon.tempmon ceph orch host rm tempmon now there are two problems. 1. ceph is still looking for it, services: mon: 4 daemons, quorum ceph1,ceph2,ceph3 (age 3s), out of quorum: tempmon mgr:

[ceph-users] Re: Speeding up reconnection

2020-08-11 Thread William Edwards
> Hi, > you can change the MDS setting to be less strict [1]: > According to [1] the default is 300 seconds to be evicted. Maybe give   > the less strict option a try? Thanks for your reply. I already set mds_session_blacklist_on_timeout to false. This seems to have helped somewhat, but

[ceph-users] Single node all-in-one install for testing

2020-08-11 Thread Richard W.M. Jones
I have one spare machine with a single 1TB on it, and I'd like to test a local Ceph install. This is just for testing, I don't care that it won't have redundancy, failover, etc. Is there any canonical documentation for this case? - - - Longer story is this morning I found this

[ceph-users] Announcing go-ceph v0.5.0

2020-08-11 Thread John Mulligan
I'm happy to announce the another release of the go-ceph API bindings. This is a regular release following our every-two-months release cadence. https://github.com/ceph/go-ceph/releases/tag/v0.5.0 The bindings aim to play a similar role to the "pybind" python bindings in the ceph tree but for

[ceph-users] Speeding up reconnection

2020-08-11 Thread William Edwards
Hello, When connection is lost between kernel client, a few things happen: 1. Caps become stale: Aug 11 11:08:14 admin-cap kernel: [308405.227718] ceph: mds0 caps stale 2. MDS evicts client for being unresponsive: MDS log: 2020-08-11 11:12:08.923 7fd1f45ae700  0 log_channel(cluster) log

[ceph-users] Re: pg stuck in unknown state

2020-08-11 Thread Michael Thomas
On 8/11/20 2:52 AM, Wido den Hollander wrote: On 11/08/2020 00:40, Michael Thomas wrote: On my relatively new Octopus cluster, I have one PG that has been perpetually stuck in the 'unknown' state.  It appears to belong to the device_health_metrics pool, which was created automatically by the

[ceph-users] Ceph not warning about clock skew on an OSD-only host?

2020-08-11 Thread Matthew Vernon
Hi, Our production cluster runs Luminous. Yesterday, one of our OSD-only hosts came up with its clock about 8 hours wrong(!) having been out of the cluster for a week or so. Initially, ceph seemed entirely happy, and then after an hour or so it all went South (OSDs start logging about bad

[ceph-users] Re: Speeding up reconnection

2020-08-11 Thread Eugen Block
Hi, you can change the MDS setting to be less strict [1]: It is possible to respond to slow clients by simply dropping their MDS sessions, but permit them to re-open sessions and permit them to continue talking to OSDs. To enable this mode, set mds_session_blacklist_on_timeout to false on

[ceph-users] Re: pgs not deep scrubbed in time - false warning?

2020-08-11 Thread Dirk Sarpe
Of course I found the cause shortly after sending the message … The scrubbing parameters need to move from the [osd] section to the [global] section, see https://www.suse.com/support/kb/doc/?id=19621 Health is back to OK after restarting osds, mons and mgrs. Cheers, Dirk On Dienstag, 11.

[ceph-users] Re: pg stuck in unknown state

2020-08-11 Thread Wido den Hollander
On 11/08/2020 00:40, Michael Thomas wrote: On my relatively new Octopus cluster, I have one PG that has been perpetually stuck in the 'unknown' state.  It appears to belong to the device_health_metrics pool, which was created automatically by the mgr daemon(?). The OSDs that the PG maps to

[ceph-users] pgs not deep scrubbed in time - false warning?

2020-08-11 Thread Dirk Sarpe
Hi, since some time (I think upgrade to nautilus) we get X pgs not deep scrubbed in time I deep-scrubbed the pgs when the error occurred and expected the cluster to recover over time, but no such luck. The warning comes up again and again. In our spinning rust cluster we allow deep scrubbing