[ceph-users] clust recovery stuck

2019-10-21 Thread Philipp Schwaha
hi, I have a problem with a cluster being stuck in recovery after osd failure. at first recovery was doing quite well, but now it just sits there without any progress. I currently looks like this: health HEALTH_ERR 36 pgs are stuck inactive for more than 300 seconds

[ceph-users] Decreasing the impact of reweighting osds

2019-10-21 Thread Mark Kirkwood
We recently needed to reweight a couple of OSDs on one of our clusters (luminous on Ubuntu,  8 hosts, 8 OSD/host). I (think) we reweighted by approx 0.2. This was perhaps too much, as IO latency on RBD drives spiked to several seconds at times. We'd like to lessen this effect as much as we

[ceph-users] Fwd: large concurrent rbd operations block for over 15 mins!

2019-10-21 Thread Void Star Nill
Apparently the graph is too big, so my last post is stuck. Resending without the graph. Thanks -- Forwarded message - From: Void Star Nill Date: Mon, Oct 21, 2019 at 4:41 PM Subject: large concurrent rbd operations block for over 15 mins! To: ceph-users Hello, I have been

Re: [ceph-users] Crashed MDS (segfault)

2019-10-21 Thread Gustavo Tonini
Is there a possibility to lose data if I use "cephfs-data-scan init --force-init"? On Mon, Oct 21, 2019 at 4:36 AM Yan, Zheng wrote: > On Fri, Oct 18, 2019 at 9:10 AM Gustavo Tonini > wrote: > > > > Hi Zheng, > > the cluster is running ceph mimic. This warning about network only > appears when

[ceph-users] Nautilus - inconsistent PGs - stat mismatch

2019-10-21 Thread Andras Pataki
We have a new ceph Nautilus setup (Nautilus from scratch - not upgraded): # ceph versions {     "mon": {     "ceph version 14.2.4 (75f4de193b3ea58512f204623e6c5a16e6c1e1ba) nautilus (stable)": 3     },     "mgr": {     "ceph version 14.2.4 (75f4de193b3ea58512f204623e6c5a16e6c1e1ba)

Re: [ceph-users] Can't create erasure coded pools with k+m greater than hosts?

2019-10-21 Thread Salsa
Just to clarify my situation, We have 2 datacenters with 3 hosts each, 12 4TB disks each host (2 are RAID with OS installed and the remaining 10 are used for Ceph). Right now I'm trying a single DC installation and intended to migrate to multi site mirroring DC1 to DC2, so if we lose DC1 we can

[ceph-users] Getting rid of prometheus messages in /var/log/messages

2019-10-21 Thread Vladimir Brik
Hello /var/log/messages on machines in our ceph cluster are inundated with entries from Prometheus scraping ("GET /metrics HTTP/1.1" 200 - "" "Prometheus/2.11.1") Is it possible to configure ceph to not send those to syslog? If not, can I configure something so that none of ceph-mgr

[ceph-users] Ceph Science User Group Call October

2019-10-21 Thread Kevin Hrpcek
Hello, This Wednesday we'll have a ceph science user group call. This is an informal conversation focused on using ceph in htc/hpc and scientific research environments. Call details copied from the event: Wednesday October 23rd 14:00 UTC 4:00PM Central European 10:00AM Eastern American Main

Re: [ceph-users] MDS crash - FAILED assert(omap_num_objs <= MAX_OBJECTS)

2019-10-21 Thread Yan, Zheng
On Mon, Oct 21, 2019 at 7:58 PM Stefan Kooman wrote: > > Quoting Yan, Zheng (uker...@gmail.com): > > > I double checked the code, but didn't find any clue. Can you compile > > mds with a debug patch? > > Sure, I'll try to do my best to get a properly packaged Ceph Mimic > 13.2.6 with the debug

Re: [ceph-users] krbd / kcephfs - jewel client features question

2019-10-21 Thread Lei Liu
Hello llya and paul, Thanks for your reply. Yes, you are right, 0x7fddff8ee8cbffb is come from kernel upgrade, it's reported by a docker container (digitalocean/ceph_exporter) use for ceph monitoring. Now upmap mode is enabled, client features: "client": { "group": {

[ceph-users] ceph balancer do not start

2019-10-21 Thread Jan Peters
Hello, I use ceph 12.2.12 and would like to activate the ceph balancer. unfortunately no redistribution of the PGs is started: ceph balancer status { "active": true, "plans": [], "mode": "crush-compat" } ceph balancer eval current cluster score 0.023776 (lower is better) ceph

Re: [ceph-users] MDS crash - FAILED assert(omap_num_objs <= MAX_OBJECTS)

2019-10-21 Thread Stefan Kooman
Quoting Yan, Zheng (uker...@gmail.com): > I double checked the code, but didn't find any clue. Can you compile > mds with a debug patch? Sure, I'll try to do my best to get a properly packaged Ceph Mimic 13.2.6 with the debug patch in it (and / or get help to get it build). Do you already have

Re: [ceph-users] MDS crash - FAILED assert(omap_num_objs <= MAX_OBJECTS)

2019-10-21 Thread Yan, Zheng
On Mon, Oct 21, 2019 at 4:33 PM Stefan Kooman wrote: > > Quoting Yan, Zheng (uker...@gmail.com): > > > delete 'mdsX_openfiles.0' object from cephfs metadata pool. (X is rank > > of the crashed mds) > > OK, MDS crashed again, restarted. I stopped it, deleted the object and > restarted the MDS. It

Re: [ceph-users] krbd / kcephfs - jewel client features question

2019-10-21 Thread Ilya Dryomov
On Sat, Oct 19, 2019 at 2:00 PM Lei Liu wrote: > > Hello llya, > > After updated client kernel version to 3.10.0-862 , ceph features shows: > > "client": { > "group": { > "features": "0x7010fb86aa42ada", > "release": "jewel", > "num": 5 > }, >

Re: [ceph-users] hanging slow requests: failed to authpin, subtree is being exported

2019-10-21 Thread Marc Roos
I think I am having this issue also (at least I had with luminous) I had to remove the hidden temp files rsync had left, when the cephfs mount 'stalled', otherwise I would never be able to complete the rsync. -Original Message- Cc: ceph-users Subject: Re: [ceph-users] hanging slow

Re: [ceph-users] hanging slow requests: failed to authpin, subtree is being exported

2019-10-21 Thread Kenneth Waegeman
I've made a ticket for this issue: https://tracker.ceph.com/issues/42338 Thanks again! K On 15/10/2019 18:00, Kenneth Waegeman wrote: Hi Robert, all, On 23/09/2019 17:37, Robert LeBlanc wrote: On Mon, Sep 23, 2019 at 4:14 AM Kenneth Waegeman wrote: Hi all, When syncing data with rsync,

Re: [ceph-users] collectd Ceph metric

2019-10-21 Thread Marc Roos
The 'xx-.conf' are mine, custom. So I would not have to merge changes with newer /etc/collectd.conf rpm updates. I would suggest get a small configuration that is working, set debug logging[0], and increase the configuration until it fails with little steps. Load plugin ceph empty,

Re: [ceph-users] collectd Ceph metric

2019-10-21 Thread Liu, Changcheng
Is there any instruction to install the plugin configuration? Attach my RHEL/collectd configuration file under /etc/ directory. On RHEL: [rdma@rdmarhel0 collectd.d]$ pwd /etc/collectd.d [rdma@rdmarhel0 collectd.d]$ tree . . 0 directories, 0 files [rdma@rdmarhel0

Re: [ceph-users] collectd Ceph metric

2019-10-21 Thread Marc Roos
Your collectd starts without the ceph plugin ok? I have also your error " didn't register a configuration callback", because I configured debug logging, but did not enable it by loading the plugin 'logfile'. Maybe it is the order in which your configuration files a read (I think this used

Re: [ceph-users] MDS crash - FAILED assert(omap_num_objs <= MAX_OBJECTS)

2019-10-21 Thread Stefan Kooman
Quoting Yan, Zheng (uker...@gmail.com): > delete 'mdsX_openfiles.0' object from cephfs metadata pool. (X is rank > of the crashed mds) OK, MDS crashed again, restarted. I stopped it, deleted the object and restarted the MDS. It became active right away. Any idea on why the openfiles list

Re: [ceph-users] collectd Ceph metric

2019-10-21 Thread Liu, Changcheng
On 10:16 Mon 21 Oct, Marc Roos wrote: > I have the same. I do not think ConvertSpecialMetricTypes is necessary. > > > Globals true > > > > LongRunAvgLatency false > ConvertSpecialMetricTypes true > > SocketPath "/var/run/ceph/ceph-osd.1.asok" > > Same configuration, but

Re: [ceph-users] collectd Ceph metric

2019-10-21 Thread Marc Roos
I have the same. I do not think ConvertSpecialMetricTypes is necessary. Globals true LongRunAvgLatency false ConvertSpecialMetricTypes true SocketPath "/var/run/ceph/ceph-osd.1.asok" -Original Message- Cc: ceph-users@lists.ceph.com Subject: Re: [ceph-users]

Re: [ceph-users] collectd Ceph metric

2019-10-21 Thread Liu, Changcheng
On 09:50 Mon 21 Oct, Marc Roos wrote: > > I am, collectd with luminous, and upgraded to nautilus and collectd > 5.8.1-1.el7 this weekend. Maybe increase logging or so. > I had to wait a long time before collectd was supporting the luminous > release, maybe it is the same with octopus (=15?) >

Re: [ceph-users] MDS crash - FAILED assert(omap_num_objs <= MAX_OBJECTS)

2019-10-21 Thread Stefan Kooman
Quoting Yan, Zheng (uker...@gmail.com): > delete 'mdsX_openfiles.0' object from cephfs metadata pool. (X is rank > of the crashed mds) Just to make sure I understand correctly. Current status is that the MDS is active (no standby for now) and not in a "crashed" state (although it has been

Re: [ceph-users] collectd Ceph metric

2019-10-21 Thread Marc Roos
I am, collectd with luminous, and upgraded to nautilus and collectd 5.8.1-1.el7 this weekend. Maybe increase logging or so. I had to wait a long time before collectd was supporting the luminous release, maybe it is the same with octopus (=15?) -Original Message- From: Liu,

Re: [ceph-users] MDS crash - FAILED assert(omap_num_objs <= MAX_OBJECTS)

2019-10-21 Thread Yan, Zheng
On Sun, Oct 20, 2019 at 1:53 PM Stefan Kooman wrote: > > Dear list, > > Quoting Stefan Kooman (ste...@bit.nl): > > > I wonder if this situation is more likely to be hit on Mimic 13.2.6 than > > on any other system. > > > > Any hints / help to prevent this from happening? > > We have had this

[ceph-users] collectd Ceph metric

2019-10-21 Thread Liu, Changcheng
Hi all, Does anyone succeed to use collectd/ceph plugin to collect ceph cluster data? I'm using collectd(5.8.1) and Ceph-15.0.0. collectd failed to get cluster data with below error: "collectd.service holdoff time over, scheduling restart" Regards, Changcheng

Re: [ceph-users] Crashed MDS (segfault)

2019-10-21 Thread Yan, Zheng
On Fri, Oct 18, 2019 at 9:10 AM Gustavo Tonini wrote: > > Hi Zheng, > the cluster is running ceph mimic. This warning about network only appears > when using nautilus' cephfs-journal-tool. > > "cephfs-data-scan scan_links" does not report any issue. > > How could variable "newparent" be NULL at