[ceph-users] upmap balancer and consequences of osds briefly marked out

2020-05-01 Thread Dylan McCulloch
Hi all, We're using upmap balancer which has made a huge improvement in evenly distributing data on our osds and has provided a substantial increase in usable capacity. Currently on ceph version: 12.2.13 luminous We ran into a firewall issue recently which led to a large number of osds being

[ceph-users] Re: upmap balancer and consequences of osds briefly marked out

2020-05-01 Thread Dan van der Ster
Hi, You're correct that all the relevant upmap entries are removed when an OSD is marked out. You can try to use this script which will recreate them and get the cluster back to HEALTH_OK quickly: https://github.com/cernceph/ceph-scripts/blob/master/tools/upmap/upmap-remapped.py Cheers, Dan On F

[ceph-users] dashboard module missing dependencies in 15.2.1 Octopus

2020-05-01 Thread Duncan Bellamy
Hi, I have installed ceph on Ubuntu Focal Fossa using the ubuntu repo, instead of ceph-deploy (as ceph-deploy install does not work for Focal Fossa yet) install I used: sudo apt-get install -y ceph ceph-mds radosgw ceph-mgr-dashboard The rest of the setup was the same as the quickstart on ceph.

[ceph-users] Re: 4.14 kernel or greater recommendation for multiple active MDS

2020-05-01 Thread Paul Emmerich
I've seen issues with clients reconnects on older kernels, yeah. They sometimes get stuck after a network failure Paul -- Paul Emmerich Looking for help with your Ceph cluster? Contact us at https://croit.io croit GmbH Freseniusstr. 31h 81247 München www.croit.io Tel: +49 89 1896585 90 On Th

[ceph-users] Re: upmap balancer and consequences of osds briefly marked out

2020-05-01 Thread Dylan McCulloch
Thanks Dan, that looks like a really neat method & script for a few use-cases. We've actually used several of the scripts in that repo over the years, so, many thanks for sharing. That method will definitely help in the scenario in which a set of unnecessary pg remaps have been triggered and ca

[ceph-users] Re: 回复: Re: OSDs continuously restarting under load

2020-05-01 Thread David Turner
badblocks has found over 50 bad sectors so far and still running. xfs_repair stopped running twice with a message "Killed" likely indicating that it hit a similar bus error that ceph-osd is running into. This seems like a fairly simple case of failing disks. I just hope I can get through it without

[ceph-users] Re: dashboard module missing dependencies in 15.2.1 Octopus

2020-05-01 Thread James Page
Hi Duncan Try python3-yaml - this might just be a missing dependency. Cheers James On Fri, May 1, 2020 at 7:32 AM Duncan Bellamy wrote: > Hi, > I have installed ceph on Ubuntu Focal Fossa using the ubuntu repo, instead > of ceph-deploy (as ceph-deploy install does not work for Focal Fossa yet

[ceph-users] ceph-mgr high CPU utilization

2020-05-01 Thread Andras Pataki
I'm wondering if anyone still sees issues with ceph-mgr using CPU and being unresponsive even in recent Nautilus releases.  We upgraded our largest cluster from Mimic to Nautilus (14.2.8) recently - it has about 3500 OSDs.  Now ceph-mgr is constantly at 100-200% CPU (1-2 cores), and becomes unr

[ceph-users] Re: ceph-mgr high CPU utilization

2020-05-01 Thread Andras Pataki
Also just a follow-up on the misbehavior of ceph-mgr.  It looks like the upmap balancer is not acting reasonably either.  It is trying to create upmap entries every minute or so - and claims to be successful, but they never show up in the OSD map.  Setting the logging to 'debug', I see upmap en

[ceph-users] 14.2.9 MDS Failing

2020-05-01 Thread Marco Pizzolo
Hello, Hoping you can help me. Ceph had been largely problem free for us for the better part of a year. We have a high file count in a single CephFS filesystem, and are seeing this error in the logs: /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/

[ceph-users] Re: 14.2.9 MDS Failing

2020-05-01 Thread Ashley Merrick
Quickly checking the code that calls that assert if (version > omap_version) { omap_version = version; omap_num_objs = num_objs; omap_num_items.resize(omap_num_objs); journal_state = jstate; } else if (version == omap_version) { ceph_assert(omap_num_objs == num_objs); if (jstate > journa

[ceph-users] Re: 14.2.9 MDS Failing

2020-05-01 Thread Marco Pizzolo
Hi Ashley, Thanks for your response. Nothing that I can think of would have happened. We are using max_mds =1. We do have 4 so used to have 3 standby. Within minutes they all crash. On Fri, May 1, 2020 at 2:21 PM Ashley Merrick wrote: > Quickly checking the code that calls that assert > > >

[ceph-users] Re: 14.2.9 MDS Failing

2020-05-01 Thread Marco Pizzolo
Also seeing errors such as this: [2020-05-01 13:15:20,970][systemd][WARNING] command returned non-zero exit status: 1 [2020-05-01 13:15:20,970][systemd][WARNING] failed activating OSD, retries left: 11 [2020-05-01 13:15:20,974][ceph_volume.process][INFO ] stderr --> RuntimeError: could not find

[ceph-users] Re: 14.2.9 MDS Failing

2020-05-01 Thread Paul Emmerich
The OpenFileTable objects are safe to delete while the MDS is offline anyways, the RADOS object names are mds*_openfiles* Paul -- Paul Emmerich Looking for help with your Ceph cluster? Contact us at https://croit.io croit GmbH Freseniusstr. 31h 81247 München www.croit.io Tel: +49 89 1896585

[ceph-users] Re: 14.2.9 MDS Failing

2020-05-01 Thread Marco Pizzolo
Hi Paul, I appreciate the response but as I'm fairly new to Ceph, I am not sure that I'm understanding. Are you saying that you believe the issue to be due to the number of open files? If so, what are you suggesting as the solution? Thanks. On Fri, May 1, 2020 at 3:27 PM Paul Emmerich wrote

[ceph-users] Re: 14.2.9 MDS Failing

2020-05-01 Thread Paul Emmerich
On Fri, May 1, 2020 at 9:27 PM Paul Emmerich wrote: > The OpenFileTable objects are safe to delete while the MDS is offline > anyways, the RADOS object names are mds*_openfiles* > I should clarify this a little bit: you shouldn't touch the CephFS internal state or data structures unless you know

[ceph-users] Re: 14.2.9 MDS Failing

2020-05-01 Thread Marco Pizzolo
Understood Paul, thanks. In case this helps to shed any further light...Digging through logs I'm also seeing this: 2020-05-01 10:06:55.984 7eff10cc3700 1 mds.prdceph01 Updating MDS map to version 1487236 from mon.2 2020-05-01 10:06:56.398 7eff0e4be700 0 log_channel(cluster) log [WRN] : 17 slow

[ceph-users] Re: 14.2.9 MDS Failing

2020-05-01 Thread Marco Pizzolo
Thanks Everyone, I was able to address the issue at least temporarily. The filesystem and MDSes are for the time staying online and the pgs are being remapped. What i'm not sure about is the best tuning for MDS given our use case, nor am i sure of exactly what caused the OSD to flap as they did,

[ceph-users] repairing osd rocksdb

2020-05-01 Thread Francois Legrand
Hi, We had a major crash which ended with ~1/3 of our osd downs. Trying to fix it we reinstalled a few down osd (that was a mistake, I agree) and destroy the datas on it. Finally, we could fix the problem (thanks to Igor Fedotov) and restart almost all of our osds except one for which the rocksd

[ceph-users] Re: repairing osd rocksdb

2020-05-01 Thread Igor Fedotov
Francois, I have never tried that myself but I recall it's possible to export/import PG using ceph-objectstore-tool. Probably there are some examples in this mailing list... Your broken OSD passes fsck, i.e. works fine in read-only mode. Unfortunately AFAIK export does a regular mount (i.e.

[ceph-users] Re: 4.14 kernel or greater recommendation for multiple active MDS

2020-05-01 Thread Robert LeBlanc
Thanks guys. We are so close to the edge that we may just take that chance, usually the only reason an active client has to reconnect is because we have to bounce the MDS because it's overwhelmed. Robert LeBlanc PGP Fingerprint 79A2 9CA4 6CC4 45DD A904 C70E E654 3BB2 FA62 B9F1 O