I know this may sound simple. Have you tried raising the PG per an OSD limit, I'm sure I have seen in the past people with the same kind of issue as you and was just I/O being blocked due to a limit but not actively logged.
mon_max_pg_per_osd = 400 In the ceph.conf and then restart all the services / or inject the config into the running admin On Mon, Feb 18, 2019 at 10:55 PM Hennen, Christian < christian.hen...@uni-trier.de> wrote: > Dear Community, > > > > we are running a Ceph Luminous Cluster with CephFS (Bluestore OSDs). > During setup, we made the mistake of configuring the OSDs on RAID Volumes. > Initially our cluster consisted of 3 nodes, each housing 1 OSD. Currently, > we are in the process of remediating this. After a loss of metadata ( > http://lists.ceph.com/pipermail/ceph-users-ceph.com/2018-March/025612.html) > due to resetting the journal (journal entries were not being flushed fast > enough), we managed to bring the cluster back up and started adding 2 > additional nodes ( > http://lists.ceph.com/pipermail/ceph-users-ceph.com/2018-June/027563.html) > . > > > > After adding the two additional nodes, we increased the number of > placement groups to not only accomodate the new nodes, but also to prepare > for reinstallation of the misconfigured nodes. Since then, the number of > placement groups per OSD is too high of course. Despite this fact, cluster > health remained fine over the last few months. > > > > However, we are currently observing massive problems: Whenever we try to > access any folder via CephFS, e.g. by listing its contents, there is no > response. Clients are getting blacklisted, but there is no warning. ceph -s > shows everything is ok, except for the number of PGs being too high. If I > grep for „assert“ or „error“ in any of the logs, nothing comes up. Also, it > is not possible to reduce the number of active MDS to 1. After issuing > ‚ceph fs set fs_data max_mds 1‘ nothing happens. > > > > Cluster details are available here: > https://gitlab.uni-trier.de/snippets/77 > > > > The MDS log ( > https://gitlab.uni-trier.de/snippets/79?expanded=true&viewer=simple) > contains no „nicely exporting to“ messages as usual, but instead these: > > 2019-02-15 08:44:52.464926 7fdb13474700 7 mds.0.server > try_open_auth_dirfrag: not auth for [dir 0x100011ce7c6 /home/r-admin/ > [2,head] rep@1.1 dir_auth=1 state=0 f(v4 m2019-02-14 13:19:41.300993 > 80=48+32) n(v11339 rc2019-02-14 13:19:41.300993 b10116465260 > 10869=10202+667) hs=7+0,ss=0+0 | dnwaiter=0 child=1 frozen=0 subtree=1 > replicated=0 dirty=0 waiter=0 authpin=0 tempexporting=0 0x564343eed100], fw > to mds.1 > > > > Updates from 12.2.8 to 12.2.11 I ran last week didn’t help. > > > > Anybody got an idea or a hint where I could look into next? Any help would > be greatly appreciated! > > > > Kind regards > > Christian Hennen > > > > Project Manager Infrastructural Services > ZIMK University of Trier > > Germany > _______________________________________________ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >
_______________________________________________ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com