[ceph-users] Re: [Ceph-users] Re: MDS failing under load with large cache sizes

2019-08-12 Thread Janek Bevendorff
I've been copying happily for days now (not very fast, but the MDS were stable), but eventually the MDSs started flapping again due to large cache sizes (they are being killed after 11M inodes). I could solve the problem by temporarily increasing the cache size in order to allow them to rejoin,

[ceph-users] Re: [Ceph-users] Re: MDS failing under load with large cache sizes

2019-08-29 Thread Patrick Donnelly
Hi Janek, On Tue, Aug 6, 2019 at 11:25 AM Janek Bevendorff wrote: > > Here are tracker tickets to resolve the issues you encountered: > > > > https://tracker.ceph.com/issues/41140 > > https://tracker.ceph.com/issues/41141 The fix has been merged into master and will be backported soon. I've also

[ceph-users] Re: [Ceph-users] Re: MDS failing under load with large cache sizes

2019-08-30 Thread Janek Bevendorff
The fix has been merged into master and will be backported soon. Amazing, thanks! I've also done testing in a large cluster to confirm the issue you found. Using multiple processes to create files as fast as possible in a single client reliably reproduced the issue. The MDS cannot recall c

[ceph-users] Re: [Ceph-users] Re: MDS failing under load with large cache sizes

2019-12-05 Thread Janek Bevendorff
I had similar issues again today. Some users were trying to train a neural network on several million files resulting in enormous cache sizes. Due to my custom cap recall and decay rate settings, the MDSs were able to withstand the load for quite some time, but at some point the active rank crashed

[ceph-users] Re: [Ceph-users] Re: MDS failing under load with large cache sizes

2019-12-05 Thread Patrick Donnelly
On Thu, Dec 5, 2019 at 10:31 AM Janek Bevendorff wrote: > > I had similar issues again today. Some users were trying to train a > neural network on several million files resulting in enormous cache > sizes. Due to my custom cap recall and decay rate settings, the MDSs > were able to withstand the

[ceph-users] Re: [Ceph-users] Re: MDS failing under load with large cache sizes

2019-12-05 Thread Janek Bevendorff
> You set mds_beacon_grace ? Yes, as I said. It seemed to have no effect or at least none that I could see. The kick timeout seemed random after all. I even set it to something ridiculous like 1800 and the MDS were still timed out. Sometimes they got to 20M inodes, sometimes only to a few 100k.

[ceph-users] Re: [Ceph-users] Re: MDS failing under load with large cache sizes

2019-12-17 Thread Janek Bevendorff
Hey Patrick, I just wanted to give you some feedback about how 14.2.5 is working for me. I've had the chance to test it for a day now and overall, the experience is much better, although not perfect (perhaps far from it). I have two active MDS (I figured that'd spread the meta data load a li

[ceph-users] Re: [Ceph-users] Re: MDS failing under load with large cache sizes

2019-12-17 Thread Stefan Kooman
Hi Janek, Quoting Janek Bevendorff (janek.bevendo...@uni-weimar.de): > Hey Patrick, > > I just wanted to give you some feedback about how 14.2.5 is working for me. > I've had the chance to test it for a day now and overall, the experience is > much better, although not perfect (perhaps far from i

[ceph-users] Re: [Ceph-users] Re: MDS failing under load with large cache sizes

2019-12-17 Thread Janek Bevendorff
Have you already tried to adjust the "mds_cache_memory_limit" and or "ceph tell mds.* cache drop"? I really wonder how the MDS copes with that with milions of CAPS. I played with the cache size, yeah. I kind of need a large cache, otherwise everything is just slow and I'm constantly getting cac

[ceph-users] Re: [Ceph-users] Re: MDS failing under load with large cache sizes

2020-01-06 Thread Janek Bevendorff
Hi, my MDS failed again, but this time I cannot recover it by deleting the mds*_openfiles .0 object. The startup behaviour is also different. Both inode count and cache size stay at zero while the MDS is replaying. When I set the MDS log level to 7, I get tons of these messages: 2020-01-06 11:59:

[ceph-users] Re: [Ceph-users] Re: MDS failing under load with large cache sizes

2020-01-06 Thread Janek Bevendorff
Update: turns out I just had to wait for an hour. The MDSs were sending Beacons regularly, so the MONs didn't try to kill them and instead let them finish doing whatever they were doing. Unlike the other bug where the number of open files outgrows what the MDS can handle, this incident allowed "se

[ceph-users] Re: [Ceph-users] Re: MDS failing under load with large cache sizes

2020-01-07 Thread Stefan Kooman
Quoting Janek Bevendorff (janek.bevendo...@uni-weimar.de): > Update: turns out I just had to wait for an hour. The MDSs were sending > Beacons regularly, so the MONs didn't try to kill them and instead let > them finish doing whatever they were doing. > > Unlike the other bug where the number of o

[ceph-users] Re: [Ceph-users] Re: MDS failing under load with large cache sizes

2020-01-07 Thread janek . bevendorff
I had two MDS nodes. One was still active, but the other was stuck rejoining, which already caused the FS to hang (i.e. Ait was down, yes). Since at first I thought this was the old cache size bug, I deleted the open files objects and when that didn't seem to have an effect, I tried restarting the