> Op 7 dec. 2022 om 11:59 heeft Stefan Kooman <ste...@bit.nl> het volgende > geschreven: > > On 5/13/22 09:38, Xiubo Li wrote: >>> On 5/12/22 12:06 AM, Stefan Kooman wrote: >>> Hi List, >>> >>> We have quite a few linux kernel clients for CephFS. One of our customers >>> has been running mainline kernels (CentOS 7 elrepo) for the past two years. >>> They started out with 3.x kernels (default CentOS 7), but upgraded to >>> mainline when those kernels would frequently generate MDS warnings like >>> "failing to respond to capability release". That worked fine until 5.14 >>> kernel. 5.14 and up would use a lot of CPU and *way* more bandwidth on >>> CephFS than older kernels (order of magnitude). After the MDS was upgraded >>> from Nautilus to Octopus that behavior is gone (comparable CPU / bandwidth >>> usage as older kernels). However, the newer kernels are now the ones that >>> give "failing to respond to capability release", and worse, clients get >>> evicted (unresponsive as far as the MDS is concerned). Even the latest 5.17 >>> kernels have that. No difference is observed between using messenger v1 or >>> v2. MDS version is 15.2.16. >>> Surprisingly the latest stable kernels from CentOS 7 work flawlessly now. >>> Although that is good news, newer operating systems come with newer kernels. >>> >>> Does anyone else observe the same behavior with newish kernel clients? >> There have some known bugs, which have been fixed or under fixing recently, >> even in the mainline and, not sure whether are they related. Such as >> [1][2][3][4]. More detail please see ceph-client repo testing branch [5]. > > None of the issues you mentioned were related. We gained some more experience > with newer kernel clients, specifically on Ubuntu Focal / Jammy (5.15). > Performance issues seem to arise in certain workloads, specifically > load-balanced Apache shared web hosting clusters with CephFS. We have tested > linux kernel clients from 5.8 up to and including 6.0 with a production > workload and the short summary is: > > < 5.13, everything works fine > 5.13 and up is giving issues
I see this issue on 6.0.0 as well. > > We tested the 5.13.-rc1 as well, and already that kernel is giving issues. So > something has changed in 5.13 that results in performance regression in > certain workloads. And I wonder if it has something to do with the changes > related to fscache that have, and are, happening in the kernel. These web > servers might access the same directories / files concurrently. > > Note: we have quite a few 5.15 kernel clients not doing any (load-balanced) > web based workload (container clusters on CephFS) that don't have any > performance issue running these kernels. > > Issue: poor CephFS performance > Symptom / result: excessive CephFS network usage (order of magnitude higher > than for older kernels not having this issue), within a minute there are a > bunch of slow web service processes, claiming loads of virtual memory, that > result in heavy swap usage and basically rendering the node unusable slow. > > Other users that replied to this thread experienced similar symptoms. It is > reproducible on both CentOS (EPEL mainline kernels) as well as on Ubuntu (hwe > as well as default relase kernel). > > MDS version used: 15.2.16 (with a backported patch from 15.2.17) (single > active / standby-replay) > > Does this ring a bell? > > Gr. Stefan > > _______________________________________________ > ceph-users mailing list -- ceph-users@ceph.io > To unsubscribe send an email to ceph-users-le...@ceph.io > _______________________________________________ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io