[ceph-users] Re: Newer linux kernel cephfs clients is more trouble?

William Edwards Wed, 07 Dec 2022 03:42:17 -0800

> Op 7 dec. 2022 om 11:59 heeft Stefan Kooman <ste...@bit.nl> het volgende 
> geschreven:
> 
> On 5/13/22 09:38, Xiubo Li wrote:
>>> On 5/12/22 12:06 AM, Stefan Kooman wrote:
>>> Hi List,
>>> 
>>> We have quite a few linux kernel clients for CephFS. One of our customers 
>>> has been running mainline kernels (CentOS 7 elrepo) for the past two years. 
>>> They started out with 3.x kernels (default CentOS 7), but upgraded to 
>>> mainline when those kernels would frequently generate MDS warnings like 
>>> "failing to respond to capability release". That worked fine until 5.14 
>>> kernel. 5.14 and up would use a lot of CPU and *way* more bandwidth on 
>>> CephFS than older kernels (order of magnitude). After the MDS was upgraded 
>>> from Nautilus to Octopus that behavior is gone (comparable CPU / bandwidth 
>>> usage as older kernels). However, the newer kernels are now the ones that 
>>> give "failing to respond to capability release", and worse, clients get 
>>> evicted (unresponsive as far as the MDS is concerned). Even the latest 5.17 
>>> kernels have that. No difference is observed between using messenger v1 or 
>>> v2. MDS version is 15.2.16.
>>> Surprisingly the latest stable kernels from CentOS 7 work flawlessly now. 
>>> Although that is good news, newer operating systems come with newer kernels.
>>> 
>>> Does anyone else observe the same behavior with newish kernel clients?
>> There have some known bugs, which have been fixed or under fixing recently, 
>> even in the mainline and, not sure whether are they related. Such as 
>> [1][2][3][4]. More detail please see ceph-client repo testing branch [5].
> 
> None of the issues you mentioned were related. We gained some more experience 
> with newer kernel clients, specifically on Ubuntu Focal / Jammy (5.15). 
> Performance issues seem to arise in certain workloads, specifically 
> load-balanced Apache shared web hosting clusters with CephFS. We have tested 
> linux kernel clients from 5.8 up to and including 6.0 with a production 
> workload and the short summary is:
> 
> < 5.13, everything works fine
> 5.13 and up is giving issues


I see this issue on 6.0.0 as well.

> 
> We tested the 5.13.-rc1 as well, and already that kernel is giving issues. So 
> something has changed in 5.13 that results in performance regression in 
> certain workloads. And I wonder if it has something to do with the changes 
> related to fscache that have, and are, happening in the kernel. These web 
> servers might access the same directories / files concurrently.
> 
> Note: we have quite a few 5.15 kernel clients not doing any (load-balanced) 
> web based workload (container clusters on CephFS) that don't have any 
> performance issue running these kernels.
> 
> Issue: poor CephFS performance
> Symptom / result: excessive CephFS network usage (order of magnitude higher 
> than for older kernels not having this issue), within a minute there are a 
> bunch of slow web service processes, claiming loads of virtual memory, that 
> result in heavy swap usage and basically rendering the node unusable slow.
> 
> Other users that replied to this thread experienced similar symptoms. It is 
> reproducible on both CentOS (EPEL mainline kernels) as well as on Ubuntu (hwe 
> as well as default relase kernel).
> 
> MDS version used: 15.2.16 (with a backported patch from 15.2.17) (single 
> active / standby-replay)
> 
> Does this ring a bell?
> 
> Gr. Stefan
> 
> _______________________________________________
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
> 

_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: Newer linux kernel cephfs clients is more trouble?

Reply via email to