[ceph-users] Re: 1 clients failing to respond to cache pressure (quincy:17.2.6)

2024-01-29 Thread Eugen Block
I'm not sure if I understand correctly: I decided to distribute subvolumes across multiple pools instead of multi-active-mds. With this method I will have multiple MDS and [1x cephfs clients for each pool / Host] Those two statements contradict each other, either you have multi-active MDS

[ceph-users] Re: 1 clients failing to respond to cache pressure (quincy:17.2.6)

2024-01-27 Thread Özkan Göksu
Thank you Frank. My focus is actually performance tuning. After your mail, I started to investigate client-side. I think the kernel tunings work great now. After the tunings I didn't get any warning again. Now I will continue with performance tunings. I decided to distribute subvolumes across

[ceph-users] Re: 1 clients failing to respond to cache pressure (quincy:17.2.6)

2024-01-27 Thread Frank Schilder
Hi Özkan, > ... The client is actually at idle mode and there is no reason to fail at > all. ... if you re-read my message, you will notice that I wrote that - its not the client failing, its a false positive error flag that - is not cleared for idle clients. You seem to encounter exactly

[ceph-users] Re: 1 clients failing to respond to cache pressure (quincy:17.2.6)

2024-01-26 Thread Özkan Göksu
d 76 >> thp_file_alloc 0 >> thp_file_fallback 0 >> thp_file_fallback_charge 0 >> thp_file_mapped 0 >> thp_split_page 2 >> thp_split_page_failed 0 >> thp_deferred_split_page 66 >> thp_split_pmd 22451 >> thp_split_pud 0 >> thp_zero_page_alloc 1 >> thp_zero_page_alloc_faile

[ceph-users] Re: 1 clients failing to respond to cache pressure (quincy:17.2.6)

2024-01-26 Thread Özkan Göksu
t;> in which thread/PR/tracker I read it, but the story was something like that: >>> >>> If an MDS gets under memory pressure it will request dentry items back >>> from *all* clients, not just the active ones or the ones holding many of >>> them. If you have

[ceph-users] Re: 1 clients failing to respond to cache pressure (quincy:17.2.6)

2024-01-26 Thread Özkan Göksu
t. >> >> Best regards, >> = >> Frank Schilder >> AIT Risø Campus >> Bygning 109, rum S14 >> >> >> From: Eugen Block >> Sent: Friday, January 26, 2024 10:05 AM >> To: Özkan Göksu

[ceph-users] Re: 1 clients failing to respond to cache pressure (quincy:17.2.6)

2024-01-26 Thread Özkan Göksu
negative impact. > > Best regards, > = > Frank Schilder > AIT Risø Campus > Bygning 109, rum S14 > > > From: Eugen Block > Sent: Friday, January 26, 2024 10:05 AM > To: Özkan Göksu > Cc: ceph-users@ceph.io

[ceph-users] Re: 1 clients failing to respond to cache pressure (quincy:17.2.6)

2024-01-26 Thread Frank Schilder
, it has no performance or otherwise negative impact. Best regards, = Frank Schilder AIT Risø Campus Bygning 109, rum S14 From: Eugen Block Sent: Friday, January 26, 2024 10:05 AM To: Özkan Göksu Cc: ceph-users@ceph.io Subject: [ceph-users

[ceph-users] Re: 1 clients failing to respond to cache pressure (quincy:17.2.6)

2024-01-26 Thread Eugen Block
Performance for small files is more about IOPS rather than throughput, and the IOPS in your fio tests look okay to me. What you could try is to split the PGs to get around 150 or 200 PGs per OSD. You're currently at around 60 according to the ceph osd df output. Before you do that, can you

[ceph-users] Re: 1 clients failing to respond to cache pressure (quincy:17.2.6)

2024-01-25 Thread Özkan Göksu
This is client side metrics from a "failing to respond to cache pressure" warned client. root@datagen-27:/sys/kernel/debug/ceph/e42fd4b0-313b-11ee-9a00-31da71873773.client1282187# cat bdi/stats BdiWriteback:0 kB BdiReclaimable: 0 kB BdiDirtyThresh: 0 kB

[ceph-users] Re: 1 clients failing to respond to cache pressure (quincy:17.2.6)

2024-01-25 Thread Özkan Göksu
Every user has a 1x subvolume and I only have 1 pool. At the beginning we were using each subvolume for ldap home directory + user data. When a user logins any docker on any host, it was using the cluster for home and the for user related data, we was have second directory in the same subvolume.

[ceph-users] Re: 1 clients failing to respond to cache pressure (quincy:17.2.6)

2024-01-25 Thread Eugen Block
I understand that your MDS shows a high CPU usage, but other than that what is your performance issue? Do users complain? Do some operations take longer than expected? Are OSDs saturated during those phases? Because the cache pressure messages don’t necessarily mean that users will notice.

[ceph-users] Re: 1 clients failing to respond to cache pressure (quincy:17.2.6)

2024-01-25 Thread Özkan Göksu
I will try my best to explain my situation. I don't have a separate mds server. I have 5 identical nodes, 3 of them mons, and I use the other 2 as active and standby mds. (currently I have left overs from max_mds 4) root@ud-01:~# ceph -s cluster: id:

[ceph-users] Re: 1 clients failing to respond to cache pressure (quincy:17.2.6)

2024-01-25 Thread Eugen Block
There is no definitive answer wrt mds tuning. As it is everywhere mentioned, it's about finding the right setup for your specific workload. If you can synthesize your workload (maybe scale down a bit) try optimizing it in a test cluster without interrupting your developers too much. But

[ceph-users] Re: 1 clients failing to respond to cache pressure (quincy:17.2.6)

2024-01-25 Thread Özkan Göksu
Hello Eugen. I read all of your MDS related topics and thank you so much for your effort on this. There is not much information and I couldn't find a MDS tuning guide at all. It seems that you are the correct person to discuss mds debugging and tuning. Do you have any documents or may I learn

[ceph-users] Re: 1 clients failing to respond to cache pressure (quincy:17.2.6)

2024-01-17 Thread Xiubo Li
On 1/13/24 07:02, Özkan Göksu wrote: Hello. I have 5 node ceph cluster and I'm constantly having "clients failing to respond to cache pressure" warning. I have 84 cephfs kernel clients (servers) and my users are accessing their personal subvolumes located on one pool. My users are software

[ceph-users] Re: 1 clients failing to respond to cache pressure (quincy:17.2.6)

2024-01-17 Thread Xiubo Li
On 1/17/24 15:57, Eugen Block wrote: Hi, this is not an easy topic and there is no formula that can be applied to all clusters. From my experience, it is exactly how the discussion went in the thread you mentioned, trial & error. Looking at your session ls output, this reminds of a debug

[ceph-users] Re: 1 clients failing to respond to cache pressure (quincy:17.2.6)

2024-01-16 Thread Eugen Block
Hi, this is not an easy topic and there is no formula that can be applied to all clusters. From my experience, it is exactly how the discussion went in the thread you mentioned, trial & error. Looking at your session ls output, this reminds of a debug session we had a few years ago:

[ceph-users] Re: 1 clients failing to respond to cache pressure (quincy:17.2.6)

2024-01-16 Thread Özkan Göksu
This my active MDS perf dump output: root@ud-01:~# ceph tell mds.ud-data.ud-02.xcoojt perf dump { "AsyncMessenger::Worker-0": { "msgr_recv_messages": 17179307, "msgr_send_messages": 15867134, "msgr_recv_bytes": 445239812294, "msgr_send_bytes": 42003529245,

[ceph-users] Re: 1 clients failing to respond to cache pressure (quincy:17.2.6)

2024-01-16 Thread Özkan Göksu
All of my clients are servers located at 2 hop away with 10Gbit network and 2x Xeon CPU/16++ cores and minimum 64GB ram with SSD OS drive + 8GB spare. I use ceph kernel mount only and this is the command: - mount.ceph admin@$fsid.ud-data=/volumes/subvolumegroup ${MOUNT_DIR} -o

[ceph-users] Re: 1 clients failing to respond to cache pressure (quincy:17.2.6)

2024-01-16 Thread Özkan Göksu
Let me share some outputs about my cluster. root@ud-01:~# ceph fs status ud-data - 84 clients === RANK STATE MDS ACTIVITY DNSINOS DIRS CAPS 0active ud-data.ud-02.xcoojt Reqs: 31 /s 3022k 3021k 52.6k 385k POOL TYPE USED AVAIL

[ceph-users] Re: 1 clients failing to respond to cache pressure (quincy:17.2.6)

2024-01-16 Thread Özkan Göksu
Hello Eugen. Thank you for the answer. According to knowledge and test results at this issue: https://github.com/ceph/ceph/pull/38574 I've tried their advice and I've applied the following changes. max_mds = 4 standby_mds = 1 mds_cache_memory_limit = 16GB mds_recall_max_caps = 4 When I set

[ceph-users] Re: 1 clients failing to respond to cache pressure (quincy:17.2.6)

2024-01-16 Thread Eugen Block
Hi, I have dealt with this topic multiple times, the SUSE team helped understanding what's going on under the hood. The summary can be found in this thread [1]. What helped in our case was to reduce the mds_recall_max_caps from 30k (default) to 3k. We tried it in steps of 1k IIRC. So I