[ceph-users] Re: High OSD commit_latency after kernel upgrade
After I set these 2 udev rules: root@sd-02:~# cat /etc/udev/rules.d/98-ceph-provisioning-mode.rules ACTION=="add", SUBSYSTEM=="scsi_disk", ATTR{provisioning_mode}:="unmap" root@sd-02:~# cat /etc/udev/rules.d/99-ceph-write-through.rules ACTION=="add", SUBSYSTEM=="scsi_disk", ATTR{cache_type}:="write through" Only drives changed to "DISC-GRAN=4K", "DISC-MAX=4G" This is the status: root@sd-02:~# lsblk -D > NAME >DISC-ALN DISC-GRAN DISC-MAX DISC-ZERO > sda >0 512B 2G 0 > ├─sda1 > 0 512B 2G 0 > ├─sda2 > 0 512B 2G 0 > └─sda3 > 0 512B 2G 0 > └─md0 >0 512B 2G 0 > └─md0p1 >0 512B 2G 0 > sdb >0 512B 2G 0 > ├─sdb1 > 0 512B 2G 0 > ├─sdb2 > 0 512B 2G 0 > └─sdb3 > 0 512B 2G 0 > └─md0 >0 512B 2G 0 > └─md0p1 >0 512B 2G 0 > sdc >04K 4G 0 > ├─ceph--35de126c--326d--45f0--85e6--ef651dd25506-osd--block--65a12345--788d--406c--b4aa--79c691662f3e >00B 0B 0 > └─ceph--35de126c--326d--45f0--85e6--ef651dd25506-osd--block--0fc29fdb--1345--465c--b830--8a217dd9034f >00B 0B 0 > > But in my other cluster as you can see also ceph lvm partitions are 4K + 2G root@ud-01:~# lsblk -D NAME DISC-ALN DISC-GRAN DISC-MAX DISC-ZERO sda 0 512B 2G 0 ├─sda1 0 512B 2G 0 └─sda2 0 512B 2G 0 └─md0 0 512B 2G 0 ├─md0p1 0 512B 2G 0 └─md0p2 0 512B 2G 0 sdb 0 512B 2G 0 ├─sdb1 0 512B 2G 0 └─sdb2 0 512B 2G 0 └─md0 0 512B 2G 0 ├─md0p1 0 512B 2G 0 └─md0p2 0 512B 2G 0 sdc 04K 2G 0 ├─ceph--7496095f--18c7--41fd--90f2--d9b3e382bc8e-osd--block--ec86a029--23f7--4328--9600--a24a290e3003 04K 2G 0 └─ceph--7496095f--18c7--41fd--90f2--d9b3e382bc8e-osd--block--5b69b748--d899--4f55--afc3--2ea3c8a05ca1 04K 2G 0 I think I also need to write a udev rule for LVM osd partitions right? Anthony D'Atri , 22 Mar 2024 Cum, 18:11 tarihinde şunu yazdı: > Maybe because the Crucial units are detected as client drives? But also > look at the device paths and the output of whatever "disklist" is. Your > boot drives are SATA and the others are SAS which seems even more likely to > be a factor. > > On Mar 22, 2024, at 10:42, Özkan Göksu wrote: > > Hello Anthony, thank you for the answer. > > While researching I also found out this type of issues but the thing I did > not understand is in the same server the OS drives "SAMSUNG MZ7WD480" is > all good. > > root@sd-01:~# lsblk -D > NAME DISC-ALN DISC-GRAN DISC-MAX > DISC-ZERO > sda 0 512B 2G > 0 > ├─sda10 512B 2G > 0 > ├─sda20 512B 2G > 0 > └─sda30 512B 2G > 0 > └─md0 0 512B 2G > 0 > └─md0p1 0 512B 2G > 0 > sdb 0 512B 2G > 0 > ├─sdb10 512B 2G > 0 > ├─sdb2
[ceph-users] Re: High OSD commit_latency after kernel upgrade
Hello again. In ceph recommendations I found this: https://docs.ceph.com/en/quincy/start/hardware-recommendations/ WRITE CACHES Enterprise SSDs and HDDs normally include power loss protection features which ensure data durability when power is lost while operating, and use multi-level caches to speed up direct or synchronous writes. These devices can be toggled between two caching modes – a volatile cache flushed to persistent media with fsync, or a non-volatile cache written synchronously. These two modes are selected by either “enabling” or “disabling” the write (volatile) cache. When the volatile cache is enabled, Linux uses a device in “write back” mode, and when disabled, it uses “write through”. The default configuration (usually: caching is enabled) may not be optimal, and OSD performance may be dramatically increased in terms of increased IOPS and decreased commit latency by disabling this write cache. Users are therefore encouraged to benchmark their devices with fio as described earlier and persist the optimal cache configuration for their devices. root@sd-02:~# cat /sys/class/scsi_disk/*/cache* write back write back write back write back write back write back write back write back write back write back What do you think about these new udev rules? root@sd-02:~# cat /etc/udev/rules.d/98-ceph-provisioning-mode.rules ACTION=="add", SUBSYSTEM=="scsi_disk", ATTR{provisioning_mode}:="unmap" root@sd-02:~# cat /etc/udev/rules.d/99-ceph-write-through.rules ACTION=="add", SUBSYSTEM=="scsi_disk", ATTR{cache_type}:="write through" Özkan Göksu , 22 Mar 2024 Cum, 17:42 tarihinde şunu yazdı: > Hello Anthony, thank you for the answer. > > While researching I also found out this type of issues but the thing I did > not understand is in the same server the OS drives "SAMSUNG MZ7WD480" is > all good. > > root@sd-01:~# lsblk -D > NAME DISC-ALN DISC-GRAN DISC-MAX > DISC-ZERO > sda 0 512B 2G > 0 > ├─sda10 512B 2G > 0 > ├─sda20 512B 2G > 0 > └─sda30 512B 2G > 0 > └─md0 0 512B 2G > 0 > └─md0p1 0 512B 2G > 0 > sdb 0 512B 2G > 0 > ├─sdb10 512B 2G > 0 > ├─sdb20 512B 2G > 0 > └─sdb30 512B 2G > 0 > └─md0 0 512B 2G > 0 > └─md0p1 0 512B 2G > 0 > > root@sd-01:~# find /sys/ -name provisioning_mode -exec grep -H . {} + | > sort > > /sys/devices/pci:00/:00:11.4/ata1/host1/target1:0:0/1:0:0:0/scsi_disk/1:0:0:0/provisioning_mode:writesame_16 > > /sys/devices/pci:00/:00:11.4/ata2/host2/target2:0:0/2:0:0:0/scsi_disk/2:0:0:0/provisioning_mode:writesame_16 > > /sys/devices/pci:80/:80:03.0/:81:00.0/host0/port-0:0/end_device-0:0/target0:0:0/0:0:0:0/scsi_disk/0:0:0:0/provisioning_mode:full > > /sys/devices/pci:80/:80:03.0/:81:00.0/host0/port-0:1/end_device-0:1/target0:0:1/0:0:1:0/scsi_disk/0:0:1:0/provisioning_mode:full > > root@sd-01:~# disklist > HCTL NAME SIZE REV TRAN WWNSERIAL MODEL > 1:0:0:0/dev/sda 447.1G 203Q sata 0x5002538500231d05 S1G1NYAF923 > SAMSUNG MZ7WD4 > 2:0:0:0/dev/sdb 447.1G 203Q sata 0x5002538500231a41 S1G1NYAF922 > SAMSUNG MZ7WD4 > 0:0:0:0/dev/sdc 3.6T 046 sas0x500a0751e6bd969b 2312E6BD969 > CT4000MX500SSD > 0:0:1:0/dev/sdd 3.6T 046 sas0x500a0751e6bd97ee 2312E6BD97E > CT4000MX500SSD > 0:0:2:0/dev/sde 3.6T 046 sas0x500a0751e6bd9805 2312E6BD980 > CT4000MX500SSD > 0:0:3:0/dev/sdf 3.6T 046 sas0x500a0751e6bd9681 2312E6BD968 > CT4000MX500SSD > 0:0:4:0/dev/sdg 3.6T 045 sas0x500a0751e6b5d30a 2309E6B5D30 > CT4000MX500SSD > 0:0:5:0/dev/sdh 3.6T 046 sas0x500a0751e6bd967e 2312E6BD967 > CT4000MX500SSD > 0:0:6:0/dev/sdi 3.6T 046 sas0x500a0751e6bd97e4 2312E6BD97E > CT4000MX500SSD > 0:0:7:0/dev/sdj 3.6T 046 sas0x500a0751e6bd96a0 2312E6BD96A > CT4000MX500SSD > > So my question is why it only happens to CT4000MX500SSD drives and why it > just
[ceph-users] Re: High OSD commit_latency after kernel upgrade
Hello Anthony, thank you for the answer. While researching I also found out this type of issues but the thing I did not understand is in the same server the OS drives "SAMSUNG MZ7WD480" is all good. root@sd-01:~# lsblk -D NAME DISC-ALN DISC-GRAN DISC-MAX DISC-ZERO sda 0 512B 2G 0 ├─sda10 512B 2G 0 ├─sda20 512B 2G 0 └─sda30 512B 2G 0 └─md0 0 512B 2G 0 └─md0p1 0 512B 2G 0 sdb 0 512B 2G 0 ├─sdb10 512B 2G 0 ├─sdb20 512B 2G 0 └─sdb30 512B 2G 0 └─md0 0 512B 2G 0 └─md0p1 0 512B 2G 0 root@sd-01:~# find /sys/ -name provisioning_mode -exec grep -H . {} + | sort /sys/devices/pci:00/:00:11.4/ata1/host1/target1:0:0/1:0:0:0/scsi_disk/1:0:0:0/provisioning_mode:writesame_16 /sys/devices/pci:00/:00:11.4/ata2/host2/target2:0:0/2:0:0:0/scsi_disk/2:0:0:0/provisioning_mode:writesame_16 /sys/devices/pci:80/:80:03.0/:81:00.0/host0/port-0:0/end_device-0:0/target0:0:0/0:0:0:0/scsi_disk/0:0:0:0/provisioning_mode:full /sys/devices/pci:80/:80:03.0/:81:00.0/host0/port-0:1/end_device-0:1/target0:0:1/0:0:1:0/scsi_disk/0:0:1:0/provisioning_mode:full root@sd-01:~# disklist HCTL NAME SIZE REV TRAN WWNSERIAL MODEL 1:0:0:0/dev/sda 447.1G 203Q sata 0x5002538500231d05 S1G1NYAF923 SAMSUNG MZ7WD4 2:0:0:0/dev/sdb 447.1G 203Q sata 0x5002538500231a41 S1G1NYAF922 SAMSUNG MZ7WD4 0:0:0:0/dev/sdc 3.6T 046 sas0x500a0751e6bd969b 2312E6BD969 CT4000MX500SSD 0:0:1:0/dev/sdd 3.6T 046 sas0x500a0751e6bd97ee 2312E6BD97E CT4000MX500SSD 0:0:2:0/dev/sde 3.6T 046 sas0x500a0751e6bd9805 2312E6BD980 CT4000MX500SSD 0:0:3:0/dev/sdf 3.6T 046 sas0x500a0751e6bd9681 2312E6BD968 CT4000MX500SSD 0:0:4:0/dev/sdg 3.6T 045 sas0x500a0751e6b5d30a 2309E6B5D30 CT4000MX500SSD 0:0:5:0/dev/sdh 3.6T 046 sas0x500a0751e6bd967e 2312E6BD967 CT4000MX500SSD 0:0:6:0/dev/sdi 3.6T 046 sas0x500a0751e6bd97e4 2312E6BD97E CT4000MX500SSD 0:0:7:0/dev/sdj 3.6T 046 sas0x500a0751e6bd96a0 2312E6BD96A CT4000MX500SSD So my question is why it only happens to CT4000MX500SSD drives and why it just started now and I don't have in other servers? Maybe it is related to firmware version "M3CR046 vs M3CR045" I check the crucial website and actually "M3CR046" is not exist: https://www.crucial.com/support/ssd-support/mx500-support In this forum people recommend upgrading "M3CR046" https://forums.unraid.net/topic/134954-warning-crucial-mx500-ssds-world-of-pain-stay-away-from-these/ But actually in my ud cluster all the drives are "M3CR045" and have lower latency. I'm really confused. Instead of writing udev rules for only CT4000MX500SSD is there any recommended udev rule for ceph and all type of sata drives? Anthony D'Atri , 22 Mar 2024 Cum, 17:00 tarihinde şunu yazdı: > [image: apple-touch-i...@2.png] > > How to stop sys from changing USB SSD provisioning_mode from unmap to full > in Ubuntu 22.04? > <https://askubuntu.com/questions/1454997/how-to-stop-sys-from-changing-usb-ssd-provisioning-mode-from-unmap-to-full-in-ub> > askubuntu.com > <https://askubuntu.com/questions/1454997/how-to-stop-sys-from-changing-usb-ssd-provisioning-mode-from-unmap-to-full-in-ub> > > <https://askubuntu.com/questions/1454997/how-to-stop-sys-from-changing-usb-ssd-provisioning-mode-from-unmap-to-full-in-ub> > ? > > > On Mar 22, 2024, at 09:36, Özkan Göksu wrote: > > Hello! > > After upgrading "5.15.0-84-generic" to "5.15.0-100-generic" (Ubuntu 22.04.2 > LTS) , commit latency started acting weird with "CT4000MX500SSD" drives. > > osd commit_latency(ms) apply_latency(ms) > 36 867867 > 373045 3045 > 38 15 15 > 39 18 18 > 421409 1409 > 431224 1224 > > I downgraded the kernel but the result did not change. > I have a similar build and it didn't get upgraded an
[ceph-users] High OSD commit_latency after kernel upgrade
Hello! After upgrading "5.15.0-84-generic" to "5.15.0-100-generic" (Ubuntu 22.04.2 LTS) , commit latency started acting weird with "CT4000MX500SSD" drives. osd commit_latency(ms) apply_latency(ms) 36 867867 373045 3045 38 15 15 39 18 18 421409 1409 431224 1224 I downgraded the kernel but the result did not change. I have a similar build and it didn't get upgraded and it is just fine. While I was digging I realised a difference. This is high latency cluster and as you can see the "DISC-GRAN=0B", "DISC-MAX=0B" root@sd-01:~# lsblk -D NAME DISC-ALN DISC-GRAN DISC-MAX DISC-ZERO sdc 00B 0B 0 ├─ceph--76b7d255--2a01--4bd4--8d3e--880190181183-osd--block--201d5050--db0c--41b4--85c4--6416ee989d6c │ 00B 0B 0 └─ceph--76b7d255--2a01--4bd4--8d3e--880190181183-osd--block--5a376133--47de--4e29--9b75--2314665c2862 root@sd-01:~# find /sys/ -name provisioning_mode -exec grep -H . {} + | sort /sys/devices/pci:80/:80:03.0/:81:00.0/host0/port-0:0/end_device-0:0/target0:0:0/0:0:0:0/scsi_disk/0:0:0:0/provisioning_mode:full -- This is low latency cluster and as you can see the "DISC-GRAN=4K", "DISC-MAX=2G" root@ud-01:~# lsblk -D NAME DISC-ALN DISC-GRAN DISC-MAX DISC-ZERO sdc 0 4K 2G 0 ├─ceph--7496095f--18c7--41fd--90f2--d9b3e382bc8e-osd--block--ec86a029--23f7--4328--9600--a24a290e3003 │0 4K 2G 0 └─ceph--7496095f--18c7--41fd--90f2--d9b3e382bc8e-osd--block--5b69b748--d899--4f55--afc3--2ea3c8a05ca1 root@ud-01:~# find /sys/ -name provisioning_mode -exec grep -H . {} + | sort /sys/devices/pci:00/:00:11.4/ata3/host2/target2:0:0/2:0:0:0/scsi_disk/2:0:0:0/provisioning_mode:writesame_16 I think the problem is related to provisioning_mode but I really did not understand the reason. I boot with a live iso and still the drive was "provisioning_mode:full" so it means this is not related to my OS at all. With the upgrade something changed and I think during boot sequence negotiation between LSI controller, drives and kernel started to assign "provisioning_mode:full" but I'm not sure. What should I do ? Best regards. ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Sata SSD trim latency with (WAL+DB on NVME + Sata OSD)
Hello. With the SSD drives without tantalum capacitors Ceph faces trim latency on every write. I wonder if the behavior is the same if we locate WAL+DB on NVME drives with "Tantalum capacitors" ? Do I need to use NVME + SAS SSD to avoid this latency issue? Best regards. ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: Seperate metadata pool in 3x MDS node
Hello Anthony, The hardware is second hand built and does not have U.2 slots. U.2 servers cost 3x-4x more.I mean PCI-E "MZ-PLK3T20". I have to buy SFP cards and 25G is only +30$ more than 10G so why not. Yes I'm thinking pinned as (clients > rack MDS) I don't have problems with building and I don't use PG autoscaler. Hello David. My system is all internal and I only use one /20 subnet at layer2 level Yes , I'm thinking of distributing the meta pool on racks 1,2,4,5 because my clients use search a lot and I just want to shorten the metadata needs. I have redundant rack PDU's so I don't have any problem with power and I only have a VPC (2x n9k switch) on the main rack 3. That's why I keep data and management related everything on rack3 as usual. Normally I always use WAL+DB on NVME with Sata OSD. The only thing I wonder is having a separate metadata pool on NVME located on the client racks is gonna give some benefit or not. Regards. David C. , 25 Şub 2024 Paz, 00:07 tarihinde şunu yazdı: > Hello, > > Each rack works on different trees or is everything parallelized ? > The meta pools would be distributed over racks 1,2,4,5 ? > If it is distributed, even if the addressed MDS is on the same switch as > the client, you will always have this MDS which will consult/write (nvme) > OSDs on the other racks (among 1,2,4,5). > > In any case, the exercise is interesting. > > > > Le sam. 24 févr. 2024 à 19:56, Özkan Göksu a écrit : > >> Hello folks! >> >> I'm designing a new Ceph storage from scratch and I want to increase >> CephFS >> speed and decrease latency. >> Usually I always build (WAL+DB on NVME with Sas-Sata SSD's) and I deploy >> MDS and MON's on the same servers. >> This time a weird idea came to my mind and I think it has great potential >> and will perform better on paper with my limited knowledge. >> >> I have 5 racks and the 3nd "middle" rack is my storage and management >> rack. >> >> - At RACK-3 I'm gonna locate 8x 1u OSD server (Spec: 2x E5-2690V4, 256GB, >> 4x 25G, 2x 1.6TB PCI-E NVME "MZ-PLK3T20", 8x 4TB SATA SSD) >> >> - My Cephfs kernel clients are 40x GPU nodes located at RACK1,2,4,5 >> >> With my current workflow, all the clients; >> 1- visit the rack data switch >> 2- jump to main VPC switch via 2x100G, >> 3- talk with MDS servers, >> 4- Go back to the client with the answer, >> 5- To access data follow the same HOP's and visit the OSD's everytime. >> >> If I deploy separate metadata pool by using 4x MDS server at top of >> RACK-1,2,4,5 (Spec: 2x E5-2690V4, 128GB, 2x 10G(Public), 2x 25G (cluster), >> 2x 960GB U.2 NVME "MZ-PLK3T20") >> Then all the clients will make the request directly in-rack 1 HOP away MDS >> servers and if the request is only metadata, then the MDS node doesn't >> need >> to redirect the request to OSD nodes. >> Also by locating MDS servers with seperated metadata pool across all the >> racks will reduce the high load on main VPC switch at RACK-3 >> >> If I'm not missing anything then only Recovery workload will suffer with >> this topology. >> >> What do you think? >> ___ >> ceph-users mailing list -- ceph-users@ceph.io >> To unsubscribe send an email to ceph-users-le...@ceph.io >> > ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Seperate metadata pool in 3x MDS node
Hello folks! I'm designing a new Ceph storage from scratch and I want to increase CephFS speed and decrease latency. Usually I always build (WAL+DB on NVME with Sas-Sata SSD's) and I deploy MDS and MON's on the same servers. This time a weird idea came to my mind and I think it has great potential and will perform better on paper with my limited knowledge. I have 5 racks and the 3nd "middle" rack is my storage and management rack. - At RACK-3 I'm gonna locate 8x 1u OSD server (Spec: 2x E5-2690V4, 256GB, 4x 25G, 2x 1.6TB PCI-E NVME "MZ-PLK3T20", 8x 4TB SATA SSD) - My Cephfs kernel clients are 40x GPU nodes located at RACK1,2,4,5 With my current workflow, all the clients; 1- visit the rack data switch 2- jump to main VPC switch via 2x100G, 3- talk with MDS servers, 4- Go back to the client with the answer, 5- To access data follow the same HOP's and visit the OSD's everytime. If I deploy separate metadata pool by using 4x MDS server at top of RACK-1,2,4,5 (Spec: 2x E5-2690V4, 128GB, 2x 10G(Public), 2x 25G (cluster), 2x 960GB U.2 NVME "MZ-PLK3T20") Then all the clients will make the request directly in-rack 1 HOP away MDS servers and if the request is only metadata, then the MDS node doesn't need to redirect the request to OSD nodes. Also by locating MDS servers with seperated metadata pool across all the racks will reduce the high load on main VPC switch at RACK-3 If I'm not missing anything then only Recovery workload will suffer with this topology. What do you think? ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: Performance improvement suggestion
Hello. I didn't test it personally but what about rep 1 write cache pool with nvme backed by another rep 2 pool? It has the potential exactly what you are looking for in theory. 1 Şub 2024 Per 20:54 tarihinde quag...@bol.com.br şunu yazdı: > > > Ok Anthony, > > I understood what you said. I also believe in all the professional history > and experience you have. > > Anyway, could there be a configuration flag to make this happen? > > As well as those that already exist: "--yes-i-really-mean-it". > > This way, the storage pattern would remain as it is. However, it would > allow situations like the one I mentioned to be possible. > > This situation will permit some rules to be relaxed (even if they are not > ok at first). > Likewise, there are already situations like lazyio that make some > exceptions to standard procedures. > > > Remembering: it's just a suggestion. > If this type of functionality is not interesting, it is ok. > > > Rafael. > > -- > > *De: *"Anthony D'Atri" > *Enviada: *2024/02/01 12:10:30 > *Para: *quag...@bol.com.br > *Cc: * ceph-users@ceph.io > *Assunto: * [ceph-users] Re: Performance improvement suggestion > > > > > I didn't say I would accept the risk of losing data. > > That's implicit in what you suggest, though. > > > I just said that it would be interesting if the objects were first > recorded only in the primary OSD. > > What happens when that host / drive smokes before it can replicate? What > happens if a secondary OSD gets a read op before the primary updates it? > Swift object storage users have to code around this potential. It's a > non-starter for block storage. > > This is similar to why RoC HBAs (which are a badly outdated thing to begin > with) will only enter writeback mode if they have a BBU / supercap -- and > of course if their firmware and hardware isn't pervasively buggy. Guess how > I know this? > > > This way it would greatly increase performance (both for iops and > throuput). > > It might increase low-QD IOPS for a single client on slow media with > certain networking. Depending on media, it wouldn't increase throughput. > > Consider QEMU drive-mirror. If you're doing RF=3 replication, you use 3x > the network resources between the client and the servers. > > > Later (in the background), record the replicas. This situation would > avoid leaving users/software waiting for the recording response from all > replicas when the storage is overloaded. > > If one makes the mistake of using HDDs, they're going to be overloaded no > matter how one slices and dices the ops. Ya just canna squeeze IOPS from a > stone. Throughput is going to be limited by the SATA interface and seeking > no matter what. > > > Where I work, performance is very important and we don't have money to > make a entire cluster only with NVMe. > > If there isn't money, then it isn't very important. But as I've written > before, NVMe clusters *do not cost appreciably more than spinners* unless > your procurement processes are bad. In fact they can cost significantly > less. This is especially true with object storage and archival where one > can leverage QLC. > > * Buy generic drives from a VAR, not channel drives through a chassis > brand. Far less markup, and moreover you get the full 5 year warranty, not > just 3 years. And you can painlessly RMA drives yourself - you don't have > to spend hours going back and forth with $chassisvendor's TAC arguing about > every single RMA. I've found that this is so bad that it is more economical > to just throw away a failed component worth < USD 500 than to RMA it. Do > you pay for extended warranty / support? That's expensive too. > > * Certain chassis brands who shall remain nameless push RoC HBAs hard with > extreme markups. List prices as high as USD2000. Per server, eschewing > those abominations makes up for a lot of the drive-only unit economics > > * But this is the part that lots of people don't get: You don't just stack > up the drives on a desk and use them. They go into *servers* that cost > money and *racks* that cost money. They take *power* that costs money. > > * $ / IOPS are FAR better for ANY SSD than for HDDs > > * RUs cost money, so do chassis and switches > > * Drive failures cost money > > * So does having your people and applications twiddle their thumbs waiting > for stuff to happen. I worked for a supercomputer company who put > low-memory low-end diskless workstations on engineer's desks. They spent > lots of time doing nothing waiting for their applications to respond. This > company no longer exists. > > * So does the risk of taking *weeks* to heal from a drive failure > > Punch honest numbers into > https://www.snia.org/forums/cmsi/programs/TCOcalc > > I walked through this with a certain global company. QLC SSDs were > demonstrated to have like 30% lower TCO than spinners. Part of the equation > is that they were accustomed to limiting HDD size to 8 TB because of the > bottlenecks, and thus
[ceph-users] Re: 1 clients failing to respond to cache pressure (quincy:17.2.6)
Thank you Frank. My focus is actually performance tuning. After your mail, I started to investigate client-side. I think the kernel tunings work great now. After the tunings I didn't get any warning again. Now I will continue with performance tunings. I decided to distribute subvolumes across multiple pools instead of multi-active-mds. With this method I will have multiple MDS and [1x cephfs clients for each pool / Host] To hide subvolume uuids, I'm using "mount --bind kernel links" and I wonder is it able to create performance issues on cephfs clients? Best regards. Frank Schilder , 27 Oca 2024 Cmt, 12:34 tarihinde şunu yazdı: > Hi Özkan, > > > ... The client is actually at idle mode and there is no reason to fail > at all. ... > > if you re-read my message, you will notice that I wrote that > > - its not the client failing, its a false positive error flag that > - is not cleared for idle clients. > > You seem to encounter exactly this situation and a simple > > echo 3 > /proc/sys/vm/drop_caches > > would probably have cleared the warning. There is nothing wrong with your > client, its an issue with the client-MDS communication protocol that is > probably still under review. You will encounter these warnings every now > and then until its fixed. > > Best regards, > = > Frank Schilder > AIT Risø Campus > Bygning 109, rum S14 > ___ > ceph-users mailing list -- ceph-users@ceph.io > To unsubscribe send an email to ceph-users-le...@ceph.io > ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: 1 clients failing to respond to cache pressure (quincy:17.2.6)
, BW=681MiB/s (714MB/s)(3072MiB/4511msec); 0 zone resets BS=32K write: IOPS=12.1k, BW=378MiB/s (396MB/s)(3072MiB/8129msec); 0 zone resets BS=16K write: IOPS=12.7k, BW=198MiB/s (208MB/s)(3072MiB/15487msec); 0 zone resets BS=4Kwrite: IOPS=12.7k, BW=49.7MiB/s (52.1MB/s)(3072MiB/61848msec); 0 zone resets Rand Read benchmarking: size=1G,direct=1,numjobs=3,iodepth=32 BS=1Mread: IOPS=1113, BW=1114MiB/s (1168MB/s)(3072MiB/2758msec) BS=128K read: IOPS=8953, BW=1119MiB/s (1173MB/s)(3072MiB/2745msec) BS=64K read: IOPS=17.9k, BW=1116MiB/s (1170MB/s)(3072MiB/2753msec) BS=32K read: IOPS=35.1k, BW=1096MiB/s (1150MB/s)(3072MiB/2802msec) BS=16K read: IOPS=69.4k, BW=1085MiB/s (1138MB/s)(3072MiB/2831msec) BS=4Kread: IOPS=112k, BW=438MiB/s (459MB/s)(3072MiB/7015msec) *Everything looks good except 4K speeds:* Seq Write - BS=4Kwrite: IOPS=8661, BW=33.8MiB/s (35.5MB/s)(3072MiB/90801msec); 0 zone resets Rand Write - BS=4Kwrite: IOPS=12.7k, BW=49.7MiB/s (52.1MB/s)(3072MiB/61848msec); 0 zone resets What do you think? Özkan Göksu , 27 Oca 2024 Cmt, 04:08 tarihinde şunu yazdı: > Wow I noticed something! > > To prevent ram overflow with gpu training allocations, I'm using a 2TB > Samsung 870 evo for swap. > > As you can see below, swap usage 18Gi and server was idle, that means > maybe ceph client hits latency because of the swap usage. > > root@bmw-m4:/sys/kernel/debug/ceph/e42fd4b0-313b-11ee-9a00-31da71873773.client1275577# > free -h >totalusedfree shared buff/cache > available > Mem:62Gi34Gi27Gi 0.0Ki 639Mi > 27Gi > Swap: 1.8Ti18Gi 1.8Ti > > I decided to play around kernel parameters to prevent ceph swap usage. > > kernel.shmmax = 60654764851 # Maximum shared segment size in bytes >> kernel.shmall = 16453658 # Maximum number of shared memory segments in >> pages >> vm.nr_hugepages = 4096 # Increase Transparent Huge Pages (THP) Defrag: >> vm.swappiness = 0 # Set vm.swappiness to 0 to minimize swapping >> vm.min_free_kbytes = 1048576 # required free memory (set to 1% of >> physical ram) > > > I reboot the server and after reboot swap usage is 0 as expected. > > To give a try I started the iobench.sh ( > https://github.com/ozkangoksu/benchmark/blob/main/iobench.sh) > This client has 1G nic only. As you can see below, other then 4K block > size, ceph client can saturate NIC. > > root@bmw-m4:~# nicstat -MUz 1 > Time Int rMbps wMbps rPk/s wPk/srAvswAvs %rUtil > %wUtil > 01:04:48 ens1f0 936.9 92.90 91196.8 60126.3 1346.6 202.5 98.2 > 9.74 > > root@bmw-m4:/mounts/ud-data/benchuser1/96f13211-c37f-42db-8d05-f3255a05129e/testdir# > bash iobench.sh > Seq Write benchmarking: size=1G,direct=1,numjobs=3,iodepth=32 > BS=1Mwrite: IOPS=112, BW=112MiB/s (118MB/s)(3072MiB/27395msec); 0 > zone resets > BS=128K write: IOPS=894, BW=112MiB/s (117MB/s)(3072MiB/27462msec); 0 > zone resets > BS=64K write: IOPS=1758, BW=110MiB/s (115MB/s)(3072MiB/27948msec); 0 > zone resets > BS=32K write: IOPS=3542, BW=111MiB/s (116MB/s)(3072MiB/27748msec); 0 > zone resets > BS=16K write: IOPS=6839, BW=107MiB/s (112MB/s)(3072MiB/28747msec); 0 > zone resets > BS=4Kwrite: IOPS=8473, BW=33.1MiB/s (34.7MB/s)(3072MiB/92813msec); 0 > zone resets > Seq Read benchmarking: size=1G,direct=1,numjobs=3,iodepth=32 > BS=1Mread: IOPS=112, BW=112MiB/s (118MB/s)(3072MiB/27386msec) > BS=128K read: IOPS=895, BW=112MiB/s (117MB/s)(3072MiB/27431msec) > BS=64K read: IOPS=1788, BW=112MiB/s (117MB/s)(3072MiB/27486msec) > BS=32K read: IOPS=3561, BW=111MiB/s (117MB/s)(3072MiB/27603msec) > BS=16K read: IOPS=6924, BW=108MiB/s (113MB/s)(3072MiB/28392msec) > BS=4Kread: IOPS=21.3k, BW=83.3MiB/s (87.3MB/s)(3072MiB/36894msec) > Rand Write benchmarking: size=1G,direct=1,numjobs=3,iodepth=32 > BS=1Mwrite: IOPS=112, BW=112MiB/s (118MB/s)(3072MiB/27406msec); 0 > zone resets > BS=128K write: IOPS=894, BW=112MiB/s (117MB/s)(3072MiB/27466msec); 0 > zone resets > BS=64K write: IOPS=1781, BW=111MiB/s (117MB/s)(3072MiB/27591msec); 0 > zone resets > BS=32K write: IOPS=3545, BW=111MiB/s (116MB/s)(3072MiB/27729msec); 0 > zone resets > BS=16K write: IOPS=6823, BW=107MiB/s (112MB/s)(3072MiB/28814msec); 0 > zone resets > BS=4Kwrite: IOPS=12.7k, BW=49.8MiB/s (52.2MB/s)(3072MiB/61694msec); 0 > zone resets > Rand Read benchmarking: size=1G,direct=1,numjobs=3,iodepth=32 > BS=1Mread: IOPS=112, BW=112MiB/s (118MB/s)(3072MiB/27388msec) > BS=128K read: IOPS=894, BW=112MiB/s (117MB/s)(3072MiB/27479msec) > BS=64K read: IOPS=1784, BW=112MiB/s (117MB/s)(3072MiB/27547msec) > BS=32K read: IOPS=3559, BW=111MiB/s (117MB/s)(3072
[ceph-users] Re: 1 clients failing to respond to cache pressure (quincy:17.2.6)
totalusedfree shared buff/cache available Mem:62Gi11Gi50Gi 3.0Mi 1.0Gi 49Gi Swap: 1.8Ti 0B 1.8Ti I started to feel we are getting closer :) Özkan Göksu , 27 Oca 2024 Cmt, 02:58 tarihinde şunu yazdı: > I started to investigate my clients. > > for example: > > root@ud-01:~# ceph health detail > HEALTH_WARN 1 clients failing to respond to cache pressure > [WRN] MDS_CLIENT_RECALL: 1 clients failing to respond to cache pressure > mds.ud-data.ud-02.xcoojt(mds.0): Client bmw-m4 failing to respond to > cache pressure client_id: 1275577 > > root@ud-01:~# ceph fs status > ud-data - 86 clients > === > RANK STATE MDS ACTIVITY DNSINOS DIRS > CAPS > 0active ud-data.ud-02.xcoojt Reqs: 34 /s 2926k 2827k 155k > 1157k > > > ceph tell mds.ud-data.ud-02.xcoojt session ls | jq -r '.[] | "clientid: > \(.id)= num_caps: \(.num_caps), num_leases: \(.num_leases), > request_load_avg: \(.request_load_avg), num_completed_requests: > \(.num_completed_requests), num_completed_flushes: > \(.num_completed_flushes)"' | sort -n -t: -k3 > > clientid: *1275577*= num_caps: 12312, num_leases: 0, request_load_avg: 0, > num_completed_requests: 0, num_completed_flushes: 1 > clientid: 1275571= num_caps: 16307, num_leases: 1, request_load_avg: 2101, > num_completed_requests: 0, num_completed_flushes: 3 > clientid: 1282130= num_caps: 26337, num_leases: 3, request_load_avg: 116, > num_completed_requests: 0, num_completed_flushes: 1 > clientid: 1191789= num_caps: 32784, num_leases: 0, request_load_avg: 1846, > num_completed_requests: 0, num_completed_flushes: 0 > clientid: 1275535= num_caps: 79825, num_leases: 2, request_load_avg: 133, > num_completed_requests: 8, num_completed_flushes: 8 > clientid: 1282142= num_caps: 80581, num_leases: 6, request_load_avg: 125, > num_completed_requests: 2, num_completed_flushes: 6 > clientid: 1275532= num_caps: 87836, num_leases: 3, request_load_avg: 190, > num_completed_requests: 2, num_completed_flushes: 6 > clientid: 1275547= num_caps: 94129, num_leases: 4, request_load_avg: 149, > num_completed_requests: 2, num_completed_flushes: 4 > clientid: 1275553= num_caps: 96460, num_leases: 4, request_load_avg: 155, > num_completed_requests: 2, num_completed_flushes: 8 > clientid: 1282139= num_caps: 108882, num_leases: 25, request_load_avg: 99, > num_completed_requests: 2, num_completed_flushes: 4 > clientid: 1275538= num_caps: 437162, num_leases: 0, request_load_avg: 101, > num_completed_requests: 2, num_completed_flushes: 0 > > -- > > *MY CLIENT:* > > The client is actually at idle mode and there is no reason to fail at all. > > root@bmw-m4:~# apt list --installed |grep ceph > ceph-common/jammy-updates,now 17.2.6-0ubuntu0.22.04.2 amd64 [installed] > libcephfs2/jammy-updates,now 17.2.6-0ubuntu0.22.04.2 amd64 > [installed,automatic] > python3-ceph-argparse/jammy-updates,now 17.2.6-0ubuntu0.22.04.2 amd64 > [installed,automatic] > python3-ceph-common/jammy-updates,now 17.2.6-0ubuntu0.22.04.2 all > [installed,automatic] > python3-cephfs/jammy-updates,now 17.2.6-0ubuntu0.22.04.2 amd64 > [installed,automatic] > > Let's check metrics and stats: > > root@bmw-m4:/sys/kernel/debug/ceph/e42fd4b0-313b-11ee-9a00-31da71873773.client1275577# > cat metrics > item total > -- > opened files / total inodes 2 / 12312 > pinned i_caps / total inodes 12312 / 12312 > opened inodes / total inodes 1 / 12312 > > item total avg_lat(us) min_lat(us) max_lat(us) > stdev(us) > > --- > read 22283 44409 430 1804853 > 15619 > write 112702 419725 36588879541 > 6008 > metadata 353322 5712154 917903 > 5357 > > item total avg_sz(bytes) min_sz(bytes) max_sz(bytes) > total_sz(bytes) > > > read 22283 1701940 1 4194304 > 37924318602 > write 112702 246211 1 4194304 > 27748469309 > > item total misshit > - > d_lease 62 63627 28564698 > caps 12312 36658 44568261 > > > root@bmw-m4:/sys/kernel/debug/ceph/e42fd4b0-313b-11ee-9a00-31da71873773.client1275577# > cat bd
[ceph-users] Re: 1 clients failing to respond to cache pressure (quincy:17.2.6)
_split_pmd 22451 thp_split_pud 0 thp_zero_page_alloc 1 thp_zero_page_alloc_failed 0 thp_swpout 22332 thp_swpout_fallback 0 balloon_inflate 0 balloon_deflate 0 balloon_migrate 0 swap_ra 25777929 swap_ra_hit 25658825 direct_map_level2_splits 1249 direct_map_level3_splits 49 nr_unstable 0 Özkan Göksu , 27 Oca 2024 Cmt, 02:36 tarihinde şunu yazdı: > Hello Frank. > > I have 84 clients (high-end servers) with: Ubuntu 20.04.5 LTS - Kernel: > Linux 5.4.0-125-generic > > My cluster 17.2.6 quincy. > I have some client nodes with "ceph-common/stable,now 17.2.7-1focal" I > wonder using new version clients is the main problem? > Maybe I have a communication error. For example I hit this problem and I > can not collect client stats " > https://github.com/ceph/ceph/pull/52127/files; > > Best regards. > > > > Frank Schilder , 26 Oca 2024 Cum, 14:53 tarihinde şunu > yazdı: > >> Hi, this message is one of those that are often spurious. I don't recall >> in which thread/PR/tracker I read it, but the story was something like that: >> >> If an MDS gets under memory pressure it will request dentry items back >> from *all* clients, not just the active ones or the ones holding many of >> them. If you have a client that's below the min-threshold for dentries (its >> one of the client/mds tuning options), it will not respond. This client >> will be flagged as not responding, which is a false positive. >> >> I believe the devs are working on a fix to get rid of these spurious >> warnings. There is a "bug/feature" in the MDS that does not clear this >> warning flag for inactive clients. Hence, the message hangs and never >> disappears. I usually clear it with a "echo 3 > /proc/sys/vm/drop_caches" >> on the client. However, except for being annoying in the dashboard, it has >> no performance or otherwise negative impact. >> >> Best regards, >> = >> Frank Schilder >> AIT Risø Campus >> Bygning 109, rum S14 >> >> >> From: Eugen Block >> Sent: Friday, January 26, 2024 10:05 AM >> To: Özkan Göksu >> Cc: ceph-users@ceph.io >> Subject: [ceph-users] Re: 1 clients failing to respond to cache pressure >> (quincy:17.2.6) >> >> Performance for small files is more about IOPS rather than throughput, >> and the IOPS in your fio tests look okay to me. What you could try is >> to split the PGs to get around 150 or 200 PGs per OSD. You're >> currently at around 60 according to the ceph osd df output. Before you >> do that, can you share 'ceph pg ls-by-pool cephfs.ud-data.data | >> head'? I don't need the whole output, just to see how many objects >> each PG has. We had a case once where that helped, but it was an older >> cluster and the pool was backed by HDDs and separate rocksDB on SSDs. >> So this might not be the solution here, but it could improve things as >> well. >> >> >> Zitat von Özkan Göksu : >> >> > Every user has a 1x subvolume and I only have 1 pool. >> > At the beginning we were using each subvolume for ldap home directory + >> > user data. >> > When a user logins any docker on any host, it was using the cluster for >> > home and the for user related data, we was have second directory in the >> > same subvolume. >> > Time to time users were feeling a very slow home environment and after a >> > month it became almost impossible to use home. VNC sessions became >> > unresponsive and slow etc. >> > >> > 2 weeks ago, I had to migrate home to a ZFS storage and now the overall >> > performance is better for only user_data without home. >> > But still the performance is not good enough as I expected because of >> the >> > problems related to MDS. >> > The usage is low but allocation is high and Cpu usage is high. You saw >> the >> > IO Op/s, it's nothing but allocation is high. >> > >> > I develop a fio benchmark script and I run the script on 4x test server >> at >> > the same time, the results are below: >> > Script: >> > >> https://github.com/ozkangoksu/benchmark/blob/8f5df87997864c25ef32447e02fcd41fda0d2a67/iobench.sh >> > >> > >> https://github.com/ozkangoksu/benchmark/blob/main/benchmark-results/iobench-client-01.txt >> > >> https://github.com/ozkangoksu/benchmark/blob/main/benchmark-results/iobench-client-02.txt >> > >> https://github.com/ozkangoksu/benchmark/blob/main/benchmark-results/iobench-client-03.txt >> > >> https://github.com/ozkangoksu/benc
[ceph-users] Re: 1 clients failing to respond to cache pressure (quincy:17.2.6)
Hello Frank. I have 84 clients (high-end servers) with: Ubuntu 20.04.5 LTS - Kernel: Linux 5.4.0-125-generic My cluster 17.2.6 quincy. I have some client nodes with "ceph-common/stable,now 17.2.7-1focal" I wonder using new version clients is the main problem? Maybe I have a communication error. For example I hit this problem and I can not collect client stats "https://github.com/ceph/ceph/pull/52127/files; Best regards. Frank Schilder , 26 Oca 2024 Cum, 14:53 tarihinde şunu yazdı: > Hi, this message is one of those that are often spurious. I don't recall > in which thread/PR/tracker I read it, but the story was something like that: > > If an MDS gets under memory pressure it will request dentry items back > from *all* clients, not just the active ones or the ones holding many of > them. If you have a client that's below the min-threshold for dentries (its > one of the client/mds tuning options), it will not respond. This client > will be flagged as not responding, which is a false positive. > > I believe the devs are working on a fix to get rid of these spurious > warnings. There is a "bug/feature" in the MDS that does not clear this > warning flag for inactive clients. Hence, the message hangs and never > disappears. I usually clear it with a "echo 3 > /proc/sys/vm/drop_caches" > on the client. However, except for being annoying in the dashboard, it has > no performance or otherwise negative impact. > > Best regards, > = > Frank Schilder > AIT Risø Campus > Bygning 109, rum S14 > > ________ > From: Eugen Block > Sent: Friday, January 26, 2024 10:05 AM > To: Özkan Göksu > Cc: ceph-users@ceph.io > Subject: [ceph-users] Re: 1 clients failing to respond to cache pressure > (quincy:17.2.6) > > Performance for small files is more about IOPS rather than throughput, > and the IOPS in your fio tests look okay to me. What you could try is > to split the PGs to get around 150 or 200 PGs per OSD. You're > currently at around 60 according to the ceph osd df output. Before you > do that, can you share 'ceph pg ls-by-pool cephfs.ud-data.data | > head'? I don't need the whole output, just to see how many objects > each PG has. We had a case once where that helped, but it was an older > cluster and the pool was backed by HDDs and separate rocksDB on SSDs. > So this might not be the solution here, but it could improve things as > well. > > > Zitat von Özkan Göksu : > > > Every user has a 1x subvolume and I only have 1 pool. > > At the beginning we were using each subvolume for ldap home directory + > > user data. > > When a user logins any docker on any host, it was using the cluster for > > home and the for user related data, we was have second directory in the > > same subvolume. > > Time to time users were feeling a very slow home environment and after a > > month it became almost impossible to use home. VNC sessions became > > unresponsive and slow etc. > > > > 2 weeks ago, I had to migrate home to a ZFS storage and now the overall > > performance is better for only user_data without home. > > But still the performance is not good enough as I expected because of the > > problems related to MDS. > > The usage is low but allocation is high and Cpu usage is high. You saw > the > > IO Op/s, it's nothing but allocation is high. > > > > I develop a fio benchmark script and I run the script on 4x test server > at > > the same time, the results are below: > > Script: > > > https://github.com/ozkangoksu/benchmark/blob/8f5df87997864c25ef32447e02fcd41fda0d2a67/iobench.sh > > > > > https://github.com/ozkangoksu/benchmark/blob/main/benchmark-results/iobench-client-01.txt > > > https://github.com/ozkangoksu/benchmark/blob/main/benchmark-results/iobench-client-02.txt > > > https://github.com/ozkangoksu/benchmark/blob/main/benchmark-results/iobench-client-03.txt > > > https://github.com/ozkangoksu/benchmark/blob/main/benchmark-results/iobench-client-04.txt > > > > While running benchmark, I take sample values for each type of iobench > run. > > > > Seq Write benchmarking: size=1G,direct=1,numjobs=3,iodepth=32 > > client: 70 MiB/s rd, 762 MiB/s wr, 337 op/s rd, 24.41k op/s wr > > client: 60 MiB/s rd, 551 MiB/s wr, 303 op/s rd, 35.12k op/s wr > > client: 13 MiB/s rd, 161 MiB/s wr, 101 op/s rd, 41.30k op/s wr > > > > Seq Read benchmarking: size=1G,direct=1,numjobs=3,iodepth=32 > > client: 1.6 GiB/s rd, 219 KiB/s wr, 28.76k op/s rd, 89 op/s wr > > client: 370 MiB/s rd, 475 KiB/s wr, 90.38k op/s rd, 89 op/s wr > > > > Rand Write b
[ceph-users] Re: 1 clients failing to respond to cache pressure (quincy:17.2.6)
This is client side metrics from a "failing to respond to cache pressure" warned client. root@datagen-27:/sys/kernel/debug/ceph/e42fd4b0-313b-11ee-9a00-31da71873773.client1282187# cat bdi/stats BdiWriteback:0 kB BdiReclaimable: 0 kB BdiDirtyThresh: 0 kB DirtyThresh: 35979376 kB BackgroundThresh: 17967720 kB BdiDirtied:3071616 kB BdiWritten:3036864 kB BdiWriteBandwidth: 20 kBps b_dirty: 0 b_io:0 b_more_io: 0 b_dirty_time:0 bdi_list:1 state: 1 root@d27:/sys/kernel/debug/ceph/e42fd4b0-313b-11ee-9a00-31da71873773.client1282187# cat metrics item total -- opened files / total inodes 4 / 14129 pinned i_caps / total inodes 14129 / 14129 opened inodes / total inodes 2 / 14129 item total avg_lat(us) min_lat(us) max_lat(us) stdev(us) --- read 1218753 3116208 8741271 2154 write 34945 24003 30172191493 16156 metadata 1703642 8395127 17936115 1497 item total avg_sz(bytes) min_sz(bytes) max_sz(bytes) total_sz(bytes) read 1218753 227009 1 4194304 276668475618 write 34945 85860 1 4194304 3000382055 item total misshit - d_lease 306 19110 3317071969 caps 14129 145404 3761682333 Özkan Göksu , 25 Oca 2024 Per, 20:25 tarihinde şunu yazdı: > Every user has a 1x subvolume and I only have 1 pool. > At the beginning we were using each subvolume for ldap home directory + > user data. > When a user logins any docker on any host, it was using the cluster for > home and the for user related data, we was have second directory in the > same subvolume. > Time to time users were feeling a very slow home environment and after a > month it became almost impossible to use home. VNC sessions became > unresponsive and slow etc. > > 2 weeks ago, I had to migrate home to a ZFS storage and now the overall > performance is better for only user_data without home. > But still the performance is not good enough as I expected because of the > problems related to MDS. > The usage is low but allocation is high and Cpu usage is high. You saw the > IO Op/s, it's nothing but allocation is high. > > I develop a fio benchmark script and I run the script on 4x test server at > the same time, the results are below: > Script: > https://github.com/ozkangoksu/benchmark/blob/8f5df87997864c25ef32447e02fcd41fda0d2a67/iobench.sh > > > https://github.com/ozkangoksu/benchmark/blob/main/benchmark-results/iobench-client-01.txt > > https://github.com/ozkangoksu/benchmark/blob/main/benchmark-results/iobench-client-02.txt > > https://github.com/ozkangoksu/benchmark/blob/main/benchmark-results/iobench-client-03.txt > > https://github.com/ozkangoksu/benchmark/blob/main/benchmark-results/iobench-client-04.txt > > While running benchmark, I take sample values for each type of iobench run. > > Seq Write benchmarking: size=1G,direct=1,numjobs=3,iodepth=32 > client: 70 MiB/s rd, 762 MiB/s wr, 337 op/s rd, 24.41k op/s wr > client: 60 MiB/s rd, 551 MiB/s wr, 303 op/s rd, 35.12k op/s wr > client: 13 MiB/s rd, 161 MiB/s wr, 101 op/s rd, 41.30k op/s wr > > Seq Read benchmarking: size=1G,direct=1,numjobs=3,iodepth=32 > client: 1.6 GiB/s rd, 219 KiB/s wr, 28.76k op/s rd, 89 op/s wr > client: 370 MiB/s rd, 475 KiB/s wr, 90.38k op/s rd, 89 op/s wr > > Rand Write benchmarking: size=1G,direct=1,numjobs=3,iodepth=32 > client: 63 MiB/s rd, 1.5 GiB/s wr, 8.77k op/s rd, 5.50k op/s wr > client: 14 MiB/s rd, 1.8 GiB/s wr, 81 op/s rd, 13.86k op/s wr > client: 6.6 MiB/s rd, 1.2 GiB/s wr, 61 op/s rd, 30.13k op/s wr > > Rand Read benchmarking: size=1G,direct=1,numjobs=3,iodepth=32 > client: 317 MiB/s rd, 841 MiB/s wr, 426 op/s rd, 10.98k op/s wr > client: 2.8 GiB/s rd, 882 MiB/s wr, 25.68k op/s rd, 291 op/s wr > client: 4.0 GiB/s rd, 226 MiB/s wr, 89.63k op/s rd, 124 op/s wr > client: 2.4 GiB/s rd, 295 KiB/s wr, 197.86k op/s rd, 20 op/s wr > > It seems I only have problems with the 4K,8K,16K other sector sizes. > > > > > Eugen Block , 25 Oca 2024 Per, 19:06 tarihinde şunu yazdı:
[ceph-users] Re: 1 clients failing to respond to cache pressure (quincy:17.2.6)
Every user has a 1x subvolume and I only have 1 pool. At the beginning we were using each subvolume for ldap home directory + user data. When a user logins any docker on any host, it was using the cluster for home and the for user related data, we was have second directory in the same subvolume. Time to time users were feeling a very slow home environment and after a month it became almost impossible to use home. VNC sessions became unresponsive and slow etc. 2 weeks ago, I had to migrate home to a ZFS storage and now the overall performance is better for only user_data without home. But still the performance is not good enough as I expected because of the problems related to MDS. The usage is low but allocation is high and Cpu usage is high. You saw the IO Op/s, it's nothing but allocation is high. I develop a fio benchmark script and I run the script on 4x test server at the same time, the results are below: Script: https://github.com/ozkangoksu/benchmark/blob/8f5df87997864c25ef32447e02fcd41fda0d2a67/iobench.sh https://github.com/ozkangoksu/benchmark/blob/main/benchmark-results/iobench-client-01.txt https://github.com/ozkangoksu/benchmark/blob/main/benchmark-results/iobench-client-02.txt https://github.com/ozkangoksu/benchmark/blob/main/benchmark-results/iobench-client-03.txt https://github.com/ozkangoksu/benchmark/blob/main/benchmark-results/iobench-client-04.txt While running benchmark, I take sample values for each type of iobench run. Seq Write benchmarking: size=1G,direct=1,numjobs=3,iodepth=32 client: 70 MiB/s rd, 762 MiB/s wr, 337 op/s rd, 24.41k op/s wr client: 60 MiB/s rd, 551 MiB/s wr, 303 op/s rd, 35.12k op/s wr client: 13 MiB/s rd, 161 MiB/s wr, 101 op/s rd, 41.30k op/s wr Seq Read benchmarking: size=1G,direct=1,numjobs=3,iodepth=32 client: 1.6 GiB/s rd, 219 KiB/s wr, 28.76k op/s rd, 89 op/s wr client: 370 MiB/s rd, 475 KiB/s wr, 90.38k op/s rd, 89 op/s wr Rand Write benchmarking: size=1G,direct=1,numjobs=3,iodepth=32 client: 63 MiB/s rd, 1.5 GiB/s wr, 8.77k op/s rd, 5.50k op/s wr client: 14 MiB/s rd, 1.8 GiB/s wr, 81 op/s rd, 13.86k op/s wr client: 6.6 MiB/s rd, 1.2 GiB/s wr, 61 op/s rd, 30.13k op/s wr Rand Read benchmarking: size=1G,direct=1,numjobs=3,iodepth=32 client: 317 MiB/s rd, 841 MiB/s wr, 426 op/s rd, 10.98k op/s wr client: 2.8 GiB/s rd, 882 MiB/s wr, 25.68k op/s rd, 291 op/s wr client: 4.0 GiB/s rd, 226 MiB/s wr, 89.63k op/s rd, 124 op/s wr client: 2.4 GiB/s rd, 295 KiB/s wr, 197.86k op/s rd, 20 op/s wr It seems I only have problems with the 4K,8K,16K other sector sizes. Eugen Block , 25 Oca 2024 Per, 19:06 tarihinde şunu yazdı: > I understand that your MDS shows a high CPU usage, but other than that > what is your performance issue? Do users complain? Do some operations > take longer than expected? Are OSDs saturated during those phases? > Because the cache pressure messages don’t necessarily mean that users > will notice. > MDS daemons are single-threaded so that might be a bottleneck. In that > case multi-active mds might help, which you already tried and > experienced OOM killers. But you might have to disable the mds > balancer as someone else mentioned. And then you could think about > pinning, is it possible to split the CephFS into multiple > subdirectories and pin them to different ranks? > But first I’d still like to know what the performance issue really is. > > Zitat von Özkan Göksu : > > > I will try my best to explain my situation. > > > > I don't have a separate mds server. I have 5 identical nodes, 3 of them > > mons, and I use the other 2 as active and standby mds. (currently I have > > left overs from max_mds 4) > > > > root@ud-01:~# ceph -s > > cluster: > > id: e42fd4b0-313b-11ee-9a00-31da71873773 > > health: HEALTH_WARN > > 1 clients failing to respond to cache pressure > > > > services: > > mon: 3 daemons, quorum ud-01,ud-02,ud-03 (age 9d) > > mgr: ud-01.qycnol(active, since 8d), standbys: ud-02.tfhqfd > > mds: 1/1 daemons up, 4 standby > > osd: 80 osds: 80 up (since 9d), 80 in (since 5M) > > > > data: > > volumes: 1/1 healthy > > pools: 3 pools, 2305 pgs > > objects: 106.58M objects, 25 TiB > > usage: 45 TiB used, 101 TiB / 146 TiB avail > > pgs: 2303 active+clean > > 2active+clean+scrubbing+deep > > > > io: > > client: 16 MiB/s rd, 3.4 MiB/s wr, 77 op/s rd, 23 op/s wr > > > > -- > > root@ud-01:~# ceph fs status > > ud-data - 84 clients > > === > > RANK STATE MDS ACTIVITY DNSINOS DIRS > > CAPS > > 0active ud-data.ud-02.xcoojt Reqs: 40 /s 2579k
[ceph-users] Re: 1 clients failing to respond to cache pressure (quincy:17.2.6)
66 up osd.53 54ssd1.81929 1.0 1.8 TiB 550 GiB 544 GiB 1.5 GiB 4.3 GiB 1.3 TiB 29.54 0.96 55 up osd.54 55ssd1.81929 1.0 1.8 TiB 527 GiB 522 GiB 1.3 GiB 4.0 GiB 1.3 TiB 28.29 0.92 52 up osd.55 56ssd1.81929 1.0 1.8 TiB 525 GiB 519 GiB 1.2 GiB 4.1 GiB 1.3 TiB 28.16 0.91 52 up osd.56 57ssd1.81929 1.0 1.8 TiB 615 GiB 609 GiB 2.3 GiB 4.2 GiB 1.2 TiB 33.03 1.07 65 up osd.57 58ssd1.81929 1.0 1.8 TiB 527 GiB 522 GiB 1.6 GiB 3.7 GiB 1.3 TiB 28.31 0.92 55 up osd.58 59ssd1.81929 1.0 1.8 TiB 615 GiB 609 GiB 1.2 GiB 4.6 GiB 1.2 TiB 33.01 1.07 60 up osd.59 60ssd1.81929 1.0 1.8 TiB 594 GiB 588 GiB 1.2 GiB 4.4 GiB 1.2 TiB 31.88 1.03 59 up osd.60 61ssd1.81929 1.0 1.8 TiB 616 GiB 610 GiB 1.9 GiB 4.1 GiB 1.2 TiB 33.04 1.07 64 up osd.61 62ssd1.81929 1.0 1.8 TiB 620 GiB 614 GiB 1.9 GiB 4.4 GiB 1.2 TiB 33.27 1.08 63 up osd.62 63ssd1.81929 1.0 1.8 TiB 527 GiB 522 GiB 1.5 GiB 4.0 GiB 1.3 TiB 28.30 0.92 53 up osd.63 -11 29.10864 - 29 TiB 9.0 TiB 8.9 TiB23 GiB 65 GiB 20 TiB 30.91 1.00- host ud-05 64ssd1.81929 1.0 1.8 TiB 608 GiB 601 GiB 2.3 GiB 4.5 GiB 1.2 TiB 32.62 1.06 65 up osd.64 65ssd1.81929 1.0 1.8 TiB 606 GiB 601 GiB 628 MiB 4.2 GiB 1.2 TiB 32.53 1.06 57 up osd.65 66ssd1.81929 1.0 1.8 TiB 583 GiB 578 GiB 1.3 GiB 4.3 GiB 1.2 TiB 31.31 1.02 57 up osd.66 67ssd1.81929 1.0 1.8 TiB 537 GiB 533 GiB 436 MiB 3.6 GiB 1.3 TiB 28.82 0.94 50 up osd.67 68ssd1.81929 1.0 1.8 TiB 541 GiB 535 GiB 2.5 GiB 3.8 GiB 1.3 TiB 29.04 0.94 59 up osd.68 69ssd1.81929 1.0 1.8 TiB 606 GiB 601 GiB 1.1 GiB 4.4 GiB 1.2 TiB 32.55 1.06 59 up osd.69 70ssd1.81929 1.0 1.8 TiB 604 GiB 598 GiB 1.8 GiB 4.1 GiB 1.2 TiB 32.44 1.05 63 up osd.70 71ssd1.81929 1.0 1.8 TiB 606 GiB 600 GiB 1.9 GiB 4.5 GiB 1.2 TiB 32.53 1.06 62 up osd.71 72ssd1.81929 1.0 1.8 TiB 602 GiB 598 GiB 612 MiB 4.1 GiB 1.2 TiB 32.33 1.05 57 up osd.72 73ssd1.81929 1.0 1.8 TiB 571 GiB 565 GiB 1.8 GiB 4.5 GiB 1.3 TiB 30.65 0.99 58 up osd.73 74ssd1.81929 1.0 1.8 TiB 608 GiB 602 GiB 1.8 GiB 4.2 GiB 1.2 TiB 32.62 1.06 61 up osd.74 75ssd1.81929 1.0 1.8 TiB 536 GiB 531 GiB 1.9 GiB 3.5 GiB 1.3 TiB 28.80 0.93 57 up osd.75 76ssd1.81929 1.0 1.8 TiB 605 GiB 599 GiB 1.4 GiB 4.5 GiB 1.2 TiB 32.48 1.05 60 up osd.76 77ssd1.81929 1.0 1.8 TiB 537 GiB 532 GiB 1.2 GiB 3.9 GiB 1.3 TiB 28.84 0.94 52 up osd.77 78ssd1.81929 1.0 1.8 TiB 525 GiB 520 GiB 1.3 GiB 3.8 GiB 1.3 TiB 28.20 0.92 52 up osd.78 79ssd1.81929 1.0 1.8 TiB 536 GiB 531 GiB 1.1 GiB 3.3 GiB 1.3 TiB 28.76 0.93 53 up osd.79 TOTAL 146 TiB 45 TiB 44 TiB 119 GiB 333 GiB 101 TiB 30.81 MIN/MAX VAR: 0.91/1.08 STDDEV: 1.90 Eugen Block , 25 Oca 2024 Per, 16:52 tarihinde şunu yazdı: > There is no definitive answer wrt mds tuning. As it is everywhere > mentioned, it's about finding the right setup for your specific > workload. If you can synthesize your workload (maybe scale down a bit) > try optimizing it in a test cluster without interrupting your > developers too much. > But what you haven't explained yet is what are you experiencing as a > performance issue? Do you have numbers or a detailed description? > From the fs status output you didn't seem to have too much activity > going on (around 140 requests per second), but that's probably not the > usual traffic? What does ceph report in its client IO output? > Can you paste the 'ceph osd df' output as well? > Do you have dedicated MDS servers or are they colocated with other > services? > > Zitat von Özkan Göksu : > > > Hello Eugen. > > > > I read all of your MDS related topics and thank you so much for your > effort > > on this. > > There is not much information and I couldn't find a MDS tuning guide at > > all. It seems that you are the correct person to discuss mds debugging > and > > tuning. > > > > Do you have any documents or may I learn what is the proper way to debug > > MDS and cli
[ceph-users] Re: 1 clients failing to respond to cache pressure (quincy:17.2.6)
Hello Eugen. I read all of your MDS related topics and thank you so much for your effort on this. There is not much information and I couldn't find a MDS tuning guide at all. It seems that you are the correct person to discuss mds debugging and tuning. Do you have any documents or may I learn what is the proper way to debug MDS and clients ? Which debug logs will guide me to understand the limitations and will help to tune according to the data flow? While searching, I find this: https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/thread/YO4SGL4DJQ6EKUBUIHKTFSW72ZJ3XLZS/ quote:"A user running VSCodium, keeping 15k caps open.. the opportunistic caps recall eventually starts recalling those but the (el7 kernel) client won't release them. Stopping Codium seems to be the only way to release." Because of this I think I also need to play around with the client side too. My main goal is increasing the speed and reducing the latency and I wonder if these ideas are correct or not: - Maybe I need to increase client side cache size because via each client, multiple users request a lot of objects and clearly the client_cache_size=16 default is not enough. - Maybe I need to increase client side maximum cache limit for object "client_oc_max_objects=1000 to 1" and data "client_oc_size=200mi to 400mi" - The client cache cleaning threshold is not aggressive enough to keep the free cache size in the desired range. I need to make it aggressive but this should not reduce speed and increase latency. mds_cache_memory_limit=4gi to 16gi client_oc_max_objects=1000 to 1 client_oc_size=200mi to 400mi client_permissions=false #to reduce latency. client_cache_size=16 to 128 What do you think? ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: cephfs-top causes 16 mgr modules have recently crashed
Hello Jos. I check the diff and notice the difference: https://github.com/ceph/ceph/pull/52127/files Thank you for the guide link and for the fix. Have a great day. Regards. 23 Oca 2024 Sal 11:07 tarihinde Jos Collin şunu yazdı: > This fix is in the mds. > I think you need to read > https://docs.ceph.com/en/quincy/cephadm/upgrade/#staggered-upgrade. > > On 23/01/24 12:19, Özkan Göksu wrote: > > Hello Jos. > Thank you for the reply. > > I can upgrade to 17.2.7 but I wonder can I only upgrade MON+MGR for this > issue or do I need to upgrade all the parts? > Otherwise I need to wait few weeks. I don't want to request maintenance > during delivery time. > > root@ud-01:~# ceph orch upgrade ls > { > "image": "quay.io/ceph/ceph", > "registry": "quay.io", > "bare_image": "ceph/ceph", > "versions": [ > "18.2.1", > "18.2.0", > "18.1.3", > "18.1.2", > "18.1.1", > "18.1.0", > "17.2.7", > "17.2.6", > "17.2.5", > "17.2.4", > "17.2.3", > "17.2.2", > "17.2.1", > "17.2.0" > ] > } > > Best regards > > Jos Collin , 23 Oca 2024 Sal, 07:42 tarihinde şunu > yazdı: > >> Please have this fix: https://tracker.ceph.com/issues/59551. It's >> backported to quincy. >> >> On 23/01/24 03:11, Özkan Göksu wrote: >> > Hello >> > >> > When I run cephfs-top it causes mgr module crash. Can you please tell me >> > the reason? >> > >> > My environment: >> > My ceph version 17.2.6 >> > Operating System: Ubuntu 22.04.2 LTS >> > Kernel: Linux 5.15.0-84-generic >> > >> > I created the cephfs-top user with the following command: >> > ceph auth get-or-create client.fstop mon 'allow r' mds 'allow r' osd >> 'allow >> > r' mgr 'allow r' > /etc/ceph/ceph.client.fstop.keyring >> > >> > This is the crash report: >> > >> > root@ud-01:~# ceph crash info >> > 2024-01-22T21:25:59.313305Z_526253e3-e8cc-4d2c-adcb-69a7c9986801 >> > { >> > "backtrace": [ >> > " File \"/usr/share/ceph/mgr/stats/module.py\", line 32, in >> > notify\nself.fs_perf_stats.notify_cmd(notify_id)", >> > " File \"/usr/share/ceph/mgr/stats/fs/perf_stats.py\", line >> 177, >> > in notify_cmd\nmetric_features = >> > >> int(metadata[CLIENT_METADATA_KEY][\"metric_spec\"][\"metric_flags\"][\"feature_bits\"], >> > 16)", >> > "ValueError: invalid literal for int() with base 16: '0x'" >> > ], >> > "ceph_version": "17.2.6", >> > "crash_id": >> > "2024-01-22T21:25:59.313305Z_526253e3-e8cc-4d2c-adcb-69a7c9986801", >> > "entity_name": "mgr.ud-01.qycnol", >> > "mgr_module": "stats", >> > "mgr_module_caller": "ActivePyModule::notify", >> > "mgr_python_exception": "ValueError", >> > "os_id": "centos", >> > "os_name": "CentOS Stream", >> > "os_version": "8", >> > "os_version_id": "8", >> > "process_name": "ceph-mgr", >> > "stack_sig": >> > "971ae170f1fff7f7bc0b7ae86d164b2b0136a8bd5ca7956166ea5161e51ad42c", >> > "timestamp": "2024-01-22T21:25:59.313305Z", >> > "utsname_hostname": "ud-01", >> > "utsname_machine": "x86_64", >> > "utsname_release": "5.15.0-84-generic", >> > "utsname_sysname": "Linux", >> > "utsname_version": "#93-Ubuntu SMP Tue Sep 5 17:16:10 UTC 2023" >> > } >> > >> > >> > Best regards. >> > ___ >> > ceph-users mailing list -- ceph-users@ceph.io >> > To unsubscribe send an email to ceph-users-le...@ceph.io >> > >> ___ >> ceph-users mailing list -- ceph-users@ceph.io >> To unsubscribe send an email to ceph-users-le...@ceph.io >> > > ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: cephfs-top causes 16 mgr modules have recently crashed
Hello Jos. Thank you for the reply. I can upgrade to 17.2.7 but I wonder can I only upgrade MON+MGR for this issue or do I need to upgrade all the parts? Otherwise I need to wait few weeks. I don't want to request maintenance during delivery time. root@ud-01:~# ceph orch upgrade ls { "image": "quay.io/ceph/ceph", "registry": "quay.io", "bare_image": "ceph/ceph", "versions": [ "18.2.1", "18.2.0", "18.1.3", "18.1.2", "18.1.1", "18.1.0", "17.2.7", "17.2.6", "17.2.5", "17.2.4", "17.2.3", "17.2.2", "17.2.1", "17.2.0" ] } Best regards Jos Collin , 23 Oca 2024 Sal, 07:42 tarihinde şunu yazdı: > Please have this fix: https://tracker.ceph.com/issues/59551. It's > backported to quincy. > > On 23/01/24 03:11, Özkan Göksu wrote: > > Hello > > > > When I run cephfs-top it causes mgr module crash. Can you please tell me > > the reason? > > > > My environment: > > My ceph version 17.2.6 > > Operating System: Ubuntu 22.04.2 LTS > > Kernel: Linux 5.15.0-84-generic > > > > I created the cephfs-top user with the following command: > > ceph auth get-or-create client.fstop mon 'allow r' mds 'allow r' osd > 'allow > > r' mgr 'allow r' > /etc/ceph/ceph.client.fstop.keyring > > > > This is the crash report: > > > > root@ud-01:~# ceph crash info > > 2024-01-22T21:25:59.313305Z_526253e3-e8cc-4d2c-adcb-69a7c9986801 > > { > > "backtrace": [ > > " File \"/usr/share/ceph/mgr/stats/module.py\", line 32, in > > notify\nself.fs_perf_stats.notify_cmd(notify_id)", > > " File \"/usr/share/ceph/mgr/stats/fs/perf_stats.py\", line > 177, > > in notify_cmd\nmetric_features = > > > int(metadata[CLIENT_METADATA_KEY][\"metric_spec\"][\"metric_flags\"][\"feature_bits\"], > > 16)", > > "ValueError: invalid literal for int() with base 16: '0x'" > > ], > > "ceph_version": "17.2.6", > > "crash_id": > > "2024-01-22T21:25:59.313305Z_526253e3-e8cc-4d2c-adcb-69a7c9986801", > > "entity_name": "mgr.ud-01.qycnol", > > "mgr_module": "stats", > > "mgr_module_caller": "ActivePyModule::notify", > > "mgr_python_exception": "ValueError", > > "os_id": "centos", > > "os_name": "CentOS Stream", > > "os_version": "8", > > "os_version_id": "8", > > "process_name": "ceph-mgr", > > "stack_sig": > > "971ae170f1fff7f7bc0b7ae86d164b2b0136a8bd5ca7956166ea5161e51ad42c", > > "timestamp": "2024-01-22T21:25:59.313305Z", > > "utsname_hostname": "ud-01", > > "utsname_machine": "x86_64", > > "utsname_release": "5.15.0-84-generic", > > "utsname_sysname": "Linux", > > "utsname_version": "#93-Ubuntu SMP Tue Sep 5 17:16:10 UTC 2023" > > } > > > > > > Best regards. > > ___ > > ceph-users mailing list -- ceph-users@ceph.io > > To unsubscribe send an email to ceph-users-le...@ceph.io > > > ___ > ceph-users mailing list -- ceph-users@ceph.io > To unsubscribe send an email to ceph-users-le...@ceph.io > ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] cephfs-top causes 16 mgr modules have recently crashed
Hello When I run cephfs-top it causes mgr module crash. Can you please tell me the reason? My environment: My ceph version 17.2.6 Operating System: Ubuntu 22.04.2 LTS Kernel: Linux 5.15.0-84-generic I created the cephfs-top user with the following command: ceph auth get-or-create client.fstop mon 'allow r' mds 'allow r' osd 'allow r' mgr 'allow r' > /etc/ceph/ceph.client.fstop.keyring This is the crash report: root@ud-01:~# ceph crash info 2024-01-22T21:25:59.313305Z_526253e3-e8cc-4d2c-adcb-69a7c9986801 { "backtrace": [ " File \"/usr/share/ceph/mgr/stats/module.py\", line 32, in notify\nself.fs_perf_stats.notify_cmd(notify_id)", " File \"/usr/share/ceph/mgr/stats/fs/perf_stats.py\", line 177, in notify_cmd\nmetric_features = int(metadata[CLIENT_METADATA_KEY][\"metric_spec\"][\"metric_flags\"][\"feature_bits\"], 16)", "ValueError: invalid literal for int() with base 16: '0x'" ], "ceph_version": "17.2.6", "crash_id": "2024-01-22T21:25:59.313305Z_526253e3-e8cc-4d2c-adcb-69a7c9986801", "entity_name": "mgr.ud-01.qycnol", "mgr_module": "stats", "mgr_module_caller": "ActivePyModule::notify", "mgr_python_exception": "ValueError", "os_id": "centos", "os_name": "CentOS Stream", "os_version": "8", "os_version_id": "8", "process_name": "ceph-mgr", "stack_sig": "971ae170f1fff7f7bc0b7ae86d164b2b0136a8bd5ca7956166ea5161e51ad42c", "timestamp": "2024-01-22T21:25:59.313305Z", "utsname_hostname": "ud-01", "utsname_machine": "x86_64", "utsname_release": "5.15.0-84-generic", "utsname_sysname": "Linux", "utsname_version": "#93-Ubuntu SMP Tue Sep 5 17:16:10 UTC 2023" } Best regards. ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: 1 clients failing to respond to cache pressure (quincy:17.2.6)
: 0, "bluestore_Buffer_items": 0, "bluestore_Extent_bytes": 0, "bluestore_Extent_items": 0, "bluestore_Blob_bytes": 0, "bluestore_Blob_items": 0, "bluestore_SharedBlob_bytes": 0, "bluestore_SharedBlob_items": 0, "bluestore_inline_bl_bytes": 0, "bluestore_inline_bl_items": 0, "bluestore_fsck_bytes": 0, "bluestore_fsck_items": 0, "bluestore_txc_bytes": 0, "bluestore_txc_items": 0, "bluestore_writing_deferred_bytes": 0, "bluestore_writing_deferred_items": 0, "bluestore_writing_bytes": 0, "bluestore_writing_items": 0, "bluefs_bytes": 0, "bluefs_items": 0, "bluefs_file_reader_bytes": 0, "bluefs_file_reader_items": 0, "bluefs_file_writer_bytes": 0, "bluefs_file_writer_items": 0, "buffer_anon_bytes": 2114708, "buffer_anon_items": 825, "buffer_meta_bytes": 88, "buffer_meta_items": 1, "osd_bytes": 0, "osd_items": 0, "osd_mapbl_bytes": 0, "osd_mapbl_items": 0, "osd_pglog_bytes": 0, "osd_pglog_items": 0, "osdmap_bytes": 25728, "osdmap_items": 946, "osdmap_mapping_bytes": 0, "osdmap_mapping_items": 0, "pgmap_bytes": 0, "pgmap_items": 0, "mds_co_bytes": 8173443932, "mds_co_items": 109004579, "unittest_1_bytes": 0, "unittest_1_items": 0, "unittest_2_bytes": 0, "unittest_2_items": 0 }, "objecter": { "op_active": 0, "op_laggy": 0, "op_send": 13563810, "op_send_bytes": 21613887606, "op_resend": 1, "op_reply": 13563809, "oplen_avg": { "avgcount": 13563809, "sum": 31377549 }, "op": 13563809, "op_r": 10213362, "op_w": 3350447, "op_rmw": 0, "op_pg": 0, "osdop_stat": 75945, "osdop_create": 1139381, "osdop_read": 15848, "osdop_write": 713549, "osdop_writefull": 20267, "osdop_writesame": 0, "osdop_append": 0, "osdop_zero": 2, "osdop_truncate": 0, "osdop_delete": 1226688, "osdop_mapext": 0, "osdop_sparse_read": 0, "osdop_clonerange": 0, "osdop_getxattr": 7321546, "osdop_setxattr": 2283499, "osdop_cmpxattr": 0, "osdop_rmxattr": 0, "osdop_resetxattrs": 0, "osdop_call": 0, "osdop_watch": 0, "osdop_notify": 0, "osdop_src_cmpxattr": 0, "osdop_pgls": 0, "osdop_pgls_filter": 0, "osdop_other": 49342, "linger_active": 0, "linger_send": 0, "linger_resend": 0, "linger_ping": 0, "poolop_active": 0, "poolop_send": 0, "poolop_resend": 0, "poolstat_active": 0, "poolstat_send": 0, "poolstat_resend": 0, "statfs_active": 0, "statfs_send": 0, "statfs_resend": 0, "command_active": 0, "command_send": 0, "command_resend": 0, "map_epoch": 13646, "map_full": 0, "map_inc": 97, "osd_sessions": 80, "osd_session_open": 176, "osd_session_close": 96, "osd_laggy": 0, "omap_wr": 354624, "omap_rd": 18128035, "omap_del": 48823 }, "oft": { "omap_total_objs": 3, "omap_total_kv_pairs": 31549, "omap_total_updates": 9972364, "omap_total_removes": 7080093 }, "purge_queue": { "pq_executing_ops": 0, "pq_executing_ops_high_water": 1126, "pq_executing": 0, "pq_executing_high_water": 64, "pq_executed": 1129
[ceph-users] Re: 1 clients failing to respond to cache pressure (quincy:17.2.6)
All of my clients are servers located at 2 hop away with 10Gbit network and 2x Xeon CPU/16++ cores and minimum 64GB ram with SSD OS drive + 8GB spare. I use ceph kernel mount only and this is the command: - mount.ceph admin@$fsid.ud-data=/volumes/subvolumegroup ${MOUNT_DIR} -o name=admin,secret=XXX==,mon_addr=XXX I think all of my clients have enough resources to answer MDS requests very fast. The only possibility that any of my clients fails to respond to cache pressure is the default settings at cephfs client or MDS server. I have some problem with understanding how cephfs client works and why it needs communication with MDS server for managing local cache. And even at the beggining I didn't understand why MDS server needs direct control over clients and tell them what to do. My mind does not understand the concept and its logic. To me, clients must be independent and they must manage their data flow without any server side control. The client must send read and write request to the remote server and return answer to the kernel. Client can have read cache management future but it does not need communicate with remote server. When a client detects multiple read for the same object it should cache it with a set of protocols and release it when it needed. I don't understand why MDS needs to tell clients to release the allocation and why client needs to report the release status back... The logical answer for me is I think I'm looking from the wrong angle and this is not the cache that I know from block filesystems. With my use case, clients reads 50-100GB of data (10.000++ objects) only one or two times with each runtime in few hours. While I was researching, I saw that some users recommends decreasing "mds_max_caps_per_client" from 1M to 64K # ceph config set mds mds_max_caps_per_client 65536 But if you check the reported client ls at previous mail you will see "num_caps": 52092, for a failing client for cache pressure. So its even under 64K and I'm not sure changing this value can help or not. I want to repeat my main goal. I'm not trying to solve cache pressure warning. The ceph random read and write performance is not good and a lot of reads from 80+ clients creates latency. I'm trying to increase the speed by creating multiple MDS even maybe binding subvolumes to specific MDS servers and decrease the latency. Also when I check MDS CPU usage I see %120++ usage time to time. But when I check the server CPU load at MDS location, I see MDS only uses 2-4 cores and other CPU cores are almost at idle. I think MDS has a CPU core limitation and I need to increase the value to decrease the latency. How can I do that? Özkan Göksu , 17 Oca 2024 Çar, 07:44 tarihinde şunu yazdı: > Let me share some outputs about my cluster. > > root@ud-01:~# ceph fs status > ud-data - 84 clients > === > RANK STATE MDS ACTIVITY DNSINOS DIRS > CAPS > 0active ud-data.ud-02.xcoojt Reqs: 31 /s 3022k 3021k 52.6k > 385k > POOL TYPE USED AVAIL > cephfs.ud-data.meta metadata 136G 44.4T > cephfs.ud-data.datadata45.2T 44.4T > STANDBY MDS > ud-data.ud-03.lhwkml > ud-data.ud-05.rnhcfe > ud-data.ud-01.uatjle > ud-data.ud-04.seggyv > > -- > This is "ceph tell mds.ud-data.ud-02.xcoojt session ls" output for the > reported client for cache pressure warning. > > { > "id": 1282205, > "entity": { > "name": { > "type": "client", > "num": 1282205 > }, > "addr": { > "type": "v1", > "addr": "172.16.3.48:0", > "nonce": 2169935642 > } > }, > "state": "open", > "num_leases": 0, > "num_caps": 52092, > "request_load_avg": 1, > "uptime": 75754.745608647994, > "requests_in_flight": 0, > "num_completed_requests": 0, > "num_completed_flushes": 1, > "reconnecting": false, > "recall_caps": { > "value": 2577232.0049106553, > "halflife": 60 > }, > "release_caps": { > "value": 1.4093491463510395, > "halflife": 60 > }, > "recall_caps_throttle": { > "value": 63733.985544098425, > &q
[ceph-users] Re: 1 clients failing to respond to cache pressure (quincy:17.2.6)
Let me share some outputs about my cluster. root@ud-01:~# ceph fs status ud-data - 84 clients === RANK STATE MDS ACTIVITY DNSINOS DIRS CAPS 0active ud-data.ud-02.xcoojt Reqs: 31 /s 3022k 3021k 52.6k 385k POOL TYPE USED AVAIL cephfs.ud-data.meta metadata 136G 44.4T cephfs.ud-data.datadata45.2T 44.4T STANDBY MDS ud-data.ud-03.lhwkml ud-data.ud-05.rnhcfe ud-data.ud-01.uatjle ud-data.ud-04.seggyv -- This is "ceph tell mds.ud-data.ud-02.xcoojt session ls" output for the reported client for cache pressure warning. { "id": 1282205, "entity": { "name": { "type": "client", "num": 1282205 }, "addr": { "type": "v1", "addr": "172.16.3.48:0", "nonce": 2169935642 } }, "state": "open", "num_leases": 0, "num_caps": 52092, "request_load_avg": 1, "uptime": 75754.745608647994, "requests_in_flight": 0, "num_completed_requests": 0, "num_completed_flushes": 1, "reconnecting": false, "recall_caps": { "value": 2577232.0049106553, "halflife": 60 }, "release_caps": { "value": 1.4093491463510395, "halflife": 60 }, "recall_caps_throttle": { "value": 63733.985544098425, "halflife": 1.5 }, "recall_caps_throttle2o": { "value": 19452.428409271757, "halflife": 0.5 }, "session_cache_liveness": { "value": 14.100272208890081, "halflife": 300 }, "cap_acquisition": { "value": 0, "halflife": 10 }, "delegated_inos": [ { "start": "0x10004a1c031", "length": 282 }, { "start": "0x10004a1c33f", "length": 207 }, { "start": "0x10004a1cdda", "length": 6 }, { "start": "0x10004a3c12e", "length": 3 }, { "start": "0x1000f9831fe", "length": 2 } ], "inst": "client.1282205 v1:172.16.3.48:0/2169935642", "completed_requests": [], "prealloc_inos": [ { "start": "0x10004a1c031", "length": 282 }, { "start": "0x10004a1c33f", "length": 207 }, { "start": "0x10004a1cdda", "length": 6 }, { "start": "0x10004a3c12e", "length": 3 }, { "start": "0x1000f9831fe", "length": 2 }, { "start": "0x1000fa86e5f", "length": 54 }, { "start": "0x1000faa069c", "length": 501 } ], "client_metadata": { "client_features": { "feature_bits": "0x7bff" }, "metric_spec": { "metric_flags": { "feature_bits": "0x03ff" } }, "entity_id": "admin", "hostname": "bennevis-2", "kernel_version": "5.15.0-91-generic", "root": "/volumes/babblians" } } Özkan Göksu , 17 Oca 2024 Çar, 07:22 tarihinde şunu yazdı: > Hello Eugen. > > Thank you for the answer. > According to knowledge and test results at this issue: > https://github.com/ceph/ceph/pull/38574 > I've tried their advice and I've appli
[ceph-users] Re: 1 clients failing to respond to cache pressure (quincy:17.2.6)
Hello Eugen. Thank you for the answer. According to knowledge and test results at this issue: https://github.com/ceph/ceph/pull/38574 I've tried their advice and I've applied the following changes. max_mds = 4 standby_mds = 1 mds_cache_memory_limit = 16GB mds_recall_max_caps = 4 When I set these parameters, 1 day later I saw this log: [8531248.982954] Out of memory: Killed process 1580586 (ceph-mds) total-vm:70577592kB, anon-rss:70244236kB, file-rss:0kB, shmem-rss:0kB, UID:167 pgtables:137832kB oom_score_adj:0 All the MDS services created memory leak and killed by kernel. Because of this I changed it as below and it is stable now but performance is very poor and I still get cache pressure alerts. max_mds = 1 standby_mds = 5 mds_cache_memory_limit = 8GB mds_recall_max_caps = 3 I'm very surprised that you are advising to decrease "mds_recall_max_caps" because it is the opposite of what developers advised in the issue I've sended. It is very hard to play around with MDS parameters without expert level of understanding what these parameters stands for and how it will effect the behavior. Because of this I'm trying to understand the MDS code flow and I'm very interested with learning more and tuning my system by debugging and understanding my own data flow and MDS usage. I have a very unique data flow and I think I need to configure the system for this case. I have 80+ clients and via all of these clients my users are requesting Read a range of objects and compare them in GPU, they generate new data and Write the new data back in the cluster. So it means my clients usually reads objects only one time and do not read the same object again. Sometimes same user runs multiple service in multiple clients and these services can read the same data from different clients. So having a large cache is useless for my use case. I need to setup MDS and Cephfs Client for this data flow. When I debug the MDS ram usage, I see high allocation all the time and I wonder why? If any of my client does not read any object why MDS does not remove that data from ram allocation? I need to configure MDS for reading the data and removing it very fast if the data is constantly requested from clients. In this case ofc I want a ram cache tier. I'm little confused and I need to learn more about how MDS works and how should I make multiple active MDS faster for my subvolumes and client data flow. Best regards. Eugen Block , 16 Oca 2024 Sal, 11:36 tarihinde şunu yazdı: > Hi, > > I have dealt with this topic multiple times, the SUSE team helped > understanding what's going on under the hood. The summary can be found > in this thread [1]. > > What helped in our case was to reduce the mds_recall_max_caps from 30k > (default) to 3k. We tried it in steps of 1k IIRC. So I suggest to > reduce that value step by step (maybe start with 20k or something) to > find the optimal value. > > Regards, > Eugen > > [1] https://www.spinics.net/lists/ceph-users/msg73188.html > > Zitat von Özkan Göksu : > > > Hello. > > > > I have 5 node ceph cluster and I'm constantly having "clients failing to > > respond to cache pressure" warning. > > > > I have 84 cephfs kernel clients (servers) and my users are accessing > their > > personal subvolumes located on one pool. > > > > My users are software developers and the data is home and user data. > (Git, > > python projects, sample data and generated new data) > > > > > - > > --- RAW STORAGE --- > > CLASS SIZEAVAILUSED RAW USED %RAW USED > > ssd146 TiB 101 TiB 45 TiB45 TiB 30.71 > > TOTAL 146 TiB 101 TiB 45 TiB45 TiB 30.71 > > > > --- POOLS --- > > POOL ID PGS STORED OBJECTS USED %USED MAX > AVAIL > > .mgr 1 1 356 MiB 90 1.0 GiB 0 30 > TiB > > cephfs.ud-data.meta 9 256 69 GiB3.09M 137 GiB 0.15 45 > TiB > > cephfs.ud-data.data 10 2048 26 TiB 100.83M 44 TiB 32.97 45 > TiB > > > - > > root@ud-01:~# ceph fs status > > ud-data - 84 clients > > === > > RANK STATE MDS ACTIVITY DNSINOS DIRS > > CAPS > > 0active ud-data.ud-04.seggyv Reqs: 142 /s 2844k 2798k 303k > > 720k > > POOL TYPE USED AVAIL > > cephfs.ud-data.meta metadata 137G 44.9T > > cephfs.ud-data.datadata44.2T 44.9T > > STANDBY MDS > > ud-data.ud-02.xcoojt > > ud-data.ud-05.rnhcfe > > ud-data.ud-03.lhwkml > > ud-data.ud-01.uatjle > >
[ceph-users] 1 clients failing to respond to cache pressure (quincy:17.2.6)
Hello. I have 5 node ceph cluster and I'm constantly having "clients failing to respond to cache pressure" warning. I have 84 cephfs kernel clients (servers) and my users are accessing their personal subvolumes located on one pool. My users are software developers and the data is home and user data. (Git, python projects, sample data and generated new data) - --- RAW STORAGE --- CLASS SIZEAVAILUSED RAW USED %RAW USED ssd146 TiB 101 TiB 45 TiB45 TiB 30.71 TOTAL 146 TiB 101 TiB 45 TiB45 TiB 30.71 --- POOLS --- POOL ID PGS STORED OBJECTS USED %USED MAX AVAIL .mgr 1 1 356 MiB 90 1.0 GiB 0 30 TiB cephfs.ud-data.meta 9 256 69 GiB3.09M 137 GiB 0.15 45 TiB cephfs.ud-data.data 10 2048 26 TiB 100.83M 44 TiB 32.97 45 TiB - root@ud-01:~# ceph fs status ud-data - 84 clients === RANK STATE MDS ACTIVITY DNSINOS DIRS CAPS 0active ud-data.ud-04.seggyv Reqs: 142 /s 2844k 2798k 303k 720k POOL TYPE USED AVAIL cephfs.ud-data.meta metadata 137G 44.9T cephfs.ud-data.datadata44.2T 44.9T STANDBY MDS ud-data.ud-02.xcoojt ud-data.ud-05.rnhcfe ud-data.ud-03.lhwkml ud-data.ud-01.uatjle MDS version: ceph version 17.2.6 (d7ff0d10654d2280e08f1ab989c7cdf3064446a5) quincy (stable) --- My MDS settings are below: mds_cache_memory_limit| 8589934592 mds_cache_trim_threshold | 524288 mds_recall_global_max_decay_threshold | 131072 mds_recall_max_caps | 3 mds_recall_max_decay_rate | 1.50 mds_recall_max_decay_threshold| 131072 mds_recall_warning_threshold | 262144 I have 2 questions: 1- What should I do to prevent cache pressue warning ? 2- What can I do to increase speed ? - Thanks ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io