[ceph-users] Re: High OSD commit_latency after kernel upgrade

2024-03-22 Thread Özkan Göksu
After I set these 2 udev rules:

root@sd-02:~# cat /etc/udev/rules.d/98-ceph-provisioning-mode.rules
ACTION=="add", SUBSYSTEM=="scsi_disk", ATTR{provisioning_mode}:="unmap"

root@sd-02:~# cat /etc/udev/rules.d/99-ceph-write-through.rules
ACTION=="add", SUBSYSTEM=="scsi_disk", ATTR{cache_type}:="write through"

Only drives changed to "DISC-GRAN=4K", "DISC-MAX=4G"

This is the status:

root@sd-02:~# lsblk -D

> NAME
>DISC-ALN DISC-GRAN DISC-MAX DISC-ZERO
> sda
>0  512B   2G 0
> ├─sda1
>   0  512B   2G 0
> ├─sda2
>   0  512B   2G 0
> └─sda3
>   0  512B   2G 0
>   └─md0
>0  512B   2G 0
> └─md0p1
>0  512B   2G 0
> sdb
>0  512B   2G 0
> ├─sdb1
>   0  512B   2G 0
> ├─sdb2
>   0  512B   2G 0
> └─sdb3
>   0  512B   2G 0
>   └─md0
>0  512B   2G 0
> └─md0p1
>0  512B   2G 0
> sdc
>04K   4G 0
> ├─ceph--35de126c--326d--45f0--85e6--ef651dd25506-osd--block--65a12345--788d--406c--b4aa--79c691662f3e
>00B   0B 0
> └─ceph--35de126c--326d--45f0--85e6--ef651dd25506-osd--block--0fc29fdb--1345--465c--b830--8a217dd9034f
>00B   0B 0
>
>

 But in my other cluster as you can see also ceph lvm partitions are 4K + 2G

root@ud-01:~# lsblk -D
NAME
   DISC-ALN DISC-GRAN DISC-MAX DISC-ZERO
sda
 0  512B   2G 0
├─sda1
  0  512B   2G 0
└─sda2
  0  512B   2G 0
  └─md0
 0  512B   2G 0
├─md0p1
 0  512B   2G 0
└─md0p2
 0  512B   2G 0
sdb
 0  512B   2G 0
├─sdb1
  0  512B   2G 0
└─sdb2
  0  512B   2G 0
  └─md0
 0  512B   2G 0
├─md0p1
 0  512B   2G 0
└─md0p2
 0  512B   2G 0
sdc
 04K   2G 0
├─ceph--7496095f--18c7--41fd--90f2--d9b3e382bc8e-osd--block--ec86a029--23f7--4328--9600--a24a290e3003
   04K   2G 0
└─ceph--7496095f--18c7--41fd--90f2--d9b3e382bc8e-osd--block--5b69b748--d899--4f55--afc3--2ea3c8a05ca1
   04K   2G 0

I think I also need to write a udev rule for LVM osd partitions right?

Anthony D'Atri , 22 Mar 2024 Cum, 18:11 tarihinde şunu
yazdı:

> Maybe because the Crucial units are detected as client drives?  But also
> look at the device paths and the output of whatever "disklist" is.  Your
> boot drives are SATA and the others are SAS which seems even more likely to
> be a factor.
>
> On Mar 22, 2024, at 10:42, Özkan Göksu  wrote:
>
> Hello Anthony, thank you for the answer.
>
> While researching I also found out this type of issues but the thing I did
> not understand is in the same server the OS drives "SAMSUNG MZ7WD480" is
> all good.
>
> root@sd-01:~# lsblk -D
> NAME   DISC-ALN DISC-GRAN DISC-MAX
> DISC-ZERO
> sda   0  512B   2G
> 0
> ├─sda10  512B   2G
> 0
> ├─sda20  512B   2G
> 0
> └─sda30  512B   2G
> 0
>   └─md0   0  512B   2G
> 0
> └─md0p1   0  512B   2G
> 0
> sdb   0  512B   2G
> 0
> ├─sdb10  512B   2G
> 0
> ├─sdb2  

[ceph-users] Re: High OSD commit_latency after kernel upgrade

2024-03-22 Thread Özkan Göksu
Hello again.

In ceph recommendations I found this:

https://docs.ceph.com/en/quincy/start/hardware-recommendations/

WRITE CACHES
Enterprise SSDs and HDDs normally include power loss protection features
which ensure data durability when power is lost while operating, and use
multi-level caches to speed up direct or synchronous writes. These devices
can be toggled between two caching modes – a volatile cache flushed to
persistent media with fsync, or a non-volatile cache written synchronously.
These two modes are selected by either “enabling” or “disabling” the write
(volatile) cache. When the volatile cache is enabled, Linux uses a device
in “write back” mode, and when disabled, it uses “write through”.
The default configuration (usually: caching is enabled) may not be optimal,
and OSD performance may be dramatically increased in terms of increased
IOPS and decreased commit latency by disabling this write cache.
Users are therefore encouraged to benchmark their devices with fio as
described earlier and persist the optimal cache configuration for their
devices.


root@sd-02:~# cat /sys/class/scsi_disk/*/cache*
write back
write back
write back
write back
write back
write back
write back
write back
write back
write back

What do you think about these new udev rules?

root@sd-02:~# cat /etc/udev/rules.d/98-ceph-provisioning-mode.rules
ACTION=="add", SUBSYSTEM=="scsi_disk", ATTR{provisioning_mode}:="unmap"

root@sd-02:~# cat /etc/udev/rules.d/99-ceph-write-through.rules
ACTION=="add", SUBSYSTEM=="scsi_disk", ATTR{cache_type}:="write through"


Özkan Göksu , 22 Mar 2024 Cum, 17:42 tarihinde şunu
yazdı:

> Hello Anthony, thank you for the answer.
>
> While researching I also found out this type of issues but the thing I did
> not understand is in the same server the OS drives "SAMSUNG MZ7WD480" is
> all good.
>
> root@sd-01:~# lsblk -D
> NAME   DISC-ALN DISC-GRAN DISC-MAX
> DISC-ZERO
> sda   0  512B   2G
> 0
> ├─sda10  512B   2G
> 0
> ├─sda20  512B   2G
> 0
> └─sda30  512B   2G
> 0
>   └─md0   0  512B   2G
> 0
> └─md0p1   0  512B   2G
> 0
> sdb   0  512B   2G
> 0
> ├─sdb10  512B   2G
> 0
> ├─sdb20  512B   2G
> 0
> └─sdb30  512B   2G
> 0
>   └─md0   0  512B   2G
> 0
> └─md0p1   0  512B   2G
> 0
>
> root@sd-01:~# find /sys/ -name provisioning_mode -exec grep -H . {} + |
> sort
>
> /sys/devices/pci:00/:00:11.4/ata1/host1/target1:0:0/1:0:0:0/scsi_disk/1:0:0:0/provisioning_mode:writesame_16
>
> /sys/devices/pci:00/:00:11.4/ata2/host2/target2:0:0/2:0:0:0/scsi_disk/2:0:0:0/provisioning_mode:writesame_16
>
> /sys/devices/pci:80/:80:03.0/:81:00.0/host0/port-0:0/end_device-0:0/target0:0:0/0:0:0:0/scsi_disk/0:0:0:0/provisioning_mode:full
>
> /sys/devices/pci:80/:80:03.0/:81:00.0/host0/port-0:1/end_device-0:1/target0:0:1/0:0:1:0/scsi_disk/0:0:1:0/provisioning_mode:full
>
> root@sd-01:~# disklist
> HCTL   NAME   SIZE  REV TRAN   WWNSERIAL  MODEL
> 1:0:0:0/dev/sda 447.1G 203Q sata   0x5002538500231d05 S1G1NYAF923
> SAMSUNG MZ7WD4
> 2:0:0:0/dev/sdb 447.1G 203Q sata   0x5002538500231a41 S1G1NYAF922
> SAMSUNG MZ7WD4
> 0:0:0:0/dev/sdc   3.6T 046  sas0x500a0751e6bd969b 2312E6BD969
> CT4000MX500SSD
> 0:0:1:0/dev/sdd   3.6T 046  sas0x500a0751e6bd97ee 2312E6BD97E
> CT4000MX500SSD
> 0:0:2:0/dev/sde   3.6T 046  sas0x500a0751e6bd9805 2312E6BD980
> CT4000MX500SSD
> 0:0:3:0/dev/sdf   3.6T 046  sas0x500a0751e6bd9681 2312E6BD968
> CT4000MX500SSD
> 0:0:4:0/dev/sdg   3.6T 045  sas0x500a0751e6b5d30a 2309E6B5D30
> CT4000MX500SSD
> 0:0:5:0/dev/sdh   3.6T 046  sas0x500a0751e6bd967e 2312E6BD967
> CT4000MX500SSD
> 0:0:6:0/dev/sdi   3.6T 046  sas0x500a0751e6bd97e4 2312E6BD97E
> CT4000MX500SSD
> 0:0:7:0/dev/sdj   3.6T 046  sas0x500a0751e6bd96a0 2312E6BD96A
> CT4000MX500SSD
>
> So my question is why it only happens to CT4000MX500SSD drives and why it
> just

[ceph-users] Re: High OSD commit_latency after kernel upgrade

2024-03-22 Thread Özkan Göksu
Hello Anthony, thank you for the answer.

While researching I also found out this type of issues but the thing I did
not understand is in the same server the OS drives "SAMSUNG MZ7WD480" is
all good.

root@sd-01:~# lsblk -D
NAME   DISC-ALN DISC-GRAN DISC-MAX
DISC-ZERO
sda   0  512B   2G
0
├─sda10  512B   2G
0
├─sda20  512B   2G
0
└─sda30  512B   2G
0
  └─md0   0  512B   2G
0
└─md0p1   0  512B   2G
0
sdb   0  512B   2G
0
├─sdb10  512B   2G
0
├─sdb20  512B   2G
0
└─sdb30  512B   2G
0
  └─md0   0  512B   2G
0
└─md0p1   0  512B   2G
0

root@sd-01:~# find /sys/ -name provisioning_mode -exec grep -H . {} + | sort
/sys/devices/pci:00/:00:11.4/ata1/host1/target1:0:0/1:0:0:0/scsi_disk/1:0:0:0/provisioning_mode:writesame_16
/sys/devices/pci:00/:00:11.4/ata2/host2/target2:0:0/2:0:0:0/scsi_disk/2:0:0:0/provisioning_mode:writesame_16
/sys/devices/pci:80/:80:03.0/:81:00.0/host0/port-0:0/end_device-0:0/target0:0:0/0:0:0:0/scsi_disk/0:0:0:0/provisioning_mode:full
/sys/devices/pci:80/:80:03.0/:81:00.0/host0/port-0:1/end_device-0:1/target0:0:1/0:0:1:0/scsi_disk/0:0:1:0/provisioning_mode:full

root@sd-01:~# disklist
HCTL   NAME   SIZE  REV TRAN   WWNSERIAL  MODEL
1:0:0:0/dev/sda 447.1G 203Q sata   0x5002538500231d05 S1G1NYAF923
SAMSUNG MZ7WD4
2:0:0:0/dev/sdb 447.1G 203Q sata   0x5002538500231a41 S1G1NYAF922
SAMSUNG MZ7WD4
0:0:0:0/dev/sdc   3.6T 046  sas0x500a0751e6bd969b 2312E6BD969
CT4000MX500SSD
0:0:1:0/dev/sdd   3.6T 046  sas0x500a0751e6bd97ee 2312E6BD97E
CT4000MX500SSD
0:0:2:0/dev/sde   3.6T 046  sas0x500a0751e6bd9805 2312E6BD980
CT4000MX500SSD
0:0:3:0/dev/sdf   3.6T 046  sas0x500a0751e6bd9681 2312E6BD968
CT4000MX500SSD
0:0:4:0/dev/sdg   3.6T 045  sas0x500a0751e6b5d30a 2309E6B5D30
CT4000MX500SSD
0:0:5:0/dev/sdh   3.6T 046  sas0x500a0751e6bd967e 2312E6BD967
CT4000MX500SSD
0:0:6:0/dev/sdi   3.6T 046  sas0x500a0751e6bd97e4 2312E6BD97E
CT4000MX500SSD
0:0:7:0/dev/sdj   3.6T 046  sas0x500a0751e6bd96a0 2312E6BD96A
CT4000MX500SSD

So my question is why it only happens to CT4000MX500SSD drives and why it
just started now and I don't have in other servers?
Maybe it is related to firmware version "M3CR046 vs M3CR045"
I check the crucial website and actually "M3CR046" is not exist:
https://www.crucial.com/support/ssd-support/mx500-support
In this forum people recommend upgrading "M3CR046"
https://forums.unraid.net/topic/134954-warning-crucial-mx500-ssds-world-of-pain-stay-away-from-these/
But actually in my ud cluster all the drives are "M3CR045" and have lower
latency. I'm really confused.


Instead of writing udev rules for only CT4000MX500SSD is there any
recommended udev rule for ceph and all type of sata drives?



Anthony D'Atri , 22 Mar 2024 Cum, 17:00 tarihinde şunu
yazdı:

> [image: apple-touch-i...@2.png]
>
> How to stop sys from changing USB SSD provisioning_mode from unmap to full
> in Ubuntu 22.04?
> <https://askubuntu.com/questions/1454997/how-to-stop-sys-from-changing-usb-ssd-provisioning-mode-from-unmap-to-full-in-ub>
> askubuntu.com
> <https://askubuntu.com/questions/1454997/how-to-stop-sys-from-changing-usb-ssd-provisioning-mode-from-unmap-to-full-in-ub>
>
> <https://askubuntu.com/questions/1454997/how-to-stop-sys-from-changing-usb-ssd-provisioning-mode-from-unmap-to-full-in-ub>
> ?
>
>
> On Mar 22, 2024, at 09:36, Özkan Göksu  wrote:
>
> Hello!
>
> After upgrading "5.15.0-84-generic" to "5.15.0-100-generic" (Ubuntu 22.04.2
> LTS) , commit latency started acting weird with "CT4000MX500SSD" drives.
>
> osd  commit_latency(ms)  apply_latency(ms)
> 36 867867
> 373045   3045
> 38  15 15
> 39  18 18
> 421409   1409
> 431224   1224
>
> I downgraded the kernel but the result did not change.
> I have a similar build and it didn't get upgraded an

[ceph-users] High OSD commit_latency after kernel upgrade

2024-03-22 Thread Özkan Göksu
Hello!

After upgrading "5.15.0-84-generic" to "5.15.0-100-generic" (Ubuntu 22.04.2
LTS) , commit latency started acting weird with "CT4000MX500SSD" drives.

osd  commit_latency(ms)  apply_latency(ms)
 36 867867
 373045   3045
 38  15 15
 39  18 18
 421409   1409
 431224   1224

I downgraded the kernel but the result did not change.
I have a similar build and it didn't get upgraded and it is just fine.
While I was digging I realised a difference.

This is high latency cluster and as you can see the "DISC-GRAN=0B",
"DISC-MAX=0B"
root@sd-01:~# lsblk -D
NAME   DISC-ALN DISC-GRAN DISC-MAX
DISC-ZERO
sdc   00B   0B
0
├─ceph--76b7d255--2a01--4bd4--8d3e--880190181183-osd--block--201d5050--db0c--41b4--85c4--6416ee989d6c
│ 00B   0B
0
└─ceph--76b7d255--2a01--4bd4--8d3e--880190181183-osd--block--5a376133--47de--4e29--9b75--2314665c2862

root@sd-01:~# find /sys/ -name provisioning_mode -exec grep -H . {} + | sort
/sys/devices/pci:80/:80:03.0/:81:00.0/host0/port-0:0/end_device-0:0/target0:0:0/0:0:0:0/scsi_disk/0:0:0:0/provisioning_mode:full

--

This is low latency cluster and as you can see the "DISC-GRAN=4K",
"DISC-MAX=2G"
root@ud-01:~# lsblk -D
NAME  DISC-ALN
DISC-GRAN DISC-MAX DISC-ZERO
sdc  0
   4K   2G 0
├─ceph--7496095f--18c7--41fd--90f2--d9b3e382bc8e-osd--block--ec86a029--23f7--4328--9600--a24a290e3003
│0
   4K   2G 0
└─ceph--7496095f--18c7--41fd--90f2--d9b3e382bc8e-osd--block--5b69b748--d899--4f55--afc3--2ea3c8a05ca1

root@ud-01:~# find /sys/ -name provisioning_mode -exec grep -H . {} + | sort
/sys/devices/pci:00/:00:11.4/ata3/host2/target2:0:0/2:0:0:0/scsi_disk/2:0:0:0/provisioning_mode:writesame_16

I think the problem is related to provisioning_mode but I really did not
understand the reason.
I boot with a live iso and still the drive was "provisioning_mode:full" so
it means this is not related to my OS at all.

With the upgrade something changed and I think during boot sequence
negotiation between LSI controller, drives and kernel started to assign
"provisioning_mode:full" but I'm not sure.

What should I do ?

Best regards.
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Sata SSD trim latency with (WAL+DB on NVME + Sata OSD)

2024-02-26 Thread Özkan Göksu
Hello.

With the SSD drives without tantalum capacitors Ceph faces trim latency on
every write.
I wonder if the behavior is the same if we locate WAL+DB on NVME drives
with "Tantalum capacitors" ?

Do I need to use NVME + SAS SSD to avoid this latency issue?

Best regards.
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Seperate metadata pool in 3x MDS node

2024-02-26 Thread Özkan Göksu
 Hello Anthony,

The hardware is second hand built and does not have U.2 slots. U.2 servers
cost 3x-4x more.I mean PCI-E "MZ-PLK3T20".
I have to buy SFP cards and 25G is only +30$ more than 10G so why not.
Yes I'm thinking pinned as (clients > rack MDS)
I don't have problems with building and I don't use PG autoscaler.

Hello David.

My system is all internal and I only use one /20 subnet at layer2 level
Yes , I'm thinking of distributing the meta pool on racks 1,2,4,5 because
my clients use search a lot and I just want to shorten the metadata needs.
I have redundant rack PDU's so I don't have any problem with power and I
only have a VPC (2x n9k switch) on the main rack 3. That's why I keep data
and management related everything on rack3 as usual.
Normally I always use WAL+DB on NVME with Sata OSD. The only thing I wonder
is having a separate metadata pool on NVME located on the client racks is
gonna give some benefit or not.

Regards.

David C. , 25 Şub 2024 Paz, 00:07 tarihinde şunu
yazdı:

> Hello,
>
> Each rack works on different trees or is everything parallelized ?
> The meta pools would be distributed over racks 1,2,4,5 ?
> If it is distributed, even if the addressed MDS is on the same switch as
> the client, you will always have this MDS which will consult/write (nvme)
> OSDs on the other racks (among 1,2,4,5).
>
> In any case, the exercise is interesting.
>
>
>
> Le sam. 24 févr. 2024 à 19:56, Özkan Göksu  a écrit :
>
>> Hello folks!
>>
>> I'm designing a new Ceph storage from scratch and I want to increase
>> CephFS
>> speed and decrease latency.
>> Usually I always build (WAL+DB on NVME with Sas-Sata SSD's) and I deploy
>> MDS and MON's on the same servers.
>> This time a weird idea came to my mind and I think it has great potential
>> and will perform better on paper with my limited knowledge.
>>
>> I have 5 racks and the 3nd "middle" rack is my storage and management
>> rack.
>>
>> - At RACK-3 I'm gonna locate 8x 1u OSD server (Spec: 2x E5-2690V4, 256GB,
>> 4x 25G, 2x 1.6TB PCI-E NVME "MZ-PLK3T20", 8x 4TB SATA SSD)
>>
>> - My Cephfs kernel clients are 40x GPU nodes located at RACK1,2,4,5
>>
>> With my current workflow, all the clients;
>> 1- visit the rack data switch
>> 2- jump to main VPC switch via 2x100G,
>> 3- talk with MDS servers,
>> 4- Go back to the client with the answer,
>> 5- To access data follow the same HOP's and visit the OSD's everytime.
>>
>> If I deploy separate metadata pool by using 4x MDS server at top of
>> RACK-1,2,4,5 (Spec: 2x E5-2690V4, 128GB, 2x 10G(Public), 2x 25G (cluster),
>> 2x 960GB U.2 NVME "MZ-PLK3T20")
>> Then all the clients will make the request directly in-rack 1 HOP away MDS
>> servers and if the request is only metadata, then the MDS node doesn't
>> need
>> to redirect the request to OSD nodes.
>> Also by locating MDS servers with seperated metadata pool across all the
>> racks will reduce the high load on main VPC switch at RACK-3
>>
>> If I'm not missing anything then only Recovery workload will suffer with
>> this topology.
>>
>> What do you think?
>> ___
>> ceph-users mailing list -- ceph-users@ceph.io
>> To unsubscribe send an email to ceph-users-le...@ceph.io
>>
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Seperate metadata pool in 3x MDS node

2024-02-24 Thread Özkan Göksu
Hello folks!

I'm designing a new Ceph storage from scratch and I want to increase CephFS
speed and decrease latency.
Usually I always build (WAL+DB on NVME with Sas-Sata SSD's) and I deploy
MDS and MON's on the same servers.
This time a weird idea came to my mind and I think it has great potential
and will perform better on paper with my limited knowledge.

I have 5 racks and the 3nd "middle" rack is my storage and management rack.

- At RACK-3 I'm gonna locate 8x 1u OSD server (Spec: 2x E5-2690V4, 256GB,
4x 25G, 2x 1.6TB PCI-E NVME "MZ-PLK3T20", 8x 4TB SATA SSD)

- My Cephfs kernel clients are 40x GPU nodes located at RACK1,2,4,5

With my current workflow, all the clients;
1- visit the rack data switch
2- jump to main VPC switch via 2x100G,
3- talk with MDS servers,
4- Go back to the client with the answer,
5- To access data follow the same HOP's and visit the OSD's everytime.

If I deploy separate metadata pool by using 4x MDS server at top of
RACK-1,2,4,5 (Spec: 2x E5-2690V4, 128GB, 2x 10G(Public), 2x 25G (cluster),
2x 960GB U.2 NVME "MZ-PLK3T20")
Then all the clients will make the request directly in-rack 1 HOP away MDS
servers and if the request is only metadata, then the MDS node doesn't need
to redirect the request to OSD nodes.
Also by locating MDS servers with seperated metadata pool across all the
racks will reduce the high load on main VPC switch at RACK-3

If I'm not missing anything then only Recovery workload will suffer with
this topology.

What do you think?
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Performance improvement suggestion

2024-02-20 Thread Özkan Göksu
Hello.

I didn't test it personally but what about rep 1 write cache pool with nvme
backed by another rep 2 pool?

It has the potential exactly what you are looking for in theory.


1 Şub 2024 Per 20:54 tarihinde quag...@bol.com.br  şunu
yazdı:

>
>
> Ok Anthony,
>
> I understood what you said. I also believe in all the professional history
> and experience you have.
>
> Anyway, could there be a configuration flag to make this happen?
>
> As well as those that already exist: "--yes-i-really-mean-it".
>
> This way, the storage pattern would remain as it is. However, it would
> allow situations like the one I mentioned to be possible.
>
> This situation will permit some rules to be relaxed (even if they are not
> ok at first).
> Likewise, there are already situations like lazyio that make some
> exceptions to standard procedures.
>
>
> Remembering: it's just a suggestion.
> If this type of functionality is not interesting, it is ok.
>
>
> Rafael.
>
> --
>
> *De: *"Anthony D'Atri" 
> *Enviada: *2024/02/01 12:10:30
> *Para: *quag...@bol.com.br
> *Cc: * ceph-users@ceph.io
> *Assunto: * [ceph-users] Re: Performance improvement suggestion
>
>
>
> > I didn't say I would accept the risk of losing data.
>
> That's implicit in what you suggest, though.
>
> > I just said that it would be interesting if the objects were first
> recorded only in the primary OSD.
>
> What happens when that host / drive smokes before it can replicate? What
> happens if a secondary OSD gets a read op before the primary updates it?
> Swift object storage users have to code around this potential. It's a
> non-starter for block storage.
>
> This is similar to why RoC HBAs (which are a badly outdated thing to begin
> with) will only enter writeback mode if they have a BBU / supercap -- and
> of course if their firmware and hardware isn't pervasively buggy. Guess how
> I know this?
>
> > This way it would greatly increase performance (both for iops and
> throuput).
>
> It might increase low-QD IOPS for a single client on slow media with
> certain networking. Depending on media, it wouldn't increase throughput.
>
> Consider QEMU drive-mirror. If you're doing RF=3 replication, you use 3x
> the network resources between the client and the servers.
>
> > Later (in the background), record the replicas. This situation would
> avoid leaving users/software waiting for the recording response from all
> replicas when the storage is overloaded.
>
> If one makes the mistake of using HDDs, they're going to be overloaded no
> matter how one slices and dices the ops. Ya just canna squeeze IOPS from a
> stone. Throughput is going to be limited by the SATA interface and seeking
> no matter what.
>
> > Where I work, performance is very important and we don't have money to
> make a entire cluster only with NVMe.
>
> If there isn't money, then it isn't very important. But as I've written
> before, NVMe clusters *do not cost appreciably more than spinners* unless
> your procurement processes are bad. In fact they can cost significantly
> less. This is especially true with object storage and archival where one
> can leverage QLC.
>
> * Buy generic drives from a VAR, not channel drives through a chassis
> brand. Far less markup, and moreover you get the full 5 year warranty, not
> just 3 years. And you can painlessly RMA drives yourself - you don't have
> to spend hours going back and forth with $chassisvendor's TAC arguing about
> every single RMA. I've found that this is so bad that it is more economical
> to just throw away a failed component worth < USD 500 than to RMA it. Do
> you pay for extended warranty / support? That's expensive too.
>
> * Certain chassis brands who shall remain nameless push RoC HBAs hard with
> extreme markups. List prices as high as USD2000. Per server, eschewing
> those abominations makes up for a lot of the drive-only unit economics
>
> * But this is the part that lots of people don't get: You don't just stack
> up the drives on a desk and use them. They go into *servers* that cost
> money and *racks* that cost money. They take *power* that costs money.
>
> * $ / IOPS are FAR better for ANY SSD than for HDDs
>
> * RUs cost money, so do chassis and switches
>
> * Drive failures cost money
>
> * So does having your people and applications twiddle their thumbs waiting
> for stuff to happen. I worked for a supercomputer company who put
> low-memory low-end diskless workstations on engineer's desks. They spent
> lots of time doing nothing waiting for their applications to respond. This
> company no longer exists.
>
> * So does the risk of taking *weeks* to heal from a drive failure
>
> Punch honest numbers into
> https://www.snia.org/forums/cmsi/programs/TCOcalc
>
> I walked through this with a certain global company. QLC SSDs were
> demonstrated to have like 30% lower TCO than spinners. Part of the equation
> is that they were accustomed to limiting HDD size to 8 TB because of the
> bottlenecks, and thus 

[ceph-users] Re: 1 clients failing to respond to cache pressure (quincy:17.2.6)

2024-01-27 Thread Özkan Göksu
Thank you Frank.

My focus is actually performance tuning.
After your mail, I started to investigate client-side.

I think the kernel tunings work great now.
After the tunings I didn't get any warning again.

Now I will continue with performance tunings.
I decided to distribute subvolumes across multiple pools instead of
multi-active-mds.
With this method I will have multiple MDS and [1x cephfs clients for each
pool / Host]

To hide subvolume uuids, I'm using "mount --bind kernel links" and I wonder
is it able to create performance issues on cephfs clients?

Best regards.



Frank Schilder , 27 Oca 2024 Cmt, 12:34 tarihinde şunu yazdı:

> Hi Özkan,
>
> > ... The client is actually at idle mode and there is no reason to fail
> at all. ...
>
> if you re-read my message, you will notice that I wrote that
>
> - its not the client failing, its a false positive error flag that
> - is not cleared for idle clients.
>
> You seem to encounter exactly this situation and a simple
>
> echo 3 > /proc/sys/vm/drop_caches
>
> would probably have cleared the warning. There is nothing wrong with your
> client, its an issue with the client-MDS communication protocol that is
> probably still under review. You will encounter these warnings every now
> and then until its fixed.
>
> Best regards,
> =
> Frank Schilder
> AIT Risø Campus
> Bygning 109, rum S14
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: 1 clients failing to respond to cache pressure (quincy:17.2.6)

2024-01-26 Thread Özkan Göksu
, BW=681MiB/s (714MB/s)(3072MiB/4511msec); 0
zone resets
 BS=32K   write: IOPS=12.1k, BW=378MiB/s (396MB/s)(3072MiB/8129msec); 0
zone resets
 BS=16K   write: IOPS=12.7k, BW=198MiB/s (208MB/s)(3072MiB/15487msec); 0
zone resets
 BS=4Kwrite: IOPS=12.7k, BW=49.7MiB/s (52.1MB/s)(3072MiB/61848msec); 0
zone resets
Rand Read benchmarking: size=1G,direct=1,numjobs=3,iodepth=32
 BS=1Mread: IOPS=1113, BW=1114MiB/s (1168MB/s)(3072MiB/2758msec)
 BS=128K  read: IOPS=8953, BW=1119MiB/s (1173MB/s)(3072MiB/2745msec)
 BS=64K   read: IOPS=17.9k, BW=1116MiB/s (1170MB/s)(3072MiB/2753msec)
 BS=32K   read: IOPS=35.1k, BW=1096MiB/s (1150MB/s)(3072MiB/2802msec)
 BS=16K   read: IOPS=69.4k, BW=1085MiB/s (1138MB/s)(3072MiB/2831msec)
 BS=4Kread: IOPS=112k, BW=438MiB/s (459MB/s)(3072MiB/7015msec)

*Everything looks good except 4K speeds:*
Seq Write  -  BS=4Kwrite: IOPS=8661, BW=33.8MiB/s
(35.5MB/s)(3072MiB/90801msec); 0 zone resets
Rand Write - BS=4Kwrite: IOPS=12.7k, BW=49.7MiB/s
(52.1MB/s)(3072MiB/61848msec); 0 zone resets

What do you think?


Özkan Göksu , 27 Oca 2024 Cmt, 04:08 tarihinde şunu
yazdı:

> Wow I noticed something!
>
> To prevent ram overflow with gpu training allocations, I'm using a 2TB
> Samsung 870 evo for swap.
>
> As you can see below, swap usage 18Gi and server was idle, that means
> maybe ceph client hits latency because of the swap usage.
>
> root@bmw-m4:/sys/kernel/debug/ceph/e42fd4b0-313b-11ee-9a00-31da71873773.client1275577#
> free -h
>totalusedfree  shared  buff/cache
> available
> Mem:62Gi34Gi27Gi   0.0Ki   639Mi
>  27Gi
> Swap:  1.8Ti18Gi   1.8Ti
>
> I decided to play around kernel parameters to prevent ceph swap usage.
>
> kernel.shmmax = 60654764851   # Maximum shared segment size in bytes
>> kernel.shmall = 16453658   # Maximum number of shared memory segments in
>> pages
>> vm.nr_hugepages = 4096   # Increase Transparent Huge Pages (THP) Defrag:
>> vm.swappiness = 0 # Set vm.swappiness to 0 to minimize swapping
>> vm.min_free_kbytes = 1048576 # required free memory (set to 1% of
>> physical ram)
>
>
> I reboot the server and after reboot swap usage is 0 as expected.
>
> To give a try I started the iobench.sh (
> https://github.com/ozkangoksu/benchmark/blob/main/iobench.sh)
> This client has 1G nic only. As you can see below, other then 4K block
> size, ceph client can saturate NIC.
>
> root@bmw-m4:~# nicstat -MUz 1
> Time  Int   rMbps   wMbps   rPk/s   wPk/srAvswAvs %rUtil
> %wUtil
> 01:04:48   ens1f0   936.9   92.90 91196.8 60126.3  1346.6   202.5   98.2
> 9.74
>
> root@bmw-m4:/mounts/ud-data/benchuser1/96f13211-c37f-42db-8d05-f3255a05129e/testdir#
> bash iobench.sh
> Seq Write benchmarking: size=1G,direct=1,numjobs=3,iodepth=32
>  BS=1Mwrite: IOPS=112, BW=112MiB/s (118MB/s)(3072MiB/27395msec); 0
> zone resets
>  BS=128K  write: IOPS=894, BW=112MiB/s (117MB/s)(3072MiB/27462msec); 0
> zone resets
>  BS=64K   write: IOPS=1758, BW=110MiB/s (115MB/s)(3072MiB/27948msec); 0
> zone resets
>  BS=32K   write: IOPS=3542, BW=111MiB/s (116MB/s)(3072MiB/27748msec); 0
> zone resets
>  BS=16K   write: IOPS=6839, BW=107MiB/s (112MB/s)(3072MiB/28747msec); 0
> zone resets
>  BS=4Kwrite: IOPS=8473, BW=33.1MiB/s (34.7MB/s)(3072MiB/92813msec); 0
> zone resets
> Seq Read benchmarking: size=1G,direct=1,numjobs=3,iodepth=32
>  BS=1Mread: IOPS=112, BW=112MiB/s (118MB/s)(3072MiB/27386msec)
>  BS=128K  read: IOPS=895, BW=112MiB/s (117MB/s)(3072MiB/27431msec)
>  BS=64K   read: IOPS=1788, BW=112MiB/s (117MB/s)(3072MiB/27486msec)
>  BS=32K   read: IOPS=3561, BW=111MiB/s (117MB/s)(3072MiB/27603msec)
>  BS=16K   read: IOPS=6924, BW=108MiB/s (113MB/s)(3072MiB/28392msec)
>  BS=4Kread: IOPS=21.3k, BW=83.3MiB/s (87.3MB/s)(3072MiB/36894msec)
> Rand Write benchmarking: size=1G,direct=1,numjobs=3,iodepth=32
>  BS=1Mwrite: IOPS=112, BW=112MiB/s (118MB/s)(3072MiB/27406msec); 0
> zone resets
>  BS=128K  write: IOPS=894, BW=112MiB/s (117MB/s)(3072MiB/27466msec); 0
> zone resets
>  BS=64K   write: IOPS=1781, BW=111MiB/s (117MB/s)(3072MiB/27591msec); 0
> zone resets
>  BS=32K   write: IOPS=3545, BW=111MiB/s (116MB/s)(3072MiB/27729msec); 0
> zone resets
>  BS=16K   write: IOPS=6823, BW=107MiB/s (112MB/s)(3072MiB/28814msec); 0
> zone resets
>  BS=4Kwrite: IOPS=12.7k, BW=49.8MiB/s (52.2MB/s)(3072MiB/61694msec); 0
> zone resets
> Rand Read benchmarking: size=1G,direct=1,numjobs=3,iodepth=32
>  BS=1Mread: IOPS=112, BW=112MiB/s (118MB/s)(3072MiB/27388msec)
>  BS=128K  read: IOPS=894, BW=112MiB/s (117MB/s)(3072MiB/27479msec)
>  BS=64K   read: IOPS=1784, BW=112MiB/s (117MB/s)(3072MiB/27547msec)
>  BS=32K   read: IOPS=3559, BW=111MiB/s (117MB/s)(3072

[ceph-users] Re: 1 clients failing to respond to cache pressure (quincy:17.2.6)

2024-01-26 Thread Özkan Göksu
 totalusedfree  shared  buff/cache
available
Mem:62Gi11Gi50Gi   3.0Mi   1.0Gi
 49Gi
Swap:  1.8Ti  0B   1.8Ti


I started to feel we are getting closer :)



Özkan Göksu , 27 Oca 2024 Cmt, 02:58 tarihinde şunu
yazdı:

> I started to investigate my clients.
>
> for example:
>
> root@ud-01:~# ceph health detail
> HEALTH_WARN 1 clients failing to respond to cache pressure
> [WRN] MDS_CLIENT_RECALL: 1 clients failing to respond to cache pressure
> mds.ud-data.ud-02.xcoojt(mds.0): Client bmw-m4 failing to respond to
> cache pressure client_id: 1275577
>
> root@ud-01:~# ceph fs status
> ud-data - 86 clients
> ===
> RANK  STATE   MDS  ACTIVITY DNSINOS   DIRS
> CAPS
>  0active  ud-data.ud-02.xcoojt  Reqs:   34 /s  2926k  2827k   155k
>  1157k
>
>
> ceph tell mds.ud-data.ud-02.xcoojt session ls | jq -r '.[] | "clientid:
> \(.id)= num_caps: \(.num_caps), num_leases: \(.num_leases),
> request_load_avg: \(.request_load_avg), num_completed_requests:
> \(.num_completed_requests), num_completed_flushes:
> \(.num_completed_flushes)"' | sort -n -t: -k3
>
> clientid: *1275577*= num_caps: 12312, num_leases: 0, request_load_avg: 0,
> num_completed_requests: 0, num_completed_flushes: 1
> clientid: 1275571= num_caps: 16307, num_leases: 1, request_load_avg: 2101,
> num_completed_requests: 0, num_completed_flushes: 3
> clientid: 1282130= num_caps: 26337, num_leases: 3, request_load_avg: 116,
> num_completed_requests: 0, num_completed_flushes: 1
> clientid: 1191789= num_caps: 32784, num_leases: 0, request_load_avg: 1846,
> num_completed_requests: 0, num_completed_flushes: 0
> clientid: 1275535= num_caps: 79825, num_leases: 2, request_load_avg: 133,
> num_completed_requests: 8, num_completed_flushes: 8
> clientid: 1282142= num_caps: 80581, num_leases: 6, request_load_avg: 125,
> num_completed_requests: 2, num_completed_flushes: 6
> clientid: 1275532= num_caps: 87836, num_leases: 3, request_load_avg: 190,
> num_completed_requests: 2, num_completed_flushes: 6
> clientid: 1275547= num_caps: 94129, num_leases: 4, request_load_avg: 149,
> num_completed_requests: 2, num_completed_flushes: 4
> clientid: 1275553= num_caps: 96460, num_leases: 4, request_load_avg: 155,
> num_completed_requests: 2, num_completed_flushes: 8
> clientid: 1282139= num_caps: 108882, num_leases: 25, request_load_avg: 99,
> num_completed_requests: 2, num_completed_flushes: 4
> clientid: 1275538= num_caps: 437162, num_leases: 0, request_load_avg: 101,
> num_completed_requests: 2, num_completed_flushes: 0
>
> --
>
> *MY CLIENT:*
>
> The client is actually at idle mode and there is no reason to fail at all.
>
> root@bmw-m4:~# apt list --installed |grep ceph
> ceph-common/jammy-updates,now 17.2.6-0ubuntu0.22.04.2 amd64 [installed]
> libcephfs2/jammy-updates,now 17.2.6-0ubuntu0.22.04.2 amd64
> [installed,automatic]
> python3-ceph-argparse/jammy-updates,now 17.2.6-0ubuntu0.22.04.2 amd64
> [installed,automatic]
> python3-ceph-common/jammy-updates,now 17.2.6-0ubuntu0.22.04.2 all
> [installed,automatic]
> python3-cephfs/jammy-updates,now 17.2.6-0ubuntu0.22.04.2 amd64
> [installed,automatic]
>
> Let's check metrics and stats:
>
> root@bmw-m4:/sys/kernel/debug/ceph/e42fd4b0-313b-11ee-9a00-31da71873773.client1275577#
> cat metrics
> item   total
> --
> opened files  / total inodes   2 / 12312
> pinned i_caps / total inodes   12312 / 12312
> opened inodes / total inodes   1 / 12312
>
> item  total   avg_lat(us) min_lat(us) max_lat(us)
> stdev(us)
>
> ---
> read  22283   44409   430 1804853
> 15619
> write 112702  419725  36588879541
> 6008
> metadata  353322  5712154 917903
>  5357
>
> item  total   avg_sz(bytes)   min_sz(bytes)   max_sz(bytes)
>  total_sz(bytes)
>
> 
> read  22283   1701940 1   4194304
> 37924318602
> write 112702  246211  1   4194304
> 27748469309
>
> item  total   misshit
> -
> d_lease   62  63627   28564698
> caps  12312   36658   44568261
>
>
> root@bmw-m4:/sys/kernel/debug/ceph/e42fd4b0-313b-11ee-9a00-31da71873773.client1275577#
> cat bd

[ceph-users] Re: 1 clients failing to respond to cache pressure (quincy:17.2.6)

2024-01-26 Thread Özkan Göksu
_split_pmd 22451
thp_split_pud 0
thp_zero_page_alloc 1
thp_zero_page_alloc_failed 0
thp_swpout 22332
thp_swpout_fallback 0
balloon_inflate 0
balloon_deflate 0
balloon_migrate 0
swap_ra 25777929
swap_ra_hit 25658825
direct_map_level2_splits 1249
direct_map_level3_splits 49
nr_unstable 0



Özkan Göksu , 27 Oca 2024 Cmt, 02:36 tarihinde şunu
yazdı:

> Hello Frank.
>
> I have 84 clients (high-end servers) with: Ubuntu 20.04.5 LTS - Kernel:
> Linux 5.4.0-125-generic
>
> My cluster 17.2.6 quincy.
> I have some client nodes with "ceph-common/stable,now 17.2.7-1focal" I
> wonder using new version clients is the main problem?
> Maybe I have a communication error. For example I hit this problem and I
> can not collect client stats "
> https://github.com/ceph/ceph/pull/52127/files;
>
> Best regards.
>
>
>
> Frank Schilder , 26 Oca 2024 Cum, 14:53 tarihinde şunu
> yazdı:
>
>> Hi, this message is one of those that are often spurious. I don't recall
>> in which thread/PR/tracker I read it, but the story was something like that:
>>
>> If an MDS gets under memory pressure it will request dentry items back
>> from *all* clients, not just the active ones or the ones holding many of
>> them. If you have a client that's below the min-threshold for dentries (its
>> one of the client/mds tuning options), it will not respond. This client
>> will be flagged as not responding, which is a false positive.
>>
>> I believe the devs are working on a fix to get rid of these spurious
>> warnings. There is a "bug/feature" in the MDS that does not clear this
>> warning flag for inactive clients. Hence, the message hangs and never
>> disappears. I usually clear it with a "echo 3 > /proc/sys/vm/drop_caches"
>> on the client. However, except for being annoying in the dashboard, it has
>> no performance or otherwise negative impact.
>>
>> Best regards,
>> =
>> Frank Schilder
>> AIT Risø Campus
>> Bygning 109, rum S14
>>
>> 
>> From: Eugen Block 
>> Sent: Friday, January 26, 2024 10:05 AM
>> To: Özkan Göksu
>> Cc: ceph-users@ceph.io
>> Subject: [ceph-users] Re: 1 clients failing to respond to cache pressure
>> (quincy:17.2.6)
>>
>> Performance for small files is more about IOPS rather than throughput,
>> and the IOPS in your fio tests look okay to me. What you could try is
>> to split the PGs to get around 150 or 200 PGs per OSD. You're
>> currently at around 60 according to the ceph osd df output. Before you
>> do that, can you share 'ceph pg ls-by-pool cephfs.ud-data.data |
>> head'? I don't need the whole output, just to see how many objects
>> each PG has. We had a case once where that helped, but it was an older
>> cluster and the pool was backed by HDDs and separate rocksDB on SSDs.
>> So this might not be the solution here, but it could improve things as
>> well.
>>
>>
>> Zitat von Özkan Göksu :
>>
>> > Every user has a 1x subvolume and I only have 1 pool.
>> > At the beginning we were using each subvolume for ldap home directory +
>> > user data.
>> > When a user logins any docker on any host, it was using the cluster for
>> > home and the for user related data, we was have second directory in the
>> > same subvolume.
>> > Time to time users were feeling a very slow home environment and after a
>> > month it became almost impossible to use home. VNC sessions became
>> > unresponsive and slow etc.
>> >
>> > 2 weeks ago, I had to migrate home to a ZFS storage and now the overall
>> > performance is better for only user_data without home.
>> > But still the performance is not good enough as I expected because of
>> the
>> > problems related to MDS.
>> > The usage is low but allocation is high and Cpu usage is high. You saw
>> the
>> > IO Op/s, it's nothing but allocation is high.
>> >
>> > I develop a fio benchmark script and I run the script on 4x test server
>> at
>> > the same time, the results are below:
>> > Script:
>> >
>> https://github.com/ozkangoksu/benchmark/blob/8f5df87997864c25ef32447e02fcd41fda0d2a67/iobench.sh
>> >
>> >
>> https://github.com/ozkangoksu/benchmark/blob/main/benchmark-results/iobench-client-01.txt
>> >
>> https://github.com/ozkangoksu/benchmark/blob/main/benchmark-results/iobench-client-02.txt
>> >
>> https://github.com/ozkangoksu/benchmark/blob/main/benchmark-results/iobench-client-03.txt
>> >
>> https://github.com/ozkangoksu/benc

[ceph-users] Re: 1 clients failing to respond to cache pressure (quincy:17.2.6)

2024-01-26 Thread Özkan Göksu
Hello Frank.

I have 84 clients (high-end servers) with: Ubuntu 20.04.5 LTS - Kernel:
Linux 5.4.0-125-generic

My cluster 17.2.6 quincy.
I have some client nodes with "ceph-common/stable,now 17.2.7-1focal" I
wonder using new version clients is the main problem?
Maybe I have a communication error. For example I hit this problem and I
can not collect client stats "https://github.com/ceph/ceph/pull/52127/files;

Best regards.



Frank Schilder , 26 Oca 2024 Cum, 14:53 tarihinde şunu yazdı:

> Hi, this message is one of those that are often spurious. I don't recall
> in which thread/PR/tracker I read it, but the story was something like that:
>
> If an MDS gets under memory pressure it will request dentry items back
> from *all* clients, not just the active ones or the ones holding many of
> them. If you have a client that's below the min-threshold for dentries (its
> one of the client/mds tuning options), it will not respond. This client
> will be flagged as not responding, which is a false positive.
>
> I believe the devs are working on a fix to get rid of these spurious
> warnings. There is a "bug/feature" in the MDS that does not clear this
> warning flag for inactive clients. Hence, the message hangs and never
> disappears. I usually clear it with a "echo 3 > /proc/sys/vm/drop_caches"
> on the client. However, except for being annoying in the dashboard, it has
> no performance or otherwise negative impact.
>
> Best regards,
> =
> Frank Schilder
> AIT Risø Campus
> Bygning 109, rum S14
>
> ________
> From: Eugen Block 
> Sent: Friday, January 26, 2024 10:05 AM
> To: Özkan Göksu
> Cc: ceph-users@ceph.io
> Subject: [ceph-users] Re: 1 clients failing to respond to cache pressure
> (quincy:17.2.6)
>
> Performance for small files is more about IOPS rather than throughput,
> and the IOPS in your fio tests look okay to me. What you could try is
> to split the PGs to get around 150 or 200 PGs per OSD. You're
> currently at around 60 according to the ceph osd df output. Before you
> do that, can you share 'ceph pg ls-by-pool cephfs.ud-data.data |
> head'? I don't need the whole output, just to see how many objects
> each PG has. We had a case once where that helped, but it was an older
> cluster and the pool was backed by HDDs and separate rocksDB on SSDs.
> So this might not be the solution here, but it could improve things as
> well.
>
>
> Zitat von Özkan Göksu :
>
> > Every user has a 1x subvolume and I only have 1 pool.
> > At the beginning we were using each subvolume for ldap home directory +
> > user data.
> > When a user logins any docker on any host, it was using the cluster for
> > home and the for user related data, we was have second directory in the
> > same subvolume.
> > Time to time users were feeling a very slow home environment and after a
> > month it became almost impossible to use home. VNC sessions became
> > unresponsive and slow etc.
> >
> > 2 weeks ago, I had to migrate home to a ZFS storage and now the overall
> > performance is better for only user_data without home.
> > But still the performance is not good enough as I expected because of the
> > problems related to MDS.
> > The usage is low but allocation is high and Cpu usage is high. You saw
> the
> > IO Op/s, it's nothing but allocation is high.
> >
> > I develop a fio benchmark script and I run the script on 4x test server
> at
> > the same time, the results are below:
> > Script:
> >
> https://github.com/ozkangoksu/benchmark/blob/8f5df87997864c25ef32447e02fcd41fda0d2a67/iobench.sh
> >
> >
> https://github.com/ozkangoksu/benchmark/blob/main/benchmark-results/iobench-client-01.txt
> >
> https://github.com/ozkangoksu/benchmark/blob/main/benchmark-results/iobench-client-02.txt
> >
> https://github.com/ozkangoksu/benchmark/blob/main/benchmark-results/iobench-client-03.txt
> >
> https://github.com/ozkangoksu/benchmark/blob/main/benchmark-results/iobench-client-04.txt
> >
> > While running benchmark, I take sample values for each type of iobench
> run.
> >
> > Seq Write benchmarking: size=1G,direct=1,numjobs=3,iodepth=32
> > client:   70 MiB/s rd, 762 MiB/s wr, 337 op/s rd, 24.41k op/s wr
> > client:   60 MiB/s rd, 551 MiB/s wr, 303 op/s rd, 35.12k op/s wr
> > client:   13 MiB/s rd, 161 MiB/s wr, 101 op/s rd, 41.30k op/s wr
> >
> > Seq Read benchmarking: size=1G,direct=1,numjobs=3,iodepth=32
> > client:   1.6 GiB/s rd, 219 KiB/s wr, 28.76k op/s rd, 89 op/s wr
> > client:   370 MiB/s rd, 475 KiB/s wr, 90.38k op/s rd, 89 op/s wr
> >
> > Rand Write b

[ceph-users] Re: 1 clients failing to respond to cache pressure (quincy:17.2.6)

2024-01-25 Thread Özkan Göksu
This is client side metrics from a "failing to respond to cache pressure"
warned client.

root@datagen-27:/sys/kernel/debug/ceph/e42fd4b0-313b-11ee-9a00-31da71873773.client1282187#
cat bdi/stats
BdiWriteback:0 kB
BdiReclaimable:  0 kB
BdiDirtyThresh:  0 kB
DirtyThresh:  35979376 kB
BackgroundThresh: 17967720 kB
BdiDirtied:3071616 kB
BdiWritten:3036864 kB
BdiWriteBandwidth:  20 kBps
b_dirty: 0
b_io:0
b_more_io:   0
b_dirty_time:0
bdi_list:1
state:   1



root@d27:/sys/kernel/debug/ceph/e42fd4b0-313b-11ee-9a00-31da71873773.client1282187#
cat metrics
item   total
--
opened files  / total inodes   4 / 14129
pinned i_caps / total inodes   14129 / 14129
opened inodes / total inodes   2 / 14129

item  total   avg_lat(us) min_lat(us) max_lat(us)
stdev(us)
---
read  1218753 3116208 8741271
2154
write 34945   24003   30172191493
16156
metadata  1703642 8395127 17936115
 1497

item  total   avg_sz(bytes)   min_sz(bytes)   max_sz(bytes)
 total_sz(bytes)

read  1218753 227009  1   4194304
276668475618
write 34945   85860   1   4194304
3000382055

item  total   misshit
-
d_lease   306 19110   3317071969
caps  14129   145404  3761682333

Özkan Göksu , 25 Oca 2024 Per, 20:25 tarihinde şunu
yazdı:

> Every user has a 1x subvolume and I only have 1 pool.
> At the beginning we were using each subvolume for ldap home directory +
> user data.
> When a user logins any docker on any host, it was using the cluster for
> home and the for user related data, we was have second directory in the
> same subvolume.
> Time to time users were feeling a very slow home environment and after a
> month it became almost impossible to use home. VNC sessions became
> unresponsive and slow etc.
>
> 2 weeks ago, I had to migrate home to a ZFS storage and now the overall
> performance is better for only user_data without home.
> But still the performance is not good enough as I expected because of the
> problems related to MDS.
> The usage is low but allocation is high and Cpu usage is high. You saw the
> IO Op/s, it's nothing but allocation is high.
>
> I develop a fio benchmark script and I run the script on 4x test server at
> the same time, the results are below:
> Script:
> https://github.com/ozkangoksu/benchmark/blob/8f5df87997864c25ef32447e02fcd41fda0d2a67/iobench.sh
>
>
> https://github.com/ozkangoksu/benchmark/blob/main/benchmark-results/iobench-client-01.txt
>
> https://github.com/ozkangoksu/benchmark/blob/main/benchmark-results/iobench-client-02.txt
>
> https://github.com/ozkangoksu/benchmark/blob/main/benchmark-results/iobench-client-03.txt
>
> https://github.com/ozkangoksu/benchmark/blob/main/benchmark-results/iobench-client-04.txt
>
> While running benchmark, I take sample values for each type of iobench run.
>
> Seq Write benchmarking: size=1G,direct=1,numjobs=3,iodepth=32
> client:   70 MiB/s rd, 762 MiB/s wr, 337 op/s rd, 24.41k op/s wr
> client:   60 MiB/s rd, 551 MiB/s wr, 303 op/s rd, 35.12k op/s wr
> client:   13 MiB/s rd, 161 MiB/s wr, 101 op/s rd, 41.30k op/s wr
>
> Seq Read benchmarking: size=1G,direct=1,numjobs=3,iodepth=32
> client:   1.6 GiB/s rd, 219 KiB/s wr, 28.76k op/s rd, 89 op/s wr
> client:   370 MiB/s rd, 475 KiB/s wr, 90.38k op/s rd, 89 op/s wr
>
> Rand Write benchmarking: size=1G,direct=1,numjobs=3,iodepth=32
> client:   63 MiB/s rd, 1.5 GiB/s wr, 8.77k op/s rd, 5.50k op/s wr
> client:   14 MiB/s rd, 1.8 GiB/s wr, 81 op/s rd, 13.86k op/s wr
> client:   6.6 MiB/s rd, 1.2 GiB/s wr, 61 op/s rd, 30.13k op/s wr
>
> Rand Read benchmarking: size=1G,direct=1,numjobs=3,iodepth=32
> client:   317 MiB/s rd, 841 MiB/s wr, 426 op/s rd, 10.98k op/s wr
> client:   2.8 GiB/s rd, 882 MiB/s wr, 25.68k op/s rd, 291 op/s wr
> client:   4.0 GiB/s rd, 226 MiB/s wr, 89.63k op/s rd, 124 op/s wr
> client:   2.4 GiB/s rd, 295 KiB/s wr, 197.86k op/s rd, 20 op/s wr
>
> It seems I only have problems with the 4K,8K,16K other sector sizes.
>
>
>
>
> Eugen Block , 25 Oca 2024 Per, 19:06 tarihinde şunu yazdı:

[ceph-users] Re: 1 clients failing to respond to cache pressure (quincy:17.2.6)

2024-01-25 Thread Özkan Göksu
Every user has a 1x subvolume and I only have 1 pool.
At the beginning we were using each subvolume for ldap home directory +
user data.
When a user logins any docker on any host, it was using the cluster for
home and the for user related data, we was have second directory in the
same subvolume.
Time to time users were feeling a very slow home environment and after a
month it became almost impossible to use home. VNC sessions became
unresponsive and slow etc.

2 weeks ago, I had to migrate home to a ZFS storage and now the overall
performance is better for only user_data without home.
But still the performance is not good enough as I expected because of the
problems related to MDS.
The usage is low but allocation is high and Cpu usage is high. You saw the
IO Op/s, it's nothing but allocation is high.

I develop a fio benchmark script and I run the script on 4x test server at
the same time, the results are below:
Script:
https://github.com/ozkangoksu/benchmark/blob/8f5df87997864c25ef32447e02fcd41fda0d2a67/iobench.sh

https://github.com/ozkangoksu/benchmark/blob/main/benchmark-results/iobench-client-01.txt
https://github.com/ozkangoksu/benchmark/blob/main/benchmark-results/iobench-client-02.txt
https://github.com/ozkangoksu/benchmark/blob/main/benchmark-results/iobench-client-03.txt
https://github.com/ozkangoksu/benchmark/blob/main/benchmark-results/iobench-client-04.txt

While running benchmark, I take sample values for each type of iobench run.

Seq Write benchmarking: size=1G,direct=1,numjobs=3,iodepth=32
client:   70 MiB/s rd, 762 MiB/s wr, 337 op/s rd, 24.41k op/s wr
client:   60 MiB/s rd, 551 MiB/s wr, 303 op/s rd, 35.12k op/s wr
client:   13 MiB/s rd, 161 MiB/s wr, 101 op/s rd, 41.30k op/s wr

Seq Read benchmarking: size=1G,direct=1,numjobs=3,iodepth=32
client:   1.6 GiB/s rd, 219 KiB/s wr, 28.76k op/s rd, 89 op/s wr
client:   370 MiB/s rd, 475 KiB/s wr, 90.38k op/s rd, 89 op/s wr

Rand Write benchmarking: size=1G,direct=1,numjobs=3,iodepth=32
client:   63 MiB/s rd, 1.5 GiB/s wr, 8.77k op/s rd, 5.50k op/s wr
client:   14 MiB/s rd, 1.8 GiB/s wr, 81 op/s rd, 13.86k op/s wr
client:   6.6 MiB/s rd, 1.2 GiB/s wr, 61 op/s rd, 30.13k op/s wr

Rand Read benchmarking: size=1G,direct=1,numjobs=3,iodepth=32
client:   317 MiB/s rd, 841 MiB/s wr, 426 op/s rd, 10.98k op/s wr
client:   2.8 GiB/s rd, 882 MiB/s wr, 25.68k op/s rd, 291 op/s wr
client:   4.0 GiB/s rd, 226 MiB/s wr, 89.63k op/s rd, 124 op/s wr
client:   2.4 GiB/s rd, 295 KiB/s wr, 197.86k op/s rd, 20 op/s wr

It seems I only have problems with the 4K,8K,16K other sector sizes.




Eugen Block , 25 Oca 2024 Per, 19:06 tarihinde şunu yazdı:

> I understand that your MDS shows a high CPU usage, but other than that
> what is your performance issue? Do users complain? Do some operations
> take longer than expected? Are OSDs saturated during those phases?
> Because the cache pressure messages don’t necessarily mean that users
> will notice.
> MDS daemons are single-threaded so that might be a bottleneck. In that
> case multi-active mds might help, which you already tried and
> experienced OOM killers. But you might have to disable the mds
> balancer as someone else mentioned. And then you could think about
> pinning, is it possible to split the CephFS into multiple
> subdirectories and pin them to different ranks?
> But first I’d still like to know what the performance issue really is.
>
> Zitat von Özkan Göksu :
>
> > I will try my best to explain my situation.
> >
> > I don't have a separate mds server. I have 5 identical nodes, 3 of them
> > mons, and I use the other 2 as active and standby mds. (currently I have
> > left overs from max_mds 4)
> >
> > root@ud-01:~# ceph -s
> >   cluster:
> > id: e42fd4b0-313b-11ee-9a00-31da71873773
> > health: HEALTH_WARN
> > 1 clients failing to respond to cache pressure
> >
> >   services:
> > mon: 3 daemons, quorum ud-01,ud-02,ud-03 (age 9d)
> > mgr: ud-01.qycnol(active, since 8d), standbys: ud-02.tfhqfd
> > mds: 1/1 daemons up, 4 standby
> > osd: 80 osds: 80 up (since 9d), 80 in (since 5M)
> >
> >   data:
> > volumes: 1/1 healthy
> > pools:   3 pools, 2305 pgs
> > objects: 106.58M objects, 25 TiB
> > usage:   45 TiB used, 101 TiB / 146 TiB avail
> > pgs: 2303 active+clean
> >  2active+clean+scrubbing+deep
> >
> >   io:
> > client:   16 MiB/s rd, 3.4 MiB/s wr, 77 op/s rd, 23 op/s wr
> >
> > --
> > root@ud-01:~# ceph fs status
> > ud-data - 84 clients
> > ===
> > RANK  STATE   MDS  ACTIVITY DNSINOS   DIRS
> > CAPS
> >  0active  ud-data.ud-02.xcoojt  Reqs:   40 /s  2579k  

[ceph-users] Re: 1 clients failing to respond to cache pressure (quincy:17.2.6)

2024-01-25 Thread Özkan Göksu
   66  up  osd.53
 54ssd1.81929   1.0  1.8 TiB  550 GiB  544 GiB   1.5 GiB  4.3
GiB  1.3 TiB  29.54  0.96   55  up  osd.54
 55ssd1.81929   1.0  1.8 TiB  527 GiB  522 GiB   1.3 GiB  4.0
GiB  1.3 TiB  28.29  0.92   52  up  osd.55
 56ssd1.81929   1.0  1.8 TiB  525 GiB  519 GiB   1.2 GiB  4.1
GiB  1.3 TiB  28.16  0.91   52  up  osd.56
 57ssd1.81929   1.0  1.8 TiB  615 GiB  609 GiB   2.3 GiB  4.2
GiB  1.2 TiB  33.03  1.07   65  up  osd.57
 58ssd1.81929   1.0  1.8 TiB  527 GiB  522 GiB   1.6 GiB  3.7
GiB  1.3 TiB  28.31  0.92   55  up  osd.58
 59ssd1.81929   1.0  1.8 TiB  615 GiB  609 GiB   1.2 GiB  4.6
GiB  1.2 TiB  33.01  1.07   60  up  osd.59
 60ssd1.81929   1.0  1.8 TiB  594 GiB  588 GiB   1.2 GiB  4.4
GiB  1.2 TiB  31.88  1.03   59  up  osd.60
 61ssd1.81929   1.0  1.8 TiB  616 GiB  610 GiB   1.9 GiB  4.1
GiB  1.2 TiB  33.04  1.07   64  up  osd.61
 62ssd1.81929   1.0  1.8 TiB  620 GiB  614 GiB   1.9 GiB  4.4
GiB  1.2 TiB  33.27  1.08   63  up  osd.62
 63ssd1.81929   1.0  1.8 TiB  527 GiB  522 GiB   1.5 GiB  4.0
GiB  1.3 TiB  28.30  0.92   53  up  osd.63
-11  29.10864 -   29 TiB  9.0 TiB  8.9 TiB23 GiB   65
GiB   20 TiB  30.91  1.00-  host ud-05
 64ssd1.81929   1.0  1.8 TiB  608 GiB  601 GiB   2.3 GiB  4.5
GiB  1.2 TiB  32.62  1.06   65  up  osd.64
 65ssd1.81929   1.0  1.8 TiB  606 GiB  601 GiB   628 MiB  4.2
GiB  1.2 TiB  32.53  1.06   57  up  osd.65
 66ssd1.81929   1.0  1.8 TiB  583 GiB  578 GiB   1.3 GiB  4.3
GiB  1.2 TiB  31.31  1.02   57  up  osd.66
 67ssd1.81929   1.0  1.8 TiB  537 GiB  533 GiB   436 MiB  3.6
GiB  1.3 TiB  28.82  0.94   50  up  osd.67
 68ssd1.81929   1.0  1.8 TiB  541 GiB  535 GiB   2.5 GiB  3.8
GiB  1.3 TiB  29.04  0.94   59  up  osd.68
 69ssd1.81929   1.0  1.8 TiB  606 GiB  601 GiB   1.1 GiB  4.4
GiB  1.2 TiB  32.55  1.06   59  up  osd.69
 70ssd1.81929   1.0  1.8 TiB  604 GiB  598 GiB   1.8 GiB  4.1
GiB  1.2 TiB  32.44  1.05   63  up  osd.70
 71ssd1.81929   1.0  1.8 TiB  606 GiB  600 GiB   1.9 GiB  4.5
GiB  1.2 TiB  32.53  1.06   62  up  osd.71
 72ssd1.81929   1.0  1.8 TiB  602 GiB  598 GiB   612 MiB  4.1
GiB  1.2 TiB  32.33  1.05   57  up  osd.72
 73ssd1.81929   1.0  1.8 TiB  571 GiB  565 GiB   1.8 GiB  4.5
GiB  1.3 TiB  30.65  0.99   58  up  osd.73
 74ssd1.81929   1.0  1.8 TiB  608 GiB  602 GiB   1.8 GiB  4.2
GiB  1.2 TiB  32.62  1.06   61  up  osd.74
 75ssd1.81929   1.0  1.8 TiB  536 GiB  531 GiB   1.9 GiB  3.5
GiB  1.3 TiB  28.80  0.93   57  up  osd.75
 76ssd1.81929   1.0  1.8 TiB  605 GiB  599 GiB   1.4 GiB  4.5
GiB  1.2 TiB  32.48  1.05   60  up  osd.76
 77ssd1.81929   1.0  1.8 TiB  537 GiB  532 GiB   1.2 GiB  3.9
GiB  1.3 TiB  28.84  0.94   52  up  osd.77
 78ssd1.81929   1.0  1.8 TiB  525 GiB  520 GiB   1.3 GiB  3.8
GiB  1.3 TiB  28.20  0.92   52  up  osd.78
 79ssd1.81929   1.0  1.8 TiB  536 GiB  531 GiB   1.1 GiB  3.3
GiB  1.3 TiB  28.76  0.93   53  up  osd.79
  TOTAL  146 TiB   45 TiB   44 TiB   119 GiB  333
GiB  101 TiB  30.81
MIN/MAX VAR: 0.91/1.08  STDDEV: 1.90



Eugen Block , 25 Oca 2024 Per, 16:52 tarihinde şunu yazdı:

> There is no definitive answer wrt mds tuning. As it is everywhere
> mentioned, it's about finding the right setup for your specific
> workload. If you can synthesize your workload (maybe scale down a bit)
> try optimizing it in a test cluster without interrupting your
> developers too much.
> But what you haven't explained yet is what are you experiencing as a
> performance issue? Do you have numbers or a detailed description?
>  From the fs status output you didn't seem to have too much activity
> going on (around 140 requests per second), but that's probably not the
> usual traffic? What does ceph report in its client IO output?
> Can you paste the 'ceph osd df' output as well?
> Do you have dedicated MDS servers or are they colocated with other
> services?
>
> Zitat von Özkan Göksu :
>
> > Hello  Eugen.
> >
> > I read all of your MDS related topics and thank you so much for your
> effort
> > on this.
> > There is not much information and I couldn't find a MDS tuning guide at
> > all. It  seems that you are the correct person to discuss mds debugging
> and
> > tuning.
> >
> > Do you have any documents or may I learn what is the proper way to debug
> > MDS and cli

[ceph-users] Re: 1 clients failing to respond to cache pressure (quincy:17.2.6)

2024-01-25 Thread Özkan Göksu
Hello  Eugen.

I read all of your MDS related topics and thank you so much for your effort
on this.
There is not much information and I couldn't find a MDS tuning guide at
all. It  seems that you are the correct person to discuss mds debugging and
tuning.

Do you have any documents or may I learn what is the proper way to debug
MDS and clients ?
Which debug logs will guide me to understand the limitations and will help
to tune according to the data flow?

While searching, I find this:
https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/thread/YO4SGL4DJQ6EKUBUIHKTFSW72ZJ3XLZS/
quote:"A user running VSCodium, keeping 15k caps open.. the opportunistic
caps recall eventually starts recalling those but the (el7 kernel) client
won't release them. Stopping Codium seems to be the only way to release."

Because of this I think I also need to play around with the client side too.

My main goal is increasing the speed and reducing the latency and I wonder
if these ideas are correct or not:
- Maybe I need to increase client side cache size because via each client,
multiple users request a lot of objects and clearly the
client_cache_size=16 default is not enough.
-  Maybe I need to increase client side maximum cache limit for
object "client_oc_max_objects=1000 to 1" and data "client_oc_size=200mi
to 400mi"
- The client cache cleaning threshold is not aggressive enough to keep the
free cache size in the desired range. I need to make it aggressive but this
should not reduce speed and increase latency.

mds_cache_memory_limit=4gi to 16gi
client_oc_max_objects=1000 to 1
client_oc_size=200mi to 400mi
client_permissions=false #to reduce latency.
client_cache_size=16 to 128


What do you think?
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: cephfs-top causes 16 mgr modules have recently crashed

2024-01-25 Thread Özkan Göksu
Hello Jos.

I check the diff and notice the difference:
https://github.com/ceph/ceph/pull/52127/files

Thank you for the guide link and for the fix.
Have a great day.

Regards.



23 Oca 2024 Sal 11:07 tarihinde Jos Collin  şunu yazdı:

> This fix is in the mds.
> I think you need to read
> https://docs.ceph.com/en/quincy/cephadm/upgrade/#staggered-upgrade.
>
> On 23/01/24 12:19, Özkan Göksu wrote:
>
> Hello Jos.
> Thank you for the reply.
>
> I can upgrade to 17.2.7 but I wonder can I only upgrade MON+MGR for this
> issue or do I need to upgrade all the parts?
> Otherwise I need to wait few weeks. I don't want to request maintenance
> during delivery time.
>
> root@ud-01:~# ceph orch upgrade ls
> {
> "image": "quay.io/ceph/ceph",
> "registry": "quay.io",
> "bare_image": "ceph/ceph",
> "versions": [
> "18.2.1",
> "18.2.0",
> "18.1.3",
> "18.1.2",
> "18.1.1",
> "18.1.0",
> "17.2.7",
> "17.2.6",
> "17.2.5",
> "17.2.4",
> "17.2.3",
>     "17.2.2",
> "17.2.1",
> "17.2.0"
> ]
> }
>
> Best regards
>
> Jos Collin , 23 Oca 2024 Sal, 07:42 tarihinde şunu
> yazdı:
>
>> Please have this fix: https://tracker.ceph.com/issues/59551. It's
>> backported to quincy.
>>
>> On 23/01/24 03:11, Özkan Göksu wrote:
>> > Hello
>> >
>> > When I run cephfs-top it causes mgr module crash. Can you please tell me
>> > the reason?
>> >
>> > My environment:
>> > My ceph version 17.2.6
>> > Operating System: Ubuntu 22.04.2 LTS
>> > Kernel: Linux 5.15.0-84-generic
>> >
>> > I created the cephfs-top user with the following command:
>> > ceph auth get-or-create client.fstop mon 'allow r' mds 'allow r' osd
>> 'allow
>> > r' mgr 'allow r' > /etc/ceph/ceph.client.fstop.keyring
>> >
>> > This is the crash report:
>> >
>> > root@ud-01:~# ceph crash info
>> > 2024-01-22T21:25:59.313305Z_526253e3-e8cc-4d2c-adcb-69a7c9986801
>> > {
>> >  "backtrace": [
>> >  "  File \"/usr/share/ceph/mgr/stats/module.py\", line 32, in
>> > notify\nself.fs_perf_stats.notify_cmd(notify_id)",
>> >  "  File \"/usr/share/ceph/mgr/stats/fs/perf_stats.py\", line
>> 177,
>> > in notify_cmd\nmetric_features =
>> >
>> int(metadata[CLIENT_METADATA_KEY][\"metric_spec\"][\"metric_flags\"][\"feature_bits\"],
>> > 16)",
>> >  "ValueError: invalid literal for int() with base 16: '0x'"
>> >  ],
>> >  "ceph_version": "17.2.6",
>> >  "crash_id":
>> > "2024-01-22T21:25:59.313305Z_526253e3-e8cc-4d2c-adcb-69a7c9986801",
>> >  "entity_name": "mgr.ud-01.qycnol",
>> >  "mgr_module": "stats",
>> >  "mgr_module_caller": "ActivePyModule::notify",
>> >  "mgr_python_exception": "ValueError",
>> >  "os_id": "centos",
>> >  "os_name": "CentOS Stream",
>> >  "os_version": "8",
>> >  "os_version_id": "8",
>> >  "process_name": "ceph-mgr",
>> >  "stack_sig":
>> > "971ae170f1fff7f7bc0b7ae86d164b2b0136a8bd5ca7956166ea5161e51ad42c",
>> >  "timestamp": "2024-01-22T21:25:59.313305Z",
>> >  "utsname_hostname": "ud-01",
>> >  "utsname_machine": "x86_64",
>> >  "utsname_release": "5.15.0-84-generic",
>> >  "utsname_sysname": "Linux",
>> >  "utsname_version": "#93-Ubuntu SMP Tue Sep 5 17:16:10 UTC 2023"
>> > }
>> >
>> >
>> > Best regards.
>> > ___
>> > ceph-users mailing list -- ceph-users@ceph.io
>> > To unsubscribe send an email to ceph-users-le...@ceph.io
>> >
>> ___
>> ceph-users mailing list -- ceph-users@ceph.io
>> To unsubscribe send an email to ceph-users-le...@ceph.io
>>
>
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: cephfs-top causes 16 mgr modules have recently crashed

2024-01-22 Thread Özkan Göksu
Hello Jos.
Thank you for the reply.

I can upgrade to 17.2.7 but I wonder can I only upgrade MON+MGR for this
issue or do I need to upgrade all the parts?
Otherwise I need to wait few weeks. I don't want to request maintenance
during delivery time.

root@ud-01:~# ceph orch upgrade ls
{
"image": "quay.io/ceph/ceph",
"registry": "quay.io",
"bare_image": "ceph/ceph",
"versions": [
"18.2.1",
"18.2.0",
"18.1.3",
"18.1.2",
"18.1.1",
"18.1.0",
"17.2.7",
"17.2.6",
"17.2.5",
"17.2.4",
"17.2.3",
"17.2.2",
"17.2.1",
"17.2.0"
]
}

Best regards

Jos Collin , 23 Oca 2024 Sal, 07:42 tarihinde şunu
yazdı:

> Please have this fix: https://tracker.ceph.com/issues/59551. It's
> backported to quincy.
>
> On 23/01/24 03:11, Özkan Göksu wrote:
> > Hello
> >
> > When I run cephfs-top it causes mgr module crash. Can you please tell me
> > the reason?
> >
> > My environment:
> > My ceph version 17.2.6
> > Operating System: Ubuntu 22.04.2 LTS
> > Kernel: Linux 5.15.0-84-generic
> >
> > I created the cephfs-top user with the following command:
> > ceph auth get-or-create client.fstop mon 'allow r' mds 'allow r' osd
> 'allow
> > r' mgr 'allow r' > /etc/ceph/ceph.client.fstop.keyring
> >
> > This is the crash report:
> >
> > root@ud-01:~# ceph crash info
> > 2024-01-22T21:25:59.313305Z_526253e3-e8cc-4d2c-adcb-69a7c9986801
> > {
> >  "backtrace": [
> >  "  File \"/usr/share/ceph/mgr/stats/module.py\", line 32, in
> > notify\nself.fs_perf_stats.notify_cmd(notify_id)",
> >  "  File \"/usr/share/ceph/mgr/stats/fs/perf_stats.py\", line
> 177,
> > in notify_cmd\nmetric_features =
> >
> int(metadata[CLIENT_METADATA_KEY][\"metric_spec\"][\"metric_flags\"][\"feature_bits\"],
> > 16)",
> >  "ValueError: invalid literal for int() with base 16: '0x'"
> >  ],
> >  "ceph_version": "17.2.6",
> >  "crash_id":
> > "2024-01-22T21:25:59.313305Z_526253e3-e8cc-4d2c-adcb-69a7c9986801",
> >  "entity_name": "mgr.ud-01.qycnol",
> >  "mgr_module": "stats",
> >  "mgr_module_caller": "ActivePyModule::notify",
> >  "mgr_python_exception": "ValueError",
> >  "os_id": "centos",
> >  "os_name": "CentOS Stream",
> >  "os_version": "8",
> >  "os_version_id": "8",
> >  "process_name": "ceph-mgr",
> >  "stack_sig":
> > "971ae170f1fff7f7bc0b7ae86d164b2b0136a8bd5ca7956166ea5161e51ad42c",
> >  "timestamp": "2024-01-22T21:25:59.313305Z",
> >  "utsname_hostname": "ud-01",
> >  "utsname_machine": "x86_64",
> >  "utsname_release": "5.15.0-84-generic",
> >  "utsname_sysname": "Linux",
> >  "utsname_version": "#93-Ubuntu SMP Tue Sep 5 17:16:10 UTC 2023"
> > }
> >
> >
> > Best regards.
> > ___
> > ceph-users mailing list -- ceph-users@ceph.io
> > To unsubscribe send an email to ceph-users-le...@ceph.io
> >
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] cephfs-top causes 16 mgr modules have recently crashed

2024-01-22 Thread Özkan Göksu
Hello

When I run cephfs-top it causes mgr module crash. Can you please tell me
the reason?

My environment:
My ceph version 17.2.6
Operating System: Ubuntu 22.04.2 LTS
Kernel: Linux 5.15.0-84-generic

I created the cephfs-top user with the following command:
ceph auth get-or-create client.fstop mon 'allow r' mds 'allow r' osd 'allow
r' mgr 'allow r' > /etc/ceph/ceph.client.fstop.keyring

This is the crash report:

root@ud-01:~# ceph crash info
2024-01-22T21:25:59.313305Z_526253e3-e8cc-4d2c-adcb-69a7c9986801
{
"backtrace": [
"  File \"/usr/share/ceph/mgr/stats/module.py\", line 32, in
notify\nself.fs_perf_stats.notify_cmd(notify_id)",
"  File \"/usr/share/ceph/mgr/stats/fs/perf_stats.py\", line 177,
in notify_cmd\nmetric_features =
int(metadata[CLIENT_METADATA_KEY][\"metric_spec\"][\"metric_flags\"][\"feature_bits\"],
16)",
"ValueError: invalid literal for int() with base 16: '0x'"
],
"ceph_version": "17.2.6",
"crash_id":
"2024-01-22T21:25:59.313305Z_526253e3-e8cc-4d2c-adcb-69a7c9986801",
"entity_name": "mgr.ud-01.qycnol",
"mgr_module": "stats",
"mgr_module_caller": "ActivePyModule::notify",
"mgr_python_exception": "ValueError",
"os_id": "centos",
"os_name": "CentOS Stream",
"os_version": "8",
"os_version_id": "8",
"process_name": "ceph-mgr",
"stack_sig":
"971ae170f1fff7f7bc0b7ae86d164b2b0136a8bd5ca7956166ea5161e51ad42c",
"timestamp": "2024-01-22T21:25:59.313305Z",
"utsname_hostname": "ud-01",
"utsname_machine": "x86_64",
"utsname_release": "5.15.0-84-generic",
"utsname_sysname": "Linux",
"utsname_version": "#93-Ubuntu SMP Tue Sep 5 17:16:10 UTC 2023"
}


Best regards.
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: 1 clients failing to respond to cache pressure (quincy:17.2.6)

2024-01-16 Thread Özkan Göksu
: 0,
"bluestore_Buffer_items": 0,
"bluestore_Extent_bytes": 0,
"bluestore_Extent_items": 0,
"bluestore_Blob_bytes": 0,
"bluestore_Blob_items": 0,
"bluestore_SharedBlob_bytes": 0,
"bluestore_SharedBlob_items": 0,
"bluestore_inline_bl_bytes": 0,
"bluestore_inline_bl_items": 0,
"bluestore_fsck_bytes": 0,
"bluestore_fsck_items": 0,
"bluestore_txc_bytes": 0,
"bluestore_txc_items": 0,
"bluestore_writing_deferred_bytes": 0,
"bluestore_writing_deferred_items": 0,
"bluestore_writing_bytes": 0,
"bluestore_writing_items": 0,
"bluefs_bytes": 0,
"bluefs_items": 0,
"bluefs_file_reader_bytes": 0,
"bluefs_file_reader_items": 0,
"bluefs_file_writer_bytes": 0,
"bluefs_file_writer_items": 0,
"buffer_anon_bytes": 2114708,
"buffer_anon_items": 825,
"buffer_meta_bytes": 88,
"buffer_meta_items": 1,
"osd_bytes": 0,
"osd_items": 0,
"osd_mapbl_bytes": 0,
"osd_mapbl_items": 0,
"osd_pglog_bytes": 0,
"osd_pglog_items": 0,
"osdmap_bytes": 25728,
"osdmap_items": 946,
"osdmap_mapping_bytes": 0,
"osdmap_mapping_items": 0,
"pgmap_bytes": 0,
"pgmap_items": 0,
"mds_co_bytes": 8173443932,
"mds_co_items": 109004579,
"unittest_1_bytes": 0,
"unittest_1_items": 0,
"unittest_2_bytes": 0,
"unittest_2_items": 0
},
"objecter": {
"op_active": 0,
"op_laggy": 0,
"op_send": 13563810,
"op_send_bytes": 21613887606,
"op_resend": 1,
"op_reply": 13563809,
"oplen_avg": {
"avgcount": 13563809,
"sum": 31377549
},
"op": 13563809,
"op_r": 10213362,
"op_w": 3350447,
"op_rmw": 0,
"op_pg": 0,
"osdop_stat": 75945,
"osdop_create": 1139381,
"osdop_read": 15848,
"osdop_write": 713549,
"osdop_writefull": 20267,
"osdop_writesame": 0,
    "osdop_append": 0,
"osdop_zero": 2,
"osdop_truncate": 0,
"osdop_delete": 1226688,
"osdop_mapext": 0,
"osdop_sparse_read": 0,
"osdop_clonerange": 0,
"osdop_getxattr": 7321546,
"osdop_setxattr": 2283499,
"osdop_cmpxattr": 0,
"osdop_rmxattr": 0,
"osdop_resetxattrs": 0,
"osdop_call": 0,
"osdop_watch": 0,
"osdop_notify": 0,
"osdop_src_cmpxattr": 0,
"osdop_pgls": 0,
"osdop_pgls_filter": 0,
"osdop_other": 49342,
"linger_active": 0,
"linger_send": 0,
"linger_resend": 0,
"linger_ping": 0,
"poolop_active": 0,
"poolop_send": 0,
"poolop_resend": 0,
"poolstat_active": 0,
"poolstat_send": 0,
"poolstat_resend": 0,
"statfs_active": 0,
"statfs_send": 0,
"statfs_resend": 0,
"command_active": 0,
"command_send": 0,
"command_resend": 0,
"map_epoch": 13646,
"map_full": 0,
"map_inc": 97,
"osd_sessions": 80,
"osd_session_open": 176,
"osd_session_close": 96,
"osd_laggy": 0,
"omap_wr": 354624,
"omap_rd": 18128035,
"omap_del": 48823
},
"oft": {
"omap_total_objs": 3,
"omap_total_kv_pairs": 31549,
"omap_total_updates": 9972364,
"omap_total_removes": 7080093
},
"purge_queue": {
"pq_executing_ops": 0,
"pq_executing_ops_high_water": 1126,
"pq_executing": 0,
"pq_executing_high_water": 64,
"pq_executed": 1129

[ceph-users] Re: 1 clients failing to respond to cache pressure (quincy:17.2.6)

2024-01-16 Thread Özkan Göksu
All of my clients are servers located at 2 hop away with 10Gbit network and
2x Xeon CPU/16++ cores and minimum 64GB ram with SSD OS drive + 8GB spare.
I use ceph kernel mount only and this is the command:
- mount.ceph admin@$fsid.ud-data=/volumes/subvolumegroup ${MOUNT_DIR} -o
name=admin,secret=XXX==,mon_addr=XXX

I think all of my clients have enough resources to answer MDS requests very
fast. The only possibility that any of my clients fails to respond to cache
pressure is the default settings at cephfs client or MDS server.

I have some problem with understanding how cephfs client works and why it
needs communication with MDS server for managing local cache.
And even at the beggining I didn't understand why MDS server needs direct
control over clients and tell them what to do. My mind does not understand
the concept and its logic.
To me, clients must be independent and they must manage their data flow
without any server side control. The client must send read and write
request to the remote server and return answer to the kernel.
Client can have read cache management future but it does not need
communicate with remote server. When a client detects multiple read for the
same object it should cache it with a set of protocols and release it when
it needed.
I don't understand why MDS needs to tell clients to release the allocation
and why client needs to report the release status back...

The logical answer for me is I think I'm looking from the wrong angle and
this is not the cache that I know from block filesystems.

With my use case, clients reads 50-100GB of data (10.000++ objects) only
one or two times with each runtime in few hours.


While I was researching, I saw that some users recommends decreasing
"mds_max_caps_per_client" from 1M to 64K
# ceph config set mds mds_max_caps_per_client 65536

But if you check the reported client ls at previous mail you will see
"num_caps": 52092, for a failing client for cache pressure.
So its even under 64K and I'm not sure changing this value can help or not.

I want to repeat my main goal.
I'm not trying to solve cache pressure warning.
The ceph random read and write performance is not good and a lot of reads
from 80+ clients creates latency.
I'm trying to increase the speed by creating multiple MDS even maybe
binding subvolumes to specific MDS servers and decrease the latency.

Also when I check MDS CPU usage I see %120++ usage time to time. But when I
check the server CPU load at MDS location, I see MDS only uses 2-4 cores
and other CPU cores are almost at idle.
I think MDS has a CPU core limitation and I need to increase the value to
decrease the latency. How can I do that?




Özkan Göksu , 17 Oca 2024 Çar, 07:44 tarihinde şunu
yazdı:

> Let me share some outputs about my cluster.
>
> root@ud-01:~# ceph fs status
> ud-data - 84 clients
> ===
> RANK  STATE   MDS  ACTIVITY DNSINOS   DIRS
> CAPS
>  0active  ud-data.ud-02.xcoojt  Reqs:   31 /s  3022k  3021k  52.6k
> 385k
> POOL   TYPE USED  AVAIL
> cephfs.ud-data.meta  metadata   136G  44.4T
> cephfs.ud-data.datadata45.2T  44.4T
> STANDBY MDS
> ud-data.ud-03.lhwkml
> ud-data.ud-05.rnhcfe
> ud-data.ud-01.uatjle
> ud-data.ud-04.seggyv
>
> --
> This is "ceph tell mds.ud-data.ud-02.xcoojt session ls" output for the
> reported client for cache pressure warning.
>
> {
> "id": 1282205,
> "entity": {
> "name": {
> "type": "client",
> "num": 1282205
> },
> "addr": {
> "type": "v1",
> "addr": "172.16.3.48:0",
> "nonce": 2169935642
> }
> },
> "state": "open",
> "num_leases": 0,
> "num_caps": 52092,
> "request_load_avg": 1,
> "uptime": 75754.745608647994,
> "requests_in_flight": 0,
> "num_completed_requests": 0,
> "num_completed_flushes": 1,
> "reconnecting": false,
> "recall_caps": {
> "value": 2577232.0049106553,
> "halflife": 60
> },
> "release_caps": {
> "value": 1.4093491463510395,
> "halflife": 60
> },
> "recall_caps_throttle": {
> "value": 63733.985544098425,
> &q

[ceph-users] Re: 1 clients failing to respond to cache pressure (quincy:17.2.6)

2024-01-16 Thread Özkan Göksu
Let me share some outputs about my cluster.

root@ud-01:~# ceph fs status
ud-data - 84 clients
===
RANK  STATE   MDS  ACTIVITY DNSINOS   DIRS
CAPS
 0active  ud-data.ud-02.xcoojt  Reqs:   31 /s  3022k  3021k  52.6k
385k
POOL   TYPE USED  AVAIL
cephfs.ud-data.meta  metadata   136G  44.4T
cephfs.ud-data.datadata45.2T  44.4T
STANDBY MDS
ud-data.ud-03.lhwkml
ud-data.ud-05.rnhcfe
ud-data.ud-01.uatjle
ud-data.ud-04.seggyv

--
This is "ceph tell mds.ud-data.ud-02.xcoojt session ls" output for the
reported client for cache pressure warning.

{
"id": 1282205,
"entity": {
"name": {
"type": "client",
"num": 1282205
},
"addr": {
"type": "v1",
"addr": "172.16.3.48:0",
"nonce": 2169935642
}
},
"state": "open",
"num_leases": 0,
"num_caps": 52092,
"request_load_avg": 1,
"uptime": 75754.745608647994,
"requests_in_flight": 0,
"num_completed_requests": 0,
"num_completed_flushes": 1,
"reconnecting": false,
"recall_caps": {
"value": 2577232.0049106553,
"halflife": 60
},
"release_caps": {
"value": 1.4093491463510395,
"halflife": 60
},
"recall_caps_throttle": {
"value": 63733.985544098425,
"halflife": 1.5
},
"recall_caps_throttle2o": {
"value": 19452.428409271757,
"halflife": 0.5
},
"session_cache_liveness": {
"value": 14.100272208890081,
"halflife": 300
},
"cap_acquisition": {
"value": 0,
"halflife": 10
},
"delegated_inos": [
{
"start": "0x10004a1c031",
"length": 282
},
{
"start": "0x10004a1c33f",
"length": 207
},
{
"start": "0x10004a1cdda",
"length": 6
},
{
"start": "0x10004a3c12e",
"length": 3
},
{
"start": "0x1000f9831fe",
"length": 2
}
],
"inst": "client.1282205 v1:172.16.3.48:0/2169935642",
"completed_requests": [],
"prealloc_inos": [
{
"start": "0x10004a1c031",
"length": 282
},
{
"start": "0x10004a1c33f",
"length": 207
},
{
"start": "0x10004a1cdda",
"length": 6
},
{
"start": "0x10004a3c12e",
"length": 3
},
{
"start": "0x1000f9831fe",
"length": 2
},
{
"start": "0x1000fa86e5f",
"length": 54
},
{
"start": "0x1000faa069c",
"length": 501
}
],
"client_metadata": {
"client_features": {
"feature_bits": "0x7bff"
},
"metric_spec": {
"metric_flags": {
"feature_bits": "0x03ff"
}
},
"entity_id": "admin",
"hostname": "bennevis-2",
"kernel_version": "5.15.0-91-generic",
"root": "/volumes/babblians"
}
}

Özkan Göksu , 17 Oca 2024 Çar, 07:22 tarihinde şunu
yazdı:

> Hello Eugen.
>
> Thank you for the answer.
> According to knowledge and test results at this issue:
> https://github.com/ceph/ceph/pull/38574
> I've tried their advice and I've appli

[ceph-users] Re: 1 clients failing to respond to cache pressure (quincy:17.2.6)

2024-01-16 Thread Özkan Göksu
Hello Eugen.

Thank you for the answer.
According to knowledge and test results at this issue:
https://github.com/ceph/ceph/pull/38574
I've tried their advice and I've applied the following changes.

max_mds = 4
standby_mds = 1
mds_cache_memory_limit = 16GB
mds_recall_max_caps = 4

When I set these parameters, 1 day later I saw this log:
[8531248.982954] Out of memory: Killed process 1580586 (ceph-mds)
total-vm:70577592kB, anon-rss:70244236kB, file-rss:0kB, shmem-rss:0kB,
UID:167 pgtables:137832kB oom_score_adj:0

All the MDS services created memory leak and killed by kernel.
Because of this I changed it as below and it is stable now but performance
is very poor and I still get cache pressure alerts.

max_mds = 1
standby_mds = 5
mds_cache_memory_limit = 8GB
mds_recall_max_caps = 3

I'm very surprised that you are advising to decrease "mds_recall_max_caps"
because it is the opposite of what developers advised in the issue I've
sended.
It is very hard to play around with MDS parameters without expert level of
understanding what these parameters stands for and how it will effect the
behavior.
Because of this I'm trying to understand the MDS code flow and I'm very
interested with learning more and tuning my system by debugging and
understanding my own data flow and MDS usage.

I have a very unique data flow and I think I need to configure the system
for this case.
I have 80+ clients and via all of these clients my users are requesting
Read a range of objects and compare them in GPU, they generate new data and
Write the new data back in the cluster.
So it means my clients usually reads objects only one time and do not read
the same object again. Sometimes same user runs multiple service in
multiple clients and these services can read the same data from different
clients.

So having a large cache is useless for my use case. I need to setup MDS and
Cephfs Client for this data flow.
When I debug the MDS ram usage, I see high allocation all the time and I
wonder why? If any of my client does not read any object why MDS does not
remove that data from ram allocation?
I need to configure MDS for reading the data and removing it very fast if
the data is constantly requested from clients. In this case ofc I want a
ram cache tier.

I'm little confused and I need to learn more about how MDS works and how
should I make multiple active MDS faster for my subvolumes and client data
flow.

Best regards.



Eugen Block , 16 Oca 2024 Sal, 11:36 tarihinde şunu yazdı:

> Hi,
>
> I have dealt with this topic multiple times, the SUSE team helped
> understanding what's going on under the hood. The summary can be found
> in this thread [1].
>
> What helped in our case was to reduce the mds_recall_max_caps from 30k
> (default) to 3k. We tried it in steps of 1k IIRC. So I suggest to
> reduce that value step by step (maybe start with 20k or something) to
> find the optimal value.
>
> Regards,
> Eugen
>
> [1] https://www.spinics.net/lists/ceph-users/msg73188.html
>
> Zitat von Özkan Göksu :
>
> > Hello.
> >
> > I have 5 node ceph cluster and I'm constantly having "clients failing to
> > respond to cache pressure" warning.
> >
> > I have 84 cephfs kernel clients (servers) and my users are accessing
> their
> > personal subvolumes  located on one pool.
> >
> > My users are software developers and the data is home and user data.
> (Git,
> > python projects, sample data and generated new data)
> >
> >
> -
> > --- RAW STORAGE ---
> > CLASS SIZEAVAILUSED  RAW USED  %RAW USED
> > ssd146 TiB  101 TiB  45 TiB45 TiB  30.71
> > TOTAL  146 TiB  101 TiB  45 TiB45 TiB  30.71
> >
> > --- POOLS ---
> > POOL ID   PGS   STORED  OBJECTS USED  %USED  MAX
> AVAIL
> > .mgr  1 1  356 MiB   90  1.0 GiB  0 30
> TiB
> > cephfs.ud-data.meta   9   256   69 GiB3.09M  137 GiB   0.15 45
> TiB
> > cephfs.ud-data.data  10  2048   26 TiB  100.83M   44 TiB  32.97 45
> TiB
> >
> -
> > root@ud-01:~# ceph fs status
> > ud-data - 84 clients
> > ===
> > RANK  STATE   MDS  ACTIVITY DNSINOS   DIRS
> > CAPS
> >  0active  ud-data.ud-04.seggyv  Reqs:  142 /s  2844k  2798k   303k
> > 720k
> > POOL   TYPE USED  AVAIL
> > cephfs.ud-data.meta  metadata   137G  44.9T
> > cephfs.ud-data.datadata44.2T  44.9T
> > STANDBY MDS
> > ud-data.ud-02.xcoojt
> > ud-data.ud-05.rnhcfe
> > ud-data.ud-03.lhwkml
> > ud-data.ud-01.uatjle
> > 

[ceph-users] 1 clients failing to respond to cache pressure (quincy:17.2.6)

2024-01-12 Thread Özkan Göksu
Hello.

I have 5 node ceph cluster and I'm constantly having "clients failing to
respond to cache pressure" warning.

I have 84 cephfs kernel clients (servers) and my users are accessing their
personal subvolumes  located on one pool.

My users are software developers and the data is home and user data. (Git,
python projects, sample data and generated new data)

-
--- RAW STORAGE ---
CLASS SIZEAVAILUSED  RAW USED  %RAW USED
ssd146 TiB  101 TiB  45 TiB45 TiB  30.71
TOTAL  146 TiB  101 TiB  45 TiB45 TiB  30.71

--- POOLS ---
POOL ID   PGS   STORED  OBJECTS USED  %USED  MAX AVAIL
.mgr  1 1  356 MiB   90  1.0 GiB  0 30 TiB
cephfs.ud-data.meta   9   256   69 GiB3.09M  137 GiB   0.15 45 TiB
cephfs.ud-data.data  10  2048   26 TiB  100.83M   44 TiB  32.97 45 TiB
-
root@ud-01:~# ceph fs status
ud-data - 84 clients
===
RANK  STATE   MDS  ACTIVITY DNSINOS   DIRS
CAPS
 0active  ud-data.ud-04.seggyv  Reqs:  142 /s  2844k  2798k   303k
720k
POOL   TYPE USED  AVAIL
cephfs.ud-data.meta  metadata   137G  44.9T
cephfs.ud-data.datadata44.2T  44.9T
STANDBY MDS
ud-data.ud-02.xcoojt
ud-data.ud-05.rnhcfe
ud-data.ud-03.lhwkml
ud-data.ud-01.uatjle
MDS version: ceph version 17.2.6 (d7ff0d10654d2280e08f1ab989c7cdf3064446a5)
quincy (stable)

---
My MDS settings are below:

mds_cache_memory_limit| 8589934592
mds_cache_trim_threshold  | 524288
mds_recall_global_max_decay_threshold | 131072
mds_recall_max_caps   | 3
mds_recall_max_decay_rate | 1.50
mds_recall_max_decay_threshold| 131072
mds_recall_warning_threshold  | 262144


I have 2 questions:
1- What should I do to prevent cache pressue warning ?
2- What can I do to increase speed ?

- Thanks
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io