Hi Yoann,

I'm not using pacific yet, but this here looks very strange to me:

  cephfs_data      data     243T  19.7T
    usage:   245 TiB used, 89 TiB / 334 TiB avail

I'm not sure if there is a mix of raw vs. stored here. Assuming the cephfs_data 
allocation is right, I'm wondering what your osd [near] full ratios are. The PG 
counts look very good. The slow ops can have 2 reasons: a bad disk or full 
OSDs. Looking at 19.7/(243+16.7)=6.4% free I wonder why there are no osd [near] 
full warnings all over the place. Even if its still 20% free performance can 
degrade dramatically according to benchmarks we made on octopus.

I think you need to provide a lot more details here. Of interest are:

ceph df detail
ceph osd df tree

and possibly a few others. I don't think the multi-MDS mode is bugging you, but 
you should check. We have seen degraded performance on mimic caused by 
excessive export_dir operations between the MDSes. However, I can't see such 
operations reported as stuck. You might want to check on your MDSes with ceph 
daemon mds.xzy ops | grep -e dirfrag -e export and/or similar commands. You 
should report a bit what kind of operations tend to be stuck longest.

I also remember that there used to be problems having a kclient ceph fs mount 
on OSD nodes. Not sure if this could play a role here.

You have basically zero IO going on:

    client:   6.2 MiB/s rd, 12 MiB/s wr, 10 op/s rd, 366 op/s wr

yet, PGs are laggy. The problem could sit on a non-ceph component.

With the hardware you have, there is something very weird going on. You might 
also want to check that you have the correct MTU on all devices on every single 
host and that the speed negotiated is the same. Problems like these I have seen 
with a single host having a wrong MTU and with LACP bonds with a broken 
transceiver.

Something else to check is flaky controller/PCIe connections. We had a case 
where a controller was behaving odd and we had a huge amount of device resets 
in the logs. On the host with the broken controller, IO wait was way above 
average (shown by top). Something similar might happen with NVMes. A painful 
procedure to locate a bad host could be to out OSDs manually on a single host 
and wait for PGs to peer and become active. If you have a bad host, in this 
moment IO should recover to good levels. Do this host by host. I know, it will 
be a day or two but, well, it might locate something.

Best regards,
=================
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14

________________________________________
From: Stefan Kooman <ste...@bit.nl>
Sent: 13 October 2022 13:56:45
To: Yoann Moulin; Patrick Donnelly
Cc: ceph-users@ceph.io
Subject: [ceph-users] Re: MDS Performance and PG/PGP value

On 10/13/22 13:47, Yoann Moulin wrote:
>> Also, you mentioned you're using 7 active MDS. How's that working out
>> for you? Do you use pinning?
>
> I don't really know how to do that, I have 55 worker nodes in my K8s
> cluster, each one can run pods that have access to a cephfs pvc. we have
> 28 cephfs persistent volumes. Pods are ML/DL/AI workload, each can be
> start and stop whenever our researchers need it. The workloads are
> unpredictable.

See [1] and [2].

Gr. Stefan

[1]:
https://docs.ceph.com/en/quincy/cephfs/multimds/#manually-pinning-directory-trees-to-a-particular-rank
[2]:
https://docs.ceph.com/en/quincy/cephfs/multimds/#setting-subtree-partitioning-policies

Gr. Stefan
_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io
_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

Reply via email to