[ceph-users] Re: MDS Performance and PG/PGP value
On 10/7/22 16:50, Yoann Moulin wrote: By the way, since I have set PG=256, I have much less SLOW requests than before, even I still have, the impact on my users has been reduced a lot. # zgrep -c -E 'WRN.*(SLOW_OPS|SLOW_REQUEST|MDS_SLOW_METADATA_IO)' floki.log.4.gz floki.log.3.gz floki.log.2.gz floki.log.1.gz floki.log floki.log.4.gz:6883 floki.log.3.gz:11794 floki.log.2.gz:3391 floki.log.1.gz:1180 floki.log:122 If I have the opportunity, I will try to run some benchmark with multiple value of the PG on cephfs_metadata pool. Two more things I want to add: - After PG splitting / rebalancing: do a "OSD compaction" of all your OSDs to optimize your RocksDB (really important: ceph-kvstore-tool bluestore-kv /var/lib/ceph/osd/ceph-$osd-id compact <- when OSD is not running) - How it the distribution of your CephFS primary PGs? You can check with this AWK magic (not mine btw, but it's in our Ceph cheatsheet): ceph pg dump | awk ' BEGIN { IGNORECASE = 1 } /^PG_STAT/ { col=1; while($col!="UP") {col++}; col++ } /^[0-9a-f]+\.[0-9a-f]+/ { match($0,/^[0-9a-f]+/); pool=substr($0, RSTART, RLENGTH); poollist[pool]=0; up=$col; i=0; RSTART=0; RLENGTH=0; delete osds; while(match(up,/[0-9]+/)>0) { osds[++i]=substr(up,RSTART,RLENGTH); up = substr(up, RSTART+RLENGTH) } for(i in osds) {array[osds[i],pool]++; osdlist[osds[i]];} } END { printf("\n"); printf("pool :\t"); for (i in poollist) printf("%s\t",i); printf("| SUM \n"); for (i in poollist) printf(""); printf("\n"); for (i in osdlist) { printf("osd.%i\t", i); sum=0; for (j in poollist) { printf("%i\t", array[i,j]); sum+=array[i,j]; sumpool[j]+=array[i,j] }; printf("| %i\n",sum) } for (i in poollist) printf(""); printf("\n"); printf("SUM :\t"); for (i in poollist) printf("%s\t",sumpool[i]); printf("|\n"); }' If some OSDs are more loaded with primaries than others, that might be a bottleneck sometimes. Gr. Stefan ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: MDS Performance and PG/PGP value
Hi Yoann, I'm not using pacific yet, but this here looks very strange to me: cephfs_data data 243T 19.7T usage: 245 TiB used, 89 TiB / 334 TiB avail I'm not sure if there is a mix of raw vs. stored here. Assuming the cephfs_data allocation is right, I'm wondering what your osd [near] full ratios are. The PG counts look very good. The slow ops can have 2 reasons: a bad disk or full OSDs. Looking at 19.7/(243+16.7)=6.4% free I wonder why there are no osd [near] full warnings all over the place. Even if its still 20% free performance can degrade dramatically according to benchmarks we made on octopus. I think you need to provide a lot more details here. Of interest are: ceph df detail ceph osd df tree and possibly a few others. I don't think the multi-MDS mode is bugging you, but you should check. We have seen degraded performance on mimic caused by excessive export_dir operations between the MDSes. However, I can't see such operations reported as stuck. You might want to check on your MDSes with ceph daemon mds.xzy ops | grep -e dirfrag -e export and/or similar commands. You should report a bit what kind of operations tend to be stuck longest. I also remember that there used to be problems having a kclient ceph fs mount on OSD nodes. Not sure if this could play a role here. You have basically zero IO going on: client: 6.2 MiB/s rd, 12 MiB/s wr, 10 op/s rd, 366 op/s wr yet, PGs are laggy. The problem could sit on a non-ceph component. With the hardware you have, there is something very weird going on. You might also want to check that you have the correct MTU on all devices on every single host and that the speed negotiated is the same. Problems like these I have seen with a single host having a wrong MTU and with LACP bonds with a broken transceiver. Something else to check is flaky controller/PCIe connections. We had a case where a controller was behaving odd and we had a huge amount of device resets in the logs. On the host with the broken controller, IO wait was way above average (shown by top). Something similar might happen with NVMes. A painful procedure to locate a bad host could be to out OSDs manually on a single host and wait for PGs to peer and become active. If you have a bad host, in this moment IO should recover to good levels. Do this host by host. I know, it will be a day or two but, well, it might locate something. Best regards, = Frank Schilder AIT Risø Campus Bygning 109, rum S14 From: Stefan Kooman Sent: 13 October 2022 13:56:45 To: Yoann Moulin; Patrick Donnelly Cc: ceph-users@ceph.io Subject: [ceph-users] Re: MDS Performance and PG/PGP value On 10/13/22 13:47, Yoann Moulin wrote: >> Also, you mentioned you're using 7 active MDS. How's that working out >> for you? Do you use pinning? > > I don't really know how to do that, I have 55 worker nodes in my K8s > cluster, each one can run pods that have access to a cephfs pvc. we have > 28 cephfs persistent volumes. Pods are ML/DL/AI workload, each can be > start and stop whenever our researchers need it. The workloads are > unpredictable. See [1] and [2]. Gr. Stefan [1]: https://docs.ceph.com/en/quincy/cephfs/multimds/#manually-pinning-directory-trees-to-a-particular-rank [2]: https://docs.ceph.com/en/quincy/cephfs/multimds/#setting-subtree-partitioning-policies Gr. Stefan ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: MDS Performance and PG/PGP value
On 10/13/22 13:47, Yoann Moulin wrote: Also, you mentioned you're using 7 active MDS. How's that working out for you? Do you use pinning? I don't really know how to do that, I have 55 worker nodes in my K8s cluster, each one can run pods that have access to a cephfs pvc. we have 28 cephfs persistent volumes. Pods are ML/DL/AI workload, each can be start and stop whenever our researchers need it. The workloads are unpredictable. See [1] and [2]. Gr. Stefan [1]: https://docs.ceph.com/en/quincy/cephfs/multimds/#manually-pinning-directory-trees-to-a-particular-rank [2]: https://docs.ceph.com/en/quincy/cephfs/multimds/#setting-subtree-partitioning-policies Gr. Stefan ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: MDS Performance and PG/PGP value
Hello Patrick, Unfortunately, increasing the number of PG did not help a lot in the end, my cluster is still in trouble... Here the current state of my cluster : https://pastebin.com/Avw5ybgd Is 256 good value in our case ? We have 80TB of data with more than 300M files. You want at least as many PGs that each of the OSDs host a portion of the OMAP data. You want to spread out OMAP to as many _fast_ OSDs as possible. I have tried to find an answer to your question: are more metadata PGs better? I haven't found a definitive answer. This would ideally be tested in a non-prod / pre-prod environment and tuned to individual requirements (type of workload). For now, I would not blindly trust the PG autoscaler. I have seen it advise settings that would definately not be OK. You can skew things in the autoscaler with the "bias" parameter, to compensate for this. But as far as I know the current heuristics to determine a good value do not take into account the importance of OMAP (RocksDB) spread accross OSDs. See a blog post about autoscaler tuning [1]. It would be great if tuning metadata PGs for CephFS / RGW could be performed during the "large scale tests" the devs are planning to perform in the future. With use cases that take into consideration "a lot of small files / objects" versus "loads of large files / objects" to get a feeling how tuning this impacts performance for different work loads. Gr. Stefan [1]: https://ceph.io/en/news/blog/2022/autoscaler_tuning/ Thanks for the information, I agree that autoscaler seem to not be able to handle my use case. (thanks to icepic...@gmail.com too) By the way, since I have set PG=256, I have much less SLOW requests than before, even I still have, the impact on my users has been reduced a lot. # zgrep -c -E 'WRN.*(SLOW_OPS|SLOW_REQUEST|MDS_SLOW_METADATA_IO)' floki.log.4.gz floki.log.3.gz floki.log.2.gz floki.log.1.gz floki.log floki.log.4.gz:6883 floki.log.3.gz:11794 floki.log.2.gz:3391 floki.log.1.gz:1180 floki.log:122 If I have the opportunity, I will try to run some benchmark with multiple value of the PG on cephfs_metadata pool. 256 sounds like a good number to me. Maybe even 128. If you do some experiments, please do share the results. Yes, of course. Also, you mentioned you're using 7 active MDS. How's that working out for you? Do you use pinning? I don't really know how to do that, I have 55 worker nodes in my K8s cluster, each one can run pods that have access to a cephfs pvc. we have 28 cephfs persistent volumes. Pods are ML/DL/AI workload, each can be start and stop whenever our researchers need it. The workloads are unpredictable. Thanks for your help. Best regards, -- Yoann Moulin EPFL IC-IT ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: MDS Performance and PG/PGP value
Hello Yoann, On Fri, Oct 7, 2022 at 10:51 AM Yoann Moulin wrote: > > Hello, > > >> Is 256 good value in our case ? We have 80TB of data with more than 300M > >> files. > > > > You want at least as many PGs that each of the OSDs host a portion of the > > OMAP data. You want to spread out OMAP to as many _fast_ OSDs as possible. > > > > I have tried to find an answer to your question: are more metadata PGs > > better? I haven't found a definitive answer. This would ideally be tested > > in a non-prod / pre-prod environment and tuned > > to individual requirements (type of workload). For now, I would not blindly > > trust the PG autoscaler. I have seen it advise settings that would > > definately not be OK. You can skew things in the > > autoscaler with the "bias" parameter, to compensate for this. But as far as > > I know the current heuristics to determine a good value do not take into > > account the importance of OMAP (RocksDB) > > spread accross OSDs. See a blog post about autoscaler tuning [1]. > > > > It would be great if tuning metadata PGs for CephFS / RGW could be > > performed during the "large scale tests" the devs are planning to perform > > in the future. With use cases that take into > > consideration "a lot of small files / objects" versus "loads of large files > > / objects" to get a feeling how tuning this impacts performance for > > different work loads. > > > > Gr. Stefan > > > > [1]: https://ceph.io/en/news/blog/2022/autoscaler_tuning/ > > Thanks for the information, I agree that autoscaler seem to not be able to > handle my use case. > (thanks to icepic...@gmail.com too) > > By the way, since I have set PG=256, I have much less SLOW requests than > before, even I still have, the impact on my users has been reduced a lot. > > > # zgrep -c -E 'WRN.*(SLOW_OPS|SLOW_REQUEST|MDS_SLOW_METADATA_IO)' > > floki.log.4.gz floki.log.3.gz floki.log.2.gz floki.log.1.gz floki.log > > floki.log.4.gz:6883 > > floki.log.3.gz:11794 > > floki.log.2.gz:3391 > > floki.log.1.gz:1180 > > floki.log:122 > > If I have the opportunity, I will try to run some benchmark with multiple > value of the PG on cephfs_metadata pool. 256 sounds like a good number to me. Maybe even 128. If you do some experiments, please do share the results. Also, you mentioned you're using 7 active MDS. How's that working out for you? Do you use pinning? -- Patrick Donnelly, Ph.D. He / Him / His Principal Software Engineer Red Hat, Inc. GPG: 19F28A586F808C2402351B93C3301A3E258DD79D ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: MDS Performance and PG/PGP value
> Hello > > As previously describe here, we have a full-flash NVME ceph cluster (16.2.6) > with currently only cephfs service configured. [...] > We noticed that cephfs_metadata pool had only 16 PG, we have set > autoscale_mode to off and increase the number of PG to 256 and with this > change, the number of SLOW message has decreased drastically. > > Is there any mechanism to increase the number of PG automatically in such a > situation ? Or this is something to do manually ? > https://ceph.io/en/news/blog/2022/autoscaler_tuning/ -- May the most significant bit of your life be positive. ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io