[ceph-users] Re: MDS Performance and PG/PGP value

2022-11-08 Thread Stefan Kooman

On 10/7/22 16:50, Yoann Moulin wrote:



By the way, since I have set PG=256, I have much less SLOW requests than 
before, even I still have, the impact on my users has been reduced a lot.


# zgrep -c -E 'WRN.*(SLOW_OPS|SLOW_REQUEST|MDS_SLOW_METADATA_IO)' 
floki.log.4.gz floki.log.3.gz floki.log.2.gz floki.log.1.gz floki.log 
floki.log.4.gz:6883

floki.log.3.gz:11794
floki.log.2.gz:3391
floki.log.1.gz:1180
floki.log:122


If I have the opportunity, I will try to run some benchmark with 
multiple value of the PG on cephfs_metadata pool.


Two more things I want to add:

- After PG splitting / rebalancing: do a "OSD compaction" of all your 
OSDs to optimize your RocksDB (really important: ceph-kvstore-tool 
bluestore-kv /var/lib/ceph/osd/ceph-$osd-id compact <- when OSD is not 
running)


- How it the distribution of your CephFS primary PGs? You can check with 
this AWK magic (not mine btw, but it's in our Ceph cheatsheet):



ceph pg dump | awk '
BEGIN { IGNORECASE = 1 }
 /^PG_STAT/ { col=1; while($col!="UP") {col++}; col++ }
 /^[0-9a-f]+\.[0-9a-f]+/ { match($0,/^[0-9a-f]+/); pool=substr($0, 
RSTART, RLENGTH); poollist[pool]=0;
 up=$col; i=0; RSTART=0; RLENGTH=0; delete osds; 
while(match(up,/[0-9]+/)>0) { osds[++i]=substr(up,RSTART,RLENGTH); up = 
substr(up, RSTART+RLENGTH) }

 for(i in osds) {array[osds[i],pool]++; osdlist[osds[i]];}
}
END {
 printf("\n");
 printf("pool :\t"); for (i in poollist) printf("%s\t",i); printf("| 
SUM \n");

 for (i in poollist) printf(""); printf("\n");
 for (i in osdlist) { printf("osd.%i\t", i); sum=0;
   for (j in poollist) { printf("%i\t", array[i,j]); sum+=array[i,j]; 
sumpool[j]+=array[i,j] }; printf("| %i\n",sum) }

 for (i in poollist) printf(""); printf("\n");
 printf("SUM :\t"); for (i in poollist) printf("%s\t",sumpool[i]); 
printf("|\n");

}'

If some OSDs are more loaded with primaries than others, that might be a 
bottleneck sometimes.


Gr. Stefan
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: MDS Performance and PG/PGP value

2022-10-13 Thread Frank Schilder
Hi Yoann,

I'm not using pacific yet, but this here looks very strange to me:

  cephfs_data  data 243T  19.7T
usage:   245 TiB used, 89 TiB / 334 TiB avail

I'm not sure if there is a mix of raw vs. stored here. Assuming the cephfs_data 
allocation is right, I'm wondering what your osd [near] full ratios are. The PG 
counts look very good. The slow ops can have 2 reasons: a bad disk or full 
OSDs. Looking at 19.7/(243+16.7)=6.4% free I wonder why there are no osd [near] 
full warnings all over the place. Even if its still 20% free performance can 
degrade dramatically according to benchmarks we made on octopus.

I think you need to provide a lot more details here. Of interest are:

ceph df detail
ceph osd df tree

and possibly a few others. I don't think the multi-MDS mode is bugging you, but 
you should check. We have seen degraded performance on mimic caused by 
excessive export_dir operations between the MDSes. However, I can't see such 
operations reported as stuck. You might want to check on your MDSes with ceph 
daemon mds.xzy ops | grep -e dirfrag -e export and/or similar commands. You 
should report a bit what kind of operations tend to be stuck longest.

I also remember that there used to be problems having a kclient ceph fs mount 
on OSD nodes. Not sure if this could play a role here.

You have basically zero IO going on:

client:   6.2 MiB/s rd, 12 MiB/s wr, 10 op/s rd, 366 op/s wr

yet, PGs are laggy. The problem could sit on a non-ceph component.

With the hardware you have, there is something very weird going on. You might 
also want to check that you have the correct MTU on all devices on every single 
host and that the speed negotiated is the same. Problems like these I have seen 
with a single host having a wrong MTU and with LACP bonds with a broken 
transceiver.

Something else to check is flaky controller/PCIe connections. We had a case 
where a controller was behaving odd and we had a huge amount of device resets 
in the logs. On the host with the broken controller, IO wait was way above 
average (shown by top). Something similar might happen with NVMes. A painful 
procedure to locate a bad host could be to out OSDs manually on a single host 
and wait for PGs to peer and become active. If you have a bad host, in this 
moment IO should recover to good levels. Do this host by host. I know, it will 
be a day or two but, well, it might locate something.

Best regards,
=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14


From: Stefan Kooman 
Sent: 13 October 2022 13:56:45
To: Yoann Moulin; Patrick Donnelly
Cc: ceph-users@ceph.io
Subject: [ceph-users] Re: MDS Performance and PG/PGP value

On 10/13/22 13:47, Yoann Moulin wrote:
>> Also, you mentioned you're using 7 active MDS. How's that working out
>> for you? Do you use pinning?
>
> I don't really know how to do that, I have 55 worker nodes in my K8s
> cluster, each one can run pods that have access to a cephfs pvc. we have
> 28 cephfs persistent volumes. Pods are ML/DL/AI workload, each can be
> start and stop whenever our researchers need it. The workloads are
> unpredictable.

See [1] and [2].

Gr. Stefan

[1]:
https://docs.ceph.com/en/quincy/cephfs/multimds/#manually-pinning-directory-trees-to-a-particular-rank
[2]:
https://docs.ceph.com/en/quincy/cephfs/multimds/#setting-subtree-partitioning-policies

Gr. Stefan
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: MDS Performance and PG/PGP value

2022-10-13 Thread Stefan Kooman

On 10/13/22 13:47, Yoann Moulin wrote:

Also, you mentioned you're using 7 active MDS. How's that working out
for you? Do you use pinning?


I don't really know how to do that, I have 55 worker nodes in my K8s 
cluster, each one can run pods that have access to a cephfs pvc. we have 
28 cephfs persistent volumes. Pods are ML/DL/AI workload, each can be 
start and stop whenever our researchers need it. The workloads are 
unpredictable.


See [1] and [2].

Gr. Stefan

[1]: 
https://docs.ceph.com/en/quincy/cephfs/multimds/#manually-pinning-directory-trees-to-a-particular-rank
[2]: 
https://docs.ceph.com/en/quincy/cephfs/multimds/#setting-subtree-partitioning-policies


Gr. Stefan
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: MDS Performance and PG/PGP value

2022-10-13 Thread Yoann Moulin

Hello Patrick,

Unfortunately, increasing the number of PG did not help a lot in the end, my 
cluster is still in trouble...

Here the current state of my cluster : https://pastebin.com/Avw5ybgd


Is 256 good value in our case ? We have 80TB of data with more than 300M files.


You want at least as many PGs that each of the OSDs host a portion of the OMAP 
data. You want to spread out OMAP to as many _fast_ OSDs as possible.

I have tried to find an answer to your question: are more metadata PGs better? 
I haven't found a definitive answer. This would ideally be tested in a non-prod 
/ pre-prod environment and tuned
to individual requirements (type of workload). For now, I would not blindly 
trust the PG autoscaler. I have seen it advise settings that would definately 
not be OK. You can skew things in the
autoscaler with the "bias" parameter, to compensate for this. But as far as I 
know the current heuristics to determine a good value do not take into account the 
importance of OMAP (RocksDB)
spread accross OSDs. See a blog post about autoscaler tuning [1].

It would be great if tuning metadata PGs for CephFS / RGW could be performed during the 
"large scale tests" the devs are planning to perform in the future. With use 
cases that take into
consideration "a lot of small files / objects" versus "loads of large files / 
objects" to get a feeling how tuning this impacts performance for different work loads.

Gr. Stefan

[1]: https://ceph.io/en/news/blog/2022/autoscaler_tuning/


Thanks for the information, I agree that autoscaler seem to not be able to 
handle my use case.
(thanks to icepic...@gmail.com too)

By the way, since I have set PG=256, I have much less SLOW requests than 
before, even I still have, the impact on my users has been reduced a lot.


# zgrep -c -E 'WRN.*(SLOW_OPS|SLOW_REQUEST|MDS_SLOW_METADATA_IO)' 
floki.log.4.gz floki.log.3.gz floki.log.2.gz floki.log.1.gz floki.log
floki.log.4.gz:6883
floki.log.3.gz:11794
floki.log.2.gz:3391
floki.log.1.gz:1180
floki.log:122


If I have the opportunity, I will try to run some benchmark with multiple value 
of the PG on cephfs_metadata pool.


256 sounds like a good number to me. Maybe even 128. If you do some
experiments, please do share the results.


Yes, of course.


Also, you mentioned you're using 7 active MDS. How's that working out
for you? Do you use pinning?


I don't really know how to do that, I have 55 worker nodes in my K8s cluster, each one can run pods that have access to a cephfs pvc. we have 28 cephfs persistent volumes. Pods are ML/DL/AI 
workload, each can be start and stop whenever our researchers need it. The workloads are unpredictable.


Thanks for your help.

Best regards,

--
Yoann Moulin
EPFL IC-IT

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: MDS Performance and PG/PGP value

2022-10-10 Thread Patrick Donnelly
Hello Yoann,

On Fri, Oct 7, 2022 at 10:51 AM Yoann Moulin  wrote:
>
> Hello,
>
> >> Is 256 good value in our case ? We have 80TB of data with more than 300M 
> >> files.
> >
> > You want at least as many PGs that each of the OSDs host a portion of the 
> > OMAP data. You want to spread out OMAP to as many _fast_ OSDs as possible.
> >
> > I have tried to find an answer to your question: are more metadata PGs 
> > better? I haven't found a definitive answer. This would ideally be tested 
> > in a non-prod / pre-prod environment and tuned
> > to individual requirements (type of workload). For now, I would not blindly 
> > trust the PG autoscaler. I have seen it advise settings that would 
> > definately not be OK. You can skew things in the
> > autoscaler with the "bias" parameter, to compensate for this. But as far as 
> > I know the current heuristics to determine a good value do not take into 
> > account the importance of OMAP (RocksDB)
> > spread accross OSDs. See a blog post about autoscaler tuning [1].
> >
> > It would be great if tuning metadata PGs for CephFS / RGW could be 
> > performed during the "large scale tests" the devs are planning to perform 
> > in the future. With use cases that take into
> > consideration "a lot of small files / objects" versus "loads of large files 
> > / objects" to get a feeling how tuning this impacts performance for 
> > different work loads.
> >
> > Gr. Stefan
> >
> > [1]: https://ceph.io/en/news/blog/2022/autoscaler_tuning/
>
> Thanks for the information, I agree that autoscaler seem to not be able to 
> handle my use case.
> (thanks to icepic...@gmail.com too)
>
> By the way, since I have set PG=256, I have much less SLOW requests than 
> before, even I still have, the impact on my users has been reduced a lot.
>
> > # zgrep -c -E 'WRN.*(SLOW_OPS|SLOW_REQUEST|MDS_SLOW_METADATA_IO)' 
> > floki.log.4.gz floki.log.3.gz floki.log.2.gz floki.log.1.gz floki.log
> > floki.log.4.gz:6883
> > floki.log.3.gz:11794
> > floki.log.2.gz:3391
> > floki.log.1.gz:1180
> > floki.log:122
>
> If I have the opportunity, I will try to run some benchmark with multiple 
> value of the PG on cephfs_metadata pool.

256 sounds like a good number to me. Maybe even 128. If you do some
experiments, please do share the results.

Also, you mentioned you're using 7 active MDS. How's that working out
for you? Do you use pinning?


--
Patrick Donnelly, Ph.D.
He / Him / His
Principal Software Engineer
Red Hat, Inc.
GPG: 19F28A586F808C2402351B93C3301A3E258DD79D

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: MDS Performance and PG/PGP value

2022-10-06 Thread Janne Johansson
> Hello
>
> As previously describe here, we have a full-flash NVME ceph cluster (16.2.6) 
> with currently only cephfs service configured.
[...]
> We noticed that cephfs_metadata pool had only 16 PG, we have set 
> autoscale_mode to off and increase the number of PG to 256 and with this
> change, the number of SLOW message has decreased drastically.
>
> Is there any mechanism to increase the number of PG automatically in such a 
> situation ? Or this is something to do manually ?
>

https://ceph.io/en/news/blog/2022/autoscaler_tuning/


-- 
May the most significant bit of your life be positive.
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io