Le 20/11/2023 à 09:24:41+0000, Frank Schilder a écrit
Hi, 

Thanks everyone for your answer. 

> 
> we are using something similar for ceph-fs. For a backup system your setup 
> can work, depending on how you back up. While HDD pools have poor IOP/s 
> performance, they are very good for streaming workloads. If you are using 
> something like Borg backup that writes huge files sequentially, a HDD 
> back-end should be OK.
> 

Ok. Good to know

> Here some things to consider and try out:
> 
> 1. You really need to get a bunch of enterprise SSDs with power loss 
> protection for the FS meta data pool (disable write cache if enabled, this 
> will disable volatile write cache and switch to protected caching). We are 
> using (formerly Intel) 1.8T SATA drives that we subdivide into 4 OSDs each to 
> raise performance. Place the meta-data pool and the primary data pool on 
> these disks. Create a secondary data pool on the HDDs and assign it to the 
> root *before* creating anything on the FS (see the recommended 3-pool layout 
> for ceph file systems in the docs). I would not even consider running this 
> without SSDs. 1 such SSD per host is the minimum, 2 is better. If Borg or 
> whatever can make use of a small fast storage directory, assign a sub-dir of 
> the root to the primary data pool.

OK. I will see what I can do. 

> 
> 2. Calculate with sufficient extra disk space. As long as utilization stays 
> below 60-70% bluestore will try to make large object writes sequential, which 
> is really important for HDDs. On our cluster we currently have 40% 
> utilization and I get full HDD bandwidth out for large sequential 
> reads/writes. Make sure your backup application makes large sequential IO 
> requests.
> 
> 3. As Anthony said, add RAM. You should go for 512G on 50 HDD-nodes. You can 
> run the MDS daemons on the OSD nodes. Set a reasonable cache limit and use 
> ephemeral pinning. Depending on the CPUs you are using, 48 cores can be 
> plenty. The latest generation Intel Xeon Scalable Processors is so efficient 
> with ceph that 1HT per HDD is more than enough.

Yes I get 512G on each node, 64 core on each server.

> 
> 4. 3 MON+MGR nodes are sufficient. You can do something else with the 
> remaining 2 nodes. Of course, you can use them as additional MON+MGR nodes. 
> We also use 5 and it improves maintainability a lot.
> 

Ok thanks. 

> Something more exotic if you have time:
> 
> 5. To improve sequential performance further, you can experiment with larger 
> min_alloc_sizes for OSDs (on creation time, you will need to scrap and 
> re-deploy the cluster to test different values). Every HDD has a preferred 
> IO-size for which random IO achieves nearly the same band-with as sequential 
> writes. (But see 7.)
> 
> 6. On your set-up you will probably go for a 4+2 EC data pool on HDD. With 
> object size 4M the max. chunk size per OSD will be 1M. For many HDDs this is 
> the preferred IO size (usually between 256K-1M). (But see 7.)
> 
> 7. Important: large min_alloc_sizes are only good if your workload *never* 
> modifies files, but only replaces them. A bit like a pool without EC 
> overwrite enabled. The implementation of EC overwrites has a "feature" that 
> can lead to massive allocation amplification. If your backup workload does 
> modifications to files instead of adding new+deleting old, do *not* 
> experiment with options 5.-7. Instead, use the default and make sure you have 
> sufficient unused capacity to increase the chances for large bluestore writes 
> (keep utilization below 60-70% and just buy extra disks). A workload with 
> large min_alloc_sizes has to be S3-like, only upload, download and delete are 
> allowed.

Thankt a lot for those tips. 

I'm newbie with ceph so it's going to take sometime before I understand
everything you say. 


Best regards

-- 
Albert SHIH 🦫 🐸
France
Heure locale/Local time:
jeu. 23 nov. 2023 08:32:20 CET
_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

Reply via email to