[ceph-users] Re: How to use hardware

2023-11-22 Thread Albert Shih
Le 20/11/2023 à 09:24:41+, Frank Schilder a écrit
Hi, 

Thanks everyone for your answer. 

> 
> we are using something similar for ceph-fs. For a backup system your setup 
> can work, depending on how you back up. While HDD pools have poor IOP/s 
> performance, they are very good for streaming workloads. If you are using 
> something like Borg backup that writes huge files sequentially, a HDD 
> back-end should be OK.
> 

Ok. Good to know

> Here some things to consider and try out:
> 
> 1. You really need to get a bunch of enterprise SSDs with power loss 
> protection for the FS meta data pool (disable write cache if enabled, this 
> will disable volatile write cache and switch to protected caching). We are 
> using (formerly Intel) 1.8T SATA drives that we subdivide into 4 OSDs each to 
> raise performance. Place the meta-data pool and the primary data pool on 
> these disks. Create a secondary data pool on the HDDs and assign it to the 
> root *before* creating anything on the FS (see the recommended 3-pool layout 
> for ceph file systems in the docs). I would not even consider running this 
> without SSDs. 1 such SSD per host is the minimum, 2 is better. If Borg or 
> whatever can make use of a small fast storage directory, assign a sub-dir of 
> the root to the primary data pool.

OK. I will see what I can do. 

> 
> 2. Calculate with sufficient extra disk space. As long as utilization stays 
> below 60-70% bluestore will try to make large object writes sequential, which 
> is really important for HDDs. On our cluster we currently have 40% 
> utilization and I get full HDD bandwidth out for large sequential 
> reads/writes. Make sure your backup application makes large sequential IO 
> requests.
> 
> 3. As Anthony said, add RAM. You should go for 512G on 50 HDD-nodes. You can 
> run the MDS daemons on the OSD nodes. Set a reasonable cache limit and use 
> ephemeral pinning. Depending on the CPUs you are using, 48 cores can be 
> plenty. The latest generation Intel Xeon Scalable Processors is so efficient 
> with ceph that 1HT per HDD is more than enough.

Yes I get 512G on each node, 64 core on each server.

> 
> 4. 3 MON+MGR nodes are sufficient. You can do something else with the 
> remaining 2 nodes. Of course, you can use them as additional MON+MGR nodes. 
> We also use 5 and it improves maintainability a lot.
> 

Ok thanks. 

> Something more exotic if you have time:
> 
> 5. To improve sequential performance further, you can experiment with larger 
> min_alloc_sizes for OSDs (on creation time, you will need to scrap and 
> re-deploy the cluster to test different values). Every HDD has a preferred 
> IO-size for which random IO achieves nearly the same band-with as sequential 
> writes. (But see 7.)
> 
> 6. On your set-up you will probably go for a 4+2 EC data pool on HDD. With 
> object size 4M the max. chunk size per OSD will be 1M. For many HDDs this is 
> the preferred IO size (usually between 256K-1M). (But see 7.)
> 
> 7. Important: large min_alloc_sizes are only good if your workload *never* 
> modifies files, but only replaces them. A bit like a pool without EC 
> overwrite enabled. The implementation of EC overwrites has a "feature" that 
> can lead to massive allocation amplification. If your backup workload does 
> modifications to files instead of adding new+deleting old, do *not* 
> experiment with options 5.-7. Instead, use the default and make sure you have 
> sufficient unused capacity to increase the chances for large bluestore writes 
> (keep utilization below 60-70% and just buy extra disks). A workload with 
> large min_alloc_sizes has to be S3-like, only upload, download and delete are 
> allowed.

Thankt a lot for those tips. 

I'm newbie with ceph so it's going to take sometime before I understand
everything you say. 


Best regards

-- 
Albert SHIH 🦫 🐸
France
Heure locale/Local time:
jeu. 23 nov. 2023 08:32:20 CET
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: How to use hardware

2023-11-20 Thread Frank Schilder
Hi Simon,

we are using something similar for ceph-fs. For a backup system your setup can 
work, depending on how you back up. While HDD pools have poor IOP/s 
performance, they are very good for streaming workloads. If you are using 
something like Borg backup that writes huge files sequentially, a HDD back-end 
should be OK.

Here some things to consider and try out:

1. You really need to get a bunch of enterprise SSDs with power loss protection 
for the FS meta data pool (disable write cache if enabled, this will disable 
volatile write cache and switch to protected caching). We are using (formerly 
Intel) 1.8T SATA drives that we subdivide into 4 OSDs each to raise 
performance. Place the meta-data pool and the primary data pool on these disks. 
Create a secondary data pool on the HDDs and assign it to the root *before* 
creating anything on the FS (see the recommended 3-pool layout for ceph file 
systems in the docs). I would not even consider running this without SSDs. 1 
such SSD per host is the minimum, 2 is better. If Borg or whatever can make use 
of a small fast storage directory, assign a sub-dir of the root to the primary 
data pool.

2. Calculate with sufficient extra disk space. As long as utilization stays 
below 60-70% bluestore will try to make large object writes sequential, which 
is really important for HDDs. On our cluster we currently have 40% utilization 
and I get full HDD bandwidth out for large sequential reads/writes. Make sure 
your backup application makes large sequential IO requests.

3. As Anthony said, add RAM. You should go for 512G on 50 HDD-nodes. You can 
run the MDS daemons on the OSD nodes. Set a reasonable cache limit and use 
ephemeral pinning. Depending on the CPUs you are using, 48 cores can be plenty. 
The latest generation Intel Xeon Scalable Processors is so efficient with ceph 
that 1HT per HDD is more than enough.

4. 3 MON+MGR nodes are sufficient. You can do something else with the remaining 
2 nodes. Of course, you can use them as additional MON+MGR nodes. We also use 5 
and it improves maintainability a lot.

Something more exotic if you have time:

5. To improve sequential performance further, you can experiment with larger 
min_alloc_sizes for OSDs (on creation time, you will need to scrap and 
re-deploy the cluster to test different values). Every HDD has a preferred 
IO-size for which random IO achieves nearly the same band-with as sequential 
writes. (But see 7.)

6. On your set-up you will probably go for a 4+2 EC data pool on HDD. With 
object size 4M the max. chunk size per OSD will be 1M. For many HDDs this is 
the preferred IO size (usually between 256K-1M). (But see 7.)

7. Important: large min_alloc_sizes are only good if your workload *never* 
modifies files, but only replaces them. A bit like a pool without EC overwrite 
enabled. The implementation of EC overwrites has a "feature" that can lead to 
massive allocation amplification. If your backup workload does modifications to 
files instead of adding new+deleting old, do *not* experiment with options 
5.-7. Instead, use the default and make sure you have sufficient unused 
capacity to increase the chances for large bluestore writes (keep utilization 
below 60-70% and just buy extra disks). A workload with large min_alloc_sizes 
has to be S3-like, only upload, download and delete are allowed.

Best regards,
=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14


From: Anthony D'Atri 
Sent: Saturday, November 18, 2023 3:24 PM
To: Simon Kepp
Cc: Albert Shih; ceph-users@ceph.io
Subject: [ceph-users] Re: How to use hardware

Common motivations for this strategy include the lure of unit economics and RUs.

Often ultra dense servers can’t fill racks anyway due to power and weight 
limits.

Here the osd_memory_target would have to be severely reduced to avoid 
oomkilling.  Assuming the OSDs are top load LFF HDDs with expanders, the HBA 
will be a bottleck as well.  I’ve suffered similar systems for RGW.  All the 
clever juggling in the world could not override the math, and the solution was 
QLC.

“We can lose 4 servers”

Do you realize that your data would then be unavailable ?  When you lose even 
one, you will not be able to restore redundancy and your OSDs likely will 
oomkill.

If you’re running CephFS, how are you provisioning fast OSDs for the metadata 
pool?  Are the CPUs high-clock for MDS responsiveness?

Even given the caveats this seems like a recipe for at best disappointment.

At the very least add RAM.  8GB per OSD plus ample for other daemons.  Better 
would be 3x normal additional hosts for the others.

> On Nov 17, 2023, at 8:33 PM, Simon Kepp  wrote:
>
> I know that your question is regarding the service servers, but may I ask,
> why you are planning to place so many OSDs ( 300) on so few OSD hosts( 6)
> (= 50 OSDs per node)?
> This is possible to do, but sounds like the nodes were desi

[ceph-users] Re: How to use hardware

2023-11-18 Thread Anthony D'Atri
Common motivations for this strategy include the lure of unit economics and 
RUs. 

Often ultra dense servers can’t fill racks anyway due to power and weight 
limits. 

Here the osd_memory_target would have to be severely reduced to avoid 
oomkilling.  Assuming the OSDs are top load LFF HDDs with expanders, the HBA 
will be a bottleck as well.  I’ve suffered similar systems for RGW.  All the 
clever juggling in the world could not override the math, and the solution was 
QLC. 

“We can lose 4 servers”

Do you realize that your data would then be unavailable ?  When you lose even 
one, you will not be able to restore redundancy and your OSDs likely will 
oomkill.  

If you’re running CephFS, how are you provisioning fast OSDs for the metadata 
pool?  Are the CPUs high-clock for MDS responsiveness? 

Even given the caveats this seems like a recipe for at best disappointment.  

At the very least add RAM.  8GB per OSD plus ample for other daemons.  Better 
would be 3x normal additional hosts for the others. 

> On Nov 17, 2023, at 8:33 PM, Simon Kepp  wrote:
> 
> I know that your question is regarding the service servers, but may I ask,
> why you are planning to place so many OSDs ( 300) on so few OSD hosts( 6)
> (= 50 OSDs per node)?
> This is possible to do, but sounds like the nodes were designed for
> scale-up rather than a scale-out architecture like ceph. Going with such
> "fat nodes" is doable, but will significantly limit performance,
> reliability and availability, compared to distributing the same OSDs
> on more thinner nodes.
> 
> Best regards,
> Simon Kepp
> 
> Founder/CEO
> Kepp Technologies
> 
>> On Fri, Nov 17, 2023 at 10:59 AM Albert Shih  wrote:
>> 
>> Hi everyone,
>> 
>> In the purpose to deploy a medium size of ceph cluster (300 OSD) we have 6
>> bare-metal server for the OSD, and 5 bare-metal server for the service
>> (MDS, Mon, etc.)
>> 
>> Those 5 bare-metal server have each 48 cores and 256 Gb.
>> 
>> What would be the smartest way to use those 5 server, I see two way :
>> 
>>  first :
>> 
>>Server 1 : MDS,MON, grafana, prometheus, webui
>>Server 2:  MON
>>Server 3:  MON
>>Server 4 : MDS
>>Server 5 : MDS
>> 
>>  so 3 MDS, 3 MON. and we can loose 2 servers.
>> 
>>  Second
>> 
>>KVM on each server
>>  Server 1 : 3 VM : One for grafana & CIe, and 1 MDS, 2 MON
>>  other server : 1 MDS, 1 MON
>> 
>>  in total :  5 MDS, 5 MON and we can loose 4 servers.
>> 
>> So on paper it's seem the second are smarter, but it's also more complex,
>> so my question are «is it worth the complexity to have 5 MDS/MON for 300
>> OSD».
>> 
>> Important : The main goal of this ceph cluster are not to get the maximum
>> I/O speed, I would not say the speed is not a factor, but it's not the main
>> point.
>> 
>> Regards.
>> 
>> 
>> --
>> Albert SHIH 🦫 🐸
>> Observatoire de Paris
>> France
>> Heure locale/Local time:
>> ven. 17 nov. 2023 10:49:27 CET
>> ___
>> ceph-users mailing list -- ceph-users@ceph.io
>> To unsubscribe send an email to ceph-users-le...@ceph.io
>> 
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: How to use hardware

2023-11-18 Thread David C.
Hello Albert,

5 vs 3 MON => you won't notice any difference
5 vs 3 MGR => by default, only 1 will be active


Le sam. 18 nov. 2023 à 09:28, Albert Shih  a écrit :

> Le 17/11/2023 à 11:23:49+0100, David C. a écrit
>
> Hi,
>
> >
> > 5 instead of 3 mon will allow you to limit the impact if you break a mon
> (for
> > example, with the file system full)
> >
> > 5 instead of 3 MDS, this makes sense if the workload can be distributed
> over
> > several trees in your file system. Sometimes it can also make sense to
> have
> > several FSs in order to limit the consequences of an infrastructure with
> > several active MDSs.
>
> So no disadvantage to have 5 instead of 3 ?
>
> > Concerning performance, if you see a node that is too busy which impacts
> the
> > cluster, you can always think about relocating certain services.
>
> Ok, thanks for the answer.
>
> Regards.
>
> --
> Albert SHIH 🦫 🐸
> Observatoire de Paris
> France
> Heure locale/Local time:
> sam. 18 nov. 2023 09:26:56 CET
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: How to use hardware

2023-11-18 Thread Albert Shih
Le 17/11/2023 à 11:23:49+0100, David C. a écrit

Hi, 

> 
> 5 instead of 3 mon will allow you to limit the impact if you break a mon (for
> example, with the file system full)
> 
> 5 instead of 3 MDS, this makes sense if the workload can be distributed over
> several trees in your file system. Sometimes it can also make sense to have
> several FSs in order to limit the consequences of an infrastructure with
> several active MDSs.

So no disadvantage to have 5 instead of 3 ? 
 
> Concerning performance, if you see a node that is too busy which impacts the
> cluster, you can always think about relocating certain services.

Ok, thanks for the answer. 

Regards.

-- 
Albert SHIH 🦫 🐸
Observatoire de Paris
France
Heure locale/Local time:
sam. 18 nov. 2023 09:26:56 CET
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: How to use hardware

2023-11-18 Thread Albert Shih
Le 18/11/2023 à 02:31:22+0100, Simon Kepp a écrit
Hi, 

> I know that your question is regarding the service servers, but may I ask, why
> you are planning to place so many OSDs ( 300) on so few OSD hosts( 6) (= 50
> OSDs per node)?

> This is possible to do, but sounds like the nodes were designed for scale-up
> rather than a scale-out architecture like ceph. Going with such "fat nodes" is
> doable, but will significantly limit performance, reliability and 
> availability,
> compared to distributing the same OSDs on more thinner nodes.

We will use the cluster only for backup large amount of data, we will love
to have much more performance but in that case the price will rise way
above the budget. 

And the question why we choose to use ceph rather than more classic system
are the migration of the data when the hardware will reach end of life. The
plan is to use the capability of ceph to migrate by himself the data from
old to new hardware.

So short answer : no enough money ;-) ;-)

Regards. 

-- 
Albert SHIH 🦫 🐸
France
Heure locale/Local time:
sam. 18 nov. 2023 09:19:03 CET
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: How to use hardware

2023-11-17 Thread Simon Kepp
I know that your question is regarding the service servers, but may I ask,
why you are planning to place so many OSDs ( 300) on so few OSD hosts( 6)
(= 50 OSDs per node)?
This is possible to do, but sounds like the nodes were designed for
scale-up rather than a scale-out architecture like ceph. Going with such
"fat nodes" is doable, but will significantly limit performance,
reliability and availability, compared to distributing the same OSDs
on more thinner nodes.

Best regards,
Simon Kepp

Founder/CEO
Kepp Technologies

On Fri, Nov 17, 2023 at 10:59 AM Albert Shih  wrote:

> Hi everyone,
>
> In the purpose to deploy a medium size of ceph cluster (300 OSD) we have 6
> bare-metal server for the OSD, and 5 bare-metal server for the service
> (MDS, Mon, etc.)
>
> Those 5 bare-metal server have each 48 cores and 256 Gb.
>
> What would be the smartest way to use those 5 server, I see two way :
>
>   first :
>
> Server 1 : MDS,MON, grafana, prometheus, webui
> Server 2:  MON
> Server 3:  MON
> Server 4 : MDS
> Server 5 : MDS
>
>   so 3 MDS, 3 MON. and we can loose 2 servers.
>
>   Second
>
> KVM on each server
>   Server 1 : 3 VM : One for grafana & CIe, and 1 MDS, 2 MON
>   other server : 1 MDS, 1 MON
>
>   in total :  5 MDS, 5 MON and we can loose 4 servers.
>
> So on paper it's seem the second are smarter, but it's also more complex,
> so my question are «is it worth the complexity to have 5 MDS/MON for 300
> OSD».
>
> Important : The main goal of this ceph cluster are not to get the maximum
> I/O speed, I would not say the speed is not a factor, but it's not the main
> point.
>
> Regards.
>
>
> --
> Albert SHIH 🦫 🐸
> Observatoire de Paris
> France
> Heure locale/Local time:
> ven. 17 nov. 2023 10:49:27 CET
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: How to use hardware

2023-11-17 Thread David C.
Hi Albert ,

5 instead of 3 mon will allow you to limit the impact if you break a mon
(for example, with the file system full)

5 instead of 3 MDS, this makes sense if the workload can be distributed
over several trees in your file system. Sometimes it can also make sense to
have several FSs in order to limit the consequences of an infrastructure
with several active MDSs.

Concerning performance, if you see a node that is too busy which impacts
the cluster, you can always think about relocating certain services.



Le ven. 17 nov. 2023 à 11:00, Albert Shih  a écrit :

> Hi everyone,
>
> In the purpose to deploy a medium size of ceph cluster (300 OSD) we have 6
> bare-metal server for the OSD, and 5 bare-metal server for the service
> (MDS, Mon, etc.)
>
> Those 5 bare-metal server have each 48 cores and 256 Gb.
>
> What would be the smartest way to use those 5 server, I see two way :
>
>   first :
>
> Server 1 : MDS,MON, grafana, prometheus, webui
> Server 2:  MON
> Server 3:  MON
> Server 4 : MDS
> Server 5 : MDS
>
>   so 3 MDS, 3 MON. and we can loose 2 servers.
>
>   Second
>
> KVM on each server
>   Server 1 : 3 VM : One for grafana & CIe, and 1 MDS, 2 MON
>   other server : 1 MDS, 1 MON
>
>   in total :  5 MDS, 5 MON and we can loose 4 servers.
>
> So on paper it's seem the second are smarter, but it's also more complex,
> so my question are «is it worth the complexity to have 5 MDS/MON for 300
> OSD».
>
> Important : The main goal of this ceph cluster are not to get the maximum
> I/O speed, I would not say the speed is not a factor, but it's not the main
> point.
>
> Regards.
>
>
> --
> Albert SHIH 🦫 🐸
> Observatoire de Paris
> France
> Heure locale/Local time:
> ven. 17 nov. 2023 10:49:27 CET
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io