[ceph-users] Re: recommendation for barebones server with 8-12 direct attach NVMe?

2024-01-13 Thread Mike O'Connor

On 14/1/2024 1:57 pm, Anthony D'Atri wrote:

The OP is asking about new servers I think.
I was looking his statement below relating to using hardware laying 
around, just putting out there some options which worked for use.
  
So we were going to replace a Ceph cluster with some hardware we had

laying around using SATA HBAs but I was told that the only right way to
build Ceph in 2023 is with direct attach NVMe.


Cheers

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: recommendation for barebones server with 8-12 direct attach NVMe?

2024-01-13 Thread Anthony D'Atri
The OP is asking about new servers I think.  

> On Jan 13, 2024, at 9:36 PM, Mike O'Connor  wrote:
> 
> Because it's almost impossible to purchase the equipment required to convert 
> old drive bays to u.2 etc.
> 
> The M.2's we purchased are enterprise class.
> 
> Mike
> 
> 
>> On 14/1/2024 12:53 pm, Anthony D'Atri wrote:
>> Why use such a card and M.2 drives that I suspect aren’t enterprise-class? 
>> Instead of U.2, E1.s, or E3.s ?
>> 
 On Jan 13, 2024, at 5:10 AM, Mike O'Connor  wrote:
>>> 
>>> On 13/1/2024 1:02 am, Drew Weaver wrote:
 Hello,
 
 So we were going to replace a Ceph cluster with some hardware we had 
 laying around using SATA HBAs but I was told that the only right way to 
 build Ceph in 2023 is with direct attach NVMe.
 
 Does anyone have any recommendation for a 1U barebones server (we just 
 drop in ram disks and cpus) with 8-10 2.5" NVMe bays that are direct 
 attached to the motherboard without a bridge or HBA for Ceph specifically?
 
 Thanks,
 -Drew
 
 ___
 ceph-users mailing list --ceph-users@ceph.io
 To unsubscribe send an emailtoceph-users-le...@ceph.io
>>> Hi
>>> 
>>> You need to use PCIe card with a PCIe switch, cards with 4 x m.2 NVME are 
>>> cheap enough around $USD180 from Aliexpress.
>>> 
>>> There are companies with cards which have many more m.2 ports but the cost 
>>> goes up greatly.
>>> 
>>> We just build a 3x1RU G9 HP cluster with 4 x 2T m.2 NVME using Dual 40G 
>>> Ethernet ports and dual 10G Ethernet and a second hand Arisa 16 port 40G 
>>> switch.
>>> 
>>> It works really well.
>>> 
>>> Cheers
>>> 
>>> Mike
>>> ___
>>> ceph-users mailing list --ceph-users@ceph.io
>>> To unsubscribe send an email toceph-users-le...@ceph.io
> 
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: recommendation for barebones server with 8-12 direct attach NVMe?

2024-01-13 Thread Mike O'Connor
Because it's almost impossible to purchase the equipment required to 
convert old drive bays to u.2 etc.


The M.2's we purchased are enterprise class.

Mike


On 14/1/2024 12:53 pm, Anthony D'Atri wrote:

Why use such a card and M.2 drives that I suspect aren’t enterprise-class? 
Instead of U.2, E1.s, or E3.s ?


On Jan 13, 2024, at 5:10 AM, Mike O'Connor  wrote:

On 13/1/2024 1:02 am, Drew Weaver wrote:

Hello,

So we were going to replace a Ceph cluster with some hardware we had laying 
around using SATA HBAs but I was told that the only right way to build Ceph in 
2023 is with direct attach NVMe.

Does anyone have any recommendation for a 1U barebones server (we just drop in ram 
disks and cpus) with 8-10 2.5" NVMe bays that are direct attached to the 
motherboard without a bridge or HBA for Ceph specifically?

Thanks,
-Drew

___
ceph-users mailing list --ceph-users@ceph.io
To unsubscribe send an emailtoceph-users-le...@ceph.io

Hi

You need to use PCIe card with a PCIe switch, cards with 4 x m.2 NVME are cheap 
enough around $USD180 from Aliexpress.

There are companies with cards which have many more m.2 ports but the cost goes 
up greatly.

We just build a 3x1RU G9 HP cluster with 4 x 2T m.2 NVME using Dual 40G 
Ethernet ports and dual 10G Ethernet and a second hand Arisa 16 port 40G switch.

It works really well.

Cheers

Mike
___
ceph-users mailing list --ceph-users@ceph.io
To unsubscribe send an email toceph-users-le...@ceph.io


___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: recommendation for barebones server with 8-12 direct attach NVMe?

2024-01-13 Thread Anthony D'Atri
Why use such a card and M.2 drives that I suspect aren’t enterprise-class? 
Instead of U.2, E1.s, or E3.s ?

> On Jan 13, 2024, at 5:10 AM, Mike O'Connor  wrote:
> 
> On 13/1/2024 1:02 am, Drew Weaver wrote:
>> Hello,
>> 
>> So we were going to replace a Ceph cluster with some hardware we had laying 
>> around using SATA HBAs but I was told that the only right way to build Ceph 
>> in 2023 is with direct attach NVMe.
>> 
>> Does anyone have any recommendation for a 1U barebones server (we just drop 
>> in ram disks and cpus) with 8-10 2.5" NVMe bays that are direct attached to 
>> the motherboard without a bridge or HBA for Ceph specifically?
>> 
>> Thanks,
>> -Drew
>> 
>> ___
>> ceph-users mailing list --ceph-users@ceph.io
>> To unsubscribe send an email toceph-users-le...@ceph.io
> 
> Hi
> 
> You need to use PCIe card with a PCIe switch, cards with 4 x m.2 NVME are 
> cheap enough around $USD180 from Aliexpress.
> 
> There are companies with cards which have many more m.2 ports but the cost 
> goes up greatly.
> 
> We just build a 3x1RU G9 HP cluster with 4 x 2T m.2 NVME using Dual 40G 
> Ethernet ports and dual 10G Ethernet and a second hand Arisa 16 port 40G 
> switch.
> 
> It works really well.
> 
> Cheers
> 
> Mike
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Recomand number of k and m erasure code

2024-01-13 Thread Anthony D'Atri
There are nuances, but in general the higher the sum of m+k, the lower the 
performance, because *every* operation has to hit that many drives, which is 
especially impactful with HDDs.  So there’s a tradeoff between storage 
efficiency and performance.  And as you’ve seen, larger parity groups 
especially mean slower recovery/backfill.

There’s also a modest benefit to choosing values of m and k that have small 
prime factors, but I wouldn’t worry too much about that.


You can find EC efficiency tables on the net:


https://docs.netapp.com/us-en/storagegrid-116/ilm/what-erasure-coding-schemes-are.html


I should really add a table to the docs, making a note to do that.

There’s a nice calculator at the OSNEXUS site:

https://www.osnexus.com/ceph-designer


The overhead factor is (k+m) / k 

So for a 4,2 profile, that’s 6 / 4 == 1.5

For 6,2, 8 / 6 = 1.33

For 10,2, 12 / 10 = 1.2

and so forth.  As k increases, the incremental efficiency gain sees diminishing 
returns, but performance continues to decrease.

Think of m as the number of copies you can lose without losing data, and m-1 as 
the number you can lose / have down and still have data *available*.

I also suggest that the number of failure domains — in your cases this means 
OSD nodes — be *at least* k+m+1, so in your case you want k+m to be at most 9.

With RBD and many CephFS implementations, we mostly have relatively large RADOS 
objects that are striped over many OSDs.

When using RGW especially, one should attend to average and median S3 object 
size.  There’s an analysis of the potential for space amplification in the docs 
so I won’t repeat it here in detail. This sheet 
https://docs.google.com/spreadsheets/d/1rpGfScgG-GLoIGMJWDixEkqs-On9w8nAUToPQjN8bDI/edit#gid=358760253
 visually demonstrates this.

Basically, for an RGW bucket pool — or for a CephFS data pool storing unusually 
small objects — if you have a lot of S3 objects in the multiples of KB size, 
you waste a significant fraction of underlying storage.  This is exacerbated by 
EC, and the larger the sum of k+m, the more waste.

When people ask me about replication vs EC and EC profile, the first question I 
ask is what they’re storing.  When EC isn’t a non-starter, I tend to recommend 
4,2 as a profile until / unless someone has specific needs and can understand 
the tradeoffs. This lets you store ~~ 2x the data of 3x replication while not 
going overboard on the performance hit.

If you care about your data, do not set m=1.

If you need to survive the loss of many drives, say if your cluster is across 
multiple buildings or sites, choose a larger value of k.  There are people 
running profiles like 4,6 because they have unusual and specific needs.




> On Jan 13, 2024, at 10:32 AM, Phong Tran Thanh  wrote:
> 
> Hi ceph user!
> 
> I need to determine which erasure code values (k and m) to choose for a
> Ceph cluster with 10 nodes.
> 
> I am using the reef version with rbd. Furthermore, when using a larger k,
> for example, ec6+2 and ec4+2, which erasure coding performance is better,
> and what are the criteria for choosing the appropriate erasure coding?
> Please help me
> 
> Email: tranphong...@gmail.com
> Skype: tranphong079
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Recomand number of k and m erasure code

2024-01-13 Thread Phong Tran Thanh
Hi ceph user!

I need to determine which erasure code values (k and m) to choose for a
Ceph cluster with 10 nodes.

I am using the reef version with rbd. Furthermore, when using a larger k,
for example, ec6+2 and ec4+2, which erasure coding performance is better,
and what are the criteria for choosing the appropriate erasure coding?
Please help me

Email: tranphong...@gmail.com
Skype: tranphong079
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Ceph Nautilous 14.2.22 slow OSD memory leak?

2024-01-13 Thread Konstantin Shalygin
Hi,

> On Jan 12, 2024, at 12:01, Frédéric Nass  
> wrote:
> 
> Hard to tell for sure since this bug hit different major versions of the 
> kernel, at least RHEL's from what I know. 

In what RH kernel release this issue was fixed?


Thanks,
k
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: recommendation for barebones server with 8-12 direct attach NVMe?

2024-01-13 Thread Mike O'Connor

On 13/1/2024 1:02 am, Drew Weaver wrote:

Hello,

So we were going to replace a Ceph cluster with some hardware we had laying 
around using SATA HBAs but I was told that the only right way to build Ceph in 
2023 is with direct attach NVMe.

Does anyone have any recommendation for a 1U barebones server (we just drop in ram 
disks and cpus) with 8-10 2.5" NVMe bays that are direct attached to the 
motherboard without a bridge or HBA for Ceph specifically?

Thanks,
-Drew

___
ceph-users mailing list --ceph-users@ceph.io
To unsubscribe send an email toceph-users-le...@ceph.io


Hi

You need to use PCIe card with a PCIe switch, cards with 4 x m.2 NVME 
are cheap enough around $USD180 from Aliexpress.


There are companies with cards which have many more m.2 ports but the 
cost goes up greatly.


We just build a 3x1RU G9 HP cluster with 4 x 2T m.2 NVME using Dual 40G 
Ethernet ports and dual 10G Ethernet and a second hand Arisa 16 port 40G 
switch.


It works really well.

Cheers

Mike
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io