>      I didn't say I would accept the risk of losing data.

That's implicit in what you suggest, though.

>      I just said that it would be interesting if the objects were first 
> recorded only in the primary OSD.

What happens when that host / drive smokes before it can replicate?  What 
happens if a secondary OSD gets a read op before the primary updates it?  Swift 
object storage users have to code around this potential.  It's a non-starter 
for block storage.

This is similar to why RoC HBAs (which are a badly outdated thing to begin 
with) will only enter writeback mode if they have a BBU / supercap -- and of 
course if their firmware and hardware isn't pervasively buggy.  Guess how I 
know this?

>      This way it would greatly increase performance (both for iops and 
> throuput).

It might increase low-QD IOPS for a single client on slow media with certain 
networking.  Depending on media, it wouldn't increase throughput.

Consider QEMU drive-mirror.  If you're doing RF=3 replication, you use 3x the 
network resources between the client and the servers.

>      Later (in the background), record the replicas. This situation would 
> avoid leaving users/software waiting for the recording response from all 
> replicas when the storage is overloaded.

If one makes the mistake of using HDDs, they're going to be overloaded no 
matter how one slices and dices the ops.  Ya just canna squeeze IOPS from a 
stone.  Throughput is going to be limited by the SATA interface and seeking no 
matter what.

>      Where I work, performance is very important and we don't have money to 
> make a entire cluster only with NVMe.

If there isn't money, then it isn't very important.  But as I've written 
before, NVMe clusters *do not cost appreciably more than spinners* unless your 
procurement processes are bad.  In fact they can cost significantly less.  This 
is especially true with object storage and archival where one can leverage QLC. 

* Buy generic drives from a VAR, not channel drives through a chassis brand.  
Far less markup, and moreover you get the full 5 year warranty, not just 3 
years.  And you can painlessly RMA drives yourself - you don't have to spend 
hours going back and forth with $chassisvendor's TAC arguing about every single 
RMA.  I've found that this is so bad that it is more economical to just throw 
away a failed component worth < USD 500 than to RMA it.  Do you pay for 
extended warranty / support?  That's expensive too.

* Certain chassis brands who shall remain nameless push RoC HBAs hard with 
extreme markups.  List prices as high as USD2000.  Per server, eschewing those 
abominations makes up for a lot of the drive-only unit economics

* But this is the part that lots of people don't get:  You don't just stack up 
the drives on a desk and use them.  They go into *servers* that cost money and 
*racks* that cost money.  They take *power* that costs money.

* $ / IOPS are FAR better for ANY SSD than for HDDs

* RUs cost money, so do chassis and switches

* Drive failures cost money

* So does having your people and applications twiddle their thumbs waiting for 
stuff to happen.  I worked for a supercomputer company who put low-memory 
low-end diskless workstations on engineer's desks.  They spent lots of time 
doing nothing waiting for their applications to respond.  This company no 
longer exists.

* So does the risk of taking *weeks* to heal from a drive failure

Punch honest numbers into https://www.snia.org/forums/cmsi/programs/TCOcalc

 I walked through this with a certain global company.  QLC SSDs were 
demonstrated to have like 30% lower TCO than spinners.  Part of the equation is 
that they were accustomed to limiting HDD size to 8 TB because of the 
bottlenecks, and thus requiring more servers, more switch ports, more DC racks, 
more rack/stack time, more administrative overhead.  You can fit 1.9 PB of raw 
SSD capacity in a 1U server.  That same RU will hold at most 88 TB of the 
largest spinners you can get today.  22 TIMES the density.  And since many 
applications can't even barely tolerate the spinner bottlenecks, capping 
spinner size at even 10T makes that like 40 TIMES better density with SSDs.


> However, I don't think it's interesting to lose the functionality of the 
> replicas.
>      I'm just suggesting another way to increase performance without losing 
> the functionality of replicas.
> 
> 
> Rafael.
>  
> 
> De: "Anthony D'Atri" <anthony.da...@gmail.com>
> Enviada: 2024/01/31 17:04:08
> Para: quag...@bol.com.br
> Cc: ceph-users@ceph.io
> Assunto: Re: [ceph-users] Performance improvement suggestion
>  
> Would you be willing to accept the risk of data loss?
>  
>> 
>> On Jan 31, 2024, at 2:48 PM, quag...@bol.com.br wrote:
>>  
>> Hello everybody,
>>      I would like to make a suggestion for improving performance in Ceph 
>> architecture.
>>      I don't know if this group would be the best place or if my proposal is 
>> correct.
>> 
>>      My suggestion would be in the item 
>> https://docs.ceph.com/en/latest/architecture/, at the end of the topic 
>> "Smart Daemons Enable Hyperscale".
>> 
>>      The Client needs to "wait" for the configured amount of replicas to be 
>> written (so that the client receives an ok and continues). This way, if 
>> there is slowness on any of the disks on which the PG will be updated, the 
>> client is left waiting.
>>      
>>      It would be possible:
>>      
>>      1-) Only record on the primary OSD
>>      2-) Write other replicas in background (like the same way as when an 
>> OSD fails: "degraded" ).
>> 
>>      This way, client has a faster response when writing to storage: 
>> improving latency and performance (throughput and IOPS).
>>      
>>      I would find it plausible to accept a period of time (seconds) until 
>> all replicas are ok (written asynchronously) at the expense of improving 
>> performance.
>>      
>>      Could you evaluate this scenario?
>> 
>> 
>> Rafael.
>> 
>>  _______________________________________________
>> ceph-users mailing list -- ceph-users@ceph.io
>> To unsubscribe send an email to ceph-users-le...@ceph.io
> _______________________________________________
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io

_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

Reply via email to