[ceph-users] Re: Performance improvement suggestion

Anthony D'Atri Tue, 20 Feb 2024 12:04:58 -0800


> Hi Anthony,
>      Did you decide that it's not a feature to be implemented?


That isn't up to me.

>      I'm asking about this so I can offer options here.
> 
>      I'd not be confortable to enable "mon_allow_pool_size_one" at a specific 
> pool.
> 
> It would be better if this feature could make a replica at a second time on 
> selected pool.
> Thanks.
> Rafael.
> 
>  
> 
> De: "Anthony D'Atri" <anthony.da...@gmail.com>
> Enviada: 2024/02/01 15:00:59
> Para: quag...@bol.com.br
> Cc: ceph-users@ceph.io
> Assunto: [ceph-users] Re: Performance improvement suggestion
>  
> I'd totally defer to the RADOS folks.
> 
> One issue might be adding a separate code path, which can have all sorts of 
> problems.
> 
> > On Feb 1, 2024, at 12:53, quag...@bol.com.br wrote:
> >
> >
> >
> > Ok Anthony,
> >
> > I understood what you said. I also believe in all the professional history 
> > and experience you have.
> >
> > Anyway, could there be a configuration flag to make this happen?
> >
> > As well as those that already exist: "--yes-i-really-mean-it".
> >
> > This way, the storage pattern would remain as it is. However, it would 
> > allow situations like the one I mentioned to be possible.
> >
> > This situation will permit some rules to be relaxed (even if they are not 
> > ok at first).
> > Likewise, there are already situations like lazyio that make some 
> > exceptions to standard procedures.
> > Remembering: it's just a suggestion.
> > If this type of functionality is not interesting, it is ok.
> >
> >
> >
> > Rafael.
> >
> >
> > De: "Anthony D'Atri" <anthony.da...@gmail.com>
> > Enviada: 2024/02/01 12:10:30
> > Para: quag...@bol.com.br
> > Cc: ceph-users@ceph.io
> > Assunto: [ceph-users] Re: Performance improvement suggestion
> >
> >
> >
> > > I didn't say I would accept the risk of losing data.
> >
> > That's implicit in what you suggest, though.
> >
> > > I just said that it would be interesting if the objects were first 
> > > recorded only in the primary OSD.
> >
> > What happens when that host / drive smokes before it can replicate? What 
> > happens if a secondary OSD gets a read op before the primary updates it? 
> > Swift object storage users have to code around this potential. It's a 
> > non-starter for block storage.
> >
> > This is similar to why RoC HBAs (which are a badly outdated thing to begin 
> > with) will only enter writeback mode if they have a BBU / supercap -- and 
> > of course if their firmware and hardware isn't pervasively buggy. Guess how 
> > I know this?
> >
> > > This way it would greatly increase performance (both for iops and 
> > > throuput).
> >
> > It might increase low-QD IOPS for a single client on slow media with 
> > certain networking. Depending on media, it wouldn't increase throughput.
> >
> > Consider QEMU drive-mirror. If you're doing RF=3 replication, you use 3x 
> > the network resources between the client and the servers.
> >
> > > Later (in the background), record the replicas. This situation would 
> > > avoid leaving users/software waiting for the recording response from all 
> > > replicas when the storage is overloaded.
> >
> > If one makes the mistake of using HDDs, they're going to be overloaded no 
> > matter how one slices and dices the ops. Ya just canna squeeze IOPS from a 
> > stone. Throughput is going to be limited by the SATA interface and seeking 
> > no matter what.
> >
> > > Where I work, performance is very important and we don't have money to 
> > > make a entire cluster only with NVMe.
> >
> > If there isn't money, then it isn't very important. But as I've written 
> > before, NVMe clusters *do not cost appreciably more than spinners* unless 
> > your procurement processes are bad. In fact they can cost significantly 
> > less. This is especially true with object storage and archival where one 
> > can leverage QLC.
> >
> > * Buy generic drives from a VAR, not channel drives through a chassis 
> > brand. Far less markup, and moreover you get the full 5 year warranty, not 
> > just 3 years. And you can painlessly RMA drives yourself - you don't have 
> > to spend hours going back and forth with $chassisvendor's TAC arguing about 
> > every single RMA. I've found that this is so bad that it is more economical 
> > to just throw away a failed component worth < USD 500 than to RMA it. Do 
> > you pay for extended warranty / support? That's expensive too.
> >
> > * Certain chassis brands who shall remain nameless push RoC HBAs hard with 
> > extreme markups. List prices as high as USD2000. Per server, eschewing 
> > those abominations makes up for a lot of the drive-only unit economics
> >
> > * But this is the part that lots of people don't get: You don't just stack 
> > up the drives on a desk and use them. They go into *servers* that cost 
> > money and *racks* that cost money. They take *power* that costs money.
> >
> > * $ / IOPS are FAR better for ANY SSD than for HDDs
> >
> > * RUs cost money, so do chassis and switches
> >
> > * Drive failures cost money
> >
> > * So does having your people and applications twiddle their thumbs waiting 
> > for stuff to happen. I worked for a supercomputer company who put 
> > low-memory low-end diskless workstations on engineer's desks. They spent 
> > lots of time doing nothing waiting for their applications to respond. This 
> > company no longer exists.
> >
> > * So does the risk of taking *weeks* to heal from a drive failure
> >
> > Punch honest numbers into https://www.snia.org/forums/cmsi/programs/TCOcalc
> >
> > I walked through this with a certain global company. QLC SSDs were 
> > demonstrated to have like 30% lower TCO than spinners. Part of the equation 
> > is that they were accustomed to limiting HDD size to 8 TB because of the 
> > bottlenecks, and thus requiring more servers, more switch ports, more DC 
> > racks, more rack/stack time, more administrative overhead. You can fit 1.9 
> > PB of raw SSD capacity in a 1U server. That same RU will hold at most 88 TB 
> > of the largest spinners you can get today. 22 TIMES the density. And since 
> > many applications can't even barely tolerate the spinner bottlenecks, 
> > capping spinner size at even 10T makes that like 40 TIMES better density 
> > with SSDs.
> >
> >
> > > However, I don't think it's interesting to lose the functionality of the 
> > > replicas.
> > > I'm just suggesting another way to increase performance without losing 
> > > the functionality of replicas.
> > >
> > >
> > > Rafael.
> > >
> > >
> > > De: "Anthony D'Atri" <anthony.da...@gmail.com>
> > > Enviada: 2024/01/31 17:04:08
> > > Para: quag...@bol.com.br
> > > Cc: ceph-users@ceph.io
> > > Assunto: Re: [ceph-users] Performance improvement suggestion
> > >
> > > Would you be willing to accept the risk of data loss?
> > >
> > >>
> > >> On Jan 31, 2024, at 2:48 PM, quag...@bol.com.br wrote:
> > >>
> > >> Hello everybody,
> > >> I would like to make a suggestion for improving performance in Ceph 
> > >> architecture.
> > >> I don't know if this group would be the best place or if my proposal is 
> > >> correct.
> > >>
> > >> My suggestion would be in the item 
> > >> https://docs.ceph.com/en/latest/architecture/, at the end of the topic 
> > >> "Smart Daemons Enable Hyperscale".
> > >>
> > >> The Client needs to "wait" for the configured amount of replicas to be 
> > >> written (so that the client receives an ok and continues). This way, if 
> > >> there is slowness on any of the disks on which the PG will be updated, 
> > >> the client is left waiting.
> > >>
> > >> It would be possible:
> > >>
> > >> 1-) Only record on the primary OSD
> > >> 2-) Write other replicas in background (like the same way as when an OSD 
> > >> fails: "degraded" ).
> > >>
> > >> This way, client has a faster response when writing to storage: 
> > >> improving latency and performance (throughput and IOPS).
> > >>
> > >> I would find it plausible to accept a period of time (seconds) until all 
> > >> replicas are ok (written asynchronously) at the expense of improving 
> > >> performance.
> > >>
> > >> Could you evaluate this scenario?
> > >>
> > >>
> > >> Rafael.
> > >>
> > >> _______________________________________________
> > >> ceph-users mailing list -- ceph-users@ceph.io
> > >> To unsubscribe send an email to ceph-users-le...@ceph.io
> > > _______________________________________________
> > > ceph-users mailing list -- ceph-users@ceph.io
> > > To unsubscribe send an email to ceph-users-le...@ceph.io
> >
> > _______________________________________________
> > ceph-users mailing list -- ceph-users@ceph.io
> > To unsubscribe send an email to 
> > ceph-users-leave@ceph.io_______________________________________________
> > ceph-users mailing list -- ceph-users@ceph.io
> > To unsubscribe send an email to ceph-users-le...@ceph.io
> 
> _______________________________________________
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io

_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: Performance improvement suggestion

Reply via email to